-
Notifications
You must be signed in to change notification settings - Fork 1k
faster bmerge numeric and roll #5187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5187 +/- ##
=======================================
Coverage 99.50% 99.51%
=======================================
Files 77 77
Lines 14535 14573 +38
=======================================
+ Hits 14463 14502 +39
+ Misses 72 71 -1
Continue to review full report at Codecov.
|
|
LMK if you'd like an extra pair of eyes on this / when it's ready |
|
Status so far in the branch on this particular large-to-small benchmark. Low hanging fruit without tackling the too-much-recursion yet. The code rework allows the too-much-recursion to be done next. setDTthreads(1)
n <- 1e7
x.df <- data.frame(date = as.POSIXct("2000-1-1") + 1:n, price = rnorm(n))
x.dt <- data.table(x.df, key = "date")
n2 <- 100
y.df <- data.frame(date = sort(sample(x.df$date, n2)), adjustment = runif(n2))
y.dt <- data.table(y.df, key = "date")
falign <- function(x, y) {
i <- findInterval(as.numeric(x$date), as.numeric(y$date))
i[which(i == 0)] <- NA
x$price.adj <- x[, "price"] + y[i, "adjustment"]
x
}
falign.dt <- function(x, y) {
i <- findInterval(as.numeric(x$date), as.numeric(y$date))
i[which(i == 0)] <- NA
x[, price.adj := x[, "price"] + y[i, "adjustment"]]
x
}
microbenchmark(
falign(x.df, y.df),
falign.dt(x.dt, y.dt),
x.dt[, price.adj := y.dt[x.dt, price+adjustment, roll=TRUE]],
times=10, unit="s"
)) |
|
Anyone want to take over this PR? @jangorecki and @tlapak are the only two I see with commits to this file in recent memory. |
|
This PR will have conflicts with mergelist, which has been already merged and conflicts resolved to recent master |
|
@Anirban166 @DorisAmoakohene please investigate speedups with atime |
|
Marking as draft over |
Thanks to a report comparing a single column case to
base::findInterval, I had a look.Some low hanging fruit was that INHERITS() (an R API call looking up class attribute and looping through the class vector) was being calling deeply in the recursive call to support integer64. Also the XIND branch, although a trinary, is called in many places and the branch could be raised.
dtwiddle()has long been sand in the cogs given that the default rounding is 0 anyway, and there was a todo to get NA for double out of the way up front. Tackling those yielded a 2x speedup. There's too much recursion in the case of a large table rolling to small (many rows rolling to the same row) and I'm hoping for another 2x speedup on that.Related #2490