Skip to content

Conversation

@mattdowle
Copy link
Member

@mattdowle mattdowle commented Sep 30, 2021

Thanks to a report comparing a single column case to base::findInterval, I had a look.
Some low hanging fruit was that INHERITS() (an R API call looking up class attribute and looping through the class vector) was being calling deeply in the recursive call to support integer64. Also the XIND branch, although a trinary, is called in many places and the branch could be raised. dtwiddle() has long been sand in the cogs given that the default rounding is 0 anyway, and there was a todo to get NA for double out of the way up front. Tackling those yielded a 2x speedup. There's too much recursion in the case of a large table rolling to small (many rows rolling to the same row) and I'm hoping for another 2x speedup on that.
Related #2490

@mattdowle mattdowle added the WIP label Sep 30, 2021
@mattdowle mattdowle added this to the 1.14.3 milestone Sep 30, 2021
@codecov
Copy link

codecov bot commented Sep 30, 2021

Codecov Report

Merging #5187 (66cfc7e) into master (0c59f36) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #5187   +/-   ##
=======================================
  Coverage   99.50%   99.51%           
=======================================
  Files          77       77           
  Lines       14535    14573   +38     
=======================================
+ Hits        14463    14502   +39     
+ Misses         72       71    -1     
Impacted Files Coverage Δ
src/bmerge.c 100.00% <100.00%> (+0.48%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0c59f36...66cfc7e. Read the comment docs.

@MichaelChirico
Copy link
Member

LMK if you'd like an extra pair of eyes on this / when it's ready

@mattdowle
Copy link
Member Author

Status so far in the branch on this particular large-to-small benchmark. Low hanging fruit without tackling the too-much-recursion yet. The code rework allows the too-much-recursion to be done next.

setDTthreads(1)
n <- 1e7 
x.df <- data.frame(date = as.POSIXct("2000-1-1") + 1:n, price = rnorm(n))
x.dt <- data.table(x.df, key = "date")

n2 <- 100
y.df <- data.frame(date = sort(sample(x.df$date, n2)), adjustment = runif(n2))
y.dt <- data.table(y.df, key = "date")

falign <- function(x, y) {
  i <- findInterval(as.numeric(x$date), as.numeric(y$date))
  i[which(i == 0)] <- NA
  x$price.adj <- x[, "price"] + y[i, "adjustment"]
  x
}

falign.dt <- function(x, y) {
  i <- findInterval(as.numeric(x$date), as.numeric(y$date))
  i[which(i == 0)] <- NA
  x[, price.adj := x[, "price"] + y[i, "adjustment"]]
  x
}

microbenchmark(
  falign(x.df, y.df),
  falign.dt(x.dt, y.dt),
  x.dt[, price.adj := y.dt[x.dt, price+adjustment, roll=TRUE]],
  times=10, unit="s"
))
# v1.14.0
Unit: seconds
                                                              expr       min        lq      mean    median        uq       max neval
                                                falign(x.df, y.df) 0.1488798 0.1542454 0.1617588 0.1593861 0.1620447 0.1992783    10
                                             falign.dt(x.dt, y.dt) 0.2815458 0.2901276 0.2958419 0.2960433 0.3009906 0.3185344    10
 x.dt[, `:=`(price.adj, y.dt[x.dt, price + adjustment, roll = T])] 0.8177566 0.8206021 0.8365346 0.8349606 0.8461528 0.8593044    10

# this branch as of now (3rd line down from 0.83 to 0.57 but still a lot slower than `findInterval`) : 
                                                              expr       min        lq      mean    median        uq       max neval
                                                falign(x.df, y.df) 0.1460495 0.1486709 0.1554336 0.1537538 0.1616508 0.1732326    10
                                             falign.dt(x.dt, y.dt) 0.2766880 0.3006418 0.3010593 0.3025785 0.3060092 0.3161539    10
 x.dt[, `:=`(price.adj, y.dt[x.dt, price + adjustment, roll = T])] 0.5418650 0.5635949 0.5735462 0.5717389 0.5876390 0.5971523    10

@jangorecki jangorecki modified the milestones: 1.14.9, 1.15.0 Oct 29, 2023
@jangorecki jangorecki modified the milestones: 1.15.0, 1.15.1 Nov 6, 2023
@MichaelChirico
Copy link
Member

Anyone want to take over this PR? @jangorecki and @tlapak are the only two I see with commits to this file in recent memory.

@jangorecki
Copy link
Member

This PR will have conflicts with mergelist, which has been already merged and conflicts resolved to recent master

@tdhock
Copy link
Member

tdhock commented Jan 8, 2024

@Anirban166 @DorisAmoakohene please investigate speedups with atime

@MichaelChirico
Copy link
Member

Marking as draft over label:WIP

@MichaelChirico MichaelChirico marked this pull request as draft February 19, 2024 02:45
@MichaelChirico MichaelChirico modified the milestones: 1.16.0, 1.17.0 Jul 10, 2024
@MichaelChirico MichaelChirico modified the milestones: 1.17.0, 1.18.0 Dec 3, 2024
@jangorecki jangorecki modified the milestones: 1.18.0, 1.19.0 Nov 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants