-
Notifications
You must be signed in to change notification settings - Fork 1k
benchmarking vignette #3132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
jangorecki
wants to merge
13
commits into
master
Choose a base branch
from
bench-vign
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
benchmarking vignette #3132
Changes from 4 commits
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
60a5694
order of points in bench vign is not relevant
jangorecki 9554c3d
mention parallel pkg in bench vignette
jangorecki 291c0a2
index aware benchmark will be also valid for grouping, joining, etc.
jangorecki 2556b05
lazy evaluation aware benchmarking
jangorecki 213295c
address feedback on bench-vign improvements
jangorecki 1cda608
reflect change of cores to 50 pct
jangorecki 4e4578a
add example requested by Matt
jangorecki 06fbb59
Merge branch 'master' into bench-vign
jangorecki 6ee3826
Merge branch 'master' into bench-vign
MichaelChirico 5b5b4bf
Update vignettes/datatable-benchmarking.Rmd
jangorecki dd49bd5
Update vignettes/datatable-benchmarking.Rmd
jangorecki 7501185
Update vignettes/datatable-benchmarking.Rmd
jangorecki f414740
Update vignettes/datatable-benchmarking.Rmd
jangorecki File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -4,7 +4,7 @@ date: "`r Sys.Date()`" | |||||
| output: | ||||||
| rmarkdown::html_vignette: | ||||||
| toc: true | ||||||
| number_sections: true | ||||||
| number_sections: false | ||||||
| vignette: > | ||||||
| %\VignetteIndexEntry{Benchmarking data.table} | ||||||
| %\VignetteEngine{knitr::rmarkdown} | ||||||
|
|
@@ -49,7 +49,7 @@ DT[V1 %in% v & V2 %in% v & V3 %in% v & V4 %in% v, verbose=TRUE] | |||||
| #... | ||||||
| ``` | ||||||
|
|
||||||
| # subset: index aware benchmarking | ||||||
| # index aware benchmarking | ||||||
|
|
||||||
| For convenience `data.table` automatically builds an index on fields you use to subset data. It will add some overhead to first subset on particular fields but greatly reduces time to query those columns in subsequent runs. When measuring speed, the best way is to measure index creation and query using an index separately. Having such timings it is easy to decide what is the optimal strategy for your use case. | ||||||
| To control usage of index use following options: | ||||||
|
|
@@ -105,6 +105,8 @@ setDTthreads(0) # use all available cores (default) | |||||
| getDTthreads() # check how many cores are currently used | ||||||
| ``` | ||||||
|
|
||||||
| Keep in mind that using `parallel` R package together with `data.table` will force `data.table` to use only single core. Thus it is recommended to verify cores utilization in resource monitoring tools, for example `htop`. | ||||||
|
|
||||||
| # inside a loop prefer `set` instead of `:=` | ||||||
|
|
||||||
| Unless you are utilizing index when doing _sub-assign by reference_ you should prefer `set` function which does not impose overhead of `[.data.table` method call. | ||||||
|
|
@@ -125,3 +127,24 @@ setindex(DT, a) | |||||
| # inside a loop prefer `setDT` instead of `data.table()` | ||||||
|
|
||||||
| As of now `data.table()` has an overhead, thus inside loops it is preferred to use `as.data.table()` or `setDT()` on a valid list. | ||||||
|
|
||||||
| # lazy evaluation aware benchmarking | ||||||
|
|
||||||
| ## let applications to optimize queries | ||||||
|
|
||||||
| In languages like python which does not support _lazy evaluation_ the following two filter queries would be processed exactly the same way. | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
|
||||||
| ```r | ||||||
| DT = data.table(a=1L, b=2L) | ||||||
| DT[a == 1L] | ||||||
|
|
||||||
| col = "a" | ||||||
| filter = 1L | ||||||
| DT[DT[[col]] == filter] | ||||||
| ``` | ||||||
|
|
||||||
| R has _lazy evaluation_ feature which allows an application to investigate and optimize expressions before it gets evaluated. In above case if we filter using `DT[[col]] == filter` we are forcing to materialize whole LHS. This prevents `data.table` to optimize expression whenever it is possible and basically falls back to base R `data.frame` way of doing subset. For more information on that subject refer to [R language manual](https://cran.r-project.org/doc/manuals/r-release/R-lang.html). | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
|
||||||
| ## force applications to finish computation | ||||||
|
|
||||||
| The are multiple applications which are trying to be as lazy as possible. As a result you might experience that when you run a query against such solution it finishes instantly, but then printing the results takes much more time. It is because the query actually was not computed at the time of calling query but it got computed when its results were required. Because of the above you should ensure that computation took place. It is not a trivial task, the ultimate way to ensure is to dump results to disk but it adds an overhead of writing to disk which is then included in timings of a query we are benchmarking. The easy and cheap way to deal with it could be for example printing dimensions of a results (useful in grouping benchmarks), or printing first and last element (useful in sorting benchmarks). | ||||||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.