-
Notifications
You must be signed in to change notification settings - Fork 1k
benchmarking vignette #3132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
benchmarking vignette #3132
Changes from all commits
60a5694
9554c3d
291c0a2
2556b05
213295c
1cda608
4e4578a
06fbb59
6ee3826
5b5b4bf
dd49bd5
7501185
f414740
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -14,6 +14,9 @@ vignette: > | |||||||||||||||||
| \usepackage[utf8]{inputenc} | ||||||||||||||||||
| --- | ||||||||||||||||||
|
|
||||||||||||||||||
|
|
||||||||||||||||||
| *** | ||||||||||||||||||
|
|
||||||||||||||||||
| <style> | ||||||||||||||||||
| h2 { | ||||||||||||||||||
| font-size: 20px; | ||||||||||||||||||
|
|
@@ -22,7 +25,30 @@ h2 { | |||||||||||||||||
|
|
||||||||||||||||||
| This document is meant to guide on measuring performance of `data.table`. Single place to document best practices and traps to avoid. | ||||||||||||||||||
|
|
||||||||||||||||||
| # fread: clear caches | ||||||||||||||||||
| ## General suggestions | ||||||||||||||||||
|
|
||||||||||||||||||
| Let's assume you are measuring a particular process. It is blazingly fast, taking only microseconds to evaluate. | ||||||||||||||||||
| What does it mean and how to approach such measurements? | ||||||||||||||||||
| The smaller time measurements are, the relatively bigger call overhead is. Call overhead can be perceived as a noise in measurement due by method dispatch, package/class initialization, low level object constructors, etc. As a result you naturally may want to measure timing many times and take the average to deal with the noise. This is valid approach, but the magnitude of timing is much more important. What will be the impact of extra 5, or lets say 5000 microseconds if writing results to target environment/format takes a minute? 1 second is 1 000 000 microseconds. Does the microseconds, or even miliseconds makes any difference? There are cases where it makes difference, for example when you call a function for every row, then you definitely should care about micro timings. The point is that in most user's benchmarks it won't make difference. Most of common R functions are vectorized, thus you are not calling them for every row. If something is blazingly fast for your data and use case then perhaps you may not have to worry about performance and benchmarks. Unless you want to scale your process, then you should worry because if something is blazingly fast today it might not be that fast tomorrow, just because your process will receive more data on input. In consequence you should confirm that your process will scale. | ||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This paragraph gets close to the Knuth quote:
I think it would be almost strange not to include such a famous quote in a benchmarking vignette!
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we also want to cite a study about human perception? Here we get 13 milliseconds as the shortest time we can even detect:
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would also just drop the last paragraph (Unless...scale), I feel it's belaboring the point |
||||||||||||||||||
| There are multiple dimensions that you should consider when examining how your process scales: | ||||||||||||||||||
| - increase number of rows on input | ||||||||||||||||||
| - cardinality of data | ||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what's cardinality mean here? number of groups? |
||||||||||||||||||
| - skewness of data - for most cases this should have the least importance | ||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think you mean how much variation there is in |
||||||||||||||||||
| - increase numbers of columns on input - this will be mostly valid when your input is a matrix, for data frames variable number of columns should be avoided as it leads to undefined schema. We suggests to model your data into predefined schema so the extra columns are modeled (using *melt*/*unpivot*) as new groups of rows. | ||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||
| - prevalence of NAs in input | ||||||||||||||||||
| - sortedness of input | ||||||||||||||||||
|
|
||||||||||||||||||
| To measure *scaling factor* for input size you have to measure timings of at least three different sizes, lets say number of rows, 1 million, 10 millions and 100 millions. Those three different measurements will allow you to conclude how your process scales. Why three and not two? From two sizes you cannot yet conclude if process scales linearly or exponentially. In theory based on that you can estimate how many rows you would need to receive on input so that your process would take for example a minute or an hour to finish. | ||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
I don't like "conclude",
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Drop this sentence? I haven't seen a use case for this, I'm not sure it adds much. |
||||||||||||||||||
| Once we have our input scaled up to reduce impact of call overhead the next thing that springs to mind is should I repeat measurements multiple times? The answer is that it strongly depends on your use case, a data processing workflow. If process is called just once in your workflow, why should you bother about its timing on second, third... and 100th run? Things like disk cache might result into subsequent runs to evaluate faster. Other optimizations might be triggered like memoize results for given input, or use of indexes created on the first run. If your workflow does not repeatadly calls your process, why should you do it in benchmark? The main focus of benchmarks should be real use case scenarios. | ||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||
|
|
||||||||||||||||||
| You should not forget about taking extra care about environment in which you are runnning benchmark. It should be striped out from startup configurations, so consider `R --vanilla` mode. Any extra configurations should be well documented. Be sure to use recent releases of tools you are benchmarking. | ||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||
| You should also not forget about being polite, and if you're about to publish some benchmarking results against another library -- reach out to the authors of that other package to check with them if you're using their library correctly. | ||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure this advice is generallypractical, maybe "reach out to experts of that package" is better? |
||||||||||||||||||
|
|
||||||||||||||||||
| *** | ||||||||||||||||||
|
|
||||||||||||||||||
| ## Best practices | ||||||||||||||||||
|
|
||||||||||||||||||
| ### fread: clear caches | ||||||||||||||||||
|
|
||||||||||||||||||
| Ideally each `fread` call should be run in fresh session with the following commands preceding R execution. This clears OS cache file in RAM and HD cache. | ||||||||||||||||||
|
|
||||||||||||||||||
|
|
@@ -35,7 +61,7 @@ sudo hdparm -t /dev/sda | |||||||||||||||||
|
|
||||||||||||||||||
| When comparing `fread` to non-R solutions be aware that R requires values of character columns to be added to _R's global string cache_. This takes time when reading data but later operations benefit since the character strings have already been cached. Consequently, in addition to timing isolated tasks (such as `fread` alone), it's a good idea to benchmark the total time of an end-to-end pipeline of tasks such as reading data, manipulating it, and producing final output. | ||||||||||||||||||
|
|
||||||||||||||||||
| # subset: threshold for index optimization on compound queries | ||||||||||||||||||
| ### subset: threshold for index optimization on compound queries | ||||||||||||||||||
|
|
||||||||||||||||||
| Index optimization for compound filter queries will be not be used when cross product of elements provided to filter on exceeds 1e4 elements. | ||||||||||||||||||
|
|
||||||||||||||||||
|
|
@@ -58,7 +84,7 @@ DT[V1 %in% v & V2 %in% v & V3 %in% v & V4 %in% v, verbose=TRUE] | |||||||||||||||||
| #... | ||||||||||||||||||
| ``` | ||||||||||||||||||
|
|
||||||||||||||||||
| # subset: index aware benchmarking | ||||||||||||||||||
| ### index aware benchmarking | ||||||||||||||||||
|
|
||||||||||||||||||
| For convenience `data.table` automatically builds an index on fields you use to subset data. It will add some overhead to first subset on particular fields but greatly reduces time to query those columns in subsequent runs. When measuring speed, the best way is to measure index creation and query using an index separately. Having such timings it is easy to decide what is the optimal strategy for your use case. | ||||||||||||||||||
| To control usage of index use following options: | ||||||||||||||||||
|
|
@@ -79,42 +105,83 @@ options(datatable.optimize=3L) | |||||||||||||||||
| `options(datatable.optimize=2L)` will turn off optimization of subsets completely, while `options(datatable.optimize=3L)` will switch it back on. | ||||||||||||||||||
| Those options affect many more optimizations and thus should not be used when only control of indices is needed. Read more in `?datatable.optimize`. | ||||||||||||||||||
|
|
||||||||||||||||||
| # _by reference_ operations | ||||||||||||||||||
| ### _by reference_ operations | ||||||||||||||||||
|
|
||||||||||||||||||
| When benchmarking `set*` functions it only makes sense to measure the first run. These functions update their input by reference, so subsequent runs will use the already-processed `data.table`, biasing the results. | ||||||||||||||||||
|
|
||||||||||||||||||
| Protecting your `data.table` from being updated by reference operations can be achieved using `copy` or `data.table:::shallow` functions. Be aware `copy` might be very expensive as it needs to duplicate whole object. It is unlikely we want to include duplication time in time of the actual task we are benchmarking. | ||||||||||||||||||
|
|
||||||||||||||||||
| # try to benchmark atomic processes | ||||||||||||||||||
| ### try to benchmark atomic processes | ||||||||||||||||||
|
|
||||||||||||||||||
| If your benchmark is meant to be published it will be much more insightful if you will split it to measure time of atomic processes. This way your readers can see how much time was spent on reading data from source, cleaning, actual transformation, exporting results. | ||||||||||||||||||
| Of course if your benchmark is meant to present to present an _end-to-end workflow_, then it makes perfect sense to present the overall timing. Nevertheless, separating out timing of individual steps is useful for understanding which steps are the main bottlenecks of a workflow. | ||||||||||||||||||
| There are other cases when atomic benchmarking might not be desirable, for example when _reading a csv_, followed by _grouping_. R requires populating _R's global string cache_ which adds extra overhead when importing character data to an R session. On the other hand, the _global string cache_ might speed up processes like _grouping_. In such cases when comparing R to other languages it might be useful to include total timing. | ||||||||||||||||||
|
|
||||||||||||||||||
| # avoid class coercion | ||||||||||||||||||
| ### avoid class coercion | ||||||||||||||||||
|
|
||||||||||||||||||
| Unless this is what you truly want to measure you should prepare input objects of the expected class for every tool you are benchmarking. | ||||||||||||||||||
| Unless this is what you truly want to measure, you should prepare input objects of the expected class for every tool you are benchmarking so that the benchmark can measure the timing of the actual relevant computation rather than class coercion time. | ||||||||||||||||||
|
|
||||||||||||||||||
| # avoid `microbenchmark(..., times=100)` | ||||||||||||||||||
| ### avoid `microbenchmark(..., times=100)` | ||||||||||||||||||
|
|
||||||||||||||||||
| Be sure to read _General suggestions_ section in the top of this document which also covers this topic. | ||||||||||||||||||
| Repeating a benchmark many times usually does not give the clearest picture for data processing tools. Of course, it makes perfect sense for more atomic calculations, but this is not a good representation of the most common way these tools will actually be used, namely for data processing tasks, which consist of batches of sequentially provided transformations, each run once. | ||||||||||||||||||
|
|
||||||||||||||||||
| Matt once said: | ||||||||||||||||||
|
|
||||||||||||||||||
| > I'm very wary of benchmarks measured in anything under 1 second. Much prefer 10 seconds or more for a single run, achieved by increasing data size. A repetition count of 500 is setting off alarm bells. 3-5 runs should be enough to convince on larger data. Call overhead and time to GC affect inferences at this very small scale. | ||||||||||||||||||
|
|
||||||||||||||||||
| This is very valid. The smaller time measurement is the relatively bigger noise is. Noise generated by method dispatch, package/class initialization, etc. Main focus of benchmark should be on real use case scenarios. | ||||||||||||||||||
| This is very valid. The smaller time measurement is the relatively bigger noise is. Noise generated by method dispatch, package/class initialization, etc. Main focus of benchmark should be real use case scenarios. | ||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||
|
|
||||||||||||||||||
| Example below represents the problem discussed: | ||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||
| ```r | ||||||||||||||||||
| library(microbenchmark) | ||||||||||||||||||
| library(data.table) | ||||||||||||||||||
| set.seed(108) | ||||||||||||||||||
|
|
||||||||||||||||||
| N = 1e5L | ||||||||||||||||||
| dt = data.table(id=sample(N), value=rnorm(N)) | ||||||||||||||||||
| setindex(dt, "id") | ||||||||||||||||||
| df = as.data.frame(dt) | ||||||||||||||||||
| microbenchmark( | ||||||||||||||||||
| dt[id==5e4L, value], | ||||||||||||||||||
| df[df$id==5e4L, "value"], | ||||||||||||||||||
| times = 1000 | ||||||||||||||||||
| ) | ||||||||||||||||||
| #Unit: microseconds | ||||||||||||||||||
| # expr min lq mean median uq max neval | ||||||||||||||||||
| # dt[id == 50000L, value] 1237.964 1359.5635 1466.9513 1392.1735 1443.1725 14500.751 1000 | ||||||||||||||||||
| #df[df$id == 50000L, "value"] 355.063 391.2695 430.3884 404.7575 429.5605 2481.442 1000 | ||||||||||||||||||
|
|
||||||||||||||||||
| N = 1e7L | ||||||||||||||||||
| dt = data.table(id=sample(N), value=rnorm(N)) | ||||||||||||||||||
| setindex(dt, "id") | ||||||||||||||||||
| df = as.data.frame(dt) | ||||||||||||||||||
| microbenchmark( | ||||||||||||||||||
| dt[id==5e6L, value], | ||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why |
||||||||||||||||||
| df[df$id==5e6L, "value"], | ||||||||||||||||||
| times = 5 | ||||||||||||||||||
| ) | ||||||||||||||||||
| #Unit: milliseconds | ||||||||||||||||||
| # expr min lq mean median uq max neval | ||||||||||||||||||
| # dt[id == 5000000L, value] 1.306013 1.367846 1.59317 1.709714 1.748953 1.833324 5 | ||||||||||||||||||
| #df[df$id == 5000000L, "value"] 47.359246 47.858230 50.83947 51.774551 53.020058 54.185249 5 | ||||||||||||||||||
| ``` | ||||||||||||||||||
|
|
||||||||||||||||||
| # multithreaded processing | ||||||||||||||||||
| ### multithreaded processing | ||||||||||||||||||
|
|
||||||||||||||||||
| One of the main factors that is likely to impact timings is the number of threads available to your R session. In recent versions of `data.table`, some functions are parallelized. | ||||||||||||||||||
| You can control the number of threads you want to use with `setDTthreads`. | ||||||||||||||||||
| By default `data.table` only uses half of detected logical cores. Unless your environment is sharing resources with other heavy processes, you should get a speed-up from using all available cores. | ||||||||||||||||||
|
|
||||||||||||||||||
| ```r | ||||||||||||||||||
| setDTthreads(0) # use all available cores (default) | ||||||||||||||||||
| getDTthreads() # check how many cores are currently used | ||||||||||||||||||
| setDTthreads(NULL) # use half of available cores | ||||||||||||||||||
| setDTthreads(0) # use all available cores | ||||||||||||||||||
| getDTthreads() # check how many cores is set | ||||||||||||||||||
| ``` | ||||||||||||||||||
|
|
||||||||||||||||||
| # inside a loop prefer `set` instead of `:=` | ||||||||||||||||||
| Keep in mind that using `parallel` R package together with `data.table` will force `data.table` to use only single core. Thus it is recommended to verify cores utilization in resource monitoring tools, for example `htop`. | ||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||
|
|
||||||||||||||||||
| ### inside a loop prefer `set` instead of `:=` | ||||||||||||||||||
|
|
||||||||||||||||||
| Unless you are utilizing index when doing _sub-assign by reference_ you should prefer `set` function which does not impose overhead of `[.data.table` method call. | ||||||||||||||||||
|
|
||||||||||||||||||
|
|
@@ -131,6 +198,25 @@ setindex(DT, a) | |||||||||||||||||
| # } | ||||||||||||||||||
| ``` | ||||||||||||||||||
|
|
||||||||||||||||||
| # inside a loop prefer `setDT` instead of `data.table()` | ||||||||||||||||||
| ### inside a loop prefer `setDT()` instead of `data.table()` | ||||||||||||||||||
|
|
||||||||||||||||||
| As of now `data.table()` has an overhead, thus inside loops it is preferred to use `as.data.table()`, even better `setDT()`, or ideally avoid class coercion as described in _avoid class coercion_ above. | ||||||||||||||||||
|
Comment on lines
+201
to
+203
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a bug for removing that overhead? Should it be cited as an HTML comment here?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No bug, I don't think we should be focusing on optimizing it, as.data.table is meant to be fast. These are still microseconds so it is only relevant when someone is looping on it. |
||||||||||||||||||
|
|
||||||||||||||||||
| ### lazy evaluation aware benchmarking | ||||||||||||||||||
|
|
||||||||||||||||||
| #### let applications to optimize queries | ||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||
|
|
||||||||||||||||||
| In languages like python which does not support _lazy evaluation_ the following two filter queries would be processed exactly the same way. | ||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||
|
|
||||||||||||||||||
| ```r | ||||||||||||||||||
| DT = data.table(a=1L, b=2L) | ||||||||||||||||||
| DT[a == 1L] | ||||||||||||||||||
|
|
||||||||||||||||||
| DT[DT[["a"]] == 1L] | ||||||||||||||||||
| ``` | ||||||||||||||||||
|
|
||||||||||||||||||
| R has _lazy evaluation_ feature which allows an application to investigate and optimize expressions before it gets evaluated. In above case if we filter using `DT[[col]] == filter` we are forcing to materialize whole LHS. This prevents `data.table` to optimize expression whenever it is possible and basically falls back to base R `data.frame` way of doing subset. For more information on that subject refer to [R language manual](https://cran.r-project.org/doc/manuals/r-release/R-lang.html). | ||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||
|
|
||||||||||||||||||
| #### force applications to finish computation | ||||||||||||||||||
|
|
||||||||||||||||||
| As of now `data.table()` has an overhead, thus inside loops it is preferred to use `as.data.table()` or `setDT()` on a valid list. | ||||||||||||||||||
| The are multiple applications which are trying to be as lazy as possible. As a result you might experience that when you run a query against such solution it finishes instantly, but then printing the results takes much more time. It is because the query actually was not computed at the time of calling query but it got computed (or even only partially computed) when its results were required. Because of that you should ensure that computation took place completely. It is not a trivial task, the ultimate way to ensure is to dump results to disk but it adds an overhead of writing to disk which is then included in timings of a query we are benchmarking. The easy and cheap way to deal with it could be for example printing dimensions of a results (useful in grouping benchmarks), or printing first and last element (useful in sorting benchmarks). | ||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I actually thing the way out here is to focus again on end-to-end benchmarks. Actually lazy operations are very good if the computation never needs to materialize in actual workflow. If we have a file from disk and read 100 columns in as lazy ALTREP, but the workflow only needs 5 columns, it's inefficient to materialize the other 95. So again making the benchmark as realistic as possible is key. |
||||||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it?