From 60a5694046d6ffa9f1ec0c78c5bc5793c168a1c9 Mon Sep 17 00:00:00 2001 From: jangorecki Date: Thu, 1 Nov 2018 10:55:28 +0000 Subject: [PATCH 01/11] order of points in bench vign is not relevant --- vignettes/datatable-benchmarking.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vignettes/datatable-benchmarking.Rmd b/vignettes/datatable-benchmarking.Rmd index a2622585ec..0664c60b36 100644 --- a/vignettes/datatable-benchmarking.Rmd +++ b/vignettes/datatable-benchmarking.Rmd @@ -4,7 +4,7 @@ date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true - number_sections: true + number_sections: false vignette: > %\VignetteIndexEntry{Benchmarking data.table} %\VignetteEngine{knitr::rmarkdown} From 9554c3db438bcf5e0c2f98e98d7dfdcd763e89eb Mon Sep 17 00:00:00 2001 From: jangorecki Date: Thu, 1 Nov 2018 11:01:47 +0000 Subject: [PATCH 02/11] mention parallel pkg in bench vignette --- vignettes/datatable-benchmarking.Rmd | 2 ++ 1 file changed, 2 insertions(+) diff --git a/vignettes/datatable-benchmarking.Rmd b/vignettes/datatable-benchmarking.Rmd index 0664c60b36..88cee84c78 100644 --- a/vignettes/datatable-benchmarking.Rmd +++ b/vignettes/datatable-benchmarking.Rmd @@ -105,6 +105,8 @@ setDTthreads(0) # use all available cores (default) getDTthreads() # check how many cores are currently used ``` +Keep in mind that using `parallel` R package together with `data.table` will force `data.table` to use only single core. Thus it is recommended to verify cores utilization in resource monitoring tools, for example `htop`. + # inside a loop prefer `set` instead of `:=` Unless you are utilizing index when doing _sub-assign by reference_ you should prefer `set` function which does not impose overhead of `[.data.table` method call. From 291c0a221ae77459ff19924ee65bcf89adc6755d Mon Sep 17 00:00:00 2001 From: jangorecki Date: Thu, 1 Nov 2018 11:04:39 +0000 Subject: [PATCH 03/11] index aware benchmark will be also valid for grouping, joining, etc. --- vignettes/datatable-benchmarking.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vignettes/datatable-benchmarking.Rmd b/vignettes/datatable-benchmarking.Rmd index 88cee84c78..247b0662d0 100644 --- a/vignettes/datatable-benchmarking.Rmd +++ b/vignettes/datatable-benchmarking.Rmd @@ -49,7 +49,7 @@ DT[V1 %in% v & V2 %in% v & V3 %in% v & V4 %in% v, verbose=TRUE] #... ``` -# subset: index aware benchmarking +# index aware benchmarking For convenience `data.table` automatically builds an index on fields you use to subset data. It will add some overhead to first subset on particular fields but greatly reduces time to query those columns in subsequent runs. When measuring speed, the best way is to measure index creation and query using an index separately. Having such timings it is easy to decide what is the optimal strategy for your use case. To control usage of index use following options: From 2556b05a65c71aaca485aac3f53cf2908e5e6c8e Mon Sep 17 00:00:00 2001 From: jangorecki Date: Thu, 1 Nov 2018 11:27:41 +0000 Subject: [PATCH 04/11] lazy evaluation aware benchmarking --- vignettes/datatable-benchmarking.Rmd | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/vignettes/datatable-benchmarking.Rmd b/vignettes/datatable-benchmarking.Rmd index 247b0662d0..53d5db575a 100644 --- a/vignettes/datatable-benchmarking.Rmd +++ b/vignettes/datatable-benchmarking.Rmd @@ -127,3 +127,24 @@ setindex(DT, a) # inside a loop prefer `setDT` instead of `data.table()` As of now `data.table()` has an overhead, thus inside loops it is preferred to use `as.data.table()` or `setDT()` on a valid list. + +# lazy evaluation aware benchmarking + +## let applications to optimize queries + +In languages like python which does not support _lazy evaluation_ the following two filter queries would be processed exactly the same way. + +```r +DT = data.table(a=1L, b=2L) +DT[a == 1L] + +col = "a" +filter = 1L +DT[DT[[col]] == filter] +``` + +R has _lazy evaluation_ feature which allows an application to investigate and optimize expressions before it gets evaluated. In above case if we filter using `DT[[col]] == filter` we are forcing to materialize whole LHS. This prevents `data.table` to optimize expression whenever it is possible and basically falls back to base R `data.frame` way of doing subset. For more information on that subject refer to [R language manual](https://cran.r-project.org/doc/manuals/r-release/R-lang.html). + +## force applications to finish computation + +The are multiple applications which are trying to be as lazy as possible. As a result you might experience that when you run a query against such solution it finishes instantly, but then printing the results takes much more time. It is because the query actually was not computed at the time of calling query but it got computed when its results were required. Because of the above you should ensure that computation took place. It is not a trivial task, the ultimate way to ensure is to dump results to disk but it adds an overhead of writing to disk which is then included in timings of a query we are benchmarking. The easy and cheap way to deal with it could be for example printing dimensions of a results (useful in grouping benchmarks), or printing first and last element (useful in sorting benchmarks). From 213295c7827f6e74200cc0d1045407bb54d5824e Mon Sep 17 00:00:00 2001 From: jangorecki Date: Wed, 6 Mar 2019 10:14:25 +0530 Subject: [PATCH 05/11] address feedback on bench-vign improvements --- vignettes/datatable-benchmarking.Rmd | 55 ++++++++++++++++++++-------- 1 file changed, 39 insertions(+), 16 deletions(-) diff --git a/vignettes/datatable-benchmarking.Rmd b/vignettes/datatable-benchmarking.Rmd index 53d5db575a..54d2dae871 100644 --- a/vignettes/datatable-benchmarking.Rmd +++ b/vignettes/datatable-benchmarking.Rmd @@ -4,16 +4,38 @@ date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true - number_sections: false vignette: > %\VignetteIndexEntry{Benchmarking data.table} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- -This document is meant to guide on measuring performance of `data.table`. Single place to document best practices and traps to avoid. +*** -# fread: clear caches +## General suggestions + +Lets assume you are measuring particular process. It is blazingly fast, it takes only microseonds to evalute. +What does it mean and how to approach such measurements? +The smaller time measurements are, the relatively bigger call overhead is. Call overhead can be perceived as a noise in measurement due by method dispatch, package/class initialization, low level object constructors, etc. As a result you naturally may want to measure timing many times and take the average to deal with the noise. This is valid approach, but the magnitude of timing is much more important. What will be the impact of extra 5, or lets say 5000 microseconds if writing results to target environment/format takes a minute? 1 second is 1 000 000 microseconds. Does the microseconds, or even miliseconds makes any difference? There are cases where it makes difference, for example when you call a function for every row, then you definitely should care about micro timings. The point is that in most user's benchmarks it won't make difference. Most of common R functions are vectorized, thus you are not calling them for every row. If something is blazingly fast for your data and use case then perhaps you may not have to worry about performance and benchmarks. Unless you want to scale your process, then you should worry because if something is blazingly fast today it might not be that fast tomorrow, just because your process will receive more data on input. In consequence you should confirm that your process will scale. +There are multiple dimensions that you should consider when examining scaling of your process. +- increase numbers of rows on input +- cardinality of data +- skewness of data - for most cases this should have the least importance +- increase numbers of columns on input - this will be mostly valid when your input is a matrix, for data frames variable number of columns should be avoided as it leads to undefined schema. We suggests to model your data into predefined schema so the extra columns are modeled (using *melt*/*unpivot*) as new groups of rows. +- presence of NAs in input +- sortedness of input + +To measure *scaling factor* for input size you have to measure timings of at least three different sizes, lets say number of rows, 1 million, 10 millions and 100 millions. Those three different measurements will allow you to conclude how your process scales. Why three and not two? From two sizes you cannot yet conclude if process scales linearly or exponentially. In theory based on that you can estimate how many rows you would need to receive on input so that your process would take for example a minute or an hour to finish. +Once we have our input scaled up to reduce impact of call overhead the next thing that springs to mind is should I repeat measurements multiple times? The answer is that it strongly depends on your use case, a data processing workflow. If process is called just once in your workflow, why should you bother about its timing on second, third... and 100th run? Things like disk cache might result into subsequent runs to evaluate faster. Other optimizations might be triggered like memoize results for given input, or use of indexes created on the first run. If your workflow does not repeatadly calls your process, why should you do it in benchmark? The main focus of benchmarks should be real use case scenarios. + +You should not forget about taking extra care about environment in which you are runnning benchmark. It should be striped out from startup configurations, so consider `R --vanilla` mode. Any extra configurations should be well documented. Be sure to use recent releases of tools you are benchmarking. +You should also not forget about being polite, and if you're about to publish some benchmarking results against another library -- reach out to the authors of that other package to check with them if you're using their library correctly. + +*** + +## Best practices + +### fread: clear caches Ideally each `fread` call should be run in fresh session with the following commands preceding R execution. This clears OS cache file in RAM and HD cache. @@ -26,7 +48,7 @@ sudo hdparm -t /dev/sda When comparing `fread` to non-R solutions be aware that R requires values of character columns to be added to _R's global string cache_. This takes time when reading data but later operations benefit since the character strings have already been cached. Consequently as well timing isolated tasks (such as `fread` alone), it's a good idea to benchmark a pipeline of tasks such as reading data, computing operators and producing final output and report the total time of the pipeline. -# subset: threshold for index optimization on compound queries +### subset: threshold for index optimization on compound queries Index optimization for compound filter queries will be not be used when cross product of elements provided to filter on exceeds 1e4 elements. @@ -49,7 +71,7 @@ DT[V1 %in% v & V2 %in% v & V3 %in% v & V4 %in% v, verbose=TRUE] #... ``` -# index aware benchmarking +### index aware benchmarking For convenience `data.table` automatically builds an index on fields you use to subset data. It will add some overhead to first subset on particular fields but greatly reduces time to query those columns in subsequent runs. When measuring speed, the best way is to measure index creation and query using an index separately. Having such timings it is easy to decide what is the optimal strategy for your use case. To control usage of index use following options: @@ -70,32 +92,33 @@ options(datatable.optimize=3L) `options(datatable.optimize=2L)` will turn off optimization of subsets completely, while `options(datatable.optimize=3L)` will switch it back on. Those options affects much more optimizations thus should not be used when only control of index is needed. Read more in `?datatable.optimize`. -# _by reference_ operations +### _by reference_ operations When benchmarking `set*` functions it make sense to measure only first run. Those functions updates data.table by reference thus in subsequent runs they get already processed `data.table` on input. Protecting your `data.table` from being updated by reference operations can be achieved using `copy` or `data.table:::shallow` functions. Be aware `copy` might be very expensive as it needs to duplicate whole object. It is unlikely we want to include duplication time in time of the actual task we are benchmarking. -# try to benchmark atomic processes +### try to benchmark atomic processes If your benchmark is meant to be published it will be much more insightful if you will split it to measure time of atomic processes. This way your readers can see how much time was spent on reading data from source, cleaning, actual transformation, exporting results. Of course if your benchmark is meant to present _full workflow_ then it perfectly make sense to present total timing, still spliting timings might give good insight into bottlenecks in such workflow. There are another cases when it might not be desired, for example when benchmarking _reading csv_, followed by _grouping_. R requires to populate _R's global string cache_ which adds extra overhead when importing character data to R session. On the other hand _global string cache_ might speed up processes like _grouping_. In such cases when comparing R to other languages it might be useful to include total timing. -# avoid class coercion +### avoid class coercion Unless this is what you truly want to measure you should prepare input objects for every tools you are benchmarking in expected class. -# avoid `microbenchmark(..., times=100)` +### avoid `microbenchmark(..., times=100)` +Be sure to read _General suggestions_ section in the top of this document as it also well covers that topic. Repeating benchmarking many times usually does not fit well for data processing tools. Of course it perfectly make sense for more atomic calculations. It does not well represent use case for common data processing tasks, which rather consists of batches sequentially provided transformations, each run once. Matt once said: > I'm very wary of benchmarks measured in anything under 1 second. Much prefer 10 seconds or more for a single run, achieved by increasing data size. A repetition count of 500 is setting off alarm bells. 3-5 runs should be enough to convince on larger data. Call overhead and time to GC affect inferences at this very small scale. -This is very valid. The smaller time measurement is the relatively bigger noise is. Noise generated by method dispatch, package/class initialization, etc. Main focus of benchmark should be on real use case scenarios. +This is very valid. The smaller time measurement is the relatively bigger noise is. Noise generated by method dispatch, package/class initialization, etc. Main focus of benchmark should be real use case scenarios. -# multithreaded processing +### multithreaded processing One of the main factor that is likely to impact timings is number of threads in your machine. In recent versions of `data.table` some of the functions has been parallelized. You can control how much threads you want to use with `setDTthreads`. @@ -107,7 +130,7 @@ getDTthreads() # check how many cores are currently used Keep in mind that using `parallel` R package together with `data.table` will force `data.table` to use only single core. Thus it is recommended to verify cores utilization in resource monitoring tools, for example `htop`. -# inside a loop prefer `set` instead of `:=` +### inside a loop prefer `set` instead of `:=` Unless you are utilizing index when doing _sub-assign by reference_ you should prefer `set` function which does not impose overhead of `[.data.table` method call. @@ -124,13 +147,13 @@ setindex(DT, a) # } ``` -# inside a loop prefer `setDT` instead of `data.table()` +### inside a loop prefer `setDT` instead of `data.table()` As of now `data.table()` has an overhead, thus inside loops it is preferred to use `as.data.table()` or `setDT()` on a valid list. -# lazy evaluation aware benchmarking +### lazy evaluation aware benchmarking -## let applications to optimize queries +#### let applications to optimize queries In languages like python which does not support _lazy evaluation_ the following two filter queries would be processed exactly the same way. @@ -145,6 +168,6 @@ DT[DT[[col]] == filter] R has _lazy evaluation_ feature which allows an application to investigate and optimize expressions before it gets evaluated. In above case if we filter using `DT[[col]] == filter` we are forcing to materialize whole LHS. This prevents `data.table` to optimize expression whenever it is possible and basically falls back to base R `data.frame` way of doing subset. For more information on that subject refer to [R language manual](https://cran.r-project.org/doc/manuals/r-release/R-lang.html). -## force applications to finish computation +#### force applications to finish computation The are multiple applications which are trying to be as lazy as possible. As a result you might experience that when you run a query against such solution it finishes instantly, but then printing the results takes much more time. It is because the query actually was not computed at the time of calling query but it got computed when its results were required. Because of the above you should ensure that computation took place. It is not a trivial task, the ultimate way to ensure is to dump results to disk but it adds an overhead of writing to disk which is then included in timings of a query we are benchmarking. The easy and cheap way to deal with it could be for example printing dimensions of a results (useful in grouping benchmarks), or printing first and last element (useful in sorting benchmarks). From 1cda60800799e8fc01c2d2f60b093dbb19f08509 Mon Sep 17 00:00:00 2001 From: jangorecki Date: Wed, 6 Mar 2019 10:25:21 +0530 Subject: [PATCH 06/11] reflect change of cores to 50 pct --- vignettes/datatable-benchmarking.Rmd | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/vignettes/datatable-benchmarking.Rmd b/vignettes/datatable-benchmarking.Rmd index 54d2dae871..8f24532fe8 100644 --- a/vignettes/datatable-benchmarking.Rmd +++ b/vignettes/datatable-benchmarking.Rmd @@ -122,10 +122,12 @@ This is very valid. The smaller time measurement is the relatively bigger noise One of the main factor that is likely to impact timings is number of threads in your machine. In recent versions of `data.table` some of the functions has been parallelized. You can control how much threads you want to use with `setDTthreads`. +Starting from 1.12.2 `data.table` uses only half of available logical cores. Unless your environment is sharing resources with other heavy processes, you should get speed-up when setting to use all available cores. ```r -setDTthreads(0) # use all available cores (default) -getDTthreads() # check how many cores are currently used +setDTthreads(NULL) # use half of available cores +setDTthreads(0) # use all available cores +getDTthreads() # check how many cores is set ``` Keep in mind that using `parallel` R package together with `data.table` will force `data.table` to use only single core. Thus it is recommended to verify cores utilization in resource monitoring tools, for example `htop`. From 4e4578a7ba9177b12b1e6081f2ae704ce62bb8d8 Mon Sep 17 00:00:00 2001 From: jangorecki Date: Mon, 6 Nov 2023 18:25:40 +0100 Subject: [PATCH 07/11] add example requested by Matt --- vignettes/datatable-benchmarking.Rmd | 47 +++++++++++++++++++++++----- 1 file changed, 40 insertions(+), 7 deletions(-) diff --git a/vignettes/datatable-benchmarking.Rmd b/vignettes/datatable-benchmarking.Rmd index 8f24532fe8..3ad6d013a7 100644 --- a/vignettes/datatable-benchmarking.Rmd +++ b/vignettes/datatable-benchmarking.Rmd @@ -106,7 +106,7 @@ There are another cases when it might not be desired, for example when benchmark ### avoid class coercion -Unless this is what you truly want to measure you should prepare input objects for every tools you are benchmarking in expected class. +Unless this is what you truly want to measure you should prepare input objects for every tool you are benchmarking in their expected class, so benchmark can measure timing of an actual computation rather than class coercion time and computation. ### avoid `microbenchmark(..., times=100)` @@ -118,6 +118,41 @@ Matt once said: This is very valid. The smaller time measurement is the relatively bigger noise is. Noise generated by method dispatch, package/class initialization, etc. Main focus of benchmark should be real use case scenarios. +Example below represents the problem discussed: +```r +library(microbenchmark) +library(data.table) +set.seed(108) + +N = 1e5L +dt = data.table(id=sample(N), value=rnorm(N)) +setindex(dt, "id") +df = as.data.frame(dt) +microbenchmark( + dt[id==5e4L, value], + df[df$id==5e4L, "value"], + times = 1000 +) +#Unit: microseconds +# expr min lq mean median uq max neval +# dt[id == 50000L, value] 1237.964 1359.5635 1466.9513 1392.1735 1443.1725 14500.751 1000 +#df[df$id == 50000L, "value"] 355.063 391.2695 430.3884 404.7575 429.5605 2481.442 1000 + +N = 1e7L +dt = data.table(id=sample(N), value=rnorm(N)) +setindex(dt, "id") +df = as.data.frame(dt) +microbenchmark( + dt[id==5e6L, value], + df[df$id==5e6L, "value"], + times = 5 +) +#Unit: milliseconds +# expr min lq mean median uq max neval +# dt[id == 5000000L, value] 1.306013 1.367846 1.59317 1.709714 1.748953 1.833324 5 +#df[df$id == 5000000L, "value"] 47.359246 47.858230 50.83947 51.774551 53.020058 54.185249 5 +``` + ### multithreaded processing One of the main factor that is likely to impact timings is number of threads in your machine. In recent versions of `data.table` some of the functions has been parallelized. @@ -149,9 +184,9 @@ setindex(DT, a) # } ``` -### inside a loop prefer `setDT` instead of `data.table()` +### inside a loop prefer `setDT()` instead of `data.table()` -As of now `data.table()` has an overhead, thus inside loops it is preferred to use `as.data.table()` or `setDT()` on a valid list. +As of now `data.table()` has an overhead, thus inside loops it is preferred to use `as.data.table()`, even better `setDT()`, or ideally avoid class coercion as described in _avoid class coercion_ above. ### lazy evaluation aware benchmarking @@ -163,13 +198,11 @@ In languages like python which does not support _lazy evaluation_ the following DT = data.table(a=1L, b=2L) DT[a == 1L] -col = "a" -filter = 1L -DT[DT[[col]] == filter] +DT[DT[["a"]] == 1L] ``` R has _lazy evaluation_ feature which allows an application to investigate and optimize expressions before it gets evaluated. In above case if we filter using `DT[[col]] == filter` we are forcing to materialize whole LHS. This prevents `data.table` to optimize expression whenever it is possible and basically falls back to base R `data.frame` way of doing subset. For more information on that subject refer to [R language manual](https://cran.r-project.org/doc/manuals/r-release/R-lang.html). #### force applications to finish computation -The are multiple applications which are trying to be as lazy as possible. As a result you might experience that when you run a query against such solution it finishes instantly, but then printing the results takes much more time. It is because the query actually was not computed at the time of calling query but it got computed when its results were required. Because of the above you should ensure that computation took place. It is not a trivial task, the ultimate way to ensure is to dump results to disk but it adds an overhead of writing to disk which is then included in timings of a query we are benchmarking. The easy and cheap way to deal with it could be for example printing dimensions of a results (useful in grouping benchmarks), or printing first and last element (useful in sorting benchmarks). +The are multiple applications which are trying to be as lazy as possible. As a result you might experience that when you run a query against such solution it finishes instantly, but then printing the results takes much more time. It is because the query actually was not computed at the time of calling query but it got computed (or even only partially computed) when its results were required. Because of that you should ensure that computation took place completely. It is not a trivial task, the ultimate way to ensure is to dump results to disk but it adds an overhead of writing to disk which is then included in timings of a query we are benchmarking. The easy and cheap way to deal with it could be for example printing dimensions of a results (useful in grouping benchmarks), or printing first and last element (useful in sorting benchmarks). From 5b5b4bf60c63d8a63c5ead471ae5c5a76f640c65 Mon Sep 17 00:00:00 2001 From: Jan Gorecki Date: Wed, 27 Aug 2025 20:57:55 +0600 Subject: [PATCH 08/11] Update vignettes/datatable-benchmarking.Rmd Co-authored-by: Michael Chirico --- vignettes/datatable-benchmarking.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vignettes/datatable-benchmarking.Rmd b/vignettes/datatable-benchmarking.Rmd index c1af2cb16e..3006e5f033 100644 --- a/vignettes/datatable-benchmarking.Rmd +++ b/vignettes/datatable-benchmarking.Rmd @@ -27,7 +27,7 @@ This document is meant to guide on measuring performance of `data.table`. Single ## General suggestions -Lets assume you are measuring particular process. It is blazingly fast, it takes only microseonds to evalute. +Let's assume you are measuring a particular process. It is blazingly fast, taking only microseconds to evaluate. What does it mean and how to approach such measurements? The smaller time measurements are, the relatively bigger call overhead is. Call overhead can be perceived as a noise in measurement due by method dispatch, package/class initialization, low level object constructors, etc. As a result you naturally may want to measure timing many times and take the average to deal with the noise. This is valid approach, but the magnitude of timing is much more important. What will be the impact of extra 5, or lets say 5000 microseconds if writing results to target environment/format takes a minute? 1 second is 1 000 000 microseconds. Does the microseconds, or even miliseconds makes any difference? There are cases where it makes difference, for example when you call a function for every row, then you definitely should care about micro timings. The point is that in most user's benchmarks it won't make difference. Most of common R functions are vectorized, thus you are not calling them for every row. If something is blazingly fast for your data and use case then perhaps you may not have to worry about performance and benchmarks. Unless you want to scale your process, then you should worry because if something is blazingly fast today it might not be that fast tomorrow, just because your process will receive more data on input. In consequence you should confirm that your process will scale. There are multiple dimensions that you should consider when examining scaling of your process. From dd49bd5031aa0fb23c02b91836d38e090f1c4ece Mon Sep 17 00:00:00 2001 From: Jan Gorecki Date: Wed, 27 Aug 2025 20:59:11 +0600 Subject: [PATCH 09/11] Update vignettes/datatable-benchmarking.Rmd Co-authored-by: Michael Chirico --- vignettes/datatable-benchmarking.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vignettes/datatable-benchmarking.Rmd b/vignettes/datatable-benchmarking.Rmd index 3006e5f033..87a99c1d46 100644 --- a/vignettes/datatable-benchmarking.Rmd +++ b/vignettes/datatable-benchmarking.Rmd @@ -30,7 +30,7 @@ This document is meant to guide on measuring performance of `data.table`. Single Let's assume you are measuring a particular process. It is blazingly fast, taking only microseconds to evaluate. What does it mean and how to approach such measurements? The smaller time measurements are, the relatively bigger call overhead is. Call overhead can be perceived as a noise in measurement due by method dispatch, package/class initialization, low level object constructors, etc. As a result you naturally may want to measure timing many times and take the average to deal with the noise. This is valid approach, but the magnitude of timing is much more important. What will be the impact of extra 5, or lets say 5000 microseconds if writing results to target environment/format takes a minute? 1 second is 1 000 000 microseconds. Does the microseconds, or even miliseconds makes any difference? There are cases where it makes difference, for example when you call a function for every row, then you definitely should care about micro timings. The point is that in most user's benchmarks it won't make difference. Most of common R functions are vectorized, thus you are not calling them for every row. If something is blazingly fast for your data and use case then perhaps you may not have to worry about performance and benchmarks. Unless you want to scale your process, then you should worry because if something is blazingly fast today it might not be that fast tomorrow, just because your process will receive more data on input. In consequence you should confirm that your process will scale. -There are multiple dimensions that you should consider when examining scaling of your process. +There are multiple dimensions that you should consider when examining how your process scales: - increase numbers of rows on input - cardinality of data - skewness of data - for most cases this should have the least importance From 7501185070be6ac04af7f0600acb13316bca2eb0 Mon Sep 17 00:00:00 2001 From: Jan Gorecki Date: Wed, 27 Aug 2025 20:59:22 +0600 Subject: [PATCH 10/11] Update vignettes/datatable-benchmarking.Rmd Co-authored-by: Michael Chirico --- vignettes/datatable-benchmarking.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vignettes/datatable-benchmarking.Rmd b/vignettes/datatable-benchmarking.Rmd index 87a99c1d46..2ce64ed6ec 100644 --- a/vignettes/datatable-benchmarking.Rmd +++ b/vignettes/datatable-benchmarking.Rmd @@ -35,7 +35,7 @@ There are multiple dimensions that you should consider when examining how your p - cardinality of data - skewness of data - for most cases this should have the least importance - increase numbers of columns on input - this will be mostly valid when your input is a matrix, for data frames variable number of columns should be avoided as it leads to undefined schema. We suggests to model your data into predefined schema so the extra columns are modeled (using *melt*/*unpivot*) as new groups of rows. -- presence of NAs in input +- prevalence of NAs in input - sortedness of input To measure *scaling factor* for input size you have to measure timings of at least three different sizes, lets say number of rows, 1 million, 10 millions and 100 millions. Those three different measurements will allow you to conclude how your process scales. Why three and not two? From two sizes you cannot yet conclude if process scales linearly or exponentially. In theory based on that you can estimate how many rows you would need to receive on input so that your process would take for example a minute or an hour to finish. From f4147404012b5d6639f5c19787c3cf062958ff3b Mon Sep 17 00:00:00 2001 From: Jan Gorecki Date: Wed, 27 Aug 2025 20:59:40 +0600 Subject: [PATCH 11/11] Update vignettes/datatable-benchmarking.Rmd Co-authored-by: Michael Chirico --- vignettes/datatable-benchmarking.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vignettes/datatable-benchmarking.Rmd b/vignettes/datatable-benchmarking.Rmd index 2ce64ed6ec..6301cfbcaa 100644 --- a/vignettes/datatable-benchmarking.Rmd +++ b/vignettes/datatable-benchmarking.Rmd @@ -31,7 +31,7 @@ Let's assume you are measuring a particular process. It is blazingly fast, takin What does it mean and how to approach such measurements? The smaller time measurements are, the relatively bigger call overhead is. Call overhead can be perceived as a noise in measurement due by method dispatch, package/class initialization, low level object constructors, etc. As a result you naturally may want to measure timing many times and take the average to deal with the noise. This is valid approach, but the magnitude of timing is much more important. What will be the impact of extra 5, or lets say 5000 microseconds if writing results to target environment/format takes a minute? 1 second is 1 000 000 microseconds. Does the microseconds, or even miliseconds makes any difference? There are cases where it makes difference, for example when you call a function for every row, then you definitely should care about micro timings. The point is that in most user's benchmarks it won't make difference. Most of common R functions are vectorized, thus you are not calling them for every row. If something is blazingly fast for your data and use case then perhaps you may not have to worry about performance and benchmarks. Unless you want to scale your process, then you should worry because if something is blazingly fast today it might not be that fast tomorrow, just because your process will receive more data on input. In consequence you should confirm that your process will scale. There are multiple dimensions that you should consider when examining how your process scales: -- increase numbers of rows on input +- increase number of rows on input - cardinality of data - skewness of data - for most cases this should have the least importance - increase numbers of columns on input - this will be mostly valid when your input is a matrix, for data frames variable number of columns should be avoided as it leads to undefined schema. We suggests to model your data into predefined schema so the extra columns are modeled (using *melt*/*unpivot*) as new groups of rows.