Rdatatable · jangorecki · Nov 1, 2018 · Nov 1, 2018 · Nov 1, 2018 · Nov 1, 2018
@@ -4,7 +4,7 @@ date: "`r Sys.Date()`"
 output:
   rmarkdown::html_vignette:
     toc: true
-    number_sections: true
+    number_sections: false
 vignette: >
   %\VignetteIndexEntry{Benchmarking data.table}
   %\VignetteEngine{knitr::rmarkdown}
@@ -49,7 +49,7 @@ DT[V1 %in% v & V2 %in% v & V3 %in% v & V4 %in% v, verbose=TRUE]
 #...
 ```
 
-# subset: index aware benchmarking
+# index aware benchmarking
 
 For convenience `data.table` automatically builds an index on fields you use to subset data. It will add some overhead to first subset on particular fields but greatly reduces time to query those columns in subsequent runs. When measuring speed, the best way is to measure index creation and query using an index separately. Having such timings it is easy to decide what is the optimal strategy for your use case.
 To control usage of index use following options:
@@ -105,6 +105,8 @@ setDTthreads(0)    # use all available cores (default)
 getDTthreads()     # check how many cores are currently used
 ```
 
+Keep in mind that using `parallel` R package together with `data.table` will force `data.table` to use only single core. Thus it is recommended to verify cores utilization in resource monitoring tools, for example `htop`.
-Keep in mind that using `parallel` R package together with `data.table` will force `data.table` to use only single core. Thus it is recommended to verify cores utilization in resource monitoring tools, for example `htop`.
+Keep in mind that using the `parallel` R package together with `data.table` will force `data.table` to use only a single core. Thus it is recommended to verify core utilization in resource monitoring tools, for example `htop`.
-Keep in mind that using `parallel` R package together with `data.table` will force `data.table` to use only single core. Thus it is recommended to verify cores utilization in resource monitoring tools, for example `htop`.
+Keep in mind that using the `parallel` R package together with `data.table` will force `data.table` to use only a single core. Thus it is recommended to verify core utilization in resource monitoring tools, for example `htop`.
+
 # inside a loop prefer `set` instead of `:=`
 
 Unless you are utilizing index when doing _sub-assign by reference_ you should prefer `set` function which does not impose overhead of `[.data.table` method call.
@@ -125,3 +127,24 @@ setindex(DT, a)
 # inside a loop prefer `setDT` instead of `data.table()`
 
 As of now `data.table()` has an overhead, thus inside loops it is preferred to use `as.data.table()` or `setDT()` on a valid list.
+
+# lazy evaluation aware benchmarking
+
+## let applications to optimize queries
+
+In languages like python which does not support _lazy evaluation_ the following two filter queries would be processed exactly the same way.
-In languages like python which does not support _lazy evaluation_ the following two filter queries would be processed exactly the same way.
+In languages like python which do not support _lazy evaluation_, the following two filter queries would be processed exactly the same way.
-In languages like python which does not support _lazy evaluation_ the following two filter queries would be processed exactly the same way.
+In languages like python which do not support _lazy evaluation_, the following two filter queries would be processed exactly the same way.
+
+```r
+DT = data.table(a=1L, b=2L)
+DT[a == 1L]
+
+col = "a"
+filter = 1L
+DT[DT[[col]] == filter]
+```
+
+R has _lazy evaluation_ feature which allows an application to investigate and optimize expressions before it gets evaluated. In above case if we filter using `DT[[col]] == filter` we are forcing to materialize whole LHS. This prevents `data.table` to optimize expression whenever it is possible and basically falls back to base R `data.frame` way of doing subset. For more information on that subject refer to [R language manual](https://cran.r-project.org/doc/manuals/r-release/R-lang.html).
-R has _lazy evaluation_ feature which allows an application to investigate and optimize expressions before it gets evaluated. In above case if we filter using `DT[[col]] == filter` we are forcing to materialize whole LHS. This prevents `data.table` to optimize expression whenever it is possible and basically falls back to base R `data.frame` way of doing subset. For more information on that subject refer to [R language manual](https://cran.r-project.org/doc/manuals/r-release/R-lang.html).
+R has _lazy evaluation_, which allows an application to investigate and optimize expressions before it gets evaluated; SQL engines also do this. In the above, if we filter using `DT[[col]] == filter` we are forcing the whole LHS to materialize. This prevents `data.table` optimizing expression whenever it is possible and basically falls back to the base R `data.frame` way of doing subsets. For more information on that subject refer to the [R language manual](https://cran.r-project.org/doc/manuals/r-release/R-lang.html).
-R has _lazy evaluation_ feature which allows an application to investigate and optimize expressions before it gets evaluated. In above case if we filter using `DT[[col]] == filter` we are forcing to materialize whole LHS. This prevents `data.table` to optimize expression whenever it is possible and basically falls back to base R `data.frame` way of doing subset. For more information on that subject refer to [R language manual](https://cran.r-project.org/doc/manuals/r-release/R-lang.html).
+R has _lazy evaluation_, which allows an application to investigate and optimize expressions before it gets evaluated; SQL engines also do this. In the above, if we filter using `DT[[col]] == filter` we are forcing the whole LHS to materialize. This prevents `data.table` optimizing expression whenever it is possible and basically falls back to the base R `data.frame` way of doing subsets. For more information on that subject refer to the [R language manual](https://cran.r-project.org/doc/manuals/r-release/R-lang.html).
+
+## force applications to finish computation
+
+The are multiple applications which are trying to be as lazy as possible. As a result you might experience that when you run a query against such solution it finishes instantly, but then printing the results takes much more time. It is because the query actually was not computed at the time of calling query but it got computed when its results were required. Because of the above you should ensure that computation took place. It is not a trivial task, the ultimate way to ensure is to dump results to disk but it adds an overhead of writing to disk which is then included in timings of a query we are benchmarking. The easy and cheap way to deal with it could be for example printing dimensions of a results (useful in grouping benchmarks), or printing first and last element (useful in sorting benchmarks).