Skip to content
Open
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 25 additions & 2 deletions vignettes/datatable-benchmarking.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ date: "`r Sys.Date()`"
output:
rmarkdown::html_vignette:
toc: true
number_sections: true
number_sections: false
vignette: >
%\VignetteIndexEntry{Benchmarking data.table}
%\VignetteEngine{knitr::rmarkdown}
Expand Down Expand Up @@ -49,7 +49,7 @@ DT[V1 %in% v & V2 %in% v & V3 %in% v & V4 %in% v, verbose=TRUE]
#...
```

# subset: index aware benchmarking
# index aware benchmarking

For convenience `data.table` automatically builds an index on fields you use to subset data. It will add some overhead to first subset on particular fields but greatly reduces time to query those columns in subsequent runs. When measuring speed, the best way is to measure index creation and query using an index separately. Having such timings it is easy to decide what is the optimal strategy for your use case.
To control usage of index use following options:
Expand Down Expand Up @@ -105,6 +105,8 @@ setDTthreads(0) # use all available cores (default)
getDTthreads() # check how many cores are currently used
```

Keep in mind that using `parallel` R package together with `data.table` will force `data.table` to use only single core. Thus it is recommended to verify cores utilization in resource monitoring tools, for example `htop`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Keep in mind that using `parallel` R package together with `data.table` will force `data.table` to use only single core. Thus it is recommended to verify cores utilization in resource monitoring tools, for example `htop`.
Keep in mind that using the `parallel` R package together with `data.table` will force `data.table` to use only a single core. Thus it is recommended to verify core utilization in resource monitoring tools, for example `htop`.


# inside a loop prefer `set` instead of `:=`

Unless you are utilizing index when doing _sub-assign by reference_ you should prefer `set` function which does not impose overhead of `[.data.table` method call.
Expand All @@ -125,3 +127,24 @@ setindex(DT, a)
# inside a loop prefer `setDT` instead of `data.table()`

As of now `data.table()` has an overhead, thus inside loops it is preferred to use `as.data.table()` or `setDT()` on a valid list.

# lazy evaluation aware benchmarking

## let applications to optimize queries

In languages like python which does not support _lazy evaluation_ the following two filter queries would be processed exactly the same way.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In languages like python which does not support _lazy evaluation_ the following two filter queries would be processed exactly the same way.
In languages like python which do not support _lazy evaluation_, the following two filter queries would be processed exactly the same way.


```r
DT = data.table(a=1L, b=2L)
DT[a == 1L]

col = "a"
filter = 1L
DT[DT[[col]] == filter]
```

R has _lazy evaluation_ feature which allows an application to investigate and optimize expressions before it gets evaluated. In above case if we filter using `DT[[col]] == filter` we are forcing to materialize whole LHS. This prevents `data.table` to optimize expression whenever it is possible and basically falls back to base R `data.frame` way of doing subset. For more information on that subject refer to [R language manual](https://cran.r-project.org/doc/manuals/r-release/R-lang.html).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
R has _lazy evaluation_ feature which allows an application to investigate and optimize expressions before it gets evaluated. In above case if we filter using `DT[[col]] == filter` we are forcing to materialize whole LHS. This prevents `data.table` to optimize expression whenever it is possible and basically falls back to base R `data.frame` way of doing subset. For more information on that subject refer to [R language manual](https://cran.r-project.org/doc/manuals/r-release/R-lang.html).
R has _lazy evaluation_, which allows an application to investigate and optimize expressions before it gets evaluated; SQL engines also do this. In the above, if we filter using `DT[[col]] == filter` we are forcing the whole LHS to materialize. This prevents `data.table` optimizing expression whenever it is possible and basically falls back to the base R `data.frame` way of doing subsets. For more information on that subject refer to the [R language manual](https://cran.r-project.org/doc/manuals/r-release/R-lang.html).


## force applications to finish computation

The are multiple applications which are trying to be as lazy as possible. As a result you might experience that when you run a query against such solution it finishes instantly, but then printing the results takes much more time. It is because the query actually was not computed at the time of calling query but it got computed when its results were required. Because of the above you should ensure that computation took place. It is not a trivial task, the ultimate way to ensure is to dump results to disk but it adds an overhead of writing to disk which is then included in timings of a query we are benchmarking. The easy and cheap way to deal with it could be for example printing dimensions of a results (useful in grouping benchmarks), or printing first and last element (useful in sorting benchmarks).