TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Follow publication

data.table speed with dplyr syntax: Yes we can!

Iyar Lin
TDS Archive
Published in
4 min readApr 13, 2021
Photo by Sam Pearce-Warrilow on Unsplash

R has many great tools for data wrangling. Two of those are the dplyr and data.table packages. While dplyr has very flexible and intuitive syntax, data.table can be orders of magnitude faster in some scenarios.

One of those scenarios is when performing operations over a very large number of groups. This can happen when for example working with CRM data, where each row describes a touch point or transaction and one is interested with calculating the number of rows per customer, monetary value of all transactions per customer etc.

Recently Rstudio released dtplyr package version 1.0.0 which provides a data.table backend for dplyr.

Using dtplyr requires learning almost no additional code. One initiates a data.table sequence using the lazy_dt function, after which regular dplyr code is written. Execution of the code is done only when calling as_tibble (or as.data.frame etc).

So for example a simple pipeline utilizing dtplyr for many group operations would look like:

mtcars %>%
lazy_dt() %>%
filter(wt < 5) %>%
mutate(l100k = 235.21 / mpg) %>% # liters / 100 km
group_by(cyl) %>%
summarise(l100k = mean(l100k))
## Source: local data table [?? x 2]
## Call: `_DT1`[wt < 5][, `:=`(l100k = 235.21/mpg)][, .(l100k = mean(l100k)),
## keyby = .(cyl)]
##
## cyl l100k
## <dbl> <dbl>
## 1 4 9.05
## 2 6 12.0
## 3 8 14.9
##
## # Use as.data.table()/as.data.frame()/as_tibble() to access results

The only caveat is, as with all other dbplyr like interfaces, that some of the more complex operations might not be supported.

Interestingly enough, while there are plenty of benchmarks for data.table I wasn’t able to find any bench-marking for dtplyr. So I decided to go ahead and run a quick benchmark test myself.

In this post I’ll check by how much does dtplyr improve on dplyr and whether it’s performance is close enough to data.table to be considered a valid alternative.

To that end, I’ll reproduce some of the benchmarking done by data.table author Matt Dowle back at Dec 2018.

The bench-marking consist of:

  • 5 simple queries: large groups and small groups on different columns of different types. Similar to what a data analyst might do in practice; i.e., various ad hoc aggregations as the data is explored and investigated.
  • Each package is tested separately in its own fresh session. To that end I’ve restarted my machine before running the benchmark code for every package.
  • Each query is repeated once more, immediately. This is to isolate cache effects and confirm the first timing. The first and second total elapsed times are plotted.

My analysis will diverge from the original in the following respects:

  1. I’ll compare data.table, dtyplr and dplyr. I’ll also check how starting a dtplyr pipe with a data.table rather than a data.frame affects performance (dubbed dt_dtplyr below)
  2. I’ll use my personal laptop instead of spinning up a virtual machine
  3. I’m generating a much smaller dataset (~4.9 Gb). I think that is representative of some of the larger datasets I’ve worked with in-memory (for larger datasets I usually switch to Spark).

Other than that the code is mostly the same. You can find it at the bottom of this post.

And now, without further ado, let’s check out the results:

We can see that using dtplyr improves the performance quite a bit, though still not as fast as data.table. It would seem however that most of the difference stems from the need to convert the data.frame object to a data.table one. That can be done once when reading in the file for example. Thus it would seem that ultimately the sacrifice in performance for the added benefit of tidy syntax (for those who dig tidy) isn’t too bad.

Personally, I’m hooked on the tidyverse and the dtplyr package is just another reason to keep using it, even for operations over a large number of groups.

dtplyr benchmarking code

Session info

sessionInfo()## R version 3.6.2 (2019-12-12)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] data.table_1.12.8 pander_0.6.3 dtplyr_1.0.0 forcats_0.4.0
## [5] stringr_1.4.0 dplyr_0.8.5 purrr_0.3.4 readr_1.3.1
## [9] tidyr_1.0.3 tibble_3.0.1 ggplot2_3.3.0 tidyverse_1.3.0
## [13] pacman_0.5.1
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.1.0 xfun_0.12 haven_2.2.0 lattice_0.20-38
## [5] colorspace_1.4-1 vctrs_0.3.0 generics_0.0.2 htmltools_0.4.0
## [9] yaml_2.2.1 utf8_1.1.4 rlang_0.4.6 pillar_1.4.4
## [13] glue_1.4.1 withr_2.1.2 DBI_1.1.0 dbplyr_1.4.2
## [17] modelr_0.1.5 readxl_1.3.1 lifecycle_0.2.0 munsell_0.5.0
## [21] blogdown_0.17 gtable_0.3.0 cellranger_1.1.0 rvest_0.3.5
## [25] evaluate_0.14 labeling_0.3 knitr_1.27 fansi_0.4.1
## [29] broom_0.5.3 Rcpp_1.0.4.6 scales_1.1.0 backports_1.1.5
## [33] jsonlite_1.6.1 farver_2.0.3 fs_1.4.1 hms_0.5.3
## [37] digest_0.6.25 stringi_1.4.6 bookdown_0.17 grid_3.6.2
## [41] cli_2.0.2 tools_3.6.2 magrittr_1.5 crayon_1.3.4
## [45] pkgconfig_2.0.3 ellipsis_0.3.1 xml2_1.2.2 reprex_0.3.0
## [49] lubridate_1.7.8 assertthat_0.2.1 rmarkdown_2.0 httr_1.4.1
## [53] rstudioapi_0.11 R6_2.4.1 nlme_3.1-143 compile

Originally published at https://iyarlin.github.io on May 26, 2020

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Iyar Lin
Iyar Lin

Written by Iyar Lin

Data scientist by title and statistician by heart, I’ll probably keep doing applied research into my pension. And play the Cello again!

Responses (1)

Write a response