2  Quick data.table Intro for Beginners

The package data.table essentially implements the eponymous alternative to R’s data.frame(), i.e. an object to store tabular data.


We decided to use data.table() because it is blazingly fast and scales quite well on bigger data. Many functions of mlr3 return data.tables which can conveniently be subsetted or combined with other outputs. If you happen to not like the syntax or are feeling more comfortable with other tools, base data.frames or tibble/dplyr is just a single as.data.frame() or as_tibble() away.

Data tables can be constructed using the data.table() function (whose interface is similar to data.frame()) or by converting an object with as.data.table().

df = data.frame(x = 1:12, y = rep(letters[1:3], each = 4))
dt = as.data.table(df)

Although both objects store the data identically in memory, they are considerably different in their operation. First, the index operator [ works slightly different. For both objects, the first argument (i) is used to select rows and the second argument (j) is used to select columns. However, column names are in the search path, and thus can be used directly:

df[df$y == "a", ]
  x y
1 1 a
2 2 a
3 3 a
4 4 a
dt[y == "a", ]
   x y
1: 1 a
2: 2 a
3: 3 a
4: 4 a

Second, there is no optional drop argument (drop is always FALSE for data.table()), but instead multiple additional arguments to query the data from the data.table() in a very concise way. Most importantly, you can group the data with argument by and combine this with aggregating functions provided via the second argument (j):

dt[, mean(x), by = "y"]
   y   V1
1: a  2.5
2: b  6.5
3: c 10.5

There is also extensive support to perform all kinds of data base join operations (see, e.g., this RPubs post by Ronald Stalder).

For an in-depth introduction, we refer to the excellent data.table introduction vignette. Also don’t miss the other vignettes linked on the CRAN page of data.table!