1  Introduction and Overview

The mlr3 (Lang et al. 2019) package and ecosystem provide a generic, object-oriented, and extensible framework for classification, regression, survival analysis, and other machine learning tasks for the R language (R Core Team 2019). This unified interface provides functionality to extend and combine existing learners, intelligently select and tune the most appropriate technique for a task, and perform large-scale comparisons that enable meta-learning. Examples of this advanced functionality include hyperparameter tuning and feature selection. Parallelization of many operations is natively supported.

Note

We do not implement any learners ourselves, but provide a unified interface to many existing learners in R.

1.1 Target Audience

We expect that users of mlr3 have at least basic knowledge of machine learning and R. The later chapters of this book describe advanced functionality that requires more advanced knowledge of both. mlr3 is suitable for complex projects that use advanced functionality as well as one-liners to quickly prototype specific tasks.

mlr3 provides a domain-specific language for machine learning in R. We target both practitioners who want to quickly apply machine learning algorithms and researchers who want to implement, benchmark, and compare their new methods in a structured environment.

Note

The package is a rewrite of an earlier version of mlr that leverages many years of experience to provide a state-of-the-art system that is easy to use and extend.

1.2 Why a Rewrite?

mlr (Bischl et al. 2016) was first released to CRAN in 2013, with the core design and architecture dating back much further. Over time, the addition of many features has led to a considerably more complex design that made it harder to build, maintain, and extend than we had hoped for. With hindsight, we saw that some design and architecture choices in mlr made it difficult to support new features, in particular with respect to pipelines. Furthermore, the R ecosystem as well as helpful packages such as data.table have undergone major changes in the meantime.

It would have been nearly impossible to integrate all of these changes into the original design of mlr. Instead, we decided to start working on a reimplementation in 2018, which resulted in the first release of mlr3 on CRAN in July 2019. The new design and the integration of further and newly-developed R packages (especially R6, future, and data.table) makes mlr3 much easier to use, maintain, and in many regards more efficient compared to its predecessor mlr.

1.3 Design Principles

We follow these general design principles in the mlr3 package and ecosystem.

  • Backend over frontend. Most packages of the mlr3 ecosystem focus on processing and transforming data, applying machine learning algorithms, and computing results. Our core packages do not provide a graphical user interfaces (GUIs) whose dependencies would make an installation on a computing server unnecessarily tedious; visualizations of data and results are provided in extra packages like mlr3viz. mlr3shiny provides an interface for some basic machine learning tasks using the shiny package.
  • Object orientation. Embrace R6 for a clean, object-oriented design, object state-changes, and reference semantics.
  • Tabular data. Embrace data.table for fast and convenient computations on tabular data.
  • Unify container and result classes as much as possible and provide result data in data.tables. This considerably simplifies the API and allows easy selection and “split-apply-combine” (aggregation) operations. We combine data.table and R6 to place references to non-atomic and compound objects in tables and make heavy use of list columns.
  • Defensive programming and type safety. All user input is checked with checkmate (Lang 2017). Return types are documented, and mechanisms popular in base R which “simplify” the result unpredictably (e.g., sapply() or the drop argument for indexing data.frames) are avoided.
  • Be light on dependencies. One of the main maintenance burdens for mlr was to keep up with changing learner interfaces and behavior of the many packages it depended on. We require far fewer packages in mlr3 to make installation and maintenance easier.

1.4 Package Ecosystem

mlr3 builds upon the following packages not developed by core members of the mlr3 team:

All these packages are well curated and mature; we expect no problems with dependencies.

The mlr3 package itself provides the base functionality that the rest of ecosystem rely on and some fundamental building blocks for machine learning. The following packages extend mlr3 with capabilities for preprocessing, pipelining, visualizations, additional learners, additional task types, and more:

Tip

A complete list with links to the repositories for the respective packages can be found on our package overview.

We build upon R6 for object orientation and data.table to store and operate on tabular data. In the following subsections, we briefly introduce both packages since at least a basic understanding is required to work efficiently with mlr3.

1.5 Quick R6 Intro for Beginners

R6 is one of R’s more recent dialects for object-oriented programming (OO). It addresses shortcomings of earlier OO implementations in R, such as S3, which we used in mlr. If you have done any object-oriented programming before, R6 should feel familiar. We focus on the parts of R6 that you need to know to use mlr3 here.

Objects are created by calling the constructor of an "R6::R6Class()" object, specifically the initialization method $new(). For example, foo = Foo$new(bar = 1) creates a new object of class Foo, setting the bar argument of the constructor to the value 1. Most objects in mlr3 are created through special functions (e.g. lrn("regr.rpart")) that encapsulate this and are also referred to as sugar functions.

Objects have mutable state that is encapsulated in their fields, which can be accessed through the dollar operator. We can access the bar value in the foo variable from above through foo$bar and set its value by assigning the field, e.g. foo$bar = 2.

In addition to fields, objects expose methods that allow to inspect the object’s state, retrieve information, or perform an action that changes the internal state of the object. For example, the $train method of a learner changes the internal state of the learner by building and storing a trained model, which can then be used to make predictions, given data.

Objects can have public and private fields and methods. The public fields and methods define the API to interact with the object. Private methods are only relevant for you if you want to extend mlr3, e.g. with new learners.

R6 objects are internally environments, and as such have reference semantics. For example, foo2 = foo does not create a copy of foo in foo2, but another reference to the same actual object. Setting foo$bar = 3 will also change foo2$bar to 3 and vice versa.

To copy an object, use the $clone() method and the deep = TRUE argument for nested objects, for example, foo2 = foo$clone(deep = TRUE).

Tip

For more details on R6, have a look at the excellent R6 vignettes, especially the introduction.

1.6 Quick data.table Intro for Beginners

The package data.table essentially implements the eponymous alternative to R’s data.frame(), i.e. an object to store tabular data.

Note

We decided to use data.table() because it is blazingly fast and scales quite well on bigger data. Many functions of mlr3 return data.tables which can conveniently be subsetted or combined with other outputs. If you happen to not like the syntax or are feeling more comfortable with other tools, base data.frames or tibble/dplyr is just a single as.data.frame() or as_tibble() away.

Data tables can be constructed using the data.table() function (whose interface is similar to data.frame()) or by converting an object with as.data.table().

library("data.table")
df = data.frame(x = 1:12, y = rep(letters[1:3], each = 4))
dt = as.data.table(df)

Although both objects store the data identically in memory, they are considerably different in their operation. First, the index operator [ works slightly different. For both objects, the first argument (i) is used to select rows and the second argument (j) is used to select columns. However, column names are in the search path, and thus can be used directly:

df[df$y == "a", ]
  x y
1 1 a
2 2 a
3 3 a
4 4 a
dt[y == "a", ]
   x y
1: 1 a
2: 2 a
3: 3 a
4: 4 a

Second, there is no optional drop argument (drop is always FALSE for data.table()), but instead multiple additional arguments to query the data from the data.table() in a very concise way. Most importantly, you can group the data with argument by and combine this with aggregating functions provided via the second argument (j):

dt[, mean(x), by = "y"]
   y   V1
1: a  2.5
2: b  6.5
3: c 10.5

There is also extensive support to perform all kinds of data base join operations (see, e.g., this RPubs post by Ronald Stalder).

For an in-depth introduction, we refer to the excellent data.table introduction vignette. Also don’t miss the other vignettes linked on the CRAN page of data.table!