1  Introduction and Overview

The (Machine Learning in R) mlr3 (Lang et al. 2019) package and ecosystem provide a generic, object-oriented, and extensible framework for classification, regression, survival analysis, and other machine learning tasks for the R language (R Core Team 2019) (task types are discussed in detail in Section 2.1.1). This unified interface provides functionality to extend and combine existing machine learning algorithms (learners), intelligently select and tune the most appropriate technique for a specific machine learning task, and perform large-scale comparisons that enable meta-learning. Examples of this advanced functionality include hyperparameter tuning (Chapter 4) and feature selection (Chapter 5). Parallelization of many operations is natively supported (Section 8.1).

mlr3 has similar overall aims to caret, tidymodels, scikit-learn for Python, and MLJ for Julia. In general mlr3, is designed to provide more flexibility than other machine learning frameworks while still offering easy ways to use advanced functionality. While in particular tidymodels makes it very easy to perform simple machine learning tasks, mlr3 is more geared towards advanced machine learning. To get a quick overview of how to do things in the mlr3verse, see the mlr3 cheatsheets1.

Note

mlr3 provides a unified interface to existing learners in R. With few exceptions, we do not implement any learners ourselves, although we often augment the functionality provided by the underlying learners. This includes, in particular, the definition of hyperparameter spaces for tuning.

1.1 Target audience

We assume that users of mlr3 have taken an introductory machine learning course or have the equivalent expertise and some basic experience with R. A background in computer science or statistics is beneficial for understanding the advanced functionality described in the later chapters of this book, but not required. (James et al. 2014) provides a comprehensive introduction for those new to machine learning.

mlr3 provides a domain-specific language for machine learning in R that allows to do everything from simple exercises to complex projects. We target both practitioners who want to quickly apply machine learning algorithms and researchers who want to implement, benchmark, and compare their new methods in a structured environment.

1.2 From mlr to mlr3

The mlr package (Bischl et al. 2016) was first released to CRAN2 in 2013, with the core design and architecture dating back much further. Over time, the addition of many features has led to a considerably more complex design that made it harder to build, maintain, and extend than we had hoped for. In hindsight, we saw that some design and architecture choices in mlr made it difficult to support new features, in particular with respect to pipelines. Furthermore, the R ecosystem and helpful packages such as data.table have undergone major changes after the initial design of mlr.

It would have been impossible to integrate all of these changes into the original design of mlr. Instead, we decided to start working on a reimplementation in 2018, which resulted in the first release of mlr3 on CRAN in July 2019.

The new design and the integration of further and newly-developed R packages (especially R6, future, and data.table) makes mlr3 much easier to use, maintain, and in many regards more efficient than its predecessor mlr.

1.3 Design principles

We follow these general design principles in the mlr3 package and mlr3verse ecosystem.

  • Command-line before GUI. Most packages of the mlr3 ecosystem focus on processing and transforming data, applying machine learning algorithms, and computing results. Our core packages do not provide a graphical user interfaces (GUIs) because their dependencies would make installation unnecessarily complex, especially on headless servers. For the same reason, visualizations of data and results are provided in the extra package mlr3viz, which avoids dependencies on ggplot2. mlr3shiny provides an interface for some basic machine learning tasks using the shiny package.
  • Object-oriented programming (OOP). Embrace R6 for a clean, object-oriented design, object state-changes, and reference semantics.
  • Tabular data. Embrace data.table for fast and convenient computations on tabular data.
  • Unify container and result classes as much as possible and provide result data in data.tables. This considerably simplifies the API and allows easy selection and “split-apply-combine” (aggregation) operations. We combine data.table and R6 to place references to non-atomic and compound objects in tables and make heavy use of list columns.
  • Defensive programming and type safety. All user input is checked with checkmate3 (Lang 2017). We document return types, and avoid mechanisms popular in base R which “simplify” the result unpredictably (e.g., sapply() or the drop argument for indexing data.frames).
  • Light on dependencies. One of the main maintenance burdens for mlr was to keep up with changing learner interfaces and behavior of the many packages it depended on. We require far fewer packages in mlr3 to make installation and maintenance easier. We still provide the same functionality, but it is split into more packages that have fewer dependencies individually. As mentioned above, this is particularly the case for all visualization functionality, which is contained in a separate package to avoid unnecessary dependencies in all other packages.

1.4 Package ecosystem

mlr3 uses the following packages that not developed by core members of the mlr3 team:

These are core packages in the R ecosystem.

The mlr3 package itself provides the base functionality that the rest of ecosystem (mlr3verse) relies on and the fundamental building blocks for machine learning. Figure 1.1 shows the packages in the mlr3verse that extend mlr3 with capabilities for preprocessing, pipelining, visualizations, additional learners, additional task types, and more.

Diagram showing the packages of the mlr3verse and their relationship.

Figure 1.1: Overview of the mlr3verse.

Tip

A complete list with links to the repository for the respective package can be found on our package overview page4.

We build on R6 for object orientation and data.table to store and operate on tabular data. Both are core to mlr3; we briefly introduce both packages for beginners. While in-depth expertise with these packages is not necessary, a basic understanding is required to work effectively with mlr3.

1.5 Quick R6 introduction for beginners

R6 is one of R’s more recent paradigm for object-oriented programming (OOP). It addresses shortcomings of earlier OO implementations in R, such as S3, which we used in mlr. If you have done any object-oriented programming before, R6 should feel familiar. We focus on the parts of R6 that you need to know to use mlr3.

Objects are created by calling the constructor of an R6::R6Class() object, specifically the initialization method $new(). For example, foo = Foo$new(bar = 1) creates a new object of class Foo, setting the bar argument of the constructor to the value 1.

Objects have mutable state that is encapsulated in their fields, which can be accessed through the dollar operator. We can access the bar value in the foo variable from above through foo$bar and set its value by assigning the field, e.g. foo$bar = 2.

In addition to fields, objects expose methods that allow to inspect the object’s state, retrieve information, or perform an action that changes the internal state of the object. For example, the $train() method of a learner changes the internal state of the learner by building and storing a model, which can then be used to make predictions.

Objects can have public and private fields and methods. The public fields and methods define the API to interact with the object. Private methods are only relevant for you if you want to extend mlr3, e.g. with new learners.

Technically, R6 objects are environments, and as such have reference semantics. For example, foo2 = foo does not create a copy of foo in foo2, but another reference to the same actual object. Setting foo$bar = 3 will also change foo2$bar to 3 and vice versa.

To copy an object, use the $clone() method and the deep = TRUE argument for nested objects, for example, foo2 = foo$clone(deep = TRUE).

Tip

For more details on R6, have a look at the excellent R6 vignettes5, especially the introduction6. For comprehensive R6 information, we refer to the R6 chapter from Advanced R7.

Sugar functions

Most objects in mlr3 can be created through special functions that are called sugar functions. They provide shortcuts for common code idioms, reducing the amount of code a user has to write. We heavily use sugar functions throughout this book and give the equivalent “full form” only for reference. In most cases, the sugar functions will achieve what you want to do, but in special cases you may have to use the full R6 code. For example lrn("regr.rpart") is the sugar version of dictionary_sugar_get(mlr_learners, "regr.rpart").

Dictionaries

mlr3 uses dictionaries for learners, tasks, and other objects that are often used in common machine learning tasks. These are key-value stores that allow to associate a key with a value that can be an R6 object, much like paper dictionaries associate words with their definitions. Often, values in dictionaries are accessed through sugar functions that automatically use the applicable dictionary without the user having to specify it; only the key to be retrieved needs to be specified. Dictionaries are used to group relevant objects so that they can be listed and retrieved easily.

1.6 Quick data.table introduction for beginners

The package data.table implements a popular alternative to R’s data.frame(), i.e. an object to store tabular data. We decided to use data.table because it is blazingly fast and scales well to bigger data.

Note

Many mlr3 functions return data.tables which can conveniently be subsetted or combined with other outputs. If you do not like the syntax or are feeling more comfortable with other tools, base data.frames or tibble/dplyrs are just a single as.data.frame() or as_tibble() away.

Data tables are constructed with the data.table() function (whose interface is similar to data.frame()) or by converting an object with as.data.table().

library("data.table")
dt = data.table(x = 1:6, y = rep(letters[1:3], each = 2))
dt
   x y
1: 1 a
2: 2 a
3: 3 b
4: 4 b
5: 5 c
6: 6 c

data.tables can be used much like data.frames, but they do provide additional functionality that makes complex operations easier. For example, data can be summarized by groups with the [ operator:

dt[, mean(x), by = "y"]
   y  V1
1: a 1.5
2: b 3.5
3: c 5.5

There is also extensive support for many kinds of database join operations (see e.g. this RPubs post by Ronald Stalder8) that make it easy to combine multiple data.tables in different ways.