mlr3
ecosystem, the mlr3verse
.The (Machine Learning in R) mlr3
(Lang et al. 2019) package and ecosystem provide a generic, object-oriented, and extensible framework for classification, regression, survival analysis, and other machine learning tasks for the R language (R Core Team 2019). This unified interface provides functionality to extend and combine existing machine learning algorithms (learners), intelligently select and tune the most appropriate technique for a specific machine learning task, and perform large-scale comparisons that enable meta-learning. Examples of this advanced functionality include hyperparameter tuning (Chapter 4) and feature selection (Chapter 5). Parallelization of many operations is natively supported (Section 9.1).
mlr3
has similar overall aims to caret, tidymodels, scikit-learn for Python, and MLJ for Julia. In general mlr3
is designed to provide more flexibility than other machine learning frameworks while still offering easy ways to use advanced functionality. While in particular tidymodels makes it very easy to perform simple machine learning tasks, mlr3
is more geared towards advanced machine learning. To get a quick overview of how to do things in the mlr3verse
, see the mlr3 cheatsheets1.
We assume that users of mlr3
have taken an introductory machine learning course or have the equivalent expertise and some basic experience with R. A background in computer science or statistics is beneficial for understanding the advanced functionality described in the later chapters of this book, but not required. (James et al. 2014) provides a comprehensive introduction for those new to machine learning.
mlr3
provides a domain-specific language for machine learning in R that allows to do everything from simple exercises to complex projects. We target both practitioners who want to quickly apply machine learning algorithms and researchers who want to implement, benchmark, and compare their new methods in a structured environment.
The mlr
package (Bischl et al. 2016) was first released to CRAN2 in 2013, with the core design and architecture dating back much further. Over time, the addition of many features has led to a considerably more complex design that made it harder to build, maintain, and extend than we had hoped for. In hindsight, we saw that some design and architecture choices in mlr
made it difficult to support new features, in particular with respect to machine learning pipelines. Furthermore, the R ecosystem and helpful packages such as data.table
have undergone major changes after the initial design of mlr
.
It would have been impossible to integrate all of these changes into the original design of mlr
. Instead, we decided to start working on a reimplementation in 2018, which resulted in the first release of mlr3
on CRAN in July 2019.
The new design and the integration of further and newly-developed R packages (especially R6
, future
, and data.table
) makes mlr3
much easier to use, maintain, and in many regards more efficient than its predecessor mlr
. The packages in the ecosystem are less tightly coupled, making them easier to maintain and easier to develop, especially very specialized packages.
We follow these general design principles in the mlr3
package and mlr3verse
ecosystem.
mlr3
ecosystem focus on processing and transforming data, applying machine learning algorithms, and computing results. Our core packages do not provide graphical user interfaces (GUIs) because their dependencies would make installation unnecessarily complex, especially on headless servers. For the same reason, visualizations of data and results are provided in the extra package mlr3viz
, which avoids dependencies on ggplot2
. mlr3shiny
provides an interface for some basic machine learning tasks using the shiny
package.R6
for a clean, object-oriented design, object state-changes, and reference semantics.data.table
for fast and convenient computations on tabular data (but we support other types of data as well).data.table
and R6
to place references to non-atomic and compound objects in tables and make heavy use of list columns.checkmate
(Lang 2017). We document return types, and avoid mechanisms popular in base R which “simplify” the result unpredictably (e.g., sapply()
or the drop
argument for indexing data.frames). And we have extensive unit tests!mlr
was to keep up with changing learner interfaces and behavior of the many packages it depended on. We require far fewer packages in mlr3
to make installation and maintenance easier. We still provide the same functionality, but it is split into more packages that have fewer dependencies individually. As mentioned above, this is particularly the case for all visualization functionality, which is contained in a separate package to avoid unnecessary dependencies in all other packages.mlr3
uses the following packages that are not developed by core members of the mlr3
team:
R6
: Reference class objects.data.table
: Extension of R’s data.frame
.digest
: Hash digests.uuid
: Unique string identifiers.lgr
: Logging.mlbench
: Collection of machine learning data sets.evaluate
: For capturing output, warnings, and exceptions (Section 9.2).future
/ future.apply
: For parallelization (Section 9.1).These are core packages in the R ecosystem.
The mlr3
package itself provides the base functionality that the rest of ecosystem (mlr3verse
) relies on and the fundamental building blocks for machine learning. Figure 1.1 shows the packages in the mlr3verse
that extend mlr3
with capabilities for preprocessing, pipelining, visualizations, additional learners, additional task types, and more.
mlr3
ecosystem, the mlr3verse
.A complete list with links to the repositories for the respective packages can be found on our package overview page3.
We build on R6
for object orientation and data.table
to store and operate on tabular data. Both are core to mlr3
; we briefly introduce both packages for beginners. While in-depth expertise with these packages is not necessary, a basic understanding is required to work effectively with mlr3
.
R6
is one of R’s more recent paradigm for object-oriented programming (OOP). It addresses shortcomings of earlier OO implementations in R, such as S3, which we used in mlr
. If you have done any object-oriented programming before, R6 should feel familiar. We focus on the parts of R6 that you need to know to use mlr3
.
Objects are created by calling the constructor of an R6::R6Class()
object, specifically the initialization method $new()
. For example, foo = Foo$new(bar = 1)
creates a new object of class Foo
, setting the bar
argument of the constructor to the value 1
.
Objects have mutable state that is encapsulated in their fields, which can be accessed through the dollar operator. We can access the bar
value in the foo
variable from above through foo$bar
and set its value by assigning the field, e.g. foo$bar = 2
.
In addition to fields, objects expose methods that allow to inspect the object’s state, retrieve information, or perform an action that changes the internal state of the object. For example, the $train()
method of a learner changes the internal state of the learner by building and storing a model, which can then be used to make predictions.
Objects can have public and private fields and methods. The public fields and methods define the API to interact with the object. Private methods are only relevant for you if you want to extend mlr3
, e.g. with new learners.
Technically, R6 objects are environments, and as such have reference semantics. For example, foo2 = foo
does not create a copy of foo
in foo2
, but another reference to the same actual object. Setting foo$bar = 3
will also change foo2$bar
to 3
and vice versa.
To copy an object, use the $clone()
method and the deep = TRUE
argument for nested objects, for example, foo2 = foo$clone(deep = TRUE)
.
For more details on R6, have a look at the excellent R6 vignettes4, especially the introduction5. For comprehensive R6 information, we refer to the R6 chapter from Advanced R6.
data.table
introduction for beginnersThe package data.table
implements a popular alternative to R’s data.frame()
, i.e. an object to store tabular data. We decided to use data.table
because it is blazingly fast and scales well to bigger data.
Many mlr3
functions return data.table
s which can conveniently be subsetted or combined with other outputs. If you do not like the syntax or are feeling more comfortable with other tools, base data.frame
s or tibble
/dplyr
s are just a single as.data.frame()
or as_tibble()
away.
Data tables are constructed with the data.table()
function (whose interface is similar to data.frame()
) or by converting an object with as.data.table()
.
x y
1: 1 a
2: 2 a
3: 3 b
4: 4 b
5: 5 c
6: 6 c
data.table
s can be used much like data.frame
s, but they do provide additional functionality that makes complex operations easier. For example, data can be summarized by groups with the [
operator:
There is also extensive support for many kinds of database join operations (see e.g. this RPubs post by Ronald Stalder7) that make it easy to combine multiple data.table
s in different ways.
For an in-depth introduction, we refer the reader to the excellent data.table introduction vignette8.
mlr3
utilitiesMost objects in mlr3
can be created through convenience functions called sugar functions. They provide shortcuts for common code idioms, reducing the amount of code a user has to write. We heavily use sugar functions throughout this book and give the equivalent “full form” only for reference. In most cases, the sugar functions will achieve what you want to do, but in special cases you may have to use the full R6 code. For example lrn("regr.rpart")
is the sugar version of LearnerRegrRpart$new()
.
mlr3
uses dictionaries for learners, tasks, and other objects that are often used in common machine learning tasks. These are key-value stores that allow to associate a key with a value that can be an R6 object, much like paper dictionaries associate words with their definitions. Often, values in dictionaries are accessed through sugar functions that automatically use the applicable dictionary without the user having to specify it; only the key to be retrieved needs to be specified. Dictionaries are used to group relevant objects so that they can be listed and retrieved easily. For example a learner can be retrieved directly from the mlr_learners
dictionary using the key "classif.featureless"
(mlr_learners$get("classif.featureless")
).
mlr3viz
mlr3viz
is the package for all plotting functionality in the mlr3
ecosystem. The package uses a common theme (ggplot2::theme_minimal()
) so that all generated plots have a similar aesthetic. Under the hood, mlr3viz
uses ggplot2
. mlr3viz
extends fortify
and autoplot
for use with common mlr3
outputs including Prediction, Learner, and Benchmark objects (these objects will be introduced and covered in the next chapter). The most common use of mlr3viz
is the autoplot()
function, where the type of the object passed determines the type of the plot. Plot types are documented in the respective manual page that can be accessed through ?autoplot.X
. For example, the documentation of plots for regression tasks can be found by running ?autoplot.TaskRegr
.