install.packages("mlr3verse")
1 Introduction
Lars Kotthoff
University of Wyoming
Raphael Sonabend
Imperial College London
Natalie Foss
University of Wyoming
Bernd Bischl
**
Welcome to the Machine Learning in R universe (mlr3verse). In this book we will guide you through the functionality offered by mlr3
step by step. If you want to contribute to our universe, ask any questions, read documentation, or just chat to the team, head to https://github.com/mlr-org/mlr3 which has several useful links in the README.
Before we begin, make sure you have installed mlr3
if you want to follow along. We recommend installing the complete mlr3verse
, which will install all of the important packages.
Or you can install just the base package:
install.packages("mlr3")
1.1 Target audience and how to use this book
The mlr3 ecosystem is the result of many years of methodological and applied research and improving the design and implementation of the packages over the years. This book describes the resulting features of the mlr3verse
and discusses best practices for ML, technical implementation details, extension guidelines, and in-depth considerations for optimizing ML. It is suitable for a wide range of readers and levels of ML expertise but we assume that users of mlr3
have taken an introductory machine learning course or have the equivalent expertise and some basic experience with R. A background in computer science or statistics is beneficial for understanding the advanced functionality described in the later chapters of this book, but not required. A comprehensive introduction for those new to machine learning can be found in (James et al. 2013), and (Wickham and Grolemund 2017) gives a comprehensive introduction to data science in R. This book may also be helpful for both practitioners who want to quickly apply machine learning algorithms and researchers who want to implement, benchmark, and compare their new methods in a structured environment.
Chapter 1, Chapter 2, and Chapter 3 cover the basics of mlr3. These chapters are essential to understanding the core infrastructure of ML in mlr3. We recommend that all readers study these chapters to become familiar with basic mlr3 terminology, syntax, and style. Chapter 4, Chapter 5, Chapter 6, and Chapter 7 contain more advanced implementation details and more complex concepts and algorithms. Chapter 8 delves into detail on domain-specific methods that are implemented in our extension packages. Readers may choose to selectively read sections in this chapter depending on your use cases (i.e., if you have domain-specific problems to tackle), or to use these as introductions to new domains to explore. Chapter 9 contains technical implementation details that are essential reading for advanced users who require parallelisation, custom error handling, and fine control over hyperparameters and large databases. Finally, Chapter 11 discusses packages that can be integrated with mlr3 to provide model-agnostic interpretability methods.
Of course, you can also read the book cover to cover from start to finish. We have marked ‘optional’ sections that are particularly complex, with respect to either technical or methodological detail, which could be skipped on a first read.
Each chapter includes tutorials, API references, explanations of methodologies, and exercises to test yourself on what you have learnt. You can find the solutions to these exercises in Appendix A.
If you want to reproduce any of the results in this book, note that at the start of each chapter we run set.seed(123)
and the sessionInfo
at the time of publication is printed in Appendix E.
1.2 Installation guidelines
All packages in the mlr3 ecosystem can be installed from GitHub and R-universe; the majority (but not all) packages can also be installed from CRAN. We recommend adding the mlr-org R-universe1 to your R options so that you can install all packages with install.packages()
without having to worry which package repository it comes from. To do this, install the usethis
package and run the following:
1 R-universe is an alternative package repository to CRAN. The bit of code below tells R to look at both R-universe and CRAN when trying to install packages. R will always install the latest version of a package.
usethis::edit_r_profile()
In the file that opens add or change the repos
argument in options
so it looks something like this (you might need to add the full code block below or just edit the existing options
function).
Save the file, restart your R session, and you are ready to go!
install.packages("mlr3verse")
If you want the latest development version of any of our packages, run
remotes::install_github("mlr-org/{pkg}")
with {pkg}
replaced with the name of the package you want to install. You can see an up-to-date list of all our extension packages at https://github.com/mlr-org/mlr3/wiki/Extension-Packages.
1.3 mlr3book code style
Throughout this book we will use the following code style:
We always use
=
instead of<-
for assignment.Class names are in
UpperCamelCase
Function and method names are in
lower_snake_case
When referencing functions, we will only include the package prefix (e.g.,
pkg::function
) for functions outside the mlr3 universe or when there may be ambiguity about in which package the function lives. Note you can useenvironment(function)
to see which namespace a function is loaded from.-
We denote packages, fields, methods, and functions as follows:
-
package
- With link (if online) to package CRAN, R-Universe, or GitHub page -
package::function()
(for functions outside the mlr-org ecosystem) -
function()
(for functions inside the mlr-org ecosystem) -
$field
for fields (data encapsulated in a R6 class) -
$method()
for methods (functions encapsulated in a R6 class)
-
Now let us see this in practice with our first example.
1.4 mlr3 example
The mlr3
universe includes a wide range of tools taking you from basic ML to complex experiments. To get started, here is an example of the most simple functionality – training a model and making predictions.
library(mlr3)
task = tsk("penguins")
split = partition(task)
learner = lrn("classif.rpart")
learner$train(task, row_ids = split$train)
learner$model
n= 231
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 231 129 Adelie (0.441558442 0.199134199 0.359307359)
2) flipper_length< 206.5 144 44 Adelie (0.694444444 0.298611111 0.006944444)
4) bill_length< 43.05 98 3 Adelie (0.969387755 0.030612245 0.000000000) *
5) bill_length>=43.05 46 6 Chinstrap (0.108695652 0.869565217 0.021739130) *
3) flipper_length>=206.5 87 5 Gentoo (0.022988506 0.034482759 0.942528736) *
predictions = learner$predict(task, row_ids = split$test)
predictions
<PredictionClassif> for 113 observations:
row_ids truth response
1 Adelie Adelie
2 Adelie Adelie
3 Adelie Adelie
---
328 Chinstrap Chinstrap
331 Chinstrap Adelie
339 Chinstrap Chinstrap
predictions$score(msr("classif.acc"))
classif.acc
0.9557522
In this example, we trained a decision tree on a subset of the penguins
dataset, made predictions on the rest of the data and then evaluated these with the accuracy measure. In Chapter 2 we will break this down in more detail.
mlr3
makes training and predicting easy, but it also allows us to perform very complex operations in just a few lines of code:
library(mlr3verse)
library(mlr3pipelines)
library(mlr3benchmark)
tasks = tsks(c("breast_cancer", "sonar"))
tuned_rf = auto_tuner(
tnr("grid_search", resolution = 5),
lrn("classif.ranger", num.trees = to_tune(200, 500)),
rsmp("holdout")
)
tuned_rf = pipeline_robustify(NULL, tuned_rf, TRUE) %>>%
po("learner", tuned_rf)
stack_lrn = ppl(
"stacking",
base_learners = lrns(c("classif.rpart", "classif.kknn")),
lrn("classif.log_reg"))
stack_lrn = pipeline_robustify(NULL, stack_lrn, TRUE) %>>%
po("learner", stack_lrn)
learners = c(tuned_rf, stack_lrn)
bm = benchmark(benchmark_grid(tasks, learners, rsmp("holdout")))
bma = bm$aggregate(msr("classif.acc"))[, c("task_id", "learner_id",
"classif.acc")]
bma$learner_id = rep(c("RF", "Stack"), 2)
bma
task_id learner_id classif.acc
1: breast_cancer RF 0.9780702
2: breast_cancer Stack 0.9385965
3: sonar RF 0.8550725
4: sonar Stack 0.7246377
In this (much more complex!) example we chose two tasks and two learners and used automated tuning to optimize the number of trees in the random forest learner (Chapter 4), and an ML pipeline that imputes missing data, collapses factor levels, and creates stacked models (Chapter 6). We also showed basic features like loading learners (Chapter 2) and choosing resampling strategies for benchmarking (Chapter 3). Finally, we compared the performance of the models using the mean accuracy on the test set, and applied a statistical test to see if the learners performed significantly different (they did not!).
You will learn how to do all this and more in this book.
1.5 MLR: Machine Learning in R
The (Machine Learning in R) mlr3
(Lang et al. 2019) package and ecosystem provide a generic, object-oriented, and extensible framework for regression (Section 2.1), classification (Section 2.5), and other machine learning tasks (Chapter 8) for the R language (R Core Team 2019). This unified interface provides functionality to extend and combine existing machine learning algorithms (Learners (Section 2.2)), intelligently select and tune the most appropriate technique for a given machine learning task (Section 2.1), and perform large-scale comparisons that enable meta-learning. In addition, mlr3
includes advanced functionality such as hyperparameter tuning (Chapter 4) and feature selection (Chapter 5). Parallelization of many operations is natively supported (Section 9.1).
mlr3
has similar overall aims to caret
and tidymodels
for R, scikit-learn2 for Python, and MLJ3 for Julia. In general mlr3
is designed to provide more flexibility than other ML frameworks while still offering easy ways to use advanced functionality. While tidymodels
in particular makes it very easy to perform simple ML tasks, mlr3
is more geared towards advanced ML.
1.6 From mlr to mlr3
The mlr
package (Bischl et al. 2016) was first released to CRAN4 in 2013, with the core design and architecture dating back somewhat further. Over time, the addition of many features has led to a considerably more complex design that made it harder to build, maintain, and extend than we had hoped for. In hindsight, we saw that some design and architecture choices in mlr
made it difficult to support new features, in particular with respect to ML pipelines. Furthermore, the R ecosystem and packages within it, such as data.table
, had undergone major changes after the initial design of mlr
.
It would have been difficult to integrate all of these changes into the original design of mlr
. Instead, we decided to start working on a reimplementation in 2018, which resulted in the first release of mlr3
on CRAN in July 2019.
The new design and the integration of further and newly-developed R packages (especially R6
, future
, and data.table
) makes mlr3
much easier to use, maintain, and in many regards more efficient than its predecessor mlr
. The packages in the ecosystem are less tightly coupled, making them easier to maintain and easier to develop, especially very specialized packages.
1.7 Design principles
We follow these general design principles in the mlr3
package and mlr3verse
ecosystem.
-
Object-oriented programming (OOP). We embrace
R6
for a clean, object-oriented design, object state-changes, and reference semantics. This means that the state of common objects (e.g. tasks (Section 2.1) and learners (Section 2.2)) is encapsulated within the object, for example to keep track of whether a model has been trained, without the user having to worry about this. We also use inheritance to specialize objects, e.g. all learners are derived from a common base class that provides basic functionality. -
Tabular data. Embrace
data.table
for its top-notch computation performance as well as tabular data as a structure that can be easily processed further. -
Unify input and output data formats. This considerably simplifies the API and allows easy selection and “split-apply-combine” (aggregation) operations. We combine
data.table
andR6
to place references to non-atomic and compound objects in tables and make heavy use of list columns. -
Defensive programming and type safety. All user input is checked with
checkmate
(Lang 2017). We use data.table which documents return types unlike other mechanisms popular in base R which “simplify” the result unpredictably (e.g.,sapply()
or thedrop
argument for indexing data.frames). And we have extensive unit tests! -
Light on dependencies. One of the main maintenance burdens for
mlr
was to keep up with changing learner interfaces and behavior of the many packages it depended on. We require far fewer packages inmlr3
, which makes installation and maintenance easier. We still provide the same functionality, but it is split into more packages that have fewer dependencies individually. -
Separation of computation and presentation. Most packages of the
mlr3
ecosystem focus on processing and transforming data, applying ML algorithms, and computing results. Our core packages do not provide visualizations because their dependencies would make installation unnecessarily complex, especially on headless servers (i.e., computers without a monitor where graphical libraries are not installed). For the same reason, visualizations of data and results are provided in the extra packagemlr3viz
, which avoids dependencies onggplot2
.
1.8 The mlr3
ecosystem
Throughout this book we often refer to mlr3
, which does not refer to the single mlr3
base package but all the packages in our ecosystem. The mlr3
package provides the base functionality that the rest of the ecosystem depends on for building more advanced ML tools. Figure 1.1 shows the packages in the mlr3verse
that extend mlr3
with capabilities for preprocessing, pipelining, visualizations, additional learners, additional task types, and more.
mlr3
ecosystem, the mlr3verse
.A complete and up-to-date list of extension packages can be found at https://mlr-org.com/ecosystem.html.
As well as packages within the mlr3
ecosystem, software in the mlr3verse
also depends on the following popular and well-established packages:
-
R6
: The class system predominantly used in mlr3. -
data.table
: High-performance extension of R’sdata.frame
. -
digest
: Cryptographic hash functions. -
uuid
: Generation of universally unique identifiers. -
lgr
: Configurable logging library. -
mlbench
andpalmerpenguins
: More ML data sets. -
evaluate
: For capturing output, warnings, and exceptions (Section 9.2). -
future
/future.apply
/parallelly
: For parallelization (Section 9.1).
We build on R6
for object orientation and data.table
to store and operate on tabular data. As both are core to mlr3
we briefly introduce both packages for beginners; in-depth expertise with of these packages is not necessary to work with mlr3
.
1.8.1 data.table
for beginners
The package data.table
implements the data.table
, which is a popular alternative to R’s data.frame()
. We use data.table
because it is blazingly fast and scales well to bigger data.
As with data.frame
, data.table
s can be constructed with data.table()
or as.data.table()
:
library(data.table)
# converting a matrix with as.data.table
as.data.table(matrix(runif(4), 2, 2))
V1 V2
1: 0.8585914 0.4890915
2: 0.8873848 0.7180918
# using data.table
dt = data.table(x = 1:6, y = rep(letters[1:3], each = 2))
dt
x y
1: 1 a
2: 2 a
3: 3 b
4: 4 b
5: 5 c
6: 6 c
data.table
s can be used much like data.frame
s, but they provide additional functionality that makes complex operations easier. For example, data can be summarized by groups with a by
argument in the [
operator:
dt[, mean(x), by = "y"]
y V1
1: a 1.5
2: b 3.5
3: c 5.5
There is also extensive support for many kinds of database join operations that make it easy to combine multiple data.table
s in different ways (Stalder (2014)). For an in-depth introduction, we recommend the data.table vignette “Introduction to Data.table” (2023).
1.8.2 R6 for beginners
R6
is one of R’s more recent paradigm for object-oriented programming (OOP). It addresses shortcomings of earlier OO implementations in R, such as S3, which we used in mlr
. If you have done any (class) object-oriented programming before, R6 should feel familiar. We focus on the parts of R6 that you need to know to use mlr3
.
Objects are created by calling the constructor of an R6::R6Class
object, specifically the initialization method $new()
. For example, say we have implemented a class called Foo
, then foo = Foo$new(bar = 1)
would create a new object of class Foo
, setting the bar
argument of the constructor to the value 1
. In practice, we implement a lot of sugar functionality (Section 1.9) in mlr3
so you do not need to interact with R6
constructors in this way if you would prefer not to.
Some R6
objects may have mutable states that are encapsulated in their fields, which can be accessed through the dollar, $
, operator. Continuing the previous example, we can access the bar
value in the foo
object by using foo$bar
or we could give it a new value, e.g. foo$bar = 2
. These fields are known as ‘active bindings’ and it is important to note that when called some computations are actually being run in the background.
In addition to fields, methods allow users to inspect the object’s state, retrieve information, or perform an action that changes the internal state of the object. For example, in mlr3
, the $train()
method of a learner changes the internal state of the learner by building and storing a model.
Fields and methods can be public or private . The public fields and methods define the API to interact with the object. In mlr3
, you can safely ignore private methods unless you are looking to extend our universe by adding a new class (Chapter 9).
Finally, R6
objects are environments
, and as such have reference semantics. This means that, for example, foo2 = foo
does not create a new variable called foo2
that is a copy of foo
. Instead it creates a variable called foo2
that references foo
, and so setting foo$bar = 3
will also change foo2$bar
to 3
and vice versa. To copy an object, use the $clone(deep = TRUE)
method, so to copy foo
: foo2 = foo$clone(deep = TRUE)
.
$clone()
For a longer introduction, we recommend the R6
vignettes found at https://r6.r-lib.org/ , more detail can be found in https://adv-r.hadley.nz/r6.html.
1.9 Essential mlr3
utilities
Finally, the following sections cover some important utilities that are essential to navigating the mlr3verse
.
Sugar Functions
Most objects in mlr3
can be created through convenience functions called helper functions or sugar functions. They provide shortcuts for common code idioms, reducing the amount of code a user has to write. For example lrn("regr.rpart")
is the sugar version of LearnerRegrRpart$new()
. We heavily use helper functions throughout this book and give the equivalent “full form” for complete detail at the end of each chapter. The helper functions are designed to cover the majority of use-covers for most users, knowledge about the full R6
backend is most commonly required for custom objects or extensions.
Dictionaries
mlr3
uses dictionaries to store objects like learners or tasks. Much like paper dictionaries associate words with their definitions, our dictionaries associate keys (i.e., identifiers) with objects (for example R6
objects). Values in dictionaries are often accessed through sugar functions that retrieve objects from the relevant dictionary, for example lrn("regr.rpart")
is a wrapper around mlr_learners$get("regr.rpart")
and is thus a simpler way to load a decision tree learner from the mlr_learners
. We use dictionaries to group large collections of relevant objects so they can be listed and retrieved easily. For example, you can see an overview of available learners (that are in loaded packages)and their properties with as.data.table(mlr_learners)
.
mlr3viz
mlr3viz
includes all plotting functionality in mlr3
by using ggplot2
under the hood. We use ggplot2::theme_minimal()
in all our plots to unify our aesthetic, but as with all ggplot
outputs, users can fully customize this.
mlr3viz
extends fortify
and autoplot
for use with common mlr3
outputs including Prediction
, Learner
, and BenchmarkResult
objects (which we will introduce and cover in the next chapters). We will cover major plot types throughout the book but the best way to learn about mlr3viz
is through experimentation, load the package and see what happens when you run autoplot
on an mlr3
object Plot types are documented in the respective manual page that can be accessed through ?autoplot.X
, for example, the documentation of plots for regression tasks can be found by running ?autoplot.TaskRegr
.