Preface
Welcome to the Machine Learning in R universe (mlr3verse)! Before we begin, make sure you have installed mlr3
if you want to follow along. We recommend installing the complete mlr3verse
, which will install all of the important packages.
Or you can install just the base package:
In our first example, we will show you some of the most basic functionality – training a model and making predictions.
library(mlr3)
task = tsk("penguins")
split = partition(task)
learner = lrn("classif.rpart")
learner$train(task, row_ids = split$train)
learner$model
n= 231
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 231 129 Adelie (0.441558442 0.199134199 0.359307359)
2) flipper_length< 207.5 145 44 Adelie (0.696551724 0.296551724 0.006896552)
4) bill_length< 44.65 100 2 Adelie (0.980000000 0.020000000 0.000000000) *
5) bill_length>=44.65 45 4 Chinstrap (0.066666667 0.911111111 0.022222222) *
3) flipper_length>=207.5 86 4 Gentoo (0.011627907 0.034883721 0.953488372) *
<PredictionClassif> for 113 observations:
row_ids truth response
3 Adelie Adelie
4 Adelie Adelie
5 Adelie Adelie
---
341 Chinstrap Adelie
343 Chinstrap Gentoo
344 Chinstrap Chinstrap
classif.acc
0.9380531
In this example, we trained a decision tree on a subset of the penguins
dataset, made predictions on the rest of the data and then evaluated these with the accuracy measure. In Chapter 2 we will break this down in more detail.
mlr3
makes training and predicting easy, but it also allows us to perform very complex operations in just a few lines of code:
library(mlr3verse)
library(mlr3pipelines)
library(mlr3benchmark)
tasks = tsks(c("breast_cancer", "sonar"))
tuned_rf = auto_tuner(
tnr("grid_search", resolution = 5),
lrn("classif.ranger", num.trees = to_tune(200, 500)),
rsmp("holdout")
)
tuned_rf = pipeline_robustify(NULL, tuned_rf, TRUE) %>>%
po("learner", tuned_rf)
stack_lrn = ppl(
"stacking",
base_learners = lrns(c("classif.rpart", "classif.kknn")),
lrn("classif.log_reg"))
stack_lrn = pipeline_robustify(NULL, stack_lrn, TRUE) %>>%
po("learner", stack_lrn)
learners = c(tuned_rf, stack_lrn)
bm = benchmark(benchmark_grid(tasks, learners, rsmp("holdout")))
bma = bm$aggregate(msr("classif.acc"))[, c("task_id", "learner_id",
"classif.acc")]
bma$learner_id = rep(c("RF", "Stack"), 2)
bma
task_id learner_id classif.acc
1: breast_cancer RF 0.9605263
2: breast_cancer Stack 0.9122807
3: sonar RF 0.7681159
4: sonar Stack 0.7101449
Friedman rank sum test
data: ce and learner_id and task_id
Friedman chi-squared = 2, df = 1, p-value = 0.1573
In this (much more complex!) example we chose two tasks and two machine learning (ML) algorithms (“learners” in mlr3
terms). We used automated tuning to optimize the number of trees in the random forest learner (Chapter 4) and a ML pipeline that imputes missing data, collapses factor levels, and creates stacked models (Chapter 6). We also showed basic features like loading learners (Chapter 2) and choosing resampling strategies for benchmarking (Chapter 3). Finally, we compared the performance of the models using the mean accuracy on the test set, and applied a statistical test to see if the learners performed significantly different (they did not!).
You will learn how to do all this and more in this book. We will walk through the functionality offered by mlr3
and the packages in the mlr3verse
step by step. There are a few different ways you can use this book, which we will discuss next.
How to use this book
The mlr3 ecosystem is the result of many years of methodological and applied research and improving the design and implementation of the packages over the years. This book describes the resulting features of the mlr3verse
and discusses best practices for ML, technical implementation details, extension guidelines, and in-depth considerations for optimizing ML. It is suitable for a wide range of readers and levels of ML expertise.
Chapter 1, Chapter 2, and Chapter 3 cover the basics of mlr3. These chapters are essential to understanding the core infrastrucure of ML in mlr3. We recommend that all readers study these chapters to become familiar with basic mlr3 terminology, syntax, and style. Chapter 4, Chapter 5, and Chapter 6 contain more advanced implementation details and some ML theory. Chapter 8 delves into detail on domain-specific methods that are implemented in our extension packages. Readers may choose to selectively read sections in this chapter depending on your use cases (i.e., if you have domain-specific problems to tackle), or to use these as introductions to new domains to explore. Chapter 9 contains technical implementation details that are essential reading for advanced users who require parallelisation, custom error handling, and fine control over hyperparameters and large databases. Chapter 10 discusses packages that can be integrated with mlr3 to provide model-agnostic interpretability methods. Finally, anyone who would like to contribute to our ecosystem should read Chapter 11.
Of course, you can also read the book cover to cover from start to finish. We have marked any section that contains complex technical information and you may wish to skip these if you are only interested in basic functionality. Similarly, we have marked sections that are optional, such as parts that are more methodological focused and do not discuss the software implementation. Readers that are interested in the more technical detail will likely want to pay attention to the tables at the end of each chapter that show the relationship between our S3 ‘sugar’ functions and the underlying R6 classes; this is explained in more detail in Chapter 1.
This book tries to follow the Diátaxis framework1 for documentation and so we include tutorials, how-to guides, API references, and explanations. This means that the conclusion of each chapter includes a short reference to the core functions learnt in the chapter, links to relevant posts in the mlr3gallery2, and a few exercises that will cover content introduced in the chapter. You can find the solutions to these exercises in Appendix A.
Finally, if you want to reproduce any of the results in this book, note that at the start of each chapter we run set.seed(<chapter number>)
and the sessionInfo
at the time of publication is printed in Appendix E.
Installation guidelines
All packages in the mlr3 ecosystem can be installed from GitHub and R-universe; the majority (but not all) packages can also be installed from CRAN. We recommend adding the mlr-org R-universe3 to your R options so that you can install all packages with install.packages()
without having to worry which package repository it comes from. To do this, install the usethis
package and run the following:
3 R-universe is an alternative package repository to CRAN. The bit of code below tells R to look at both R-universe and CRAN when trying to install packages. R will always install the latest version of a package.
In the file that opens add or change the repos
argument in options
so it looks something like this (you might need to add the full code block below or just edit the existing options
function).
Save the file, restart your R session, and you are ready to go!
If you want latest development versions of any of our packages, run
with {pkg}
replaced with the name of the package you want to install. You can see an up-to-date list of all our extension packages at https://github.com/mlr-org/mlr3/wiki/Extension-Packages.
Community links
The mlr community is open to all and we welcome everybody, from those completely new to ML and R to advanced coders and professional data scientists. You can reach us on our Mattermost4.
For case studies and how-to guides, check out the mlr3gallery5 for extended practical blog posts. For updates on mlr you might find our blog6 a useful point of reference.
We appreciate all contributions, whether they are bug reports, feature requests, or pull requests that fix bugs or extend functionality. Each of our GitHub repositories includes issues and pull request templates to ensure we can help you as much as possible to get started. Please make sure you read our code of conduct7 and contribution guidelines8. With so many packages in our universe it may be hard to keep track of where to open issues. As a general rule:
- If you have a question about using any part of the mlr3 ecosystem, ask on StackOverflow and use the tag #mlr3 – one of our team will answer you there. Be sure to include a reproducible example (reprex) and if we think you found a bug then we will refer you to the relevant GitHub repository.
- Bug reports or pull requests about core functionality (train, predict, etc.) should be opened in the mlr3 GitHub repository.
- Bug reports or pull requests about learners should be opened in the mlr3extralearners GitHub repository.
- Bug reports or pull requests about measures should be opened in the mlr3measures GitHub repository.
- Bug reports or pull requests about domain specific functionality should be opened in the GitHub repository of the respective package (see Chapter 1).
Do not worry about opening an issue in the wrong place, we will transfer it to the right one!
Citation info
Every package in the mlr3verse has its own citation details that can be found on the respective GitHub repository.
To reference this book please use:
Becker M, Binder M, Bischl B, Foss N, Kotthoff L, Lang M, Pfisterer F,
Reich N G, Richter J, Schratz P, Sonabend R, Pulatov D.
2023. "Preface". https://mlr3book.mlr-org.com.
@misc{
title = Preface
author = {Marc Becker, Martin Binder, Bernd Bischl, Natalie Foss,
Lars Kotthoff, Michel Lang, Florian Pfisterer, Nicholas G. Reich,
Jakob Richter, Patrick Schratz, Raphael Sonabend, Damir Pulatov},
url = {https://mlr3book.mlr-org.com},
year = {2023}
}
To reference the mlr3
package, please cite our JOSS paper:
Lang M, Binder M, Richter J, Schratz P, Pfisterer F, Coors S, Au Q,
Casalicchio G, Kotthoff L, Bischl B (2019). “mlr3: A modern object-oriented
machine learning framework in R.” Journal of Open Source Software.
doi: 10.21105/joss.01903.
@Article{mlr3,
title = {{mlr3}: A modern object-oriented machine learning framework in {R}},
author = {Michel Lang and Martin Binder and Jakob Richter and Patrick Schratz and
Florian Pfisterer and Stefan Coors and Quay Au and Giuseppe Casalicchio and
Lars Kotthoff and Bernd Bischl},
journal = {Journal of Open Source Software},
year = {2019},
month = {dec},
doi = {10.21105/joss.01903},
url = {https://joss.theoj.org/papers/10.21105/joss.01903},
}
mlr3book style guide
Throughout this book we will use our own style guide that can be found in the mlr3 wiki9. Below are the most important style choices relevant to the book.
We always use
=
instead of<-
for assignment.Class names are in
UpperCamelCase
Function and method names are in
lower_snake_case
When referencing functions, we will only include the package prefix (e.g.,
pkg::function
) for functions outside the mlr3 universe or when there may be ambiguity about in which package the function lives. Note you can useenvironment(function)
to see which namespace a function is loaded from.-
We denote packages, fields, methods, and functions as follows:
-
package
- With link (if online) to package CRAN, R-Universe, or GitHub page -
package::function()
(for functions outside the mlr-org ecosystem) -
function()
(for functions inside the mlr-org ecosystem) - With link to function documentation page -
$field
for fields (data encapsulated in a R6 class) -
$method()
for methods (functions encapsulated in a R6 class)
-