Preface

Welcome to the Machine Learning in R universe (mlr3verse)! Before we begin, make sure you have installed mlr3 if you want to follow along. We recommend installing the complete mlr3verse, which will install all of the important packages.

install.packages("mlr3verse")

Or you can install just the base package:

install.packages("mlr3")

In our first example, we will show you some of the most basic functionality – training a model and making predictions.

library(mlr3)
task = tsk("penguins")
split = partition(task)
learner = lrn("classif.rpart")

learner$train(task, row_ids = split$train)
learner$model
n= 231 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

1) root 231 129 Adelie (0.441558442 0.199134199 0.359307359)  
  2) flipper_length< 207.5 145  44 Adelie (0.696551724 0.296551724 0.006896552)  
    4) bill_length< 44.65 100   2 Adelie (0.980000000 0.020000000 0.000000000) *
    5) bill_length>=44.65 45   4 Chinstrap (0.066666667 0.911111111 0.022222222) *
  3) flipper_length>=207.5 86   4 Gentoo (0.011627907 0.034883721 0.953488372) *
predictions = learner$predict(task, row_ids = split$test)
predictions
<PredictionClassif> for 113 observations:
    row_ids     truth  response
          3    Adelie    Adelie
          4    Adelie    Adelie
          5    Adelie    Adelie
---                            
        341 Chinstrap    Adelie
        343 Chinstrap    Gentoo
        344 Chinstrap Chinstrap
predictions$score(msr("classif.acc"))
classif.acc 
  0.9380531 

In this example, we trained a decision tree on a subset of the penguins dataset, made predictions on the rest of the data and then evaluated these with the accuracy measure. In Chapter 2 we will break this down in more detail.

mlr3 makes training and predicting easy, but it also allows us to perform very complex operations in just a few lines of code:

library(mlr3verse)
library(mlr3pipelines)
library(mlr3benchmark)

tasks = tsks(c("breast_cancer", "sonar"))
tuned_rf = auto_tuner(
    tnr("grid_search", resolution = 5),
    lrn("classif.ranger", num.trees = to_tune(200, 500)),
    rsmp("holdout")
)
tuned_rf = pipeline_robustify(NULL, tuned_rf, TRUE) %>>%
    po("learner", tuned_rf)
stack_lrn = ppl(
    "stacking",
    base_learners = lrns(c("classif.rpart", "classif.kknn")),
    lrn("classif.log_reg"))
stack_lrn = pipeline_robustify(NULL, stack_lrn, TRUE) %>>%
    po("learner", stack_lrn)

learners = c(tuned_rf, stack_lrn)
bm = benchmark(benchmark_grid(tasks, learners, rsmp("holdout")))
bma = bm$aggregate(msr("classif.acc"))[, c("task_id", "learner_id",
  "classif.acc")]
bma$learner_id = rep(c("RF", "Stack"), 2)
bma
         task_id learner_id classif.acc
1: breast_cancer         RF   0.9605263
2: breast_cancer      Stack   0.9122807
3:         sonar         RF   0.7681159
4:         sonar      Stack   0.7101449
as.BenchmarkAggr(bm)$friedman_test()

    Friedman rank sum test

data:  ce and learner_id and task_id
Friedman chi-squared = 2, df = 1, p-value = 0.1573

In this (much more complex!) example we chose two tasks and two machine learning (ML) algorithms (“learners” in mlr3 terms). We used automated tuning to optimize the number of trees in the random forest learner (Chapter 4) and a ML pipeline that imputes missing data, collapses factor levels, and creates stacked models (Chapter 6). We also showed basic features like loading learners (Chapter 2) and choosing resampling strategies for benchmarking (Chapter 3). Finally, we compared the performance of the models using the mean accuracy on the test set, and applied a statistical test to see if the learners performed significantly different (they did not!).

You will learn how to do all this and more in this book. We will walk through the functionality offered by mlr3 and the packages in the mlr3verse step by step. There are a few different ways you can use this book, which we will discuss next.

How to use this book

The mlr3 ecosystem is the result of many years of methodological and applied research and improving the design and implementation of the packages over the years. This book describes the resulting features of the mlr3verse and discusses best practices for ML, technical implementation details, extension guidelines, and in-depth considerations for optimizing ML. It is suitable for a wide range of readers and levels of ML expertise.

Chapter 1, Chapter 2, and Chapter 3 cover the basics of mlr3. These chapters are essential to understanding the core infrastrucure of ML in mlr3. We recommend that all readers study these chapters to become familiar with basic mlr3 terminology, syntax, and style. Chapter 4, Chapter 5, and Chapter 6 contain more advanced implementation details and some ML theory. Chapter 8 delves into detail on domain-specific methods that are implemented in our extension packages. Readers may choose to selectively read sections in this chapter depending on your use cases (i.e., if you have domain-specific problems to tackle), or to use these as introductions to new domains to explore. Chapter 9 contains technical implementation details that are essential reading for advanced users who require parallelisation, custom error handling, and fine control over hyperparameters and large databases. Chapter 10 discusses packages that can be integrated with mlr3 to provide model-agnostic interpretability methods. Finally, anyone who would like to contribute to our ecosystem should read Chapter 11.

Of course, you can also read the book cover to cover from start to finish. We have marked any section that contains complex technical information and you may wish to skip these if you are only interested in basic functionality. Similarly, we have marked sections that are optional, such as parts that are more methodological focused and do not discuss the software implementation. Readers that are interested in the more technical detail will likely want to pay attention to the tables at the end of each chapter that show the relationship between our S3 ‘sugar’ functions and the underlying R6 classes; this is explained in more detail in Chapter 1.

This book tries to follow the Diátaxis framework1 for documentation and so we include tutorials, how-to guides, API references, and explanations. This means that the conclusion of each chapter includes a short reference to the core functions learnt in the chapter, links to relevant posts in the mlr3gallery2, and a few exercises that will cover content introduced in the chapter. You can find the solutions to these exercises in Appendix A.

Finally, if you want to reproduce any of the results in this book, note that at the start of each chapter we run set.seed(<chapter number>) and the sessionInfo at the time of publication is printed in Appendix E.

Installation guidelines

All packages in the mlr3 ecosystem can be installed from GitHub and R-universe; the majority (but not all) packages can also be installed from CRAN. We recommend adding the mlr-org R-universe3 to your R options so that you can install all packages with install.packages() without having to worry which package repository it comes from. To do this, install the usethis package and run the following:

  • 3 R-universe is an alternative package repository to CRAN. The bit of code below tells R to look at both R-universe and CRAN when trying to install packages. R will always install the latest version of a package.

  • usethis::edit_r_profile()

    In the file that opens add or change the repos argument in options so it looks something like this (you might need to add the full code block below or just edit the existing options function).

    options(repos = c(
      mlrorg = "https://mlr-org.r-universe.dev",
      CRAN = "https://cloud.r-project.org/"
    ))

    Save the file, restart your R session, and you are ready to go!

    install.packages("mlr3verse")

    If you want latest development versions of any of our packages, run

    remotes::install_github("mlr-org/{pkg}")

    with {pkg} replaced with the name of the package you want to install. You can see an up-to-date list of all our extension packages at https://github.com/mlr-org/mlr3/wiki/Extension-Packages.

    Citation info

    Every package in the mlr3verse has its own citation details that can be found on the respective GitHub repository.

    To reference this book please use:

    Becker M, Binder M, Bischl B, Foss N, Kotthoff L, Lang M, Pfisterer F,
    Reich N G, Richter J, Schratz P, Sonabend R, Pulatov D.
    2023. "Preface". https://mlr3book.mlr-org.com.
    @misc{
        title = Preface
        author = {Marc Becker, Martin Binder, Bernd Bischl, Natalie Foss,
        Lars Kotthoff, Michel Lang, Florian Pfisterer, Nicholas G. Reich,
        Jakob Richter, Patrick Schratz, Raphael Sonabend, Damir Pulatov},
        url = {https://mlr3book.mlr-org.com},
        year = {2023}
    }

    To reference the mlr3 package, please cite our JOSS paper:

    Lang M, Binder M, Richter J, Schratz P, Pfisterer F, Coors S, Au Q,
    Casalicchio G, Kotthoff L, Bischl B (2019). “mlr3: A modern object-oriented
    machine learning framework in R.” Journal of Open Source Software.
    doi: 10.21105/joss.01903.
    
    @Article{mlr3,
      title = {{mlr3}: A modern object-oriented machine learning framework in {R}},
      author = {Michel Lang and Martin Binder and Jakob Richter and Patrick Schratz and
      Florian Pfisterer and Stefan Coors and Quay Au and Giuseppe Casalicchio and
      Lars Kotthoff and Bernd Bischl},
      journal = {Journal of Open Source Software},
      year = {2019},
      month = {dec},
      doi = {10.21105/joss.01903},
      url = {https://joss.theoj.org/papers/10.21105/joss.01903},
    }

    mlr3book style guide

    Throughout this book we will use our own style guide that can be found in the mlr3 wiki9. Below are the most important style choices relevant to the book.

    1. We always use = instead of <- for assignment.

    2. Class names are in UpperCamelCase

    3. Function and method names are in lower_snake_case

    4. When referencing functions, we will only include the package prefix (e.g., pkg::function) for functions outside the mlr3 universe or when there may be ambiguity about in which package the function lives. Note you can use environment(function) to see which namespace a function is loaded from.

    5. We denote packages, fields, methods, and functions as follows:

      • package - With link (if online) to package CRAN, R-Universe, or GitHub page
      • package::function() (for functions outside the mlr-org ecosystem)
      • function() (for functions inside the mlr-org ecosystem) - With link to function documentation page
      • $field for fields (data encapsulated in a R6 class)
      • $method() for methods (functions encapsulated in a R6 class)