1  Introduction

Lars Kotthoff
University of Wyoming

Raphael Sonabend
Imperial College London

Natalie Foss
University of Wyoming

Bernd Bischl
**

Welcome to the Machine Learning in R universe (mlr3verse). In this book we will guide you through the functionality offered by mlr3 step by step. If you want to contribute to our universe, ask any questions, read documentation, or just chat to the team, head to https://github.com/mlr-org/mlr3 which has several useful links in the README.

Before we begin, make sure you have installed mlr3 if you want to follow along. We recommend installing the complete mlr3verse, which will install all of the important packages.

install.packages("mlr3verse")

Or you can install just the base package:

1.1 Target audience and how to use this book

The mlr3 ecosystem is the result of many years of methodological and applied research and improving the design and implementation of the packages over the years. This book describes the resulting features of the mlr3verse and discusses best practices for ML, technical implementation details, extension guidelines, and in-depth considerations for optimizing ML. It is suitable for a wide range of readers and levels of ML expertise but we assume that users of mlr3 have taken an introductory machine learning course or have the equivalent expertise and some basic experience with R. A background in computer science or statistics is beneficial for understanding the advanced functionality described in the later chapters of this book, but not required. A comprehensive introduction for those new to machine learning can be found in (James et al. 2013), and (Wickham and Grolemund 2017) gives a comprehensive introduction to data science in R. This book may also be helpful for both practitioners who want to quickly apply machine learning algorithms and researchers who want to implement, benchmark, and compare their new methods in a structured environment.

Chapter 1, Chapter 2, and Chapter 3 cover the basics of mlr3. These chapters are essential to understanding the core infrastructure of ML in mlr3. We recommend that all readers study these chapters to become familiar with basic mlr3 terminology, syntax, and style. Chapter 4, Chapter 5, Chapter 6, and Chapter 7 contain more advanced implementation details and more complex concepts and algorithms. Chapter 8 delves into detail on domain-specific methods that are implemented in our extension packages. Readers may choose to selectively read sections in this chapter depending on your use cases (i.e., if you have domain-specific problems to tackle), or to use these as introductions to new domains to explore. Chapter 9 contains technical implementation details that are essential reading for advanced users who require parallelisation, custom error handling, and fine control over hyperparameters and large databases. Finally, Chapter 11 discusses packages that can be integrated with mlr3 to provide model-agnostic interpretability methods.

Of course, you can also read the book cover to cover from start to finish. We have marked ‘optional’ sections that are particularly complex, with respect to either technical or methodological detail, which could be skipped on a first read.

Each chapter includes tutorials, API references, explanations of methodologies, and exercises to test yourself on what you have learnt. You can find the solutions to these exercises in Appendix A.

If you want to reproduce any of the results in this book, note that at the start of each chapter we run set.seed(123) and the sessionInfo at the time of publication is printed in Appendix E.

1.2 Installation guidelines

All packages in the mlr3 ecosystem can be installed from GitHub and R-universe; the majority (but not all) packages can also be installed from CRAN. We recommend adding the mlr-org R-universe1 to your R options so that you can install all packages with install.packages() without having to worry which package repository it comes from. To do this, install the usethis package and run the following:

  • 1 R-universe is an alternative package repository to CRAN. The bit of code below tells R to look at both R-universe and CRAN when trying to install packages. R will always install the latest version of a package.

  • usethis::edit_r_profile()

    In the file that opens add or change the repos argument in options so it looks something like this (you might need to add the full code block below or just edit the existing options function).

    options(repos = c(
      mlrorg = "https://mlr-org.r-universe.dev",
      CRAN = "https://cloud.r-project.org/"
    ))

    Save the file, restart your R session, and you are ready to go!

    install.packages("mlr3verse")

    If you want the latest development version of any of our packages, run

    remotes::install_github("mlr-org/{pkg}")

    with {pkg} replaced with the name of the package you want to install. You can see an up-to-date list of all our extension packages at https://github.com/mlr-org/mlr3/wiki/Extension-Packages.

    1.3 mlr3book code style

    Throughout this book we will use the following code style:

    1. We always use = instead of <- for assignment.

    2. Class names are in UpperCamelCase

    3. Function and method names are in lower_snake_case

    4. When referencing functions, we will only include the package prefix (e.g., pkg::function) for functions outside the mlr3 universe or when there may be ambiguity about in which package the function lives. Note you can use environment(function) to see which namespace a function is loaded from.

    5. We denote packages, fields, methods, and functions as follows:

      • package - With link (if online) to package CRAN, R-Universe, or GitHub page
      • package::function() (for functions outside the mlr-org ecosystem)
      • function() (for functions inside the mlr-org ecosystem)
      • $field for fields (data encapsulated in a R6 class)
      • $method() for methods (functions encapsulated in a R6 class)

    Now let us see this in practice with our first example.

    1.4 mlr3 example

    The mlr3 universe includes a wide range of tools taking you from basic ML to complex experiments. To get started, here is an example of the most simple functionality – training a model and making predictions.

    library(mlr3)
    task = tsk("penguins")
    split = partition(task)
    learner = lrn("classif.rpart")
    
    learner$train(task, row_ids = split$train)
    learner$model
    n= 231 
    
    node), split, n, loss, yval, (yprob)
          * denotes terminal node
    
    1) root 231 129 Adelie (0.441558442 0.199134199 0.359307359)  
      2) flipper_length< 206.5 144  44 Adelie (0.694444444 0.298611111 0.006944444)  
        4) bill_length< 43.05 98   3 Adelie (0.969387755 0.030612245 0.000000000) *
        5) bill_length>=43.05 46   6 Chinstrap (0.108695652 0.869565217 0.021739130) *
      3) flipper_length>=206.5 87   5 Gentoo (0.022988506 0.034482759 0.942528736) *
    predictions = learner$predict(task, row_ids = split$test)
    predictions
    <PredictionClassif> for 113 observations:
        row_ids     truth  response
              1    Adelie    Adelie
              2    Adelie    Adelie
              3    Adelie    Adelie
    ---                            
            328 Chinstrap Chinstrap
            331 Chinstrap    Adelie
            339 Chinstrap Chinstrap
    predictions$score(msr("classif.acc"))
    classif.acc 
      0.9557522 

    In this example, we trained a decision tree on a subset of the penguins dataset, made predictions on the rest of the data and then evaluated these with the accuracy measure. In Chapter 2 we will break this down in more detail.

    mlr3 makes training and predicting easy, but it also allows us to perform very complex operations in just a few lines of code:

    library(mlr3verse)
    library(mlr3pipelines)
    library(mlr3benchmark)
    
    tasks = tsks(c("breast_cancer", "sonar"))
    tuned_rf = auto_tuner(
        tnr("grid_search", resolution = 5),
        lrn("classif.ranger", num.trees = to_tune(200, 500)),
        rsmp("holdout")
    )
    tuned_rf = pipeline_robustify(NULL, tuned_rf, TRUE) %>>%
        po("learner", tuned_rf)
    stack_lrn = ppl(
        "stacking",
        base_learners = lrns(c("classif.rpart", "classif.kknn")),
        lrn("classif.log_reg"))
    stack_lrn = pipeline_robustify(NULL, stack_lrn, TRUE) %>>%
        po("learner", stack_lrn)
    
    learners = c(tuned_rf, stack_lrn)
    bm = benchmark(benchmark_grid(tasks, learners, rsmp("holdout")))
    
    bma = bm$aggregate(msr("classif.acc"))[, c("task_id", "learner_id",
      "classif.acc")]
    bma$learner_id = rep(c("RF", "Stack"), 2)
    bma
             task_id learner_id classif.acc
    1: breast_cancer         RF   0.9780702
    2: breast_cancer      Stack   0.9385965
    3:         sonar         RF   0.8550725
    4:         sonar      Stack   0.7246377

    In this (much more complex!) example we chose two tasks and two learners and used automated tuning to optimize the number of trees in the random forest learner (Chapter 4), and an ML pipeline that imputes missing data, collapses factor levels, and creates stacked models (Chapter 6). We also showed basic features like loading learners (Chapter 2) and choosing resampling strategies for benchmarking (Chapter 3). Finally, we compared the performance of the models using the mean accuracy on the test set, and applied a statistical test to see if the learners performed significantly different (they did not!).

    You will learn how to do all this and more in this book.

    1.5 MLR: Machine Learning in R

    The (Machine Learning in R) mlr3 (Lang et al. 2019) package and ecosystem provide a generic, object-oriented, and extensible framework for regression (Section 2.1), classification (Section 2.5), and other machine learning tasks (Chapter 8) for the R language (R Core Team 2019). This unified interface provides functionality to extend and combine existing machine learning algorithms (Learners (Section 2.2)), intelligently select and tune the most appropriate technique for a given machine learning task (Section 2.1), and perform large-scale comparisons that enable meta-learning. In addition, mlr3 includes advanced functionality such as hyperparameter tuning (Chapter 4) and feature selection (Chapter 5). Parallelization of many operations is natively supported (Section 9.1).

    RegressionClassificationTasksLearners

    mlr3 has similar overall aims to caret and tidymodels for R, scikit-learn2 for Python, and MLJ3 for Julia. In general mlr3 is designed to provide more flexibility than other ML frameworks while still offering easy ways to use advanced functionality. While tidymodels in particular makes it very easy to perform simple ML tasks, mlr3 is more geared towards advanced ML.

    1.6 From mlr to mlr3

    The mlr package (Bischl et al. 2016) was first released to CRAN4 in 2013, with the core design and architecture dating back somewhat further. Over time, the addition of many features has led to a considerably more complex design that made it harder to build, maintain, and extend than we had hoped for. In hindsight, we saw that some design and architecture choices in mlr made it difficult to support new features, in particular with respect to ML pipelines. Furthermore, the R ecosystem and packages within it, such as data.table, had undergone major changes after the initial design of mlr.

    It would have been difficult to integrate all of these changes into the original design of mlr. Instead, we decided to start working on a reimplementation in 2018, which resulted in the first release of mlr3 on CRAN in July 2019.

    The new design and the integration of further and newly-developed R packages (especially R6, future, and data.table) makes mlr3 much easier to use, maintain, and in many regards more efficient than its predecessor mlr. The packages in the ecosystem are less tightly coupled, making them easier to maintain and easier to develop, especially very specialized packages.

    1.7 Design principles

    This section covers advanced ML or technical details that can be skipped.

    We follow these general design principles in the mlr3 package and mlr3verse ecosystem.

    • Object-oriented programming (OOP). We embrace R6 for a clean, object-oriented design, object state-changes, and reference semantics. This means that the state of common objects (e.g. tasks (Section 2.1) and learners (Section 2.2)) is encapsulated within the object, for example to keep track of whether a model has been trained, without the user having to worry about this. We also use inheritance to specialize objects, e.g. all learners are derived from a common base class that provides basic functionality.
    • Tabular data. Embrace data.table for its top-notch computation performance as well as tabular data as a structure that can be easily processed further.
    • Unify input and output data formats. This considerably simplifies the API and allows easy selection and “split-apply-combine” (aggregation) operations. We combine data.table and R6 to place references to non-atomic and compound objects in tables and make heavy use of list columns.
    • Defensive programming and type safety. All user input is checked with checkmate (Lang 2017). We use data.table which documents return types unlike other mechanisms popular in base R which “simplify” the result unpredictably (e.g., sapply() or the drop argument for indexing data.frames). And we have extensive unit tests!
    • Light on dependencies. One of the main maintenance burdens for mlr was to keep up with changing learner interfaces and behavior of the many packages it depended on. We require far fewer packages in mlr3, which makes installation and maintenance easier. We still provide the same functionality, but it is split into more packages that have fewer dependencies individually.
    • Separation of computation and presentation. Most packages of the mlr3 ecosystem focus on processing and transforming data, applying ML algorithms, and computing results. Our core packages do not provide visualizations because their dependencies would make installation unnecessarily complex, especially on headless servers (i.e., computers without a monitor where graphical libraries are not installed). For the same reason, visualizations of data and results are provided in the extra package mlr3viz, which avoids dependencies on ggplot2.

    1.8 The mlr3 ecosystem

    Throughout this book we often refer to mlr3, which does not refer to the single mlr3 base package but all the packages in our ecosystem. The mlr3 package provides the base functionality that the rest of the ecosystem depends on for building more advanced ML tools. Figure 1.1 shows the packages in the mlr3verse that extend mlr3 with capabilities for preprocessing, pipelining, visualizations, additional learners, additional task types, and more.

    Diagram showing the packages of the mlr3verse and their relationship.

    Figure 1.1: Overview of the mlr3 ecosystem, the mlr3verse.

    A complete and up-to-date list of extension packages can be found at https://mlr-org.com/ecosystem.html.

    As well as packages within the mlr3 ecosystem, software in the mlr3verse also depends on the following popular and well-established packages:

    We build on R6 for object orientation and data.table to store and operate on tabular data. As both are core to mlr3 we briefly introduce both packages for beginners; in-depth expertise with of these packages is not necessary to work with mlr3.

    1.8.1 data.table for beginners

    The package data.table implements the data.table, which is a popular alternative to R’s data.frame(). We use data.table because it is blazingly fast and scales well to bigger data.

    As with data.frame, data.tables can be constructed with data.table() or as.data.table():

    library(data.table)
    # converting a matrix with as.data.table
    as.data.table(matrix(runif(4), 2, 2))
              V1        V2
    1: 0.8585914 0.4890915
    2: 0.8873848 0.7180918
    # using data.table
    dt = data.table(x = 1:6, y = rep(letters[1:3], each = 2))
    dt
       x y
    1: 1 a
    2: 2 a
    3: 3 b
    4: 4 b
    5: 5 c
    6: 6 c

    data.tables can be used much like data.frames, but they provide additional functionality that makes complex operations easier. For example, data can be summarized by groups with a by argument in the [ operator:

    dt[, mean(x), by = "y"]
       y  V1
    1: a 1.5
    2: b 3.5
    3: c 5.5

    There is also extensive support for many kinds of database join operations that make it easy to combine multiple data.tables in different ways (Stalder (2014)). For an in-depth introduction, we recommend the data.table vignette “Introduction to Data.table” (2023).

    1.8.2 R6 for beginners

    R6 is one of R’s more recent paradigm for object-oriented programming (OOP). It addresses shortcomings of earlier OO implementations in R, such as S3, which we used in mlr. If you have done any (class) object-oriented programming before, R6 should feel familiar. We focus on the parts of R6 that you need to know to use mlr3.

    Objects are created by calling the constructor of an R6::R6Class object, specifically the initialization method $new(). For example, say we have implemented a class called Foo, then foo = Foo$new(bar = 1) would create a new object of class Foo, setting the bar argument of the constructor to the value 1. In practice, we implement a lot of sugar functionality (Section 1.9) in mlr3 so you do not need to interact with R6 constructors in this way if you would prefer not to.

    Objects

    Some R6 objects may have mutable states that are encapsulated in their fields, which can be accessed through the dollar, $, operator. Continuing the previous example, we can access the bar value in the foo object by using foo$bar or we could give it a new value, e.g. foo$bar = 2. These fields are known as ‘active bindings’ and it is important to note that when called some computations are actually being run in the background.

    Fields

    In addition to fields, methods allow users to inspect the object’s state, retrieve information, or perform an action that changes the internal state of the object. For example, in mlr3, the $train() method of a learner changes the internal state of the learner by building and storing a model.

    Methods

    Fields and methods can be public or private . The public fields and methods define the API to interact with the object. In mlr3, you can safely ignore private methods unless you are looking to extend our universe by adding a new class (Chapter 9).

    Finally, R6 objects are environments, and as such have reference semantics. This means that, for example, foo2 = foo does not create a new variable called foo2 that is a copy of foo. Instead it creates a variable called foo2 that references foo, and so setting foo$bar = 3 will also change foo2$bar to 3 and vice versa. To copy an object, use the $clone(deep = TRUE) method, so to copy foo: foo2 = foo$clone(deep = TRUE) .

    $clone()

    For a longer introduction, we recommend the R6 vignettes found at https://r6.r-lib.org/ , more detail can be found in https://adv-r.hadley.nz/r6.html.

    1.9 Essential mlr3 utilities

    Finally, the following sections cover some important utilities that are essential to navigating the mlr3verse.

    Sugar Functions

    Most objects in mlr3 can be created through convenience functions called helper functions or sugar functions. They provide shortcuts for common code idioms, reducing the amount of code a user has to write. For example lrn("regr.rpart") is the sugar version of LearnerRegrRpart$new(). We heavily use helper functions throughout this book and give the equivalent “full form” for complete detail at the end of each chapter. The helper functions are designed to cover the majority of use-covers for most users, knowledge about the full R6 backend is most commonly required for custom objects or extensions.

    Dictionaries

    mlr3 uses dictionaries to store objects like learners or tasks. Much like paper dictionaries associate words with their definitions, our dictionaries associate keys (i.e., identifiers) with objects (for example R6 objects). Values in dictionaries are often accessed through sugar functions that retrieve objects from the relevant dictionary, for example lrn("regr.rpart") is a wrapper around mlr_learners$get("regr.rpart") and is thus a simpler way to load a decision tree learner from the mlr_learners. We use dictionaries to group large collections of relevant objects so they can be listed and retrieved easily. For example, you can see an overview of available learners (that are in loaded packages)and their properties with as.data.table(mlr_learners).

    mlr3viz

    mlr3viz includes all plotting functionality in mlr3 by using ggplot2 under the hood. We use ggplot2::theme_minimal() in all our plots to unify our aesthetic, but as with all ggplot outputs, users can fully customize this.

    mlr3viz extends fortify and autoplot for use with common mlr3 outputs including Prediction, Learner, and BenchmarkResult objects (which we will introduce and cover in the next chapters). We will cover major plot types throughout the book but the best way to learn about mlr3viz is through experimentation, load the package and see what happens when you run autoplot on an mlr3 object Plot types are documented in the respective manual page that can be accessed through ?autoplot.X, for example, the documentation of plots for regression tasks can be found by running ?autoplot.TaskRegr.