1  Introduction and Overview

Lars Kotthoff
University of Wyoming

Raphael Sonabend
Imperial College London

Natalie Foss
University of Wyoming

Bernd Bischl
Ludwig-Maximilians-Universität München, and Munich Center for Machine Learning (MCML)

Welcome to the Machine Learning in R universe. In this book, we will guide you through the functionality offered by mlr3 step by step. If you want to contribute to our universe, ask any questions, read documentation, or just chat with the team, head to https://github.com/mlr-org/mlr3 which has several useful links in the README.

The mlr3 (Lang et al. 2019) package and the wider mlr3 ecosystem provide a generic, object-oriented, and extensible framework for regression (Section 2.1), classification (Section 2.5), and other machine learning tasks (Chapter 13) for the R language (R Core Team 2019). On the most basic level, the unified interface provides functionality to train, test, and evaluate many machine learning algorithms. You can also take this a step further with hyperparameter optimization, computational pipelines, model interpretation, and much more. mlr3 has similar overall aims to caret and tidymodels for R, scikit-learn for Python, and MLJ for Julia. In general, mlr3 is designed to provide more flexibility than other ML frameworks while still offering easy ways to use advanced functionality. While tidymodels in particular makes it very easy to perform simple ML tasks, mlr3 is more geared towards advanced ML.

Before we can show you the full power of mlr3, we recommend installing the mlr3verse package, which will install several, important packages in the mlr3 ecosystem.

install.packages("mlr3verse")

1.1 Installation Guidelines

There are many packages in the mlr3 ecosystem that you may want to use as you work through this book. All our packages can be installed from GitHub and R-universe1; the majority (but not all) packages can also be installed from CRAN. We recommend adding the mlr-org R-universe to your R options so you can install all packages with install.packages(), without having to worry which package repository it comes from. To do this, install usethis and run the following:

1 R-universe is an alternative package repository to CRAN. The bit of code below tells R to look at both R-universe and CRAN when trying to install packages. R will always install the latest version of a package.

usethis::edit_r_profile()

In the file that opens add or change the repos argument in options so it looks something like the code below (you might need to add the full code block below or just edit the existing options function).

options(repos = c(
  mlrorg = "https://mlr-org.r-universe.dev",
  CRAN = "https://cloud.r-project.org/"
))

Save the file, restart your R session, and you are ready to go!

If you want the latest development version of any of our packages, run

remotes::install_github("mlr-org/{pkg}")

with {pkg} replaced with the name of the package you want to install. You can see an up-to-date list of all our extension packages at https://github.com/mlr-org/mlr3/wiki/Extension-Packages.

1.2 How to Use This Book

You could read this book cover to cover but you may benefit more from dipping in and out of chapters as suits your needs, we have provided a comprehensive index to help you find relevant pages and sections. We do recommend reading the first part of the book in its entirety as this will provide you with a complete overview of our basic infrastructure and design, which is used throughout our ecosystem.

We have marked sections that are particularly complex with respect to either technical or methodological detail and could be skipped on a first read with the following information box:

This section covers advanced ML or technical details.

Each chapter includes examples, API references, and explanations of methodologies. At the end of each part of the book we have included exercises for you to test yourself on what you have learned; you can find the solutions to these exercises at https://mlr3book.mlr-org.com/solutions.html. We have marked more challenging (and possibly time-consuming) exercises with an asterisk, ’*’.

If you want more detail about any of the tasks used in this book or links to all the mlr3 dictionaries, please see the appendices in the online version of the book at https://mlr3book.mlr-org.com/.

Reproducibility

At the start of each chapter we run set.seed(123) and use renv to manage package versions, you can find our lockfile at https://github.com/mlr-org/mlr3book/blob/main/book/renv.lock.

1.3 mlr3book Code Style

Throughout this book we will use the following code style:

  1. We always use = instead of <- for assignment.

  2. Class names are in UpperCamelCase

  3. Function and method names are in lower_snake_case

  4. When referencing functions, we will only include the package prefix (e.g., pkg::function) for functions outside the mlr3 universe or when there may be ambiguity about in which package the function lives. Note you can use environment(function) to see which namespace a function is loaded from.

  5. We denote packages, fields, methods, and functions as follows:

    • package (highlighted in the first instance)
    • package::function() or function() (see point 4)
    • $field for fields (data encapsulated in an R6 class)
    • $method() for methods (functions encapsulated in an R6 class)
    • Class (for R6 classes primarily, these can be distinguished from packages by context)

Now let us see this in practice with our first example.

1.4 mlr3 by Example

The mlr3 universe includes a wide range of tools taking you from basic ML to complex experiments. To get started, here is an example of the simplest functionality – training a model and making predictions.

library(mlr3)
task = tsk("penguins")
split = partition(task)
learner = lrn("classif.rpart")

learner$train(task, row_ids = split$train)
learner$model
n= 231 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

1) root 231 129 Adelie (0.441558 0.199134 0.359307)  
  2) flipper_length< 206.5 144  44 Adelie (0.694444 0.298611 0.006944)  
    4) bill_length< 43.05 98   3 Adelie (0.969388 0.030612 0.000000) *
    5) bill_length>=43.05 46   6 Chinstrap (0.108696 0.869565 0.021739) *
  3) flipper_length>=206.5 87   5 Gentoo (0.022989 0.034483 0.942529) *
prediction = learner$predict(task, row_ids = split$test)
prediction
<PredictionClassif> for 113 observations:
    row_ids     truth  response
          1    Adelie    Adelie
          2    Adelie    Adelie
          3    Adelie    Adelie
---                            
        328 Chinstrap Chinstrap
        331 Chinstrap    Adelie
        339 Chinstrap Chinstrap
prediction$score(msr("classif.acc"))
classif.acc 
     0.9558 

In this example, we trained a decision tree on a subset of the penguins dataset, made predictions on the rest of the data and then evaluated these with the accuracy measure. In Chapter 2 we will break this down in more detail.

The mlr3 interface also lets you run more complicated experiments in just a few lines of code:

library(mlr3verse)

tasks = tsks(c("breast_cancer", "sonar"))

glrn_rf_tuned = as_learner(ppl("robustify") %>>% auto_tuner(
    tnr("grid_search", resolution = 5),
    lrn("classif.ranger", num.trees = to_tune(200, 500)),
    rsmp("holdout")
))
glrn_rf_tuned$id = "RF"

glrn_stack = as_learner(ppl("robustify") %>>% ppl("stacking",
    lrns(c("classif.rpart", "classif.kknn")),
    lrn("classif.log_reg")
))
glrn_stack$id = "Stack"

learners = c(glrn_rf_tuned, glrn_stack)
bmr = benchmark(benchmark_grid(tasks, learners, rsmp("cv", folds = 3)))

bmr$aggregate(msr("classif.acc"))
         task_id learner_id classif.acc
1: breast_cancer         RF      0.9781
2: breast_cancer      Stack      0.9386
3:         sonar         RF      0.8551
4:         sonar      Stack      0.7246

In this (much more complex!) example we chose two tasks and two learners and used automated tuning to optimize the number of trees in the random forest learner (Chapter 4), and a machine learning pipeline that imputes missing data, collapses factor levels, and stacks models (Chapter 7 and Chapter 8). We also showed basic features like loading learners (Chapter 2) and choosing resampling strategies for benchmarking (Chapter 3). Finally, we compared the performance of the models using the mean accuracy with three-fold cross-validation.

You will learn how to do all this and more in this book.

1.5 The mlr3 Ecosystem

Throughout this book, we often refer to mlr3, which may refer to the single mlr3 base package but usually refers to all packages in our ecosystem, this should be clear from context. The mlr3 package provides the base functionality that the rest of the ecosystem depends on for building more advanced machine learning tools. Figure 1.1 shows the packages in our ecosystem that extend mlr3 with capabilities for preprocessing, pipelining, visualizations, additional learners, additional task types, and much more.

Mindmap showing the packages of the mlr3verse and their relationship. Center `mlr3`, immediately connected to that are 'Learners', 'Evaluation', 'Tuning', 'Feature Selection', 'Utilities', 'Special Tasks', 'Data', and 'Pipelines'. Within each group is: Learners: `mlr3learners`, `mlr3extralearners`, `mlr3torch`; Evaluation: `mlr3measures`, `mlr3benchmark`; Tuning: `mlr3tuning`, `miesmuschel`, `mlr3hyperband`, `mlr3mbo`, `bbotk`, `mlr3tuningspaces`; Feature Selection: `mlr3filters`, `mlr3fselect`; Utilities: `mlr3misc`, `mlr3viz`, `mlr3verse`, `mlr3batchmark`, `paradox`; Special Tasks: `mlr3spatiotempcv`, `mlr3spatial`, `mlr3proba`, `mlr3cluster`, `mlr3fda`, `mlr3fairness`; Data: `mlr3db`, `mlr3oml`, `mlr3data`; Pipelines: `mlr3pipelines`. `mlr3fda` and `mlr3torch` are connected by gray dashed lines.
Figure 1.1: Overview of the mlr3 ecosystem, the packages with gray dashed lines are still in development, all others have a stable interface.

A complete and up-to-date list of extension packages can be found at https://mlr-org.com/ecosystem.html.

As well as packages within the mlr3 ecosystem, software in the mlr3verse also depends on the following popular and well-established packages:

We build on R6 for object orientation and data.table to store and operate on tabular data. As both are core to mlr3 we briefly introduce both packages for beginners; in-depth expertise with these packages is not necessary to work with mlr3.

1.5.1 R6 for Beginners

R6 is one of R’s more recent paradigms for object-oriented programming. If you have experience with any (class) object-oriented programming then R6 should feel familiar. We focus on the parts of R6 that you need to know to use mlr3.

Objects are created by constructing an instance of an R6Class variable using the $new() initialization method. For example, say we have implemented a class called Foo, then foo = Foo$new(bar = 1) would create a new object of class Foo and set the bar argument of the constructor to the value 1. In practice, we implement a lot of sugar functionality (Section 1.6) in mlr3 that make construction and access a bit more convenient.

Some R6 objects may have mutable states that are encapsulated in their fields, which can be accessed through the dollar, $, operator. Continuing the previous example, we can access the bar value in the foo object by using foo$bar or we could give it a new value, e.g. foo$bar = 2. These fields can also be ‘active bindings’, which perform additional computations when referenced or modified.

In addition to fields, methods allow users to inspect the object’s state, retrieve information, or perform an action that changes the internal state of the object. For example, in mlr3, the $train() method of a learner changes the internal state of the learner by building and storing a model. Methods that modify the internal state of an object often return the object itself. Other methods may return a new R6 object. In both cases, it is possible to ‘chain’ methods by calling one immediately after the other using the $-operator; this is similar to the %>%-operator used in tidyverse packages. For example, Foo$bar()$hello_world() would run the $bar() method of the object Foo and then the $hello_world() method of the object returned by $bar() (which may be Foo itself).

Fields and methods can be public or private. The public fields and methods define the API to interact with the object. In mlr3, you can safely ignore private methods unless you are looking to extend our universe by adding a new class (Chapter 10).

Finally, R6 objects are environments, and as such have reference semantics. This means that, for example, foo2 = foo does not create a new variable called foo2 that is a copy of foo. Instead, it creates a variable called foo2 that references foo, and so setting foo$bar = 3 will also change foo2$bar to 3 and vice versa. To copy an object, use the $clone(deep = TRUE) method, so to copy foo: foo2 = foo$clone(deep = TRUE).

$clone()

For a longer introduction, we recommend the R6 vignettes found at https://r6.r-lib.org/; more detail can be found in https://adv-r.hadley.nz/r6.html.

1.5.2 data.table for Beginners

The package data.table implements data.table(), which is a popular alternative to R’s data.frame(). We use data.table because it is blazingly fast and scales well to bigger data.

As with data.frame, data.tables can be constructed with data.table() or as.data.table():

library(data.table)
# converting a matrix with as.data.table
as.data.table(matrix(runif(4), 2, 2))
       V1     V2
1: 0.8586 0.4891
2: 0.8874 0.7181
# using data.table
dt = data.table(x = 1:6, y = rep(letters[1:3], each = 2))
dt
   x y
1: 1 a
2: 2 a
3: 3 b
4: 4 b
5: 5 c
6: 6 c

data.tables can be used much like data.frames, but they provide additional functionality that makes complex operations easier. For example, data can be summarized by groups with a by argument in the [ operator and they can be modified in-place with the := operator.

# mean of x column in groups given by y
dt[, mean(x), by = "y"]
   y  V1
1: a 1.5
2: b 3.5
3: c 5.5
# adding a new column with :=
dt[, z := x * 3]
dt
   x y  z
1: 1 a  3
2: 2 a  6
3: 3 b  9
4: 4 b 12
5: 5 c 15
6: 6 c 18

Finally data.table also uses reference semantics so you will need to use copy() to clone a data.table. For an in-depth introduction, we recommend the vignette “Introduction to Data.table” (2023).

1.6 Essential mlr3 Utilities

mlr3 includes a few important utilities that are essential to simplifying code in our ecosystem.

Sugar Functions

Most objects in mlr3 can be created through convenience functions called helper functions or sugar functions. They provide shortcuts for common code idioms, reducing the amount of code a user has to write. For example lrn("regr.rpart") returns the learner without having to explicitly create a new R6 object. We heavily use sugar functions throughout this book and provide the equivalent “full form” for complete detail at the end of each chapter. The sugar functions are designed to cover the majority of use cases for most users, knowledge about the full R6 backend is only required if you want to build custom objects or extensions.

Many object names in mlr3 are standardized according to the convention: mlr_<type>_<key>, where <type> will be tasks, learners, measures, and other classes that will be covered in the book, and <key> refers to the ID of the object. To simplify the process of constructing objects, you only need to know the object key and the sugar function for constructing the type. For example: mlr_tasks_mtcars becomes tsk("mtcars");mlr_learners_regr.rpart becomes lrn("regr.rpart"); and mlr_measures_regr.mse becomes msr("regr.mse"). Throughout this book, we will refer to all objects using this abbreviated form.

Dictionaries

mlr3 uses dictionaries to store R6 classes, which associate keys (unique identifiers) with objects (R6 objects). Values in dictionaries are often accessed through sugar functions that retrieve objects from the relevant dictionary, for example lrn("regr.rpart") is a wrapper around mlr_learners$get("regr.rpart") and is thus a simpler way to load a decision tree learner from mlr_learners. We use dictionaries to group large collections of relevant objects so they can be listed and retrieved easily. For example, you can see an overview of available learners (that are in loaded packages) and their properties with as.data.table(mlr_learners) or by calling the sugar function without any arguments, e.g. lrn().

mlr3viz

mlr3viz includes all plotting functionality in mlr3 and uses ggplot2 under the hood. We use theme_minimal() in all our plots to unify our aesthetic, but as with all ggplot outputs, users can fully customize this. mlr3viz extends fortify and autoplot for use with common mlr3 outputs including Prediction, Learner, and BenchmarkResult objects (which we will introduce and cover in the next chapters). We will cover major plot types throughout the book. The best way to learn about mlr3viz is through experimentation; load the package and see what happens when you run autoplot on an mlr3 object. Plot types are documented in the respective manual page that can be accessed through ?autoplot.<class>, for example, you can find different types of plots for regression tasks by running ?autoplot.TaskRegr.

1.7 Design Principles

This section covers advanced ML or technical details.

Learning from over a decade of design and adaptation from mlr to mlr3, we now follow these design principles in the mlr3 ecosystem:

  • Object-oriented programming. We embrace R6 for a clean, object-oriented design, object state changes, and reference semantics. This means that the state of common objects (e.g. tasks (Section 2.1) and learners (Section 2.2)) is encapsulated within the object, for example, to keep track of whether a model has been trained, without the user having to worry about this. We also use inheritance to specialize objects, e.g. all learners are derived from a common base class that provides basic functionality.
  • Tabular data. Embrace data.table for its top-notch computational performance as well as tabular data as a structure that can be easily processed further.
  • Unified tabular input and output data formats. This considerably simplifies the API and allows easy selection and “split-apply-combine” (aggregation) operations. We combine data.table and R6 to place references to non-atomic and compound objects in tables and make heavy use of list columns.
  • Defensive programming and type safety. All user input is checked with checkmate (Lang 2017). We use data.table, which has behavior that is more consistent than several base R methods (e.g., indexing data.frames simplifies the result when the drop argument is omitted). And we have extensive unit tests!
  • Light on dependencies. One of the main maintenance burdens for mlr was to keep up with changing learner interfaces and behavior of the many packages it depended on. We require far fewer packages in mlr3, which makes installation and maintenance easier. We still provide the same functionality, but it is split into more packages that have fewer dependencies individually.
  • Separation of computation and presentation. Most packages of the mlr3 ecosystem focus on processing and transforming data, applying ML algorithms, and computing results. Our core packages do not provide visualizations because their dependencies would make installation unnecessarily complex, especially on headless servers (i.e., computers without a monitor where graphical libraries are not installed). Hence, visualizations of data and results are provided in mlr3viz.

1.8 Citation

Please cite this chapter as:

Kotthoff L, Sonabend R, Foss N, Bischl B. (2024). Introduction and Overview. In Bischl B, Sonabend R, Kotthoff L, Lang M, (Eds.), Applied Machine Learning Using mlr3 in R. CRC Press. https://mlr3book.mlr-org.com/introduction_and_overview.html.

@incollection{citekey, 
  author = "Lars Kotthoff and Raphael Sonabend and Natalie Foss and Bernd Bischl", 
  title = "Introduction and Overview",
  booktitle = "Applied Machine Learning Using {m}lr3 in {R}",
  publisher = "CRC Press", year = "2024",
  editor = "Bernd Bischl and Raphael Sonabend and Lars Kotthoff and Michel Lang", 
  url = "https://mlr3book.mlr-org.com/introduction_and_overview.html"
}