2  Fundamentals

Natalie Foss
University of Wyoming

Lars Kotthoff
University of Wyoming

In this chapter, we will introduce the mlr3 objects and corresponding R6 classes that implement the essential building blocks of machine learning (ML). These building blocks include the data (and the methods of creating training and test sets), the ML algorithm (and its training and prediction process), the configuration of a ML algorithm through its hyperparameters, and evaluation measures to assess the quality of predictions.

In the simplest definition, machine learning is the process of using computer models to learn relationships from data. Supervised learning is a subfield of ML in which datasets consist of observations (rows in tabular data) that are labeled, which means that each data point includes features (columns in tabular data) and a quantity that we are trying to predict, also called a target. A classic example might be trying to predict a car’s miles per gallon (the target) based on properties (the features) such as horsepower and the number of gears (we will return to this particular example later). In mlr3, we refer to datasets, and their associated metadata as tasks (Section 2.1). The term ‘tasks’ is used to refer to the ML task (i.e., mathematical problem) that we are trying to solve. Tasks are defined by the features used for prediction and the targets to predict, so there can be multiple tasks associated with any given dataset. For example, predicting miles per gallon (mpg) from horsepower is one task, predicting horsepower from miles per gallon is another task, and predicting number of gears from model is yet another task, and so on.

Machine LearningSupervised Learning

Supervised learning can be further divided into regression – which is prediction of numeric target values, e.g. predicting a car’s mpg – and classification – which is prediction of categorical values/labels, e.g., predicting a car’s model. Other tasks are also encompassed by supervised learning, and these are returned to in Chapter 8, we will also consider unsupervised learning tasks in that chapter. For any supervised learning task, the goal is to build a model that captures the relationship between features and target and often to train a model to be able to make predictions for new and previously unseen data. A model is formally a mapping from a feature vector to predictions, such models are induced by passing training data to machine learning algorithms, including decision trees, support vector machines, neural networks, and many more. ML algorithms are called learners in mlr3 (Section 2.2) as given data, they learn models. Each learner has a parameterized space that potential models are drawn from and during the training process, these parameters are fitted to best match the data. For example, the parameters could be the weights given to individual features when training a linear regression model. During training, all ML algorithms are fitted/trained by optimizing a loss-function that quantifies the mismatch between ground truth target values in the training data and the predictions of the model.


For a model to be most useful, it should generalize beyond the training data to make ‘good’ predictions (Section 2.2.2) on new and previously ‘unseen’ (by the model) data. The simplest way to determine if a model will generalize to make ‘good’ predictions for new data, is to split data into training data and test data – where the model is trained on the training data and then the separate test data is used to evaluate models in an unbiased way by assessing to what extent the model has learned the true relationships that underlie the data (Chapter 3). This evaluation procedure estimates a model’s generalization error, i.e., how well we expect the model to perform in general. There are many ways to evaluate models (Chapter 3) and to split data for estimating generalization error (Section 3.2).

Training DataTest DataGeneralization Error

This brief overview of ML provides the basic knowledge required to use software in mlr3 and is summarized in Figure 2.1. In the rest of this book we will also provide introductions to methodology when relevant and in Chapter 8 we will also provide introduction to applications in other tasks. For texts about ML, including detailed methodology and underpinnings of different algorithms, we recommend Hastie, Friedman, and Tibshirani (2001), James et al. (2013), and Bishop (2006).

In the next few sections we will look at the building blocks of mlr3 using regression as an example, we will then consider how to extend this to classification in Section 2.5, for other tasks see Chapter 8.

A flowchart starting with the task (data), which splits into training- and test sets. The training set is used with the learner to fit a model, which is then used with the test set to make predictions. A performance measure is applied to the predictions and results in a performance estimate. Resampling refers to the repeated application of this process.

Figure 2.1: General overview of the machine learning process.

2.1 Tasks

Tasks are objects that contain the (usually tabular) data and additional metadata that define a ML problem. The metadata contain, for example, the name of the target feature for supervised ML problems. This information is used automatically by operations that can be performed on a task so that for example the user does not have to specify the prediction target every time a model is trained.

2.1.1 Constructing Tasks

mlr3 includes a few predefined ML tasks in the mlr_tasks Dictionary.

<DictionaryTask> with 20 stored values
Keys: bike_sharing, boston_housing, breast_cancer, german_credit, ilpd,
  iris, kc_housing, moneyball, mtcars, optdigits, penguins,
  penguins_simple, pima, ruspini, sonar, spam, titanic, usarrests,
  wine, zoo

To get a task from the dictionary, use the tsk() function and assign the return value to a new variable. Below we retrieve the task mlr_tasks_mtcars, which uses the datasets::mtcars dataset:

task_mtcars = tsk("mtcars")
<TaskRegr:mtcars> (32 x 11): Motor Trends
* Target: mpg
* Properties: -
* Features (10):
  - dbl (10): am, carb, cyl, disp, drat, gear, hp, qsec, vs, wt
Help pages

Usually in R, the help pages of functions can be queried with ?. The same is true of R6 classes, so if you want to find the help page of the mtcars task you could use ?mlr_tasks_mtcars. We have also added a $help() method to many of our classes, which allows you to access the help page of a class from any instance of that class, for example: tsk("mtcars")$help().

Class naming conventions

Many object names in mlr3 are standardized according to the convention: mlr_<types>_<key>. Where <types> will be tasks, learners, measures, and others to be covered later in the book; and <key> refers to the ID of the object. To simplify the process of constructing objects, you only need to know the object key and the sugar function for construction.

For example: mlr_tasks_mtcars becomes tsk("mtcars");mlr_learners_regr.rpart becomes lrn("regr.rpart"); and mlr_measures_regr.mse becomes msr("regr.mse").

To create your own regression task, you will need to construct a new instance of the TaskRegr. The simplest way to do this is with the function as_task_regr() to convert a data.frame type object to a regression task, specifying the target feature by passing this to the target argument. By example, we will imagine that mtcars was not already available as a predefined task in mlr3. In the code below we load the datasets::mtcars dataset, print its properties, subset the data to only include columns "mpg", "cyl", "disp", print the modified data’s properties, and then setup a regression task called "cars" (id = "cars") in which we will try to predict miles per gallon (target = "mpg") from number of cylinders ("cyl") and displacement ("disp"):

data("mtcars", package = "datasets")
mtcars_subset = subset(mtcars, select = c("mpg", "cyl", "disp"))
'data.frame':   32 obs. of  3 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
task_mtcars = as_task_regr(mtcars_subset, target = "mpg", id = "cars")

The data can be in any tabular format, e.g. a data.frame(), data.table(), or tibble(). The target argument specifies the prediction target column. The id argument is optional and specifies an identifier for the task that is used in plots and summaries; if omitted the variable name of the data will be used as the id.


As many ML models do not work properly with arbitrary UTF8 names1, mlr3 defaults to throw an error if any of the column names passed to as_task_regr() (and other task constructors) contain a non-ASCII character or do not comply with R’s variable naming scheme. Therefore, we recommend converting names with make.names() if possible, but if not then you can bypass this check inr mlr3 by setting options(mlr3.allow_utf8_names = TRUE) (but do not be surprised if an underlying package implementation throws up a related error).

Printing a task provides a short summary, in this case we can see the task has 32 observations and 3 columns (32 x 3), of which mpg is the target, there are no special properties, and there are 2 features stored in double-precision floating point format.

<TaskRegr:cars> (32 x 3)
* Target: mpg
* Properties: -
* Features (2):
  - dbl (2): cyl, disp

We can plot the task using the mlr3viz package, which gives a graphical summary of the distribution of the target and feature values:

autoplot(task_mtcars, type = "pairs")

Diagram shows six plots, three are line plots showing the relationship between continuous variables, and three are scatter plots showing relationships between other variables.

Overview of the mtcars dataset.

2.1.2 Retrieving Data

We have looked at how to create tasks to store data and metadata, now we will look at how to retrieve the stored data.

Various fields can be used to retrieve metadata about a task. The dimensions, for example, can be retrieved using $nrow and $ncol:

c(task_mtcars$nrow, task_mtcars$ncol)
[1] 32  3

The names of the feature and target columns are stored in the $feature_names and $target_names slots, respectively.

c(Features = task_mtcars$feature_names, Target = task_mtcars$target_names)
Features1 Features2    Target 
    "cyl"    "disp"     "mpg" 

While the columns of a task have unique character-valued names, their rows are identified by unique natural numbers, called row IDs. They can be accessed through the $row_ids field:

[1] 1 2 3 4 5 6

Row IDs are not used as features when training or predicting but are metadata that allows to access individual observations. Note that row IDs are not the same as row numbers. This is best demonstrated by example, below we create a regression task from random data, print the original row IDs, which correspond to row numbers 1-5, then we filter three rows (we will return to this method just below) and print the new row IDs, which no longer correspond to the row numbers.

task = as_task_regr(data.frame(x = runif(5), y = runif(5)), target = "y")
[1] 1 2 3 4 5
task$filter(c(4, 1, 3))
[1] 1 3 4

This design decision allows tasks and learners to transparently operate on real database management systems, where uniqueness is the only requirement for primary keys (and not the actual row ID value).

The data contained in a task can be accessed through $data(), which returns a data.table object. This method has optional rows and cols arguments to specify subsets of the data to retrieve.

# retrieve all data
     mpg cyl  disp
 1: 21.0   6 160.0
 2: 21.0   6 160.0
 3: 22.8   4 108.0
 4: 21.4   6 258.0
 5: 18.7   8 360.0
28: 30.4   4  95.1
29: 15.8   8 351.0
30: 19.7   6 145.0
31: 15.0   8 301.0
32: 21.4   4 121.0
# retrieve data for rows with IDs 1, 5, and 10 and feature columns
task_mtcars$data(rows = c(1, 5, 10), cols = task_mtcars$feature_names)
   cyl  disp
1:   6 160.0
2:   8 360.0
3:   6 167.6

You can work with row numbers instead of row IDs by adding a step to extract the corresponding row ID:

# select the 2nd row of the task by extracting the second row_id:
task$data(rows = task$row_ids[2])

You can always use ‘standard’ R methods to extract summary data from a task, for example to summarize the underlying data:

      mpg             cyl             disp      
 Min.   :10.40   Min.   :4.000   Min.   : 71.1  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8  
 Median :19.20   Median :6.000   Median :196.3  
 Mean   :20.09   Mean   :6.188   Mean   :230.7  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0  
 Max.   :33.90   Max.   :8.000   Max.   :472.0  

2.1.3 Task Mutators

After a task has been created, you may want to perform operations on the task such as filtering down to subsets of rows and columns, which is often useful for manually creating train and test splits or to fit models on a subset of given features. Above we saw how to access subsets of the underlying dataset using $data(), however this will not change the underlying task. Therefore, we provide mutators, which modify the given Task in place, this can be seen in examples below.


Subsetting by features (columns) is possible with $select() with the desired feature names passed as a character vector and subsetting by observations (rows) is performed with $filter() by passing the row IDs as a numeric vector :

task_mtcars_small = tsk("mtcars") # initialize with the full task
task_mtcars_small$select(c("am", "carb")) # keep only these features
task_mtcars_small$filter(2:4) # keep only these rows
    mpg am carb
1: 21.0  1    4
2: 22.8  1    1
3: 21.4  0    1

As R6 uses reference semantics (Section 1.8.2), you need to use $clone() if you want to copy a task and then mutate it further:

# the wrong way
task_mtcars_small = tsk("mtcars")$filter(1:2)$select("cyl")
task_mtcars_wrong = task_mtcars_small
   mpg cyl
1:  21   6
2:  21   6
# original data affected
   mpg cyl
1:  21   6
# the right way
task_mtcars_small = tsk("mtcars")$filter(1:2)$select("cyl")
task_mtcars_right = task_mtcars_small$clone()
   mpg cyl
1:  21   6
2:  21   6
# original data unaffected
   mpg cyl
1:  21   6
2:  21   6

To add extra rows and columns to a task, you can use $rbind() and $cbind() respectively :

task_mtcars_small$cbind( # add another column
  data.frame(disp = c(150, 160))
task_mtcars_small$rbind( # add another row
  data.frame(mpg = 23, cyl = 5, disp = 170)
   mpg cyl disp
1:  21   6  150
2:  21   6  160
3:  23   5  170

2.2 Learners

Objects of class Learner provide a unified interface to many popular ML algorithms in R. The mlr_learners dictionary contains all the learners available in mlr3, we will discuss the available learners in Section 2.7, for now we will just use a regression tree learner as an example to discuss the Learner interface. As with tasks, you can access learners from the dictionary with a single sugar function, in this case lrn().

<LearnerRegrRpart:regr.rpart>: Regression Tree
* Model: -
* Parameters: xval=0
* Packages: mlr3, rpart
* Predict Types:  [response]
* Feature Types: logical, integer, numeric, factor, ordered
* Properties: importance, missings, selected_features, weights

All Learner objects include the following metadata, which can be seen in the output above:

  • $feature_types: the type of features the learner can handle.
  • $packages: the packages required to be installed to use the learner.
  • $properties: special properties the model can handle, for example the “missings” properties means a model can handle missing data, and “importance” means it can compute the relative importance of each feature.
  • $predict_types: the types of prediction that the model can make (Section 2.2.2).
  • $param_set: the set of available hyperparameters (Section 2.2.3).

To run an ML experiment, learners pass through two stages (Figure 2.2):

  • Training: A training Task is passed to the learner’s $train() function which trains and stores a model, i.e., the learned relationship of the features to the target.
  • Predicting: New data, often a different partition of the original dataset, is passed to the $predict() method of the trained learner to predict the target values.

Diagram shows two boxes, the first is labelled "$train()" and shows data being passed to a Learner. The second is labelled "$predict()" and shows "Inference Data" being passed to the "Learner" which now include a "$model", an arrow then shows predictions being made.

Figure 2.2: Overview of the different stages of a learner. Top – data (split into features and a target) is passed to an (untrained learner). Bottom – new data is passed to the trained model which makes predictions for the ‘missing’ target column.

2.2.1 Training

In the simplest use-case, models are trained by passing a task to a learner with the $train() method:

# load mtcars task
task = tsk("mtcars")
# load a regression tree
learner_rpart = lrn("regr.rpart")
# pass the task to the learner via $train()

After training, the fitted model is stored in the $model field for future inspection and prediction:

# inspect the trained model
n= 32 

node), split, n, deviance, yval
      * denotes terminal node

1) root 32 1126.04700 20.09062  
  2) cyl>=5 21  198.47240 16.64762  
    4) hp>=192.5 7   28.82857 13.41429 *
    5) hp< 192.5 14   59.87214 18.26429 *
  3) cyl< 5 11  203.38550 26.66364 *

We see that the regression tree has identified features in the task that are predictive of the target (mpg) and used them to partition observations. The textual representation of the model depends on the type of learner. For more information on any model see the learner help page, which can be accessed in the same way as tasks with the help() field, e.g., learner_rpart$help(). Partitioning data

When performing simple examples to assess the quality of a model’s predictions, you will likely want to partition your dataset to get a fair and unbiased estimate of a model’s generalization error. In Chapter 3 we will look at resampling and benchmark experiments, which will go into more detail about performance estimation, for now we will just discuss the simplest method of splitting data using the partition() function. This function randomly splits the given task into two disjoint sets: a training set (67% of the total data, the default) and a test set (33% of the total data, the data not part of the training set).

# changing default to a 70:30 train:test split
splits = partition(task_mtcars, ratio = 0.7)
 [1]  1  3  4  5  8 10 21 25 32  6  7 11 14 15 16 17 22 24 31 18 19 20 26

[1]  2  9 27 30 12 13 23 29 28

Now when training we will tell the model to only use the training data by passing the row IDs from partition to the row_ids argument of $train():

learner_rpart$train(task_mtcars, row_ids = splits$train)

Now we can use our trained learner to make predictions on new data.

2.2.2 Predicting

Predicting from trained models is as simple as passing your data to the $predict() method of the trained Learner.


Carrying straight on from our last example, we will call the $predict() method from our trained learner and again will use the row_ids argument, but this time to pass the IDs of our test set:

predictions = learner_rpart$predict(task_mtcars, row_ids = splits$test)

The $predict() method returns an object inheriting from Prediction, in this case PredictionRegr as this is a regression task.

<PredictionRegr> for 9 observations:
    row_ids truth response
          2  21.0  16.2800
          9  22.8  26.7625
         27  26.0  26.7625
         23  15.2  16.2800
         29  15.8  16.2800
         28  30.4  26.7625

The row_ids column corresponds to the row IDs of the predicted observations. The truth column contains the ground truth data, which the object extracts from the task, in this case: task_mtcars$truth(splits$test). Finally, the response column contains the values predicted by the model. The Prediction object can easily be converted into a data.table or data.frame using as.data.table()/as.data.frame() respectively.

All data in the above columns can be accessed directly, for example to get the first two predicted responses:

[1] 16.2800 26.7625

Similarly to plotting Tasks, mlr3viz provides an autoplot() method for Prediction objects.

predictions = learner_rpart$predict(task_mtcars, splits$test)

A scatter plot of predicted values on one axis and ground truth values on the other. A trend line is fit to show that in general there is good agreement between predicted and ground truth values.

Comparing predicted and ground truth values for the mtcars dataset.

In the examples above we made predictions by passing a task to $predict(), instead if you would rather pass a data.frame type object directly then you can use $predict_newdata():

mtcars_new = data.table(cyl = c(5, 6), disp = c(100, 120),
  hp = c(100, 150), drat = c(4, 3.9), wt = c(3.8, 4.1),
  qsec = c(18, 19.5), vs = c(1, 0), am = c(1, 1),
  gear = c(6, 4), carb = c(3, 5))
predictions = learner_rpart$predict_newdata(mtcars_new)
<PredictionRegr> for 2 observations:
 row_ids truth response
       1    NA  26.7625
       2    NA  26.7625

Changing the Prediction Type

Whilst predicting a single numeric quantity is the most common prediction type in regression, it is not the only prediction type. Several regression models can also predict standard errors, which are computed during training. To predict these, the $predict_type field of a LearnerRegr must be changed from “response” (the default) to “se” before training, and most simply during construction. The rpart learner we used above does not support predicting standard errors, so in the example below we will use a linear regression model implemented in mlr3learners::LearnerRegrLm, note how the output now includes standard errors.

learner_lm = lrn("regr.lm", predict_type = "se")
learner_lm$train(task_mtcars, splits$train)
learner_lm$predict(task_mtcars, splits$test)
<PredictionRegr> for 9 observations:
    row_ids truth response       se
          2  21.0 21.92034 1.193742
          9  22.8 25.36346 1.346913
         27  26.0 25.80819 1.219812
         23  15.2 15.76983 1.349221
         29  15.8 14.75022 1.056986
         28  30.4 26.35487 1.145776

We will see prediction types playing an even bigger part in classification in Section 2.5.3.

The final step of the basic ML workflow (Figure 2.1) is to evaluate the quality of predictions to see if our trained model is any ‘good’. We will cover basic evaluation in Section 2.3 and then more advanced evaluation for data resampling and model comparison in Chapter 3. But first we will cover the final element that makes ML models powerful predictive tools, which is their hyperparameters, which give users control over the fitting and predicting process and when set correctly can result in more accurate models.

2.2.3 Hyperparameters

Learners encapsulate an ML algorithm and its hyperparameters, which are free parameters that can be set by the user to affect how the algorithm is run. Hyperparameters may affect how a model is trained or how it makes predictions and deciding which hyperparameters to set can require expert knowledge though often there is an element of trial and error. Hyperparameters are hugely important to a model performing well and therefore setting hyperparameters manually is rarely a good idea. In practice, automated hyperparameter optimization is more common, which we will return to in Chapter 4. For this chapter we will refer to manual setting of hyperparameters for the sake of brevity. We will first look at paradox and ParamSet objects which are used to store learner hyperparameters, and then we will look at getting and setting these values. Paradox and parameter sets

We will continue our running example with a regression tree learner. To access the hyperparameters in the decision tree, we use $param_set:

                id    class lower upper nlevels        default value
 1:             cp ParamDbl     0     1     Inf           0.01      
 2:     keep_model ParamLgl    NA    NA       2          FALSE      
 3:     maxcompete ParamInt     0   Inf     Inf              4      
 4:       maxdepth ParamInt     1    30      30             30      
 5:   maxsurrogate ParamInt     0   Inf     Inf              5      
 6:      minbucket ParamInt     1   Inf     Inf <NoDefault[3]>      
 7:       minsplit ParamInt     1   Inf     Inf             20      
 8: surrogatestyle ParamInt     0     1       2              0      
 9:   usesurrogate ParamInt     0     2       3              2      
10:           xval ParamInt     0   Inf     Inf             10     0

The output above is a paradox::ParamSet object from the package paradox. These objects provide information on hyperparameters including their name (id), data types (class), acceptable ranges for hyperparameter values (lower, upper), the number of levels possible if the data type is categorical (nlevels), the default value from the underlying package (default), and finally the set value if different from the default (value). The second column references classes defined in paradox that determine the class of the parameter and the possible values it can take. Table 2.1 lists the possible hyperparameter types, all of which inherit from paradox::Param.

Table 2.1: Hyperparameter classes and the type of hyperparameter they represent.
Hyperparameter Class Description
ParamDbl Real-valued (Numeric) Parameters
ParamInt Integer Parameters
ParamFct Categorical (Factor) Parameters
ParamLgl Logical / Boolean Parameters
ParamUty Untyped Parameters

Let us carry on the example above and consider some specific hyperparameters. From the decision tree ParamSet output we can infer the following:

  • cp must be a “double” (ParamDbl) taking values between 0 (lower) and 1 (upper) with a default of 0.01 (default).
  • keep_model must be a “logical” (ParamLgl) taking values TRUE or FALSE with default FALSE
  • xval must be an “integer” (ParamInt) taking values between 0 and Inf with a default of 10 and a set value of 0.

In rare cases (we try to minimize it as much as possible), we alter hyperparameter values in construction. When we do this the reason will always be given in the learner help page. In the case of regr.rpart, we change the xval default to 0 because xval controls internal cross-validations and if a user accidentally leaves this at 10 then model training can take a long time. Getting and setting hyperparameter values

Now we have looked at how parameter sets are stored, we can now think about getting and setting parameters. Returning to our decision tree, say we are interested in growing a tree with depth 1, which means a tree where data is split once into two terminal nodes. From the parameter set output, we know that the maxdepth parameter has a default of 30 and that it takes integer values.

There are a few different ways we could change this hyperparameter. The simplest way to set a hyperparameter is in construction of the learner by simply passing the hyperparameter name and new value to lrn():

learner_rpart = lrn("regr.rpart", maxdepth = 1)

We can view the set of non-default hyperparameters (i.e., those changed by the user) by using $param_set$values:

[1] 0

[1] 1
# depth 1
n= 32 

node), split, n, deviance, yval
      * denotes terminal node

1) root 32 1126.0470 20.09062  
  2) cyl>=5 21  198.4724 16.64762 *
  3) cyl< 5 11  203.3855 26.66364 *

Now we can see that maxdepth = 1 (as we discussed above xval = 0 is changed in construction) and the constructed regression tree reflects this.

This values field simply returns a list of set hyperparameters, so another way to update hyperparameters is by updating an element in the list:

learner_rpart$param_set$values$maxdepth = 2
[1] 0

[1] 2
# depth 2
n= 32 

node), split, n, deviance, yval
      * denotes terminal node

1) root 32 1126.04700 20.09062  
  2) cyl>=5 21  198.47240 16.64762  
    4) hp>=192.5 7   28.82857 13.41429 *
    5) hp< 192.5 14   59.87214 18.26429 *
  3) cyl< 5 11  203.38550 26.66364 *

Finally, to set multiple values at once we recommend either setting these in construction or using $set_values.

learner_rpart = lrn("regr.rpart", maxdepth = 3, xval = 1)
[1] 1

[1] 3
# or with set_values
learner_rpart$param_set$set_values(xval = 2, cp = 0.5)
[1] 2

[1] 3

[1] 0.5

As learner_rpart$param_set$values returns a list, some users may be tempted to set hyperparameters by passing a new list to $values – this would work but we do not recommend it. This is because passing a list will wipe any existing hyperparameter values if they are not included in the list. So by example:

rpart_params = lrn("regr.rpart")
# values at construction
[1] 0
# passing maxdepth the wrong way
rpart_params$param_set$values = list(maxdepth = 1)
# we have removed xval by mistake
[1] 1
# now with set_values
rpart_params = lrn("regr.rpart")
rpart_params$param_set$set_values(maxdepth = 1)
[1] 0

[1] 1

All methods have safety checks to ensure your new values fall within the allowed parameter range:

lrn("regr.rpart", cp = 2, maxdepth = 2)
Error in self$assert(xs): Assertion on 'xs' failed: cp: Element 1 is not <= 1. Parameter dependencies

This section covers advanced ML or technical details that can be skipped.

More complex hyperparameter spaces may include parameter dependencies, which occur when setting a hyperparameter is conditional on the value of another hyperparameter, this is most important in the context of model tuning (Chapter 4). One such example is an SVM classifier, implemented in mlr3learners::LearnerClassifSVM. The parameter set of this model has an additional column called ‘parents’, which tells us there are parameter dependencies in the learner.

                 id    class lower upper nlevels          default parents
 1:       cachesize ParamDbl  -Inf   Inf     Inf               40        
 2:   class.weights ParamUty    NA    NA     Inf                         
 3:           coef0 ParamDbl  -Inf   Inf     Inf                0  kernel
 4:            cost ParamDbl     0   Inf     Inf                1    type
 5:           cross ParamInt     0   Inf     Inf                0        
 6: decision.values ParamLgl    NA    NA       2            FALSE        
 7:          degree ParamInt     1   Inf     Inf                3  kernel
 8:         epsilon ParamDbl     0   Inf     Inf              0.1        
 9:          fitted ParamLgl    NA    NA       2             TRUE        
10:           gamma ParamDbl     0   Inf     Inf   <NoDefault[3]>  kernel
11:          kernel ParamFct    NA    NA       4           radial        
12:              nu ParamDbl  -Inf   Inf     Inf              0.5    type
13:           scale ParamUty    NA    NA     Inf             TRUE        
14:       shrinking ParamLgl    NA    NA       2             TRUE        
15:       tolerance ParamDbl     0   Inf     Inf            0.001        
16:            type ParamFct    NA    NA       2 C-classification        
1 variable not shown: [value]

To view exactly what the dependency is we can use $deps, this returns a data.table which can queried in the usual way. So to see the dependencies of the SVM and to inspect the conditions we could do the following:

       id     on           cond
1:   cost   type <CondEqual[9]>
2:     nu   type <CondEqual[9]>
3: degree kernel <CondEqual[9]>
4:  coef0 kernel <CondAnyOf[9]>
5:  gamma kernel <CondAnyOf[9]>
lrn("classif.svm")$param_set$deps[1, cond]
CondEqual: x = C-classification
lrn("classif.svm")$param_set$deps[4, cond]
CondAnyOf: x ∈ {polynomial, sigmoid}

This tells us that the parameter cost should only be set if the type parameter is set to "C-classification". Similarly, the coef0 parameter should be set only if "polynomial", "radial", or "sigmoid".

# errors as type is not C-classification
lrn("classif.svm", type = "eps-classification", cost = 0.5)
Error in self$assert(xs): Assertion on 'xs' failed: type: Must be element of set {'C-classification','nu-classification'}, but is 'eps-classification'.
# works because type is C-classification
lrn("classif.svm", type = "C-classification", cost = 0.5)
* Model: -
* Parameters: type=C-classification, cost=0.5
* Packages: mlr3, mlr3learners, e1071
* Predict Types:  [response], prob
* Feature Types: logical, integer, numeric
* Properties: multiclass, twoclass

2.2.4 Baseline learners

This section covers advanced ML or technical details that can be skipped.

Before we move onto learner evaluation, we will first highlight one particularly important class of learners that are useful in many aspects of ML. Contrary to expectations, these are actually the ‘bad’ or ‘weak’ learners known as baselines. Baselines are useful in model comparison (Chapter 3), as fallback learners (Section 4.7.1, Section 9.2.2), to be ‘composed’ into more complex models (Section 8.2.4), and can be used by sophisticated models (e.g., random forests) during training and/or prediction. For regression, we have implemented the baseline regr.featureless, which always predicts the mean of the target of the training data:

# generate data
df = as_task_regr(data.frame(x = runif(1000), y = rnorm(1000, 2, 1)), target = "y")
lrn("regr.featureless")$train(df, 1:995)$predict(df, 996:1000)
<PredictionRegr> for 5 observations:
 row_ids    truth response
     996 3.675092 1.977081
     997 3.651012 1.977081
     998 1.803173 1.977081
     999 1.195544 1.977081
    1000 1.861893 1.977081

It is good practice to test all new models against a baseline, and also to include baselines in experiments with multiple other models. In general, a model that does not outperform a baseline is a ‘bad’ model, on the other hand a model is not necessarily ‘good’ if it outperforms the baseline.

2.3 Evaluation

Perhaps the most important step of the ML workflow is evaluating model performance. Without this step, we would have no way to know if our trained model makes very accurate predictions, is worse than randomly guessing, or somewhere in between. We will continue with our decision tree example to establish if the quality of our predictions is ‘good’, first we will rerun the above code so it is easier to follow along.

learner_rpart = lrn("regr.rpart")
task_mtcars = tsk("mtcars")
splits = partition(task_mtcars)
learner_rpart$train(task_mtcars, splits$train)
predictions = learner_rpart$predict(task_mtcars, splits$test)

2.3.1 Measures

Analogously to Tasks and Learners, the available measures in mlr3 are stored in a dictionary called mlr_measures, which can be converted to a data.table to view all available measures; we have a sugar function msr() to simplify retrieving a measure for you and again you can use the $help() method to find documentation for any measure.

             key                          label task_type          packages
 1:          aic   Akaike Information Criterion      <NA>              mlr3
 2:          bic Bayesian Information Criterion      <NA>              mlr3
 3:  classif.acc        Classification Accuracy   classif mlr3,mlr3measures
 4:  classif.auc       Area Under the ROC Curve   classif mlr3,mlr3measures
 5: classif.bacc              Balanced Accuracy   classif mlr3,mlr3measures
62:  sim.jaccard       Jaccard Similarity Index      <NA> mlr3,mlr3measures
63:      sim.phi     Phi Coefficient Similarity      <NA> mlr3,mlr3measures
64:    time_both                   Elapsed Time      <NA>              mlr3
65: time_predict                   Elapsed Time      <NA>              mlr3
66:   time_train                   Elapsed Time      <NA>              mlr3
2 variables not shown: [predict_type, task_properties]

All measures implemented in mlr3 are defined primarily by three components: 1) the function that defines the measure; 2) whether a lower or higher value is consider ‘good’; and 3) the range of possible values the measure can take. As well as these defining elements, other metadata are important to consider when selecting and using a Measure, including if the measure has any special properties (e.g., requires training data), the type of predictions the measure can evaluate, and whether the measure has any ‘control parameters’. All this information is encapsulated in the Measure object. By example let us consider the mean absolute error (regr.mae).

measure = msr("regr.mae")
<MeasureRegrSimple:regr.mae>: Mean Absolute Error
* Packages: mlr3, mlr3measures
* Range: [0, Inf]
* Minimize: TRUE
* Average: macro
* Parameters: list()
* Properties: -
* Predict type: response

This measure compares the absolute difference (‘error’) between true and predicted values: \(f(y, \hat{y}) = | y - \hat{y} |\). Lower values are considered better (Minimize = TRUE), which is intuitive as we would like the true values, \(y\), to be identical (or as close as possible) in value to the predicted values, \(\hat{y}\). Finally we can see that the range of possible values the learner can take is from \(0\) to \(\infty\) (Range: [0, Inf]). The measure has no special properties (Properties: -), it evaluates response type predictions for regression models (Predict type: response), and it has no control parameters (Parameters: list()).

2.3.2 Scoring Predictions

All supervised learning measures compare the difference between predicted values and the ground truth. mlr3 simplifies the process of bringing these quantities together by storing the predictions and true outcomes in the Prediction object as we have already seen.

<PredictionRegr> for 11 observations:
    row_ids truth response
          2  21.0 16.70000
          8  24.4 26.81429
         21  21.5 26.81429
         31  15.0 16.70000
         18  32.4 26.81429
         26  27.3 26.81429

To actually calculate model performance, we simply call the $score() method of a Prediction object and pass as a single argument the measure (or measures passed as a list) that we want to compute. Less abstractly:


Note that all task types have default measures that are used if the argument to $score() is omitted, for regression this is the mean squared error (regr.mse), which is the squared difference between true and predicted values: \(f(y, \hat{y}) = (y - \hat{y})^2\).

It is possible to calculate multiple measures at the same time by passing multiple measures to $score(). For example, below we compute performance for mean squared error (regr.mse) and mean absolute error (regr.mae) – note we use msrs() to load multiple measures at once.

measures = msrs(c("regr.mse", "regr.mae"))
regr.mse regr.mae 
9.566957 2.590909 

2.3.3 Technical measures

This section covers advanced ML or technical details that can be skipped.

mlr3 also provides measures that do not quantify the quality of the predictions of a model, but instead provide ‘meta’ information about the model, in particular we have implemented:

So we can now score our decision tree to see how long it takes to train the model and then make predictions:

measures = msrs(c("time_train", "time_predict", "time_both"))
predictions$score(measures, learner = learner_rpart)
  time_train time_predict    time_both 
       0.004        0.003        0.007 

Notice a few key properties of these measures:

  1. time_both is simply the sum of time_train and time_predict
  2. We had to pass learner = learner_rpart to $score() as these measures have the requires_learner property:
[1] "requires_learner"
  1. These can be used after model training and predicting because we automatically store model run times whenever $train() and $test() are called, so the measures above are equivalent to:
c(learner_rpart$timings, both = sum(learner_rpart$timings))
  train predict    both 
  0.004   0.003   0.007 

The selected_features measure calculates how many features were selected as important by the learner.

measure_sf = msr("selected_features")
<MeasureSelectedFeatures:selected_features>: Absolute or Relative Frequency of Selected Features
* Packages: mlr3
* Range: [0, Inf]
* Minimize: TRUE
* Average: macro
* Parameters: normalize=FALSE
* Properties: requires_task, requires_learner, requires_model
* Predict type: NA

We can see that this measure contains control parameters (Parameters: normalize=FALSE), which are parameters that control how the measure is computed. As with hyperparameters these can be viewed with $param_set:

Control Parameters
measure_sf = msr("selected_features")
          id    class lower upper nlevels default value
1: normalize ParamLgl    NA    NA       2   FALSE FALSE

The normalize hyperparameter specifies whether the returned number of selected features should be normalized by the total number of features, this is useful if you are comparing this value across tasks with differing number of features, so let us change the default to TRUE and see how many (normalized) features our decision tree selected:

measure_sf$param_set$values$normalize = TRUE
predictions$score(measure_sf, task = task_mtcars, learner = learner_rpart)

Note that we passed the task and learner as the measure has the requires_task and requires_learner property.

2.4 Our first regression experiment

Before we go on to look at how the building blocks of mlr3 extend to classification, we will take a brief pause to put together everything above in a short experiment. In this experiment we will compare the performance of a featureless regression learner to a decision tree with changed parameters.

# load and partition our task
task_bh = tsk("mtcars")
splits = partition(task)
# load featureless learner
featureless = lrn("regr.featureless")
# load decision tree with different hyperparameters
rpart = lrn("regr.rpart", cp = 0.2, maxdepth = 5)
# load MSE and MAE measures, and calculate time
measures = msrs(c("regr.mse", "regr.mae"))
# train learners
featureless$train(task, splits$train)
rpart$train(task, splits$train)
# make and score predictions
featureless$predict(task, splits$test)$score(measures)
 regr.mse  regr.mae 
26.726772  4.512987 
rpart$predict(task, splits$test)$score(measures)
regr.mse regr.mae 
6.932709 2.206494 

Before starting the experiment we load the mlr3 library and set a seed (in an exercise below you will be asked to think about why setting a seed is essential for reproducibility in this experiment). In this experiment we loaded the regression task mtcars with tsk() and then split this using partition with the default 70/30 split. Next we loaded a featureless baseline learner (regr.featureless) with the lrn() function. Then loaded a decision tree (regr.rpart) but changed the complexity parameter and max tree depth from their defaults. We then used msrs() to load multiple measures at once, the mean squared error (MSE) (regr.mse) and the mean absolute error (MAE) (regr.mae). With all objects loaded we then train our models, passing the same training data to both. Finally we made predictions from our trained models and scored these, note how we use ‘method chaining’, which is an R6 technique to combine multiple methods (called with $()) in a row on the same line. For both MSE and MAE, lower values are ‘better’ (Minimize: TRUE) therefore we can conclude that the decision tree performs better than the featureless baseline as its MSE and MAE are both lower. In Section 3.3 we will see how to formalize comparison between models in a more efficient way using benchmark().

Now we have put everything together you may notice that our learners and measures both have the "regr." prefix, which is a handy way of reminding us that we are working with a regression task and therefore must make use of learners and measures built for regression. In the next section, we will extend the building block of mlr3 to consider classification tasks, which make use of learners and measures with the "classif." prefix.

2.5 Classification

Classification problems are ones in which a model tries to predict a discrete, categorical target, as opposed to a continuous, numeric quantity. For example, predicting the species of penguin from its physical characteristics would be a classification problem as there are only a finite number of species. mlr3 ensures that the interface for all tasks is as similar as possible (if not identical) and therefore we will not repeat any content from the previous section but will just focus on differences that make classification a unique ML problem. We will first demonstrate the similarities between regression and classification by performing an experiment very similar to the one in Section 2.4 using code that will now be familiar to you. We will then move to differences in tasks, learners and predictions, before looking at thresholding, which is a method specific to classification.

2.5.1 Our first classification experiment

The interface for classification tasks, learners, and measures, is identical to the regression setting, except the underlying objects inherit from TaskClassif, LearnerClassif, and MeasureClassif, respectively.

We can therefore run a very similar experiment to the one above.

# load and partition our task
task_pen = tsk("penguins")
splits = partition(task_pen)
# load featureless learner
featureless = lrn("classif.featureless")
# load decision tree with different hyperparameters
rpart = lrn("classif.rpart", cp = 0.2, maxdepth = 5)
# load accuracy measure
measure = msr("classif.acc")
# train learners
featureless$train(task_pen, splits$train)
rpart$train(task_pen, splits$train)
# make and score predictions
featureless$predict(task_pen, splits$test)$score(measure)
rpart$predict(task_pen, splits$test)$score(measure)

In this experiment we loaded the predefined task mlr_tasks_penguins, which is based on the palmerpenguins::penguins dataset, then partitioned the data into training and test splits. We loaded the featureless classification baseline (which always predicts the most common class in the training data) and a classification decision tree, then the accuracy measure (sum of correct predictions divided by total number of predictions), trained our models then made predictions and scored them. In this experiment the decision tree is clearly the better performing model as it is vastly more accurate.

Now we have seen the similarities between classification and regression, we can turn to some key differences.

2.5.2 TaskClassif

Classification tasks, objects inheriting from TaskClassif, are very similar to regression tasks, except the target variable is of type factor and will have a limited number of possible classes/categories that observations can fall into.


You can view the predefined classification tasks in mlr3 by filtering the mlr_tasks dictionary, and you can create your own with as_task_classif.

as.data.table(mlr_tasks)[task_type == "classif"]
                key                                     label task_type nrow
 1:   breast_cancer                   Wisconsin Breast Cancer   classif  683
 2:   german_credit                             German Credit   classif 1000
 3:            ilpd                 Indian Liver Patient Data   classif  583
 4:            iris                              Iris Flowers   classif  150
 5:       optdigits Optical Recognition of Handwritten Digits   classif 5620
 6:        penguins                           Palmer Penguins   classif  344
 7: penguins_simple                Simplified Palmer Penguins   classif  333
 8:            pima                      Pima Indian Diabetes   classif  768
 9:           sonar                    Sonar: Mines vs. Rocks   classif  208
10:            spam                         HP Spam Detection   classif 4601
11:         titanic                                   Titanic   classif 1309
12:            wine                              Wine Regions   classif  178
13:             zoo                               Zoo Animals   classif  101
9 variables not shown: [ncol, properties, lgl, int, dbl, chr, fct, ord, pxc]
as_task_classif(palmerpenguins::penguins, target = "species")
<TaskClassif:palmerpenguins::penguins> (344 x 8)
* Target: species
* Properties: multiclass
* Features (7):
  - int (3): body_mass_g, flipper_length_mm, year
  - dbl (2): bill_depth_mm, bill_length_mm
  - fct (2): island, sex

There are two types of classification task supported in mlr3: binary classification, in which the outcome can be one of two categories, and multiclass classification, where the outcome can be one of three or more categories.

Binary ClassificationMulticlass Classification

The sonar task (mlr_tasks_sonar) is an example of a binary classification problem, as it has two targets, in mlr3 terminology it has the “twoclass” property:

task_sonar = tsk("sonar")
<TaskClassif:sonar> (208 x 61): Sonar: Mines vs. Rocks
* Target: Class
* Properties: twoclass
* Features (60):
  - dbl (60): V1, V10, V11, V12, V13, V14, V15, V16, V17, V18, V19, V2,
    V20, V21, V22, V23, V24, V25, V26, V27, V28, V29, V3, V30, V31,
    V32, V33, V34, V35, V36, V37, V38, V39, V4, V40, V41, V42, V43,
    V44, V45, V46, V47, V48, V49, V5, V50, V51, V52, V53, V54, V55,
    V56, V57, V58, V59, V6, V60, V7, V8, V9
[1] "M" "R"

In contrast, penguins (mlr_tasks_penguins) is a multiclass problem as there are more than two species of penguins, in mlr3 terminology it has the “multiclass” property:

task_penguins = tsk("penguins")
<TaskClassif:penguins> (344 x 8): Palmer Penguins
* Target: species
* Properties: multiclass
* Features (7):
  - int (3): body_mass, flipper_length, year
  - dbl (2): bill_depth, bill_length
  - fct (2): island, sex
[1] "Adelie"    "Chinstrap" "Gentoo"   

In mlr3, the only difference between these is that binary classification tasks have an extra field called $positive, which defines the ‘positive’ class. In binary classification, as there are only two possible class types, by convention one of these is known as the positive class and the other as the negative class; it is arbitrary which is which, though often the more ‘important’ class is set as the positive class. You can set the positive class during or after construction, if no positive class is specified then mlr3 assumes the first level in the target column is the positive class, which can lead to misleading results, as shown in the example below.

$positivePositive ClassNegative Class
# create a dataset with factor target
data = data.frame(x = runif(5), y = factor(c("neg", "pos", "neg", "neg", "pos")))
# specifying the positive class:
as_task_classif(data, target = "y", positive = "pos")$positive
[1] "pos"
# default is first class, which here is "neg"
classif_task = as_task_classif(data, target = "y")
[1] "neg"
# changing after construction
classif_task$positive = "pos"
[1] "pos"

Whilst the choice of positive and negative class is arbitrary, it is essential to ensuring results from models and performance measures are interpreted as expected – this is best demonstrated when we discuss thresholding (Section 2.5.4) and ROC metrics (Section 3.4).

Finally, plotting is possible with mlr3viz::autoplot.TaskClassif, below we plot a comparison between the target column and features.

autoplot(tsk("penguins"), type = "duo") +
  ggplot2::theme(strip.text.y = ggplot2::element_text(angle = -45, size = 8))

Diagram showing the distribution of target and feature values for a subset of the penguins data.

Overview of part of the penguins dataset.

2.5.3 LearnerClassif and MeasureClassif

Classification learners, which inherit from LearnerClassif have the same interface as regression learners. However, a key difference is that the possible prediction types in classification are either "response" – predicting an observation’s class (a penguins Species in our example) – or "prob" – predicting the probability of an observation belonging to each class. In classification, the latter is more informative as it provides more information about the confidence of the predictions:

learner_rpart = lrn("classif.rpart", predict_type = "prob")
learner_rpart$train(task_penguins, splits$train)
predictions = learner_rpart$predict(task_penguins, splits$test)
<PredictionClassif> for 113 observations:
    row_ids     truth  response prob.Adelie prob.Chinstrap prob.Gentoo
          2    Adelie    Adelie  0.97029703     0.02970297  0.00000000
          4    Adelie    Adelie  0.97029703     0.02970297  0.00000000
          7    Adelie    Adelie  0.97029703     0.02970297  0.00000000
        338 Chinstrap Chinstrap  0.04651163     0.93023256  0.02325581
        341 Chinstrap    Adelie  0.97029703     0.02970297  0.00000000
        344 Chinstrap Chinstrap  0.04651163     0.93023256  0.02325581

Notice how the predictions include the predicted probabilities for all three classes, as well as the response, which (by default) is the class with the highest predicted probability.

The interface for classification measures, which are of class MeasureClassif, is identical to regression measures. The key difference in usage is that predict types are more important to be aware of, to ensure you are evaluating the ‘correct’ predictions. To evaluate "response" predictions, you will need measures with predict_type = "response", or to evaluate probability predictions you will require predict_type = "prob". The easiest way to find these measures is by filtering the mlr_measures dictionary:

as.data.table(mlr_measures)[task_type == "classif" & predict_type == "prob" & task_properties != "twoclass"]
                 key                                      label task_type
1:   classif.logloss                                   Log Loss   classif
2: classif.mauc_au1p    Weighted average 1 vs. 1 multiclass AUC   classif
3: classif.mauc_au1u             Average 1 vs. 1 multiclass AUC   classif
4: classif.mauc_aunp Weighted average 1 vs. rest multiclass AUC   classif
5: classif.mauc_aunu          Average 1 vs. rest multiclass AUC   classif
6:    classif.mbrier                     Multiclass Brier Score   classif
3 variables not shown: [packages, predict_type, task_properties]

We also filtered to remove any measures that have the “twoclass” property as this would conflict with our “multiclass” task. Now we can evaluate the quality of our probability predictions and response predictions simultaneously:

measures = msrs(c("classif.mbrier", "classif.logloss", "classif.acc"))
 classif.mbrier classif.logloss     classif.acc 
      0.1016821       0.2291407       0.9469027 

The accuracy measure evaluates the "response" predictions whereas the brier score (classif.mbrier) (squared difference between predicted probabilities and the truth) and logloss (classif.logloss) (negative logarithm of the predicted probability for the true class) are evaluating the probability predictions.

If no measure is passed to $score(), the default classification error (classif.ce) is calculated, which is the number of misclassifications divided by the number of predictions, i.e., 1 - classif.acc.

2.5.4 PredictionClassif, Confusion Matrix, and Thresholding

PredictionClassif objects have two important differences from the regression case. Firstly, the added field $confusion, and secondly the added method $set_threshold().


Confusion matrix

A confusion matrix is a popular way to show the quality of classification (response) predictions in a more detailed fashion by seeing if a model is good at (mis)classifying observations in a particular class. For binary and multiclass classification, the confusion matrix is stored in the $confusion field of the PredictionClassif object:

Confusion Matrix$confusion
response    Adelie Chinstrap Gentoo
  Adelie        49         3      0
  Chinstrap      1        18      1
  Gentoo         0         1     40

The rows in a confusion matrix are the predicted class and the columns are the true class. All off-diagonal entries are incorrectly classified observations, and all diagonal entries are correctly classified. In this case, the classifier does fairly well classifying all penguins, but we could have found that it only classifies the Adelie species well but often conflates Chinstrap and Gentoo. You can visualize a confusion matrix with autoplot.PredictionClassif.


If we take task_sonar$positive (M) to be the positive class then the confusion matrix corresponds to true positives (top left), false positives (top right), false negatives (bottom left), and true negatives (bottom right) (see Figure 3.10):

splits = partition(task_sonar)
  train(task_sonar, splits$train)$
  predict(task_sonar, splits$test)$
response  M  R
       M 29 10
       R  8 22

We will return to the concept of binary (mis)classification in greater detail in Section 3.4.


The final big difference to discuss is thresholding. We saw previously that the response prediction type by default is calculated as the class that has the highest predicted probability. For n classes, with predicted probabilities \(p_1,...,p_n\), this is the same as saying response = argmax\(\{p_1,...,p_n\}\). If the maximum probability is not unique, i.e., multiple classes are predicted to have the highest probability, then the response is chosen randomly from these. In binary classification this means that the positive class will be selected if the predicted class is greater than 50%, and the negative class otherwise.


This 50% value is known as the threshold and it can be useful to change this threshold if there is class imbalance (when one class is over- or under-represented in a dataset), or if there are different costs associated with classes, or simply if there is a preference to ‘over’-predict one class. As an example, let us take the german_credit task in which 700 customers have good credit and 300 have bad. Now we could easily build a model with 70% accuracy simply by always predicting a customer will have good credit:

task_credit = tsk("german_credit")
learn_featureless = lrn("classif.featureless", predict_type = "prob")
split = partition(task_credit)
learn_featureless$train(task_credit, split$train)
preds = learn_featureless$predict(task_credit, split$test)

Whilst this model may appear ‘good’ on the surface, in fact it just ignores all ‘bad’ customers – this can create very big problems in healthcare and other settings where there are data biases, as well as for the insurance company if false positives cost more than false negatives (see Section 8.1 for cost-sensitive classification).

Thresholding allows classes to be selected with a lower probability threshold, so instead of predicting a customer has bad credit if P(good) < 50%, instead we might predict bad credit if P(good) < 70% – notice how we write this in terms of the positive class, which in this task is ‘good’. Let us see this in practice:


Whilst our model performs ‘worse’ overall, i.e. with lower accuracy, it is still a ‘better’ model as it more accurately captures the relationship between classes.

In the binary classification setting, $set_threshold only requires one numeric argument, which corresponds with the threshold for the positive class – hence why it is essential to ensure the positive class is correctly set in your task.

In multiclass classification, thresholding works by first assigning a threshold to each of the n classes, dividing the predicted probabilities for each class by these thresholds to return n ratios, and then the class with the highest ratio is selected. By example say we are predicting if a new observation will be of class A, B, C, or D and we have predicted \(P(A = 0.2), P(B = 0.4), P(C = 0.1), P(D = 0.3)\). For now we will assume that the threshold for all classes is identical, note that it is arbitrary what thresholds are chosen if they are all identical so below we just use 1:

probs = c(0.2, 0.4, 0.1, 0.3)
thresholds = c(A = 1, B = 1, C = 1, D = 1)
  A   B   C   D 
0.2 0.4 0.1 0.3 

We would therefore predict our observation is of class B as this is the highest ratio. However, we could change our thresholds so that D has the lowest threshold and is therefore most likely to be predicted, A has the highest threshold, and B and C are equal:

thresholds = c(A = 0.5, B = 0.25, C = 0.25, D = 0.1)
  A   B   C   D 
0.4 1.6 0.4 3.0 

Now our observation will be predicted to be in class D.

In mlr3, the same principle is followed with $set_threshold by passing a named list. This is demonstrated below with mlr_tasks_zoo. Before changing the thresholds, some classes are never predicted and some are overpredicted.


task = tsk("zoo")
splits = partition(task)
learner = lrn("classif.rpart", predict_type = "prob")
learner$train(task, splits$train)
preds = learner$predict(task, splits$test)
before = autoplot(preds) + ggtitle("Default thresholds")
new_thresh = proportions(table(task$truth(splits$train)))

       mammal          bird       reptile          fish     amphibian 
   0.40298507    0.19402985    0.04477612    0.13432836    0.04477612 
       insect mollusc.et.al 
   0.07462687    0.10447761 
after = autoplot(preds) + ggtitle("Inverse weighting thresholds")
before + after + plot_layout(guides = "collect")

A stacked bar plot of predicted values in one bar and ground truth values in the other. Some classes are predicted more often than in the ground truth data, some less often.

Comparing predicted and ground truth values for the zoo dataset.

Again we see that the model better represents all classes after thresholding. In this example we set the new thresholds to be the proportions of each class in the training set, doing so, known as inverse weighting, effectively sets the thresholds as the inverse probability of occurring, this means that more common classes are will have higher thresholds and vice versa.

In Chapter 6 we will return to thresholding to see how to automatically choose and set thresholds,§ and in Section 8.1 we will look at cost-sensitive classification where each class has a different associated cost.

2.6 Task Column Roles

This section covers advanced ML or technical details that can be skipped.

Now we have covered regression and classification, we can briefly return to tasks and in particular to column roles, which are used to customize tasks further. Column roles are used by Task objects to define important metadata that can be used by learners and other objects to interact with the task. We have already seen some of these in action with targets and features. There are seven column roles available:

  1. "feature": Features used for prediction.
  2. "target": Target variable to predict.
  3. "name": Row names/observation labels, for mtcars this is the "model" column.
  4. "order": Variable(s) used to order data returned by $data(); must be sortable with order().
  5. "group": Variable used to keep observations together during resampling.
  6. "stratum": Variable(s) to stratify during resampling.
  7. "weight": Observation weights. Not more than one numeric column may have this role.

We have already seen how feature and targets work in Section 2.1, these are the only required column roles. In Section 3.2.5 we will have a look at the stratum and group column roles. So for now we will only look at order, and weight. We will not go into detail about name, which is primarily used by plotting and will almost always be the rownames() of the underlying data.

Column roles are updated using the $set_col_roles() method. When we set the order column role, the data is ordered according to that column(s), as in the following example.

df = data.frame(mtcars[1:2, ], idx = 2:1)
task_mtcars_order = as_task_regr(df, target = "mpg")
task_mtcars_order$data(ordered = TRUE)
   mpg am carb cyl disp drat gear  hp idx  qsec vs    wt
1:  21  1    4   6  160  3.9    4 110   2 16.46  0 2.620
2:  21  1    4   6  160  3.9    4 110   1 17.02  0 2.875
# order by "idx" column
task_mtcars_order$set_col_roles("idx", roles = "order")
task_mtcars_order$data(ordered = TRUE)
   mpg am carb cyl disp drat gear  hp  qsec vs    wt
1:  21  1    4   6  160  3.9    4 110 17.02  0 2.875
2:  21  1    4   6  160  3.9    4 110 16.46  0 2.620

In this example we can see that by setting "idx" to have the order column role, it is no longer displayed when we run $data() but instead is used to order the observations according to its value. This demonstrates how the Task object can hold metadata that is not passed to the learner.

The weights column role is used to weight data points differently. One example of why we would do this is in classification tasks with severe class imbalance, weighting the minority class rows more heavily may improve the model’s performance on that class. For example in the breast_cancer dataset, there are more instances of the benign tumors than malignant tumors, so if we want to better predict malignant tumors we could weight the data in favour of this class:

cancer_unweighted = tsk("breast_cancer")
malignant    benign 
      239       444 
df = cancer_unweighted$data()
# adding a column where the weight is 2 when the class == "malignant", and 1 otherwise
df$weights = ifelse(df$class == "malignant", 2, 1)
cancer_weighted = as_task_classif(df, target = "class")
cancer_weighted$set_col_roles("weights", roles = "weight")
# compare weighted and unweighted predictions
split = partition(cancer_unweighted)
lrn_rf = lrn("classif.ranger")
lrn_rf$train(cancer_unweighted, split$train)$predict(cancer_unweighted, split$test)$score()
lrn_rf$train(cancer_weighted, split$train)$predict(cancer_weighted, split$test)$score()

In this example, weighting may marginally improve the model performance (see Chapter 3 for more thorough comparison methods). Not all models can handle weights in the data so it’s important to check a learner’s properties to make sure this column role is being used as expected. Furthermore, algorithms will make use of weights in different ways so it’s important to read the implementation’s documentation to understand how weights are being used.

2.7 Supported Algorithms

mlr3 supports many algorithms (some through multiple implementations) as Learners. These are primarily accessed through mlr3, mlr3learners and mlr3extralearners package, however all packages that implement new tasks (Chapter 8) also include a handful of simple algorithms.

The list of learners included in mlr3 is deliberately small to avoid large sets of dependencies:

  • Featureless learner (regr.featureless/classif.featureless), which are implemented in mlr3 and are baseline learners used for model comparison or as fallback learners (Section 9.2.2). The former predicts the mean of the target values in the training set for all new observations, the latter predicts the most frequent label.
  • Debug learners (regr.debug/classif.debug), which are implemented in mlr3 and used only to debug code (Section 9.2).
  • Classification and regression trees (CART) (regr.rpart/classif.rpart).

The mlr3learners package contains a selection of algorithms (and select implementations) chosen by the mlr team that we recommend as a good starting point for most experiments:

  • Linear (regr.lm) and logistic (classif.log_reg) regression.
  • Penalized Generalized Linear Models (regr.glmnet/classif.glmnet) and with built-in optimization of the penalization parameter (regr.cv_glmnet/classif.cv_glmnet`).
  • Weighted \(k\)-Nearest Neighbors regression (regr.kknn/classif.kknn).
  • Kriging / Gaussian Process Regression (regr.km).
  • Linear (classif.lda) and Quadratic (classif.qda) Discriminant Analysis.
  • Naïve Bayes Classification (classif.naive_bayes).
  • Support-Vector machines (regr.svm/classif.svm).
  • Gradient Boosting (regr.xgboost/classif.xgboost).
  • Random Forests for regression and classification (regr.ranger/classif.ranger).

The majority of other learners are all in mlr3extralearners. You can find an up-to-date list of learners here: https://mlr-org.com/learners.html.

The dictionary mlr_learners contains learners that are supported in loaded packages. You can list all learners by converting the mlr_learners dictionary into a data.table:

                     key                       label task_type
  1:  classif.AdaBoostM1           Adaptive Boosting   classif
  2:         classif.C50            Tree-based Model   classif
  3:         classif.IBk           Nearest Neighbour   classif
  4:         classif.J48            Tree-based Model   classif
  5:        classif.JRip Propositional Rule Learner.   classif
136: surv.priority_lasso              Priority Lasso      surv
137:         surv.ranger               Random Forest      surv
138:          surv.rfsrc               Random Forest      surv
139:            surv.svm      Support Vector Machine      surv
140:        surv.xgboost           Gradient Boosting      surv
4 variables not shown: [feature_types, packages, properties, predict_types]

The resulting data.table contains a lot of metadata that is useful for identifying learners with particular properties. For example, we can list all learners that support regression problems:

as.data.table(mlr_learners)[task_type == "classif"]
                   key                       label task_type
 1: classif.AdaBoostM1           Adaptive Boosting   classif
 2:        classif.C50            Tree-based Model   classif
 3:        classif.IBk           Nearest Neighbour   classif
 4:        classif.J48            Tree-based Model   classif
 5:       classif.JRip Propositional Rule Learner.   classif
41:     classif.ranger                        <NA>   classif
42:      classif.rfsrc               Random Forest   classif
43:      classif.rpart         Classification Tree   classif
44:        classif.svm                        <NA>   classif
45:    classif.xgboost                        <NA>   classif
4 variables not shown: [feature_types, packages, properties, predict_types]

We can filter by multiple conditions, for example to list all regression learners that and can predict standard errors:

as.data.table(mlr_learners)[task_type == "regr" &
    sapply(predict_types, function(x) "se" %in% x)]
                key                                    label task_type
1:       regr.debug             Debug Learner for Regression      regr
2:       regr.earth Multivariate Adaptive Regression Splines      regr
3: regr.featureless           Featureless Regression Learner      regr
4:         regr.gam    Generalized Additive Regression Model      regr
5:         regr.glm            Generalized Linear Regression      regr
6:          regr.km                                     <NA>      regr
7:          regr.lm                                     <NA>      regr
8:         regr.mob       Model-based Recursive Partitioning      regr
9:      regr.ranger                                     <NA>      regr
4 variables not shown: [feature_types, packages, properties, predict_types]

2.8 Conclusion

In this chapter we covered the building blocks of mlr3. We first introduced basic ML methodology and then showed how this is implemented in mlr3 We began by looking at the Task class, which is used to define ML tasks or problems to solve. We then looked at the Learner class, which encapsulates ML algorithms, hyperparameters, and other metainformation. Finally we consider how to evaluate ML models with objects from the Measure class. After looking at regression implementations, we extended all the above to the classification setting, before finally looking at some extra details about tasks and the algorithms that are implemented across mlr3. The rest of this book will build on the basic elements seen in this chapter, starting with more advanced model comparison methods in Chapter 3 before moving to improving model performance with automated hyperparameter tuning in Chapter 4. Table 2.2 summarizes the most important functions and methods seen in this chapter.

Table 2.2: Important classes and functions covered in this chapter with underlying R6 class (if applicable), constructor to create an object of the class, and important class methods.
Underlying R6 Class Constructor (if applicable) Important methods
Task tsk()/tsks()/as_task_X $filter()/$data()
Learner lrn()/lrns() $train()/$predict()
Prediction some_learner$predict() $score()
Measure msr()/msrs()

2.9 Exercises

  1. Set the seed to 124 then train a classification tree model with classif.rpart and default hyperparameters on 80% of the data in the predefined sonar task. Evaluate the model’s performance with the classification error measure on the remaining data. Also think about why we need to set the seed in this example.
  2. Calculate the true positive, false positive, true negative, and false negative rates of the predictions made by the model in exercise 1.
  3. Change the threshold of the model from exercise 1 such that the false positive rate is lower than the false negative rate. What is one reason you might do this in practice?