costs = matrix(c(1, 0, 5, 0), nrow = 2, dimnames =
list("Predicted Credit" = c("good", "bad"),
Truth = c("good", "bad")))
costs
Truth
Predicted Credit good bad
good 1 5
bad 0 0
Raphael Sonabend
Imperial College London
Patrick Schratz
Friedrich Schiller University Jena
Damir Pulatov
University of Wyoming
So far in this book we have only considered two tasks. In Chapter 2 we introduced deterministic regression as well as deterministic and probabilistic singlelabel classification (Table 8.1). But our infrastructure also works well for many other tasks, some of which are available in extension packages (Figure 1.1) and some are available by creating pipelines with mlr3pipelines
. In this chapter, we will take you through just a subset of these new tasks, focusing on the ones that have a stable API. As we work through this chapter we will refer to the ‘building blocks’ of mlr3
, this refers to the base classes that must be extended to create new tasks, these are Prediction
, Learner
, Measure
, and Task
.
Table 8.1 summarizes available extension tasks, including the package(s) they are implemented in and a brief description of the task.
Task  Package  Description 

Deterministic regression  mlr3 
Point prediction of a continuous variable. 
Deterministic singlelabel classification  mlr3 
Prediction of a single class for each observation. 
Probabilistic singlelabel classification  mlr3 
Prediction of the probability of an observation falling into one or more mutually exclusive categories. 
Costsensitive classification 
mlr3 and mlr3pipelines

Classification predictions with unequal costs associated with misclassifications. 
Survival analysis  mlr3proba 
Timetoevent predictions with possible ‘censoring’. 
Density estimation  mlr3proba 
Unsupervised estimation of probability density functions. 
Spatiotemporal analysis 
mlr3spatiotempcv and mlr3spatial

Supervised prediction of data with spatial (e.g., coordinates) and/or temporal outcomes. 
Cluster analysis  mlr3cluster 
Unsupervised estimation of homogeneous clusters of data points. 
We begin by discussing a task that does not require any additional packages or infrastructure, only the tools we have already learned about from earlier chapters. In ‘regular’ classification, the aim is to optimize a metric (often the misclassification rate) whilst assuming all misclassification errors are deemed equally severe. A more general approach is costsensitive classification, in which costs caused by different kinds of errors may not be equal. The objective of costsensitive classification is to minimize the expected costs. As we discuss this task we will work with mlr_tasks_german_credit
(Appendix C) as a running example.
Imagine you are trying to calculate if giving someone a loan of $5K will result in a profit after one year, assuming they are expected to pay back $6K. To make this calculation, you will need to be predict if the person will have good credit. This is a deterministic classification problem where we are predicting whether someone will be in class ‘Good’ or ‘Bad’. Now we can define the potential costs associated with each prediction and the eventual truth:
costs = matrix(c(1, 0, 5, 0), nrow = 2, dimnames =
list("Predicted Credit" = c("good", "bad"),
Truth = c("good", "bad")))
costs
Truth
Predicted Credit good bad
good 1 5
bad 0 0
If the model predicts that the individual has bad credit then there is no profit or loss, the loan is not provided. If the model predicts that the individual has good credit and they pay the loan and interest then you make a $1K profit, but if they default then you lose $5K. Note that costsensitive classification is a minimization problem where we assume lower costs mean higher profits/positive outcomes, hence above we wrote the profit and loss as 1
and +5
respectively.
We will now see how to implement a more nuanced approach to classification errors with MeasureClassifCosts
. This measure takes one argument, which is a matrix with row and column names corresponding to the class labels in the task of interest. Let us put our insurance example into practice, notice that we have already named the cost matrix as required for the measure:
library(mlr3verse)
task = tsk("german_credit")
cost_measure = msr("classif.costs", costs = costs)
cost_measure
<MeasureClassifCosts:classif.costs>: Costsensitive Classification
* Packages: mlr3
* Range: [Inf, Inf]
* Minimize: TRUE
* Average: macro
* Parameters: normalize=TRUE
* Properties: 
* Predict type: response
learners = lrns(c("classif.log_reg", "classif.featureless", "classif.ranger"))
bmr = benchmark(benchmark_grid(task, learners, rsmp("cv", folds = 3)))
bmr$aggregate(cost_measure)[, c(4, 7)]
learner_id classif.costs
1: classif.log_reg 0.1790683
2: classif.featureless 0.8001654
3: classif.ranger 0.2491294
In this experiment we find that the logistic regression learner happens to perform best as it minimizes the expected costs (and maximizes expected profits) and the featureless learner performs the worst. However all costs are positive, which actually means a loss is made, so let us now see if we can improve these models by using thresholding.
As we have discussed in Chapter 2, thresholding is a method to finetune the probability at which an observation will be predicted as one class label or another. Currently in our running example, the models above will predict a customer has good credit (in the class ‘Good’) if the probability of good credit is greater than 0.5. However, this may not be a sensible approach as the cost of a false positive and false negative is not equal. In fact, a false positive results in a cost of +5 whereas a false negative only results in a cost of 0, hence we would prefer a model with a higher false negative rates and lower false positive rates. This is highlighted in the "threshold"
autoplot
:
prediction = lrn("classif.log_reg",
predict_type = "prob")$train(task)$predict(task)
autoplot(prediction, type = "threshold", measure = cost_measure)
As expected, the optimal threshold is greater than 0.5, indicating that to maximize profits, the majority of observations should be predicted to have bad credit.
To automate the process of optimizing the threshold, we can make use of mlr3tuning
(Chapter 4) and mlr3pipelines
(Chapter 6) by creating a graph with PipeOpTuneThreshold
. Continuing the same example:
gr = po("learner_cv", lrn("classif.log_reg", predict_type = "prob")) %>>%
po("tunethreshold", measure = cost_measure)
learners = list(as_learner(gr), lrn("classif.log_reg"))
bmr = benchmark(benchmark_grid(task, learners, rsmp("cv", folds = 3)))
bmr$aggregate(cost_measure)[, c(4, 7)]
learner_id classif.costs
1: classif.log_reg.tunethreshold 0.1060282
2: classif.log_reg 0.1481062
As expected, our tuned learner performs much better and now we expect a profit and not a loss.
Survival analysis is a field of statistics concerned with trying to predict/estimate the time until an event takes place. This predictive problem is unique as survival models are trained and tested on data that may include ‘censoring’, which occurs when the event of interest does not take place. Survival analysis can be hard to explain in the abstract, so as a working example consider a marathon runner in a race. Here the ‘survival problem’ is trying to predict the time when the marathon runner finishes the race. However, if the event of interest does not take place (i.e., marathon runner gives up and does not finish the race), they are said to be censored. Instead of throwing away information about censored events, survival analysis datasets include a status variable that provides information about the ‘status’ of an observation. So in our example we might write the runner’s outcome as \((4, 1)\) if they finish the race at 4 hours, otherwise if they give up at 2 hours we would write \((2, 0)\).
The key to modelling in survival analysis is that we assume there exists a hypothetical time the marathon runner would have finished if they had not been censored, it is then the job of a survival learner to estimate what the true survival time would have been for a similar runner, assuming they are not censored (see Figure 8.1). Mathematically, this is represented by the hypothetical event time, \(Y\), the hypothetical censoring time, \(C\), the observed outcome time, \(T = min(Y, C)\), the event indicator \(\Delta = (T = Y)\), and as usual some features, \(X\). Learners are trained on \((T, \Delta)\) but, critically, make predictions of \(Y\) from previously unseen features. This means that unlike classification and regression, learners are trained on two variables, \((T, \Delta)\), which, in R, is often captured in a survival::Surv
object. Relating to our example above the runner’s outcome would then be \((T = 4, \Delta = 1)\) or \((T = 2, \Delta = 0)\). Another example is in the code below, where we randomly generate 6 survival times and 6 event indicators, an outcome with a +
indicates the outcome is censored, otherwise the event of interest occurred.
[1] 0.5522635+ 0.2905186 0.4404405+ 0.1184443 0.9216186+ 0.7325895
Readers familiar with survival analysis will recognize that the description above applies specifically to ‘right censoring’. Currently, this is the only form of censoring available in the mlr3
universe, hence restricting our discussion to that setting. For a good introduction to survival analysis see Collett (2014) or for machine learning in survival analysis specifically see R. Sonabend and Bender (2023).
For the remainder of this section we will look at how mlr3proba
(R. Sonabend et al. 2021) extends the building blocks of mlr3
for survival analysis. We will begin by looking at objects used to construct machine learning tasks for survival analysis, then we will turn to the learners we have implemented to solve these tasks, before looking at measures for evaluating survival analysis predictions, and then finally we will consider how to transform prediction types.
As we saw in the introduction to this section, survival algorithms require two targets for training, this means the new TaskSurv
object expects two targets. The simplest way to create a survival task is to use as_task_surv
, as in the following code chunk. Note this has more arguments than as_task_regr
to reflect multiple target and censoring types, time
and event
arguments expect strings representing column names where the ‘time’ and ‘event’ variables are stored, type
refers to the censoring type (currently only right censoring supported so this is the default). Note how as_task_surv
coerces the target columns into a survival::Surv
object.
library("mlr3verse")
library("mlr3proba")
library("survival")
task = as_task_surv(survival::rats, time = "time",
event = "status", type = "right", id = "rats")
task$head()
time status litter rx sex
1: 101 0 1 1 f
2: 49 1 1 0 f
3: 104 0 1 0 f
4: 91 0 2 1 m
5: 104 0 2 0 m
6: 102 0 2 0 m
Plotting the task with autoplot
results in a KaplanMeier plot which is a nonparametric estimator of the probability of survival for the average observation in the training set.
autoplot(task)
In the above example we used the survival::rats
dataset, which looks at predicting if a drug treatment was successful in preventing 150 rats from developing tumors. Note that the dataset (by its own admission) is not perfect and should generally be treated as ‘dummy’ data, which is good for examples but not realworld analysis.
As well as creating your own tasks, you can load any of the tasks shipped with mlr3proba
:
as.data.table(mlr_tasks)[task_type == "surv"]
key label task_type nrow ncol properties lgl int
1: actg ACTG 320 surv 1151 13 0 3
2: gbcs German Breast Cancer surv 686 10 0 4
3: grace GRACE 1000 surv 1000 8 0 2
4: lung Lung Cancer surv 228 10 0 7
5: rats Rats surv 300 5 0 2
6: unemployment Unemployment Duration surv 3343 6 0 1
7: whas Worcester Heart Attack surv 481 11 0 4
5 variables not shown: [dbl, chr, fct, ord, pxc]
The interface for LearnerSurv
and PredictionSurv
objects is identical to the regression and classification settings discussed in Chapter 2. Similarly to these settings, survival learners are constructed with "lrn"
; available learners are listed in Appendix D.
mlr3proba
has a different predict interface to mlr3
as all possible types of prediction (‘predict types’) are returned when possible for all survival models – i.e., if a model can compute a predict type then it is returned in PredictionSurv
. The reason for this design decision is that all these predict types can be transformed to one another and it is therefore computationally simpler to return all at once instead of rerunning models to change predict type. In survival analysis, the following predictions can be made:
response
 Predicted survival time.distr
 Predicted survival distribution, either discrete or continuous.lp
 Linear predictor calculated as the fitted coefficients multiplied by the test data.crank
 Continuous risk ranking.We will go through each of these prediction types in more detail and with examples to make them less abstract. We will use the following setup for most of the examples. In this chunk we are partitioning the survival::rats
and training a Cox Proportional Hazards model (mlr_learners_surv.coxph
) on the training set and making predictions for the predict set. For this model, all predict types except response
can be computed.
t = tsk("rats")
split = partition(t)
p = lrn("surv.coxph")$train(t, split$train)$predict(t, split$test)
p
<PredictionSurv> for 99 observations:
row_ids time status crank lp distr
8 102 FALSE 0.1577286 0.1577286 <list[1]>
16 98 FALSE 1.9548867 1.9548867 <list[1]>
24 76 FALSE 2.7149884 2.7149884 <list[1]>

241 72 TRUE 0.8826867 0.8826867 <list[1]>
247 73 TRUE 0.8896945 0.8896945 <list[1]>
249 66 TRUE 0.1225849 0.1225849 <list[1]>
predict_type = "response"
Counterintuitively for many, the response
prediction of predicted survival times is actually the least common predict type in survival analysis. The likely reason for this is due to the presence of censoring. We rarely observe the true survival time for many observations and therefore it is unlikely any survival model can confidently make predictions for survival times. This is illustrated in the code below.
In the example below we train and predict from a survival support vector machine (mlr_learners_surv.svm
), note we use type = "regression"
to select the algorithm that optimizes survival time predictions and gamma.mu = 1e3
is selected arbitrarily as this is a required parameter (this parameter should usually be tuned). We then compare the predictions from the model to the true data.
library(mlr3extralearners)
pred = lrn("surv.svm", type = "regression", gamma.mu = 1e3)$
train(t, split$train)$predict(t, split$test)
data.frame(pred = pred$response[1:3], truth = pred$truth[1:3])
pred truth
1 87.56067 102+
2 86.97710 98+
3 86.58935 76+
As can be seen from the output, our predictions are all less than the true observed time, which means we know our model definitely underestimated the truth. However, because each of the true values are censored times, we have absolutely no way of knowing if these predictions are slightly bad or absolutely terrible, (i.e., the true survival times could be \(105, 99, 92\) or they could be \(300, 1000, 200\)). Hence, with no realistic way to evaluate these models, survival time predictions are rarely useful.
predict_type = "distr"
So unlike regression in which deterministic/point predictions are most common, in survival analysis distribution predictions are much more common. You will therefore find that the majority of survival models in mlr3proba
will make distribution predictions by default. These predictions are implemented using the distr6
package, which allows visualization and evaluation of survival curves (defined as 1  cumulative distribution function). In the example below we train a Cox PH model on the rats
dataset and then evaluate the survival function for three predictions at \(t = 77\).
t = tsk("rats")
split = partition(t)
p = lrn("surv.coxph")$train(t, split$train)$predict(t, split$test)
p$distr[1:3]$survival(77)
[,1] [,2] [,3]
77 0.959636 0.9925459 0.9586408
The output indicates that there is a 96%, 99.3%, 95.9%, chance of the first three predicted rats being alive at time 77 respectively.
predict_type = "lp"
lp
, often written as \(\eta\) in academic writing, is computationally the simplest prediction and has a natural analogue in regression modelling. Readers familiar with linear regression will know that when fitting a simple linear regression model, \(Y = X\beta\), we are actually estimating the values for \(\beta\), and the estimated linear predictor (lp) is then \(X\hat{\beta}\), where \(\hat{\beta}\) are our estimated coefficients. In simple survival models, the linear predictor is the same quantity (but estimated in a slightly more complicated way). The learner implementations in mlr3proba
are primarily machinelearning focused and few of these models have a simple linear form, which means that lp
cannot be computed for most of these. In practice, when used for prediction, lp
is a proxy for a relative risk/continuous ranking prediction, which is discussed next.
predict_type = "crank"
The final prediction type, crank
, is the most common in survival analysis and perhaps also the most confusing. Academic texts will often refer to ‘risk’ predictions in survival analysis (hence why survival models are often known as ‘risk prediction models’), without defining what ‘risk’ means. Often risk is defined as \(exp(\eta)\) as this is a common quantity found in simple linear survival models. However, sometimes risk is defined as \(exp(\eta)\), and sometimes it can be an arbitrary quantity that does not have a meaningful interpretation. To prevent this confusion in mlr3proba
, we define the predict type crank
, which stands for continuous ranking. This is best explained by example.
The interpretation of ‘risk’ for survival predictions differs across R packages and sometimes even between models in the same package. In mlr3proba
there is one consistent interpretation of crank
: lower values represent a lower risk of the event taking place and higher values represent higher risk.
Continuing from the previous example we output the first three crank
predictions. The output tells us that the second rat is at the higher risk of death (larger values represent higher risk) and the third rat is at the lowest risk of death. The distance between predictions also tells us that the difference in risk between the first and second rat is smaller than the difference between the second and third. The actual values themselves are meaningless and therefore comparing crank
values between samples (or papers or experiments) is not meaningful.
p$crank[1:3]
1 2 3
0.5893286 2.2952962 0.5644561
The crank
prediction type is informative and common in practice because it allows identifying observations at lower/higher risk to each other, which is useful for resource allocation and prioritization (e.g., which patient should be given an expensive treatment), and clinical trials (e.g., are people in a treatment arm at lower risk of disease X than people in the control arm.).
Survival models in mlr3proba
are evaluated with MeasureSurv
objects, which are constructed in the usual way with "msr"
; measures currently implemented are listed in see Appendix D.
In general survival measures can be grouped into the following:
crank
and/or lp
predictions.crank
and/or distr
predictions.distr
predictions.head(as.data.table(mlr_measures)[
task_type == "surv", c("key", "predict_type")])
key predict_type
1: surv.brier distr
2: surv.calib_alpha distr
3: surv.calib_beta lp
4: surv.chambless_auc lp
5: surv.cindex crank
6: surv.dcalib distr
There is a lot of debate in the literature around the ‘best’ survival measures to use to evaluate models, as a general rule we recommend RCLL (mlr_measures_surv.rcll
) to evaluate the quality of distr
predictions, concordance index (mlr_measures_surv.cindex
) to evaluate a model’s discrimination, and DCalibration (mlr_measures_surv.dcalib
) to evaluate a model’s calibration.
We can now evaluate our predictions from the previous example. In the code below we use the recommend measures and find this model’s performance seems okay as the RCLL and DCalib are relatively low 0 and the Cindex is greater than 0.5.
Throughout mlr3proba
documentation we refer to “native” and “composed” predictions. We define a ‘native’ prediction as the prediction made by a model without any postprocessing, whereas a ‘composed’ prediction is one that is returned after postprocessing.
In mlr3proba
we make use of composition internally to return a "crank"
prediction for every learner. This is to ensure that we can meaningfully benchmark all models according to at least one criterion. The package uses the following rules to create "crank"
predictions:
crank = risk
(we may multiply this by \(1\) to ensure the ‘low value low risk’ interpretation).response
prediction then we set crank = response
.lp
prediction then we set crank = lp
(or crank = lp
if needed).distr
prediction then we set crank
as the sum of the cumulative hazard function (see R. Sonabend, Bender, and Vollmer (2022) for full discussion as to why we picked this method).At the start of this section we mentioned that it is possible to transform prediction types between each other. In mlr3proba
this is possible with ‘compositor’ pipelines (Chapter 6). There are a number of pipelines implemented in the package but two in particular focus on predict type transformation:
pipeline_crankcompositor()
 Transforms a "distr"
prediction to "crank"
; andpipeline_distrcompositor()
 Transforms a "lp"
prediction to "distr"
.In practice, the second pipeline is more common as we internally use a version of the first pipeline whenever we return predictions from survival models (so only use the first pipeline to overwrite these ranking predictions), and so here we will just look at the second pipeline.
In the example below we load the rats
dataset, remove factor columns, and then partition the data into training and testing. We construct the distrcompositor
pipeline around a survival GLMnet learner (mlr_learners_surv.glmnet
) which by default can only make predictions for "lp"
and "crank"
. In the pipeline we specify that we will estimate the baseline distribution with a KaplanMeier estimator (estimator = "kaplan"
) and that we want to assume a proportional hazards form for our estimated distribution. We then train and predict in the usual way and in our output we can now see a distr
prediction.
library(mlr3verse)
library(mlr3extralearners)
task = tsk("rats")
task$select(c("litter", "rx"))
split = partition(task)
learner = lrn("surv.glmnet")
# no distr output
learner$train(task, split$train)$predict(task, split$test)
<PredictionSurv> for 99 observations:
row_ids time status crank.1 lp.1
1 101 FALSE 0.707318789 0.707318789
3 104 FALSE 0.003141698 0.003141698
4 91 FALSE 0.710460487 0.710460487

230 84 TRUE 0.241910708 0.241910708
235 80 TRUE 0.952371195 0.952371195
293 75 TRUE 0.307886356 0.307886356
pipe = as_learner(ppl(
"distrcompositor",
learner = learner,
estimator = "kaplan",
form = "ph"
))
# now with distr
pipe$train(task, split$train)$predict(task, split$test)
<PredictionSurv> for 99 observations:
row_ids time status crank.1 lp.1 distr
1 101 FALSE 0.707318789 0.707318789 <list[1]>
3 104 FALSE 0.003141698 0.003141698 <list[1]>
4 91 FALSE 0.710460487 0.710460487 <list[1]>

230 84 TRUE 0.241910708 0.241910708 <list[1]>
235 80 TRUE 0.952371195 0.952371195 <list[1]>
293 75 TRUE 0.307886356 0.307886356 <list[1]>
Mathematically, we have done the following:
For more details about prediction types and compositions we recommend Kalbfleisch and Prentice (2011).
Finally, we will put all the above into practice in a small benchmark experiment. We first load the grace
dataset and subset to the first 500 rows. We then select the RCLL, DCalibration, and Cindex to evaluate predictions, setup the same pipeline we used in the previous experiment, and load a Cox PH and KaplanMeier estimator. We run our experiment with 3fold crossvalidation and aggregate the results.
library(mlr3verse)
library(mlr3extralearners)
task = tsk("grace")$filter(1:500)
msr_txt = c("surv.rcll", "surv.cindex", "surv.dcalib")
measures = msrs(msr_txt)
pipe = as_learner(ppl(
"distrcompositor",
learner = lrn("surv.glmnet"),
estimator = "kaplan",
form = "ph"
))
pipe$id = "Coxnet"
learners = c(lrns(c("surv.coxph", "surv.kaplan")), pipe)
bmr = benchmark(benchmark_grid(task, learners, rsmp("cv", folds = 3)))
bmr$aggregate(measures)[, c("learner_id", ..msr_txt)]
learner_id surv.rcll surv.cindex surv.dcalib
1: surv.coxph 2.937825 0.8016593 2.004290
2: surv.kaplan 3.094982 0.5000000 3.224777
3: Coxnet 2.940040 0.8021212 5.658987
In this small experiment, Coxnet and Cox PH have the best discrimination with \(C \approx 0.82\), Cox PH has the best calibration (as DCalib is closest to 0), and Coxnet and Cox PH have similar overall predictive accuracy (with lowest RCLL).
Density estimation is a learning task to estimate the unknown distribution from which a univariate dataset is generated, or put more simply to estimate the probability density (or mass) function for a single variable. As with survival analysis, density estimation is implemented in mlr3proba
, as both can make probability distribution predictions (hence the name “mlr3probabilistic”). Unconditional density estimation (i.e. estimation of a target without any covariates) is viewed as an unsupervised task, which means the ‘truth’ is never known. For a good overview to density estimation see Silverman (1986).
The package mlr3proba
extends mlr3
with the following objects for density estimation:
TaskDens
to define density tasks.LearnerDens
as the base class for density estimators.PredictionDens
for density predictions.MeasureDens
as a specialized class for density performance measures.We will consider each in turn.
As density estimation is an unsupervised task, there is no target for prediction. In the code below we construct a density task using as_task_dens
which takes one argument, a data.frame
type object with exactly one column (which we will use to estimate the underlying distribution).
task = as_task_dens(data.table(x = rnorm(1000)))
task
<TaskDens:data.table(x = rnorm(1000))> (1000 x 1)
* Target: 
* Properties: 
* Features (1):
 dbl (1): x
As with other tasks, we have included a couple of tasks that come shipped with mlr3proba
:
as.data.table(mlr_tasks)[task_type == "dens", c(1:2, 4:5)]
key label nrow ncol
1: faithful Old Faithful Eruptions 272 1
2: precip Annual Precipitation 70 1
LearnerDens
and PredictionDens
Density learners can make one of three possible prediction types:
distr
 probability distributionpdf
 probability density functioncdf
 cumulative distribution functionAll learners will return a distr
and pdf
prediction but only some can make cdf
predictions. Again, the distr
predict type is implemented using distr6
.
learn = lrn("dens.hist")
p = learn$train(task, 1:900)$predict(task, 901:1000)
x = seq.int(2, 2, 0.01)
ggplot(data.frame(x = x, y = p$distr$pdf(x)), aes(x = x, y = y)) +
geom_line() + theme_minimal()
The pdf
and cdf
predict types are simply wrappers around distr$pdf
and distr$cdf
respectively, which is best demonstrated by example:
learn = lrn("dens.hist")
p = learn$train(task, 1:10)$predict(task, 11:13)
p
<PredictionDens> for 3 observations:
row_ids pdf cdf distr
11 0.2 0.5933786 <Distribution[39]>
12 0.8 0.3599538 <Distribution[39]>
13 0.0 0.0000000 <Distribution[39]>
cbind(p$distr$cdf(task$data()$x[11:13]), p$cdf[1:3])
[,1] [,2]
[1,] 0.5933786 0.5933786
[2,] 0.3599538 0.3599538
[3,] 0.0000000 0.0000000
The reason for returning pdf
and cdf
in this way is to support measures that can be used to evaluate the quality of our estimations, which we will return to in the next section.
MeasureDens
Currently the only measure implemented in mlr3proba
for density estimation is logloss, which is defined in the same way as in classification, \(L(y) = log(\hat{f}_Y(y))\), where \(\hat{f}_Y\) is our estimated probability density function. Putting this together with the above we are now ready to train a density learner, estimate a distribution, and evaluate our estimation:
meas = msr("dens.logloss")
meas
<MeasureDensLogloss:dens.logloss>: Log Loss
* Packages: mlr3, mlr3proba
* Range: [0, Inf]
* Minimize: TRUE
* Average: macro
* Parameters: eps=1e15
* Properties: 
* Predict type: pdf
p$score(meas)
dens.logloss
12.12379
As with any scoring rule this output cannot be interpreted by itself and is more useful in a benchmark experiment, which we will look at in the final part of this section.
Finally, we conduct a small benchmark study on the mlr_tasks_faithful
task using some of the integrated density learners:
library(mlr3extralearners)
task = tsk("faithful")
learners = lrns(c("dens.hist", "dens.pen", "dens.kde"))
measure = msr("dens.logloss")
bmr = benchmark(benchmark_grid(task, learners, rsmp("cv", folds = 3)))
bmr$aggregate(measure)
autoplot(bmr, measure = measure)
The results of this experiment show that the sophisticated Penalized Density Estimator does not outperform the baseline histogram, but that the Kernel Density Estimator has at least consistently better (i.e. lower logloss) results.
Cluster analysis is another unsupervised task implemented in mlr3
. The objective of cluster analysis is to group data into clusters, where each cluster contains similar observations. The similarity is based on specified metrics that are task and applicationdependent. Unlike classification where we try to predict a class for each observation, in cluster analysis there is no ‘true’ label or class to predict.
The package mlr3cluster
extends mlr3
with the following objects for cluster analysis:
TaskClust
to define clustering tasksLearnerClust
as base class for clustering learnersPredictionClust
as specialized class for Prediction
objectsMeasureClust
as specialized class for performance measuresWe will consider each in turn.
Similarly to density estimation (Section 8.3), there is no target for prediction and so no truth
field in TaskClust
. Let us look at the cluster::ruspini
dataset often used for cluster analysis examples.
The dataset has 75 rows and two columns and was first introduced in Ruspini (1970) to illustrate different clustering techniques. As we will see from the plots below, the observations form four natural clusters.
In the code below we construct a cluster task using as_task_clust
which only takes one argument, a data.frame
type object.
library(mlr3verse)
library(cluster)
task = as_task_clust(ruspini)
task
<TaskClust:ruspini> (75 x 2)
* Target: 
* Properties: 
* Features (2):
 int (2): x, y
autoplot(task)
Technically, we did not need to create a new task for ruspini dataset since it is already included in the package. Currently we have two clustering tasks shipped with mlr3cluster
:
as.data.table(mlr_tasks)[task_type == "clust", c(1:2, 4:5)]
key label nrow ncol
1: ruspini Ruspini 75 2
2: usarrests US Arrests 50 4
As with density estimation, we refer to training
and predicting
for clustering to be consistent with the mlr3
interface, but strictly speaking this should be clustering
and assigning
(the latter we will return to shortly). Two predict_types
are available for clustering learners:
partition
– estimate of which cluster an observation falls intoprob
– probability of an observation belonging to each clusterHence, similarly to classification, prediction types of clustering learners are either deterministic (partition
) or probabilistic (prob
).
Below we construct a CMeans clustering learner with prob
prediction type and 3 clusters, train it on the cluster::ruspini
dataset and then return the cluster assignments ($assignments
) for each observation.
learner = lrn("clust.cmeans", predict_type = "prob", centers = 3)
learner
<LearnerClustCMeans:clust.cmeans>: Fuzzy CMeans Clustering Learner
* Model: 
* Parameters: centers=3
* Packages: mlr3, mlr3cluster, e1071
* Predict Types: partition, [prob]
* Feature Types: logical, integer, numeric
* Properties: complete, fuzzy, partitional
learner$train(task)
learner$assignments[1:6]
[1] 2 2 2 2 2 2
As clustering is unsupervised, it often does not make sense to use predict
for new data however this is still possible using the mlr3
interface.
# using different data for estimation (rare use case)
learner$train(task, 1:30)$predict(task, 31:32)
<PredictionClust> for 2 observations:
row_ids partition prob.1 prob.2 prob.3
31 3 0.2746426 0.008320118 0.7170373
32 3 0.3716548 0.006463253 0.6218819
# using same data for estimation (common use case)
prediction = learner$train(task)$predict(task)
autoplot(prediction, task)
Whilst two prediction types are possible, there are some learners where ‘prediction’ can never make sense, for example in hierarchical clustering. In hierarchical clustering, the goal is to build a hierarchy of nested clusters by either splitting large clusters into smaller ones or merging smaller clusters into bigger ones. The final result is a tree or dendrogram which can change if a new data point is added. For consistency, mlr3cluster
offers predict
method for hierarchical clusters but with a warning:
learner = lrn("clust.hclust")
learner$train(task)
learner$predict(task)
Warning: Learner 'clust.hclust' doesn't predict on new data and predictions may
not make sense on new data
<PredictionClust> for 75 observations:
row_ids partition
1 1
2 1
3 1

73 1
74 1
75 1
autoplot(learner) + theme(axis.text = element_text(size = 3.5))
In this case, the predict
method simply cuts the dendrogram into the number of clusters specified by k
parameter of the learner.
As previously discussed, unsupervised tasks do not have ground truth data to compare to in model evaluation. However, we can still measure the quality of cluster assignments by quantifying how closely objects within the same cluster are related (cluster cohesion) as well as how distinct different clusters are from each other (cluster separation). There are a few builtin evaluation metrics available to assess the quality of clustering, see Appendix D.
Two common measures are the within sum of squares (WSS) measure, mlr_measures_clust.wss
, and the silhouette coefficient, mlr_measures_clust.silhouette
. WSS calculates the sum of squared differences between observations and centroids, which is a quantification of cluster cohesion (smaller values indicate clusters more compact). The silhouette coefficient quantifies how well each point belongs to its assigned cluster versus neighboring clusters, where scores closer to 1
indicate well clustered and scores closer to 1
indicate poorly clustered. Note that the silhouette measure in mlr3cluster
returns the mean silhouette score across all observations and when there is only a single cluster, the measure simply outputs 0.
Putting this together with the above we can now score our cluster estimation (note we must pass the task
to $score
):
clust.wss clust.silhouette
5.115541e+04 6.413923e01
The very high WSS and middling mean silhouette coefficient indicate that our clusters could do with a bit more work. Often reducing an unsupervised task to a quantitative measure may not be useful (given no ground truth) and instead visualization (discussed below) may be a more effective tool for assessing the quality of the clusters.
As clustering is an unsupervised task, visualization can be essential not just for ‘evaluating’ models but also for determining if our learners are performing as expected for our task. The next section will look at visualizations for supporting clustering choices and following that we will consider plots for evaluating model performance.
It is easy to rely on clustering measures to assess the quality of clustering. However, this should be done with some care, by example consider cluster analysis on the following dataset.
spirals = mlbench::mlbench.spirals(n = 300, sd = 0.01)
task = as_task_clust(as.data.frame(spirals$x))
autoplot(task)
Now fitting our clustering learner.
learners = list(
lrn("clust.kmeans"),
lrn("clust.dbscan", eps = 0.1)
)
measures = list(msr("clust.silhouette"))
bmr = benchmark(benchmark_grid(task, learners, rsmp("insample")))
bmr$aggregate(measures)[, c(4, 7)]
learner_id clust.silhouette
1: clust.kmeans 0.37230996
2: clust.dbscan 0.02941168
We can see that Kmeans clustering gives us a higher average silhouette score and might assume that Kmeans learner with 2 centroids is a a better choice than DBSCAN method. However, now take a look at the cluster assignment plots.
prediction_kmeans = bmr$resample_results$resample_result[[1]]$prediction()
prediction_dbscan = bmr$resample_results$resample_result[[2]]$prediction()
autoplot(prediction_kmeans, task)
autoplot(prediction_dbscan, task)
The two learners arrived at different results – the Kmeans algorithm assigned points that are part of the same line into two different clusters whereas DBSCAN assigned each line to its own cluster. Which one of these approaches is correct? The answer is it depends on your specific task and the goal of cluster analysis. If we had only relied on the silhouette score, then the details of how exactly the clustering was done would have been masked and we would not be able to decide which method was appropriate for the task.
The two most important plots implemented in mlr3viz
to support evaluation of cluster learners are principal components analysis (PCA) and silhouette plots.
PCA is a commonly used dimension reduction method in ML to reduce the number of variables in a dataset or to visualize the most important ‘components’, which are linear transformations of the dataset features. Components are considered more important if they have higher variance (and therefore more predictive power). In the context of clustering, by plotting observations against the first two components, and then coloring them by cluster, we could visualize our high dimensional dataset and we would expect to see observations in distinct groups.
Since our running example only has two features, PCA does not make sense to visualize the data. So we will use a task based on the USArrests
dataset instead. By plotting the result of PCA, we see that our model has separated observations into two clusters along the first two principal components.
task = mlr_tasks$get("usarrests")
learner = lrn("clust.kmeans")
prediction = learner$train(task)$predict(task)
autoplot(prediction, task, type = "pca")
Silhouette plots visually assess the quality of the estimated clusters by visualizing if observations in a cluster are wellplaced both individually and as a group. The plots include a dotted line which visualizes the average silhouette coefficient across all data points and each data point’s silhouette value is represented by a bar colored by cluster. In our particular case, the average silhouette index is 0.59. If the average silhouette value for a given cluster is below the average silhouette coefficient line then this implies that the cluster is not well defined.
Continuing with our new example, we find that a lot of observations are actually below the average line and close to zero, and therefore the quality of our cluster assignments is not very good, meaning that many observations are likely assigned to the wrong cluster.
autoplot(prediction, task, type = "sil")
Finally, we conduct a small benchmark study using the ruspini
data and with a few integrated cluster learners:
task = tsk("ruspini")
learners = list(
lrn("clust.featureless"),
lrn("clust.kmeans", centers = 4L),
lrn("clust.cmeans", centers = 3L)
)
measures = list(msr("clust.wss"), msr("clust.silhouette"))
bmr = benchmark(benchmark_grid(task, learners, rsmp("insample")))
bmr$aggregate(measures)[, c(4, 7, 8)]
learner_id clust.wss clust.silhouette
1: clust.featureless 244373.87 0.0000000
2: clust.kmeans 48309.14 0.5537388
3: clust.cmeans 51063.48 0.6327047
The experiment shows that using the Kmeans algorithm with four centers has the best cluster cohesion (lowest within sum of squares) and the best average silhouette score.
The final task we will discuss in this book is spatial analysis. Spatial analysis can be a subset of any other machine learning task (e.g., regression or classification) and is defined by the presence of spatial information in a dataset, usually stored as coordinates that are often named “x” and “y” or “lat” and “lon” (for ‘latitude’ and ‘longitude’ respectively.)
Spatial analysis is its own task as spatial data must be handled carefully due to the complexity of ‘autocorrelation’. Where correlation is defined as a statistical association between two variables, autocorrelation is a statistical association within one variable. In machine learning terms, in a dataset with features and observations, correlation occurs when two or more features are statistically associated in some way, whereas autocorrelation occurs when two or more observations are statistically associated across one feature. Autocorrelation therefore violates one of the fundamental assumptions of ML that all observations in a dataset are independent, which results in lower confidence about the quality of a trained ML model and the resulting performance estimates (Hastie, Friedman, and Tibshirani 2001).
Autocorrelation is present in spatial data as there is implicit information encoded in coordinates, such as whether two observations (e.g., cities, countries, continents) are close together or far apart. For example, say we are predicting the number of cases of a disease two months after outbreak in Germany. Outbreaks radiate outwards from an epicenter and therefore countries closer to Germany will have higher numbers of cases, and countries further away will have lower numbers (Figure 8.4, bottom). Thus, looking at the data spatially shows clear signs of autocorrelation across nearby observations. Note in this example the autocorrelation is radial but in practice this will not always be the case.
Unlike other tasks we have looked at in this chapter, there is no underlying difference to the implemented learners or measures. Instead we provide additional resampling methods in mlr3spatiotempcv
to account for the similarity in the train and test sets during resampling that originates from spatiotemporal autocorrelation.
Throughout this section we will use the mlr3spatiotempcv::ecuador
dataset and task as a working example.
TaskClassifST
and TaskRegrST
To make use of spatial resampling methods, we have implemented two extensions of TaskClassif
and TaskRegr
to accommodate spatial data, TaskClassifST
and TaskRegrST
respectively. Below we only show classification examples but regression follows trivially.
library(mlr3spatial)
library(mlr3spatiotempcv)
# create task from `data.frame`
task = as_task_classif_st(ecuador, id = "ecuador_task",
target = "slides", positive = "TRUE",
coordinate_names = c("x", "y"), crs = "32717")
# or create task from 'sf' object
data_sf = sf::st_as_sf(ecuador, coords = c("x", "y"), crs = "32717")
task = as_task_classif_st(data_sf, target = "slides", positive = "TRUE")
task
<TaskClassifST:data_sf> (751 x 11)
* Target: slides
* Properties: twoclass
* Features (10):
 dbl (10): carea, cslope, dem, distdeforest, distroad,
distslidespast, hcurv, log.carea, slope, vcurv
* Coordinates:
X Y
1: 712882.5 9560002
2: 715232.5 9559582
3: 715392.5 9560172
4: 715042.5 9559312
5: 715382.5 9560142

747: 714472.5 9558482
748: 713142.5 9560992
749: 713322.5 9560562
750: 715392.5 9557932
751: 713802.5 9560862
Once a task is created, you can train and predict as normal.
lrn("classif.rpart")$train(task)$predict(task)
<PredictionClassif> for 751 observations:
row_ids truth response
1 TRUE TRUE
2 TRUE TRUE
3 TRUE TRUE

749 FALSE FALSE
750 FALSE FALSE
751 FALSE TRUE
However as discussed above, it is best to use the specialized resampling methods to achieve biasreduced estimates of model performance.
Before we look at the spatial resampling methods implemented in mlr3spatiotempcv
we will first show what can go wrong if nonspatial resampling methods are used for spatial data. Below we benchmark a decision tree on the mlr_tasks_ecuador
task using two different repeated crossvalidation resampling methods, the first (“NSpCV” (nonspatial crossvalidation)) is a nonspatial resampling method from mlr3
, the second (“SpCV” (spatial crossvalidation)) is from mlr3spatiotempcv
and is optimized for spatial data. The example highlights how “NSpCV” makes it appear as if the decision tree is performing better than it is with significantly higher estimated performance, however, this is an overconfident prediction due to the autocorrelation in the data.
learner = lrn("classif.rpart", predict_type = "prob")
rsmp_nsp = rsmp("repeated_cv", folds = 3, repeats = 2, id = "NSpCV")
rsmp_sp = rsmp("repeated_spcv_coords", folds = 3, repeats = 2, id = "SpCV")
design = benchmark_grid(task, learner, c(rsmp_nsp, rsmp_sp))
bmr = benchmark(design)
bmr$aggregate(msr("classif.acc"))[, c(5, 7)]
resampling_id classif.acc
1: NSpCV 0.6744197
2: SpCV 0.5842370
In the above example, applying nonspatial resampling results in train and test sets that are very similar due to the underlying spatial autocorrelation. Hence there is little difference from testing a model on the same data it was trained on, which should be avoided for an honest performance result (see Chapter 2). In contrast, the spatial method has accommodated for autocorrelation and the test data is therefore less correlated (though still some association will remain) with the training data. Visually this can be seen using builtin autoplot
methods. In Figure 8.5 we visualize how the task is partitioned according to the spatial resampling method (Figure 8.5, top) and nonspatial resampling method (Figure 8.5, bottom). There is a clear separation in space for the respective partitions when using the spatial resampling whereas the train and test splits overlap a lot (and are therefore more correlated) using the nonspatial method.
library(patchwork)
autoplot(rsmp_sp, task, fold_id = c(1:3), size = 0.7) /
autoplot(rsmp_nsp, task, fold_id = c(1:3), size = 0.7) &
scale_y_continuous(breaks = seq(3.97, 4, 0.01)) &
scale_x_continuous(breaks = seq(79.06, 79.08, 0.02))
Now we have seen why spatial resampling matters we can take a look at what methods are available in mlr3spatiotempcv
. The resampling methods we have added can be categorized into:
The choice of method may depend on specific characteristics of the dataset and there is no easy rule to pick one method over another, full details of different methods can be found in Schratz et al. (2021) – the paper deliberately avoids recommending one method over another as the ‘optimal’ choice is highly dependent on the predictive task, autocorrelation in the data, and the spatial structure of the sampling design. The documentation for each of the implemented methods includes details of each method as well as references to original publications.
We have focused on spatial analysis but referred to “spatiotemporal” and “spatiotemp”. The spatialonly resampling methods discussed in this section can be extended to temporal analysis (or spatial and temporal analysis combined) by setting the "time"
col_role
in the task (Chapter 3) – this is an advanced topic that may be added in future editions of this book. See the mlr3spatiotempcv
visualization vignette^{1} for specific details about 3D spatiotemporal visualization.
Until now we have looked at resampling to accommodate spatiotemporal features, but what if you want to make spatiotemporal predictions? In this case the goal is to make classification or regression predictions at the a pixel level, i.e., for an area, defined by the geometric resolution, of a raster image.
To enable these predictions we have created a new function, predict_spatial
, to allow spatial predictions using any of the following spatial data classes:
stars
(from package stars
)SpatRaster
(from package terra
)RasterLayer
(from package raster
)RasterStack
(from package raster
)To use a raster image for prediction, it must be wrapped in TaskUnsupervised
. In the example below we load the leipzig_points
dataset for training and coerce this to a spatiotemporal task with as_task_classif_st
, and we load the leipzig_raster
raster image and coerce this to an unsupervised task. Both files are included as example data in mlr3spatial
.
library(sf)
# load sample points
leipzig_vector = sf::read_sf(system.file("extdata",
"leipzig_points.gpkg", package = "mlr3spatial"),
stringsAsFactors = TRUE)
# create training data
tsk_leipzig = as_task_classif_st(leipzig_vector, target = "land_cover")
# load raster image
leipzig_raster = terra::rast(system.file("extdata", "leipzig_raster.tif",
package = "mlr3spatial"))
# create testing data
tsk_predict = as_task_unsupervised(leipzig_raster)
Now we can continue as normal to train and predict with a classification learner, in this case a random forest.
lrn = lrn("classif.ranger")$train(tsk_leipzig)
pred = predict_spatial(tsk_predict, lrn, format = "terra")
pred
class : SpatRaster
dimensions : 206, 154, 1 (nrow, ncol, nlyr)
resolution : 10, 10 (x, y)
extent : 731810, 733350, 5692030, 5694090 (xmin, xmax, ymin, ymax)
coord. ref. : WGS 84 / UTM zone 32N (EPSG:32632)
source : file29cb6ba3c678.tif
categories : categories
name : land_cover
min value : forest
max value : water
In this example we specified creation of a terra
object, which can be visualized with inbuilt plotting methods.
In this chapter, we explored going beyond deterministic regression and classification to see how functions in mlr3
can be used to implement other machine learning tasks. Costsensitive classification extends the ‘normal’ classification setting by assuming the cost associated with false negatives and false positives are unequal. Running costsensitive classification experiments is possible using only features in mlr3
. Survival analysis, available in mlr3proba
, can be thought of as a regression problem when the outcome may be censored, which means it may never be observed within a given time frame. The final task in mlr3proba
is density estimation, the unsupervised task concerned with estimating univariate probability distributions. Using mlr3cluster
, you can perform cluster analysis on observations, which involves grouping observations together according to similarities in their variables. Finally, with mlr3spatial
and mlr3spatiotempcv
, it is possible to perform spatial analysis to make predictions using coordinates as features and to make spatial predictions. The mlr3
interface is highly extensible, which means future machine learning tasks can (and will) be added to our universe and we will add these to this chapter of the book in future editions.
We will run set.seed(11)
before each of our solutions so you can reproduce our results.
german_credit
task with algorithms: featureless
, log_reg
, ranger
. Tune the featureless
model using tunetreshold
and learner_cv
. Use 2fold CV and evaluate with msr("classif.costs", costs = costs)
where you should make the parameter costs
so that the cost of a true positive is 10, the cost of a true negative is 1, the cost of a false positive is 2, and the cost of a false negative is 3. Are your results surprising?rfsrc
(from mlr3extralearners
). Run this experiment using task = tsk("rats"); split = partition(task)
. Evaluate your model with the RCLL measure.tsk("precip")
data using logspline
(from mlr3extralearners
). Run this experiment using task = tsk("precip"); split = partition(task)
. Evaluate your model with the logloss measure.wine
dataset without a label column. Compare the performance of kmeans
learner with k
equal to 2, 3 and 4 using the silhouette measure. Use insample resampling technique. What value of k
would you choose based on the silhouette scores?ecuador
task with a featureless learner and xgboost, evaluate with the binary Brier score.