8 Special Tasks
This chapter explores the different functions of mlr3
when dealing with specific data sets that require further statistical modification to undertake sensible analysis.
Following topics are discussed:
Survival Analysis
This subchapter explains how to conduct sound survival analysis in mlr3. Survival analysis is used to monitor the period of time until a specific event takes places. This specific event could be e.g. death, transmission of a disease, marriage or divorce. Two considerations are important when conducting survival analysis:
 Whether the event occurred within the frame of the given data
 How much time it took until the event occurred
In summary, this subchapter explains how to account for these considerations and conduct survival analysis using the mlr3proba
extension package.
Density Estimation
This subchapter explains how to conduct (unconditional) density estimation in mlr3. Density estimation is used to estimate the probability density function of a continuous variable. Unconditional density estimation is an unsupervised task so there is no ‘value’ to predict, instead densities are estimated.
This subchapter explains how to estimate probability distributions for continuous variables using the mlr3proba
extension package.
Spatiotemporal Analysis
Spatiotemporal analysis data observations entail reference information about spatial and temporal characteristics. One of the largest issues of spatiotemporal data analysis is the inevitable presence of autocorrelation in the data. Autocorrelation is especially severe in data with marginal spatiotemporal variation. The subchapter on Spatiotemporal analysis provides instructions on how to account for spatiotemporal data.
Ordinal Analysis
This is work in progress. See mlr3ordinal for the current state.
Functional Analysis
Functional analysis contains data that consists of curves varying over a continuum e.g. time, frequency or wavelength. This type of analysis is frequently used when examining measurements over various time points. Steps on how to accommodate functional data structures in mlr3 are explained in the functional analysis subchapter.
Multilabel Classification
Multilabel classification deals with objects that can belong to more than one category at the same time. Numerous target labels are attributed to a single observation. Working with multilabel data requires one to use modified algorithms, to accommodate data specific characteristics. Two approaches to multilabel classification are prominently used:
 The problem transformation method
 The algorithm adaption method
Instructions on how to deal with multilabel classification in mlr3 can be found in this subchapter.
Cost Sensitive Classification
This subchapter deals with the implementation of costsensitive classification. Regular classification aims to minimize the misclassification rate and thus all types of misclassification errors are deemed equally severe. Costsensitive classification is a setting where the costs caused by different kinds of errors are not assumed to be equal. The objective is to minimize the expected costs.
Analytical data for a big credit institution is used as a use case to illustrate the different features. Firstly, the subchapter provides guidance on how to implement a first model. Subsequently, the subchapter contains instructions on how to modify cost sensitivity measures, thresholding and threshold tuning.
Cluster Analysis
Cluster analysis aims to group data into clusters such that objects that are similar end up in the same cluster.
Fundamentally, clustering and classification are similar.
However, clustering is an unsupervised task because observations do not contain true labels while in classification, labels are needed in order to train a model.
This subchapter explains how to perform cluster analysis in mlr3 with the help of mlr3cluster
extension package.
8.1 Survival Analysis
Survival analysis is a subfield of supervised machine learning in which the aim is to predict the survival distribution of a given individual. Arguably the main feature of survival analysis is that unlike classification and regression, learners are trained on two features:
 the time until the event takes place
 the event type: either censoring or death.
At a particular timepoint, an individual is either: alive, dead, or censored. Censoring occurs if it is unknown if an individual is alive or dead. For example, say we are interested in patients in hospital and every day it is recorded if they are alive or dead, then after a patient leaves it is unknown if they are alive or dead, hence they are censored. If there was no censoring, then ordinary regression analysis could be used instead. Furthermore, survival data contains solely positive values and therefore needs to be transformed to avoid biases.
Note that survival analysis accounts for both censored and uncensored observations while adjusting respective model parameters.
The package mlr3proba (Sonabend et al. 2021) extends mlr3 with the following objects for survival analysis:

TaskSurv
to define (censored) survival tasks 
LearnerSurv
as base class for survival learners 
PredictionSurv
as specialized class forPrediction
objects 
MeasureSurv
as specialized class for performance measures
For a good introduction to survival analysis see Modelling Survival Data in Medical Research (Collett 2014).
8.1.1 TaskSurv
Unlike TaskClassif
and TaskRegr
which have a single ‘target’ argument, TaskSurv
mimics the
survival::Surv
object and has three to four target arguments (dependent on censoring type).
A TaskSurv
can be constructed with the function as_task_surv()
:
library("mlr3")
library("mlr3proba")
library("survival")
as_task_surv(survival::bladder2[, 1L], id = "interval_censored",
time = "start", event = "event", time2 = "stop", type = "interval")
## <TaskSurv:interval_censored> (178 x 7)
## * Target: start, stop, event
## * Properties: 
## * Features (4):
##  dbl (2): enum, rx
##  int (2): number, size
# type = "right" is default
task = as_task_surv(survival::rats, id = "right_censored",
time = "time", event = "status", type = "right")
print(task)
## <TaskSurv:right_censored> (300 x 5)
## * Target: time, status
## * Properties: 
## * Features (3):
##  int (1): litter
##  dbl (1): rx
##  chr (1): sex
# the target column is a survival object:
head(task$truth())
## [1] 101+ 49 104+ 91+ 104+ 102+
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
8.1.2 Predict Types  crank, lp, and distr
Every PredictionSurv
object can predict one or more of:

lp
 Linear predictor calculated as the fitted coefficients multiplied by the test data. 
distr
 Predicted survival distribution, either discrete or continuous. Implemented in distr6. 
crank
 Continuous risk ranking. 
response
 Predicted survival time.
lp
and crank
can be used with measures of discrimination such as the concordance index.
Whilst lp
is a specific mathematical prediction, crank
is any continuous ranking that identifies who is more or less likely to experience the event.
So far the only implemented learner that only returns a continuous ranking is surv.svm
.
If a PredictionSurv
returns an lp
then the crank
is identical to this.
Otherwise crank
is calculated as the expectation of the predicted survival distribution.
Note that for linear proportional hazards models, the ranking (but not necessarily the crank
score itself) given by lp
and the expectation of distr
, is identical.
The example below uses the rats
task shipped with mlr3proba.
task = tsk("rats")
learn = lrn("surv.coxph")
train_set = sample(task$nrow, 0.8 * task$nrow)
test_set = setdiff(seq_len(task$nrow), train_set)
learn$train(task, row_ids = train_set)
prediction = learn$predict(task, row_ids = test_set)
print(prediction)
## <PredictionSurv> for 60 observations:
## row_ids time status crank lp distr
## 3 104 FALSE 0.4230 0.4230 <list[1]>
## 6 102 FALSE 3.3329 3.3329 <list[1]>
## 17 62 FALSE 3.2993 3.2993 <list[1]>
## 
## 290 91 FALSE 0.3816 0.3816 <list[1]>
## 292 104 FALSE 1.8071 1.8071 <list[1]>
## 294 64 FALSE 2.5283 2.5283 <list[1]>
8.1.3 Composition
Finally we take a look at the PipeOp
s implemented in mlr3proba, which are used for composition of predict types.
For example, a predict linear predictor does not have a lot of meaning by itself, but it can be composed into a survival distribution.
See mlr3pipelines for full tutorials and details on PipeOp
s.
library("mlr3pipelines")
library("mlr3learners")
# PipeOpDistrCompositor  Train one model with a baseline distribution,
# (KaplanMeier or NelsonAalen), and another with a predicted linear predictor.
task = tsk("rats")
# remove the factor column for support with glmnet
task$select(c("litter", "rx"))
learner_lp = lrn("surv.glmnet")
learner_distr = lrn("surv.kaplan")
prediction_lp = learner_lp$train(task)$predict(task)
prediction_distr = learner_distr$train(task)$predict(task)
prediction_lp$distr
# Doesn't need training. Base = baseline distribution. ph = Proportional hazards.
pod = po("compose_distr", form = "ph", overwrite = FALSE)
prediction = pod$predict(list(base = prediction_distr, pred = prediction_lp))$output
# Now we have a predicted distr!
prediction$distr
# This can all be simplified by using the distrcompose pipeline
glm.distr = ppl("distrcompositor", learner = lrn("surv.glmnet"),
estimator = "kaplan", form = "ph", overwrite = FALSE, graph_learner = TRUE)
glm.distr$train(task)$predict(task)
8.1.4 Benchmark Experiment
Finally, we conduct a small benchmark study on the rats
task using some of the integrated survival learners:
library("mlr3learners")
task = tsk("rats")
# some integrated learners
learners = lrns(c("surv.coxph", "surv.kaplan", "surv.ranger"))
print(learners)
## [[1]]
## <LearnerSurvCoxPH:surv.coxph>: Cox Proportional Hazards
## * Model: 
## * Parameters: list()
## * Packages: mlr3, mlr3proba, survival, distr6
## * Predict Type: distr
## * Feature types: logical, integer, numeric, factor
## * Properties: weights
##
## [[2]]
## <LearnerSurvKaplan:surv.kaplan>: KaplanMeier Estimator
## * Model: 
## * Parameters: list()
## * Packages: mlr3, mlr3proba, survival, distr6
## * Predict Type: crank
## * Feature types: logical, integer, numeric, character, factor, ordered
## * Properties: missings
##
## [[3]]
## <LearnerSurvRanger:surv.ranger>
## * Model: 
## * Parameters: num.threads=1
## * Packages: mlr3, mlr3proba, mlr3learners, ranger
## * Predict Type: distr
## * Feature types: logical, integer, numeric, character, factor, ordered
## * Properties: importance, oob_error, weights
## <MeasureSurvCindex:surv.cindex>
## * Packages: mlr3, mlr3proba
## * Range: [0, 1]
## * Minimize: FALSE
## * Average: macro
## * Parameters: weight_meth=I, tiex=0.5
## * Properties: 
## * Predict type: crank
## * Return type: Score
set.seed(1)
bmr = benchmark(benchmark_grid(task, learners, rsmp("cv", folds = 3)))
bmr$aggregate(measure)
## nr resample_result task_id learner_id resampling_id iters surv.cindex
## 1: 1 <ResampleResult[22]> rats surv.coxph cv 3 0.7671
## 2: 2 <ResampleResult[22]> rats surv.kaplan cv 3 0.5000
## 3: 3 <ResampleResult[22]> rats surv.ranger cv 3 0.7799
autoplot(bmr, measure = measure)
The experiment indicates that both the Cox PH and the random forest have better discrimination than the KaplanMeier baseline estimator, but that the machine learning random forest is not consistently better than the interpretable Cox PH.
8.2 Density Estimation
Density estimation is the learning task to find the unknown distribution from which an i.i.d. data set is generated. We interpret this broadly, with this distribution not necessarily being continuous (so may possess a mass not density). The conditional case, where a distribution is predicted conditional on covariates, is known as ‘probabilistic supervised regression’, and will be implemented in mlr3proba in the nearfuture. Unconditional density estimation is viewed as an unsupervised task. For a good overview to density estimation see Density estimation for statistics and data analysis (Silverman 1986).
The package mlr3proba extends mlr3 with the following objects for density estimation:

TaskDens
to define density tasks 
LearnerDens
as base class for density estimators 
PredictionDens
as specialized class forPrediction
objects 
MeasureDens
as specialized class for performance measures
In this example we demonstrate the basic functionality of the package on the faithful
data from the datasets package.
This task ships as predefined TaskDens
with mlr3proba.
## <TaskDens:precip> (70 x 1): Annual Precipitation
## * Target: 
## * Properties: 
## * Features (1):
##  dbl (1): precip
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Unconditional density estimation is an unsupervised method.
Hence, TaskDens
is an unsupervised task which inherits directly from Task
unlike TaskClassif
and TaskRegr
.
However, TaskDens
still has a target
argument and a $truth
field defined by:

target
 the name of the variable in the data for which to estimate density 
$truth
 the values of thetarget
column (which is not the true density, which is always unknown)
8.2.1 Train and Predict
Density learners have train
and predict
methods, though being unsupervised, ‘prediction’ is actually ‘estimation’.
In training, a distr6 object is created,
see here for full tutorials on how to access the probability density function, pdf
, cumulative distribution function, cdf
, and other important fields and methods.
The predict method is simply a wrapper around self$model$pdf
and if available self$model$cdf
, i.e. evaluates the pdf/cdf at given points.
Note that in prediction the points to evaluate the pdf and cdf are determined by the target
column in the TaskDens
object used for testing.
# create task and learner
task_faithful = TaskDens$new(id = "eruptions", backend = datasets::faithful$eruptions)
learner = lrn("dens.hist")
# train/test split
train_set = sample(task_faithful$nrow, 0.8 * task_faithful$nrow)
test_set = setdiff(seq_len(task_faithful$nrow), train_set)
# fitting KDE and model inspection
learner$train(task_faithful, row_ids = train_set)
learner$model
## $distr
## Histogram()
##
## $hist
## $breaks
## [1] 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
##
## $counts
## [1] 42 32 4 5 24 60 48 2
##
## $density
## [1] 0.38710 0.29493 0.03687 0.04608 0.22120 0.55300 0.44240 0.01843
##
## $mids
## [1] 1.75 2.25 2.75 3.25 3.75 4.25 4.75 5.25
##
## $xname
## [1] "dat"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"
##
## attr(,"class")
## [1] "dens.hist"
class(learner$model)
## [1] "dens.hist"
# make predictions for new data
prediction = learner$predict(task_faithful, row_ids = test_set)
Every PredictionDens
object can estimate:

pdf
 probability density function
Some learners can estimate:

cdf
 cumulative distribution function
8.2.2 Benchmark Experiment
Finally, we conduct a small benchmark study on the precip
task using some of the integrated survival learners:
## [[1]]
## <LearnerDensHistogram:dens.hist>: Histogram Density Estimator
## * Model: 
## * Parameters: list()
## * Packages: mlr3, mlr3proba, distr6
## * Predict Type: pdf
## * Feature types: integer, numeric
## * Properties: 
##
## [[2]]
## <LearnerDensKDE:dens.kde>: Kernel Density Estimator
## * Model: 
## * Parameters: kernel=Epan, bandwidth=silver
## * Packages: mlr3, mlr3proba, distr6
## * Predict Type: pdf
## * Feature types: integer, numeric
## * Properties: missings
## <MeasureDensLogloss:dens.logloss>: Log Loss
## * Packages: mlr3, mlr3proba
## * Range: [0, Inf]
## * Minimize: TRUE
## * Average: macro
## * Parameters: eps=1e15
## * Properties: 
## * Predict type: pdf
set.seed(1)
bmr = benchmark(benchmark_grid(task, learners, rsmp("cv", folds = 3)))
bmr$aggregate(measure)
## nr resample_result task_id learner_id resampling_id iters dens.logloss
## 1: 1 <ResampleResult[22]> precip dens.hist cv 3 4.396
## 2: 2 <ResampleResult[22]> precip dens.kde cv 3 4.818
autoplot(bmr, measure = measure)
The results of this experiment show that the sophisticated Penalized Density Estimator does not outperform the baseline Histogram, but that the Kernel Density Estimator has at least consistently better
(i.e. lower logloss) results.
8.3 Spatiotemporal Analysis
Data observations may entail reference information about spatial or temporal characteristics. Spatial information is stored as coordinates, usually named “x” and “y” or “lat”/“lon”. Treating spatiotemporal data using nonspatial data methods can lead to overoptimistic performance estimates. Hence, methods specifically designed to account for the special nature of spatiotemporal data are needed.
In the mlr3 framework, the following packages relate to this field:
 mlr3spatiotemporal (biasedreduced performance estimation)
 mlr3forecasting (timeseries support)
 mlr3spatial (spatial prediction support)
The following (sub)sections introduce the potential pitfalls of spatiotemporal data in machine learning and how to account for it. Note that not all functionality will be covered, and that some of the used packages are still in early lifecycles. If you want to contribute to one of the packages mentioned above, please contact Patrick Schratz.
8.3.1 Creating a spatial Task
To make use of spatial resampling methods, a {mlr3} task that is aware of its spatial characteristic needs to be created. Two child classes exist in {mlr3spatiotempcv} for this purpose:
TaskClassifST
TaskRegrST
To create one of these, one can either pass a sf
object as the “backend” directly:
# create 'sf' object
data_sf = sf::st_as_sf(ecuador, coords = c("x", "y"), crs = 32717)
# create mlr3 task
task = TaskClassifST$new("ecuador_sf",
backend = data_sf, target = "slides", positive = "TRUE"
)
or use a plain data.frame
.
In this case, the constructor of TaskClassifST
needs a few more arguments:
data = mlr3::as_data_backend(ecuador)
task = TaskClassifST$new("ecuador",
backend = data, target = "slides",
positive = "TRUE", extra_args = list(coordinate_names = c("x", "y"),
crs = 32717)
)
Now this Task can be used as a normal {mlr3} task in any kind of modeling scenario.
8.3.2 Autocorrelation
Data which includes spatial or temporal information requires special treatment in machine learning (similar to survival, ordinal and other task types listed in the special tasks chapter). In contrast to nonspatial/nontemporal data, observations inherit a natural grouping, either in space or time or in both space and time (Legendre 1993). This grouping causes observations to be autocorrelated, either in space (spatial autocorrelation (SAC)), time (temporal autocorrelation (TAC)) or both space and time (spatiotemporal autocorrelation (STAC)). For simplicity, the acronym STAC is used as a generic term in the following chapter for all the different characteristics introduced above.
What effects does STAC have in statistical/machine learning?
The overarching problem is that STAC violates the assumption that the observations in the train and test datasets are independent (Hastie, Friedman, and Tibshirani 2001). If this assumption is violated, the reliability of the resulting performance estimates, for example retrieved via crossvalidation, is decreased. The magnitude of this decrease is linked to the magnitude of STAC in the dataset, which cannot be determined easily.
One approach to account for the existence of STAC is to use dedicated resampling methods. mlr3spatiotemporal provides access to the most frequently used spatiotemporal resampling methods. The following example showcases how a spatial dataset can be used to retrieve a biasreduced performance estimate of a learner.
The following examples use the ecuador dataset created by Jannes Muenchow. It contains information on the occurrence of landslides (binary) in the Andes of Southern Ecuador. The landslides were mapped from aerial photos taken in 2000. The dataset is well suited to serve as an example because it it relatively small and of course due to the spatial nature of the observations. Please refer to Muenchow, Brenning, and Richter (2012) for a detailed description of the dataset.
To account for the spatial autocorrelation probably present in the landslide data, we will make use one of the most used spatial partitioning methods, a clusterbased kmeans grouping (Brenning 2012), ("spcv_coords"
in mlr3spatiotemporal).
This method performs a clustering in 2D space which contrasts with the commonly used random partitioning for nonspatial data.
The grouping has the effect that train and test data are more separated in space as they would be by conducting a random partitioning, thereby reducing the effect of STAC.
By contrast, when using the classical random partitioning approach with spatial data, train and test observations would be located sidebyside across the full study area (a visual example is provided further below). This leads to a high similarity between train and test sets, resulting in “better” but biased performance estimates in every fold of a CV compared to the spatial CV approach. However, these low error rates are mainly caused due to the STAC in the observations and the lack of appropriate partitioning methods and not by the power of the fitted model.
8.3.3 Spatial CV vs. NonSpatial CV
In the following a spatial and a nonspatial CV will be conducted to showcase the mentioned performance differences.
The performance of a simple classification tree ("classif.rpart"
) is evaluated on a random partitioning ("repeated_cv"
) with four folds and two repetitions.
The chosen evaluation measure is “classification error” ("classif.ce"
).
The only difference in the spatial setting is that "repeated_spcv_coords"
is chosen instead of "repeated_cv"
.
8.3.3.1 NonSpatial CV
library("mlr3")
library("mlr3spatiotempcv")
set.seed(42)
# be less verbose
lgr::get_logger("bbotk")$set_threshold("warn")
lgr::get_logger("mlr3")$set_threshold("warn")
task = tsk("ecuador")
learner = lrn("classif.rpart", maxdepth = 3, predict_type = "prob")
resampling_nsp = rsmp("repeated_cv", folds = 4, repeats = 2)
rr_nsp = resample(
task = task, learner = learner,
resampling = resampling_nsp)
rr_nsp$aggregate(measures = msr("classif.ce"))
## classif.ce
## 0.3389
8.3.3.2 Spatial CV
task = tsk("ecuador")
learner = lrn("classif.rpart", maxdepth = 3, predict_type = "prob")
resampling_sp = rsmp("repeated_spcv_coords", folds = 4, repeats = 2)
rr_sp = resample(
task = task, learner = learner,
resampling = resampling_sp)
rr_sp$aggregate(measures = msr("classif.ce"))
## classif.ce
## 0.4125
Here, the classification tree learner is around 0.05 percentage points worse when using Spatial CrossValidation (SpCV) compared to NonSpatial CrossValidation (NSpCV). The magnitude of this difference is variable as it depends on the dataset, the magnitude of STAC and the learner itself. For algorithms with a higher tendency of overfitting to the training set, the difference between the two methods will be larger.
8.3.4 Visualization of Spatiotemporal Partitions
Every partitioning method in mlr3spatiotemporal comes with a generic plot()
method to visualize the created groups.
In a 2D space this happens via ggplot2 while for spatiotemporal methods 3D visualizations via plotly are created.
autoplot(resampling_sp, task, fold_id = c(1:4), size = 0.7) *
ggplot2::scale_y_continuous(breaks = seq(3.97, 4, 0.01)) *
ggplot2::scale_x_continuous(breaks = seq(79.06, 79.08, 0.01))
Note that setting the correct CRS for the given data is important which is done during task creation Spatial offsets of up to multiple meters may occur if the wrong CRS is supplied initially.
This example used an already created task via the sugar function tsk()
.
In practice however, one needs to create a spatiotemporal task via TaskClassifST()
/TaskRegrST()
and set the crs
argument.
The spatial grouping of the kmeans based approach above contrasts visually ver well compared to the NSpCV (random) partitioning:
autoplot(resampling_nsp, task, fold_id = c(1:4), size = 0.7) *
ggplot2::scale_y_continuous(breaks = seq(3.97, 4, 0.01)) *
ggplot2::scale_x_continuous(breaks = seq(79.06, 79.08, 0.01))
8.3.5 Spatial Block Visualization
The spcvblock
method makes use of rectangular blocks to divide the study area into equallysized parts.
These blocks can be visualized by their spatial location and fold ID to get a better understanding how these influenced the final partitions.
task = tsk("ecuador")
resampling = rsmp("spcv_block", range = 1000L)
resampling$instantiate(task)
## Visualize train/test splits of multiple folds
autoplot(resampling, task, size = 0.7,
fold_id = c(1, 2), show_blocks = TRUE, show_labels = TRUE) *
ggplot2::scale_x_continuous(breaks = seq(79.085, 79.055, 0.01))
8.3.6 Choosing a Resampling Method
While the example used the "spcv_coords"
method, this does not mean that this method is the best or only method suitable for this task.
Even though this method is quite popular, it was mainly chosen because of the clear visual grouping differences compared to random partitioning.
In fact, most often multiple spatial partitioning methods can be used for a dataset. It is recommended (required) that users familiarize themselves with each implemented method and decide which method to choose based on the specific characteristics of the dataset. For almost all methods implemented in mlr3spatiotemporal, there is a scientific publication describing the strengths and weaknesses of the respective approach (either linked in the help file of mlr3spatiotemporal or its respective dependency packages).
In the example above, a crossvalidation without hyperparameter tuning was shown. If a nested CV is desired, it is recommended to use the same spatial partitioning method for the inner loop (= tuning level). See Schratz et al. (2019) for more details and chapter 11 of Geocomputation with R (Lovelace, Nowosad, and Muenchow 2019)^{3}.
A list of all implemented methods in mlr3spatiotemporal can be found in the Getting Started vignette of the package.
If you want to learn even more about the field of spatial partitioning, STAC and the problems associated with it, the work of Prof. Hanna Meyer is very much recommended for further reference.
8.3.7 Spatial Prediction
Experimental support for spatial prediction with terra, raster, stars and sf objects is available in mlr3spatial.
Until the package is released on CRAN, please see the package vignettes for usage examples.
8.4 Ordinal Analysis
This is work in progress. See mlr3ordinal for the current state of the implementation.
8.5 Functional Analysis
Functional data is data containing an ordering on the dimensions. This implies that functional data consists of curves varying over a continuum, such as time, frequency, or wavelength.
8.5.1 How to model functional data?
There are two ways to model functional data:
 Modification of the learner, so that the learner is suitable for the functional data
 Modification of the task, so that the task matches the standard or classificationlearner
The development has not started yet, we are looking for contributors. Open an issue in mlr3fda if you are interested!
8.6 Multilabel Classification
Multilabel classification deals with objects that can belong to more than one category at the same time.
The development has not started yet, we are looking for contributes. Open an issue in mlr3multioutput if you are interested!
8.7 CostSensitive Classification
In regular classification the aim is to minimize the misclassification rate and thus all types of misclassification errors are deemed equally severe. A more general setting is costsensitive classification. Cost sensitive classification does not assume that the costs caused by different kinds of errors are equal. The objective of cost sensitive classification is to minimize the expected costs.
Imagine you are an analyst for a big credit institution. Let’s also assume that a correct decision of the bank would result in 35% of the profit at the end of a specific period. A correct decision means that the bank predicts that a customer will pay their bills (hence would obtain a loan), and the customer indeed has good credit. On the other hand, a wrong decision means that the bank predicts that the customer’s credit is in good standing, but the opposite is true. This would result in a loss of 100% of the given loan.
Good Customer (truth)  Bad Customer (truth)  

Good Customer (predicted)  + 0.35   1.0 
Bad Customer (predicted)  0  0 
Expressed as costs (instead of profit), we can write down the costmatrix as follows:
costs = matrix(c(0.35, 0, 1, 0), nrow = 2)
dimnames(costs) = list(response = c("good", "bad"), truth = c("good", "bad"))
print(costs)
## truth
## response good bad
## good 0.35 1
## bad 0.00 0
An exemplary data set for such a problem is the German Credit
task:
##
## good bad
## 700 300
The data has 70% customers who are able to pay back their credit, and 30% bad customers who default on the debt. A manager, who doesn’t have any model, could decide to give either everybody a credit or to give nobody a credit. The resulting costs for the German credit data are:
# nobody:
(700 * costs[2, 1] + 300 * costs[2, 2]) / 1000
## [1] 0
# everybody
(700 * costs[1, 1] + 300 * costs[1, 2]) / 1000
## [1] 0.055
If the average loan is $20,000, the credit institute would lose more than one million dollar if it would grant everybody a credit:
# average profit * average loan * number of customers
0.055 * 20000 * 1000
## [1] 1100000
Our goal is to find a model which minimizes the costs (and thereby maximizes the expected profit).
8.7.1 A First Model
For our first model, we choose an ordinary logistic regression (implemented in the addon package mlr3learners). We first create a classification task, then resample the model using a 10fold cross validation and extract the resulting confusion matrix:
library("mlr3learners")
learner = lrn("classif.log_reg")
rr = resample(task, learner, rsmp("cv"))
confusion = rr$prediction()$confusion
print(confusion)
## truth
## response good bad
## good 599 153
## bad 101 147
To calculate the average costs like above, we can simply multiply the elements of the confusion matrix with the elements of the previously introduced cost matrix, and sum the values of the resulting matrix:
## [1] 0.05665
With an average loan of $20,000, the logistic regression yields the following costs:
avg_costs * 20000 * 1000
## [1] 1133000
Instead of losing over $1,000,000, the credit institute now can expect a profit of more than $1,000,000.
8.7.2 Costsensitive Measure
Our natural next step would be to further improve the modeling step in order to maximize the profit.
For this purpose we first create a costsensitive classification measure which calculates the costs based on our cost matrix.
This allows us to conveniently quantify and compare modeling decisions.
Fortunately, there already is a predefined measure Measure
for this purpose: MeasureClassifCosts
:
## <MeasureClassifCosts:classif.costs>: Costsensitive Classification
## * Packages: mlr3
## * Range: [Inf, Inf]
## * Minimize: TRUE
## * Average: macro
## * Parameters: normalize=TRUE
## * Properties: requires_task
## * Predict type: response
If we now call resample()
or benchmark()
, the costsensitive measures will be evaluated.
We compare the logistic regression to a simple featureless learner and to a random forest from package ranger :
learners = list(
lrn("classif.log_reg"),
lrn("classif.featureless"),
lrn("classif.ranger")
)
cv3 = rsmp("cv", folds = 3)
bmr = benchmark(benchmark_grid(task, learners, cv3))
bmr$aggregate(cost_measure)
## nr resample_result task_id learner_id resampling_id
## 1: 1 <ResampleResult[22]> german_credit classif.log_reg cv
## 2: 2 <ResampleResult[22]> german_credit classif.featureless cv
## 3: 3 <ResampleResult[22]> german_credit classif.ranger cv
## iters classif.costs
## 1: 3 0.06605
## 2: 3 0.05500
## 3: 3 0.05454
As expected, the featureless learner is performing comparably bad. The logistic regression and the random forest work equally well.
8.7.3 Thresholding
Although we now correctly evaluate the models in a costsensitive fashion, the models themselves are unaware of the classification costs. They assume the same costs for both wrong classification decisions (false positives and false negatives). Some learners natively support costsensitive classification (e.g., XXX). However, we will concentrate on a more generic approach which works for all models which can predict probabilities for class labels: thresholding.
Most learners can calculate the probability \(p\) for the positive class. If \(p\) exceeds the threshold \(0.5\), they predict the positive class, and the negative class otherwise.
For our binary classification case of the credit data, the we primarily want to minimize the errors where the model predicts “good”, but truth is “bad” (i.e., the number of false positives) as this is the more expensive error. If we now increase the threshold to values \(> 0.5\), we reduce the number of false negatives. Note that we increase the number of false positives simultaneously, or, in other words, we are trading false positives for false negatives.
# fit models with probability prediction
learner = lrn("classif.log_reg", predict_type = "prob")
rr = resample(task, learner, rsmp("cv"))
p = rr$prediction()
print(p)
## <PredictionClassif> for 1000 observations:
## row_ids truth response prob.good prob.bad
## 10 bad bad 0.1416 0.85844
## 13 good good 0.8989 0.10108
## 16 bad good 0.5547 0.44528
## 
## 991 good good 0.9528 0.04721
## 997 good good 0.5954 0.40459
## 1000 good good 0.8678 0.13218
# helper function to try different threshold values interactively
with_threshold = function(p, th) {
p$set_threshold(th)
list(confusion = p$confusion, costs = p$score(measures = cost_measure, task = task))
}
with_threshold(p, 0.5)
## $confusion
## truth
## response good bad
## good 606 153
## bad 94 147
##
## $costs
## classif.costs
## 0.0591
with_threshold(p, 0.75)
## $confusion
## truth
## response good bad
## good 475 82
## bad 225 218
##
## $costs
## classif.costs
## 0.08425
with_threshold(p, 1.0)
## $confusion
## truth
## response good bad
## good 1 0
## bad 699 300
##
## $costs
## classif.costs
## 0.00035
# TODO: include plot of threshold vs performance
Instead of manually trying different threshold values, one uses use optimize()
to find a good threshold value w.r.t. our performance measure:
# simple wrapper function which takes a threshold and returns the resulting model performance
# this wrapper is passed to optimize() to find its minimum for thresholds in [0.5, 1]
f = function(th) {
with_threshold(p, th)$costs
}
best = optimize(f, c(0.5, 1))
print(best)
## $minimum
## [1] 0.8296
##
## $objective
## classif.costs
## 0.08765
# optimized confusion matrix:
with_threshold(p, best$minimum)$confusion
## truth
## response good bad
## good 399 52
## bad 301 248
Note that the function optimize()
is intended for unimodal functions and therefore may converge to a local optimum here.
See below for better alternatives to find good threshold values.
8.7.4 Threshold Tuning
Before we start, we have load all required packages:
## Loading required package: paradox
8.7.5 Adjusting thresholds: Two strategies
Currently mlr3pipelines
offers two main strategies towards adjusting classification thresholds
.
We can either expose the thresholds as a hyperparameter
of the Learner by using PipeOpThreshold
.
This allows us to tune the thresholds
via an outside optimizer from mlr3tuning
.
Alternatively, we can also use PipeOpTuneThreshold
which automatically tunes the threshold after each learner is fit.
In this blogpost, we’ll go through both strategies.
8.7.6 PipeOpThreshold
PipeOpThreshold
can be put directly after a Learner
.
A simple example would be:
gr = lrn("classif.rpart", predict_type = "prob") %>>% po("threshold")
l = as_learner(gr)
Note, that predict_type
= “prob” is required for po("threshold")
to have any effect.
The thresholds
are now exposed as a hyperparameter
of the GraphLearner
we created:
l$param_set
## <ParamSetCollection>
## id class lower upper nlevels default
## 1: classif.rpart.cp ParamDbl 0 1 Inf 0.01
## 2: classif.rpart.keep_model ParamLgl NA NA 2 FALSE
## 3: classif.rpart.maxcompete ParamInt 0 Inf Inf 4
## 4: classif.rpart.maxdepth ParamInt 1 30 30 30
## 5: classif.rpart.maxsurrogate ParamInt 0 Inf Inf 5
## 6: classif.rpart.minbucket ParamInt 1 Inf Inf <NoDefault[3]>
## 7: classif.rpart.minsplit ParamInt 1 Inf Inf 20
## 8: classif.rpart.surrogatestyle ParamInt 0 1 2 0
## 9: classif.rpart.usesurrogate ParamInt 0 2 3 2
## 10: classif.rpart.xval ParamInt 0 Inf Inf 10
## 11: threshold.thresholds ParamUty NA NA Inf <NoDefault[3]>
## value
## 1:
## 2:
## 3:
## 4:
## 5:
## 6:
## 7:
## 8:
## 9:
## 10: 0
## 11: 0.5
We can now tune those thresholds from the outside as follows:
Before tuning
, we have to define which hyperparameters we want to tune over.
In this example, we only tune over the thresholds
parameter of the threshold
pipeop.
you can easily imagine, that we can also jointly tune over additional hyperparameters, i.e. rpart’s cp
parameter.
As the Task
we aim to optimize for is a binary task, we can simply specify the threshold param:
We now create a AutoTuner
, which automatically tunes the supplied learner over the ParamSet
we supplied above.
at = AutoTuner$new(
learner = l,
resampling = rsmp("cv", folds = 3L),
measure = msr("classif.ce"),
search_space = ps,
terminator = trm("evals", n_evals = 5L),
tuner = tnr("random_search")
)
at$train(tsk("german_credit"))
## INFO [15:49:54.063] [bbotk] Starting to optimize 1 parameter(s) with '<OptimizerRandomSearch>' and '<TerminatorEvals> [n_evals=5, k=0]'
## INFO [15:49:54.107] [bbotk] Evaluating 1 configuration(s)
## INFO [15:49:54.405] [bbotk] Result of batch 1:
## INFO [15:49:54.407] [bbotk] threshold.thresholds classif.ce warnings errors runtime_learners
## INFO [15:49:54.407] [bbotk] 0.2789 0.279 0 0 0.204
## INFO [15:49:54.407] [bbotk] uhash
## INFO [15:49:54.407] [bbotk] edbf9c5eac5d42108f9c4a2a061c4370
## INFO [15:49:54.410] [bbotk] Evaluating 1 configuration(s)
## INFO [15:49:54.741] [bbotk] Result of batch 2:
## INFO [15:49:54.743] [bbotk] threshold.thresholds classif.ce warnings errors runtime_learners
## INFO [15:49:54.743] [bbotk] 0.5091 0.291 0 0 0.244
## INFO [15:49:54.743] [bbotk] uhash
## INFO [15:49:54.743] [bbotk] 8fe5147c612b46a298676bcad3b28369
## INFO [15:49:54.746] [bbotk] Evaluating 1 configuration(s)
## INFO [15:49:55.026] [bbotk] Result of batch 3:
## INFO [15:49:55.028] [bbotk] threshold.thresholds classif.ce warnings errors runtime_learners
## INFO [15:49:55.028] [bbotk] 0.0653 0.302 0 0 0.193
## INFO [15:49:55.028] [bbotk] uhash
## INFO [15:49:55.028] [bbotk] 51766fd2bf374d55a842b92093886dd9
## INFO [15:49:55.031] [bbotk] Evaluating 1 configuration(s)
## INFO [15:49:55.312] [bbotk] Result of batch 4:
## INFO [15:49:55.314] [bbotk] threshold.thresholds classif.ce warnings errors runtime_learners
## INFO [15:49:55.314] [bbotk] 0.1273 0.289 0 0 0.196
## INFO [15:49:55.314] [bbotk] uhash
## INFO [15:49:55.314] [bbotk] 9f8882f22f1f49f4b6edc45135b619b4
## INFO [15:49:55.317] [bbotk] Evaluating 1 configuration(s)
## INFO [15:49:55.614] [bbotk] Result of batch 5:
## INFO [15:49:55.616] [bbotk] threshold.thresholds classif.ce warnings errors runtime_learners
## INFO [15:49:55.616] [bbotk] 0.1514 0.287 0 0 0.202
## INFO [15:49:55.616] [bbotk] uhash
## INFO [15:49:55.616] [bbotk] c8f908ae8c9949b5ab06650f7b4d80a3
## INFO [15:49:55.626] [bbotk] Finished optimizing after 5 evaluation(s)
## INFO [15:49:55.626] [bbotk] Result:
## INFO [15:49:55.628] [bbotk] threshold.thresholds learner_param_vals x_domain classif.ce
## INFO [15:49:55.628] [bbotk] 0.2789 <list[2]> <list[1]> 0.279
Inside the trafo
, we simply collect all set params into a named vector via map_dbl
and store it
in the threshold.thresholds
slot expected by the learner.
Again, we create a AutoTuner
, which automatically tunes the supplied learner over the ParamSet
we supplied above.
One drawback of this strategy is, that this requires us to fit a new model for each new threshold setting. While setting a threshold and computing performance is relatively cheap, fitting the learner is often more computationally demanding. A better strategy is therefore often to optimize the thresholds separately after each model fit.
8.7.7 PipeOpTunethreshold
PipeOpTuneThreshold
on the other hand works together with PipeOpLearnerCV
.
It directly optimizes the crossvalidated
predictions made by this PipeOp
.
This is done in order to avoid overfitting the threshold tuning.
A simple example would be:
gr = po("learner_cv", lrn("classif.rpart", predict_type = "prob")) %>>% po("tunethreshold")
l2 = as_learner(gr)
Note, that predict_type
= “prob” is required for po("tunethreshold")
to work.
Additionally, note that this time no threshold
parameter is exposed, it is automatically tuned internally.
l2$param_set
## <ParamSetCollection>
## id class lower upper nlevels
## 1: classif.rpart.resampling.method ParamFct NA NA 2
## 2: classif.rpart.resampling.folds ParamInt 2 Inf Inf
## 3: classif.rpart.resampling.keep_response ParamLgl NA NA 2
## 4: classif.rpart.cp ParamDbl 0 1 Inf
## 5: classif.rpart.keep_model ParamLgl NA NA 2
## 6: classif.rpart.maxcompete ParamInt 0 Inf Inf
## 7: classif.rpart.maxdepth ParamInt 1 30 30
## 8: classif.rpart.maxsurrogate ParamInt 0 Inf Inf
## 9: classif.rpart.minbucket ParamInt 1 Inf Inf
## 10: classif.rpart.minsplit ParamInt 1 Inf Inf
## 11: classif.rpart.surrogatestyle ParamInt 0 1 2
## 12: classif.rpart.usesurrogate ParamInt 0 2 3
## 13: classif.rpart.xval ParamInt 0 Inf Inf
## 14: classif.rpart.affect_columns ParamUty NA NA Inf
## 15: tunethreshold.measure ParamUty NA NA Inf
## 16: tunethreshold.optimizer ParamUty NA NA Inf
## 17: tunethreshold.log_level ParamUty NA NA Inf
## default value
## 1: <NoDefault[3]> cv
## 2: <NoDefault[3]> 3
## 3: <NoDefault[3]> FALSE
## 4: 0.01
## 5: FALSE
## 6: 4
## 7: 30
## 8: 5
## 9: <NoDefault[3]>
## 10: 20
## 11: 0
## 12: 2
## 13: 10 0
## 14: <Selector[1]>
## 15: <NoDefault[3]> classif.ce
## 16: <NoDefault[3]> gensa
## 17: <function[1]> warn
Note that we can set rsmp("intask")
as a resampling strategy for “learner_cv” in order to evaluate
predictions on the “training” data. This is generally not advised, as it might lead to overfitting
on the thresholds but can significantly reduce runtime.
For more information, see the post on Threshold Tuning on the mlr3 gallery.
8.8 Cluster Analysis
Cluster analysis is a type of unsupervised machine learning where the goal is to group data into clusters, where each cluster contains similar observations. The similarity is based on specified metrics that are task and application dependent. Cluster analysis is closely related to classification in a sense that each observation needs to be assigned to a cluster or a class. However, unlike classification problems where each observation is labeled, clustering works on data sets without true labels or class assignments.
The package mlr3cluster extends mlr3 with the following objects for cluster analysis:

TaskClust
to define clustering tasks 
LearnerClust
as base class for clustering learners 
PredictionClust
as specialized class forPrediction
objects 
MeasureClust
as specialized class for performance measures
Since clustering is a type of unsupervised learning, TaskClust
is slightly different from TaskRegr
and TaskClassif
objects.
More specifically:

truth()
function is missing because observations are not labeled. 
target
field is empty and will returncharacter(0)
if accessed anyway.
Additionally, LearnerClust
provides two extra fields that are absent from supervised learners:

assignments
returns cluster assignments for training data. It returnNULL
if accessed before training. 
save_assignments
is a boolean field that controls whether or not to store training set assignments in a learner.
Finally, PredictionClust
contains additional two fields:

partition
stores cluster partitions. 
prob
stores cluster probabilities for each observation.
8.8.1 Train and Predict
Clustering learners provide both train
and predict
methods.
The analysis typically consists of building clusters using all available data.
To be consistent with the rest of the library, we refer to this process as training.
Some learners can assign new observations to existing groups with predict
.
However, prediction does not always make sense, as it is the case for hierarchical clustering.
In hierarchical clustering, the goal is to build a hierarchy of nested clusters by either splitting large clusters into smaller ones or merging smaller clusters into bigger ones.
The final result is a tree or dendrogram which can change if a new data point is added.
For consistency with the rest of the ecosystem, mlr3cluster offers predict
method for hierarchical clusterers but it simply assigns all points to a specified number of clusters by cutting the resulting tree at a corresponding level.
Moreover, some learners estimate the probability of each observation belonging to a given cluster.
predict_types
field gives a list of prediction types for each learner.
After training, the model
field stores a learner’s model that looks different for each learner depending on the underlying library.
predict
returns a PredictionClust
object that gives a simplified view of the learned model.
If the data given to the predict
method is the same as the one on which the learner was trained, predict
simply returns cluster assignments for the “training” observations.
On the other hand, if the test set contains new data, predict
will estimate cluster assignments for that data set.
Some learners do not support estimating cluster partitions on new data and will instead return assignments for training data and print a warning message.
In the following example, a $k$means learner
is applied on the US arrest data set
.
The class labels are predicted and the contribution of the task features to assignment of the respective class are visualized.
library("mlr3")
library("mlr3cluster")
library("mlr3viz")
set.seed(1L)
# create an example task
task = tsk("usarrests")
print(task)
## <TaskClust:usarrests> (50 x 4): US Arrests
## * Target: 
## * Properties: 
## * Features (4):
##  int (2): Assault, UrbanPop
##  dbl (2): Murder, Rape
autoplot(task)
# create a kmeans learner
learner = lrn("clust.kmeans")
# assigning each observation to one of the two clusters (default in clust.kmeans)
learner$train(task)
learner$model
## Kmeans clustering with 2 clusters of sizes 21, 29
##
## Cluster means:
## Assault Murder Rape UrbanPop
## 1 255.0 11.857 28.11 67.62
## 2 109.8 4.841 16.25 64.03
##
## Clustering vector:
## [1] 1 1 1 1 1 1 2 1 1 1 2 2 1 2 2 2 2 1 2 1 2 1 2 1 2 2 2 1 2 2 1 1 1 2 2 2 2 2
## [39] 2 1 2 1 1 2 2 2 2 2 2 2
##
## Within cluster sum of squares by cluster:
## [1] 41637 54762
## (between_SS / total_SS = 72.9 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
# make "predictions" for the same data
prediction = learner$predict(task)
autoplot(prediction, task)
8.8.2 Measures
The difference between supervised and unsupervised learning is that there is no ground truth data in unsupervised learning. In a supervised setting, such as classification, we would need to compare our predictions to true labels. Since clustering is an example of unsupervised learning, there are no true labels to which we can compare. However, we can still measure the quality of cluster assignments by quantifying how closely objects within the same cluster are related (cluster cohesion) as well as how distinct different clusters are from each other (cluster separation).
To assess the quality of clustering, there are a few builtin evaluation metrics available.
One of them is within sum of squares (WSS)
which calculates the sum of squared differences between observations and centroids.
WSS is useful because it quantifies cluster cohesion.
The range of this measure is \([0, \infty)\) where a smaller value means that clusters are more compact.
Another measure is silhouette quality index
that quantifies how well each point belongs to its assigned cluster versus neighboring cluster.
Silhouette values are in \([1, 1]\) range.
Points with silhouette closer to:
 1 are well clustered
 0 lie between two clusters
 1 likely placed in the wrong cluster
The following is an example of conducting a benchmark experiment with various learners on iris data set
without target variable and assessing the quality of each learner with both within sum of squares and silhouette measures.
design = benchmark_grid(
tasks = TaskClust$new("iris", iris[5]),
learners = list(
lrn("clust.kmeans", centers = 3L),
lrn("clust.pam", k = 2L),
lrn("clust.cmeans", centers = 3L)),
resamplings = rsmp("insample"))
print(design)
## task learner resampling
## 1: <TaskClust[46]> <LearnerClustKMeans[38]> <ResamplingInsample[20]>
## 2: <TaskClust[46]> <LearnerClustPAM[38]> <ResamplingInsample[20]>
## 3: <TaskClust[46]> <LearnerClustCMeans[38]> <ResamplingInsample[20]>
# execute benchmark
bmr = benchmark(design)
# define measure
measures = list(msr("clust.wss"), msr("clust.silhouette"))
bmr$aggregate(measures)
## nr resample_result task_id learner_id resampling_id iters clust.wss
## 1: 1 <ResampleResult[22]> iris clust.kmeans insample 1 78.85
## 2: 2 <ResampleResult[22]> iris clust.pam insample 1 153.33
## 3: 3 <ResampleResult[22]> iris clust.cmeans insample 1 79.03
## clust.silhouette
## 1: 0.5555
## 2: 0.7158
## 3: 0.5493
The experiment shows that using kmeans algorithm with three centers produces a better within sum of squares score than any other learner considered. However, pam (partitioning around medoids) learner with two clusters performs the best when considering silhouette measure which takes into the account both cluster cohesion and separation.
8.8.3 Visualization
Cluster analysis in mlr3 is integrated with mlr3viz which provides a number of useful plots. Some of those plots are shown below.
task = TaskClust$new("iris", iris[5])
learner = lrn("clust.kmeans")
learner$train(task)
prediction = learner$predict(task)
# performing PCA on task and showing assignments
autoplot(prediction, task, type = "pca")
# same as above but with probability ellipse that assumes normal distribution
autoplot(prediction, task, type = "pca", frame = TRUE, frame.type = "norm")
task = tsk("usarrests")
learner = lrn("clust.agnes")
learner$train(task)
# dendrogram for hierarchical clustering
autoplot(learner)
# advanced dendrogram options from `factoextra::fviz_dend`
autoplot(learner,
k = learner$param_set$values$k, rect_fill = TRUE,
rect = TRUE)
Silhouette plots can help to visually assess the quality of the analysis and help choose a number of clusters for a given data set. The red dotted line shows the mean silhouette value and each bar represents a data point. If most points in each cluster have an index around or higher than mean silhouette, the number of clusters is chosen well.
# silhouette plot allows to visually inspect the quality of clustering
task = TaskClust$new("iris", iris[5])
learner = lrn("clust.kmeans")
learner$param_set$values = list(centers = 5L)
learner$train(task)
prediction = learner$predict(task)
autoplot(prediction, task, type = "sil")
The plot shows that all points in cluster 5 and almost all points in clusters 4, 2 and 1 are below average silhouette index. This means that a lot of observations lie either on the border of clusters or are likely assigned to the wrong cluster.
learner = lrn("clust.kmeans")
learner$param_set$values = list(centers = 2L)
learner$train(task)
prediction = learner$predict(task)
autoplot(prediction, task, type = "sil")
Setting the number of centers to two improves both average silhouette score as well as overall quality of clustering because almost all points in cluster 1 are higher than and a lot of points in cluster 2 are close to mean silhouette. Hence, having two centers might be a better choice for the number of clusters.