8  Special Tasks

Author 1

Affiliation 1

Author 2

Affiliation 2

TODO (150-200 WORDS)

This chapter explores the different functions of mlr3 when dealing with specific data sets that require further statistical modification to undertake sensible analysis. Following topics are discussed:

Survival Analysis

This sub-chapter explains how to conduct sound survival analysis in mlr3. Survival analysis is used to monitor the period of time until a specific event takes places. This specific event could be, e.g., death, the transmission of a disease, marriage, or divorce. Two considerations are important when conducting survival analysis:

In summary, this sub-chapter explains how to account for these considerations and conduct survival analysis using the mlr3proba extension package.

Density Estimation

This sub-chapter explains how to conduct (unconditional) density estimation in mlr3. Density estimation is used to estimate the probability density function of a continuous variable. Unconditional density estimation is an unsupervised task, so there is no ‘value’ to predict. Instead, densities are estimated.

This sub-chapter explains how to estimate probability distributions for continuous variables using the mlr3proba extension package.

Spatiotemporal Analysis

Spatiotemporal analysis data observations entail reference information about spatial and temporal characteristics. One of the largest issues of spatiotemporal data analysis is the inevitable presence of auto-correlation in the data. Auto-correlation is especially severe in data with a marginal spatiotemporal variation. The sub-chapter on Spatiotemporal analysis provides instructions on how to account for spatiotemporal data.

Multilabel Classification

Multilabel classification deals with objects that can belong to more than one category at the same time. Numerous target labels are attributed to a single observation. Working with multilabel data requires one to use modified algorithms, to accommodate data specific characteristics. Two approaches to multilabel classification are prominently used:

Instructions on how to deal with multilabel classification inmlr3can be found in this sub-chapter.

Cost Sensitive Classification

This sub-chapter deals with the implementation of cost-sensitive classification. Regular classification aims to minimize the misclassification rate and thus all types of misclassification errors are deemed equally severe. Cost-sensitive classification is a setting where the costs caused by different kinds of errors are not assumed to be equal. The objective is to minimize the expected costs.

Analytical data for a big credit institution is used as a use case to illustrate the different features. Firstly, the sub-chapter provides guidance on how to implement a first model. Subsequently, the sub-chapter contains instructions on how to modify cost sensitivity measures, thresholding and threshold tuning.

Cluster Analysis

Cluster analysis aims to group data into clusters such that objects that are similar end up in the same cluster. Fundamentally, clustering and classification are similar. However, clustering is an unsupervised task because observations do not contain true labels while in classification, labels are needed in order to train a model.

This sub-chapter explains how to perform cluster analysis inmlr3with the help of mlr3cluster extension package.

Coming soon

In development we are also working on

Send us a message if you want to help out on any of these!

8.1 Survival Analysis


Survival analysis with {mlr3proba} is currently in a fragile state after its removal from CRAN. Hence most code examples listed in this page will not work for the time being. We are actively working on a solution!

Survival analysis is a sub-field of supervised machine learning in which the aim is to predict the survival distribution of a given individual. Arguably the main feature of survival analysis is that, unlike classification and regression, learners are trained on two features:

  1. the time until the event takes place
  2. the event type: either censoring or death.

At a particular time-point, an individual is either: alive, dead, or censored. Censoring occurs if it is unknown if an individual is alive or dead. For example, say we are interested in patients in hospital and every day it is recorded if they are alive or dead, then after a patient leaves, it is unknown if they are alive or dead. Hence they are censored. If there was no censoring, then ordinary regression analysis could be used instead. Furthermore, survival data contains solely positive values and therefore needs to be transformed to avoid biases.

Note that survival analysis accounts for both censored and uncensored observations while adjusting respective model parameters.

The package mlr3proba (Sonabend et al. 2021) extends mlr3 with the following objects for survival analysis:

For a good introduction to survival analysis see Modelling Survival Data in Medical Research (Collett 2014).

8.1.1 TaskSurv

Unlike TaskClassif and TaskRegr which have a single ‘target’ argument, TaskSurv mimics the survival::Surv object and has three to four target arguments (dependent on censoring type). A TaskSurv can be constructed with the function as_task_surv():


as_task_surv(survival::bladder2[, -1L], id = "interval_censored",
  time = "start", event = "event", time2 = "stop", type = "interval")

# type = "right" is default
task = as_task_surv(survival::rats, id = "right_censored",
  time = "time", event = "status", type = "right")


# the target column is a survival object:

# kaplan-meier plot
# library("mlr3viz")

8.1.2 Predict Types - crank, lp, and distr

Every PredictionSurv object can predict one or more of:

  • lp - Linear predictor calculated as the fitted coefficients multiplied by the test data.
  • distr - Predicted survival distribution, either discrete or continuous. Implemented in distr6.
  • crank - Continuous risk ranking.
  • response - Predicted survival time.

lp and crank can be used with measures of discrimination such as the concordance index. Whilst lp is a specific mathematical prediction, crank is any continuous ranking that identifies who is more or less likely to experience the event. So far the only implemented learner that only returns a continuous ranking is surv.svm. If a PredictionSurv returns an lp then the crank is identical to this. Otherwise crank is calculated as the expectation of the predicted survival distribution. Note that for linear proportional hazards models, the ranking (but not necessarily the crank score itself) given by lp and the expectation of distr, is identical.

The example below uses the rats task shipped with mlr3proba.

task = tsk("rats")
learn = lrn("surv.coxph")

train_set = sample(task$nrow, 0.8 * task$nrow)
test_set = setdiff(seq_len(task$nrow), train_set)

learn$train(task, row_ids = train_set)
prediction = learn$predict(task, row_ids = test_set)

<PredictionSurv> for 60 observations:
    row_ids time status      crank         lp     distr
          2   49   TRUE -0.5038971 -0.5038971 <list[1]>
          3  104  FALSE -0.5038971 -0.5038971 <list[1]>
         12  102  FALSE -3.4446695 -3.4446695 <list[1]>
        281   89  FALSE -2.5217340 -2.5217340 <list[1]>
        295  104  FALSE  1.3955682  1.3955682 <list[1]>
        300  102  FALSE -2.4602050 -2.4602050 <list[1]>

8.1.3 Composition

Finally we take a look at the PipeOps implemented in mlr3proba, which are used for composition of predict types. For example, a predict linear predictor does not have a lot of meaning by itself, but it can be composed into a survival distribution. See mlr3pipelines for full tutorials and details on PipeOps.

# PipeOpDistrCompositor - Train one model with a baseline distribution,
# (Kaplan-Meier or Nelson-Aalen), and another with a predicted linear predictor.
task = tsk("rats")
# remove the factor column for support with glmnet
task$select(c("litter", "rx"))
learner_lp = lrn("surv.glmnet")
learner_distr = lrn("surv.kaplan")
prediction_lp = learner_lp$train(task)$predict(task)
prediction_distr = learner_distr$train(task)$predict(task)

# Doesn't need training. Base = baseline distribution. ph = Proportional hazards.

pod = po("compose_distr", form = "ph", overwrite = FALSE)
prediction = pod$predict(list(base = prediction_distr, pred = prediction_lp))$output

# Now we have a predicted distr!


# This can all be simplified by using the distrcompose pipeline

glm.distr = ppl("distrcompositor", learner = lrn("surv.glmnet"),
  estimator = "kaplan", form = "ph", overwrite = FALSE, graph_learner = TRUE)

8.1.4 Benchmark Experiment

Finally, we conduct a small benchmark study on the rats task using some of the integrated survival learners:


task = tsk("rats")

# some integrated learners
learners = lrns(c("surv.coxph", "surv.kaplan", "surv.ranger"))
<LearnerSurvCoxPH:surv.coxph>: Cox Proportional Hazards
* Model: -
* Parameters: list()
* Packages: mlr3, mlr3proba, survival, distr6
* Predict Types:  crank, [distr], lp
* Feature Types: logical, integer, numeric, factor
* Properties: weights

<LearnerSurvKaplan:surv.kaplan>: Kaplan-Meier Estimator
* Model: -
* Parameters: list()
* Packages: mlr3, mlr3proba, survival, distr6
* Predict Types:  [crank], distr
* Feature Types: logical, integer, numeric, character, factor, ordered
* Properties: missings

<LearnerSurvRanger:surv.ranger>: Random Forest
* Model: -
* Parameters: num.threads=1
* Packages: mlr3, mlr3proba, mlr3extralearners, ranger
* Predict Types:  crank, [distr]
* Feature Types: logical, integer, numeric, character, factor, ordered
* Properties: importance, oob_error, weights
# Harrell's C-Index for survival
measure = msr("surv.cindex")
* Packages: mlr3, mlr3proba
* Range: [0, 1]
* Minimize: FALSE
* Average: macro
* Parameters: weight_meth=I, tiex=0.5, eps=0.001
* Properties: -
* Predict type: crank
* Return type: Score
bmr = benchmark(benchmark_grid(task, learners, rsmp("cv", folds = 3)))
   nr      resample_result task_id  learner_id resampling_id iters surv.cindex
1:  1 <ResampleResult[21]>    rats  surv.coxph            cv     3   0.7671307
2:  2 <ResampleResult[21]>    rats surv.kaplan            cv     3   0.5000000
3:  3 <ResampleResult[21]>    rats surv.ranger            cv     3   0.7799153
autoplot(bmr, measure = measure)

The experiment indicates that both the Cox PH and the random forest have better discrimination than the Kaplan-Meier baseline estimator, but that the machine learning random forest is not consistently better than the interpretable Cox PH.

8.2 Density Estimation


Survival analysis with {mlr3proba} is currently in a fragile state after its removal from CRAN. Hence most code examples listed in this page will not work for the time being. We are actively working on a solution!

Density estimation is the learning task to find the unknown distribution from which an i.i.d. data set is generated. We interpret this broadly, with this distribution not necessarily being continuous (so may possess a mass not density). The conditional case, where a distribution is predicted conditional on covariates, is known as ‘probabilistic supervised regression’, and will be implemented in mlr3proba in the near-future. Unconditional density estimation is viewed as an unsupervised task. For a good overview to density estimation see Density estimation for statistics and data analysis (Silverman 1986).

The package mlr3proba extends mlr3 with the following objects for density estimation:

In this example we demonstrate the basic functionality of the package on the faithful data from the datasets package. This task ships as pre-defined TaskDens with mlr3proba.


task = tsk("precip")
<TaskDens:precip> (70 x 1): Annual Precipitation
* Target: -
* Properties: -
* Features (1):
  - dbl (1): precip
# histogram and density plot
# autoplot(task, type = "overlay")

Unconditional density estimation is an unsupervised method. Hence, TaskDens is an unsupervised task which inherits directly from Task unlike TaskClassif and TaskRegr. However, TaskDens still has a target argument and a $truth field defined by:

  • target - the name of the variable in the data for which to estimate density
  • $truth - the values of the target column (which is not the true density, which is always unknown)

8.2.1 Train and Predict

Density learners have train and predict methods, though being unsupervised, ‘prediction’ is actually ‘estimation’. In training, a distr6 object is created, see here for full tutorials on how to access the probability density function, pdf, cumulative distribution function, cdf, and other important fields and methods. The predict method is simply a wrapper around self$model$pdf and if available self$model$cdf, i.e. evaluates the pdf/cdf at given points. Note that in prediction the points to evaluate the pdf and cdf are determined by the target column in the TaskDens object used for testing.

# create task and learner

task_faithful = TaskDens$new(id = "eruptions", backend = datasets::faithful$eruptions)
learner = lrn("dens.hist")

# train/test split
train_set = sample(task_faithful$nrow, 0.8 * task_faithful$nrow)
test_set = setdiff(seq_len(task_faithful$nrow), train_set)

# fitting KDE and model inspection
learner$train(task_faithful, row_ids = train_set)

[1] 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5

[1] 39 30  5  4 25 62 49  3

[1] 0.35944700 0.27649770 0.04608295 0.03686636 0.23041475 0.57142857 0.45161290
[8] 0.02764977

[1] 1.75 2.25 2.75 3.25 3.75 4.25 4.75 5.25

[1] "dat"

[1] TRUE

[1] "histogram"

[1] "dens.hist"
[1] "dens.hist"
# make predictions for new data
prediction = learner$predict(task_faithful, row_ids = test_set)

Every PredictionDens object can estimate:

  • pdf - probability density function

Some learners can estimate:

  • cdf - cumulative distribution function

8.2.2 Benchmark Experiment

Finally, we conduct a small benchmark study on the precip task using some of the integrated survival learners:

# some integrated learners
learners = lrns(c("dens.hist", "dens.kde"))
<LearnerDensHistogram:dens.hist>: Histogram Density Estimator
* Model: -
* Parameters: list()
* Packages: mlr3, mlr3proba, distr6
* Predict Types:  [pdf], cdf, distr
* Feature Types: integer, numeric
* Properties: -

<LearnerDensKDE:dens.kde>: Kernel Density Estimator
* Model: -
* Parameters: kernel=Epan, bandwidth=silver
* Packages: mlr3, mlr3proba, distr6
* Predict Types:  [pdf], distr
* Feature Types: integer, numeric
* Properties: missings
# Logloss for probabilistic predictions
measure = msr("dens.logloss")
<MeasureDensLogloss:dens.logloss>: Log Loss
* Packages: mlr3, mlr3proba
* Range: [0, Inf]
* Minimize: TRUE
* Average: macro
* Parameters: eps=1e-15
* Properties: -
* Predict type: pdf
bmr = benchmark(benchmark_grid(task, learners, rsmp("cv", folds = 3)))
   nr      resample_result task_id learner_id resampling_id iters dens.logloss
1:  1 <ResampleResult[21]>  precip  dens.hist            cv     3     4.396138
2:  2 <ResampleResult[21]>  precip   dens.kde            cv     3     4.817715
autoplot(bmr, measure = measure)

The results of this experiment show that the sophisticated Penalized Density Estimator does not outperform the baseline histogram, but that the Kernel Density Estimator has at least consistently better (i.e. lower logloss) results.

8.3 Spatiotemporal Analysis

Data observations may entail reference information about spatial or temporal characteristics. Spatial information is stored as coordinates, usually named “x” and “y” or “lat”/“lon”. Treating spatiotemporal data using non-spatial data methods can lead to over-optimistic performance estimates. Hence, methods specifically designed to account for the special nature of spatiotemporal data are needed.

In the mlr3 framework, the following packages relate to this field:

The following (sub-)sections introduce the potential pitfalls of spatiotemporal data in machine learning and how to account for it. Note that not all functionality will be covered, and that some of the used packages are still in early lifecycles. If you want to contribute to one of the packages mentioned above, please contact Patrick Schratz.

8.3.1 Creating a spatial Task

To make use of spatial resampling methods, a {mlr3} task that is aware of its spatial characteristic needs to be created. Two Task child classes exist in {mlr3spatiotempcv} for this purpose:

  • TaskClassifST
  • TaskRegrST

To create one of these, you have multiple options:

  1. Use the constructor of the Task directly via $new() - this only works for data.table backends (!)
  2. Use the as_task_* converters (e.g. if your data is stored in an sf object)

We recommend the latter, as the as_task_* converters aim to make task construction easier, e.g., by creating the DataBackend (which is required to create a Task in {mlr3}) automatically and setting the crs and coordinate_names fields. Let’s assume your (point) data is stored in with an sf object, which is a common scenario for spatial analysis in R.

# create 'sf' object
data_sf = sf::st_as_sf(ecuador, coords = c("x", "y"), crs = 32717)

# create `TaskClassifST` from `sf` object
task = as_task_classif_st(data_sf, id = "ecuador_task", target = "slides", positive = "TRUE")

You can also use a plain data.frame. In this case, crs and coordinate_names need to be passed along explicitly as they cannot be inferred directly from the sf object:

task = as_task_classif_st(ecuador, id = "ecuador_task", target = "slides",
  positive = "TRUE", coordinate_names = c("x", "y"), crs = 32717)

The *ST task family prints a subset of the coordinates by default:

<TaskClassifST:ecuador_task> (751 x 11)
* Target: slides
* Properties: twoclass
* Features (10):
  - dbl (10): carea, cslope, dem, distdeforest, distroad,
    distslidespast, hcurv, log.carea, slope, vcurv
* Coordinates:
            x       y
  1: 712882.5 9560002
  2: 715232.5 9559582
  3: 715392.5 9560172
  4: 715042.5 9559312
  5: 715382.5 9560142
747: 714472.5 9558482
748: 713142.5 9560992
749: 713322.5 9560562
750: 715392.5 9557932
751: 713802.5 9560862

All *ST tasks can be treated as their super class equivalents TaskClassif or TaskRegr in subsequent {mlr3} modeling steps.

8.3.2 Autocorrelation

Data which includes spatial or temporal information requires special treatment in machine learning (similar to survival, ordinal and other task types listed in the special tasks chapter). In contrast to non-spatial/non-temporal data, observations inherit a natural grouping, either in space or time or in both space and time (Legendre 1993). This grouping causes observations to be autocorrelated, either in space (spatial autocorrelation (SAC)), time (temporal autocorrelation (TAC)) or both space and time (spatiotemporal autocorrelation (STAC)). For simplicity, the acronym STAC is used as a generic term in the following chapter for all the different characteristics introduced above.

What effects does STAC have in statistical/machine learning?

The overarching problem is that STAC violates the assumption that the observations in the train and test datasets are independent (Hastie, Friedman, and Tibshirani 2001). If this assumption is violated, the reliability of the resulting performance estimates, for example retrieved via cross-validation, is decreased. The magnitude of this decrease is linked to the magnitude of STAC in the dataset, which cannot be determined easily.

One approach to account for the existence of STAC is to use dedicated resampling methods. mlr3spatiotempcv provides access to the most frequently used spatiotemporal resampling methods. The following example showcases how a spatial dataset can be used to retrieve a bias-reduced performance estimate of a learner.

The following examples use the ecuador dataset created by Jannes Muenchow. It contains information on the occurrence of landslides (binary) in the Andes of Southern Ecuador. The landslides were mapped from aerial photos taken in 2000. The dataset is well suited to serve as an example because it it relatively small and of course due to the spatial nature of the observations. Please refer to Muenchow, Brenning, and Richter (2012) for a detailed description of the dataset.

To account for the spatial autocorrelation probably present in the landslide data, we will make use one of the most used spatial partitioning methods, a cluster-based k-means grouping (Brenning 2012), (spcv_coords in mlr3spatiotempcv). This method performs a clustering in 2D space which contrasts with the commonly used random partitioning for non-spatial data. The grouping has the effect that train and test data are more separated in space as they would be by conducting a random partitioning, thereby reducing the effect of STAC.

By contrast, when using the classical random partitioning approach with spatial data, train and test observations would be located side-by-side across the full study area (a visual example is provided further below). This leads to a high similarity between train and test sets, resulting in “better” but biased performance estimates in every fold of a CV compared to the spatial CV approach. However, these low error rates are mainly caused due to the STAC in the observations and the lack of appropriate partitioning methods and not by the power of the fitted model.

8.3.3 Spatiotemporal Cross-Validation and Partitioning

One part of spatiotemporal machine learning is dealing with the spatiotemporal components of the data during performance estimation. Performance is commonly estimated via cross-validation and mlr3spatiotempcv provides specialized resamplings methods for spatiotemporal data. The following chapters showcases how these methods can be applied and how they differ compared to non-spatial resampling methods, e.g. random partitioning. In addition, examples which show how resamplings with spatial information can be visualized using mlr3spatiotempcv.

Besides performance estimation, prediction on spatiotemporal data is another challenging task. See Section 8.3.6 for more information about how this topic is handled within the mlr3 framework. Spatial CV vs. Non-Spatial CV

In the following a spatial and non-spatial CV will be applied to showcase the mentioned performance differences.

The performance of a simple classification tree ("classif.rpart") is evaluated on a random partitioning (repeated_cv) with four folds and two repetitions. The chosen evaluation measure is “classification error” ("classif.ce").

For the spatial example, repeated_spcv_coords is chosen whereas repeated_cv represents the non-spatial example.


The selection of repeated_spcv_coords in this example is arbitrary. For your use case, you might want to use a different spatial partitioning method (but not necessarily!). Have a look at the “Getting Started” vignette of mlr3spatiotempcv to see all available methods and choose one which fits your data and its prediction purpose. Non-Spatial CV

In this example the ecuador example task is taken to estimate the performance of an rpart learner with fixed parameters on it.


In practice you usually might want to tune the hyperparameters of the learner in this case and apply a nested CV in which the inner loop is used for hyperparameter tuning.


# be less verbose

task = tsk("ecuador")

learner = lrn("classif.rpart", maxdepth = 3, predict_type = "prob")
resampling_nsp = rsmp("repeated_cv", folds = 4, repeats = 2)
rr_nsp = resample(
  task = task, learner = learner,
  resampling = resampling_nsp)

rr_nsp$aggregate(measures = msr("classif.ce"))
 0.3388575 Spatial CV
task = tsk("ecuador")

learner = lrn("classif.rpart", maxdepth = 3, predict_type = "prob")
resampling_sp = rsmp("repeated_spcv_coords", folds = 4, repeats = 2)
rr_sp = resample(
  task = task, learner = learner,
  resampling = resampling_sp)

rr_sp$aggregate(measures = msr("classif.ce"))

Here, the performance of the classification tree learner is around 7 percentage points worse when using Spatial Cross-Validation (SpCV) compared to Non-Spatial Cross-Validation (NSpCV). The resulting difference in performance is variable as it depends on the dataset, the magnitude of STAC and the learner itself. For algorithms with a higher tendency of overfitting to the training set, the difference between the two methods will be larger.

Now, what does it mean that the performance in the spatial case is worse? Should you ditch SpCV and keep using non-spatial partitioning? The answer is NO. The reason why the spatial partitioning scenario results in a lower predictive performance is because throughout the CV the model has been trained on data that is less similar than the test data compared against the non-spatial scenario. Or in other words: in the non-spatial scenario, train and test data are almost identical (due to spatial autocorrelation).

This means that the result from the SpCV setting is more close to the true predictive power of the model - whereas the result from non-spatial CV is overoptimistic and biased.


The result of the SpCV setting is by no means the absolute truth - it is also biased, but (most often) less compared to the non-spatial setting.

8.3.4 Visualization of Spatiotemporal Partitions

Every partitioning method in mlr3spatiotempcv comes with S3 methods for plot() and autoplot() to visualize the created groups. In a 2D space ggplot2 is used in the backgroudn while for spatiotemporal methods 3D visualizations via plotly are created.

The following examples shows how the resampling_sp object from the previous example can be visualized. In this case I want to look at the first four partitions of the first repetition. The point size is adjusted via argument size. After the plot creation, additional scale_* calls are used to adjust the coordinate labels on the x and y axes, respectively.

autoplot(resampling_sp, task, fold_id = c(1:4), size = 0.7) *
  ggplot2::scale_y_continuous(breaks = seq(-3.97, -4, -0.01)) *
  ggplot2::scale_x_continuous(breaks = seq(-79.06, -79.08, -0.02))


Note that setting the correct CRS for the given data is important which is done during task creation. Spatial offsets of up to multiple meters may occur if the wrong CRS is supplied initially.

This example used a built-in mlr3 task via tsk(). In practice however, one needs to create a spatiotemporal task via TaskClassifST/TaskRegrST and set the crs argument (unless a sf object is handed over).

mlr3spatiotempcv can also visualize non-spatial partitonings. This helps to visually compare differences. Let’s use the objects from the previous example again, this time resampling_nsp.

autoplot(resampling_nsp, task, fold_id = c(1:4), size = 0.7) *
  ggplot2::scale_y_continuous(breaks = seq(-3.97, -4, -0.01)) *
  ggplot2::scale_x_continuous(breaks = seq(-79.06, -79.08, -0.02))

The visualization show very well how close train and test observations are located next to each other. Spatial Block Visualization

This examples showcases another SpCV method: spcv_block. This method makes use of rectangular blocks to divide the study area into equally-sized parts. {mlr3spatiotempcv} has support for visualizing the created blocks and displaying their respective fold ID to get a better understanding how the final folds were composed out of the partitions. E.g. the “Fold 1 Repetition 1” plot shows that the test set is composed out of two “blocks” with the ID “1” in this case.


The use of range = 1000L is arbitrary here and should not be copy-pasted into a real application.

task = tsk("ecuador")
resampling = rsmp("spcv_block", range = 1000L)

# Visualize train/test splits of multiple folds
autoplot(resampling, task, size = 0.7,
  fold_id = c(1, 2), show_blocks = TRUE, show_labels = TRUE) *
  ggplot2::scale_x_continuous(breaks = seq(-79.085, -79.055, 0.02)) Spatiotemporal Visualization

When going spatiotemporal, two dimensions are not enough anymore. To visualize space and time together, a 3D solution is needed. {mlr3spatiotempcv} makes use of plotly for this purpose.

The following examples uses a modified version of the cookfarm_mlr3 task for showcasing reasons. By adjusting some levels, the individual partitions can be recognized very well.


In practice, you should not modify your data to achieve “good looking” visualizations as done in this example. This is only done for (visual) demonstration purposes.

In the following we use the sptcv_cstf method after Meyer et al. (2018). Only the temporal variable is used in this first example, denoted by setting the column role “time” to variable “Date”. This column role is then picked up by the resampling method. Last, autoplot() is called with an explicit definition of plot3D = TRUE. This is because sptcv_cstf can also be visualized in 2D (which only makes sense if the “space” column role is used for partitioning).

Last, sample_fold_n is used to take a stratified random sample from all partitions. The call to plotly::layout() only adjusts the default viewing angle of the plot - in interactive visualizations, this is not needed and the viewing angle can be adjusted with the mouse.


This is done to reduce the final plot size and keep things small for demos like this one. We don’t recommend doing this in actual studies - unless you prominently communicate this alongside the resulting plot.

data = cookfarm_mlr3
data$Date = sample(rep(c(
  "2020-01-01", "2020-02-01", "2020-03-01", "2020-04-01",
  "2020-05-01"), times = 1, each = 35768))
task_spt = as_task_regr_st(data,
  id = "cookfarm", target = "PHIHOX",
  coordinate_names = c("x", "y"), coords_as_features = FALSE,
  crs = 26911)
task_spt$set_col_roles("Date", roles = "time")

rsmp_cstf_time = rsmp("sptcv_cstf", folds = 5)

plot = autoplot(rsmp_cstf_time,
  fold_id = 5, task = task_spt, plot3D = TRUE,
  sample_fold_n = 3000L
plotly::layout(plot, scene = list(camera = list(eye = list(z = 0.58))))

If both space and time are used for partitioning in sptcv_cstf, the visualization becomes even more powerful as it allows to also show the observations which are omitted, i.e., not being used in either train and test sets for a specific fold.

task_spt$set_col_roles("SOURCEID", roles = "space")
task_spt$set_col_roles("Date", roles = "time")

rsmp_cstf_space_time = rsmp("sptcv_cstf", folds = 5)

plot = autoplot(rsmp_cstf_space_time,
  fold_id = 4, task = task_spt, plot3D = TRUE,
  show_omitted = TRUE, sample_fold_n = 3000L)

plotly::layout(plot, scene = list(camera =
list(eye = list(z = 0.58, x = -1.4, y = 1.6))))

Combining multiple spatiotemporal plots with plotly::layout() is possible but somewhat cumbersome. First, a list of plots containing the individuals plots must be created. These plots can then be passed to plotly::subplot(). This return is then passed to plotly::layout().

pl = autoplot(rsmp_cstf_space_time, task = task_spt,
  fold_id = c(1, 2, 3, 4), point_size = 3,
  axis_label_fontsize = 10, plot3D = TRUE,
  sample_fold_n = 3000L, show_omitted = TRUE

# Warnings can be ignored
pl_subplot = plotly::subplot(pl)

  title = "Individual Folds",
  scene = list(
    domain = list(x = c(0, 0.5), y = c(0.5, 1)),
    aspectmode = "cube",
    camera = list(eye = list(z = 0.20, x = -1.4, y = 1.6))
  scene2 = list(
    domain = list(x = c(0.5, 1), y = c(0.5, 1)),
    aspectmode = "cube",
    camera = list(eye = list(z = 0.1, x = -1.4, y = 1.6))
  scene3 = list(
    domain = list(x = c(0, 0.5), y = c(0, 0.5)),
    aspectmode = "cube",
    camera = list(eye = list(z = 0.1, x = -1.4, y = 1.6))
  scene4 = list(
    domain = list(x = c(0.5, 1), y = c(0, 0.5)),
    aspectmode = "cube",
    camera = list(eye = list(z = 0.58, x = -1.4, y = 1.6))

Unfortunately, titles in subplots cannot be created dynamically. However, there is a manual workaround via annotations show in this RPubs post.

8.3.5 Choosing a Resampling Method

While the example in this section made use of the spcv_coords method, this should by no means infer that this method is the best or only method suitable for this task. Even though this method is quite popular, it was mainly chosen because of the clear visual grouping differences when being applied on the ecuador task when compared to random partitioning.

In fact, most often multiple spatial partitioning methods can be used for a dataset. It is recommended (required) that users familiarize themselves with each implemented method and decide which method to choose based on the specific characteristics of the dataset. For almost all methods implemented in mlr3spatiotempcv, there is a scientific publication describing the strengths and weaknesses of the respective approach (either linked in the help file of mlr3spatiotempcv or its respective dependency packages).


In the example above, a cross-validation without hyperparameter tuning was shown. If a nested CV is desired, it is recommended to use the same spatial partitioning method for the inner loop (= tuning level). See Schratz et al. (2019) for more details and chapter 11 of Geocomputation with R (Lovelace, Nowosad, and Muenchow 2019).


A list of all implemented methods in mlr3spatiotempcv can be found in the Getting Started vignette of the package.

If you want to learn even more about the field of spatial partitioning, STAC and the problems associated with it, the works of Prof. Hanna Meyer and Prof. Alexander Brenning are very much recommended for further reference.

8.3.6 Spatial Prediction

Support for (parallel) spatial prediction with terra, raster, stars and sf objects is available via mlr3spatial.

mlr3spatial has two main scopes:

  • Provide DataBackends for spatial objects (vector and raster data)
  • Simplify spatial prediction tasks

Overall it aims to reduce the roundtripping between spatial objects -> data.frame / data.table -> spatial object during modeling. Spatial Data Backends

A common scenario is the existence of spatial information in (point) vector data. mlr3spatial provides as_task_* helpers to directly convert these into a spatiotemporal (ST) task:


# load sample points
leipzig_vector = sf::read_sf(system.file("extdata",
  "leipzig_points.gpkg", package = "mlr3spatial"),
  stringsAsFactors = TRUE)

# create land cover task
tsk_leipzig = as_task_classif_st(leipzig_vector, target = "land_cover")
<TaskClassifST:leipzig_vector> (97 x 9)
* Target: land_cover
* Properties: multiclass
* Features (8):
  - dbl (8): b02, b03, b04, b06, b07, b08, b11, ndvi
* Coordinates:
           X       Y
 1: 732480.1 5693957
 2: 732217.4 5692769
 3: 732737.2 5692469
 4: 733169.3 5692777
 5: 732202.2 5692644
93: 733018.7 5692342
94: 732551.4 5692887
95: 732520.4 5692589
96: 732542.2 5692204
97: 732437.8 5692300

This saves users from stripping the coordinates from the vector data or transforming the sf object to a data.frame or data.table in the first place.

mlr3spatial adds support for the following spatial data classes:

  • stars (from package stars)
  • SpatRaster (from package terra)
  • RasterLayer (from package raster)
  • RasterStack (from package raster) Spatial prediction

The goal of spatial prediction is usually a raster image that covers a specific area. Most often this area is quite large and contains millions of pixels. The goal is to predict on each pixel and return a raster image containing these predictions.

To save users from having to go the extra mile of extracting the final model and using it with the respective predict() function of the spatial R package of their desired outcome class, mlr3spatial provides predict_spatial(). predict_spatial() let’s users

  • choose the output class of the resulting raster image
  • optionally predict in parallel via the future package, using a parallel backend of their choice

To use a raster image for prediction, it must be wrapped into a TaskUnsupervised for internal mlr3 reasons. Next, the learner can be used to predict on the task.

leipzig_raster = terra::rast(system.file("extdata", "leipzig_raster.tif",
  package = "mlr3spatial"))
tsk_predict = as_task_unsupervised(leipzig_raster)

lrn = lrn("classif.ranger")

# plan("multisession") # optional parallelization
pred = predict_spatial(tsk_predict, lrn, format = "terra")
[1] "SpatRaster"
[1] "terra"

The resulting object can be visualized and treated as any other terra object. Thanks to the improved handling of factor variables in terra, the raster image contains the correct labeled classes from the prediction right away.

library(terra, exclude = "resample")
plot(pred, col = c("#440154FF", "#443A83FF", "#31688EFF",
  "#21908CFF", "#35B779FF", "#8FD744FF", "#FDE725FF"))

If you are interested in the performance of parallel spatial prediction, have a look at the Spatial Prediction Benchmark vignette - the resulting plot is shown below.

The results of the shown benchmark results should be taken with care as something seems to go wrong when using all-core parallelization with terra::predict(). Also both relative and absolute numbers will vary on your local machine and results might change depending on eventual package updates in the future.

8.4 Cost-Sensitive Classification

In regular classification, the aim is to minimize the misclassification rate, and thus all types of misclassification errors are deemed equally severe. A more general setting is cost-sensitive classification. Cost-sensitive classification does not assume that the costs caused by different kinds of errors are equal. The objective of cost-sensitive classification is to minimize the expected costs.

Imagine you are an analyst for a big credit institution. Let’s also assume that a correct bank decision would result in 35% of the profit at the end of a specific period. A correct decision means that the bank predicts that a customer will pay their bills (hence would obtain a loan), and the customer indeed has good credit. On the other hand, a wrong decision means that the bank predicts that the customer’s credit is in good standing, but the opposite is true. This would result in a loss of 100% of the given loan.

Good Customer (truth) Bad Customer (truth)
Good Customer (predicted) + 0.35 - 1.0
Bad Customer (predicted) 0 0

Expressed as costs (instead of profit), we can write down the cost-matrix as follows:

costs = matrix(c(-0.35, 0, 1, 0), nrow = 2)
dimnames(costs) = list(response = c("good", "bad"), truth = c("good", "bad"))
response  good bad
    good -0.35   1
    bad   0.00   0

An exemplary data set for such a problem is the German Credit task:

task = tsk("german_credit")

good  bad 
 700  300 

The data has 70% of customers who can pay back their credit and 30% of bad customers who default on their debt. A manager, who doesn’t have any model, could decide to give either everybody credit or to give nobody credit. The resulting costs for the German credit data are:

# nobody:
(700 * costs[2, 1] + 300 * costs[2, 2]) / 1000
[1] 0
# everybody
(700 * costs[1, 1] + 300 * costs[1, 2]) / 1000
[1] 0.055

If the average loan is $20,000, the credit institute will lose more than one million dollars if it would grant everybody a credit:

# average profit * average loan * number of customers
0.055 * 20000 * 1000
[1] 1100000

Our goal is to find a model that minimizes the costs (and maximizes the expected profit).

8.4.1 A First Model

For our first model, we choose an ordinary logistic regression (implemented in add-on package mlr3learners). We first create a classification task, then resample the model using 10-fold cross-validation and extract the resulting confusion matrix:

learner = lrn("classif.log_reg")
rr = resample(task, learner, rsmp("cv"))

confusion = rr$prediction()$confusion
response good bad
    good  606 154
    bad    94 146

To calculate the average costs like above, we can simply multiply the elements of the confusion matrix with the elements of the previously introduced cost matrix and sum the values of the resulting matrix:

avg_costs = sum(confusion * costs) / 1000
[1] -0.0581

With an average loan of $20,000, the logistic regression yields the following costs:

avg_costs * 20000 * 1000
[1] -1162000

Instead of losing over $1,000,000, the credit institute now can expect a profit of more than $1,000,000.

8.4.2 Cost-sensitive Measure

Our natural next step would be to further improve the modeling step in order to maximize the profit. For this purpose, we first create a cost-sensitive classification measure that calculates the costs based on our cost matrix. This allows us to conveniently quantify and compare modeling decisions. Fortunately, there already is a predefined measure Measure for this purpose: MeasureClassifCosts. The costs have to be provided as a numeric matrix whose columns and rows are named with class labels (just like the previously constructed costs matrix):

cost_measure = msr("classif.costs", costs = costs)
<MeasureClassifCosts:classif.costs>: Cost-sensitive Classification
* Packages: mlr3
* Range: [-Inf, Inf]
* Minimize: TRUE
* Average: macro
* Parameters: normalize=TRUE
* Properties: -
* Predict type: response

If we now call resample() or benchmark(), the cost-sensitive measures will be evaluated. We compare the logistic regression to a simple featureless learner and to a random forest from package ranger :

learners = list(
cv10 = rsmp("cv", folds = 10)
bmr = benchmark(benchmark_grid(task, learners, cv10))
autoplot(bmr, measure = cost_measure) + ggplot2::geom_hline(yintercept = 0, colour = "red")

As expected, the featureless learner is performing comparably badly. The logistic regression and the random forest both yield a profit on average.

8.4.3 Thresholding

Although we now correctly evaluate the models in a cost-sensitive fashion, the models themselves are unaware of the classification costs. They assume the same costs for both wrong classification decisions (false positives and false negatives). Some learners natively support cost-sensitive classification (e.g., XXX). However, we will concentrate on a more generic approach that works for all models which can predict probabilities for class labels: thresholding.

Most learners can calculate the probability \(p\) for the positive class. If \(p\) exceeds the threshold \(0.5\), they predict the positive class and the negative class otherwise.

For our binary classification case of the credit data, we primarily want to minimize the errors where the model predicts “good”, but truth is “bad” (i.e., the number of false positives) as this is the more expensive error. If we now increase the threshold to values \(> 0.5\), we reduce the number of false negatives. Note that we increase the number of false positives simultaneously, or, in other words, we are trading false positives for false negatives.

# fit models with probability prediction
learner = lrn("classif.log_reg", predict_type = "prob")
rr = resample(task, learner, rsmp("cv"))
p = rr$prediction()
<PredictionClassif> for 1000 observations:
    row_ids truth response  prob.good  prob.bad
          2   bad      bad 0.40126720 0.5987328
         17  good     good 0.97406360 0.0259364
         46  good     good 0.74870171 0.2512983
        973   bad      bad 0.09917846 0.9008215
        976  good     good 0.78358973 0.2164103
        997  good     good 0.51492948 0.4850705
# helper function to try different threshold values interactively
with_threshold = function(p, th) {
  list(confusion = p$confusion, costs = p$score(measures = cost_measure))

with_threshold(p, 0.5)
response good bad
    good  598 152
    bad   102 148

with_threshold(p, 0.75)
response good bad
    good  462  75
    bad   238 225

with_threshold(p, 1.0)
response good bad
    good    0   0
    bad   700 300


There is also an autoplot() method which systematically varies the threshold between 0 and 1 and calculates the corresponding scores:

autoplot(p, type = "threshold", measure = cost_measure)

Instead of manually or visually searching for good values, the base R function optimize() can do the job for us:

# simple wrapper function that takes a threshold and returns the resulting model performance
# this wrapper is passed to optimize() to find its minimum for thresholds in [0.5, 1]
f = function(th) {
  with_threshold(p, th)$costs
best = optimize(f, c(0.5, 1))
[1] 0.6905289

# optimized confusion matrix:
with_threshold(p, best$minimum)$confusion
response good bad
    good  515  91
    bad   185 209

Note that the function "optimize()" is intended for unimodal functions and, therefore, may converge to a local optimum here. See the next section for better alternatives to tune the threshold.

8.4.4 Threshold Tuning

Currently mlr3pipelines offers two main strategies towards adjusting classification thresholds. We can either expose the thresholds as a hyperparameter of the Learner by using PipeOpThreshold. This allows us to tune the thresholds via an outside optimizer from mlr3tuning. Alternatively, we can also use PipeOpTuneThreshold which automatically tunes the threshold after each learner is fit. Both methods are described in the following subsections.

8.4.5 PipeOpThreshold

PipeOpThreshold can be put directly after a Learner.

A simple example would be:

gr = lrn("classif.rpart", predict_type = "prob") %>>% po("threshold")
l = as_learner(gr)

Note, that predict_type = "prob" is required for po("threshold") to have any effect.

The thresholds are now exposed as a hyperparameter of the GraphLearner we created:

                              id    class lower upper nlevels        default
 1:             classif.rpart.cp ParamDbl     0     1     Inf           0.01
 2:     classif.rpart.keep_model ParamLgl    NA    NA       2          FALSE
 3:     classif.rpart.maxcompete ParamInt     0   Inf     Inf              4
 4:       classif.rpart.maxdepth ParamInt     1    30      30             30
 5:   classif.rpart.maxsurrogate ParamInt     0   Inf     Inf              5
 6:      classif.rpart.minbucket ParamInt     1   Inf     Inf <NoDefault[3]>
 7:       classif.rpart.minsplit ParamInt     1   Inf     Inf             20
 8: classif.rpart.surrogatestyle ParamInt     0     1       2              0
 9:   classif.rpart.usesurrogate ParamInt     0     2       3              2
10:           classif.rpart.xval ParamInt     0   Inf     Inf             10
11:         threshold.thresholds ParamUty    NA    NA     Inf <NoDefault[3]>
1 variable not shown: [value]

We can now tune those thresholds from the outside as follows:

Before tuning, we have to define which hyperparameters we want to tune over. In this example, we only tune over the thresholds parameter of the threshold pipe operator. You can easily imagine that we can also jointly tune over additional hyperparameters, i.e., rpart’s cp parameter.

As the Task we aim to optimize for is a binary task, we can simply specify the threshold param:

ps = ps(threshold.thresholds = p_dbl(lower = 0, upper = 1))

We now create a AutoTuner which automatically tunes the supplied learner over the ParamSet we supplied above.

at = AutoTuner$new(
  learner = l,
  resampling = rsmp("cv", folds = 3L),
  measure = msr("classif.ce"),
  search_space = ps,
  terminator = trm("evals", n_evals = 5L),
  tuner = tnr("random_search")


Inside the trafo, we simply collect all set params into a named vector via map_dbl and store it in the threshold.thresholds slot expected by the learner.

Again, we create a AutoTuner, which automatically tunes the supplied learner over the ParamSet we supplied above.

One drawback of this strategy is that this requires us to fit a new model for each new threshold setting. While setting a threshold and computing performance is relatively cheap, fitting the learner is often more computationally demanding. An often better strategy is, therefore, to optimize the thresholds separately after each model fit.

8.4.6 PipeOpTunethreshold

PipeOpTuneThreshold on the other hand works together with PipeOpLearnerCV. It directly optimizes the cross-validated predictions made by this PipeOp. This is necessary to avoid over-fitting the threshold tuning.

A simple example would be:

gr = po("learner_cv", lrn("classif.rpart", predict_type = "prob")) %>>% po("tunethreshold")
l2 = as_learner(gr)

Note, that predict_type = “prob” is required for po("tunethreshold") to work. Additionally, note that this time no threshold parameter is exposed, and it is automatically tuned internally.

                                        id    class lower upper nlevels
 1:        classif.rpart.resampling.method ParamFct    NA    NA       2
 2:         classif.rpart.resampling.folds ParamInt     2   Inf     Inf
 3: classif.rpart.resampling.keep_response ParamLgl    NA    NA       2
 4:                       classif.rpart.cp ParamDbl     0     1     Inf
 5:               classif.rpart.keep_model ParamLgl    NA    NA       2
 6:               classif.rpart.maxcompete ParamInt     0   Inf     Inf
 7:                 classif.rpart.maxdepth ParamInt     1    30      30
 8:             classif.rpart.maxsurrogate ParamInt     0   Inf     Inf
 9:                classif.rpart.minbucket ParamInt     1   Inf     Inf
10:                 classif.rpart.minsplit ParamInt     1   Inf     Inf
11:           classif.rpart.surrogatestyle ParamInt     0     1       2
12:             classif.rpart.usesurrogate ParamInt     0     2       3
13:                     classif.rpart.xval ParamInt     0   Inf     Inf
14:           classif.rpart.affect_columns ParamUty    NA    NA     Inf
15:                  tunethreshold.measure ParamUty    NA    NA     Inf
16:                tunethreshold.optimizer ParamUty    NA    NA     Inf
17:                tunethreshold.log_level ParamUty    NA    NA     Inf
2 variables not shown: [default, value]

Note that we can set rsmp("intask") as a resampling strategy for “learner_cv” in order to evaluate predictions on the “training” data. This is generally not advised, as it might lead to over-fitting on the thresholds but can significantly reduce runtime.

For more information, see the post on Threshold Tuning on the mlr3 gallery.

8.5 Cluster Analysis

Cluster analysis is a type of unsupervised machine learning where the goal is to group data into clusters, where each cluster contains similar observations. The similarity is based on specified metrics that are task and application-dependent. Cluster analysis is closely related to classification in the sense that each observation needs to be assigned to a cluster or a class. However, unlike classification problems where each observation is labeled, clustering works on data sets without true labels or class assignments.

The package mlr3cluster extends mlr3 with the following objects for cluster analysis:

Since clustering is a type of unsupervised learning, TaskClust is slightly different from TaskRegr and TaskClassif objects. More specifically:

  • truth() function is missing because observations are not labeled.
  • target field is empty and will return character(0) if accessed anyway.

Additionally, LearnerClust provides two extra fields that are absent from supervised learners:

  • assignments returns cluster assignments for training data. It returns NULL if accessed before training.
  • save_assignments is a boolean field that controls whether or not to store training set assignments in a learner.

Finally, PredictionClust contains two additional fields:

  • partition stores cluster partitions.
  • prob stores cluster probabilities for each observation.

8.5.1 Train and Predict

Clustering learners provide both train and predict methods. The analysis typically consists of building clusters using all available data. To be consistent with the rest of the library, we refer to this process as training.

Some learners can assign new observations to existing groups with predict. However, prediction does not always make sense, as is the case for hierarchical clustering. In hierarchical clustering, the goal is to build a hierarchy of nested clusters by either splitting large clusters into smaller ones or merging smaller clusters into bigger ones. The final result is a tree or dendrogram which can change if a new data point is added. For consistency with the rest of the ecosystem, mlr3cluster offers predict method for hierarchical clusterers but it simply assigns all points to a specified number of clusters by cutting the resulting tree at a corresponding level. Moreover, some learners estimate the probability of each observation belonging to a given cluster. predict_types field gives a list of prediction types for each learner.

After training, the model field stores a learner’s model that looks different for each learner depending on the underlying library. predict returns a PredictionClust object that gives a simplified view of the learned model. If the data given to the predict method is the same as the one on which the learner was trained, predict simply returns cluster assignments for the “training” observations. On the other hand, if the test set contains new data, predict will estimate cluster assignments for that data set. Some learners do not support estimating cluster partitions on new data and will instead return assignments for training data and print a warning message.

In the following example, a $k$-means learner is applied on the US arrest data set. The class labels are predicted and the contribution of the task features to assignment of the respective class are visualized.


# create an example task
task = tsk("usarrests")
<TaskClust:usarrests> (50 x 4): US Arrests
* Target: -
* Properties: -
* Features (4):
  - int (2): Assault, UrbanPop
  - dbl (2): Murder, Rape

# create a k-means learner
learner = lrn("clust.kmeans")

# assigning each observation to one of the two clusters (default in clust.kmeans)
K-means clustering with 2 clusters of sizes 21, 29

Cluster means:
   Assault    Murder     Rape UrbanPop
1 255.0000 11.857143 28.11429 67.61905
2 109.7586  4.841379 16.24828 64.03448

Clustering vector:
 [1] 1 1 1 1 1 1 2 1 1 1 2 2 1 2 2 2 2 1 2 1 2 1 2 1 2 2 2 1 2 2 1 1 1 2 2 2 2 2
[39] 2 1 2 1 1 2 2 2 2 2 2 2

Within cluster sum of squares by cluster:
[1] 41636.73 54762.30
 (between_SS / total_SS =  72.9 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      
# make "predictions" for the same data
prediction = learner$predict(task)
autoplot(prediction, task)

8.5.2 Measures

The difference between supervised and unsupervised learning is that there is no ground truth data in unsupervised learning. In a supervised setting, such as classification, we would need to compare our predictions to true labels. Since clustering is an example of unsupervised learning, there are no true labels to which we can compare. However, we can still measure the quality of cluster assignments by quantifying how closely objects within the same cluster are related (cluster cohesion) as well as how distinct different clusters are from each other (cluster separation).

There are a few built-in evaluation metrics available to assess the quality of clustering. One of them is within sum of squares (WSS) which calculates the sum of squared differences between observations and centroids. WSS is useful because it quantifies cluster cohesion. The range of this measure is \([0, \infty)\) where a smaller value means that clusters are more compact.

Another measure is silhouette quality index that quantifies how well each point belongs to its assigned cluster versus neighboring cluster. Silhouette values are in \([-1, 1]\) range.

Points with silhouette closer to:

  • 1 are well clustered
  • 0 lie between two clusters
  • -1 likely placed in the wrong cluster

The following is an example of conducting a benchmark experiment with various learners on iris data set without target variable and assessing the quality of each learner with both within sum of squares and silhouette measures.

design = benchmark_grid(
  tasks = TaskClust$new("iris", iris[-5]),
  learners = list(
    lrn("clust.kmeans", centers = 3L),
    lrn("clust.pam", k = 2L),
    lrn("clust.cmeans", centers = 3L)),
  resamplings = rsmp("insample"))
              task                  learner               resampling
1: <TaskClust[46]> <LearnerClustKMeans[38]> <ResamplingInsample[20]>
2: <TaskClust[46]>    <LearnerClustPAM[38]> <ResamplingInsample[20]>
3: <TaskClust[46]> <LearnerClustCMeans[38]> <ResamplingInsample[20]>
# execute benchmark
bmr = benchmark(design)

# define measure
measures = list(msr("clust.wss"), msr("clust.silhouette"))
   nr      resample_result task_id   learner_id resampling_id iters clust.wss
1:  1 <ResampleResult[21]>    iris clust.kmeans      insample     1  78.85144
2:  2 <ResampleResult[21]>    iris    clust.pam      insample     1 153.32572
3:  3 <ResampleResult[21]>    iris clust.cmeans      insample     1  79.02617
1 variable not shown: [clust.silhouette]

The experiment shows that using k-means algorithm with three centers produces a better within sum of squares score than any other learner considered. However, pam (partitioning around medoids) learner with two clusters performs the best when considering silhouette measure which takes into the account both cluster cohesion and separation.

8.5.3 Visualization

Cluster analysis in mlr3 is integrated with mlr3viz which provides a number of useful plots. Some of those plots are shown below.

task = TaskClust$new("iris", iris[-5])
learner = lrn("clust.kmeans")
prediction = learner$predict(task)

# performing PCA on task and showing assignments
autoplot(prediction, task, type = "pca")

# same as above but with probability ellipse that assumes normal distribution
autoplot(prediction, task, type = "pca", frame = TRUE, frame.type = "norm")

task = tsk("usarrests")
learner = lrn("clust.hclust")

# dendrogram for hierarchical clustering
autoplot(learner, task = task)

Silhouette plots can help to visually assess the quality of the analysis and help choose a number of clusters for a given data set. The red dotted line shows the mean silhouette value and each bar represents a data point. If most points in each cluster have an index around or higher than mean silhouette, the number of clusters is chosen well.

# silhouette plot allows to visually inspect the quality of clustering
task = TaskClust$new("iris", iris[-5])
learner = lrn("clust.kmeans")
learner$param_set$values = list(centers = 5L)
prediction = learner$predict(task)
autoplot(prediction, task, type = "sil")

The plot shows that all points in cluster 5 and almost all points in clusters 4, 2 and 1 are below average silhouette index. This means that a lot of observations lie either on the border of clusters or are likely assigned to the wrong cluster.

learner = lrn("clust.kmeans")
learner$param_set$values = list(centers = 2L)
prediction = learner$predict(task)
autoplot(prediction, task, type = "sil")

Setting the number of centers to two improves both the average silhouette score as well as the overall quality of clustering because almost all points in cluster 1 are higher than, and a lot of points in cluster 2 are close to the mean silhouette. Hence, having two centers might be a better choice for the number of clusters.