8 Special Tasks
This chapter explores the different functions of mlr3 when dealing with specific data sets that require further statistical modification to undertake sensible analysis. Following topics are discussed:
Survival Analysis
This subchapter explains how to conduct sound survival analysis in mlr3. Survival analysis is used to monitor the period of time until a specific event takes places. This specific event could be e.g. death, transmission of a disease, marriage or divorce. Two considerations are important when conducting survival analysis:
 Whether the event occurred within the frame of the given data
 How much time it took until the event occurred
In summary, this subchapter explains how to account for these considerations and conduct survival analysis using the mlr3proba
extension package.
Density Estimation
This subchapter explains how to conduct (unconditional) density estimation in mlr3. Density estimation is used to estimate the probability density function of a continuous variable. Unconditional density estimation is an unsupervised task so there is no ‘value’ to predict, instead densities are estimated.
This subchapter explains how to estimate probability distributions for continuous variables using the mlr3proba
extension package.
Spatiotemporal Analysis
Spatiotemporal analysis data observations entail reference information about spatial and temporal characteristics. One of the largest issues of spatiotemporal data analysis is the inevitable presence of autocorrelation in the data. Autocorrelation is especially severe in data with marginal spatiotemporal variation. The subchapter on Spatiotemporal analysis provides instructions on how to account for spatiotemporal data.
Ordinal Analysis
This is work in progress. See mlr3ordinal for the current state.
Functional Analysis
Functional analysis contains data that consists of curves varying over a continuum e.g. time, frequency or wavelength. This type of analysis is frequently used when examining measurements over various time points. Steps on how to accommodate functional data structures inmlr3are explained in the functional analysis subchapter.
Multilabel Classification
Multilabel classification deals with objects that can belong to more than one category at the same time. Numerous target labels are attributed to a single observation. Working with multilabel data requires one to use modified algorithms, to accommodate data specific characteristics. Two approaches to multilabel classification are prominently used:
 The problem transformation method
 The algorithm adaption method
Instructions on how to deal with multilabel classification inmlr3can be found in this subchapter.
Cost Sensitive Classification
This subchapter deals with the implementation of costsensitive classification. Regular classification aims to minimize the misclassification rate and thus all types of misclassification errors are deemed equally severe. Costsensitive classification is a setting where the costs caused by different kinds of errors are not assumed to be equal. The objective is to minimize the expected costs.
Analytical data for a big credit institution is used as a use case to illustrate the different features. Firstly, the subchapter provides guidance on how to implement a first model. Subsequently, the subchapter contains instructions on how to modify cost sensitivity measures, thresholding and threshold tuning.
Cluster Analysis
Cluster analysis aims to group data into clusters such that objects that are similar end up in the same cluster. Fundamentally, clustering and classification are similar. However, clustering is an unsupervised task because observations do not contain true labels while in classification, labels are needed in order to train a model.
This subchapter explains how to perform cluster analysis inmlr3with the help of mlr3cluster
extension package.
8.1 Survival Analysis
Survival analysis is a subfield of supervised machine learning in which the aim is to predict the survival distribution of a given individual. Arguably the main feature of survival analysis is that unlike classification and regression, learners are trained on two features:
 the time until the event takes place
 the event type: either censoring or death.
At a particular timepoint, an individual is either: alive, dead, or censored. Censoring occurs if it is unknown if an individual is alive or dead. For example, say we are interested in patients in hospital and every day it is recorded if they are alive or dead, then after a patient leaves it is unknown if they are alive or dead, hence they are censored. If there was no censoring, then ordinary regression analysis could be used instead. Furthermore, survival data contains solely positive values and therefore needs to be transformed to avoid biases.
Note that survival analysis accounts for both censored and uncensored observations while adjusting respective model parameters.
The package mlr3proba (Sonabend et al. 2021) extends mlr3 with the following objects for survival analysis:

TaskSurv
to define (censored) survival tasks 
LearnerSurv
as base class for survival learners 
PredictionSurv
as specialized class forPrediction
objects 
MeasureSurv
as specialized class for performance measures
For a good introduction to survival analysis see Modelling Survival Data in Medical Research (Collett 2014).
8.1.1 TaskSurv
Unlike TaskClassif
and TaskRegr
which have a single ‘target’ argument, TaskSurv
mimics the survival::Surv
object and has three to four target arguments (dependent on censoring type). A TaskSurv
can be constructed with the function as_task_surv()
:
library("mlr3")
library("mlr3proba")
library("survival")
as_task_surv(survival::bladder2[, 1L], id = "interval_censored",
time = "start", event = "event", time2 = "stop", type = "interval")
# type = "right" is default
task = as_task_surv(survival::rats, id = "right_censored",
time = "time", event = "status", type = "right")
print(task)
# the target column is a survival object:
head(task$truth())
# kaplanmeier plot
# library("mlr3viz")
autoplot(task)
8.1.2 Predict Types  crank, lp, and distr
Every PredictionSurv
object can predict one or more of:

lp
 Linear predictor calculated as the fitted coefficients multiplied by the test data. 
distr
 Predicted survival distribution, either discrete or continuous. Implemented in distr6. 
crank
 Continuous risk ranking. 
response
 Predicted survival time.
lp
and crank
can be used with measures of discrimination such as the concordance index. Whilst lp
is a specific mathematical prediction, crank
is any continuous ranking that identifies who is more or less likely to experience the event. So far the only implemented learner that only returns a continuous ranking is surv.svm
. If a PredictionSurv
returns an lp
then the crank
is identical to this. Otherwise crank
is calculated as the expectation of the predicted survival distribution. Note that for linear proportional hazards models, the ranking (but not necessarily the crank
score itself) given by lp
and the expectation of distr
, is identical.
The example below uses the rats
task shipped with mlr3proba.
task = tsk("rats")
learn = lrn("surv.coxph")
train_set = sample(task$nrow, 0.8 * task$nrow)
test_set = setdiff(seq_len(task$nrow), train_set)
learn$train(task, row_ids = train_set)
prediction = learn$predict(task, row_ids = test_set)
print(prediction)
<PredictionSurv> for 60 observations:
row_ids time status crank lp distr
2 49 TRUE 0.5038971 0.5038971 <list[1]>
3 104 FALSE 0.5038971 0.5038971 <list[1]>
12 102 FALSE 3.4446695 3.4446695 <list[1]>

281 89 FALSE 2.5217340 2.5217340 <list[1]>
295 104 FALSE 1.3955682 1.3955682 <list[1]>
300 102 FALSE 2.4602050 2.4602050 <list[1]>
8.1.3 Composition
Finally we take a look at the PipeOp
s implemented in mlr3proba, which are used for composition of predict types. For example, a predict linear predictor does not have a lot of meaning by itself, but it can be composed into a survival distribution. See mlr3pipelines for full tutorials and details on PipeOp
s.
library("mlr3pipelines")
library("mlr3learners")
# PipeOpDistrCompositor  Train one model with a baseline distribution,
# (KaplanMeier or NelsonAalen), and another with a predicted linear predictor.
task = tsk("rats")
# remove the factor column for support with glmnet
task$select(c("litter", "rx"))
learner_lp = lrn("surv.glmnet")
learner_distr = lrn("surv.kaplan")
prediction_lp = learner_lp$train(task)$predict(task)
prediction_distr = learner_distr$train(task)$predict(task)
prediction_lp$distr
# Doesn't need training. Base = baseline distribution. ph = Proportional hazards.
pod = po("compose_distr", form = "ph", overwrite = FALSE)
prediction = pod$predict(list(base = prediction_distr, pred = prediction_lp))$output
# Now we have a predicted distr!
prediction$distr
# This can all be simplified by using the distrcompose pipeline
glm.distr = ppl("distrcompositor", learner = lrn("surv.glmnet"),
estimator = "kaplan", form = "ph", overwrite = FALSE, graph_learner = TRUE)
glm.distr$train(task)$predict(task)
8.1.4 Benchmark Experiment
Finally, we conduct a small benchmark study on the rats
task using some of the integrated survival learners:
library("mlr3learners")
task = tsk("rats")
# some integrated learners
learners = lrns(c("surv.coxph", "surv.kaplan", "surv.ranger"))
print(learners)
$surv.coxph
<LearnerSurvCoxPH:surv.coxph>: Cox Proportional Hazards
* Model: 
* Parameters: list()
* Packages: mlr3, mlr3proba, survival, distr6
* Predict Types: crank, [distr], lp
* Feature Types: logical, integer, numeric, factor
* Properties: weights
$surv.kaplan
<LearnerSurvKaplan:surv.kaplan>: KaplanMeier Estimator
* Model: 
* Parameters: list()
* Packages: mlr3, mlr3proba, survival, distr6
* Predict Types: [crank], distr
* Feature Types: logical, integer, numeric, character, factor, ordered
* Properties: missings
$surv.ranger
<LearnerSurvRanger:surv.ranger>: Random Forest
* Model: 
* Parameters: num.threads=1
* Packages: mlr3, mlr3proba, mlr3extralearners, ranger
* Predict Types: crank, [distr]
* Feature Types: logical, integer, numeric, character, factor, ordered
* Properties: importance, oob_error, weights
<MeasureSurvCindex:surv.cindex>
* Packages: mlr3, mlr3proba
* Range: [0, 1]
* Minimize: FALSE
* Average: macro
* Parameters: weight_meth=I, tiex=0.5
* Properties: 
* Predict type: crank
* Return type: Score
set.seed(1)
bmr = benchmark(benchmark_grid(task, learners, rsmp("cv", folds = 3)))
bmr$aggregate(measure)
nr resample_result task_id learner_id resampling_id iters surv.cindex
1: 1 <ResampleResult[21]> rats surv.coxph cv 3 0.7671307
2: 2 <ResampleResult[21]> rats surv.kaplan cv 3 0.5000000
3: 3 <ResampleResult[21]> rats surv.ranger cv 3 0.7799153
autoplot(bmr, measure = measure)
The experiment indicates that both the Cox PH and the random forest have better discrimination than the KaplanMeier baseline estimator, but that the machine learning random forest is not consistently better than the interpretable Cox PH.
8.2 Density Estimation
Density estimation is the learning task to find the unknown distribution from which an i.i.d. data set is generated. We interpret this broadly, with this distribution not necessarily being continuous (so may possess a mass not density). The conditional case, where a distribution is predicted conditional on covariates, is known as ‘probabilistic supervised regression’, and will be implemented in mlr3proba in the nearfuture. Unconditional density estimation is viewed as an unsupervised task. For a good overview to density estimation see Density estimation for statistics and data analysis (Silverman 1986).
The package mlr3proba extends mlr3 with the following objects for density estimation:

TaskDens
to define density tasks 
LearnerDens
as base class for density estimators 
PredictionDens
as specialized class forPrediction
objects 
MeasureDens
as specialized class for performance measures
In this example we demonstrate the basic functionality of the package on the faithful
data from the datasets package. This task ships as predefined TaskDens
with mlr3proba.
<TaskDens:precip> (70 x 1): Annual Precipitation
* Target: 
* Properties: 
* Features (1):
 dbl (1): precip
Unconditional density estimation is an unsupervised method. Hence, TaskDens
is an unsupervised task which inherits directly from Task
unlike TaskClassif
and TaskRegr
. However, TaskDens
still has a target
argument and a $truth
field defined by:

target
 the name of the variable in the data for which to estimate density 
$truth
 the values of thetarget
column (which is not the true density, which is always unknown)
8.2.1 Train and Predict
Density learners have train
and predict
methods, though being unsupervised, ‘prediction’ is actually ‘estimation’. In training, a distr6 object is created, see here for full tutorials on how to access the probability density function, pdf
, cumulative distribution function, cdf
, and other important fields and methods. The predict method is simply a wrapper around self$model$pdf
and if available self$model$cdf
, i.e. evaluates the pdf/cdf at given points. Note that in prediction the points to evaluate the pdf and cdf are determined by the target
column in the TaskDens
object used for testing.
# create task and learner
task_faithful = TaskDens$new(id = "eruptions", backend = datasets::faithful$eruptions)
learner = lrn("dens.hist")
# train/test split
train_set = sample(task_faithful$nrow, 0.8 * task_faithful$nrow)
test_set = setdiff(seq_len(task_faithful$nrow), train_set)
# fitting KDE and model inspection
learner$train(task_faithful, row_ids = train_set)
learner$model
$distr
Histogram()
$hist
$breaks
[1] 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
$counts
[1] 39 30 5 4 25 62 49 3
$density
[1] 0.35944700 0.27649770 0.04608295 0.03686636 0.23041475 0.57142857 0.45161290
[8] 0.02764977
$mids
[1] 1.75 2.25 2.75 3.25 3.75 4.25 4.75 5.25
$xname
[1] "dat"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
attr(,"class")
[1] "dens.hist"
class(learner$model)
[1] "dens.hist"
# make predictions for new data
prediction = learner$predict(task_faithful, row_ids = test_set)
Every PredictionDens
object can estimate:

pdf
 probability density function
Some learners can estimate:

cdf
 cumulative distribution function
8.2.2 Benchmark Experiment
Finally, we conduct a small benchmark study on the precip
task using some of the integrated survival learners:
$dens.hist
<LearnerDensHistogram:dens.hist>: Histogram Density Estimator
* Model: 
* Parameters: list()
* Packages: mlr3, mlr3proba, distr6
* Predict Types: [pdf], cdf, distr
* Feature Types: integer, numeric
* Properties: 
$dens.kde
<LearnerDensKDE:dens.kde>: Kernel Density Estimator
* Model: 
* Parameters: kernel=Epan, bandwidth=silver
* Packages: mlr3, mlr3proba, distr6
* Predict Types: [pdf], distr
* Feature Types: integer, numeric
* Properties: missings
<MeasureDensLogloss:dens.logloss>: Log Loss
* Packages: mlr3, mlr3proba
* Range: [0, Inf]
* Minimize: TRUE
* Average: macro
* Parameters: eps=1e15
* Properties: 
* Predict type: pdf
set.seed(1)
bmr = benchmark(benchmark_grid(task, learners, rsmp("cv", folds = 3)))
bmr$aggregate(measure)
nr resample_result task_id learner_id resampling_id iters dens.logloss
1: 1 <ResampleResult[21]> precip dens.hist cv 3 4.396138
2: 2 <ResampleResult[21]> precip dens.kde cv 3 4.817715
autoplot(bmr, measure = measure)
The results of this experiment show that the sophisticated Penalized Density Estimator does not outperform the baseline histogram, but that the Kernel Density Estimator has at least consistently better (i.e. lower logloss) results.
8.3 Spatiotemporal Analysis
Author: Patrick Schratz
Data observations may entail reference information about spatial or temporal characteristics. Spatial information is stored as coordinates, usually named “x” and “y” or “lat”/“lon”. Treating spatiotemporal data using nonspatial data methods can lead to overoptimistic performance estimates. Hence, methods specifically designed to account for the special nature of spatiotemporal data are needed.
In the mlr3 framework, the following packages relate to this field:
 mlr3spatiotemporal (biasedreduced performance estimation)
 mlr3temporal (timeseries support)
 mlr3spatial (spatial prediction support)
The following (sub)sections introduce the potential pitfalls of spatiotemporal data in machine learning and how to account for it. Note that not all functionality will be covered, and that some of the used packages are still in early lifecycles. If you want to contribute to one of the packages mentioned above, please contact Patrick Schratz.
8.3.1 Creating a spatial Task
To make use of spatial resampling methods, a {mlr3} task that is aware of its spatial characteristic needs to be created. Two Task
child classes exist in {mlr3spatiotempcv} for this purpose:
TaskClassifST
TaskRegrST
To create one of these, you have multiple options:
 Use the constructor of the
Task
directly via$new()
 this only works for data.table backends (!)  Use the
as_task_*
converters (e.g. if your data is stored in ansf
object)
We recommend the latter, as the as_task_*
converters aim to make task construction easier, e.g., by creating the DataBackend
(which is required to create a Task in {mlr3}) automatically and setting the crs
and coordinate_names
fields. Let’s assume your (point) data is stored in with an sf
object, which is a common scenario for spatial analysis in R.
You can also use a plain data.frame
. In this case, crs
and coordinate_names
need to be passed along explicitly as they cannot be inferred directly from the sf
object:
task = as_task_classif_st(ecuador, id = "ecuador_task", target = "slides",
positive = "TRUE", coordinate_names = c("x", "y"), crs = 32717)
The *ST
task family prints a subset of the coordinates by default:
print(task)
<TaskClassifST:ecuador_task> (751 x 11)
* Target: slides
* Properties: twoclass
* Features (10):
 dbl (10): carea, cslope, dem, distdeforest, distroad,
distslidespast, hcurv, log.carea, slope, vcurv
* Coordinates:
x y
1: 712882.5 9560002
2: 715232.5 9559582
3: 715392.5 9560172
4: 715042.5 9559312
5: 715382.5 9560142

747: 714472.5 9558482
748: 713142.5 9560992
749: 713322.5 9560562
750: 715392.5 9557932
751: 713802.5 9560862
All *ST
tasks can be treated as their super class equivalents TaskClassif
or TaskRegr
in subsequent {mlr3} modeling steps.
8.3.2 Autocorrelation
Data which includes spatial or temporal information requires special treatment in machine learning (similar to survival, ordinal and other task types listed in the special tasks chapter). In contrast to nonspatial/nontemporal data, observations inherit a natural grouping, either in space or time or in both space and time (Legendre 1993). This grouping causes observations to be autocorrelated, either in space (spatial autocorrelation (SAC)), time (temporal autocorrelation (TAC)) or both space and time (spatiotemporal autocorrelation (STAC)). For simplicity, the acronym STAC is used as a generic term in the following chapter for all the different characteristics introduced above.
What effects does STAC have in statistical/machine learning?
The overarching problem is that STAC violates the assumption that the observations in the train and test datasets are independent (Hastie, Friedman, and Tibshirani 2001). If this assumption is violated, the reliability of the resulting performance estimates, for example retrieved via crossvalidation, is decreased. The magnitude of this decrease is linked to the magnitude of STAC in the dataset, which cannot be determined easily.
One approach to account for the existence of STAC is to use dedicated resampling methods. mlr3spatiotemporal provides access to the most frequently used spatiotemporal resampling methods. The following example showcases how a spatial dataset can be used to retrieve a biasreduced performance estimate of a learner.
The following examples use the ecuador
dataset created by Jannes Muenchow. It contains information on the occurrence of landslides (binary) in the Andes of Southern Ecuador. The landslides were mapped from aerial photos taken in 2000. The dataset is well suited to serve as an example because it it relatively small and of course due to the spatial nature of the observations. Please refer to Muenchow, Brenning, and Richter (2012) for a detailed description of the dataset.
To account for the spatial autocorrelation probably present in the landslide data, we will make use one of the most used spatial partitioning methods, a clusterbased kmeans grouping (Brenning 2012), (spcv_coords
in mlr3spatiotemporal). This method performs a clustering in 2D space which contrasts with the commonly used random partitioning for nonspatial data. The grouping has the effect that train and test data are more separated in space as they would be by conducting a random partitioning, thereby reducing the effect of STAC.
By contrast, when using the classical random partitioning approach with spatial data, train and test observations would be located sidebyside across the full study area (a visual example is provided further below). This leads to a high similarity between train and test sets, resulting in “better” but biased performance estimates in every fold of a CV compared to the spatial CV approach. However, these low error rates are mainly caused due to the STAC in the observations and the lack of appropriate partitioning methods and not by the power of the fitted model.
8.3.3 Spatiotemporal CrossValidation and Partitioning
One part of spatiotemporal machine learning is dealing with the spatiotemporal components of the data during performance estimation. Performance is commonly estimated via crossvalidation and mlr3spatiotemporal provides specialized resamplings methods for spatiotemporal data. The following chapters showcases how these methods can be applied and how they differ compared to nonspatial resampling methods, e.g. random partitioning. In addition, examples which show how resamplings with spatial information can be visualized using mlr3spatiotemporal.
Besides performance estimation, prediction on spatiotemporal data is another challenging task. See Section 8.3.6 for more information about how this topic is handled within the mlr3 framework.
8.3.3.1 Spatial CV vs. NonSpatial CV
In the following a spatial and nonspatial CV will be applied to showcase the mentioned performance differences.
The performance of a simple classification tree ("classif.rpart"
) is evaluated on a random partitioning (repeated_cv
) with four folds and two repetitions. The chosen evaluation measure is “classification error” ("classif.ce"
).
For the spatial example, repeated_spcv_coords
is chosen whereas repeated_cv
represents the nonspatial example.
8.3.3.1.1 NonSpatial CV
In this example the ecuador
example task is taken to estimate the performance of an rpart
learner with fixed parameters on it.
library("mlr3")
library("mlr3spatiotempcv")
set.seed(42)
# be less verbose
lgr::get_logger("bbotk")$set_threshold("warn")
lgr::get_logger("mlr3")$set_threshold("warn")
task = tsk("ecuador")
learner = lrn("classif.rpart", maxdepth = 3, predict_type = "prob")
resampling_nsp = rsmp("repeated_cv", folds = 4, repeats = 2)
rr_nsp = resample(
task = task, learner = learner,
resampling = resampling_nsp)
rr_nsp$aggregate(measures = msr("classif.ce"))
classif.ce
0.3388575
8.3.3.1.2 Spatial CV
task = tsk("ecuador")
learner = lrn("classif.rpart", maxdepth = 3, predict_type = "prob")
resampling_sp = rsmp("repeated_spcv_coords", folds = 4, repeats = 2)
rr_sp = resample(
task = task, learner = learner,
resampling = resampling_sp)
rr_sp$aggregate(measures = msr("classif.ce"))
classif.ce
0.412529
Here, the performance of the classification tree learner is around 7 percentage points worse when using Spatial CrossValidation (SpCV) compared to NonSpatial CrossValidation (NSpCV). The resulting difference in performance is variable as it depends on the dataset, the magnitude of STAC and the learner itself. For algorithms with a higher tendency of overfitting to the training set, the difference between the two methods will be larger.
Now, what does it mean that the performance in the spatial case is worse? Should you ditch SpCV and keep using nonspatial partitioning? The answer is NO. The reason why the spatial partitioning scenario results in a lower predictive performance is because throughout the CV the model has been trained on data that is less similar than the test data compared against the nonspatial scenario. Or in other words: in the nonspatial scenario, train and test data are almost identical (due to spatial autocorrelation).
This means that the result from the SpCV setting is more close to the true predictive power of the model  whereas the result from nonspatial CV is overoptimistic and biased.
8.3.4 Visualization of Spatiotemporal Partitions
Every partitioning method in mlr3spatiotemporal comes with S3 methods for plot()
and autoplot()
to visualize the created groups. In a 2D space ggplot2 is used in the backgroudn while for spatiotemporal methods 3D visualizations via plotly are created.
The following examples shows how the resampling_sp
object from the previous example can be visualized. In this case I want to look at the first four partitions of the first repetition. The point size is adjusted via argument size
. After the plot creation, additional scale_*
calls are used to adjust the coordinate labels on the x and y axes, respectively.
autoplot(resampling_sp, task, fold_id = c(1:4), size = 0.7) *
ggplot2::scale_y_continuous(breaks = seq(3.97, 4, 0.01)) *
ggplot2::scale_x_continuous(breaks = seq(79.06, 79.08, 0.02))
This example used a builtin mlr3 task via tsk()
. In practice however, one needs to create a spatiotemporal task via TaskClassifST
/TaskRegrST
and set the crs
argument (unless a sf
object is handed over).
mlr3spatiotemporal can also visualize nonspatial partitonings. This helps to visually compare differences. Let’s use the objects from the previous example again, this time resampling_nsp
.
autoplot(resampling_nsp, task, fold_id = c(1:4), size = 0.7) *
ggplot2::scale_y_continuous(breaks = seq(3.97, 4, 0.01)) *
ggplot2::scale_x_continuous(breaks = seq(79.06, 79.08, 0.02))
The visualization show very well how close train and test observations are located next to each other.
8.3.4.1 Spatial Block Visualization
This examples showcases another SpCV method: spcv_block
. This method makes use of rectangular blocks to divide the study area into equallysized parts. {mlr3spatiotempcv} has support for visualizing the created blocks and displaying their respective fold ID to get a better understanding how the final folds were composed out of the partitions. E.g. the “Fold 1 Repetition 1” plot shows that the test set is composed out of two “blocks” with the ID “1” in this case.
8.3.4.2 Spatiotemporal Visualization
When going spatiotemporal, two dimensions are not enough anymore. To visualize space and time together, a 3D solution is needed. {mlr3spatiotempcv} makes use of plotly for this purpose.
The following examples uses a modified version of the cookfarm_mlr3
task for showcasing reasons. By adjusting some levels, the individual partitions can be recognized very well.
In the following we use the sptcv_cstf
method after Meyer et al. (2018). Only the temporal variable is used in this first example, denoted by setting the column role “time” to variable “Date”. This column role is then picked up by the resampling method. Last, autoplot()
is called with an explicit definition of plot3D = TRUE
. This is because sptcv_cstf
can also be visualized in 2D (which only makes sense if the “space” column role is used for partitioning).
Last, sample_fold_n
is used to take a stratified random sample from all partitions. The call to plotly::layout()
only adjusts the default viewing angle of the plot  in interactive visualizations, this is not needed and the viewing angle can be adjusted with the mouse.
data = cookfarm_mlr3
set.seed(42)
data$Date = sample(rep(c(
"20200101", "20200201", "20200301", "20200401",
"20200501"), times = 1, each = 35768))
task_spt = as_task_regr_st(data,
id = "cookfarm", target = "PHIHOX",
coordinate_names = c("x", "y"), coords_as_features = FALSE,
crs = 26911)
task_spt$set_col_roles("Date", roles = "time")
rsmp_cstf_time = rsmp("sptcv_cstf", folds = 5)
plot = autoplot(rsmp_cstf_time,
fold_id = 5, task = task_spt, plot3D = TRUE,
sample_fold_n = 3000L
)
plotly::layout(plot, scene = list(camera = list(eye = list(z = 0.58))))
If both space and time are used for partitioning in sptcv_cstf
, the visualization becomes even more powerful as it allows to also show the observations which are omitted, i.e., not being used in either train and test sets for a specific fold.
task_spt$set_col_roles("SOURCEID", roles = "space")
task_spt$set_col_roles("Date", roles = "time")
rsmp_cstf_space_time = rsmp("sptcv_cstf", folds = 5)
plot = autoplot(rsmp_cstf_space_time,
fold_id = 4, task = task_spt, plot3D = TRUE,
show_omitted = TRUE, sample_fold_n = 3000L)
plotly::layout(plot, scene = list(camera =
list(eye = list(z = 0.58, x = 1.4, y = 1.6))))
Combining multiple spatiotemporal plots with plotly::layout()
is possible but somewhat cumbersome. First, a list of plots containing the individuals plots must be created. These plots can then be passed to plotly::subplot()
. This return is then passed to plotly::layout()
.
pl = autoplot(rsmp_cstf_space_time, task = task_spt,
fold_id = c(1, 2, 3, 4), point_size = 3,
axis_label_fontsize = 10, plot3D = TRUE,
sample_fold_n = 3000L, show_omitted = TRUE
)
# Warnings can be ignored
pl_subplot = plotly::subplot(pl)
plotly::layout(pl_subplot,
title = "Individual Folds",
scene = list(
domain = list(x = c(0, 0.5), y = c(0.5, 1)),
aspectmode = "cube",
camera = list(eye = list(z = 0.20, x = 1.4, y = 1.6))
),
scene2 = list(
domain = list(x = c(0.5, 1), y = c(0.5, 1)),
aspectmode = "cube",
camera = list(eye = list(z = 0.1, x = 1.4, y = 1.6))
),
scene3 = list(
domain = list(x = c(0, 0.5), y = c(0, 0.5)),
aspectmode = "cube",
camera = list(eye = list(z = 0.1, x = 1.4, y = 1.6))
),
scene4 = list(
domain = list(x = c(0.5, 1), y = c(0, 0.5)),
aspectmode = "cube",
camera = list(eye = list(z = 0.58, x = 1.4, y = 1.6))
)
)
Unfortunately, titles in subplots cannot be created dynamically. However, there is a manual workaround via annotations show in this RPubs post.
8.3.5 Choosing a Resampling Method
While the example in this section made use of the spcv_coords
method, this should by no means infer that this method is the best or only method suitable for this task. Even though this method is quite popular, it was mainly chosen because of the clear visual grouping differences when being applied on the ecuador
task when compared to random partitioning.
In fact, most often multiple spatial partitioning methods can be used for a dataset. It is recommended (required) that users familiarize themselves with each implemented method and decide which method to choose based on the specific characteristics of the dataset. For almost all methods implemented in mlr3spatiotemporal, there is a scientific publication describing the strengths and weaknesses of the respective approach (either linked in the help file of mlr3spatiotemporal or its respective dependency packages).
If you want to learn even more about the field of spatial partitioning, STAC and the problems associated with it, the works of Prof. Hanna Meyer and Prof. Alexander Brenning are very much recommended for further reference.
8.3.6 Spatial Prediction
Support for (parallel) spatial prediction with terra, raster, stars and sf objects is available via mlr3spatial.
mlr3spatial has two main scopes:
 Provide DataBackends for spatial objects (vector and raster data)
 Simplify spatial prediction tasks
Overall it aims to reduce the roundtripping between spatial objects > data.frame / data.table > spatial object during modeling.
8.3.6.1 Spatial Data Backends
A common scenario is the existence of spatial information in (point) vector data. mlr3spatial provides as_task_*
helpers to directly convert these into a spatiotemporal (ST) task:
library(mlr3)
library(mlr3spatial)
library(sf)
# load sample points
leipzig_vector = sf::read_sf(system.file("extdata",
"leipzig_points.gpkg", package = "mlr3spatial"),
stringsAsFactors = TRUE)
# create land cover task
tsk_leipzig = as_task_classif_st(leipzig_vector, target = "land_cover")
tsk_leipzig
<TaskClassifST:leipzig_vector> (97 x 9)
* Target: land_cover
* Properties: multiclass
* Features (8):
 dbl (8): b02, b03, b04, b06, b07, b08, b11, ndvi
* Coordinates:
X Y
1: 732480.1 5693957
2: 732217.4 5692769
3: 732737.2 5692469
4: 733169.3 5692777
5: 732202.2 5692644

93: 733018.7 5692342
94: 732551.4 5692887
95: 732520.4 5692589
96: 732542.2 5692204
97: 732437.8 5692300
This saves users from stripping the coordinates from the vector data or transforming the sf
object to a data.frame
or data.table
in the first place.
mlr3spatial adds support for the following spatial data classes:
8.3.6.2 Spatial prediction
The goal of spatial prediction is usually a raster image that covers a specific area. Most often this area is quite large and contains millions of pixels. The goal is to predict on each pixel and return a raster image containing these predictions.
To save users from having to go the extra mile of extracting the final model and using it with the respective predict()
function of the spatial R package of their desired outcome class, mlr3spatial provides predict_spatial()
. predict_spatial()
let’s users
 choose the output class of the resulting raster image
 optionally predict in parallel via the future package, using a parallel backend of their choice
To use a raster image for prediction, it must be wrapped into a TaskUnsupervised
for internal mlr3 reasons. Next, the learner can be used to predict on the task.
leipzig_raster = terra::rast(system.file("extdata", "leipzig_raster.tif",
package = "mlr3spatial"))
tsk_predict = as_task_unsupervised(leipzig_raster)
lrn = lrn("classif.ranger")
lrn$train(tsk_leipzig)
# plan("multisession") # optional parallelization
pred = predict_spatial(tsk_predict, lrn, format = "terra")
class(pred)
[1] "SpatRaster"
attr(,"package")
[1] "terra"
The resulting object can be visualized and treated as any other terra
object. Thanks to the improved handling of factor variables in terra, the raster image contains the correct labeled classes from the prediction right away.
library(terra, exclude = "resample")
plot(pred, col = c("#440154FF", "#443A83FF", "#31688EFF",
"#21908CFF", "#35B779FF", "#8FD744FF", "#FDE725FF"))
If you are interested in the performance of parallel spatial prediction, have a look at the Spatial Prediction Benchmark vignette  the resulting plot is shown below.
The results of the shown benchmark results should be taken with care as something seems to go wrong when using allcore parallelization with terra::predict()
. Also both relative and absolute numbers will vary on your local machine and results might change depending on eventual package updates in the future.
8.4 Ordinal Analysis
This is work in progress. See mlr3ordinal for the current state of the implementation.
8.5 Functional Analysis
Functional data is data containing an ordering on the dimensions. This implies that functional data consists of curves varying over a continuum, such as time, frequency, or wavelength.
8.5.1 How to model functional data?
There are two ways to model functional data:
 Modification of the learner, so that the learner is suitable for the functional data
 Modification of the task, so that the task matches the standard or classificationlearner
The development has not started yet, we are looking for contributors. Open an issue in mlr3fda if you are interested!
8.6 Multilabel Classification
Multilabel classification deals with objects that can belong to more than one category at the same time.
The development has not started yet, we are looking for contributes. Open an issue in mlr3multioutput if you are interested!
8.7 CostSensitive Classification
In regular classification the aim is to minimize the misclassification rate and thus all types of misclassification errors are deemed equally severe. A more general setting is costsensitive classification. Cost sensitive classification does not assume that the costs caused by different kinds of errors are equal. The objective of cost sensitive classification is to minimize the expected costs.
Imagine you are an analyst for a big credit institution. Let’s also assume that a correct decision of the bank would result in 35% of the profit at the end of a specific period. A correct decision means that the bank predicts that a customer will pay their bills (hence would obtain a loan), and the customer indeed has good credit. On the other hand, a wrong decision means that the bank predicts that the customer’s credit is in good standing, but the opposite is true. This would result in a loss of 100% of the given loan.
Good Customer (truth)  Bad Customer (truth)  

Good Customer (predicted)  + 0.35   1.0 
Bad Customer (predicted)  0  0 
Expressed as costs (instead of profit), we can write down the costmatrix as follows:
costs = matrix(c(0.35, 0, 1, 0), nrow = 2)
dimnames(costs) = list(response = c("good", "bad"), truth = c("good", "bad"))
print(costs)
truth
response good bad
good 0.35 1
bad 0.00 0
An exemplary data set for such a problem is the German Credit
task:
The data has 70% customers who are able to pay back their credit, and 30% bad customers who default on the debt. A manager, who doesn’t have any model, could decide to give either everybody a credit or to give nobody a credit. The resulting costs for the German credit data are:
# nobody:
(700 * costs[2, 1] + 300 * costs[2, 2]) / 1000
[1] 0
# everybody
(700 * costs[1, 1] + 300 * costs[1, 2]) / 1000
[1] 0.055
If the average loan is $20,000, the credit institute would lose more than one million dollar if it would grant everybody a credit:
# average profit * average loan * number of customers
0.055 * 20000 * 1000
[1] 1100000
Our goal is to find a model which minimizes the costs (and thereby maximizes the expected profit).
8.7.1 A First Model
For our first model, we choose an ordinary logistic regression (implemented in addon package mlr3learners). We first create a classification task, then resample the model using a 10fold cross validation and extract the resulting confusion matrix:
library("mlr3learners")
learner = lrn("classif.log_reg")
rr = resample(task, learner, rsmp("cv"))
confusion = rr$prediction()$confusion
print(confusion)
truth
response good bad
good 606 154
bad 94 146
To calculate the average costs like above, we can simply multiply the elements of the confusion matrix with the elements of the previously introduced cost matrix, and sum the values of the resulting matrix:
With an average loan of $20,000, the logistic regression yields the following costs:
avg_costs * 20000 * 1000
[1] 1162000
Instead of losing over $1,000,000, the credit institute now can expect a profit of more than $1,000,000.
8.7.2 Costsensitive Measure
Our natural next step would be to further improve the modeling step in order to maximize the profit. For this purpose we first create a costsensitive classification measure which calculates the costs based on our cost matrix. This allows us to conveniently quantify and compare modeling decisions. Fortunately, there already is a predefined measure Measure
for this purpose: MeasureClassifCosts
:
<MeasureClassifCosts:classif.costs>: Costsensitive Classification
* Packages: mlr3
* Range: [Inf, Inf]
* Minimize: TRUE
* Average: macro
* Parameters: normalize=TRUE
* Properties: 
* Predict type: response
If we now call resample()
or benchmark()
, the costsensitive measures will be evaluated. We compare the logistic regression to a simple featureless learner and to a random forest from package ranger :
learners = list(
lrn("classif.log_reg"),
lrn("classif.featureless"),
lrn("classif.ranger")
)
cv3 = rsmp("cv", folds = 3)
bmr = benchmark(benchmark_grid(task, learners, cv3))
bmr$aggregate(cost_measure)
nr resample_result task_id learner_id resampling_id
1: 1 <ResampleResult[21]> german_credit classif.log_reg cv
2: 2 <ResampleResult[21]> german_credit classif.featureless cv
3: 3 <ResampleResult[21]> german_credit classif.ranger cv
2 variables not shown: [iters, classif.costs]
As expected, the featureless learner is performing comparably bad. The logistic regression and the random forest work equally well.
8.7.3 Thresholding
Although we now correctly evaluate the models in a costsensitive fashion, the models themselves are unaware of the classification costs. They assume the same costs for both wrong classification decisions (false positives and false negatives). Some learners natively support costsensitive classification (e.g., XXX). However, we will concentrate on a more generic approach which works for all models which can predict probabilities for class labels: thresholding.
Most learners can calculate the probability \(p\) for the positive class. If \(p\) exceeds the threshold \(0.5\), they predict the positive class, and the negative class otherwise.
For our binary classification case of the credit data, the we primarily want to minimize the errors where the model predicts “good”, but truth is “bad” (i.e., the number of false positives) as this is the more expensive error. If we now increase the threshold to values \(> 0.5\), we reduce the number of false negatives. Note that we increase the number of false positives simultaneously, or, in other words, we are trading false positives for false negatives.
# fit models with probability prediction
learner = lrn("classif.log_reg", predict_type = "prob")
rr = resample(task, learner, rsmp("cv"))
p = rr$prediction()
print(p)
<PredictionClassif> for 1000 observations:
row_ids truth response prob.good prob.bad
2 bad bad 0.40126720 0.5987328
17 good good 0.97406360 0.0259364
46 good good 0.74870171 0.2512983

973 bad bad 0.09917846 0.9008215
976 good good 0.78358973 0.2164103
997 good good 0.51492948 0.4850705
# helper function to try different threshold values interactively
with_threshold = function(p, th) {
p$set_threshold(th)
list(confusion = p$confusion, costs = p$score(measures = cost_measure, task = task))
}
with_threshold(p, 0.5)
$confusion
truth
response good bad
good 598 152
bad 102 148
$costs
classif.costs
0.0573
with_threshold(p, 0.75)
$confusion
truth
response good bad
good 462 75
bad 238 225
$costs
classif.costs
0.0867
with_threshold(p, 1.0)
$confusion
truth
response good bad
good 0 0
bad 700 300
$costs
classif.costs
0
# TODO: include plot of threshold vs performance
Instead of manually trying different threshold values, one uses use optimize()
to find a good threshold value w.r.t. our performance measure:
# simple wrapper function which takes a threshold and returns the resulting model performance
# this wrapper is passed to optimize() to find its minimum for thresholds in [0.5, 1]
f = function(th) {
with_threshold(p, th)$costs
}
best = optimize(f, c(0.5, 1))
print(best)
$minimum
[1] 0.6905289
$objective
classif.costs
0.08925
# optimized confusion matrix:
with_threshold(p, best$minimum)$confusion
truth
response good bad
good 515 91
bad 185 209
Note that the function "optimize()"
is intended for unimodal functions and therefore may converge to a local optimum here. See below for better alternatives to find good threshold values.
8.7.4 Threshold Tuning
Before we start, we have load all required packages:
Loading required package: paradox
8.7.5 Adjusting thresholds: Two strategies
Currently mlr3pipelines offers two main strategies towards adjusting classification thresholds
. We can either expose the thresholds as a hyperparameter
of the Learner by using PipeOpThreshold
. This allows us to tune the thresholds
via an outside optimizer from mlr3tuning.
Alternatively, we can also use PipeOpTuneThreshold
which automatically tunes the threshold after each learner is fit.
In this blogpost, we’ll go through both strategies.
8.7.6 PipeOpThreshold
PipeOpThreshold
can be put directly after a Learner
.
A simple example would be:
gr = lrn("classif.rpart", predict_type = "prob") %>>% po("threshold")
l = as_learner(gr)
Note, that predict_type = "prob"
is required for po("threshold")
to have any effect.
The thresholds
are now exposed as a hyperparameter
of the GraphLearner
we created:
l$param_set
<ParamSetCollection>
id class lower upper nlevels default
1: classif.rpart.cp ParamDbl 0 1 Inf 0.01
2: classif.rpart.keep_model ParamLgl NA NA 2 FALSE
3: classif.rpart.maxcompete ParamInt 0 Inf Inf 4
4: classif.rpart.maxdepth ParamInt 1 30 30 30
5: classif.rpart.maxsurrogate ParamInt 0 Inf Inf 5
6: classif.rpart.minbucket ParamInt 1 Inf Inf <NoDefault[3]>
7: classif.rpart.minsplit ParamInt 1 Inf Inf 20
8: classif.rpart.surrogatestyle ParamInt 0 1 2 0
9: classif.rpart.usesurrogate ParamInt 0 2 3 2
10: classif.rpart.xval ParamInt 0 Inf Inf 10
11: threshold.thresholds ParamUty NA NA Inf <NoDefault[3]>
1 variable not shown: [value]
We can now tune those thresholds from the outside as follows:
Before tuning, we have to define which hyperparameters we want to tune over. In this example, we only tune over the thresholds
parameter of the threshold
pipeop. you can easily imagine, that we can also jointly tune over additional hyperparameters, i.e. rpart’s cp
parameter.
As the Task
we aim to optimize for is a binary task, we can simply specify the threshold param:
We now create a AutoTuner
, which automatically tunes the supplied learner over the ParamSet
we supplied above.
Inside the trafo
, we simply collect all set params into a named vector via map_dbl
and store it in the threshold.thresholds
slot expected by the learner.
Again, we create a AutoTuner
, which automatically tunes the supplied learner over the ParamSet
we supplied above.
One drawback of this strategy is, that this requires us to fit a new model for each new threshold setting. While setting a threshold and computing performance is relatively cheap, fitting the learner is often more computationally demanding. A better strategy is therefore often to optimize the thresholds separately after each model fit.
8.7.7 PipeOpTunethreshold
PipeOpTuneThreshold
on the other hand works together with PipeOpLearnerCV
. It directly optimizes the crossvalidated
predictions made by this PipeOp
. This is done in order to avoid overfitting the threshold tuning.
A simple example would be:
gr = po("learner_cv", lrn("classif.rpart", predict_type = "prob")) %>>% po("tunethreshold")
l2 = as_learner(gr)
Note, that predict_type
= “prob” is required for po("tunethreshold")
to work. Additionally, note that this time no threshold
parameter is exposed, it is automatically tuned internally.
l2$param_set
<ParamSetCollection>
id class lower upper nlevels
1: classif.rpart.resampling.method ParamFct NA NA 2
2: classif.rpart.resampling.folds ParamInt 2 Inf Inf
3: classif.rpart.resampling.keep_response ParamLgl NA NA 2
4: classif.rpart.cp ParamDbl 0 1 Inf
5: classif.rpart.keep_model ParamLgl NA NA 2
6: classif.rpart.maxcompete ParamInt 0 Inf Inf
7: classif.rpart.maxdepth ParamInt 1 30 30
8: classif.rpart.maxsurrogate ParamInt 0 Inf Inf
9: classif.rpart.minbucket ParamInt 1 Inf Inf
10: classif.rpart.minsplit ParamInt 1 Inf Inf
11: classif.rpart.surrogatestyle ParamInt 0 1 2
12: classif.rpart.usesurrogate ParamInt 0 2 3
13: classif.rpart.xval ParamInt 0 Inf Inf
14: classif.rpart.affect_columns ParamUty NA NA Inf
15: tunethreshold.measure ParamUty NA NA Inf
16: tunethreshold.optimizer ParamUty NA NA Inf
17: tunethreshold.log_level ParamUty NA NA Inf
2 variables not shown: [default, value]
Note that we can set rsmp("intask")
as a resampling strategy for “learner_cv” in order to evaluate predictions on the “training” data. This is generally not advised, as it might lead to overfitting on the thresholds but can significantly reduce runtime.
For more information, see the post on Threshold Tuning on the mlr3 gallery.
8.8 Cluster Analysis
Cluster analysis is a type of unsupervised machine learning where the goal is to group data into clusters, where each cluster contains similar observations. The similarity is based on specified metrics that are task and application dependent. Cluster analysis is closely related to classification in a sense that each observation needs to be assigned to a cluster or a class. However, unlike classification problems where each observation is labeled, clustering works on data sets without true labels or class assignments.
The package mlr3cluster extends mlr3 with the following objects for cluster analysis:

TaskClust
to define clustering tasks 
LearnerClust
as base class for clustering learners 
PredictionClust
as specialized class forPrediction
objects 
MeasureClust
as specialized class for performance measures
Since clustering is a type of unsupervised learning, TaskClust
is slightly different from TaskRegr
and TaskClassif
objects. More specifically:

truth()
function is missing because observations are not labeled. 
target
field is empty and will returncharacter(0)
if accessed anyway.
Additionally, LearnerClust
provides two extra fields that are absent from supervised learners:

assignments
returns cluster assignments for training data. It returnNULL
if accessed before training. 
save_assignments
is a boolean field that controls whether or not to store training set assignments in a learner.
Finally, PredictionClust
contains additional two fields:

partition
stores cluster partitions. 
prob
stores cluster probabilities for each observation.
8.8.1 Train and Predict
Clustering learners provide both train
and predict
methods. The analysis typically consists of building clusters using all available data. To be consistent with the rest of the library, we refer to this process as training.
Some learners can assign new observations to existing groups with predict
. However, prediction does not always make sense, as it is the case for hierarchical clustering. In hierarchical clustering, the goal is to build a hierarchy of nested clusters by either splitting large clusters into smaller ones or merging smaller clusters into bigger ones. The final result is a tree or dendrogram which can change if a new data point is added. For consistency with the rest of the ecosystem, mlr3cluster offers predict
method for hierarchical clusterers but it simply assigns all points to a specified number of clusters by cutting the resulting tree at a corresponding level. Moreover, some learners estimate the probability of each observation belonging to a given cluster. predict_types
field gives a list of prediction types for each learner.
After training, the model
field stores a learner’s model that looks different for each learner depending on the underlying library. predict
returns a PredictionClust
object that gives a simplified view of the learned model. If the data given to the predict
method is the same as the one on which the learner was trained, predict
simply returns cluster assignments for the “training” observations. On the other hand, if the test set contains new data, predict
will estimate cluster assignments for that data set. Some learners do not support estimating cluster partitions on new data and will instead return assignments for training data and print a warning message.
In the following example, a $k$means learner
is applied on the US arrest data set
. The class labels are predicted and the contribution of the task features to assignment of the respective class are visualized.
library("mlr3")
library("mlr3cluster")
library("mlr3viz")
set.seed(1L)
# create an example task
task = tsk("usarrests")
print(task)
<TaskClust:usarrests> (50 x 4): US Arrests
* Target: 
* Properties: 
* Features (4):
 int (2): Assault, UrbanPop
 dbl (2): Murder, Rape
autoplot(task)
# create a kmeans learner
learner = lrn("clust.kmeans")
# assigning each observation to one of the two clusters (default in clust.kmeans)
learner$train(task)
learner$model
Kmeans clustering with 2 clusters of sizes 21, 29
Cluster means:
Assault Murder Rape UrbanPop
1 255.0000 11.857143 28.11429 67.61905
2 109.7586 4.841379 16.24828 64.03448
Clustering vector:
[1] 1 1 1 1 1 1 2 1 1 1 2 2 1 2 2 2 2 1 2 1 2 1 2 1 2 2 2 1 2 2 1 1 1 2 2 2 2 2
[39] 2 1 2 1 1 2 2 2 2 2 2 2
Within cluster sum of squares by cluster:
[1] 41636.73 54762.30
(between_SS / total_SS = 72.9 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
# make "predictions" for the same data
prediction = learner$predict(task)
autoplot(prediction, task)
8.8.2 Measures
The difference between supervised and unsupervised learning is that there is no ground truth data in unsupervised learning. In a supervised setting, such as classification, we would need to compare our predictions to true labels. Since clustering is an example of unsupervised learning, there are no true labels to which we can compare. However, we can still measure the quality of cluster assignments by quantifying how closely objects within the same cluster are related (cluster cohesion) as well as how distinct different clusters are from each other (cluster separation).
To assess the quality of clustering, there are a few builtin evaluation metrics available. One of them is within sum of squares (WSS)
which calculates the sum of squared differences between observations and centroids. WSS is useful because it quantifies cluster cohesion. The range of this measure is \([0, \infty)\) where a smaller value means that clusters are more compact.
Another measure is silhouette quality index
that quantifies how well each point belongs to its assigned cluster versus neighboring cluster. Silhouette values are in \([1, 1]\) range.
Points with silhouette closer to:
 1 are well clustered
 0 lie between two clusters
 1 likely placed in the wrong cluster
The following is an example of conducting a benchmark experiment with various learners on iris data set
without target variable and assessing the quality of each learner with both within sum of squares and silhouette measures.
design = benchmark_grid(
tasks = TaskClust$new("iris", iris[5]),
learners = list(
lrn("clust.kmeans", centers = 3L),
lrn("clust.pam", k = 2L),
lrn("clust.cmeans", centers = 3L)),
resamplings = rsmp("insample"))
print(design)
task learner resampling
1: <TaskClust[46]> <LearnerClustKMeans[38]> <ResamplingInsample[20]>
2: <TaskClust[46]> <LearnerClustPAM[38]> <ResamplingInsample[20]>
3: <TaskClust[46]> <LearnerClustCMeans[38]> <ResamplingInsample[20]>
# execute benchmark
bmr = benchmark(design)
# define measure
measures = list(msr("clust.wss"), msr("clust.silhouette"))
bmr$aggregate(measures)
nr resample_result task_id learner_id resampling_id iters clust.wss
1: 1 <ResampleResult[21]> iris clust.kmeans insample 1 78.85144
2: 2 <ResampleResult[21]> iris clust.pam insample 1 153.32572
3: 3 <ResampleResult[21]> iris clust.cmeans insample 1 79.02617
1 variable not shown: [clust.silhouette]
The experiment shows that using kmeans algorithm with three centers produces a better within sum of squares score than any other learner considered. However, pam (partitioning around medoids) learner with two clusters performs the best when considering silhouette measure which takes into the account both cluster cohesion and separation.
8.8.3 Visualization
Cluster analysis in mlr3 is integrated with mlr3viz which provides a number of useful plots. Some of those plots are shown below.
task = TaskClust$new("iris", iris[5])
learner = lrn("clust.kmeans")
learner$train(task)
prediction = learner$predict(task)
# performing PCA on task and showing assignments
autoplot(prediction, task, type = "pca")
# same as above but with probability ellipse that assumes normal distribution
autoplot(prediction, task, type = "pca", frame = TRUE, frame.type = "norm")
task = tsk("usarrests")
learner = lrn("clust.agnes")
learner$train(task)
# dendrogram for hierarchical clustering
autoplot(learner)
# advanced dendrogram options from `factoextra::fviz_dend`
autoplot(learner,
k = learner$param_set$values$k, rect_fill = TRUE,
rect = TRUE)
Silhouette plots can help to visually assess the quality of the analysis and help choose a number of clusters for a given data set. The red dotted line shows the mean silhouette value and each bar represents a data point. If most points in each cluster have an index around or higher than mean silhouette, the number of clusters is chosen well.
# silhouette plot allows to visually inspect the quality of clustering
task = TaskClust$new("iris", iris[5])
learner = lrn("clust.kmeans")
learner$param_set$values = list(centers = 5L)
learner$train(task)
prediction = learner$predict(task)
autoplot(prediction, task, type = "sil")
The plot shows that all points in cluster 5 and almost all points in clusters 4, 2 and 1 are below average silhouette index. This means that a lot of observations lie either on the border of clusters or are likely assigned to the wrong cluster.
learner = lrn("clust.kmeans")
learner$param_set$values = list(centers = 2L)
learner$train(task)
prediction = learner$predict(task)
autoplot(prediction, task, type = "sil")
Setting the number of centers to two improves both average silhouette score as well as overall quality of clustering because almost all points in cluster 1 are higher than and a lot of points in cluster 2 are close to mean silhouette. Hence, having two centers might be a better choice for the number of clusters.