library("mlr3")
library("mlr3proba")
library("survival")
as_task_surv(survival::bladder2[, -1L], id = "interval_censored",
time = "start", event = "event", time2 = "stop", type = "interval")
# type = "right" is default
task = as_task_surv(survival::rats, id = "right_censored",
time = "time", event = "status", type = "right")
print(task)
# the target column is a survival object:
head(task$truth())
# kaplan-meier plot
# library("mlr3viz")
autoplot(task)
8 Special Tasks
This chapter explores the different functions of mlr3
when dealing with specific data sets that require further statistical modification to undertake sensible analysis. Following topics are discussed:
Survival Analysis
This sub-chapter explains how to conduct sound survival analysis in mlr3. Survival analysis is used to monitor the period of time until a specific event takes places. This specific event could be, e.g., death, the transmission of a disease, marriage, or divorce. Two considerations are important when conducting survival analysis:
- Whether the event occurred within the frame of the given data
- How much time it took until the event occurred
In summary, this sub-chapter explains how to account for these considerations and conduct survival analysis using the mlr3proba
extension package.
Density Estimation
This sub-chapter explains how to conduct (unconditional) density estimation in mlr3. Density estimation is used to estimate the probability density function of a continuous variable. Unconditional density estimation is an unsupervised task, so there is no ‘value’ to predict. Instead, densities are estimated.
This sub-chapter explains how to estimate probability distributions for continuous variables using the mlr3proba
extension package.
Spatiotemporal Analysis
Spatiotemporal analysis data observations entail reference information about spatial and temporal characteristics. One of the largest issues of spatiotemporal data analysis is the inevitable presence of auto-correlation in the data. Auto-correlation is especially severe in data with a marginal spatiotemporal variation. The sub-chapter on Spatiotemporal analysis provides instructions on how to account for spatiotemporal data.
Multilabel Classification
Multilabel classification deals with objects that can belong to more than one category at the same time. Numerous target labels are attributed to a single observation. Working with multilabel data requires one to use modified algorithms, to accommodate data specific characteristics. Two approaches to multilabel classification are prominently used:
- The problem transformation method
- The algorithm adaption method
Instructions on how to deal with multilabel classification inmlr3
can be found in this sub-chapter.
Cost Sensitive Classification
This sub-chapter deals with the implementation of cost-sensitive classification. Regular classification aims to minimize the misclassification rate and thus all types of misclassification errors are deemed equally severe. Cost-sensitive classification is a setting where the costs caused by different kinds of errors are not assumed to be equal. The objective is to minimize the expected costs.
Analytical data for a big credit institution is used as a use case to illustrate the different features. Firstly, the sub-chapter provides guidance on how to implement a first model. Subsequently, the sub-chapter contains instructions on how to modify cost sensitivity measures, thresholding and threshold tuning.
Cluster Analysis
Cluster analysis aims to group data into clusters such that objects that are similar end up in the same cluster. Fundamentally, clustering and classification are similar. However, clustering is an unsupervised task because observations do not contain true labels while in classification, labels are needed in order to train a model.
This sub-chapter explains how to perform cluster analysis inmlr3
with the help of mlr3cluster
extension package.
Coming soon
In development we are also working on
- Multilabel classification -
mlr3multioutput
- Functional data analysis -
mlr3fda
- Ordinal analysis -
mlr3ordinal
- Time-series analysis / forecasting -
mlr3temporal
Send us a message if you want to help out on any of these!
8.1 Survival Analysis
Survival analysis with {mlr3proba} is currently in a fragile state after its removal from CRAN. Hence most code examples listed in this page will not work for the time being. We are actively working on a solution!
Survival analysis is a sub-field of supervised machine learning in which the aim is to predict the survival distribution of a given individual. Arguably the main feature of survival analysis is that, unlike classification and regression, learners are trained on two features:
- the time until the event takes place
- the event type: either censoring or death.
At a particular time-point, an individual is either: alive, dead, or censored. Censoring occurs if it is unknown if an individual is alive or dead. For example, say we are interested in patients in hospital and every day it is recorded if they are alive or dead, then after a patient leaves, it is unknown if they are alive or dead. Hence they are censored. If there was no censoring, then ordinary regression analysis could be used instead. Furthermore, survival data contains solely positive values and therefore needs to be transformed to avoid biases.
Note that survival analysis accounts for both censored and uncensored observations while adjusting respective model parameters.
The package mlr3proba
(Sonabend et al. 2021) extends mlr3
with the following objects for survival analysis:
-
TaskSurv
to define (censored) survival tasks -
LearnerSurv
as base class for survival learners -
PredictionSurv
as specialized class forPrediction
objects -
MeasureSurv
as specialized class for performance measures
For a good introduction to survival analysis see Modelling Survival Data in Medical Research (Collett 2014).
8.1.1 TaskSurv
Unlike TaskClassif
and TaskRegr
which have a single ‘target’ argument, TaskSurv
mimics the survival::Surv
object and has three to four target arguments (dependent on censoring type). A TaskSurv
can be constructed with the function as_task_surv()
:
8.1.2 Predict Types - crank, lp, and distr
Every PredictionSurv
object can predict one or more of:
-
lp
- Linear predictor calculated as the fitted coefficients multiplied by the test data. -
distr
- Predicted survival distribution, either discrete or continuous. Implemented indistr6
. -
crank
- Continuous risk ranking. -
response
- Predicted survival time.
lp
and crank
can be used with measures of discrimination such as the concordance index. Whilst lp
is a specific mathematical prediction, crank
is any continuous ranking that identifies who is more or less likely to experience the event. So far the only implemented learner that only returns a continuous ranking is surv.svm
. If a PredictionSurv
returns an lp
then the crank
is identical to this. Otherwise crank
is calculated as the expectation of the predicted survival distribution. Note that for linear proportional hazards models, the ranking (but not necessarily the crank
score itself) given by lp
and the expectation of distr
, is identical.
The example below uses the rats
task shipped with mlr3proba
.
task = tsk("rats")
learn = lrn("surv.coxph")
train_set = sample(task$nrow, 0.8 * task$nrow)
test_set = setdiff(seq_len(task$nrow), train_set)
learn$train(task, row_ids = train_set)
prediction = learn$predict(task, row_ids = test_set)
print(prediction)
<PredictionSurv> for 60 observations:
row_ids time status crank lp distr
2 49 TRUE -0.5038971 -0.5038971 <list[1]>
3 104 FALSE -0.5038971 -0.5038971 <list[1]>
12 102 FALSE -3.4446695 -3.4446695 <list[1]>
---
281 89 FALSE -2.5217340 -2.5217340 <list[1]>
295 104 FALSE 1.3955682 1.3955682 <list[1]>
300 102 FALSE -2.4602050 -2.4602050 <list[1]>
8.1.3 Composition
Finally we take a look at the PipeOp
s implemented in mlr3proba
, which are used for composition of predict types. For example, a predict linear predictor does not have a lot of meaning by itself, but it can be composed into a survival distribution. See mlr3pipelines
for full tutorials and details on PipeOp
s.
library("mlr3pipelines")
library("mlr3learners")
# PipeOpDistrCompositor - Train one model with a baseline distribution,
# (Kaplan-Meier or Nelson-Aalen), and another with a predicted linear predictor.
task = tsk("rats")
# remove the factor column for support with glmnet
task$select(c("litter", "rx"))
learner_lp = lrn("surv.glmnet")
learner_distr = lrn("surv.kaplan")
prediction_lp = learner_lp$train(task)$predict(task)
prediction_distr = learner_distr$train(task)$predict(task)
prediction_lp$distr
# Doesn't need training. Base = baseline distribution. ph = Proportional hazards.
pod = po("compose_distr", form = "ph", overwrite = FALSE)
prediction = pod$predict(list(base = prediction_distr, pred = prediction_lp))$output
# Now we have a predicted distr!
prediction$distr
# This can all be simplified by using the distrcompose pipeline
glm.distr = ppl("distrcompositor", learner = lrn("surv.glmnet"),
estimator = "kaplan", form = "ph", overwrite = FALSE, graph_learner = TRUE)
glm.distr$train(task)$predict(task)
8.1.4 Benchmark Experiment
Finally, we conduct a small benchmark study on the rats
task using some of the integrated survival learners:
library("mlr3learners")
task = tsk("rats")
# some integrated learners
learners = lrns(c("surv.coxph", "surv.kaplan", "surv.ranger"))
print(learners)
$surv.coxph
<LearnerSurvCoxPH:surv.coxph>: Cox Proportional Hazards
* Model: -
* Parameters: list()
* Packages: mlr3, mlr3proba, survival, distr6
* Predict Types: crank, [distr], lp
* Feature Types: logical, integer, numeric, factor
* Properties: weights
$surv.kaplan
<LearnerSurvKaplan:surv.kaplan>: Kaplan-Meier Estimator
* Model: -
* Parameters: list()
* Packages: mlr3, mlr3proba, survival, distr6
* Predict Types: [crank], distr
* Feature Types: logical, integer, numeric, character, factor, ordered
* Properties: missings
$surv.ranger
<LearnerSurvRanger:surv.ranger>: Random Forest
* Model: -
* Parameters: num.threads=1
* Packages: mlr3, mlr3proba, mlr3extralearners, ranger
* Predict Types: crank, [distr]
* Feature Types: logical, integer, numeric, character, factor, ordered
* Properties: importance, oob_error, weights
<MeasureSurvCindex:surv.cindex>
* Packages: mlr3, mlr3proba
* Range: [0, 1]
* Minimize: FALSE
* Average: macro
* Parameters: weight_meth=I, tiex=0.5, eps=0.001
* Properties: -
* Predict type: crank
* Return type: Score
set.seed(1)
bmr = benchmark(benchmark_grid(task, learners, rsmp("cv", folds = 3)))
bmr$aggregate(measure)
nr resample_result task_id learner_id resampling_id iters surv.cindex
1: 1 <ResampleResult[21]> rats surv.coxph cv 3 0.7671307
2: 2 <ResampleResult[21]> rats surv.kaplan cv 3 0.5000000
3: 3 <ResampleResult[21]> rats surv.ranger cv 3 0.7799153
The experiment indicates that both the Cox PH and the random forest have better discrimination than the Kaplan-Meier baseline estimator, but that the machine learning random forest is not consistently better than the interpretable Cox PH.
8.2 Density Estimation
Survival analysis with {mlr3proba} is currently in a fragile state after its removal from CRAN. Hence most code examples listed in this page will not work for the time being. We are actively working on a solution!
Density estimation is the learning task to find the unknown distribution from which an i.i.d. data set is generated. We interpret this broadly, with this distribution not necessarily being continuous (so may possess a mass not density). The conditional case, where a distribution is predicted conditional on covariates, is known as ‘probabilistic supervised regression’, and will be implemented in mlr3proba
in the near-future. Unconditional density estimation is viewed as an unsupervised task. For a good overview to density estimation see Density estimation for statistics and data analysis (Silverman 1986).
The package mlr3proba
extends mlr3
with the following objects for density estimation:
-
TaskDens
to define density tasks -
LearnerDens
as base class for density estimators -
PredictionDens
as specialized class forPrediction
objects -
MeasureDens
as specialized class for performance measures
In this example we demonstrate the basic functionality of the package on the faithful
data from the datasets package. This task ships as pre-defined TaskDens
with mlr3proba
.
<TaskDens:precip> (70 x 1): Annual Precipitation
* Target: -
* Properties: -
* Features (1):
- dbl (1): precip
Unconditional density estimation is an unsupervised method. Hence, TaskDens
is an unsupervised task which inherits directly from Task
unlike TaskClassif
and TaskRegr
. However, TaskDens
still has a target
argument and a $truth
field defined by:
-
target
- the name of the variable in the data for which to estimate density -
$truth
- the values of thetarget
column (which is not the true density, which is always unknown)
8.2.1 Train and Predict
Density learners have train
and predict
methods, though being unsupervised, ‘prediction’ is actually ‘estimation’. In training, a distr6
object is created, see here for full tutorials on how to access the probability density function, pdf
, cumulative distribution function, cdf
, and other important fields and methods. The predict method is simply a wrapper around self$model$pdf
and if available self$model$cdf
, i.e. evaluates the pdf/cdf at given points. Note that in prediction the points to evaluate the pdf and cdf are determined by the target
column in the TaskDens
object used for testing.
# create task and learner
task_faithful = TaskDens$new(id = "eruptions", backend = datasets::faithful$eruptions)
learner = lrn("dens.hist")
# train/test split
train_set = sample(task_faithful$nrow, 0.8 * task_faithful$nrow)
test_set = setdiff(seq_len(task_faithful$nrow), train_set)
# fitting KDE and model inspection
learner$train(task_faithful, row_ids = train_set)
learner$model
$distr
Histogram()
$hist
$breaks
[1] 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
$counts
[1] 39 30 5 4 25 62 49 3
$density
[1] 0.35944700 0.27649770 0.04608295 0.03686636 0.23041475 0.57142857 0.45161290
[8] 0.02764977
$mids
[1] 1.75 2.25 2.75 3.25 3.75 4.25 4.75 5.25
$xname
[1] "dat"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
attr(,"class")
[1] "dens.hist"
[1] "dens.hist"
Every PredictionDens
object can estimate:
-
pdf
- probability density function
Some learners can estimate:
-
cdf
- cumulative distribution function
8.2.2 Benchmark Experiment
Finally, we conduct a small benchmark study on the precip
task using some of the integrated survival learners:
$dens.hist
<LearnerDensHistogram:dens.hist>: Histogram Density Estimator
* Model: -
* Parameters: list()
* Packages: mlr3, mlr3proba, distr6
* Predict Types: [pdf], cdf, distr
* Feature Types: integer, numeric
* Properties: -
$dens.kde
<LearnerDensKDE:dens.kde>: Kernel Density Estimator
* Model: -
* Parameters: kernel=Epan, bandwidth=silver
* Packages: mlr3, mlr3proba, distr6
* Predict Types: [pdf], distr
* Feature Types: integer, numeric
* Properties: missings
<MeasureDensLogloss:dens.logloss>: Log Loss
* Packages: mlr3, mlr3proba
* Range: [0, Inf]
* Minimize: TRUE
* Average: macro
* Parameters: eps=1e-15
* Properties: -
* Predict type: pdf
set.seed(1)
bmr = benchmark(benchmark_grid(task, learners, rsmp("cv", folds = 3)))
bmr$aggregate(measure)
nr resample_result task_id learner_id resampling_id iters dens.logloss
1: 1 <ResampleResult[21]> precip dens.hist cv 3 4.396138
2: 2 <ResampleResult[21]> precip dens.kde cv 3 4.817715
The results of this experiment show that the sophisticated Penalized Density Estimator does not outperform the baseline histogram, but that the Kernel Density Estimator has at least consistently better (i.e. lower logloss) results.
8.3 Spatiotemporal Analysis
Data observations may entail reference information about spatial or temporal characteristics. Spatial information is stored as coordinates, usually named “x” and “y” or “lat”/“lon”. Treating spatiotemporal data using non-spatial data methods can lead to over-optimistic performance estimates. Hence, methods specifically designed to account for the special nature of spatiotemporal data are needed.
In the mlr3
framework, the following packages relate to this field:
-
mlr3spatiotempcv
(biased-reduced performance estimation) -
mlr3spatial
(spatial prediction support)
The following (sub-)sections introduce the potential pitfalls of spatiotemporal data in machine learning and how to account for it. Note that not all functionality will be covered, and that some of the used packages are still in early lifecycles. If you want to contribute to one of the packages mentioned above, please contact Patrick Schratz.
8.3.1 Creating a spatial Task
To make use of spatial resampling methods, a {mlr3} task that is aware of its spatial characteristic needs to be created. Two Task
child classes exist in {mlr3spatiotempcv} for this purpose:
TaskClassifST
TaskRegrST
To create one of these, you have multiple options:
- Use the constructor of the
Task
directly via$new()
- this only works for data.table backends (!) - Use the
as_task_*
converters (e.g. if your data is stored in ansf
object)
We recommend the latter, as the as_task_*
converters aim to make task construction easier, e.g., by creating the DataBackend
(which is required to create a Task in {mlr3}) automatically and setting the crs
and coordinate_names
fields. Let’s assume your (point) data is stored in with an sf
object, which is a common scenario for spatial analysis in R.
You can also use a plain data.frame
. In this case, crs
and coordinate_names
need to be passed along explicitly as they cannot be inferred directly from the sf
object:
The *ST
task family prints a subset of the coordinates by default:
<TaskClassifST:ecuador_task> (751 x 11)
* Target: slides
* Properties: twoclass
* Features (10):
- dbl (10): carea, cslope, dem, distdeforest, distroad,
distslidespast, hcurv, log.carea, slope, vcurv
* Coordinates:
x y
1: 712882.5 9560002
2: 715232.5 9559582
3: 715392.5 9560172
4: 715042.5 9559312
5: 715382.5 9560142
---
747: 714472.5 9558482
748: 713142.5 9560992
749: 713322.5 9560562
750: 715392.5 9557932
751: 713802.5 9560862
All *ST
tasks can be treated as their super class equivalents TaskClassif
or TaskRegr
in subsequent {mlr3} modeling steps.
8.3.2 Autocorrelation
Data which includes spatial or temporal information requires special treatment in machine learning (similar to survival, ordinal and other task types listed in the special tasks chapter). In contrast to non-spatial/non-temporal data, observations inherit a natural grouping, either in space or time or in both space and time (Legendre 1993). This grouping causes observations to be autocorrelated, either in space (spatial autocorrelation (SAC)), time (temporal autocorrelation (TAC)) or both space and time (spatiotemporal autocorrelation (STAC)). For simplicity, the acronym STAC is used as a generic term in the following chapter for all the different characteristics introduced above.
What effects does STAC have in statistical/machine learning?
The overarching problem is that STAC violates the assumption that the observations in the train and test datasets are independent (Hastie, Friedman, and Tibshirani 2001). If this assumption is violated, the reliability of the resulting performance estimates, for example retrieved via cross-validation, is decreased. The magnitude of this decrease is linked to the magnitude of STAC in the dataset, which cannot be determined easily.
One approach to account for the existence of STAC is to use dedicated resampling methods. mlr3spatiotempcv
provides access to the most frequently used spatiotemporal resampling methods. The following example showcases how a spatial dataset can be used to retrieve a bias-reduced performance estimate of a learner.
The following examples use the ecuador
dataset created by Jannes Muenchow. It contains information on the occurrence of landslides (binary) in the Andes of Southern Ecuador. The landslides were mapped from aerial photos taken in 2000. The dataset is well suited to serve as an example because it it relatively small and of course due to the spatial nature of the observations. Please refer to Muenchow, Brenning, and Richter (2012) for a detailed description of the dataset.
To account for the spatial autocorrelation probably present in the landslide data, we will make use one of the most used spatial partitioning methods, a cluster-based k-means grouping (Brenning 2012), (spcv_coords
in mlr3spatiotempcv
). This method performs a clustering in 2D space which contrasts with the commonly used random partitioning for non-spatial data. The grouping has the effect that train and test data are more separated in space as they would be by conducting a random partitioning, thereby reducing the effect of STAC.
By contrast, when using the classical random partitioning approach with spatial data, train and test observations would be located side-by-side across the full study area (a visual example is provided further below). This leads to a high similarity between train and test sets, resulting in “better” but biased performance estimates in every fold of a CV compared to the spatial CV approach. However, these low error rates are mainly caused due to the STAC in the observations and the lack of appropriate partitioning methods and not by the power of the fitted model.
8.3.3 Spatiotemporal Cross-Validation and Partitioning
One part of spatiotemporal machine learning is dealing with the spatiotemporal components of the data during performance estimation. Performance is commonly estimated via cross-validation and mlr3spatiotempcv
provides specialized resamplings methods for spatiotemporal data. The following chapters showcases how these methods can be applied and how they differ compared to non-spatial resampling methods, e.g. random partitioning. In addition, examples which show how resamplings with spatial information can be visualized using mlr3spatiotempcv
.
Besides performance estimation, prediction on spatiotemporal data is another challenging task. See Section 8.3.6 for more information about how this topic is handled within the mlr3 framework.
8.3.3.1 Spatial CV vs. Non-Spatial CV
In the following a spatial and non-spatial CV will be applied to showcase the mentioned performance differences.
The performance of a simple classification tree ("classif.rpart"
) is evaluated on a random partitioning (repeated_cv
) with four folds and two repetitions. The chosen evaluation measure is “classification error” ("classif.ce"
).
For the spatial example, repeated_spcv_coords
is chosen whereas repeated_cv
represents the non-spatial example.
The selection of repeated_spcv_coords
in this example is arbitrary. For your use case, you might want to use a different spatial partitioning method (but not necessarily!). Have a look at the “Getting Started” vignette of mlr3spatiotempcv
to see all available methods and choose one which fits your data and its prediction purpose.
8.3.3.1.1 Non-Spatial CV
In this example the ecuador
example task is taken to estimate the performance of an rpart
learner with fixed parameters on it.
In practice you usually might want to tune the hyperparameters of the learner in this case and apply a nested CV in which the inner loop is used for hyperparameter tuning.
library("mlr3")
library("mlr3spatiotempcv")
set.seed(42)
# be less verbose
lgr::get_logger("bbotk")$set_threshold("warn")
lgr::get_logger("mlr3")$set_threshold("warn")
task = tsk("ecuador")
learner = lrn("classif.rpart", maxdepth = 3, predict_type = "prob")
resampling_nsp = rsmp("repeated_cv", folds = 4, repeats = 2)
rr_nsp = resample(
task = task, learner = learner,
resampling = resampling_nsp)
rr_nsp$aggregate(measures = msr("classif.ce"))
classif.ce
0.3388575
8.3.3.1.2 Spatial CV
task = tsk("ecuador")
learner = lrn("classif.rpart", maxdepth = 3, predict_type = "prob")
resampling_sp = rsmp("repeated_spcv_coords", folds = 4, repeats = 2)
rr_sp = resample(
task = task, learner = learner,
resampling = resampling_sp)
rr_sp$aggregate(measures = msr("classif.ce"))
classif.ce
0.412529
Here, the performance of the classification tree learner is around 7 percentage points worse when using Spatial Cross-Validation (SpCV) compared to Non-Spatial Cross-Validation (NSpCV). The resulting difference in performance is variable as it depends on the dataset, the magnitude of STAC and the learner itself. For algorithms with a higher tendency of overfitting to the training set, the difference between the two methods will be larger.
Now, what does it mean that the performance in the spatial case is worse? Should you ditch SpCV and keep using non-spatial partitioning? The answer is NO. The reason why the spatial partitioning scenario results in a lower predictive performance is because throughout the CV the model has been trained on data that is less similar than the test data compared against the non-spatial scenario. Or in other words: in the non-spatial scenario, train and test data are almost identical (due to spatial autocorrelation).
This means that the result from the SpCV setting is more close to the true predictive power of the model - whereas the result from non-spatial CV is overoptimistic and biased.
The result of the SpCV setting is by no means the absolute truth - it is also biased, but (most often) less compared to the non-spatial setting.
8.3.4 Visualization of Spatiotemporal Partitions
Every partitioning method in mlr3spatiotempcv
comes with S3 methods for plot()
and autoplot()
to visualize the created groups. In a 2D space ggplot2
is used in the backgroudn while for spatiotemporal methods 3D visualizations via plotly
are created.
The following examples shows how the resampling_sp
object from the previous example can be visualized. In this case I want to look at the first four partitions of the first repetition. The point size is adjusted via argument size
. After the plot creation, additional scale_*
calls are used to adjust the coordinate labels on the x and y axes, respectively.
autoplot(resampling_sp, task, fold_id = c(1:4), size = 0.7) *
ggplot2::scale_y_continuous(breaks = seq(-3.97, -4, -0.01)) *
ggplot2::scale_x_continuous(breaks = seq(-79.06, -79.08, -0.02))
Note that setting the correct CRS for the given data is important which is done during task creation. Spatial offsets of up to multiple meters may occur if the wrong CRS is supplied initially.
This example used a built-in mlr3 task via tsk()
. In practice however, one needs to create a spatiotemporal task via TaskClassifST
/TaskRegrST
and set the crs
argument (unless a sf
object is handed over).
mlr3spatiotempcv
can also visualize non-spatial partitonings. This helps to visually compare differences. Let’s use the objects from the previous example again, this time resampling_nsp
.
autoplot(resampling_nsp, task, fold_id = c(1:4), size = 0.7) *
ggplot2::scale_y_continuous(breaks = seq(-3.97, -4, -0.01)) *
ggplot2::scale_x_continuous(breaks = seq(-79.06, -79.08, -0.02))
The visualization show very well how close train and test observations are located next to each other.
8.3.4.1 Spatial Block Visualization
This examples showcases another SpCV method: spcv_block
. This method makes use of rectangular blocks to divide the study area into equally-sized parts. {mlr3spatiotempcv} has support for visualizing the created blocks and displaying their respective fold ID to get a better understanding how the final folds were composed out of the partitions. E.g. the “Fold 1 Repetition 1” plot shows that the test set is composed out of two “blocks” with the ID “1” in this case.
The use of range = 1000L
is arbitrary here and should not be copy-pasted into a real application.
8.3.4.2 Spatiotemporal Visualization
When going spatiotemporal, two dimensions are not enough anymore. To visualize space and time together, a 3D solution is needed. {mlr3spatiotempcv} makes use of plotly
for this purpose.
The following examples uses a modified version of the cookfarm_mlr3
task for showcasing reasons. By adjusting some levels, the individual partitions can be recognized very well.
In practice, you should not modify your data to achieve “good looking” visualizations as done in this example. This is only done for (visual) demonstration purposes.
In the following we use the sptcv_cstf
method after Meyer et al. (2018). Only the temporal variable is used in this first example, denoted by setting the column role “time” to variable “Date”. This column role is then picked up by the resampling method. Last, autoplot()
is called with an explicit definition of plot3D = TRUE
. This is because sptcv_cstf
can also be visualized in 2D (which only makes sense if the “space” column role is used for partitioning).
Last, sample_fold_n
is used to take a stratified random sample from all partitions. The call to plotly::layout()
only adjusts the default viewing angle of the plot - in interactive visualizations, this is not needed and the viewing angle can be adjusted with the mouse.
This is done to reduce the final plot size and keep things small for demos like this one. We don’t recommend doing this in actual studies - unless you prominently communicate this alongside the resulting plot.
data = cookfarm_mlr3
set.seed(42)
data$Date = sample(rep(c(
"2020-01-01", "2020-02-01", "2020-03-01", "2020-04-01",
"2020-05-01"), times = 1, each = 35768))
task_spt = as_task_regr_st(data,
id = "cookfarm", target = "PHIHOX",
coordinate_names = c("x", "y"), coords_as_features = FALSE,
crs = 26911)
task_spt$set_col_roles("Date", roles = "time")
rsmp_cstf_time = rsmp("sptcv_cstf", folds = 5)
plot = autoplot(rsmp_cstf_time,
fold_id = 5, task = task_spt, plot3D = TRUE,
sample_fold_n = 3000L
)
plotly::layout(plot, scene = list(camera = list(eye = list(z = 0.58))))
If both space and time are used for partitioning in sptcv_cstf
, the visualization becomes even more powerful as it allows to also show the observations which are omitted, i.e., not being used in either train and test sets for a specific fold.
task_spt$set_col_roles("SOURCEID", roles = "space")
task_spt$set_col_roles("Date", roles = "time")
rsmp_cstf_space_time = rsmp("sptcv_cstf", folds = 5)
plot = autoplot(rsmp_cstf_space_time,
fold_id = 4, task = task_spt, plot3D = TRUE,
show_omitted = TRUE, sample_fold_n = 3000L)
plotly::layout(plot, scene = list(camera =
list(eye = list(z = 0.58, x = -1.4, y = 1.6))))
Combining multiple spatiotemporal plots with plotly::layout()
is possible but somewhat cumbersome. First, a list of plots containing the individuals plots must be created. These plots can then be passed to plotly::subplot()
. This return is then passed to plotly::layout()
.
pl = autoplot(rsmp_cstf_space_time, task = task_spt,
fold_id = c(1, 2, 3, 4), point_size = 3,
axis_label_fontsize = 10, plot3D = TRUE,
sample_fold_n = 3000L, show_omitted = TRUE
)
# Warnings can be ignored
pl_subplot = plotly::subplot(pl)
plotly::layout(pl_subplot,
title = "Individual Folds",
scene = list(
domain = list(x = c(0, 0.5), y = c(0.5, 1)),
aspectmode = "cube",
camera = list(eye = list(z = 0.20, x = -1.4, y = 1.6))
),
scene2 = list(
domain = list(x = c(0.5, 1), y = c(0.5, 1)),
aspectmode = "cube",
camera = list(eye = list(z = 0.1, x = -1.4, y = 1.6))
),
scene3 = list(
domain = list(x = c(0, 0.5), y = c(0, 0.5)),
aspectmode = "cube",
camera = list(eye = list(z = 0.1, x = -1.4, y = 1.6))
),
scene4 = list(
domain = list(x = c(0.5, 1), y = c(0, 0.5)),
aspectmode = "cube",
camera = list(eye = list(z = 0.58, x = -1.4, y = 1.6))
)
)
Unfortunately, titles in subplots cannot be created dynamically. However, there is a manual workaround via annotations show in this RPubs post.
8.3.5 Choosing a Resampling Method
While the example in this section made use of the spcv_coords
method, this should by no means infer that this method is the best or only method suitable for this task. Even though this method is quite popular, it was mainly chosen because of the clear visual grouping differences when being applied on the ecuador
task when compared to random partitioning.
In fact, most often multiple spatial partitioning methods can be used for a dataset. It is recommended (required) that users familiarize themselves with each implemented method and decide which method to choose based on the specific characteristics of the dataset. For almost all methods implemented in mlr3spatiotempcv
, there is a scientific publication describing the strengths and weaknesses of the respective approach (either linked in the help file of mlr3spatiotempcv
or its respective dependency packages).
In the example above, a cross-validation without hyperparameter tuning was shown. If a nested CV is desired, it is recommended to use the same spatial partitioning method for the inner loop (= tuning level). See Schratz et al. (2019) for more details and chapter 11 of Geocomputation with R (Lovelace, Nowosad, and Muenchow 2019).
A list of all implemented methods in mlr3spatiotempcv
can be found in the Getting Started vignette of the package.
If you want to learn even more about the field of spatial partitioning, STAC and the problems associated with it, the works of Prof. Hanna Meyer and Prof. Alexander Brenning are very much recommended for further reference.
8.3.6 Spatial Prediction
Support for (parallel) spatial prediction with terra
, raster
, stars
and sf
objects is available via mlr3spatial
.
mlr3spatial
has two main scopes:
- Provide DataBackends for spatial objects (vector and raster data)
- Simplify spatial prediction tasks
Overall it aims to reduce the roundtripping between spatial objects -> data.frame / data.table -> spatial object during modeling.
8.3.6.1 Spatial Data Backends
A common scenario is the existence of spatial information in (point) vector data. mlr3spatial
provides as_task_*
helpers to directly convert these into a spatiotemporal (ST) task:
library(mlr3verse)
library(mlr3spatial)
library(sf)
# load sample points
leipzig_vector = sf::read_sf(system.file("extdata",
"leipzig_points.gpkg", package = "mlr3spatial"),
stringsAsFactors = TRUE)
# create land cover task
tsk_leipzig = as_task_classif_st(leipzig_vector, target = "land_cover")
tsk_leipzig
<TaskClassifST:leipzig_vector> (97 x 9)
* Target: land_cover
* Properties: multiclass
* Features (8):
- dbl (8): b02, b03, b04, b06, b07, b08, b11, ndvi
* Coordinates:
X Y
1: 732480.1 5693957
2: 732217.4 5692769
3: 732737.2 5692469
4: 733169.3 5692777
5: 732202.2 5692644
---
93: 733018.7 5692342
94: 732551.4 5692887
95: 732520.4 5692589
96: 732542.2 5692204
97: 732437.8 5692300
This saves users from stripping the coordinates from the vector data or transforming the sf
object to a data.frame
or data.table
in the first place.
mlr3spatial
adds support for the following spatial data classes:
8.3.6.2 Spatial prediction
The goal of spatial prediction is usually a raster image that covers a specific area. Most often this area is quite large and contains millions of pixels. The goal is to predict on each pixel and return a raster image containing these predictions.
To save users from having to go the extra mile of extracting the final model and using it with the respective predict()
function of the spatial R package of their desired outcome class, mlr3spatial
provides predict_spatial()
. predict_spatial()
let’s users
- choose the output class of the resulting raster image
- optionally predict in parallel via the
future
package, using a parallel backend of their choice
To use a raster image for prediction, it must be wrapped into a TaskUnsupervised
for internal mlr3
reasons. Next, the learner can be used to predict on the task.
leipzig_raster = terra::rast(system.file("extdata", "leipzig_raster.tif",
package = "mlr3spatial"))
tsk_predict = as_task_unsupervised(leipzig_raster)
lrn = lrn("classif.ranger")
lrn$train(tsk_leipzig)
# plan("multisession") # optional parallelization
pred = predict_spatial(tsk_predict, lrn, format = "terra")
class(pred)
[1] "SpatRaster"
attr(,"package")
[1] "terra"
The resulting object can be visualized and treated as any other terra
object. Thanks to the improved handling of factor variables in terra
, the raster image contains the correct labeled classes from the prediction right away.
library(terra, exclude = "resample")
plot(pred, col = c("#440154FF", "#443A83FF", "#31688EFF",
"#21908CFF", "#35B779FF", "#8FD744FF", "#FDE725FF"))
If you are interested in the performance of parallel spatial prediction, have a look at the Spatial Prediction Benchmark vignette - the resulting plot is shown below.
The results of the shown benchmark results should be taken with care as something seems to go wrong when using all-core parallelization with terra::predict()
. Also both relative and absolute numbers will vary on your local machine and results might change depending on eventual package updates in the future.
8.4 Cost-Sensitive Classification
In regular classification, the aim is to minimize the misclassification rate, and thus all types of misclassification errors are deemed equally severe. A more general setting is cost-sensitive classification. Cost-sensitive classification does not assume that the costs caused by different kinds of errors are equal. The objective of cost-sensitive classification is to minimize the expected costs.
Imagine you are an analyst for a big credit institution. Let’s also assume that a correct bank decision would result in 35% of the profit at the end of a specific period. A correct decision means that the bank predicts that a customer will pay their bills (hence would obtain a loan), and the customer indeed has good credit. On the other hand, a wrong decision means that the bank predicts that the customer’s credit is in good standing, but the opposite is true. This would result in a loss of 100% of the given loan.
Good Customer (truth) | Bad Customer (truth) | |
---|---|---|
Good Customer (predicted) | + 0.35 | - 1.0 |
Bad Customer (predicted) | 0 | 0 |
Expressed as costs (instead of profit), we can write down the cost-matrix as follows:
costs = matrix(c(-0.35, 0, 1, 0), nrow = 2)
dimnames(costs) = list(response = c("good", "bad"), truth = c("good", "bad"))
print(costs)
truth
response good bad
good -0.35 1
bad 0.00 0
An exemplary data set for such a problem is the German Credit
task:
The data has 70% of customers who can pay back their credit and 30% of bad customers who default on their debt. A manager, who doesn’t have any model, could decide to give either everybody credit or to give nobody credit. The resulting costs for the German credit data are:
[1] 0
[1] 0.055
If the average loan is $20,000, the credit institute will lose more than one million dollars if it would grant everybody a credit:
Our goal is to find a model that minimizes the costs (and maximizes the expected profit).
8.4.1 A First Model
For our first model, we choose an ordinary logistic regression (implemented in add-on package mlr3learners
). We first create a classification task, then resample the model using 10-fold cross-validation and extract the resulting confusion matrix:
learner = lrn("classif.log_reg")
rr = resample(task, learner, rsmp("cv"))
confusion = rr$prediction()$confusion
print(confusion)
truth
response good bad
good 606 154
bad 94 146
To calculate the average costs like above, we can simply multiply the elements of the confusion matrix with the elements of the previously introduced cost matrix and sum the values of the resulting matrix:
With an average loan of $20,000, the logistic regression yields the following costs:
Instead of losing over $1,000,000, the credit institute now can expect a profit of more than $1,000,000.
8.4.2 Cost-sensitive Measure
Our natural next step would be to further improve the modeling step in order to maximize the profit. For this purpose, we first create a cost-sensitive classification measure that calculates the costs based on our cost matrix. This allows us to conveniently quantify and compare modeling decisions. Fortunately, there already is a predefined measure Measure
for this purpose: MeasureClassifCosts
. The costs have to be provided as a numeric matrix whose columns and rows are named with class labels (just like the previously constructed costs
matrix):
<MeasureClassifCosts:classif.costs>: Cost-sensitive Classification
* Packages: mlr3
* Range: [-Inf, Inf]
* Minimize: TRUE
* Average: macro
* Parameters: normalize=TRUE
* Properties: -
* Predict type: response
If we now call resample()
or benchmark()
, the cost-sensitive measures will be evaluated. We compare the logistic regression to a simple featureless learner and to a random forest from package ranger
:
learners = list(
lrn("classif.log_reg"),
lrn("classif.featureless"),
lrn("classif.ranger")
)
cv10 = rsmp("cv", folds = 10)
bmr = benchmark(benchmark_grid(task, learners, cv10))
autoplot(bmr, measure = cost_measure) + ggplot2::geom_hline(yintercept = 0, colour = "red")
As expected, the featureless learner is performing comparably badly. The logistic regression and the random forest both yield a profit on average.
8.4.3 Thresholding
Although we now correctly evaluate the models in a cost-sensitive fashion, the models themselves are unaware of the classification costs. They assume the same costs for both wrong classification decisions (false positives and false negatives). Some learners natively support cost-sensitive classification (e.g., XXX). However, we will concentrate on a more generic approach that works for all models which can predict probabilities for class labels: thresholding.
Most learners can calculate the probability \(p\) for the positive class. If \(p\) exceeds the threshold \(0.5\), they predict the positive class and the negative class otherwise.
For our binary classification case of the credit data, we primarily want to minimize the errors where the model predicts “good”, but truth is “bad” (i.e., the number of false positives) as this is the more expensive error. If we now increase the threshold to values \(> 0.5\), we reduce the number of false negatives. Note that we increase the number of false positives simultaneously, or, in other words, we are trading false positives for false negatives.
# fit models with probability prediction
learner = lrn("classif.log_reg", predict_type = "prob")
rr = resample(task, learner, rsmp("cv"))
p = rr$prediction()
print(p)
<PredictionClassif> for 1000 observations:
row_ids truth response prob.good prob.bad
2 bad bad 0.40126720 0.5987328
17 good good 0.97406360 0.0259364
46 good good 0.74870171 0.2512983
---
973 bad bad 0.09917846 0.9008215
976 good good 0.78358973 0.2164103
997 good good 0.51492948 0.4850705
# helper function to try different threshold values interactively
with_threshold = function(p, th) {
p$set_threshold(th)
list(confusion = p$confusion, costs = p$score(measures = cost_measure))
}
with_threshold(p, 0.5)
$confusion
truth
response good bad
good 598 152
bad 102 148
$costs
classif.costs
-0.0573
$confusion
truth
response good bad
good 462 75
bad 238 225
$costs
classif.costs
-0.0867
$confusion
truth
response good bad
good 0 0
bad 700 300
$costs
classif.costs
0
There is also an autoplot()
method which systematically varies the threshold between 0 and 1 and calculates the corresponding scores:
Instead of manually or visually searching for good values, the base R function optimize()
can do the job for us:
# simple wrapper function that takes a threshold and returns the resulting model performance
# this wrapper is passed to optimize() to find its minimum for thresholds in [0.5, 1]
f = function(th) {
with_threshold(p, th)$costs
}
best = optimize(f, c(0.5, 1))
print(best)
$minimum
[1] 0.6905289
$objective
classif.costs
-0.08925
truth
response good bad
good 515 91
bad 185 209
Note that the function "optimize()"
is intended for unimodal functions and, therefore, may converge to a local optimum here. See the next section for better alternatives to tune the threshold.
8.4.4 Threshold Tuning
Currently mlr3pipelines
offers two main strategies towards adjusting classification thresholds
. We can either expose the thresholds as a hyperparameter
of the Learner by using PipeOpThreshold
. This allows us to tune the thresholds
via an outside optimizer from mlr3tuning
. Alternatively, we can also use PipeOpTuneThreshold
which automatically tunes the threshold after each learner is fit. Both methods are described in the following subsections.
8.4.5 PipeOpThreshold
PipeOpThreshold
can be put directly after a Learner
.
A simple example would be:
Note, that predict_type = "prob"
is required for po("threshold")
to have any effect.
The thresholds
are now exposed as a hyperparameter
of the GraphLearner
we created:
<ParamSetCollection>
id class lower upper nlevels default
1: classif.rpart.cp ParamDbl 0 1 Inf 0.01
2: classif.rpart.keep_model ParamLgl NA NA 2 FALSE
3: classif.rpart.maxcompete ParamInt 0 Inf Inf 4
4: classif.rpart.maxdepth ParamInt 1 30 30 30
5: classif.rpart.maxsurrogate ParamInt 0 Inf Inf 5
6: classif.rpart.minbucket ParamInt 1 Inf Inf <NoDefault[3]>
7: classif.rpart.minsplit ParamInt 1 Inf Inf 20
8: classif.rpart.surrogatestyle ParamInt 0 1 2 0
9: classif.rpart.usesurrogate ParamInt 0 2 3 2
10: classif.rpart.xval ParamInt 0 Inf Inf 10
11: threshold.thresholds ParamUty NA NA Inf <NoDefault[3]>
1 variable not shown: [value]
We can now tune those thresholds from the outside as follows:
Before tuning, we have to define which hyperparameters we want to tune over. In this example, we only tune over the thresholds
parameter of the threshold
pipe operator. You can easily imagine that we can also jointly tune over additional hyperparameters, i.e., rpart’s cp
parameter.
As the Task
we aim to optimize for is a binary task, we can simply specify the threshold param:
We now create a AutoTuner
which automatically tunes the supplied learner over the ParamSet
we supplied above.
Inside the trafo
, we simply collect all set params into a named vector via map_dbl
and store it in the threshold.thresholds
slot expected by the learner.
Again, we create a AutoTuner
, which automatically tunes the supplied learner over the ParamSet
we supplied above.
One drawback of this strategy is that this requires us to fit a new model for each new threshold setting. While setting a threshold and computing performance is relatively cheap, fitting the learner is often more computationally demanding. An often better strategy is, therefore, to optimize the thresholds separately after each model fit.
8.4.6 PipeOpTunethreshold
PipeOpTuneThreshold
on the other hand works together with PipeOpLearnerCV
. It directly optimizes the cross-validated
predictions made by this PipeOp
. This is necessary to avoid over-fitting the threshold tuning.
A simple example would be:
Note, that predict_type
= “prob” is required for po("tunethreshold")
to work. Additionally, note that this time no threshold
parameter is exposed, and it is automatically tuned internally.
<ParamSetCollection>
id class lower upper nlevels
1: classif.rpart.resampling.method ParamFct NA NA 2
2: classif.rpart.resampling.folds ParamInt 2 Inf Inf
3: classif.rpart.resampling.keep_response ParamLgl NA NA 2
4: classif.rpart.cp ParamDbl 0 1 Inf
5: classif.rpart.keep_model ParamLgl NA NA 2
6: classif.rpart.maxcompete ParamInt 0 Inf Inf
7: classif.rpart.maxdepth ParamInt 1 30 30
8: classif.rpart.maxsurrogate ParamInt 0 Inf Inf
9: classif.rpart.minbucket ParamInt 1 Inf Inf
10: classif.rpart.minsplit ParamInt 1 Inf Inf
11: classif.rpart.surrogatestyle ParamInt 0 1 2
12: classif.rpart.usesurrogate ParamInt 0 2 3
13: classif.rpart.xval ParamInt 0 Inf Inf
14: classif.rpart.affect_columns ParamUty NA NA Inf
15: tunethreshold.measure ParamUty NA NA Inf
16: tunethreshold.optimizer ParamUty NA NA Inf
17: tunethreshold.log_level ParamUty NA NA Inf
2 variables not shown: [default, value]
Note that we can set rsmp("intask")
as a resampling strategy for “learner_cv” in order to evaluate predictions on the “training” data. This is generally not advised, as it might lead to over-fitting on the thresholds but can significantly reduce runtime.
For more information, see the post on Threshold Tuning on the mlr3 gallery.
8.5 Cluster Analysis
Cluster analysis is a type of unsupervised machine learning where the goal is to group data into clusters, where each cluster contains similar observations. The similarity is based on specified metrics that are task and application-dependent. Cluster analysis is closely related to classification in the sense that each observation needs to be assigned to a cluster or a class. However, unlike classification problems where each observation is labeled, clustering works on data sets without true labels or class assignments.
The package mlr3cluster
extends mlr3
with the following objects for cluster analysis:
-
TaskClust
to define clustering tasks -
LearnerClust
as base class for clustering learners -
PredictionClust
as specialized class forPrediction
objects -
MeasureClust
as specialized class for performance measures
Since clustering is a type of unsupervised learning, TaskClust
is slightly different from TaskRegr
and TaskClassif
objects. More specifically:
-
truth()
function is missing because observations are not labeled. -
target
field is empty and will returncharacter(0)
if accessed anyway.
Additionally, LearnerClust
provides two extra fields that are absent from supervised learners:
-
assignments
returns cluster assignments for training data. It returnsNULL
if accessed before training. -
save_assignments
is a boolean field that controls whether or not to store training set assignments in a learner.
Finally, PredictionClust
contains two additional fields:
-
partition
stores cluster partitions. -
prob
stores cluster probabilities for each observation.
8.5.1 Train and Predict
Clustering learners provide both train
and predict
methods. The analysis typically consists of building clusters using all available data. To be consistent with the rest of the library, we refer to this process as training.
Some learners can assign new observations to existing groups with predict
. However, prediction does not always make sense, as is the case for hierarchical clustering. In hierarchical clustering, the goal is to build a hierarchy of nested clusters by either splitting large clusters into smaller ones or merging smaller clusters into bigger ones. The final result is a tree or dendrogram which can change if a new data point is added. For consistency with the rest of the ecosystem, mlr3cluster
offers predict
method for hierarchical clusterers but it simply assigns all points to a specified number of clusters by cutting the resulting tree at a corresponding level. Moreover, some learners estimate the probability of each observation belonging to a given cluster. predict_types
field gives a list of prediction types for each learner.
After training, the model
field stores a learner’s model that looks different for each learner depending on the underlying library. predict
returns a PredictionClust
object that gives a simplified view of the learned model. If the data given to the predict
method is the same as the one on which the learner was trained, predict
simply returns cluster assignments for the “training” observations. On the other hand, if the test set contains new data, predict
will estimate cluster assignments for that data set. Some learners do not support estimating cluster partitions on new data and will instead return assignments for training data and print a warning message.
In the following example, a $k$-means learner
is applied on the US arrest data set
. The class labels are predicted and the contribution of the task features to assignment of the respective class are visualized.
library("mlr3")
library("mlr3cluster")
library("mlr3viz")
set.seed(1L)
# create an example task
task = tsk("usarrests")
print(task)
<TaskClust:usarrests> (50 x 4): US Arrests
* Target: -
* Properties: -
* Features (4):
- int (2): Assault, UrbanPop
- dbl (2): Murder, Rape
# create a k-means learner
learner = lrn("clust.kmeans")
# assigning each observation to one of the two clusters (default in clust.kmeans)
learner$train(task)
learner$model
K-means clustering with 2 clusters of sizes 21, 29
Cluster means:
Assault Murder Rape UrbanPop
1 255.0000 11.857143 28.11429 67.61905
2 109.7586 4.841379 16.24828 64.03448
Clustering vector:
[1] 1 1 1 1 1 1 2 1 1 1 2 2 1 2 2 2 2 1 2 1 2 1 2 1 2 2 2 1 2 2 1 1 1 2 2 2 2 2
[39] 2 1 2 1 1 2 2 2 2 2 2 2
Within cluster sum of squares by cluster:
[1] 41636.73 54762.30
(between_SS / total_SS = 72.9 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
# make "predictions" for the same data
prediction = learner$predict(task)
autoplot(prediction, task)
8.5.2 Measures
The difference between supervised and unsupervised learning is that there is no ground truth data in unsupervised learning. In a supervised setting, such as classification, we would need to compare our predictions to true labels. Since clustering is an example of unsupervised learning, there are no true labels to which we can compare. However, we can still measure the quality of cluster assignments by quantifying how closely objects within the same cluster are related (cluster cohesion) as well as how distinct different clusters are from each other (cluster separation).
There are a few built-in evaluation metrics available to assess the quality of clustering. One of them is within sum of squares (WSS)
which calculates the sum of squared differences between observations and centroids. WSS is useful because it quantifies cluster cohesion. The range of this measure is \([0, \infty)\) where a smaller value means that clusters are more compact.
Another measure is silhouette quality index
that quantifies how well each point belongs to its assigned cluster versus neighboring cluster. Silhouette values are in \([-1, 1]\) range.
Points with silhouette closer to:
- 1 are well clustered
- 0 lie between two clusters
- -1 likely placed in the wrong cluster
The following is an example of conducting a benchmark experiment with various learners on iris data set
without target variable and assessing the quality of each learner with both within sum of squares and silhouette measures.
design = benchmark_grid(
tasks = TaskClust$new("iris", iris[-5]),
learners = list(
lrn("clust.kmeans", centers = 3L),
lrn("clust.pam", k = 2L),
lrn("clust.cmeans", centers = 3L)),
resamplings = rsmp("insample"))
print(design)
task learner resampling
1: <TaskClust[46]> <LearnerClustKMeans[38]> <ResamplingInsample[20]>
2: <TaskClust[46]> <LearnerClustPAM[38]> <ResamplingInsample[20]>
3: <TaskClust[46]> <LearnerClustCMeans[38]> <ResamplingInsample[20]>
# execute benchmark
bmr = benchmark(design)
# define measure
measures = list(msr("clust.wss"), msr("clust.silhouette"))
bmr$aggregate(measures)
nr resample_result task_id learner_id resampling_id iters clust.wss
1: 1 <ResampleResult[21]> iris clust.kmeans insample 1 78.85144
2: 2 <ResampleResult[21]> iris clust.pam insample 1 153.32572
3: 3 <ResampleResult[21]> iris clust.cmeans insample 1 79.02617
1 variable not shown: [clust.silhouette]
The experiment shows that using k-means algorithm with three centers produces a better within sum of squares score than any other learner considered. However, pam (partitioning around medoids) learner with two clusters performs the best when considering silhouette measure which takes into the account both cluster cohesion and separation.
8.5.3 Visualization
Cluster analysis in mlr3
is integrated with mlr3viz
which provides a number of useful plots. Some of those plots are shown below.
task = TaskClust$new("iris", iris[-5])
learner = lrn("clust.kmeans")
learner$train(task)
prediction = learner$predict(task)
# performing PCA on task and showing assignments
autoplot(prediction, task, type = "pca")
# same as above but with probability ellipse that assumes normal distribution
autoplot(prediction, task, type = "pca", frame = TRUE, frame.type = "norm")
task = tsk("usarrests")
learner = lrn("clust.hclust")
learner$train(task)
# dendrogram for hierarchical clustering
autoplot(learner, task = task)
Silhouette plots can help to visually assess the quality of the analysis and help choose a number of clusters for a given data set. The red dotted line shows the mean silhouette value and each bar represents a data point. If most points in each cluster have an index around or higher than mean silhouette, the number of clusters is chosen well.
# silhouette plot allows to visually inspect the quality of clustering
task = TaskClust$new("iris", iris[-5])
learner = lrn("clust.kmeans")
learner$param_set$values = list(centers = 5L)
learner$train(task)
prediction = learner$predict(task)
autoplot(prediction, task, type = "sil")
The plot shows that all points in cluster 5 and almost all points in clusters 4, 2 and 1 are below average silhouette index. This means that a lot of observations lie either on the border of clusters or are likely assigned to the wrong cluster.
learner = lrn("clust.kmeans")
learner$param_set$values = list(centers = 2L)
learner$train(task)
prediction = learner$predict(task)
autoplot(prediction, task, type = "sil")
Setting the number of centers to two improves both the average silhouette score as well as the overall quality of clustering because almost all points in cluster 1 are higher than, and a lot of points in cluster 2 are close to the mean silhouette. Hence, having two centers might be a better choice for the number of clusters.