8 Special Tasks

This chapter explores the different functions of mlr3 when dealing with specific data sets that require further statistical modification to undertake sensible analysis. Following topics are discussed:

Survival Analysis

This sub-chapter explains how to conduct sound survival analysis in mlr3. Survival analysis is used to monitor the period of time until a specific event takes places. This specific event could be e.g. death, transmission of a disease, marriage or divorce. Two considerations are important when conducting survival analysis:

  • Whether the event occurred within the frame of the given data
  • How much time it took until the event occurred

In summary, this sub-chapter explains how to account for these considerations and conduct survival analysis using the mlr3proba extension package.

Density Estimation

This sub-chapter explains how to conduct (unconditional) density estimation in mlr3. Density estimation is used to estimate the probability density function of a continuous variable. Unconditional density estimation is an unsupervised task so there is no ‘value’ to predict, instead densities are estimated.

This sub-chapter explains how to estimate probability distributions for continuous variables using the mlr3proba extension package.

Spatiotemporal Analysis

Spatiotemporal analysis data observations entail reference information about spatial and temporal characteristics. One of the largest issues of spatiotemporal data analysis is the inevitable presence of auto-correlation in the data. Auto-correlation is especially severe in data with marginal spatiotemporal variation. The sub-chapter on Spatiotemporal analysis provides instructions on how to account for spatiotemporal data.

Ordinal Analysis

This is work in progress. See mlr3ordinal for the current state.

Functional Analysis

Functional analysis contains data that consists of curves varying over a continuum e.g. time, frequency or wavelength. This type of analysis is frequently used when examining measurements over various time points. Steps on how to accommodate functional data structures in mlr3 are explained in the functional analysis sub-chapter.

Multilabel Classification

Multilabel classification deals with objects that can belong to more than one category at the same time. Numerous target labels are attributed to a single observation. Working with multilabel data requires one to use modified algorithms, to accommodate data specific characteristics. Two approaches to multilabel classification are prominently used:

  • The problem transformation method
  • The algorithm adaption method

Instructions on how to deal with multilabel classification in mlr3 can be found in this sub-chapter.

Cost Sensitive Classification

This sub-chapter deals with the implementation of cost-sensitive classification. Regular classification aims to minimize the misclassification rate and thus all types of misclassification errors are deemed equally severe. Cost-sensitive classification is a setting where the costs caused by different kinds of errors are not assumed to be equal. The objective is to minimize the expected costs.

Analytical data for a big credit institution is used as a use case to illustrate the different features. Firstly, the sub-chapter provides guidance on how to implement a first model. Subsequently, the sub-chapter contains instructions on how to modify cost sensitivity measures, thresholding and threshold tuning.

Cluster Analysis

Cluster analysis aims to group data into clusters such that objects that are similar end up in the same cluster.
Fundamentally, clustering and classification are similar. However, clustering is an unsupervised task because observations do not contain true labels while in classification, labels are needed in order to train a model.

This sub-chapter explains how to perform cluster analysis in mlr3 with the help of mlr3cluster extension package.

8.1 Survival Analysis

Survival analysis is a sub-field of supervised machine learning in which the aim is to predict the survival distribution of a given individual. Arguably the main feature of survival analysis is that unlike classification and regression, learners are trained on two features:

  1. the time until the event takes place
  2. the event type: either censoring or death.

At a particular time-point, an individual is either: alive, dead, or censored. Censoring occurs if it is unknown if an individual is alive or dead. For example, say we are interested in patients in hospital and every day it is recorded if they are alive or dead, then after a patient leaves it is unknown if they are alive or dead, hence they are censored. If there was no censoring, then ordinary regression analysis could be used instead. Furthermore, survival data contains solely positive values and therefore needs to be transformed to avoid biases.

Note that survival analysis accounts for both censored and uncensored observations while adjusting respective model parameters.

The package mlr3proba (Sonabend et al. 2021) extends mlr3 with the following objects for survival analysis:

For a good introduction to survival analysis see Modelling Survival Data in Medical Research (Collett 2014).

8.1.1 TaskSurv

Unlike TaskClassif and TaskRegr which have a single ‘target’ argument, TaskSurv mimics the survival::Surv object and has three to four target arguments (dependent on censoring type). A TaskSurv can be constructed with the function as_task_surv():

library("mlr3")
library("mlr3proba")
library("survival")

as_task_surv(survival::bladder2[, -1L], id = "interval_censored",
  time = "start", event = "event", time2 = "stop", type = "interval")
## <TaskSurv:interval_censored> (178 x 7)
## * Target: start, stop, event
## * Properties: -
## * Features (4):
##   - dbl (2): enum, rx
##   - int (2): number, size
# type = "right" is default
task = as_task_surv(survival::rats, id = "right_censored",
  time = "time", event = "status", type = "right")

print(task)
## <TaskSurv:right_censored> (300 x 5)
## * Target: time, status
## * Properties: -
## * Features (3):
##   - int (1): litter
##   - dbl (1): rx
##   - chr (1): sex
# the target column is a survival object:
head(task$truth())
## [1] 101+  49  104+  91+ 104+ 102+
# kaplan-meier plot
library("mlr3viz")
autoplot(task)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

8.1.2 Predict Types - crank, lp, and distr

Every PredictionSurv object can predict one or more of:

  • lp - Linear predictor calculated as the fitted coefficients multiplied by the test data.
  • distr - Predicted survival distribution, either discrete or continuous. Implemented in distr6.
  • crank - Continuous risk ranking.

lp and crank can be used with measures of discrimination such as the concordance index. Whilst lp is a specific mathematical prediction, crank is any continuous ranking that identifies who is more or less likely to experience the event. So far the only implemented learner that only returns a continuous ranking is surv.svm. If a PredictionSurv returns an lp then the crank is identical to this. Otherwise crank is calculated as the expectation of the predicted survival distribution. Note that for linear proportional hazards models, the ranking (but not necessarily the crank score itself) given by lp and the expectation of distr, is identical.

The example below uses the rats task shipped with mlr3proba.

task = tsk("rats")
learn = lrn("surv.coxph")

train_set = sample(task$nrow, 0.8 * task$nrow)
test_set = setdiff(seq_len(task$nrow), train_set)

learn$train(task, row_ids = train_set)
prediction = learn$predict(task, row_ids = test_set)

print(prediction)
## <PredictionSurv> for 60 observations:
##     row_ids time status   crank      lp     distr
##           2   49   TRUE -0.4151 -0.4151 <list[1]>
##          10   91  FALSE -2.7347 -2.7347 <list[1]>
##          11  104  FALSE -3.2501 -3.2501 <list[1]>
## ---                                              
##         270  102  FALSE -2.5238 -2.5238 <list[1]>
##         290   91  FALSE  0.3958  0.3958 <list[1]>
##         299  104  FALSE -2.4393 -2.4393 <list[1]>

8.1.3 Composition

Finally we take a look at the PipeOps implemented in mlr3proba, which are used for composition of predict types. For example, a predict linear predictor does not have a lot of meaning by itself, but it can be composed into a survival distribution. See mlr3pipelines for full tutorials and details on PipeOps.

library("mlr3pipelines")
library("mlr3learners")
# PipeOpDistrCompositor - Train one model with a baseline distribution,
# (Kaplan-Meier or Nelson-Aalen), and another with a predicted linear predictor.
task = tsk("rats")
# remove the factor column for support with glmnet
task$select(c("litter", "rx"))
learner_lp = lrn("surv.glmnet")
learner_distr = lrn("surv.kaplan")
prediction_lp = learner_lp$train(task)$predict(task)
prediction_distr = learner_distr$train(task)$predict(task)
prediction_lp$distr

# Doesn't need training. Base = baseline distribution. ph = Proportional hazards.

pod = po("compose_distr", form = "ph", overwrite = FALSE)
prediction = pod$predict(list(base = prediction_distr, pred = prediction_lp))$output

# Now we have a predicted distr!

prediction$distr

# This can all be simplified by using the distrcompose pipeline

glm.distr = ppl("distrcompositor", learner = lrn("surv.glmnet"),
  estimator = "kaplan", form = "ph", overwrite = FALSE, graph_learner = TRUE)
glm.distr$train(task)$predict(task)

8.1.4 Benchmark Experiment

Finally, we conduct a small benchmark study on the rats task using some of the integrated survival learners:

library("mlr3learners")

task = tsk("rats")

# some integrated learners
learners = lrns(c("surv.coxph", "surv.kaplan", "surv.ranger"))
print(learners)
## [[1]]
## <LearnerSurvCoxPH:surv.coxph>
## * Model: -
## * Parameters: list()
## * Packages: mlr3, survival, distr6
## * Predict Type: distr
## * Feature types: logical, integer, numeric, factor
## * Properties: weights
## 
## [[2]]
## <LearnerSurvKaplan:surv.kaplan>
## * Model: -
## * Parameters: list()
## * Packages: mlr3, survival, distr6
## * Predict Type: crank
## * Feature types: logical, integer, numeric, character, factor, ordered
## * Properties: missings
## 
## [[3]]
## <LearnerSurvRanger:surv.ranger>
## * Model: -
## * Parameters: num.threads=1
## * Packages: mlr3, mlr3learners, ranger
## * Predict Type: distr
## * Feature types: logical, integer, numeric, character, factor, ordered
## * Properties: importance, oob_error, weights
# Harrell's C-Index for survival
measure = msr("surv.cindex")
print(measure)
## <MeasureSurvCindex:surv.harrell_c>
## * Packages: mlr3
## * Range: [0, 1]
## * Minimize: FALSE
## * Average: macro
## * Parameters: list()
## * Properties: -
## * Predict type: crank
## * Return type: Score
set.seed(1)
bmr = benchmark(benchmark_grid(task, learners, rsmp("cv", folds = 3)))
bmr$aggregate(measure)
##    nr      resample_result task_id  learner_id resampling_id iters
## 1:  1 <ResampleResult[22]>    rats  surv.coxph            cv     3
## 2:  2 <ResampleResult[22]>    rats surv.kaplan            cv     3
## 3:  3 <ResampleResult[22]>    rats surv.ranger            cv     3
##    surv.harrell_c
## 1:         0.7671
## 2:         0.5000
## 3:         0.7737
autoplot(bmr, measure = measure)

The experiment indicates that both the Cox PH and the random forest have better discrimination than the Kaplan-Meier baseline estimator, but that the machine learning random forest is not consistently better than the interpretable Cox PH.

8.2 Density Estimation

Density estimation is the learning task to find the unknown distribution from which an i.i.d. data set is generated. We interpret this broadly, with this distribution not necessarily being continuous (so may possess a mass not density). The conditional case, where a distribution is predicted conditional on covariates, is known as ‘probabilistic supervised regression’, and will be implemented in mlr3proba in the near-future. Unconditional density estimation is viewed as an unsupervised task. For a good overview to density estimation see Density estimation for statistics and data analysis (Silverman 1986).

The package mlr3proba extends mlr3 with the following objects for density estimation:

In this example we demonstrate the basic functionality of the package on the faithful data from the datasets package. This task ships as pre-defined TaskDens with mlr3proba.

library("mlr3")
library("mlr3proba")

task = tsk("precip")
print(task)
## <TaskDens:precip> (70 x 1)
## * Target: -
## * Properties: -
## * Features (1):
##   - dbl (1): precip
# histogram and density plot
library("mlr3viz")
autoplot(task, type = "overlay")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Unconditional density estimation is an unsupervised method. Hence, TaskDens is an unsupervised task which inherits directly from Task unlike TaskClassif and TaskRegr. However, TaskDens still has a target argument and a $truth field defined by:

  • target - the name of the variable in the data for which to estimate density
  • $truth - the values of the target column (which is not the true density, which is always unknown)

8.2.1 Train and Predict

Density learners have train and predict methods, though being unsupervised, ‘prediction’ is actually ‘estimation’. In training, a distr6 object is created, see here for full tutorials on how to access the probability density function, pdf, cumulative distribution function, cdf, and other important fields and methods. The predict method is simply a wrapper around self$model$pdf and if available self$model$cdf, i.e. evaluates the pdf/cdf at given points. Note that in prediction the points to evaluate the pdf and cdf are determined by the target column in the TaskDens object used for testing.

# create task and learner

task_faithful = TaskDens$new(id = "eruptions", backend = datasets::faithful$eruptions)
learner = lrn("dens.hist")

# train/test split
train_set = sample(task_faithful$nrow, 0.8 * task_faithful$nrow)
test_set = setdiff(seq_len(task_faithful$nrow), train_set)

# fitting KDE and model inspection
learner$train(task_faithful, row_ids = train_set)
learner$model
## $distr
## Histogram() 
## 
## $hist
## $breaks
## [1] 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
## 
## $counts
## [1] 45 34  4  4 24 57 46  3
## 
## $density
## [1] 0.41475 0.31336 0.03687 0.03687 0.22120 0.52535 0.42396 0.02765
## 
## $mids
## [1] 1.75 2.25 2.75 3.25 3.75 4.25 4.75 5.25
## 
## $xname
## [1] "dat"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"
## 
## attr(,"class")
## [1] "dens.hist"
class(learner$model)
## [1] "dens.hist"
# make predictions for new data
prediction = learner$predict(task_faithful, row_ids = test_set)

Every PredictionDens object can estimate:

  • pdf - probability density function

Some learners can estimate:

  • cdf - cumulative distribution function

8.2.2 Benchmark Experiment

Finally, we conduct a small benchmark study on the precip task using some of the integrated survival learners:

# some integrated learners
learners = lrns(c("dens.hist", "dens.kde"))
print(learners)
## [[1]]
## <LearnerDensHistogram:dens.hist>
## * Model: -
## * Parameters: list()
## * Packages: mlr3, distr6
## * Predict Type: pdf
## * Feature types: integer, numeric
## * Properties: -
## 
## [[2]]
## <LearnerDensKDE:dens.kde>
## * Model: -
## * Parameters: kernel=Epan, bandwidth=silver
## * Packages: mlr3, distr6
## * Predict Type: pdf
## * Feature types: integer, numeric
## * Properties: missings
# Logloss for probabilistic predictions
measure = msr("dens.logloss")
print(measure)
## <MeasureDensLogloss:dens.logloss>
## * Packages: mlr3
## * Range: [0, Inf]
## * Minimize: TRUE
## * Average: macro
## * Parameters: list()
## * Properties: -
## * Predict type: pdf
set.seed(1)
bmr = benchmark(benchmark_grid(task, learners, rsmp("cv", folds = 3)))
bmr$aggregate(measure)
##    nr      resample_result task_id learner_id resampling_id iters dens.logloss
## 1:  1 <ResampleResult[22]>  precip  dens.hist            cv     3        4.396
## 2:  2 <ResampleResult[22]>  precip   dens.kde            cv     3        4.818
autoplot(bmr, measure = measure)

The results of this experiment show that the sophisticated Penalized Density Estimator does not outperform the baseline Histogram, but that the Kernel Density Estimator has at least consistently better (i.e. lower logloss) results.

8.3 Spatiotemporal Analysis

Data observations may entail reference information about spatial or temporal characteristics. Spatial information is stored as coordinates, usually named “x” and “y” or “lat”/“lon”. Treating spatiotemporal data using non-spatial data methods can lead to over-optimistic performance estimates. Hence, methods specifically designed to account for the special nature of spatiotemporal data are needed.

In the mlr3 framework, the following packages relate to this field:

The following (sub-)sections introduce the potential pitfalls of spatiotemporal data in machine learning and how to account for it. Note that not all functionality will be covered, and that some of the used packages are still in early lifecycles. If you want to contribute to one of the packages mentioned above, please contact Patrick Schratz.

8.3.1 Creating a spatial Task

To make use of spatial resampling methods, a {mlr3} task that is aware of its spatial characteristic needs to be created. Two child classes exist in {mlr3spatiotempcv} for this purpose:

  • TaskClassifST
  • TaskRegrST

To create one of these, one can either pass a sf object as the “backend” directly:

# create 'sf' object
data_sf = sf::st_as_sf(ecuador, coords = c("x", "y"), crs = 32717)
# create mlr3 task
task = TaskClassifST$new("ecuador_sf",
  backend = data_sf, target = "slides", positive = "TRUE"
)

or use a plain data.frame. In this case, the constructor of TaskClassifST needs a few more arguments:

data = mlr3::as_data_backend(ecuador)
task = TaskClassifST$new("ecuador",
  backend = data, target = "slides",
  positive = "TRUE", extra_args = list(coordinate_names = c("x", "y"),
  crs = 32717)
)

Now this Task can be used as a normal {mlr3} task in any kind of modeling scenario.

8.3.2 Autocorrelation

Data which includes spatial or temporal information requires special treatment in machine learning (similar to survival, ordinal and other task types listed in the special tasks chapter). In contrast to non-spatial/non-temporal data, observations inherit a natural grouping, either in space or time or in both space and time (Legendre 1993). This grouping causes observations to be autocorrelated, either in space (spatial autocorrelation (SAC)), time (temporal autocorrelation (TAC)) or both space and time (spatiotemporal autocorrelation (STAC)). For simplicity, the acronym STAC is used as a generic term in the following chapter for all the different characteristics introduced above.

What effects does STAC have in statistical/machine learning?

The overarching problem is that STAC violates the assumption that the observations in the train and test datasets are independent (Hastie, Friedman, and Tibshirani 2001). If this assumption is violated, the reliability of the resulting performance estimates, for example retrieved via cross-validation, is decreased. The magnitude of this decrease is linked to the magnitude of STAC in the dataset, which cannot be determined easily.

One approach to account for the existence of STAC is to use dedicated resampling methods. mlr3spatiotemporal provides access to the most frequently used spatiotemporal resampling methods. The following example showcases how a spatial dataset can be used to retrieve a bias-reduced performance estimate of a learner.

The following examples use the ecuador dataset created by Jannes Muenchow. It contains information on the occurrence of landslides (binary) in the Andes of Southern Ecuador. The landslides were mapped from aerial photos taken in 2000. The dataset is well suited to serve as an example because it it relatively small and of course due to the spatial nature of the observations. Please refer to Muenchow, Brenning, and Richter (2012) for a detailed description of the dataset.

To account for the spatial autocorrelation probably present in the landslide data, we will make use one of the most used spatial partitioning methods, a cluster-based k-means grouping (Brenning 2012), ("spcv_coords" in mlr3spatiotemporal). This method performs a clustering in 2D space which contrasts with the commonly used random partitioning for non-spatial data. The grouping has the effect that train and test data are more separated in space as they would be by conducting a random partitioning, thereby reducing the effect of STAC.

By contrast, when using the classical random partitioning approach with spatial data, train and test observations would be located side-by-side across the full study area (a visual example is provided further below). This leads to a high similarity between train and test sets, resulting in “better” but biased performance estimates in every fold of a CV compared to the spatial CV approach. However, these low error rates are mainly caused due to the STAC in the observations and the lack of appropriate partitioning methods and not by the power of the fitted model.

8.3.3 Spatial CV vs. Non-Spatial CV

In the following a spatial and a non-spatial CV will be conducted to showcase the mentioned performance differences.

The performance of a simple classification tree ("classif.rpart") is evaluated on a random partitioning ("repeated_cv") with four folds and two repetitions. The chosen evaluation measure is “classification error” ("classif.ce"). The only difference in the spatial setting is that "repeated_spcv_coords" is chosen instead of "repeated_cv".

8.3.3.1 Non-Spatial CV

library("mlr3")
library("mlr3spatiotempcv")
set.seed(42)

# be less verbose
lgr::get_logger("bbotk")$set_threshold("warn")
lgr::get_logger("mlr3")$set_threshold("warn")

task = tsk("ecuador")

learner = lrn("classif.rpart", maxdepth = 3, predict_type = "prob")
resampling_nsp = rsmp("repeated_cv", folds = 4, repeats = 2)
rr_nsp = resample(
  task = task, learner = learner,
  resampling = resampling_nsp)

rr_nsp$aggregate(measures = msr("classif.ce"))
## classif.ce 
##     0.3389

8.3.3.2 Spatial CV

task = tsk("ecuador")

learner = lrn("classif.rpart", maxdepth = 3, predict_type = "prob")
resampling_sp = rsmp("repeated_spcv_coords", folds = 4, repeats = 2)
rr_sp = resample(
  task = task, learner = learner,
  resampling = resampling_sp)

rr_sp$aggregate(measures = msr("classif.ce"))
## classif.ce 
##     0.4125

Here, the classification tree learner is around 0.05 percentage points worse when using Spatial Cross-Validation (SpCV) compared to Non-Spatial Cross-Validation (NSpCV). The magnitude of this difference is variable as it depends on the dataset, the magnitude of STAC and the learner itself. For algorithms with a higher tendency of overfitting to the training set, the difference between the two methods will be larger.

8.3.4 Visualization of Spatiotemporal Partitions

Every partitioning method in mlr3spatiotemporal comes with a generic plot() method to visualize the created groups. In a 2D space this happens via ggplot2 while for spatiotemporal methods 3D visualizations via plotly are created.

autoplot(resampling_sp, task, fold_id = c(1:4), size = 0.7) *
  ggplot2::scale_y_continuous(breaks = seq(-3.97, -4, -0.01)) *
  ggplot2::scale_x_continuous(breaks = seq(-79.06, -79.08, -0.01))

Note that setting the correct CRS for the given data is important which is done during task creation Spatial offsets of up to multiple meters may occur if the wrong CRS is supplied initially.

This example used an already created task via the sugar function tsk(). In practice however, one needs to create a spatiotemporal task via TaskClassifST()/TaskRegrST() and set the crs argument.

The spatial grouping of the k-means based approach above contrasts visually ver well compared to the NSpCV (random) partitioning:

autoplot(resampling_nsp, task, fold_id = c(1:4), size = 0.7) *
  ggplot2::scale_y_continuous(breaks = seq(-3.97, -4, -0.01)) *
  ggplot2::scale_x_continuous(breaks = seq(-79.06, -79.08, -0.01))

8.3.5 Spatial Block Visualization

The spcv-block method makes use of rectangular blocks to divide the study area into equally-sized parts. These blocks can be visualized by their spatial location and fold ID to get a better understanding how these influenced the final partitions.

task = tsk("ecuador")
resampling = rsmp("spcv_block", range = 1000L)
resampling$instantiate(task)

## Visualize train/test splits of multiple folds
autoplot(resampling, task, size = 0.7,
  fold_id = c(1, 2), show_blocks = TRUE, show_labels = TRUE) *
  ggplot2::scale_x_continuous(breaks = seq(-79.085, -79.055, 0.01))

8.3.6 Choosing a Resampling Method

While the example used the "spcv_coords" method, this does not mean that this method is the best or only method suitable for this task. Even though this method is quite popular, it was mainly chosen because of the clear visual grouping differences compared to random partitioning.

In fact, most often multiple spatial partitioning methods can be used for a dataset. It is recommended (required) that users familiarize themselves with each implemented method and decide which method to choose based on the specific characteristics of the dataset. For almost all methods implemented in mlr3spatiotemporal, there is a scientific publication describing the strengths and weaknesses of the respective approach (either linked in the help file of mlr3spatiotemporal or its respective dependency packages).

In the example above, a cross-validation without hyperparameter tuning was shown. If a nested CV is desired, it is recommended to use the same spatial partitioning method for the inner loop (= tuning level). See Schratz et al. (2019) for more details and chapter 11 of Geocomputation with R (Lovelace, Nowosad, and Muenchow 2019)3.

A list of all implemented methods in mlr3spatiotemporal can be found in the Getting Started vignette of the package.

If you want to learn even more about the field of spatial partitioning, STAC and the problems associated with it, the work of Prof. Hanna Meyer is very much recommended for further reference.

8.3.7 Spatial Prediction

Experimental support for spatial prediction with terra, raster, stars and sf objects is available in mlr3spatial.

Until the package is released on CRAN, please see the package vignettes for usage examples.

8.4 Ordinal Analysis

This is work in progress. See mlr3ordinal for the current state of the implementation.

8.5 Functional Analysis

Functional data is data containing an ordering on the dimensions. This implies that functional data consists of curves varying over a continuum, such as time, frequency, or wavelength.

8.5.1 How to model functional data?

There are two ways to model functional data:

  • Modification of the learner, so that the learner is suitable for the functional data
  • Modification of the task, so that the task matches the standard- or classification-learner

The development has not started yet, we are looking for contributors. Open an issue in mlr3fda if you are interested!

8.6 Multilabel Classification

Multilabel classification deals with objects that can belong to more than one category at the same time.

The development has not started yet, we are looking for contributes. Open an issue in mlr3multioutput if you are interested!

8.7 Cost-Sensitive Classification

In regular classification the aim is to minimize the misclassification rate and thus all types of misclassification errors are deemed equally severe. A more general setting is cost-sensitive classification. Cost sensitive classification does not assume that the costs caused by different kinds of errors are equal. The objective of cost sensitive classification is to minimize the expected costs.

Imagine you are an analyst for a big credit institution. Let’s also assume that a correct decision of the bank would result in 35% of the profit at the end of a specific period. A correct decision means that the bank predicts that a customer will pay their bills (hence would obtain a loan), and the customer indeed has good credit. On the other hand, a wrong decision means that the bank predicts that the customer’s credit is in good standing, but the opposite is true. This would result in a loss of 100% of the given loan.

Good Customer (truth) Bad Customer (truth)
Good Customer (predicted) + 0.35 - 1.0
Bad Customer (predicted) 0 0

Expressed as costs (instead of profit), we can write down the cost-matrix as follows:

costs = matrix(c(-0.35, 0, 1, 0), nrow = 2)
dimnames(costs) = list(response = c("good", "bad"), truth = c("good", "bad"))
print(costs)
##         truth
## response  good bad
##     good -0.35   1
##     bad   0.00   0

An exemplary data set for such a problem is the German Credit task:

library("mlr3")
task = tsk("german_credit")
table(task$truth())
## 
## good  bad 
##  700  300

The data has 70% customers who are able to pay back their credit, and 30% bad customers who default on the debt. A manager, who doesn’t have any model, could decide to give either everybody a credit or to give nobody a credit. The resulting costs for the German credit data are:

# nobody:
(700 * costs[2, 1] + 300 * costs[2, 2]) / 1000
## [1] 0
# everybody
(700 * costs[1, 1] + 300 * costs[1, 2]) / 1000
## [1] 0.055

If the average loan is $20,000, the credit institute would lose more than one million dollar if it would grant everybody a credit:

# average profit * average loan * number of customers
0.055 * 20000 * 1000
## [1] 1100000

Our goal is to find a model which minimizes the costs (and thereby maximizes the expected profit).

8.7.1 A First Model

For our first model, we choose an ordinary logistic regression (implemented in the add-on package mlr3learners). We first create a classification task, then resample the model using a 10-fold cross validation and extract the resulting confusion matrix:

library("mlr3learners")
learner = lrn("classif.log_reg")
rr = resample(task, learner, rsmp("cv"))

confusion = rr$prediction()$confusion
print(confusion)
##         truth
## response good bad
##     good  601 158
##     bad    99 142

To calculate the average costs like above, we can simply multiply the elements of the confusion matrix with the elements of the previously introduced cost matrix, and sum the values of the resulting matrix:

avg_costs = sum(confusion * costs) / 1000
print(avg_costs)
## [1] -0.05235

With an average loan of $20,000, the logistic regression yields the following costs:

avg_costs * 20000 * 1000
## [1] -1047000

Instead of losing over $1,000,000, the credit institute now can expect a profit of more than $1,000,000.

8.7.2 Cost-sensitive Measure

Our natural next step would be to further improve the modeling step in order to maximize the profit. For this purpose we first create a cost-sensitive classification measure which calculates the costs based on our cost matrix. This allows us to conveniently quantify and compare modeling decisions. Fortunately, there already is a predefined measure Measure for this purpose: MeasureClassifCosts:

cost_measure = msr("classif.costs", costs = costs)
print(cost_measure)
## <MeasureClassifCosts:classif.costs>
## * Packages: mlr3
## * Range: [-Inf, Inf]
## * Minimize: TRUE
## * Average: macro
## * Parameters: normalize=TRUE
## * Properties: requires_task
## * Predict type: response

If we now call resample() or benchmark(), the cost-sensitive measures will be evaluated. We compare the logistic regression to a simple featureless learner and to a random forest from package ranger :

learners = list(
  lrn("classif.log_reg"),
  lrn("classif.featureless"),
  lrn("classif.ranger")
)
cv3 = rsmp("cv", folds = 3)
bmr = benchmark(benchmark_grid(task, learners, cv3))
bmr$aggregate(cost_measure)
##    nr      resample_result       task_id          learner_id resampling_id
## 1:  1 <ResampleResult[22]> german_credit     classif.log_reg            cv
## 2:  2 <ResampleResult[22]> german_credit classif.featureless            cv
## 3:  3 <ResampleResult[22]> german_credit      classif.ranger            cv
##    iters classif.costs
## 1:     3      -0.06125
## 2:     3       0.05498
## 3:     3      -0.05026

As expected, the featureless learner is performing comparably bad. The logistic regression and the random forest work equally well.

8.7.3 Thresholding

Although we now correctly evaluate the models in a cost-sensitive fashion, the models themselves are unaware of the classification costs. They assume the same costs for both wrong classification decisions (false positives and false negatives). Some learners natively support cost-sensitive classification (e.g., XXX). However, we will concentrate on a more generic approach which works for all models which can predict probabilities for class labels: thresholding.

Most learners can calculate the probability \(p\) for the positive class. If \(p\) exceeds the threshold \(0.5\), they predict the positive class, and the negative class otherwise.

For our binary classification case of the credit data, the we primarily want to minimize the errors where the model predicts “good”, but truth is “bad” (i.e., the number of false positives) as this is the more expensive error. If we now increase the threshold to values \(> 0.5\), we reduce the number of false negatives. Note that we increase the number of false positives simultaneously, or, in other words, we are trading false positives for false negatives.

# fit models with probability prediction
learner = lrn("classif.log_reg", predict_type = "prob")
rr = resample(task, learner, rsmp("cv"))
p = rr$prediction()
print(p)
## <PredictionClassif> for 1000 observations:
##     row_ids truth response prob.good prob.bad
##          22  good     good    0.7471 0.252924
##          41  good     good    0.7236 0.276401
##          54  good     good    0.9958 0.004214
## ---                                          
##         894  good     good    0.8657 0.134345
##         899  good     good    0.9605 0.039505
##         979   bad     good    0.7007 0.299265
# helper function to try different threshold values interactively
with_threshold = function(p, th) {
  p$set_threshold(th)
  list(confusion = p$confusion, costs = p$score(measures = cost_measure, task = task))
}

with_threshold(p, 0.5)
## $confusion
##         truth
## response good bad
##     good  602 154
##     bad    98 146
## 
## $costs
## classif.costs 
##       -0.0567
with_threshold(p, 0.75)
## $confusion
##         truth
## response good bad
##     good  463  74
##     bad   237 226
## 
## $costs
## classif.costs 
##      -0.08805
with_threshold(p, 1.0)
## $confusion
##         truth
## response good bad
##     good    0   0
##     bad   700 300
## 
## $costs
## classif.costs 
##             0
# TODO: include plot of threshold vs performance

Instead of manually trying different threshold values, one uses use optimize() to find a good threshold value w.r.t. our performance measure:

# simple wrapper function which takes a threshold and returns the resulting model performance
# this wrapper is passed to optimize() to find its minimum for thresholds in [0.5, 1]
f = function(th) {
  with_threshold(p, th)$costs
}
best = optimize(f, c(0.5, 1))
print(best)
## $minimum
## [1] 0.7611
## 
## $objective
## classif.costs 
##       -0.0926
# optimized confusion matrix:
with_threshold(p, best$minimum)$confusion
##         truth
## response good bad
##     good  456  67
##     bad   244 233

Note that the function optimize() is intended for unimodal functions and therefore may converge to a local optimum here. See below for better alternatives to find good threshold values.

8.7.4 Threshold Tuning

Before we start, we have load all required packages:

## Loading required package: paradox

8.7.5 Adjusting thresholds: Two strategies

Currently mlr3pipelines offers two main strategies towards adjusting classification thresholds. We can either expose the thresholds as a hyperparameter of the Learner by using PipeOpThreshold. This allows us to tune the thresholds via an outside optimizer from mlr3tuning.

Alternatively, we can also use PipeOpTuneThreshold which automatically tunes the threshold after each learner is fit.

In this blog-post, we’ll go through both strategies.

8.7.6 PipeOpThreshold

PipeOpThreshold can be put directly after a Learner.

A simple example would be:

gr = lrn("classif.rpart", predict_type = "prob") %>>% po("threshold")
l = as_learner(gr)

Note, that predict_type = “prob” is required for po("threshold") to have any effect.

The thresholds are now exposed as a hyperparameter of the GraphLearner we created:

l$param_set
## <ParamSetCollection>
##                               id    class lower upper nlevels        default
##  1:             classif.rpart.cp ParamDbl     0     1     Inf           0.01
##  2:     classif.rpart.keep_model ParamLgl    NA    NA       2          FALSE
##  3:     classif.rpart.maxcompete ParamInt     0   Inf     Inf              4
##  4:       classif.rpart.maxdepth ParamInt     1    30      30             30
##  5:   classif.rpart.maxsurrogate ParamInt     0   Inf     Inf              5
##  6:      classif.rpart.minbucket ParamInt     1   Inf     Inf <NoDefault[3]>
##  7:       classif.rpart.minsplit ParamInt     1   Inf     Inf             20
##  8: classif.rpart.surrogatestyle ParamInt     0     1       2              0
##  9:   classif.rpart.usesurrogate ParamInt     0     2       3              2
## 10:           classif.rpart.xval ParamInt     0   Inf     Inf             10
## 11:         threshold.thresholds ParamUty    NA    NA     Inf <NoDefault[3]>
##     value
##  1:      
##  2:      
##  3:      
##  4:      
##  5:      
##  6:      
##  7:      
##  8:      
##  9:      
## 10:     0
## 11:   0.5

We can now tune those thresholds from the outside as follows:

Before tuning, we have to define which hyperparameters we want to tune over. In this example, we only tune over the thresholds parameter of the threshold pipeop. you can easily imagine, that we can also jointly tune over additional hyperparameters, i.e. rpart’s cp parameter.

As the Task we aim to optimize for is a binary task, we can simply specify the threshold param:

library("paradox")
ps = ps(threshold.thresholds = p_dbl(lower = 0, upper = 1))

We now create a AutoTuner, which automatically tunes the supplied learner over the ParamSet we supplied above.

at = AutoTuner$new(
  learner = l,
  resampling = rsmp("cv", folds = 3L),
  measure = msr("classif.ce"),
  search_space = ps,
  terminator = trm("evals", n_evals = 5L),
  tuner = tnr("random_search")
)

at$train(tsk("german_credit"))
## INFO  [20:24:10.802] [bbotk] Starting to optimize 1 parameter(s) with '<OptimizerRandomSearch>' and '<TerminatorEvals> [n_evals=5, k=0]' 
## INFO  [20:24:10.852] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:24:11.299] [bbotk] Result of batch 1: 
## INFO  [20:24:11.301] [bbotk]  threshold.thresholds classif.ce runtime_learners 
## INFO  [20:24:11.301] [bbotk]                0.4271      0.266            0.283 
## INFO  [20:24:11.301] [bbotk]                                 uhash 
## INFO  [20:24:11.301] [bbotk]  3b0dc9c7-d3f3-4c3c-9d3e-3f696ab23794 
## INFO  [20:24:11.304] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:24:11.661] [bbotk] Result of batch 2: 
## INFO  [20:24:11.663] [bbotk]  threshold.thresholds classif.ce runtime_learners 
## INFO  [20:24:11.663] [bbotk]                0.5862      0.263            0.211 
## INFO  [20:24:11.663] [bbotk]                                 uhash 
## INFO  [20:24:11.663] [bbotk]  c8bc7f26-cf2e-4060-b7d7-4f2c786f667c 
## INFO  [20:24:11.666] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:24:12.026] [bbotk] Result of batch 3: 
## INFO  [20:24:12.028] [bbotk]  threshold.thresholds classif.ce runtime_learners 
## INFO  [20:24:12.028] [bbotk]                0.4186      0.266            0.216 
## INFO  [20:24:12.028] [bbotk]                                 uhash 
## INFO  [20:24:12.028] [bbotk]  b6073ab0-a472-4a52-8c85-14e279f1c84a 
## INFO  [20:24:12.031] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:24:12.401] [bbotk] Result of batch 4: 
## INFO  [20:24:12.404] [bbotk]  threshold.thresholds classif.ce runtime_learners 
## INFO  [20:24:12.404] [bbotk]                0.4328      0.266            0.221 
## INFO  [20:24:12.404] [bbotk]                                 uhash 
## INFO  [20:24:12.404] [bbotk]  4b231ad5-d4d3-439a-852b-7ab1a51a737f 
## INFO  [20:24:12.407] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:24:12.774] [bbotk] Result of batch 5: 
## INFO  [20:24:12.776] [bbotk]  threshold.thresholds classif.ce runtime_learners 
## INFO  [20:24:12.776] [bbotk]                0.4008      0.266            0.226 
## INFO  [20:24:12.776] [bbotk]                                 uhash 
## INFO  [20:24:12.776] [bbotk]  083201c0-dd67-4370-a6e3-c9c7d696c32d 
## INFO  [20:24:12.785] [bbotk] Finished optimizing after 5 evaluation(s) 
## INFO  [20:24:12.786] [bbotk] Result: 
## INFO  [20:24:12.788] [bbotk]  threshold.thresholds learner_param_vals  x_domain classif.ce 
## INFO  [20:24:12.788] [bbotk]                0.5862          <list[2]> <list[1]>      0.263

Inside the trafo, we simply collect all set params into a named vector via map_dbl and store it in the threshold.thresholds slot expected by the learner.

Again, we create a AutoTuner, which automatically tunes the supplied learner over the ParamSet we supplied above.

One drawback of this strategy is, that this requires us to fit a new model for each new threshold setting. While setting a threshold and computing performance is relatively cheap, fitting the learner is often more computationally demanding. A better strategy is therefore often to optimize the thresholds separately after each model fit.

8.7.7 PipeOpTunethreshold

PipeOpTuneThreshold on the other hand works together with PipeOpLearnerCV. It directly optimizes the cross-validated predictions made by this PipeOp. This is done in order to avoid over-fitting the threshold tuning.

A simple example would be:

gr = po("learner_cv", lrn("classif.rpart", predict_type = "prob")) %>>% po("tunethreshold")
l2 = as_learner(gr)

Note, that predict_type = “prob” is required for po("tunethreshold") to work. Additionally, note that this time no threshold parameter is exposed, it is automatically tuned internally.

l2$param_set
## <ParamSetCollection>
##                                         id    class lower upper nlevels
##  1:        classif.rpart.resampling.method ParamFct    NA    NA       2
##  2:         classif.rpart.resampling.folds ParamInt     2   Inf     Inf
##  3: classif.rpart.resampling.keep_response ParamLgl    NA    NA       2
##  4:                       classif.rpart.cp ParamDbl     0     1     Inf
##  5:               classif.rpart.keep_model ParamLgl    NA    NA       2
##  6:               classif.rpart.maxcompete ParamInt     0   Inf     Inf
##  7:                 classif.rpart.maxdepth ParamInt     1    30      30
##  8:             classif.rpart.maxsurrogate ParamInt     0   Inf     Inf
##  9:                classif.rpart.minbucket ParamInt     1   Inf     Inf
## 10:                 classif.rpart.minsplit ParamInt     1   Inf     Inf
## 11:           classif.rpart.surrogatestyle ParamInt     0     1       2
## 12:             classif.rpart.usesurrogate ParamInt     0     2       3
## 13:                     classif.rpart.xval ParamInt     0   Inf     Inf
## 14:           classif.rpart.affect_columns ParamUty    NA    NA     Inf
## 15:                  tunethreshold.measure ParamUty    NA    NA     Inf
## 16:                tunethreshold.optimizer ParamUty    NA    NA     Inf
## 17:                tunethreshold.log_level ParamUty    NA    NA     Inf
##            default      value
##  1: <NoDefault[3]>         cv
##  2: <NoDefault[3]>          3
##  3: <NoDefault[3]>      FALSE
##  4:           0.01           
##  5:          FALSE           
##  6:              4           
##  7:             30           
##  8:              5           
##  9: <NoDefault[3]>           
## 10:             20           
## 11:              0           
## 12:              2           
## 13:             10          0
## 14:  <Selector[1]>           
## 15: <NoDefault[3]> classif.ce
## 16: <NoDefault[3]>      gensa
## 17:  <function[1]>       warn

Note that we can set rsmp("intask") as a resampling strategy for “learner_cv” in order to evaluate predictions on the “training” data. This is generally not advised, as it might lead to over-fitting on the thresholds but can significantly reduce runtime.

For more information, see the post on Threshold Tuning on the mlr3 gallery.

8.8 Cluster Analysis

Cluster analysis is a type of unsupervised machine learning where the goal is to group data into clusters, where each cluster contains similar observations. The similarity is based on specified metrics that are task and application dependent. Cluster analysis is closely related to classification in a sense that each observation needs to be assigned to a cluster or a class. However, unlike classification problems where each observation is labeled, clustering works on data sets without true labels or class assignments.

The package mlr3cluster extends mlr3 with the following objects for cluster analysis:

Since clustering is a type of unsupervised learning, TaskClust is slightly different from TaskRegr and TaskClassif objects. More specifically:

  • truth() function is missing because observations are not labeled.
  • target field is empty and will return character(0) if accessed anyway.

Additionally, LearnerClust provides two extra fields that are absent from supervised learners:

  • assignments returns cluster assignments for training data. It return NULL if accessed before training.
  • save_assignments is a boolean field that controls whether or not to store training set assignments in a learner.

Finally, PredictionClust contains additional two fields:

  • partition stores cluster partitions.
  • prob stores cluster probabilities for each observation.

8.8.1 Train and Predict

Clustering learners provide both train and predict methods. The analysis typically consists of building clusters using all available data. To be consistent with the rest of the library, we refer to this process as training.

Some learners can assign new observations to existing groups with predict. However, prediction does not always make sense, as it is the case for hierarchical clustering. In hierarchical clustering, the goal is to build a hierarchy of nested clusters by either splitting large clusters into smaller ones or merging smaller clusters into bigger ones. The final result is a tree or dendrogram which can change if a new data point is added. For consistency with the rest of the ecosystem, mlr3cluster offers predict method for hierarchical clusterers but it simply assigns all points to a specified number of clusters by cutting the resulting tree at a corresponding level. Moreover, some learners estimate the probability of each observation belonging to a given cluster. predict_types field gives a list of prediction types for each learner.

After training, the model field stores a learner’s model that looks different for each learner depending on the underlying library. predict returns a PredictionClust object that gives a simplified view of the learned model. If the data given to the predict method is the same as the one on which the learner was trained, predict simply returns cluster assignments for the “training” observations. On the other hand, if the test set contains new data, predict will estimate cluster assignments for that data set. Some learners do not support estimating cluster partitions on new data and will instead return assignments for training data and print a warning message.

In the following example, a $k$-means learner is applied on the US arrest data set. The class labels are predicted and the contribution of the task features to assignment of the respective class are visualized.

library("mlr3")
library("mlr3cluster")
library("mlr3viz")
set.seed(1L)

# create an example task
task = tsk("usarrests")
print(task)
## <TaskClust:usarrests> (50 x 4)
## * Target: -
## * Properties: -
## * Features (4):
##   - int (2): Assault, UrbanPop
##   - dbl (2): Murder, Rape
autoplot(task)
# create a k-means learner
learner = lrn("clust.kmeans")

# assigning each observation to one of the two clusters (default in clust.kmeans)
learner$train(task)
learner$model
## K-means clustering with 2 clusters of sizes 21, 29
## 
## Cluster means:
##   Assault Murder  Rape UrbanPop
## 1   255.0 11.857 28.11    67.62
## 2   109.8  4.841 16.25    64.03
## 
## Clustering vector:
##  [1] 1 1 1 1 1 1 2 1 1 1 2 2 1 2 2 2 2 1 2 1 2 1 2 1 2 2 2 1 2 2 1 1 1 2 2 2 2 2
## [39] 2 1 2 1 1 2 2 2 2 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 41637 54762
##  (between_SS / total_SS =  72.9 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
# make "predictions" for the same data
prediction = learner$predict(task)
autoplot(prediction, task)

8.8.2 Measures

The difference between supervised and unsupervised learning is that there is no ground truth data in unsupervised learning. In a supervised setting, such as classification, we would need to compare our predictions to true labels. Since clustering is an example of unsupervised learning, there are no true labels to which we can compare. However, we can still measure the quality of cluster assignments by quantifying how closely objects within the same cluster are related (cluster cohesion) as well as how distinct different clusters are from each other (cluster separation).

To assess the quality of clustering, there are a few built-in evaluation metrics available. One of them is within sum of squares (WSS) which calculates the sum of squared differences between observations and centroids. WSS is useful because it quantifies cluster cohesion. The range of this measure is \([0, \infty)\) where a smaller value means that clusters are more compact.

Another measure is silhouette quality index that quantifies how well each point belongs to its assigned cluster versus neighboring cluster. Silhouette values are in \([-1, 1]\) range.

Points with silhouette closer to:

  • 1 are well clustered
  • 0 lie between two clusters
  • -1 likely placed in the wrong cluster

The following is an example of conducting a benchmark experiment with various learners on iris data set without target variable and assessing the quality of each learner with both within sum of squares and silhouette measures.

design = benchmark_grid(
  tasks = TaskClust$new("iris", iris[-5]),
  learners = list(
    lrn("clust.kmeans", centers = 3L),
    lrn("clust.pam", k = 2L),
    lrn("clust.cmeans", centers = 3L)),
  resamplings = rsmp("insample"))
print(design)
##               task                  learner               resampling
## 1: <TaskClust[45]> <LearnerClustKMeans[37]> <ResamplingInsample[19]>
## 2: <TaskClust[45]>    <LearnerClustPAM[37]> <ResamplingInsample[19]>
## 3: <TaskClust[45]> <LearnerClustCMeans[37]> <ResamplingInsample[19]>
# execute benchmark
bmr = benchmark(design)

# define measure
measures = list(msr("clust.wss"), msr("clust.silhouette"))
bmr$aggregate(measures)
##    nr      resample_result task_id   learner_id resampling_id iters clust.wss
## 1:  1 <ResampleResult[22]>    iris clust.kmeans      insample     1     78.85
## 2:  2 <ResampleResult[22]>    iris    clust.pam      insample     1    153.33
## 3:  3 <ResampleResult[22]>    iris clust.cmeans      insample     1     79.03
##    clust.silhouette
## 1:           0.5555
## 2:           0.7158
## 3:           0.5493

The experiment shows that using k-means algorithm with three centers produces a better within sum of squares score than any other learner considered. However, pam (partitioning around medoids) learner with two clusters performs the best when considering silhouette measure which takes into the account both cluster cohesion and separation.

8.8.3 Visualization

Cluster analysis in mlr3 is integrated with mlr3viz which provides a number of useful plots. Some of those plots are shown below.

task = TaskClust$new("iris", iris[-5])
learner = lrn("clust.kmeans")
learner$train(task)
prediction = learner$predict(task)

# performing PCA on task and showing assignments
autoplot(prediction, task, type = "pca")
# same as above but with probability ellipse that assumes normal distribution
autoplot(prediction, task, type = "pca", frame = TRUE, frame.type = "norm")
task = tsk("usarrests")
learner = lrn("clust.agnes")
learner$train(task)

# dendrogram for hierarchical clustering
autoplot(learner)
# advanced dendrogram options from `factoextra::fviz_dend`
autoplot(learner,
  k = learner$param_set$values$k, rect_fill = TRUE,
  rect = TRUE)

Silhouette plots can help to visually assess the quality of the analysis and help choose a number of clusters for a given data set. The red dotted line shows the mean silhouette value and each bar represents a data point. If most points in each cluster have an index around or higher than mean silhouette, the number of clusters is chosen well.

# silhouette plot allows to visually inspect the quality of clustering
task = TaskClust$new("iris", iris[-5])
learner = lrn("clust.kmeans")
learner$param_set$values = list(centers = 5L)
learner$train(task)
prediction = learner$predict(task)
autoplot(prediction, task, type = "sil")

The plot shows that all points in cluster 5 and almost all points in clusters 4, 2 and 1 are below average silhouette index. This means that a lot of observations lie either on the border of clusters or are likely assigned to the wrong cluster.

learner = lrn("clust.kmeans")
learner$param_set$values = list(centers = 2L)
learner$train(task)
prediction = learner$predict(task)
autoplot(prediction, task, type = "sil")

Setting the number of centers to two improves both average silhouette score as well as overall quality of clustering because almost all points in cluster 1 are higher than and a lot of points in cluster 2 are close to mean silhouette. Hence, having two centers might be a better choice for the number of clusters.