37  Cluster Analysis

Cluster analysis is a type of unsupervised machine learning where the goal is to group data into clusters, where each cluster contains similar observations. The similarity is based on specified metrics that are task and application dependent. Cluster analysis is closely related to classification in a sense that each observation needs to be assigned to a cluster or a class. However, unlike classification problems where each observation is labeled, clustering works on data sets without true labels or class assignments.

The package mlr3cluster extends mlr3 with the following objects for cluster analysis:

Since clustering is a type of unsupervised learning, TaskClust is slightly different from TaskRegr and TaskClassif objects. More specifically:

Additionally, LearnerClust provides two extra fields that are absent from supervised learners:

Finally, PredictionClust contains additional two fields:

37.1 Train and Predict

Clustering learners provide both train and predict methods. The analysis typically consists of building clusters using all available data. To be consistent with the rest of the library, we refer to this process as training.

Some learners can assign new observations to existing groups with predict. However, prediction does not always make sense, as it is the case for hierarchical clustering. In hierarchical clustering, the goal is to build a hierarchy of nested clusters by either splitting large clusters into smaller ones or merging smaller clusters into bigger ones. The final result is a tree or dendrogram which can change if a new data point is added. For consistency with the rest of the ecosystem, mlr3cluster offers predict method for hierarchical clusterers but it simply assigns all points to a specified number of clusters by cutting the resulting tree at a corresponding level. Moreover, some learners estimate the probability of each observation belonging to a given cluster. predict_types field gives a list of prediction types for each learner.

After training, the model field stores a learner’s model that looks different for each learner depending on the underlying library. predict returns a PredictionClust object that gives a simplified view of the learned model. If the data given to the predict method is the same as the one on which the learner was trained, predict simply returns cluster assignments for the “training” observations. On the other hand, if the test set contains new data, predict will estimate cluster assignments for that data set. Some learners do not support estimating cluster partitions on new data and will instead return assignments for training data and print a warning message.

In the following example, a $k$-means learner is applied on the US arrest data set. The class labels are predicted and the contribution of the task features to assignment of the respective class are visualized.

library("mlr3")
library("mlr3cluster")
library("mlr3viz")
set.seed(1L)

# create an example task
task = tsk("usarrests")
print(task)
<TaskClust:usarrests> (50 x 4): US Arrests
* Target: -
* Properties: -
* Features (4):
  - int (2): Assault, UrbanPop
  - dbl (2): Murder, Rape
autoplot(task)

# create a k-means learner
learner = lrn("clust.kmeans")

# assigning each observation to one of the two clusters (default in clust.kmeans)
learner$train(task)
learner$model
K-means clustering with 2 clusters of sizes 21, 29

Cluster means:
   Assault    Murder     Rape UrbanPop
1 255.0000 11.857143 28.11429 67.61905
2 109.7586  4.841379 16.24828 64.03448

Clustering vector:
 [1] 1 1 1 1 1 1 2 1 1 1 2 2 1 2 2 2 2 1 2 1 2 1 2 1 2 2 2 1 2 2 1 1 1 2 2 2 2 2
[39] 2 1 2 1 1 2 2 2 2 2 2 2

Within cluster sum of squares by cluster:
[1] 41636.73 54762.30
 (between_SS / total_SS =  72.9 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      
# make "predictions" for the same data
prediction = learner$predict(task)
autoplot(prediction, task)

37.2 Measures

The difference between supervised and unsupervised learning is that there is no ground truth data in unsupervised learning. In a supervised setting, such as classification, we would need to compare our predictions to true labels. Since clustering is an example of unsupervised learning, there are no true labels to which we can compare. However, we can still measure the quality of cluster assignments by quantifying how closely objects within the same cluster are related (cluster cohesion) as well as how distinct different clusters are from each other (cluster separation).

To assess the quality of clustering, there are a few built-in evaluation metrics available. One of them is within sum of squares (WSS) which calculates the sum of squared differences between observations and centroids. WSS is useful because it quantifies cluster cohesion. The range of this measure is \([0, \infty)\) where a smaller value means that clusters are more compact.

Another measure is silhouette quality index that quantifies how well each point belongs to its assigned cluster versus neighboring cluster. Silhouette values are in \([-1, 1]\) range.

Points with silhouette closer to:

  • 1 are well clustered
  • 0 lie between two clusters
  • -1 likely placed in the wrong cluster

The following is an example of conducting a benchmark experiment with various learners on iris data set without target variable and assessing the quality of each learner with both within sum of squares and silhouette measures.

design = benchmark_grid(
  tasks = TaskClust$new("iris", iris[-5]),
  learners = list(
    lrn("clust.kmeans", centers = 3L),
    lrn("clust.pam", k = 2L),
    lrn("clust.cmeans", centers = 3L)),
  resamplings = rsmp("insample"))
print(design)
              task                  learner               resampling
1: <TaskClust[46]> <LearnerClustKMeans[38]> <ResamplingInsample[20]>
2: <TaskClust[46]>    <LearnerClustPAM[38]> <ResamplingInsample[20]>
3: <TaskClust[46]> <LearnerClustCMeans[38]> <ResamplingInsample[20]>
# execute benchmark
bmr = benchmark(design)
INFO  [21:40:33.718] [mlr3] Running benchmark with 3 resampling iterations 
INFO  [21:40:33.845] [mlr3] Applying learner 'clust.kmeans' on task 'iris' (iter 1/1) 
INFO  [21:40:34.005] [mlr3] Applying learner 'clust.cmeans' on task 'iris' (iter 1/1) 
INFO  [21:40:34.045] [mlr3] Applying learner 'clust.pam' on task 'iris' (iter 1/1) 
INFO  [21:40:34.063] [mlr3] Finished benchmark 
# define measure
measures = list(msr("clust.wss"), msr("clust.silhouette"))
bmr$aggregate(measures)
   nr      resample_result task_id   learner_id resampling_id iters clust.wss
1:  1 <ResampleResult[22]>    iris clust.kmeans      insample     1  78.85144
2:  2 <ResampleResult[22]>    iris    clust.pam      insample     1 153.32572
3:  3 <ResampleResult[22]>    iris clust.cmeans      insample     1  79.02617
   clust.silhouette
1:        0.5555218
2:        0.7158437
3:        0.5492507

The experiment shows that using k-means algorithm with three centers produces a better within sum of squares score than any other learner considered. However, pam (partitioning around medoids) learner with two clusters performs the best when considering silhouette measure which takes into the account both cluster cohesion and separation.

37.3 Visualization

Cluster analysis in mlr3 is integrated with mlr3viz which provides a number of useful plots. Some of those plots are shown below.

task = TaskClust$new("iris", iris[-5])
learner = lrn("clust.kmeans")
learner$train(task)
prediction = learner$predict(task)

# performing PCA on task and showing assignments
autoplot(prediction, task, type = "pca")

# same as above but with probability ellipse that assumes normal distribution
autoplot(prediction, task, type = "pca", frame = TRUE, frame.type = "norm")

task = tsk("usarrests")
learner = lrn("clust.agnes")
learner$train(task)

# dendrogram for hierarchical clustering
autoplot(learner)

# advanced dendrogram options from `factoextra::fviz_dend`
autoplot(learner,
  k = learner$param_set$values$k, rect_fill = TRUE,
  rect = TRUE)

Silhouette plots can help to visually assess the quality of the analysis and help choose a number of clusters for a given data set. The red dotted line shows the mean silhouette value and each bar represents a data point. If most points in each cluster have an index around or higher than mean silhouette, the number of clusters is chosen well.

# silhouette plot allows to visually inspect the quality of clustering
task = TaskClust$new("iris", iris[-5])
learner = lrn("clust.kmeans")
learner$param_set$values = list(centers = 5L)
learner$train(task)
prediction = learner$predict(task)
autoplot(prediction, task, type = "sil")

The plot shows that all points in cluster 5 and almost all points in clusters 4, 2 and 1 are below average silhouette index. This means that a lot of observations lie either on the border of clusters or are likely assigned to the wrong cluster.

learner = lrn("clust.kmeans")
learner$param_set$values = list(centers = 2L)
learner$train(task)
prediction = learner$predict(task)
autoplot(prediction, task, type = "sil")

Setting the number of centers to two improves both average silhouette score as well as overall quality of clustering because almost all points in cluster 1 are higher than and a lot of points in cluster 2 are close to mean silhouette. Hence, having two centers might be a better choice for the number of clusters.