8  Beyond Deterministic Regression and Single-Label Classification

Authors
Affiliations
Raphael Sonabend

Imperial College London

Patrick Schratz

.

Damir Pulatov

University of Wyoming

Abstract
Previous chapters have considered the core functionality of functions and features in the mlr3 universe and other helper packages, which altogether create a modular ecosystem. This chapter uses that modularity to discuss tasks beyond deterministic regression and single-label classification. The chapter starts with cost-sensitive classification, which makes use of mlr3, mlr3tuning, and mlr3pipelines to implement a classification task in which different types of errors have different costs. Next, the chapter looks at probabilistic regression, which is a regression problem where the prediction is a distribution and not a single point prediction. We first discuss how this is implemented in mlr3 and how this is extended in mlr3proba. We then discuss survival analysis and density estimation, which are both implemented in mlr3proba and make use of the distr6 package for distribution predictions. After these tasks, we look at unsupervised learning in the form of clustering with mlr3cluster, and then finally spatial analysis with mlr3spatial and mlr3spatiotempcv. As some or all of these tasks may be new to some readers, we include an overview to the theory behind each, before looking at the extensions to common mlr3 objects (learners, predictions, tasks, and measures), and then finally consider task-specific extensions.

The chapter title reflects that up until now we have only considered two tasks, regression (specifically deterministic), and classification (specifically single-label). So far, this book has focused on detailed explanation of the mlr3 universe of packages, which is abstracted in such a way that allows users to choose the level of complexity that suits each project. For example, you could limit your usage of these packages to just mlr3 to train a very small subset of regression and classification learners on a set of data and then use your model to make predictions for new unseen data. You could extend this to many regression and classification learners with mlr3learners and mlr3extralearners. By making use of mlr3tuning and mlr3pipelines you could optimize your model further including with pre- and post-processing of data and algorithms. However, the real power of the mlr3 universe, is in its ability to take all of these features, and extend them to many machine learning tasks.

In Chapter 2 we introduced deterministic regression and deterministic & probabilistic single-label classification (Table 8.1). But our infrastructure also works well for survival analysis, density estimation, clustering, spatiotemporal analysis, time-series forecasting, deep learning architectures, image analysis, multi-label classification, probabilistic regression, and many more. Some of these tasks are implemented in extension packages as part of the mlr3 universe (see Figure 1.1), some are available by creating pipelines with mlr3pipelines, some are in development, and some are waiting to be developed (maybe by some readers of this book). In this chapter, we will take you through just a subset of these new tasks, focusing on the ones that have a stable API. As we work through this chapter we will refer to the ‘building blocks’ of mlr3, this refers to the base classes that must be extended to create new tasks, these are Prediction, Learner, Measure, and Task.

Table 8.1 summarizes available extension tasks, including the package(s) they are implemented in, the stability of the task implementation, and a brief description. Only tasks with stable APIs are discussed in this chapter, the table also includes tasks in development.

Table 8.1: Table of extension tasks that can be used with mlr3 infrastructure. ‘Stable’ tasks have an interface that is unlikely to undergo any major changes in the future, these tasks are discussed in this chapter. ‘Developing’ tasks are in packages that might have breaking API changes. As we have a growing community of contributors, this list is far from exhaustive and many ‘experimental’ task implementations exist; this list just represents the tasks that have a functioning interface.
Task Package Status Description
Deterministic regression mlr3 Stable Point prediction of a continuous variable.
Probabilistic regression mlr3 and mlr3proba Stable Probability distribution prediction of a continuous variable.
Deterministic single-label classification mlr3 Stable Prediction of a single class an observation falls into.
Probabilistic single-label classification mlr3 Stable Prediction of the probability of an observation falling into one or more mutually exclusive categories.
Cost-sensitive classification mlr3 and mlr3pipelines Stable Classification predictions with unequal costs associated with misclassifications.
Survival analysis mlr3proba Stable Time-to-event predictions with possible ‘censoring’.
Density estimation mlr3proba Stable Unsupervised estimation of probability density functions.
Spatiotemporal analysis mlr3spatiotempcv and mlr3spatial Stable Supervised prediction of data with spatial (e.g., coordinates) and/or temporal outcomes.
Cluster analysis mlr3cluster Stable Unsupervised estimation of homogeneous clusters of data points.
Time series forecasting mlr3temporal Developing Supervised predictions of target over time (e.g., daily temperature).
Functional data analysis mlr3fda Developing Supervised predictions from functional data.

8.1 Cost-Sensitive Classification

We begin by discussing a task that does not require any additional packages or infrastructure, only the tools we have already learnt about from earlier chapters. In ‘regular’ classification, the aim is to optimize a metric (often the misclassification rate) whilst assuming all misclassification errors are deemed equally severe. A more general setting is cost-sensitive classification, in which costs caused by different kinds of errors may not be equal. The objective of cost-sensitive classification is to minimize the expected costs. As we discuss this task we will work with mlr_tasks_german_credit (Appendix C) as a running example.

Cost-sensitive Classification

Imagine in this task that an insurance company, InsureU, provides fixed loans of $5K over a one year period and customers are expected to pay back $6K at the end of this period. InsureU needs to be predict if a new customer will have good credit to decide if a loan should be given. This is a deterministic classification problem where InsureU is classifying if a customer will be in class ‘Good’ or ‘Bad’. Now we can write up the potential costs associated with each prediction and the eventual truth:

costs = matrix(c(-1, 0, 5, 0), nrow = 2, dimnames =
  list("Predicted Credit" = c("good", "bad"),
    Truth = c("good", "bad")))
costs
                Truth
Predicted Credit good bad
            good   -1   5
            bad     0   0

If InsureU thinks a customer has bad credit then there is no profit or loss, the loan is not provided. If a customer has good credit and returns the loan then InsureU makes a $1K profit, but if they default then InsureU loses $5K. Note that cost-sensitive classification is a minimization problem where we assume lower costs mean higher profits/positive outcomes, hence above we wrote the profit and loss as -1 and +5 respectively.

8.1.1 Cost-sensitive Measure

We will now see how to implement a more nuanced approach to classification errors with MeasureClassifCosts. This measure takes one argument, which is a matrix with row and column names corresponding to the class labels in the task of interest. Let’s put our insurance example into practice, notice that we have already named the cost matrix as required for the measure:

library(mlr3verse)

task = tsk("german_credit")

cost_measure = msr("classif.costs", costs = costs)
cost_measure
<MeasureClassifCosts:classif.costs>: Cost-sensitive Classification
* Packages: mlr3
* Range: [-Inf, Inf]
* Minimize: TRUE
* Average: macro
* Parameters: normalize=TRUE
* Properties: -
* Predict type: response
learners = lrns(
  paste0("classif.", c("log_reg", "featureless", "ranger")))
bmr = benchmark(benchmark_grid(task, learners, rsmp("cv", folds = 3)))
bmr$aggregate(cost_measure)[, c(4, 7)]
            learner_id classif.costs
1:     classif.log_reg     0.1790683
2: classif.featureless     0.8001654
3:      classif.ranger     0.2491294

In this experiment we find that the logistic regression learner happens to perform best as it minimizes the expected costs (and maximizes expected profits) and the featureless learner performs the worst. However, we could improve this prediction by considering thresholding.

8.1.2 Thresholding

As we have discussed in Chapter 2, thresholding is a method to determine at what probability an observation will fall into one outcome class or another. Currently in our running example, the models above will predict a customer has good credit (in the class ‘Good’) if the probability of good credit is greater than 0.5. However, this may not be a sensible approach as the cost of a false positive and false negative is not equal. In fact, a false positive results in a cost of +5 whereas a false negative only results in a cost of 0, hence we would prefer a model with a higher false negative rates and lower false positive rates. This is highlighted in the "threshold" autoplot:

prediction = lrn("classif.log_reg",
  predict_type = "prob")$train(task)$predict(task)
autoplot(prediction, type = "threshold", measure = cost_measure)

As expected, the optimal threshold is greater than 0.5, indicating that to maximize profits, the majority of observations should be predicted to have bad credit.

To automate the process of optimizing the threshold, we can make use of mlr3tuning (Chapter 4) and mlr3pipelines (Chapter 6) by creating a graph with PipeOpTuneThreshold. Continuing the same example:

gr = po("learner_cv", lrn("classif.log_reg", predict_type = "prob")) %>>%
  po("tunethreshold", measure = cost_measure)

learners = list(as_learner(gr), lrn("classif.log_reg"))
bmr = benchmark(benchmark_grid(task, learners, rsmp("cv", folds = 3)))
bmr$aggregate(cost_measure)[, c(4, 7)]
                      learner_id classif.costs
1: classif.log_reg.tunethreshold    -0.1060282
2:               classif.log_reg     0.1481062

As expected, our tuned learner performs much better and may even make InsureU a profit not a loss.

8.2 Probabilistic Regression

Probabilistic regression is a simple extension to deterministic regression in which the goal is to predict a full distribution around a single ‘point prediction’. By example, instead of just predicting someone’s height, \(\mu\), we could also predict how confident we are about our prediction, usually estimated as a standard error, \(\sigma\). This ‘confidence’ could be expressed as a confidence interval, for example \(\mu \pm \sigma\), or by assuming a distribution, for example \(\mathcal{N}(\mu, \sigma)\).

8.2.1 Confidence interval predictions

Predicting a confidence interval is possible using only features in mlr3. Several (but not all) implemented regression learners can make standard error predictions by setting predict_type = "se". For compatible learners this means the learner will predict both the ‘usual’ response predict type, as well as the standard error of the prediction, se. Manually combining these predict types leads to a confidence interval. An example is in Figure 8.1 where we assume a symmetric Normal 95% confidence interval.

t = tsk("mtcars")
split = partition(t)
l = lrn("regr.ranger", predict_type = "se")
l$train(t, split$train)
p = l$predict(t, split$test)
p
<PredictionRegr> for 11 observations:
    row_ids truth response        se
          4  21.4 20.91669 1.7147686
          8  24.4 23.27838 0.7933228
         21  21.5 23.73936 1.0114484
---                                 
         31  15.0 15.63936 0.5377532
         19  30.4 29.63470 1.3813498
         26  27.3 30.38791 1.4538484
autoplot(p, type = "confidence", quantile = 1.96)

x-label says "Response plus-minus 1.96se" and y-label says "Truth". Plot shows a black line running through the x=y axis. 7 blue points are plotted with error bars. Plot shows that 3 points are lying on the x=y line. Other points are not on the line and their error bars do not overlap with the line either.

Figure 8.1: Scatter plot of predictions (x) vs truth (y) where the straight line is \(x=y\). Points lying on \(x=y\) indicate good (mean) predictions, points with error bars overlapping \(x=y\) indicates truth is within 95% confidence interval.

This representation is more informative than a standard ‘point prediction’ as it includes confidence about the predictions. For example, the top prediction in Figure 8.1 might seem like a ‘good’ prediction as the mean response is close to the truth. However, the error bars suggest that we actually have low confidence in the prediction. On the other hand, the fourth prediction from the top is slightly further from the truth, but we are much more confident about our predictions. Analogously to classification, we might think of the top prediction as “P(Y = 30) = 0.4” versus the fourth prediction as “P(Y = 23) = 0.7” (numbers not exact). This level of detail allows more nuanced metrics to be utilized to measure probabilistic predictions, which we will turn to in the next section.

8.2.2 Probability distribution predictions

In the example above we used the estimated standard error to calculate an approximate symmetric Normal confidence interval. Instead we could have plugged the predicted response and se into a probability distribution as an estimate of the distribution mean and standard deviation. By assuming a probability distribution, we can make our predictions even more detailed by not just considering the mean and standard error of the distribution, but also the probability density function, cumulative distribution function, and all other common properties and methods.

distr6 implements many probability distributions with R6 and provides a flexible interface that was originally designed to be used with mlr3proba. Below we demonstrate construction of two distributions (Uniform and Normal) in distr6 and how to access their methods and properties, below we call the pdf (probability density function) and cdf (cumulative distribution function) methods and look at the most important properties with summary. Full details of the distr6 interface can be found on the package website1.

library(distr6)
response = 5
se = 2

# construct Uniform distribution
unif = dstr("Uniform",
  lower = response - se * sqrt(3),
  upper = response + se * sqrt(3))
# summary of distribution properties
summary(unif)
Uniform Probability Distribution. 
Parameterised with:

      Id Support    Value     Tags
1: lower       ℝ 1.535898 required
2: upper       ℝ 8.464102 required


Quick Statistics
    Mean:       5
    Variance:   4
    Skewness:   0
    Ex. Kurtosis:   -1.2

Support: [1.53589838486225,8.46410161513775]    Scientific Type: ℝ 

Traits:     continuous; univariate
Properties: symmetric; platykurtic; no skew
# evaluate cdf and pdf
rbind(cdf = unif$cdf(1:6), pdf = unif$pdf(1:6))
    [,1]      [,2]      [,3]      [,4]      [,5]      [,6]
cdf    0 0.0669873 0.2113249 0.3556624 0.5000000 0.6443376
pdf    0 0.1443376 0.1443376 0.1443376 0.1443376 0.1443376

In the above example we manually transformed the response and se ‘predictions’ to be compatible with a Uniform distribution construction, but learning these transformations can be time consuming. mlr3proba includes a pipeline pipeline_probregr() that incorporates the key steps of distribution prediction for probabilistic regression, which are:

  1. Train and predict a learner to make response predictions
  2. Train and predict a learner to make se predictions
  3. Select a probability distribution to model these predictions
  4. Transform predictions to parameterize the chosen distribution
  5. Return prediction

This pipeline is a wrapper around PipeOpProbregr, and is showcased in the code below where we specify three inputs: 1) a random forest as the main learner, which predicts response; 2) a featureless learner to estimate the standard deviation, which simply predicts the future standard error as the sample standard deviation in the training set; and 3) we assume a Normal distribution for the predicted distribution. We visualize the quality of predictions in Figure 8.2.

library(mlr3verse)
library(mlr3proba)
library(ggplot2)

# create our pipeline
l = as_learner(ppl("probregr",
  learner = lrn("regr.ranger"),
  learner_se = lrn("regr.featureless"),
  dist = "Normal")
)

# train and predict
task = tsk("mtcars")
split = partition(task)
l$train(task, split$train)
p = l$predict(task, split$test)
p
<PredictionRegr> for 11 observations:
    row_ids truth response       se                    distr
          4  21.4 19.19134 5.683925 <VectorDistribution[11]>
          9  22.8 22.49444 5.683925 <VectorDistribution[11]>
         27  26.0 25.11810 5.683925 <VectorDistribution[11]>
---                                                         
         29  15.8 17.75105 5.683925 <VectorDistribution[11]>
         19  30.4 28.32432 5.683925 <VectorDistribution[11]>
         20  33.9 28.61724 5.683925 <VectorDistribution[11]>
plot_probregr(p, 3, "point", "random") +
  ggtitle(sprintf("Logloss = %s", round(p$score(msr("regr.logloss")), 2)))

Three Normally distributed curves are plotted in blue, red, and green. Each curve has a point of the same color, representing the true value. The x-label says "x" and the y-label says "f(x)". The point on the green curve is very close to the distribution maximum. The point on the red curve is close to, but not exactly the maximum. The point on the blue curve is about a third of the way down from the top of the curve. The title of the plot says "Logloss = 2.82".

Figure 8.2: Three probability distribution predictions plotted as probability density functions. True values are represented as points. Shows fairly good predictions for the green and red observations as the truth is close to the distribution mean/mode but a less good (but not terrible) prediction for the blue observation. The logloss value is low but could be better (0 is the minimum).

The $distr column in our PredictionRegr contains objects inheriting from distr6::Distribution, which allows querying of all standard probability distribution methods and properties, as we have seen in the code above. This example shows that although none of our predictions were perfect – the truth was not exactly in line with the peak of the predicted distributions – in each case the truth was at least within an area of reasonable confidence.

In future versions of mlr3proba we will extend functionality further by adding models that can make probabilistic predictions (e.g., with Bayesian methods), more losses (e.g., Brier score), and a dedicated predict type for interval predictions. We now turn to the next task implemented in mlr3proba, which follows closely from probabilistic regression.

8.3 Survival Analysis

Survival analysis is a field of statistics concerned with trying to predict/estimate the time until an event takes place. This predictive problem is unique as survival models are trained and tested on data that may include ‘censoring’, which occurs when the event of interest does not take place. Survival analysis can be hard to explain in the abstract, so as a working example consider a marathon runner in a race. Here the ‘survival problem’ is trying to predict the time when the marathon runner finishes the race. However, if the event of interest does not take place, i.e., marathon runner gives up and does not finish the race, then they are said to be censored. Instead of throwing away information about censoring events, survival analysis datasets include a status variable that provides information about the ‘status’ of an observation. So in our example we might write the runner’s outcome as \((4, 1)\) if they finish the race at 4 hours, otherwise if they give up at 2 hours we would write \((2, 0)\).

Survival Analysis

The key to modelling in survival analysis is that we assume there exists a hypothetical time the marathon runner would have finished if they had not been censored, it is then the job of a survival learner to estimate what the true survival time would have been for a similar runner, assuming they are not censored (see Figure 8.3). Mathematically, this is represented by the hypothetical event time, \(Y\), the hypothetical censoring time, \(C\), the observed outcome time, \(T = min(Y, C)\), the event indicator \(\Delta = (T = Y)\), and as usual some features, \(X\). Learners are trained on \((T, \Delta)\) but, critically, make predictions of \(Y\) from previously unseen features. This means that unlike classification and regression, learners are trained on two variables, \((T, \Delta)\), which, in R, is often captured in a survival::Surv object. An example of this is in the code below, where we randomly generate 6 survival times and 6 event indicators, an outcome with a + indicates the outcome is censoring, otherwise it is the event of interest.

library(survival)
Surv(runif(6), rbinom(6, 1, 0.5))
[1] 0.7110534+ 0.5259401  0.3869013+ 0.7185668  0.8801807+ 0.1211602 

Readers familiar with survival analysis will recognize that the description above applies specifically to ‘right censoring’. Currently, this is the only form of censoring available in the mlr3 universe, hence restricting our discussion to that setting. For a good introduction to survival analysis see Collett (2014) or for machine learning in survival analysis specifically see R. Sonabend and Bender (2023).

For the remainder of this section we will look at how mlr3proba (R. Sonabend et al. 2021) extends the building blocks of mlr3 for survival analysis. We will begin by looking at objects used to construct machine learning tasks for survival analysis, then we will turn to the learners we have implemented to solve these tasks, before looking at measures for evaluating survival analysis predictions, and then finally we will consider how to transform prediction types.

Figure 8.3: Dead and censored subjects (y-axis) over time (x-axis). Black diamonds indicate true death times and white circles indicate censoring times. Vertical line is the study end time. Subjects 1 and 2 die in the study time. Subject 3 is censored in the study and (unknown) dies within the study time. Subject 4 is censored in the study and (unknown) dies after the study. Subject 5 dies after the end of the study. Figure and caption from R. E. B. Sonabend (2021).

8.3.1 TaskSurv

As we saw in the introduction to this section, survival algorithms require two targets for training, this means the new TaskSurv object expects two targets. The simplest way to create a survival task is to use as_task_surv, as in the following code chunk. Note how as_task_surv coerces the target columns into a survival::Surv object.

library("mlr3verse")
library("mlr3proba")
library("survival")

task = as_task_surv(survival::rats, time = "time",
  event = "status", type = "right", id = "rats")

task$head()
   time status litter rx sex
1:  101      0      1  1   f
2:   49      1      1  0   f
3:  104      0      1  0   f
4:   91      0      2  1   m
5:  104      0      2  0   m
6:  102      0      2  0   m
autoplot(task)

In the above example we used the survival::rats dataset, which looks at predicting if a drug treatment was successful in preventing 150 rats from developing tumors. Note that the dataset (by its own admission) is not perfect and should generally be treated as ‘dummy’ data, which is good for examples but not real-world analysis.

As well as creating your own tasks, you can load any of the tasks shipped with mlr3proba:

as.data.table(mlr_tasks)[task_type == "surv"]
            key                  label task_type nrow ncol properties lgl int
1:         actg               ACTG 320      surv 1151   13              0   3
2:         gbcs   German Breast Cancer      surv  686   10              0   4
3:        grace             GRACE 1000      surv 1000    8              0   2
4:         lung            Lung Cancer      surv  228   10              0   7
5:         rats                   Rats      surv  300    5              0   2
6: unemployment  Unemployment Duration      surv 3343    6              0   1
7:         whas Worcester Heart Attack      surv  481   11              0   4
5 variables not shown: [dbl, chr, fct, ord, pxc]

8.3.2 LearnerSurv, PredictionSurv and predict types

The interface for LearnerSurv and PredictionSurv objects is identical to the regression and classification settings discussed in Chapter 2. Similarly to these settings, survival learners are constructed with "lrn"; available learners are listed in Appendix D.

The primary difference in constructing a survival learner (and the resulting predictions) is the number of possible predict_type options that can be passed to lrn.

Similarly to deterministic regression (Section 2.1), a survival model can make a single point prediction, and similarly to probabilistic regression (Section 8.2), a survival model could also make a prediction of a probability distribution. mlr3proba has a different predict interface to mlr3 as all possible types of prediction (‘predict types’) are returned for all survival models. The reason for this decision is that all these predict types can be transformed to one another and it is therefore computationally simpler to return all at once instead of rerunning models to change predict type. In survival analysis, the following predictions can be made:

  • response - Predicted survival time.
  • distr - Predicted survival distribution, either discrete or continuous.
  • lp - Linear predictor calculated as the fitted coefficients multiplied by the test data.
  • crank - Continuous risk ranking.

We will go through each of these prediction types in more detail and with examples to make them less abstract. We will use the following setup for most of the examples. In this chunk we are partitioning the survival::rats and training a Cox Proportional Hazards model (mlr_learners_surv.coxph) on the training set and making predictions for the predict set. For this model, all predict types except response can be computed.

Cox Proportional Hazards
t = tsk("rats")
split = partition(t)
p = lrn("surv.coxph")$train(t, split$train)$predict(t, split$test)
p
<PredictionSurv> for 99 observations:
    row_ids time status      crank         lp     distr
          4   91  FALSE -2.4935516 -2.4935516 <list[1]>
          5  104  FALSE -2.8118294 -2.8118294 <list[1]>
          6  102  FALSE -2.8118294 -2.8118294 <list[1]>
---                                                    
        253   92   TRUE  0.4733165  0.4733165 <list[1]>
        255  102   TRUE  0.1550388  0.1550388 <list[1]>
        277   89   TRUE  0.5081450  0.5081450 <list[1]>

8.3.2.1 predict_type = "response"

Counterintuitively for many, the response prediction of predicted survival times is actually the least common predict type in survival analysis. The likely reason for this is due to the presence of censoring. We rarely observe the true survival time for many observations and therefore it is unlikely any survival model can confidently make predictions for survival times. This is illustrated in the code below.

In the example below we train and predict from a survival support vector machine (mlr_learners_surv.svm), note we use type = "regression" to select the algorithm that optimizes survival time predictions. We then compare the predictions from the model to the true data.

library(mlr3extralearners)
pred = lrn("surv.svm", type = "regression", gamma.mu = 1e-3)$
  train(t, split$train)$predict(t, split$test)
data.frame(pred = pred$response[1:3], truth = pred$truth[1:3])
      pred truth
1 89.34273   91+
2 89.34125  104+
3 89.34125  102+

As can be seen from the output, our predictions are all less than the true observed time, which means we know our model definitely underestimated the truth. However, because each of the true values are censored times, we have absolutely no way of knowing if these predictions are slightly bad or absolutely terrible, i.e., the true survival times could be \(105, 92, 92\) or they could be \(300, 1000, 200\). Hence, with no realistic way to evaluate these models, survival time predictions are rarely useful.

8.3.2.2 predict_type = "distr"

So unlike regression in which deterministic/point predictions are most common, in survival analysis distribution predictions are much more common. You will therefore find that the majority of survival models in mlr3proba will make distribution predictions by default. As with probabilistic regression, we implement these predictions using the distr6 interface, which allows visualization and evaluation of survival curves (defined as 1 - cumulative distribution function). In the example below we train a Cox PH model on the rats dataset and then plot the survival function of the first three predictions and evaluation this function for all three predictions at \(t = \{40,70,120\}\).

t = tsk("rats")
split = partition(t)
p = lrn("surv.coxph")$train(t, split$train)$predict(t, split$test)
plot(p$distr[1:3], fun = "survival")

p$distr[1:3]$survival(c(40, 70, 120))
         [,1]      [,2]      [,3]
40  0.9860413 0.9856659 0.9924366
70  0.9416197 0.9400865 0.9680333
120 0.7304510 0.7242618 0.8439682

8.3.2.3 predict_type = "lp"

lp, often written as \(\eta\) in academic writing, is computationally the simplest prediction and has a natural analogue in regression modelling. Readers familiar with linear regression will know that when fitting a simple linear regression model, \(Y = X\beta\), we are actually estimating the values for \(\beta\), and the estimated linear predictor (lp) is then \(X\hat{\beta}\), where \(\hat{\beta}\) are our estimated coefficients. In simple survival models, the linear predictor is the same quantity (but estimated in a slightly more complicated way). The learner implementations in mlr3proba are primarily machine-learning focused and few of these models have a simple linear form, which means that lp cannot be computed for most of these.

8.3.2.4 predict_type = "crank"

The final prediction type, crank, is the most common in survival analysis and perhaps also the most confusing. Academic texts will often refer to ‘risk’ predictions in survival analysis (hence why survival models are often known as ‘risk prediction models’), without defining what ‘risk’ means. Often risk is defined as \(exp(\eta)\) as this is a common quantity found in simple linear survival models. However, sometimes risk is defined as \(exp(-\eta)\), and sometimes it can be an arbitrary quantity that does not have a meaningful interpretation. To prevent this confusion in mlr3proba, we define the predict type crank, which stands for continuous ranking. This is best explained by example.

Continuing from the previous example we output the first three crank predictions. The output tells us that the second rat is at the higher risk of death (larger values represent higher risk) and the third rat is at the lowest risk of death. The distance between predictions also tells us that the difference in risk between the first and second rat is smaller than the difference between the second and third. The key points in this interpretation are: 1) lower values represent lower risk; 2) the difference between values has meaning but should not be over-interpreted; and 3) the actual values are meaningless. The last point is the most important, comparing these values between samples (or papers or experiments) is not meaningful.

p$crank[1:3]
           1            2            3 
 0.004592434  0.031323182 -0.611415996 

The crank prediction type is informative and common in practice because it allows identifying observations at lower/higher risk to each other, which is useful for resource allocation and prioritization (e.g., which patient should be given an expensive treatment), and clinical trials (e.g., are people in a treatment arm at lower risk of disease X than people in the control arm.).

8.3.3 MeasureSurv

Survival models in mlr3proba are evaluated with MeasureSurv objects, which are constructed in the usual way with "msr"; measures currently implemented are listed in see Appendix D.

In general survival measures can be grouped into the following:

  1. Discrimination measures – Quantify if a model correctly identifies if one observation is at higher risk than another. Evaluate crank and/or lp predictions.
  2. Calibration measures – Quantify if the average prediction is close to the truth (all definitions of calibration are unfortunately vague in a survival context). Evaluate crank and/or distr predictions.
  3. Scoring rules – Quantify if probabilistic predictions are close to true values. Evaluates distr predictions.
head(as.data.table(mlr_measures)[
  task_type == "surv", c("key", "predict_type")])
                  key predict_type
1:         surv.brier        distr
2:   surv.calib_alpha        distr
3:    surv.calib_beta           lp
4: surv.chambless_auc           lp
5:        surv.cindex        crank
6:        surv.dcalib        distr

There is a lot of debate in the literature around the ‘best’ survival measures to use to evaluate models, as a general rule we recommend RCLL (mlr_measures_surv.rcll) to evaluate the quality of distr predictions, concordance index (mlr_measures_surv.cindex) to evaluate a model’s discrimination, and D-Calibration (mlr_measures_surv.dcalib) to evaluate a model’s calibration.

8.3.4 Composition

Throughout mlr3proba documentation we refer to “native” and “composed” predictions. We define a ‘native’ prediction as the prediction made by a model without any post-processing, whereas a ‘composed’ prediction is one that is returned after post-processing.

8.3.4.1 Internal composition

In mlr3proba we make use of composition internally to return a "crank" prediction for every learner. This is to ensure that we can meaningfully benchmark all models according to at least one criterion. The package uses the following rules to create "crank" predictions:

  1. If a model returns a ‘risk’ prediction then crank = risk (we may multiply this by \(-1\) to ensure the ‘low value low risk’ interpretation).
  2. Else if a model returns a response prediction then we set crank = -response.
  3. Else if a model returns a lp prediction then we set crank = lp (or crank = -lp if needed).
  4. Else if a model returns a distr prediction then we set crank as the sum of the cumulative hazard function (see R. Sonabend, Bender, and Vollmer (2022) for full discussion as to why we picked this method).

8.3.4.2 Explicit composition and pipelines

At the start of this section we mentioned that it is possible to transform prediction types between each other. In mlr3proba this is possible with ‘compositor’ pipelines (Chapter 6). There are a number of pipelines implemented in the package but two in particular focus on predict type transformation:

  1. pipeline_crankcompositor() - Transforms a "distr" prediction to "crank"; and
  2. pipeline_distrcompositor() - Transforms a "lp" prediction to "distr".

In practice, the second pipeline is more common as we internally use a version of the first pipeline whenever we return predictions from survival models. You may want to use the first pipeline to overwrite the default method of transforming distributions to rankings. For now we will just look at the second pipeline.

In the example below we load the rats dataset, remove factor columns, and then partition the data into training and testing. We construct the distrcompositor pipeline around a survival GLMnet learner (mlr_learners_surv.glmnet) which by default can only make predictions for "lp" and "crank". In the pipeline we specify that we will estimate the baseline distribution with a Kaplan-Meier estimator (estimator = "kaplan") and that we want to assume a proportional hazards form for our estimated distribution. We then train and predict in the usual way and in our output we can now see a distr prediction.

library(mlr3verse)
library(mlr3extralearners)

task = tsk("rats")
task$select(c("litter", "rx"))
split = partition(task)

learner = lrn("surv.glmnet")

# no distr output
learner$train(task, split$train)$predict(task, split$test)
<PredictionSurv> for 99 observations:
    row_ids time status    crank.1       lp.1
          1  101  FALSE 0.52464562 0.52464562
          4   91  FALSE 0.53174836 0.53174836
          6  102  FALSE 0.01420547 0.01420547
---                                          
        231   78   TRUE 0.54691061 0.54691061
        235   80   TRUE 1.07865896 1.07865896
        249   66   TRUE 0.58952702 0.58952702
pipe = as_learner(ppl(
  "distrcompositor",
  learner = learner,
  estimator = "kaplan",
  form = "ph"
))

# now with distr
pipe$train(task, split$train)$predict(task, split$test)
<PredictionSurv> for 99 observations:
    row_ids time status    crank.1       lp.1     distr
          1  101  FALSE 0.52464562 0.52464562 <list[1]>
          4   91  FALSE 0.53174836 0.53174836 <list[1]>
          6  102  FALSE 0.01420547 0.01420547 <list[1]>
---                                                    
        231   78   TRUE 0.54691061 0.54691061 <list[1]>
        235   80   TRUE 1.07865896 1.07865896 <list[1]>
        249   66   TRUE 0.58952702 0.58952702 <list[1]>

Mathematically, we have done the following:

  1. Assume our estimated distribution will have the form \(h(t) = h_0(t)exp(\eta)\) where \(h\) is the hazard function and \(h_0\) is the baseline hazard function.
  2. Estimate \(\hat{\eta}\) prediction using GLMnet
  3. Estimate \(\hat{h}_0(t)\) with the Kaplan-Meier estimator
  4. Put this all together as \(h(t) = \hat{h}_0(t)exp(\hat{\eta})\)

For more details about prediction types and compositions we recommend Kalbfleisch and Prentice (2011).

8.3.5 Putting it all together

Finally, we will put all the above into practice in a small benchmark experiment. We first load the grace dataset and subset to the first 500 rows. We then select the RCLL, D-Calibration, and C-index to evaluate predictions, setup the same pipeline we used in the previous experiment, and load a Cox PH and Kaplan-Meier estimator. We run our experiment with 3-fold cross-validation and aggregate the results.

library(mlr3verse)
library(mlr3extralearners)

task = tsk("grace")$filter(1:500)
msr_txt = paste0("surv.", c("cindex", "dcalib", "rcll"))
measures = msrs(msr_txt)

pipe = as_learner(ppl(
  "distrcompositor",
  learner = lrn("surv.glmnet"),
  estimator = "kaplan",
  form = "ph"
))
pipe$id = "Coxnet"
learners = c(lrns(c("surv.coxph", "surv.kaplan")), pipe)

bmr = benchmark(benchmark_grid(task, learners, rsmp("cv", folds = 3)))
bmr$aggregate(measures)[, c("learner_id", ..msr_txt)]
    learner_id surv.cindex surv.dcalib surv.rcll
1:  surv.coxph   0.8237529    3.788224  3.114973
2: surv.kaplan   0.5000000    4.165848  3.263463
3:      Coxnet   0.8258842    9.480867  3.131287

In this small experiment, Coxnet and Cox PH have the best discrimination, Cox PH has the best calibration, and Coxnet and Cox PH have similar overall predictive accuracy.

8.4 Density Estimation

Density estimation is the learning task to estimate the unknown distribution from which a univariate dataset is generated, or put more simply to estimate the probability density (or mass) function for a variable. Density estimation is implemented in mlr3proba, similarly to the previous two tasks that can also predict probability distributions (hence the name “mlr3probabilistic”). Unconditional density estimation, i.e., estimation of a target without any covariates, is viewed as an unsupervised task, which means the ‘truth’ is never known. For a good overview to density estimation see Silverman (1986).

The package mlr3proba extends mlr3 with the following objects for density estimation:

We will consider each in turn.

8.4.1 TaskDens

As density estimation is an unsupervised task, there is no target for prediction. In the code below we construct a density task using as_task_dens which takes one argument, a data.frame type object with exactly one column (which we will use to estimate the underlying distribution).

task = as_task_dens(data.frame(x = rnorm(1000)))
task
<TaskDens:data.frame(x = rnorm(1000))> (1000 x 1)
* Target: -
* Properties: -
* Features (1):
  - dbl (1): x

As with other tasks, we have included a couple tasks that come shipped with mlr3proba:

as.data.table(mlr_tasks)[task_type == "dens", c(1:2, 4:5)]
        key                  label nrow ncol
1: faithful Old Faithful Eruptions  272    1
2:   precip   Annual Precipitation   70    1

8.4.2 LearnerDens and PredictionDens

Density learners can make one of three possible prediction types.

  1. distr - probability distribution
  2. pdf - probability density function
  3. cdf - cumulative distribution function

All learners will return a distr and pdf prediction but only some can make cdf predictions. Similarly to survival analysis and probabilistic regression, the distr predict type is implemented using distr6.

learn = lrn("dens.hist")
p = learn$train(task, 1:900)$predict(task, 901:1000)
x = seq.int(-2, 2, 0.01)
ggplot(data.frame(x = x, y = p$distr$pdf(x)), aes(x = x, y = y)) +
  geom_line() + theme_minimal()

The pdf and cdf predict types are simply wrappers around distr$pdf and distr$cdf respectively, which is best demonstrated by example:

learn = lrn("dens.hist")
p = learn$train(task, 1:10)$predict(task, 11:13)
p
<PredictionDens> for 3 observations:
 row_ids pdf       cdf              distr
      11 0.6 0.5036701 <Distribution[39]>
      12 0.6 0.7035210 <Distribution[39]>
      13 0.4 0.9134401 <Distribution[39]>
cbind(p$distr$cdf(task$data()$x[11:13]), p$cdf[1:3])
          [,1]      [,2]
[1,] 0.5036701 0.5036701
[2,] 0.7035210 0.7035210
[3,] 0.9134401 0.9134401

The reason for returning pdf and cdf in this way is to enable measures that can be used to evaluate the quality of our estimations, which we will return to in the next section.

8.4.3 MeasureDens

Currently the only measure implemented in mlr3proba for density estimation is logloss, which is defined in the same way as in classification and probabilistic regression, \(L(y) = -log(\hat{f}_Y(y))\), where \(\hat{f}_Y\) is our estimated probability density function. Putting this together with the above we are now ready to train a density learner, estimate a distribution, and evaluate our estimation:

meas = msr("dens.logloss")
meas
<MeasureDensLogloss:dens.logloss>: Log Loss
* Packages: mlr3, mlr3proba
* Range: [0, Inf]
* Minimize: TRUE
* Average: macro
* Parameters: eps=1e-15
* Properties: -
* Predict type: pdf
p$score(meas)
dens.logloss 
   0.6459807 

8.4.4 Unconditional density estimation vs. probabilistic regression

Some readers may spot parallels between density estimation and probabilistic regression (Section 8.2). In fact, density estimation is just a special case of probabilistic regression, occurring when we make an unconditional prediction. For example, given average rainfall across the USA, estimating the distribution of rainfall would be unsupervised density estimation. If we now wanted to estimate the distribution of rainfall next year given rainfall this year, then we would be back to supervised probabilistic regression. This is illustrated in the code below.

We load the precip dataset and train, predict, and score a Normal kernel density estimation density learner (mlr_learners_dens.kde). We then create a new dataset based on precip but with an added variable for rainfall next year. Now we can specify a target for prediction based on another covariate, thus making this a supervised learning task. We then use the probregr pipeline (Section 8.2) and train, predict, and score a random forest learner (`r ref(“mlr_learners_regr.ranger”)). Note the key difference in these outputs is that whilst both make distribution predictions, the former (density) estimates a single probability distribution for the variable of interest, whereas the latter (regression) predicts a distribution for each observation.

task = tsk("precip")
split = partition(task)
learn = lrn("dens.kde", kernel = "Norm")
p = learn$train(task, split$train)$predict(task, split$test)
p
<PredictionDens> for 23 observations:
    row_ids         pdf              distr
          3 0.008201882 <Distribution[39]>
          4 0.017647600 <Distribution[39]>
          6 0.012930403 <Distribution[39]>
---                                       
         57 0.022636656 <Distribution[39]>
         60 0.018232811 <Distribution[39]>
         69 0.012844688 <Distribution[39]>
p$score(msr("dens.logloss"))
dens.logloss 
    3.868815 
df = task$data()
df$next_year = (df$precip * 1.5) + rnorm(length(df$precip))
task = as_task_regr(df, target = "next_year")
pipe = as_learner(ppl("probregr", lrn("regr.ranger"), dist = "Normal"))
p = pipe$train(task, split$train)$predict(task, split$test)
p
<PredictionRegr> for 23 observations:
    row_ids     truth response        se                    distr
          3  9.855528 13.32877 1.9314220 <VectorDistribution[23]>
          4 73.348915 71.81530 1.5944345 <VectorDistribution[23]>
          6 24.690852 25.88878 1.0082043 <VectorDistribution[23]>
---                                                              
         57 69.050214 68.72069 0.8467608 <VectorDistribution[23]>
         60 70.349208 71.81530 1.5944345 <VectorDistribution[23]>
         69 22.141849 23.37564 1.0210862 <VectorDistribution[23]>
p$score(msr("regr.logloss"))
regr.logloss 
    2.408236 

8.4.5 Putting it all together

Finally, we conduct a small benchmark study on the mlr_tasks_faithful task using some of the integrated density learners:

library(mlr3extralearners)
task = tsk("faithful")
learners = lrns(c("dens.hist", "dens.pen", "dens.kde"))
measure = msr("dens.logloss")
bmr = benchmark(benchmark_grid(task, learners, rsmp("cv", folds = 3)))
bmr$aggregate(measure)
autoplot(bmr, measure = measure)

The results of this experiment show that the sophisticated Penalized Density Estimator does not outperform the baseline histogram, but that the Kernel Density Estimator has at least consistently better (i.e. lower logloss) results.

8.5 Cluster Analysis

Cluster analysis is another unsupervised task implemented in mlr3. The objective of cluster analysis is to group data into clusters, where each cluster contains similar observations. The similarity is based on specified metrics that are task and application-dependent. Unlike classification where we try to predict a class for each observation, in cluster analysis there is no ‘true’ label or class to predict.

Cluster Analysis

The package mlr3cluster extends mlr3 with the following objects for cluster analysis:

We will consider each in turn.

8.5.1 TaskClust

Similarly to density estimation (Section 8.4), there is no target for prediction and so no truth field in TaskClust. Let’s look at the cluster::ruspini dataset often used for cluster analysis examples.

library(cluster)
head(ruspini)
   x  y
1  4 53
2  5 63
3 10 59
4  9 77
5 13 49
6 13 69

The dataset has 75 rows and two columns and was first introduced in Ruspini (1970) to illustrate different clustering techniques. As we will see from the plots below, the observations form four natural clusters.

In the code below we construct a cluster task using as_task_clust which only takes one argument, a data.frame type object.

library(mlr3verse)
library(cluster)
task = as_task_clust(ruspini)
task
<TaskClust:ruspini> (75 x 2)
* Target: -
* Properties: -
* Features (2):
  - int (2): x, y
autoplot(task)

Technically, we did not need to create a new task for ruspini dataset since it is already included in the package. Currently we have two clustering tasks shipped with mlr3cluster:

as.data.table(mlr_tasks)[task_type == "clust", c(1:2, 4:5)]
         key      label nrow ncol
1:   ruspini    Ruspini   75    2
2: usarrests US Arrests   50    4

8.5.2 LearnerClust and PredictionClust

As with density estimation, we refer to training and predicting for clustering to be consistent with the mlr3 interface, but strictly speaking this should be clustering and assigning (the latter we will return to shortly). Two predict_types are available for clustering learners:

  1. partition – estimate of which cluster an observation falls into
  2. prob – probability of an observation belonging to each cluster

Hence, similarly to classification, prediction types of clustering learners are either deterministic (partition) or probabilistic (prob).

Below we construct a C-Means clustering learner with prob prediction type and 3 clusters, train it on the cluster::ruspini dataset and then return the cluster assignments ($assignments) for each observation.

learner = lrn("clust.cmeans", predict_type = "prob", centers = 3)
learner
<LearnerClustCMeans:clust.cmeans>: Fuzzy C-Means Clustering Learner
* Model: -
* Parameters: centers=3
* Packages: mlr3, mlr3cluster, e1071
* Predict Types:  partition, [prob]
* Feature Types: logical, integer, numeric
* Properties: complete, fuzzy, partitional
learner$train(task)
learner$assignments[1:6]
[1] 3 3 3 3 3 3

As clustering is unsupervised, it often does not make sense to use predict for new data however this is still possible using the mlr3 interface:

# using same data for estimation (rare use case)
learner$train(task, 1:30)$predict(task, 31:32)
<PredictionClust> for 2 observations:
 row_ids partition    prob.1     prob.2     prob.3
      31         1 0.9663368 0.01814576 0.01551747
      32         1 0.9782005 0.01177975 0.01001979
# using same data for estimation (common use case)
prediction = learner$train(task)$predict(task)
autoplot(prediction, task)

Whilst two prediction types are possible, there are some learners where ‘prediction’ can never make sense, for example in hierarchical clustering. In hierarchical clustering, the goal is to build a hierarchy of nested clusters by either splitting large clusters into smaller ones or merging smaller clusters into bigger ones. The final result is a tree or dendrogram which can change if a new data point is added. For consistency, mlr3cluster offers predict method for hierarchical clusters but with a warning:

Hierarchical Clustering
learner = lrn("clust.hclust")
learner$train(task)
learner$predict(task)
Warning: Learner 'clust.hclust' doesn't predict on new data and predictions may
not make sense on new data
<PredictionClust> for 75 observations:
    row_ids partition
          1         1
          2         1
          3         1
---                  
         73         1
         74         1
         75         1
autoplot(learner) + theme(axis.text = element_text(size = 3.5))

In this case, the predict method simply cuts the dendrogram into the number of clusters specified by k parameter of the learner.

8.5.3 MeasureClust

As previously discussed, unsupervised tasks do not have ground truth data to compare to in model evaluation. However, we can still measure the quality of cluster assignments by quantifying how closely objects within the same cluster are related (cluster cohesion) as well as how distinct different clusters are from each other (cluster separation). There are a few built-in evaluation metrics available to assess the quality of clustering, see Appendix D.

Two common measures are the within sum of squares (WSS) measure, mlr_measures_clust.wss, and the silhouette coefficient, mlr_measures_clust.silhouette. WSS calculates the sum of squared differences between observations and centroids, which is a quantification of cluster cohesion (smaller values indicate clusters more compact). The silhouette coefficient quantifies how well each point belongs to its assigned cluster versus neighboring clusters, where scores closer to 1 indicate well clustered and scores closer to -1 indicate poorly clustered. Note that the silhouette measure in mlr3cluster returns the mean silhouette score across all observations and when there is only a single cluster, the measure simply outputs 0.

Putting this together with the above we can now score our cluster estimation (note we must pass the task to msr):

meas = msrs(c("clust.wss", "clust.silhouette"))

prediction$score(meas, task = task)
       clust.wss clust.silhouette 
    5.106348e+04     6.327047e-01 

The very high WSS and middling mean silhouette coefficient indicate that our clusters could do with a bit more work. Often reducing an unsupervised task to a quantitative measure may not be useful (given no ground truth) and instead visualization may be a more effective tool for assessing the quality of the clusters.

8.5.3.1 Measure Limitations

It is easy to rely on clustering measures to assess the quality of clustering. However, this should be done with some care, by example consider cluster analysis on the following dataset.

spirals = mlbench::mlbench.spirals(n = 300, sd = 0.01)
task = as_task_clust(as.data.frame(spirals$x))
autoplot(task)

Now fitting our clustering learner.

learners = list(
  lrn("clust.kmeans"),
  lrn("clust.dbscan", eps = 0.1)
)
measures = list(msr("clust.silhouette"))
bmr = benchmark(benchmark_grid(task, learners, rsmp("insample")))
bmr$aggregate(measures)[, c(4, 7)]
     learner_id clust.silhouette
1: clust.kmeans       0.37258038
2: clust.dbscan       0.02915828

We can see that K-means clustering gives us a higher average silhouette score and might assume that K-means learner with 2 centroids is a a better choice than DBSCAN method. However, now take a look at the cluster assignment plots.

prediction_kmeans = bmr$resample_results$resample_result[[1]]$prediction()
prediction_dbscan = bmr$resample_results$resample_result[[2]]$prediction()
autoplot(prediction_kmeans, task)

autoplot(prediction_dbscan, task)

The two learners arrived at different results – the K-means algorithm assigned points that are part of the same line into two different clusters whereas DBSCAN assigned each line to its own cluster. Which one of these approaches is correct? The answer is it depends on your specific task and the goal of cluster analysis. If we had only relied on the silhouette score, then the details of how exactly the clustering was done would have been masked and we would not be able to decide which method was appropriate for the task.

8.5.4 Visualization

As with other tasks, we use mlr3viz to visualize performance of cluster learners. The two most important plots are principal components analysis (PCA) and silhouette plots.

PCA is a commonly used dimension reduction method in ML to reduce the number of variables in a dataset or to visualize the most important ‘components’, which are linear transformations of the dataset features. Components are considered more important if they have higher variance (and therefore more predictive power). In the context of clustering, by plotting observations against the first two components, and then coloring them by cluster, we could visualize our high dimensional dataset and we would expect to see observations in distinct groups.

Since our running example only has two features, PCA does not make sense to visualize the data. So we will use a task based on the USArrests dataset instead. By plotting the result of PCA, we see that our model has separated observations into two clusters along the first two principal components.

task = mlr_tasks$get("usarrests")
learner = lrn("clust.kmeans")
prediction = learner$train(task)$predict(task)
autoplot(prediction, task, type = "pca")

Silhouette plots visually assess the quality of the estimated clusters by visualizing if observations in a cluster are well-placed both individually and as a group. The plots include a dotted line which visualizes the average silhouette coefficient across all data points and each data point’s silhouette value is represented by a bar colored by cluster. In our particular case, the average silhouette index is 0.59. If the average silhouette value for a given cluster is below the average silhouette coefficient line then this implies that the cluster is not well defined.

Continuing with our new example, we find that a lot of observations are actually below the average line and close to zero, and therefore the quality of our cluster assignments is not very good, meaning that many observations are likely assigned to the wrong cluster.

autoplot(prediction, task, type = "sil")

8.5.5 Putting it all together

Finally, we conduct a small benchmark study using the ruspini data and with a few integrated cluster learners:

task = tsk("ruspini")
learners = list(
  lrn("clust.featureless"),
  lrn("clust.kmeans", centers = 4L),
  lrn("clust.cmeans", centers = 3L)
)
measures = list(msr("clust.wss"), msr("clust.silhouette"))
bmr = benchmark(benchmark_grid(task, learners, rsmp("insample")))
bmr$aggregate(measures)[, c(4, 7, 8)]
          learner_id clust.wss clust.silhouette
1: clust.featureless 244373.87        0.0000000
2:      clust.kmeans  49512.16        0.5056154
3:      clust.cmeans  51063.48        0.6327047

The experiment shows that using the K-means algorithm with four centers has the best cluster cohesion (lowest within sum of squares) and the best average silhouette score.

8.6 Spatial Analysis

The final task we will discuss in this book is spatial analysis. Spatial analysis can be a subset of any other machine learning task (e.g., regression or classification) and is defined by the presence of spatial information in a dataset, usually stored as coordinates that are often named “x” and “y” or “lat” and “lon” (for ‘latitude’ and ‘longitude’ respectively.)

Spatial Analysis

Spatial analysis is its own task as spatial data must be handled carefully due to the complexity of ‘autocorrelation’. Where correlation is defined as a statistical association between two variables, autocorrelation is a statistical association within one variable. In machine learning terms, in a dataset with features and observations, correlation occurs when two or more features are statistically associated in some way, whereas autocorrelation occurs when two or more observations are statistically associated across one feature. Autocorrelation therefore violates one of the fundamental assumptions of ML that all observations in a dataset are independent, which results in lower confidence about the quality of a trained ML model and we therefore cannot trust any performance estimates (Hastie, Friedman, and Tibshirani 2001).

Autocorrelation

Autocorrelation is present in spatial data as there is implicit information encoded in coordinates, such as whether two observations (e.g., cities, countries, continents) are close together or far apart. By example, say we are predicting the number of cases of a disease two months after outbreak in Germany. Outbreaks radiate outwards from an epicenter and therefore countries closer to Germany will have higher numbers of cases and countries further away will have lower numbers (Fig Figure 8.4, bottom), thus there is autocorrelation in the ‘country’.

Image shows two separate maps of Europe. Top mat has a random distribution of colors from yellow to red. Bottom mat shows darkest color (dark orange) at Germany with increasing lightness as the countries are increasingly further away.

Figure 8.4: Heatmaps where darker countries indicate higher number of cases and lighter countries indicate lower number of cases of imaginary Disease X with epicenter in Germany. The top map imagines a world in which there is no spatial autocorrelation and the number of cases of a disease is randomly distributed. The bottom map shows a more accurate world in which the number of cases radiate outwards from the epicenter (Germany).

Unlike other tasks we have looked at in this chapter, there is no underlying difference to the implemented learners or measures, instead we provide new resampling methods in mlr3spatiotempcv to account for any overconfidence in predictions that might occur from autocorrelation. To make use of these resamplings we have also implemented special tasks in mlr3spatial to hold spatial data.

Throughout this section we will use the mlr3spatiotempcv::ecuador dataset and task as a working example.

8.6.1 TaskClassifST and TaskRegrST

To make use of spatial resampling methods, we have implemented two extensions of TaskClassif and TaskRegr to accommodate spatial data, TaskClassifST and TaskRegrST respectively. Below we only show classification examples but regression follows trivially.

library(mlr3spatial)
library(mlr3spatiotempcv)

# create task from `data.frame`
task = as_task_classif_st(ecuador, id = "ecuador_task",
  target = "slides", positive = "TRUE",
  coordinate_names = c("x", "y"), crs = 32717)

# or create task from 'sf' object
data_sf = sf::st_as_sf(ecuador, coords = c("x", "y"), crs = 32717)
task = as_task_classif_st(data_sf, target = "slides", positive = "TRUE")
task
<TaskClassifST:data_sf> (751 x 11)
* Target: slides
* Properties: twoclass
* Features (10):
  - dbl (10): carea, cslope, dem, distdeforest, distroad,
    distslidespast, hcurv, log.carea, slope, vcurv
* Coordinates:
            X       Y
  1: 712882.5 9560002
  2: 715232.5 9559582
  3: 715392.5 9560172
  4: 715042.5 9559312
  5: 715382.5 9560142
 ---                 
747: 714472.5 9558482
748: 713142.5 9560992
749: 713322.5 9560562
750: 715392.5 9557932
751: 713802.5 9560862

Once a task is created, you can train and predict as normal.

lrn("classif.rpart")$train(task)$predict(task)
<PredictionClassif> for 751 observations:
    row_ids truth response
          1  TRUE     TRUE
          2  TRUE     TRUE
          3  TRUE     TRUE
---                       
        749 FALSE    FALSE
        750 FALSE    FALSE
        751 FALSE     TRUE

However as discussed above, it is best to use the specialized resampling methods in order to have trustworthy estimates of model performance.

8.6.2 Spatiotemporal Cross-Validation

Before we look at the spatial resampling methods implemented in mlr3spatiotempcv we will first show what can go wrong if standard resampling methods are used. Below we benchmark a decision tree on the mlr_tasks_ecuador task using two different repeated cross-validation resampling methods, the first (“NSpCV”) is a standard resampling method from mlr3, the second (“SpCV”) is from mlr3spatiotempcv and is optimized for spatial data. The example highlights how “NSpCV” makes it appear as if the decision tree is performing better than it is with significantly higher estimated performance, however this is an overconfident prediction due to the autocorrelation in the data.

learner = lrn("classif.rpart", predict_type = "prob")
rsmp_nsp = rsmp("repeated_cv", folds = 3, repeats = 2, id = "NSpCV")
rsmp_sp = rsmp("repeated_spcv_coords", folds = 3, repeats = 2, id = "SpCV")

design = benchmark_grid(task, learner, c(rsmp_nsp, rsmp_sp))
bmr = benchmark(design)
bmr$aggregate(msr("classif.acc"))[, c(5, 7)]
   resampling_id classif.acc
1:         NSpCV   0.6864064
2:          SpCV   0.5842370

In the above example, the non-spatial resampling method means that the train and test data are very similar due to spatial autocorrelation, and so this is almost as bad as evaluating a model on the training data (see Chapter 2). In contrast, the spatial method has accommodated for autocorrelation and the test data is therefore less correlated (though still some association will remain) with the training data. Visually this can be seen using built-in autoplot methods. In Figure 8.5 we visualize how the task is partitioned according to the spatial resampling method (Figure 8.5 top) and non-spatial resampling method (Figure 8.5 bottom). There is a clear separation between coordinates using the ‘correct’ spatial resampling method whereas the train and test splits overlap significantly (and are therefore highly correlated) using the non-spatial method.

library(patchwork)
autoplot(rsmp_sp, task, fold_id = c(1:3), size = 0.7) /
  autoplot(rsmp_nsp, task, fold_id = c(1:3), size = 0.7) &
  scale_y_continuous(breaks = seq(-3.97, -4, -0.01)) &
  scale_x_continuous(breaks = seq(-79.06, -79.08, -0.02))

Six scatter plots are shown with y-axes from 4 degrees South to 3.97 degrees South and x-axes from 79.08 degrees West to 79.06 degrees West. The top row of plots show a very clear separation with no overlap between blue and orange datapoints, whereas the bottom low shows significant overlap.

Figure 8.5: Scatterplots show separation of train (blue) and test (orange) data for the first three (left to right) folds of the first repetition of the cross-validation. The top row is spatial resampling where train and test data are clearly separated. The bottom row is non-spatial resampling where there is overlap in train and test data.

Now we have seen why spatial resampling methods matter we can take a look at what is implemented in mlr3spatiotempcv. At the time of writing we have implemented 19 different resampling methods for spatial and spatiotemporal analysis. These are categorized into:

  • Blocking – Create rectangular blocks in 2D/3D space (as opposed to partitioning in 1D space used in non-spatial methods)
  • Buffering – Create buffering zones to remove observations between train and test sets
  • Spatiotemporal clustering – Clusters based on coordinates (and/or time-points)
  • Feature space clustering – Clusters based on feature space and not necessarily spatiotemporal
  • Partitioning – Grouped by factor variables

The choice of method may depend on specific characteristics of the dataset and there is no easy rule to pick one method over another, full details of different methods can be found in Schratz et al. (2021) – the paper deliberately avoids recommending one method over another as the ‘optimal’ choice is highly dependent on the predictive task, autocorrelation in the data, and the spatial structure of the sampling design. The documentation for each of the implemented methods includes details of each method as well as references to original publications.

Spatiotemporal resampling

We have focused on spatial analysis but referred to “spatiotemporal” and “spatiotemp”. The resampling methods discussed in this section extend trivially to temporal analysis (or spatial and temporal analysis combined) by setting the "time" col_role in the task (Chapter 3). See the mlr3spatiotempcv visualization vignette2 for specific details about 3D spatiotemporal visualization.

8.6.3 Spatial Prediction

Until now we have looked at resampling to accommodate spatiotemporal features, but what if you want to make spatiotemporal predictions? In this case the goal is to make classification or regression predictions for spatial coordinates at a pixel level, usually by creating and labelling raster images.

To enable these predictions we have created a new function, predict_spatial, to allow spatial predictions using any of the following spatial data classes:

  • stars (from package stars)
  • SpatRaster (from package terra)
  • RasterLayer (from package raster)
  • RasterStack (from package raster)

To use a raster image for prediction, it must be wrapped in TaskUnsupervised. In the example below we load the leipzig_points file for training and coerce this to a spatiotemporal task with as_task_classif_st, and we load the leipzig_raster raster image and coerce this to an unsupervised task. Both files are shipped with, and can be downloaded from, mlr3spatial.

library(sf)

# load sample points
leipzig_vector = sf::read_sf(system.file("extdata",
  "leipzig_points.gpkg", package = "mlr3spatial"),
  stringsAsFactors = TRUE)
# create training data
tsk_leipzig = as_task_classif_st(leipzig_vector, target = "land_cover")

# load raster image
leipzig_raster = terra::rast(system.file("extdata", "leipzig_raster.tif",
  package = "mlr3spatial"))
# create testing data
tsk_predict = as_task_unsupervised(leipzig_raster)

Now we can continue as normal to train and predict with a classification learner, in this case a random forest.

lrn = lrn("classif.ranger")$train(tsk_leipzig)
pred = predict_spatial(tsk_predict, lrn, format = "terra")
pred
class       : SpatRaster 
dimensions  : 206, 154, 1  (nrow, ncol, nlyr)
resolution  : 10, 10  (x, y)
extent      : 731810, 733350, 5692030, 5694090  (xmin, xmax, ymin, ymax)
coord. ref. : WGS 84 / UTM zone 32N (EPSG:32632) 
source      : file265478673975.tif 
categories  : categories 
name        : land_cover 
min value   :     forest 
max value   :      water 

In this example we specified creation of a terra object, which can be visualized with in-built plotting methods.

library(terra, exclude = "resample")
plot(pred, col = c("#440154FF", "#443A83FF", "#31688EFF",
  "#21908CFF", "#35B779FF", "#8FD744FF", "#FDE725FF"))

8.7 Conclusion

In this chapter, we explored going beyond deterministic regression and classification to see how functions in the mlr3 can be used to implement other machine learning tasks. Cost-sensitive classification extends the ‘normal’ classification setting by assuming errors associated with false negatives and false positives are unequal. Running cost-sensitive classification experiments is possible just using features in mlr3. Probabilistic regression extend deterministic regression by making predictions over a full probability distribution and not just a single point. You can train and predict probabilistic regression models with mlr3 or more effectively in pipelines with mlr3proba. Survival analysis, also available in mlr3proba, is a special form of probabilistic regression when the outcome is censored, which means it may never be observed within a given time frame. The final task in mlr3proba is density estimation, the unsupervised task concerned with estimating univariate probability distributions. Using mlr3cluster, you can perform cluster analysis on observations, which involves grouping observations together according to similarities in their variables. Finally, with mlr3spatial and mlr3spatiotempcv, it is possible to perform spatial analysis to make predictions using co-ordinates as features and to make spatial predictions. The mlr3 interface is highly extensible, which means future machine learning tasks can (and will) be added to our universe and we will add these to this chapter of the book in future editions.

8.8 Exercises

We will run set.seed(11) before each of our solutions so you can reproduce our results.

  1. Run a benchmark experiment on the german_credit task with algorithms: featureless, log_reg, ranger. Tune the featureless model using tunetreshold and learner_cv. Use 2-fold CV and evaluate with msr("classif.costs", costs = costs) where you should make the parameter costs so that the cost of a true positive is -10, the cost of a true negative is -1, the cost of a false positive is 2, and the cost of a false negative is 3. Are your results surprising?

  2. Use the probregr pipeline to create a probabilistic regression model using: xgboost for the response prediction, featureless learner for the se prediction, and assuming an Cauchy distribution. Train and predict on task = tsk("mtcars"); split = partition(task). Evaluate your model with the logloss measure.

  3. Train and predict a survival forest using rfsrc (from mlr3extralearners). Run this experiment using task = tsk("rats"); split = partition(task). Evaluate your model with the RCLL measure.

  4. Estimate the density of the tsk("precip") data using logspline (from mlr3extralearners). Run this experiment using task = tsk("precip"); split = partition(task). Evaluate your model with the logloss measure.

  5. Run a benchmark clustering experiment on the wine dataset without a label column. Compare the performance of k-means learner with k equal to 2, 3 and 4 using the silhouette measure. Use insample resampling technique. What value of k would you choose based on the silhouette scores?

  6. Run a (spatially) unbiased classification benchmark experiment on the ecuador task with a featureless learner and xgboost, evaluate with the binary Brier score.