```
# Performance Evaluation and Comparison {#sec-performance}
{{< include _setup.qmd >}}
Now that we are familiar with the basics of how to construct tasks and learners and how to fit models, we would like to know more about how to assess fitted models.
In particular, we would like to assign one or several numeric *scores* to a given model.
This allows us to select a *best* model out of several fitted models and furthermore, helps us assess whether the models we have fitted are useful for a task at hand.
Scores in this context can reflect different qualities of a model, depending on the measures we employ.
To provide an example, in some cases we might be interested in how well our model predicts (in terms of global accuracy).
In others we might care about other perspectives, such as how many features it uses or how fast it predicts.
In supervised learning contexts, the core idea in performance evaluation is often that we would like to assess how good a fitted model will work on new, unseen data points.
In order to mimic this, we often hold out some data during model development.
Afterwards, we use predictions made on the held-out data and compare them to the true label to simulate how our model would have fared on new data points.
This strategy is often called *train-test* split or *holdout*.
[Performance Measures](#measures) now summarize such comparisons in a single number, e.g. a model's accuracy is the fraction of cases in which it predicted the true outcome correctly.
If datasets are small, the model might perform poorly on the test data by chance.
We would therefore like to repeat this process multiple times to reduce the variance in our estimation.
This gives rise to different [Resampling](#resampling) strategies encoding more complex strategies.
**Performance Measures**
[Measures](#measures) are a way to quantify model properties along relevant dimensions.
We cover how to
* construct `Measure` objects for different `task_types`,
* use `Measure` objects to score models
An in-depth guide on how to construct custom measures can be found in the [extending measures](##extending-measures) section.
**Resampling**
[Resampling](#resampling) is a methodology to construct training and test splits.
We cover how to
* access and select [resampling strategies](#resampling-settings),
* instantiate the [split into training and test sets](#resampling-inst) by applying the resampling, and
* execute the resampling to obtain [results](#resampling-exec).
In this section, we will use `Resampling` objects that define the resampling strategy we want to employ.
Calling `resample` on a `Task`, `Learner` and a `Resampling` object then will return a `ResampleResult` which contains the results of applying the strategy on a `Task` and `Learner`.
Additional information on resampling can be found in the section about [nested resampling](#nested-resampling) and in the chapter on [model optimization](#optimization).
**Benchmarking**
[Benchmarking](#benchmarking) is used to compare the performance of different models, for example models trained with different learners, on different tasks, or with different resampling methods.
This is usually done to obtain an overview of how different methods perform across different tasks.
We cover how to
* construct a [benchmarking design](#bm-design) using `benchmark_grid`,
* [execute a design](#bm-exec) and aggregate results, and
* [convert benchmarking objects](#bm-resamp) to other types of objects that can be used for different purposes.
#### Further Reading
* If data is pre-processed before feeding it to a model, the train-test split needs to be taken into account. [Pipelines](pipelines) solve this problem by cleanly separating train and predict steps!
* Depending on the task at hand, more involved resampling strategies might be required, e.g. for [spatiotemporal data](#spatiotemporal).
* [@bischl2012resampling](https://direct.mit.edu/evco/article-abstract/20/2/249/925/Resampling-Methods-for-Meta-Model-Validation-with?redirectedFrom=fulltext) provide an overview of resampling methods.
## ROC Curve and Thresholds {#binary-roc}
As we have seen before, binary classification is unique because of the presence of a positive and negative class and a threshold probability to distinguish between the two.
ROC Analysis, which stands for receiver operating characteristics, applies specifically to this case and allows a better picture of the trade-offs when choosing between the two classes.
We saw earlier that one can retrieve the confusion matrix of a `r ref("Prediction")` by accessing the `$confusion` field.
In the following code chunk, we first retrieve the Sonar classification task and construct a classification tree learner.
Next, we use the `r ref("partition()")` helper function to randomly split the rows of the Sonar task into two disjunct sets: a training set and a test set.
We train the learner on the training set and use the trained model to generate predictions on the test set.
Finally, we retrieve the confusion matrix.
```{r performance-001}
task = tsk("sonar")
learner = lrn("classif.rpart", predict_type = "prob")
# split into training and test
splits = partition(task, ratio = 0.8)
print(str(splits))
pred = learner$train(task, splits$train)$predict(task, splits$test)
pred$confusion
```
The upper left quadrant denotes the number of times our model predicted the positive class and was correct about it.
Similarly, the lower right quadrant denotes the number of times our model predicted the negative class and was also correct about it.
Together, the elements on the diagonal are called True Positives (TP) and True Negatives (TN).
The upper right quadrant denotes the number of times we falsely predicted a positive label and is called False Positives (FP).
The lower left quadrant is called False Negatives (FN).
We can derive the following performance metrics from the confusion matrix:
* **True Positive Rate (TPR)**: How many of the true positives did we predict as positive?
* **True Negative Rate (TNR)**: How many of the true negatives did we predict as negative?
* **Positive Predictive Value PPV**: If we predict positive how likely is it a true positive?
* **Negative Predictive Value NPV**: If we predict negative how likely is it a true negative?
It is difficult to achieve a high TPR and low FPR simultaneously.
We can characterize the behavior of a binary classifier for different thresholds by plotting the TPR and FPR values -- this is the ROC curve.
The best classifier lies on the top-left corner -- the TPR is 1 and the FPR is 0.
Classifiers on the diagonal produce labels randomly (possibly with different proportions).
For example, if each positive $x$ will be randomly classified with 25\% as "positive", we get a TPR of 0.25.
If we assign each negative $x$ randomly to "positive" we get a FPR of 0.25.
In practice, we should never obtain a classifier clearly below the diagonal -- ROC curves for different labels are symmetric with respect to the diagonal, so a curve below the diagonal would indicate that the positive and negative class labels have been switched by the classifier.
For `r mlr3` prediction objects, the ROC curve can easily be constructed with `r mlr3viz` which relies on the `r ref_pkg("precrec")` to calculate and plot ROC curves:
```{r performance-002}
library("mlr3viz")
# TPR vs FPR / Sensitivity vs (1 - Specificity)
autoplot(pred, type = "roc")
```
We can also plot the precision-recall curve (PPV vs. TPR).
The main difference between ROC curves and precision-recall curves (PRC) is that the number of true-negative results is not used for making a PRC.
PRCs are preferred over ROC curves for imbalanced populations.
```{r performance-003}
# Precision vs Recall
autoplot(pred, type = "prc")
```
## Resampling {#sec-resampling}
When evaluating the performance of a model, we are interested in its generalization performance -- how well will it perform on new data that has not been seen during training?
We can estimate the generalization performance by evaluating a model on a test set, as we have done above, that was constructed to contain only observations that are not contained in the training set.
There are many different strategies for partitioning a data set into training and test; in `r mlr3` we call these strategies "resampling".
`r mlr3` includes the following predefined [resampling](#resampling) strategies:
| Name | Identifier |
|:---|:---|
| `r ref("mlr_resamplings_cv", text = "cross validation")` | `"cv"` |
| `r ref("mlr_resamplings_loo", text = "leave-one-out cross validation")` | `"loo"` |
| `r ref("mlr_resamplings_repeated_cv", text = "repeated cross validation")` | `"repeated_cv"` |
| `r ref("mlr_resamplings_bootstrap", text = "bootstrapping")`| `"bootstrap"` |
| `r ref("mlr_resamplings_subsampling", text = "subsampling")` | `"subsampling"` |
| `r ref("mlr_resamplings_holdout", text = "holdout")` | `"holdout"` |
| `r ref("mlr_resamplings_insample", text = "in-sample resampling")`| `"insample"` |
| `r ref("mlr_resamplings_custom", text = "custom resampling")` | `"custom"` |
:::{.callout-note}
It is often desirable to repeatedly split the entire data in different ways to ensure that a "lucky" or "unlucky" split does not bias the generalization performance estimate.
Without resampling strategies like the ones we provide here, this is a tedious and error-prone process.
:::
The following sections provide guidance on how to select a resampling strategy and how to use it.
Here is a graphical illustration of the resampling process in general:
```{r performance-004, echo=FALSE, fig.align="center"}
knitr::include_graphics("images/ml_abstraction.svg")
```
### Settings {#resampling-settings}
We will use the `r ref("mlr_tasks_penguins", text = "penguins")` task and a simple classification tree from the `r ref_pkg("rpart")` package as an example here.
```{r performance-005}
task = tsk("penguins")
learner = lrn("classif.rpart")
```
When performing resampling with a dataset, we first need to define which approach to use.
`r mlr3` resampling strategies and their parameters can be queried by looking at the `data.table` output of the `r ref("mlr_resamplings")` dictionary; this also lists the parameters that can be changed to affect the behavior of each strategy:
```{r performance-006}
as.data.table(mlr_resamplings)
```
:::{.callout-tip}
Additional resampling methods for special use cases are available via extension packages, such as `r mlr3spatiotempcv` for spatial data.
:::
What we showed in the [train/predict/score](#train-predict) part is the equivalent of holdout resampling, done manually, so let's consider this one first.
We can retrieve elements from the dictionary `r ref("mlr_resamplings")` with the convenience function `r ref("rsmp()")`:
```{r performance-007}
resampling = rsmp("holdout")
print(resampling)
```
:::{.callout-note}
Note that the `$is_instantiated` field is set to `FALSE`.
This means we did not apply the strategy to a dataset yet.
:::
By default we get a .66/.33 split of the data into training and test.
There are two ways in which this ratio can be changed:
1. Overwriting the slot in `$param_set$values` using a named list:
```{r performance-008}
resampling$param_set$values = list(ratio = 0.8)
```
2. Specifying the resampling parameters directly during construction:
```{r performance-009}
rsmp("holdout", ratio = 0.8)
```
### Instantiation {#resampling-inst}
So far, we have only chosen a resampling strategy; we now need to instantiate it with data.
To perform the splitting and obtain indices for the training and the test split, the resampling needs a `r ref("Task")`.
By calling the method `instantiate()`, we split the row indices of the task into indices for training and test sets.
These resulting indices are stored in the `r ref("Resampling")` objects.
```{r performance-010}
resampling$instantiate(task)
train = resampling$train_set(1)
test = resampling$test_set(1)
str(train)
str(test)
# are they disjunct?
intersect(train, test)
# are all row ids either in train or test?
setequal(c(train, test), task$row_ids)
```
Note that if you want to compare multiple [Learners](#learners) in a fair manner, using the same instantiated resampling for each learner is mandatory, such that each learner gets exactly the same training data and the performance of the trained model is evaluated in exactly the same test set.
A way to greatly simplify the comparison of multiple learners is discussed in the [section on benchmarking](#benchmarking).
### Execution {#resampling-exec}
With a `r ref("Task")`, a `r ref("Learner")`, and a `r ref("Resampling")` object we can now perform a resampling: fit the learner on a subset of the task repeatedly and predict on the left-out observations.
For this, we call the `r ref("resample()")` function which returns a `r ref("ResampleResult")` object.
We can tell `r ref("resample()")` to keep the fitted models (for example for later inspection) by setting the `store_models` option to `TRUE`.
:::{.callout-note}
The `r ref_pkg("butcher")` package allows you to remove components of a fitted model that may no longer be needed.
However, from the point of view of the mlr3 framework, it is impossible to find a good compromise here.
Of course, as much balast as possible should be removed.
However, if too much is removed, some graphics, outputs or third-party extensions may not work anymore as intended.
Our conservative policy here is that the training data should not be part of the stored model.
If this is the case somewhere, please [open an issue](https://github.com/mlr-org/mlr3/issues).
To manually apply butcher, just replace the stored model in the learner:
```{r performance-011, eval = FALSE}
task = tsk("penguins")
learner = lrn("classif.rpart")
learner$train(task)
learner$model = butcher::butcher(learner$model)
```
:::
Per default, mlr3 discards fitted models after the prediction step to keep the `r ref("ResampleResult")` object small w.r.t. its memory consumption.
```{r performance-012}
task = tsk("penguins")
learner = lrn("classif.rpart", maxdepth = 3, predict_type = "prob")
resampling = rsmp("cv", folds = 3)
rr = resample(task, learner, resampling, store_models = TRUE)
print(rr)
```
Here we use a three-fold cross-validation resampling, which trains and evaluates on three different training and test sets.
The returned `r ref("ResampleResult")`, stored as `rr`, provides various getters to access and aggregate the stored information.
Here are a few examples:
- Calculate the average performance across all resampling iterations, in terms of classification error:
```{r performance-013}
rr$aggregate(msr("classif.ce"))
```
This is useful to check if one (or more) of the iterations are very different from the average.
- Check for warnings or errors:
```{r performance-014}
rr$warnings
rr$errors
```
- Extract and inspect the resampling splits; this allows us to see in detail which observations were used for what purpose when:
```{r performance-015}
rr$resampling
rr$resampling$iters
str(rr$resampling$train_set(1))
str(rr$resampling$test_set(1))
```
- Retrieve the model trained in a specific iteration and inspect it, for example, to investigate why the performance in this iteration was very different from the average:
```{r performance-016}
lrn = rr$learners[[1]]
lrn$model
```
- Extract the individual predictions:
```{r performance-017}
rr$prediction() # all predictions merged into a single Prediction object
rr$predictions()[[1]] # predictions of first resampling iteration
```
- Filter the result, keep only specified resampling iterations:
```{r performance-018}
rr$filter(c(1, 3))
print(rr)
```
### Custom resampling {#resamp-custom}
Sometimes it is necessary to perform resampling with custom splits, e.g., to reproduce results reported in a study.
A manual resampling instance can be constructed using the `"custom"` template.
```{r performance-019}
resampling = rsmp("custom")
resampling$instantiate(task,
train = list(c(1:10, 51:60, 101:110)),
test = list(c(11:20, 61:70, 111:120))
)
resampling$iters
resampling$train_set(1)
resampling$test_set(1)
```
### Resampling with (predefined) groups
In some cases, it is desirable to keep observations together, i.e., either have all observations of the same group in the training set or have all observations of the same group in the test set.
This can be defined through the column role `"group"` during Task creation, i.e., a column in the data specifies the groups (see also the [help page](https://mlr3.mlr-org.com/reference/Resampling.html#grouping-blocking) on this column role).
A potential use case for this is spatial modeling, where one wants to account for groups of observations during resampling.
:::{.callout-tip appearance="simple"}
Besides the `"group"` column role, dedicated spatiotemporal resampling methods are available in `r mlr3spatiotempcv`.
More general information about this topic can be found in the [spatiotemporal resampling](special.html#spatiotemp-cv) section.
:::
If you want the ensure that the observations in the training and test set are similarly distributed as in the complete task, one or multiple columns can be used for stratification via the column role `stratum`.
See also the [mlr3gallery post on this topic](https://mlr3gallery.mlr-org.com/posts/2020-03-30-stratification-blocking/) for a practical introduction.
:::{.callout-note appearance="simple"}
In the old `r ref_pkg("mlr")` package, grouping was called "blocking".
:::
### Plotting Resample Results {#autoplot-resampleresult}
`r mlr3viz` provides a `r ref("ggplot2::autoplot()", text = "autoplot()")` method for resampling results.
As an example, we construct a binary classification task with two features, perform a resampling with a 10-fold cross-validation and visualize the results:
```{r performance-020}
task = tsk("pima")
task$select(c("glucose", "mass"))
learner = lrn("classif.rpart", predict_type = "prob")
rr = resample(task, learner, rsmp("cv"), store_models = TRUE)
rr$score(msr("classif.auc"))
# boxplot of AUC values across the 10 folds
library("mlr3viz")
autoplot(rr, measure = msr("classif.auc"))
# ROC curve, averaged over 10 folds
autoplot(rr, type = "roc")
```
We can also plot the predictions of individual models:
```{r performance-021}
# learner predictions for the first fold
rr1 = rr$clone()$filter(1)
autoplot(rr1, type = "prediction")
```
:::{.callout-tip}
All available plot types are listed on the manual page of `r ref("autoplot.ResampleResult()")`.
:::
## Benchmarking {#benchmarking}
Comparing the performance of different learners on multiple tasks and/or different resampling schemes is a common task.
This operation is usually referred to as "benchmarking" in the field of machine learning.
The `r mlr3` package offers the `r ref("benchmark()")` convenience function that takes care of most of the work of repeatedly training and evaluating models under the same conditions.
### Design Creation {#bm-design}
Benchmark experiments in `r mlr3` are specified through a design.
Such a design is essentially a table of scenarios to be evaluated; in particular unique combinations of `r ref("Task")`, `r ref("Learner")` and `r ref("Resampling")` triplets.
We use the `r ref("benchmark_grid()")` function to construct an exhaustive design (that evaluates each learner on each task with each resampling) and instantiate the resampling properly, so that all learners are executed on the same train/test split for each tasks.
We set the learners to predict probabilities and also tell them to predict for the observations of the training set (by setting `predict_sets` to `c("train", "test")`).
Additionally, we use `r ref("tsks()")`, `r ref("lrns()")`, and `r ref("rsmps()")` to retrieve lists of `r ref("Task")`, `r ref("Learner")` and `r ref("Resampling")`.
```{r performance-022}
library("mlr3verse")
design = benchmark_grid(
tasks = tsks(c("spam", "german_credit", "sonar")),
learners = lrns(c("classif.ranger", "classif.rpart", "classif.featureless"),
predict_type = "prob", predict_sets = c("train", "test")),
resamplings = rsmps("cv", folds = 3)
)
print(design)
```
The constructed design table can be passed to `r ref("benchmark()")` to start the computation.
It is also possible to construct a custom design manually, for example, to exclude certain task-learner combinations.
:::{.callout-caution}
However, if you construct a custom design with `r ref("data.table()")`, the train/test splits will be different for each row of the design if you do not [**manually instantiate**](#resampling-inst) the resampling before creating the design.
See the help page on `r ref("benchmark_grid()")` for an example.
:::
### Execution Aggregation of Results {#bm-exec}
After the [benchmark design](#bm-design) is ready, we can call `r ref("benchmark()")` on it:
```{r performance-023}
bmr = benchmark(design)
```
:::{.callout-tip}
Note that we did not have to instantiate the resampling manually.
`r ref("benchmark_grid()")` took care of it for us: each resampling strategy is instantiated once for each task during the construction of the exhaustive grid.
This way, each learner operates on the same training and test sets which makes the results easier to compare.
:::
Once the benchmarking is finished (and, depending on the size of your design, this can take quite some time), we can aggregate the performance with the `$aggregate()` method of the returned `r ref("BenchmarkResult")`.
We construct two measures to calculate the area under the curve (AUC) for the training and the test set:
```{r performance-024}
measures = list(
msr("classif.auc", predict_sets = "train", id = "auc_train"),
msr("classif.auc", id = "auc_test")
)
tab = bmr$aggregate(measures)
print(tab[, .(task_id, learner_id, auc_train, auc_test)])
```
We can aggregate the results even further.
For example, we might be interested to know which learner performed best across all tasks.
Simply aggregating the performances with the mean is usually not statistically sound.
Instead, we calculate the rank statistic for each learner, grouped by task.
Then the calculated ranks, grouped by the learner, are aggregated with the `r ref_pkg("data.table")` package.
As larger AUC scores are better, we multiply the values by $-1$ such that the best learner has a rank of $1$.
```{r performance-025}
library("data.table")
# group by levels of task_id, return columns:
# - learner_id
# - rank of col '-auc_train' (per level of learner_id)
# - rank of col '-auc_test' (per level of learner_id)
ranks = tab[, .(learner_id, rank_train = rank(-auc_train), rank_test = rank(-auc_test)), by = task_id]
print(ranks)
# group by levels of learner_id, return columns:
# - mean rank of col 'rank_train' (per level of learner_id)
# - mean rank of col 'rank_test' (per level of learner_id)
ranks = ranks[, .(mrank_train = mean(rank_train), mrank_test = mean(rank_test)), by = learner_id]
# print the final table, ordered by mean rank of AUC test
ranks[order(mrank_test)]
```
Unsurprisingly, the featureless learner is worse overall.
The winner is the classification forest, which outperforms a single classification tree.
### Plotting Benchmark Results {#autoplot-benchmarkresult}
Similar to [tasks](#autoplot-task), [predictions](#autoplot-prediction), or [resample results](#autoplot-resampleresult), `r mlr3viz` also provides a `r ref("ggplot2::autoplot()", text = "autoplot()")` method for benchmark results.
```{r performance-026}
autoplot(bmr) + ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 45, hjust = 1))
```
Such a plot gives a nice overview of the overall performance and how learners compare on different tasks in an intuitive way.
We can also plot ROC (receiver operating characteristics) curves.
We filter the `r ref("BenchmarkResult")` to only contain a single `r ref("Task")`, then we simply plot the result:
```{r performance-027}
bmr_small = bmr$clone(deep = TRUE)$filter(task_id = "german_credit")
autoplot(bmr_small, type = "roc")
```
All available plot types are listed on the manual page of `r ref("autoplot.BenchmarkResult()")`.
### Statistical Tests
The package `r mlr3benchmark` provides some infrastructure for applying statistical significance tests on `r ref("BenchmarkResult")`.
Currently, Friedman tests and pairwise Friedman-Nemenyi tests [@demsar2006] are supported for benchmarks with at least two tasks and at least two learners.
The results can be summarized in Critical Difference Plots.
```{r performance-028}
library("mlr3benchmark")
bma = as.BenchmarkAggr(bmr, measures = msr("classif.auc"))
bma$friedman_posthoc()
autoplot(bma, type = "cd")
```
### Extracting ResampleResults {#bm-resamp}
A `r ref("BenchmarkResult")` object is essentially a collection of multiple `r ref("ResampleResult")` objects.
As these are stored in a column of the aggregated `r ref("data.table()")`, we can easily extract them:
```{r performance-029}
tab = bmr$aggregate(measures)
rr = tab[task_id == "german_credit" & learner_id == "classif.ranger"]$resample_result[[1]]
print(rr)
```
We can now investigate this resampling and even single resampling iterations using one of the approaches shown in [the previous section on resampling](#resampling):
```{r performance-030}
measure = msr("classif.auc")
rr$aggregate(measure)
# get the iteration with worst AUC
perf = rr$score(measure)
i = which.min(perf$classif.auc)
# get the corresponding learner and training set
print(rr$learners[[i]])
head(rr$resampling$train_set(i))
```
### Converting and Merging
A `r ref("ResampleResult")` can be converted to a `r ref("BenchmarkResult")` with the function `r ref("as_benchmark_result()")`.
We can also combine two `r ref("BenchmarkResult", text = "BenchmarkResults")` into a larger result object, for example, for two related benchmarks that are computed on different machines.
```{r performance-031}
task = tsk("iris")
resampling = rsmp("holdout")$instantiate(task)
rr1 = resample(task, lrn("classif.rpart"), resampling)
rr2 = resample(task, lrn("classif.featureless"), resampling)
# Cast both ResampleResults to BenchmarkResults
bmr1 = as_benchmark_result(rr1)
bmr2 = as_benchmark_result(rr2)
# Merge 2nd BMR into the first BMR
bmr = c(bmr1, bmr2)
print(bmr)
```
```