# 3  Performance Evaluation and Comparison

Now that we are familiar with the basics of how to create tasks and learners, how to fit models, let’s have a look at some of the details, and in particular how mlr3 makes it easy to perform many common machine learning steps.

We will cover the following topics:

Performance Scoring

Resampling

Resampling is a methodology to create training and test splits. We cover how to

Additional information on resampling can be found in the section about nested resampling and in the chapter on model optimization.

Benchmarking

Benchmarking is used to compare the performance of different models, for example models trained with different learners, on different tasks, or with different resampling methods. This is usually done to get an overview of how different methods perform across different tasks. We cover how to

## 3.1 ROC Curve and Thresholds

As we have seen before, binary classification is unique because of the presence of a positive and negative class and a threshold probability to distinguish between the two. ROC Analysis, which stands for receiver operating characteristics, applies specifically to this case and allows a better picture of the trade-offs when choosing between the two classes.

We saw earlier that one can retrieve the confusion matrix of a Prediction by accessing the $confusion field. In the following code chunk, we first retrieve the Sonar classification task and construct a classification tree learner. Next, we use the partition() helper function to randomly split the rows of the Sonar task into two disjunct sets: a training set and a test set. We train the learner on the training set and use the trained model to generate predictions on the test set. Finally, we retrieve the confusion matrix. task = tsk("sonar") learner = lrn("classif.rpart", predict_type = "prob") # split into training and test splits = partition(task, ratio = 0.8) print(str(splits)) List of 2$ train: int [1:167] 3 4 5 6 7 8 9 12 13 14 ...
$test : int [1:41] 1 2 10 11 24 28 35 37 46 48 ... NULL pred = learner$train(task, splits$train)$predict(task, splits$test) pred$confusion
        truth
response  M  R
M 19  5
R  3 14

The upper left quadrant denotes the number of times our model predicted the positive class and was correct about it. Similarly, the lower right quadrant denotes the number of times our model predicted the negative class and was also correct about it. Together, the elements on the diagonal are called True Positives (TP) and True Negatives (TN). The upper right quadrant denotes the number of times we falsely predicted a positive label and is called False Positives (FP). The lower left quadrant is called False Negatives (FN).

We can derive the following performance metrics from the confusion matrix:

• True Positive Rate (TPR): How many of the true positives did we predict as positive?
• True Negative Rate (TNR): How many of the true negatives did we predict as negative?
• Positive Predictive Value PPV: If we predict positive how likely is it a true positive?
• Negative Predictive Value NPV: If we predict negative how likely is it a true negative?

It is difficult to achieve a high TPR and low FPR simultaneously. We can characterize the behavior of a binary classifier for different thresholds by plotting the TPR and FPR values – this is the ROC curve. The best classifier lies on the top-left corner – the TPR is 1 and the FPR is 0. Classifiers on the diagonal produce labels randomly (possibly with different proportions). For example, if each positive $$x$$ will be randomly classified with 25% as “positive”, we get a TPR of 0.25. If we assign each negative $$x$$ randomly to “positive” we get a FPR of 0.25. In practice, we should never obtain a classifier clearly below the diagonal – ROC curves for different labels are symmetric with respect to the diagonal, so a curve below the diagonal would indicate that the positive and negative class labels have been switched by the classifier.

For mlr3 prediction objects, the ROC curve can easily be created with mlr3viz which relies on the precrec to calculate and plot ROC curves:

library("mlr3viz")

# TPR vs FPR / Sensitivity vs (1 - Specificity)
autoplot(pred, type = "roc") We can also plot the precision-recall curve (PPV vs. TPR). The main difference between ROC curves and precision-recall curves (PRC) is that the number of true-negative results is not used for making a PRC. PRCs are preferred over ROC curves for imbalanced populations.

# Precision vs Recall
autoplot(pred, type = "prc") ## 3.2 Resampling

When evaluating the performance of a model, we are interested in its generalization performance – how well will it perform on new data that has not been seen during training? We can estimate the generalization performance by evaluating a model on a test set, as we have done above, that was created to contain only observations that are not contained in the training set. There are many different strategies for partitioning a data set into training and test; in mlr3 we call these strategies “resampling”. mlr3 includes the following predefined resampling strategies:

Name Identifier
cross validation "cv"
leave-one-out cross validation "loo"
repeated cross validation "repeated_cv"
bootstrapping "bootstrap"
subsampling "subsampling"
holdout "holdout"
in-sample resampling "insample"
custom resampling "custom"
Note

It is often desirable to repeatedly split the entire data in different ways to ensure that a “lucky” or “unlucky” split does not bias the generalization performance estimate. Without resampling strategies like the ones we provide here, this is a tedious and error-prone process.

The following sections provide guidance on how to select a resampling strategy and how to use it.

Here is a graphical illustration of the resampling process in general:

### 3.2.1 Settings

We will use the penguins task and a simple classification tree from the rpart package as an example here.

task = tsk("penguins")
learner = lrn("classif.rpart")

When performing resampling with a dataset, we first need to define which approach to use. mlr3 resampling strategies and their parameters can be queried by looking at the data.table output of the mlr_resamplings dictionary; this also lists the parameters that can be changed to affect the behavior of each strategy:

as.data.table(mlr_resamplings)
           key                         label        params iters
1:   bootstrap                     Bootstrap ratio,repeats    30
2:      custom                 Custom Splits                  NA
3:   custom_cv Custom Split Cross-Validation                  NA
4:          cv              Cross-Validation         folds    10
5:     holdout                       Holdout         ratio     1
6:    insample           Insample Resampling                   1
7:         loo                 Leave-One-Out                  NA
8: repeated_cv     Repeated Cross-Validation folds,repeats   100
9: subsampling                   Subsampling ratio,repeats    30
Tip

Additional resampling methods for special use cases are available via extension packages, such as mlr3spatiotemporal for spatial data.

What we showed in the train/predict/score part is the equivalent of holdout resampling, done manually, so let’s consider this one first. We can retrieve elements from the dictionary mlr_resamplings with the convenience function rsmp():

resampling = rsmp("holdout")
print(resampling)
<ResamplingHoldout>: Holdout
* Iterations: 1
* Instantiated: FALSE
* Parameters: ratio=0.6667
Note

Note that the $is_instantiated field is set to FALSE. This means we did not apply the strategy to a dataset yet. By default we get a .66/.33 split of the data into training and test. There are two ways in which this ratio can be changed: 1. Overwriting the slot in $param_set$values using a named list: resampling$param_set$values = list(ratio = 0.8) 1. Specifying the resampling parameters directly during construction: rsmp("holdout", ratio = 0.8) <ResamplingHoldout>: Holdout * Iterations: 1 * Instantiated: FALSE * Parameters: ratio=0.8 ### 3.2.2 Instantiation So far, we have only chosen a resampling strategy; we now need to instantiate it with data. To perform the splitting and obtain indices for the training and the test split, the resampling needs a Task. By calling the method instantiate(), we split the row indices of the task into indices for training and test sets. These resulting indices are stored in the Resampling objects. resampling$instantiate(task)
train = resampling$train_set(1) test = resampling$test_set(1)
str(train)
 int [1:275] 2 3 4 5 6 7 8 9 10 11 ...
str(test)
 int [1:69] 1 13 14 15 17 20 28 32 42 44 ...
# are they disjunct?
intersect(train, test)
integer(0)
# are all row ids either in train or test?
setequal(c(train, test), task$row_ids)  TRUE Note that if you want to compare multiple Learners in a fair manner, using the same instantiated resampling for each learner is mandatory, such that each learner gets exactly the same training data and the performance of the trained model is evaluated in exactly the same test set. A way to greatly simplify the comparison of multiple learners is discussed in the section on benchmarking. ### 3.2.3 Execution With a Task, a Learner, and a Resampling object we can now perform a resampling: fit the learner on a subset of the task repeatedly and predict on the left-out observations. For this, we call the resample() function which returns a ResampleResult object. We can tell resample() to keep the fitted models (for example for later inspection) by setting the store_models option to TRUE. Note The butcher package allows you to remove components of a fitted model that may no longer be needed. However, from the point of view of the mlr3 framework, it is impossible to find a good compromise here. Of course, as much balast as possible should be removed. However, if too much is removed, some graphics, outputs or third-party extensions may not work anymore as intended. Our conservative policy here is that the training data should not be part of the stored model. If this is the case somewhere, please open an issue. To manually apply butcher, just replace the stored model in the learner: task = tsk("penguins") learner = lrn("classif.rpart") learner$train(task)

learner$model = butcher::butcher(learner$model)

Per default, mlr3 discards fitted models after the prediction step to keep the ResampleResult object small w.r.t. its memory consumption.

task = tsk("penguins")
learner = lrn("classif.rpart", maxdepth = 3, predict_type = "prob")
resampling = rsmp("cv", folds = 3)

rr = resample(task, learner, resampling, store_models = TRUE)
print(rr)
<ResampleResult> of 3 iterations
* Learner: classif.rpart
* Warnings: 0 in 0 iterations
* Errors: 0 in 0 iterations

Here we use a three-fold cross-validation resampling, which trains and evaluates on three different training and test sets. The returned ResampleResult, stored as rr, provides various getters to access and aggregate the stored information. Here are a few examples:

• Calculate the average performance across all resampling iterations, in terms of classification error:

rr$aggregate(msr("classif.ce")) classif.ce 0.05819985  This is useful to check if one (or more) of the iterations are very different from the average. • Check for warnings or errors: rr$warnings
Empty data.table (0 rows and 2 cols): iteration,msg
rr$errors Empty data.table (0 rows and 2 cols): iteration,msg • Extract and inspect the resampling splits; this allows us to see in detail which observations were used for what purpose when: rr$resampling
<ResamplingCV>: Cross-Validation
* Iterations: 3
* Instantiated: TRUE
* Parameters: folds=3
rr$resampling$iters
 3
str(rr$resampling$train_set(1))
 int [1:229] 9 12 16 17 19 24 25 29 30 34 ...
str(rr$resampling$test_set(1))
 int [1:115] 1 5 8 10 15 18 21 22 23 26 ...
• Retrieve the model trained in a specific iteration and inspect it, for example, to investigate why the performance in this iteration was very different from the average:

lrn = rr$learners[] lrn$model
n= 229

node), split, n, loss, yval, (yprob)
* denotes terminal node

1) root 229 128 Adelie (0.44104803 0.22707424 0.33187773)
2) bill_length< 42.35 94   1 Adelie (0.98936170 0.00000000 0.01063830) *
3) bill_length>=42.35 135  60 Gentoo (0.05925926 0.38518519 0.55555556)
6) island=Dream,Torgersen 58   6 Chinstrap (0.10344828 0.89655172 0.00000000) *
7) island=Biscoe 77   2 Gentoo (0.02597403 0.00000000 0.97402597) *
• Extract the individual predictions:

rr$prediction() # all predictions merged into a single Prediction object <PredictionClassif> for 344 observations: row_ids truth response prob.Adelie prob.Chinstrap prob.Gentoo 1 Adelie Adelie 0.98936170 0.00000000 0.01063830 5 Adelie Adelie 0.98936170 0.00000000 0.01063830 8 Adelie Adelie 0.98936170 0.00000000 0.01063830 --- 340 Chinstrap Gentoo 0.01162791 0.01162791 0.97674419 342 Chinstrap Chinstrap 0.04761905 0.92857143 0.02380952 343 Chinstrap Gentoo 0.01162791 0.01162791 0.97674419 rr$predictions()[] # predictions of first resampling iteration
<PredictionClassif> for 115 observations:
row_ids     truth  response prob.Adelie prob.Chinstrap prob.Gentoo
---
335 Chinstrap Chinstrap   0.1034483      0.8965517   0.0000000
336 Chinstrap Chinstrap   0.1034483      0.8965517   0.0000000
339 Chinstrap Chinstrap   0.1034483      0.8965517   0.0000000
• Filter the result, keep only specified resampling iterations:

rr$filter(c(1, 3)) print(rr) <ResampleResult> of 2 iterations * Task: penguins * Learner: classif.rpart * Warnings: 0 in 0 iterations * Errors: 0 in 0 iterations ### 3.2.4 Custom resampling Sometimes it is necessary to perform resampling with custom splits, e.g., to reproduce results reported in a study. A manual resampling instance can be created using the "custom" template. resampling = rsmp("custom") resampling$instantiate(task,
train = list(c(1:10, 51:60, 101:110)),
test = list(c(11:20, 61:70, 111:120))
)
resampling$iters  1 resampling$train_set(1)
    1   2   3   4   5   6   7   8   9  10  51  52  53  54  55  56  57  58  59
  60 101 102 103 104 105 106 107 108 109 110
resampling$test_set(1)   11 12 13 14 15 16 17 18 19 20 61 62 63 64 65 66 67 68 69  70 111 112 113 114 115 116 117 118 119 120 ### 3.2.5 Resampling with (predefined) groups In some cases, it is desirable to keep observations together, i.e., either have all observations of the same group in the training set or have all observations of the same group in the test set. This can be defined through the column role "group" during Task creation, i.e., a column in the data specifies the groups (see also the help page on this column role). A potential use case for this is spatial modeling, where one wants to account for groups of observations during resampling. Besides the "group" column role, dedicated spatiotemporal resampling methods are available in mlr3spatiotemporal. More general information about this topic can be found in the spatiotemporal resampling section. If you want the ensure that the observations in the training and test set are similarly distributed as in the complete task, one or multiple columns can be used for stratification via the column role stratum. See also the mlr3gallery post on this topic for a practical introduction. In the old mlr package, grouping was called “blocking”. ### 3.2.6 Plotting Resample Results mlr3viz provides a autoplot() method for resampling results. As an example, we create a binary classification task with two features, perform a resampling with a 10-fold cross-validation and visualize the results: task = tsk("pima") task$select(c("glucose", "mass"))
learner = lrn("classif.rpart", predict_type = "prob")
rr = resample(task, learner, rsmp("cv"), store_models = TRUE)
rr$score(msr("classif.auc"))  task task_id learner learner_id 1: <TaskClassif> pima <LearnerClassifRpart> classif.rpart 2: <TaskClassif> pima <LearnerClassifRpart> classif.rpart 3: <TaskClassif> pima <LearnerClassifRpart> classif.rpart 4: <TaskClassif> pima <LearnerClassifRpart> classif.rpart 5: <TaskClassif> pima <LearnerClassifRpart> classif.rpart 6: <TaskClassif> pima <LearnerClassifRpart> classif.rpart 7: <TaskClassif> pima <LearnerClassifRpart> classif.rpart 8: <TaskClassif> pima <LearnerClassifRpart> classif.rpart 9: <TaskClassif> pima <LearnerClassifRpart> classif.rpart 10: <TaskClassif> pima <LearnerClassifRpart> classif.rpart 5 variables not shown: [resampling, resampling_id, iteration, prediction, classif.auc] # boxplot of AUC values across the 10 folds library("mlr3viz") autoplot(rr, measure = msr("classif.auc")) # ROC curve, averaged over 10 folds autoplot(rr, type = "roc") We can also plot the predictions of individual models: # learner predictions for the first fold rr1 = rr$clone()$filter(1) autoplot(rr1, type = "prediction") Warning: Removed 2 rows containing missing values (geom_point). Tip All available plot types are listed on the manual page of autoplot.ResampleResult(). ## 3.3 Benchmarking Comparing the performance of different learners on multiple tasks and/or different resampling schemes is a common task. This operation is usually referred to as “benchmarking” in the field of machine learning. The mlr3 package offers the benchmark() convenience function that takes care of most of the work of repeatedly training and evaluating models under the same conditions. ### 3.3.1 Design Creation Benchmark experiments in mlr3 are specified through a design. Such a design is essentially a table of scenarios to be evaluated; in particular unique combinations of Task, Learner and Resampling triplets. We use the benchmark_grid() function to create an exhaustive design (that evaluates each learner on each task with each resampling) and instantiate the resampling properly, so that all learners are executed on the same train/test split for each tasks. We set the learners to predict probabilities and also tell them to predict for the observations of the training set (by setting predict_sets to c("train", "test")). Additionally, we use tsks(), lrns(), and rsmps() to retrieve lists of Task, Learner and Resampling. library("mlr3verse") design = benchmark_grid( tasks = tsks(c("spam", "german_credit", "sonar")), learners = lrns(c("classif.ranger", "classif.rpart", "classif.featureless"), predict_type = "prob", predict_sets = c("train", "test")), resamplings = rsmps("cv", folds = 3) ) print(design)  task learner resampling 1: <TaskClassif> <LearnerClassifRanger> <ResamplingCV> 2: <TaskClassif> <LearnerClassifRpart> <ResamplingCV> 3: <TaskClassif> <LearnerClassifFeatureless> <ResamplingCV> 4: <TaskClassif> <LearnerClassifRanger> <ResamplingCV> 5: <TaskClassif> <LearnerClassifRpart> <ResamplingCV> 6: <TaskClassif> <LearnerClassifFeatureless> <ResamplingCV> 7: <TaskClassif> <LearnerClassifRanger> <ResamplingCV> 8: <TaskClassif> <LearnerClassifRpart> <ResamplingCV> 9: <TaskClassif> <LearnerClassifFeatureless> <ResamplingCV> The created design table can be passed to benchmark() to start the computation. It is also possible to create a custom design manually, for example, to exclude certain task-learner combinations. Danger However, if you create a custom design with data.table(), the train/test splits will be different for each row of the design if you do not manually instantiate the resampling before creating the design. See the help page on benchmark_grid() for an example. ### 3.3.2 Execution Aggregation of Results After the benchmark design is ready, we can call benchmark() on it: bmr = benchmark(design) Tip Note that we did not have to instantiate the resampling manually. benchmark_grid() took care of it for us: each resampling strategy is instantiated once for each task during the construction of the exhaustive grid. This way, each learner operates on the same training and test sets which makes the results easier to compare. Once the benchmarking is finished (and, depending on the size of your design, this can take quite some time), we can aggregate the performance with the $aggregate() method of the returned BenchmarkResult. We create two measures to calculate the area under the curve (AUC) for the training and the test set:

measures = list(
msr("classif.auc", predict_sets = "train", id = "auc_train"),
msr("classif.auc", id = "auc_test")
)

tab = bmr$aggregate(measures) print(tab[, .(task_id, learner_id, auc_train, auc_test)])  task_id learner_id auc_train auc_test 1: spam classif.ranger 0.9994318 0.9860031 2: spam classif.rpart 0.9152765 0.9061875 3: spam classif.featureless 0.5000000 0.5000000 4: german_credit classif.ranger 0.9984109 0.8055811 5: german_credit classif.rpart 0.8073881 0.7110071 6: german_credit classif.featureless 0.5000000 0.5000000 7: sonar classif.ranger 1.0000000 0.9142603 8: sonar classif.rpart 0.8953818 0.7588270 9: sonar classif.featureless 0.5000000 0.5000000 We can aggregate the results even further. For example, we might be interested to know which learner performed best across all tasks. Simply aggregating the performances with the mean is usually not statistically sound. Instead, we calculate the rank statistic for each learner, grouped by task. Then the calculated ranks, grouped by the learner, are aggregated with the data.table package. As larger AUC scores are better, we multiply the values by $$-1$$ such that the best learner has a rank of $$1$$. library("data.table") # group by levels of task_id, return columns: # - learner_id # - rank of col '-auc_train' (per level of learner_id) # - rank of col '-auc_test' (per level of learner_id) ranks = tab[, .(learner_id, rank_train = rank(-auc_train), rank_test = rank(-auc_test)), by = task_id] print(ranks)  task_id learner_id rank_train rank_test 1: spam classif.ranger 1 1 2: spam classif.rpart 2 2 3: spam classif.featureless 3 3 4: german_credit classif.ranger 1 1 5: german_credit classif.rpart 2 2 6: german_credit classif.featureless 3 3 7: sonar classif.ranger 1 1 8: sonar classif.rpart 2 2 9: sonar classif.featureless 3 3 # group by levels of learner_id, return columns: # - mean rank of col 'rank_train' (per level of learner_id) # - mean rank of col 'rank_test' (per level of learner_id) ranks = ranks[, .(mrank_train = mean(rank_train), mrank_test = mean(rank_test)), by = learner_id] # print the final table, ordered by mean rank of AUC test ranks[order(mrank_test)]  learner_id mrank_train mrank_test 1: classif.ranger 1 1 2: classif.rpart 2 2 3: classif.featureless 3 3 Unsurprisingly, the featureless learner is worse overall. The winner is the classification forest, which outperforms a single classification tree. ### 3.3.3 Plotting Benchmark Results Similar to tasks, predictions, or resample results, mlr3viz also provides a autoplot() method for benchmark results. autoplot(bmr) + ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 45, hjust = 1)) Such a plot gives a nice overview of the overall performance and how learners compare on different tasks in an intuitive way. We can also plot ROC (receiver operating characteristics) curves. We filter the BenchmarkResult to only contain a single Task, then we simply plot the result: bmr_small = bmr$clone(deep = TRUE)$filter(task_id = "german_credit") autoplot(bmr_small, type = "roc") All available plot types are listed on the manual page of autoplot.BenchmarkResult(). ### 3.3.4 Statistical Tests The package mlr3benchmark provides some infrastructure for applying statistical significance tests on BenchmarkResult. Currently, Friedman tests and pairwise Friedman-Nemenyi tests are supported for benchmarks with at least two tasks and at least two learners. The results can be summarized in Critical Difference Plots. library("mlr3benchmark") bma = as.BenchmarkAggr(bmr, measures = msr("classif.auc")) bma$friedman_posthoc()

Pairwise comparisons using Nemenyi-Wilcoxon-Wilcox all-pairs test for a two-way balanced complete block design
data: auc and learner_id and task_id
            ranger rpart
rpart       0.438  -
featureless 0.038  0.438

P value adjustment method: single-step
autoplot(bma, type = "cd") ### 3.3.5 Extracting ResampleResults

A BenchmarkResult object is essentially a collection of multiple ResampleResult objects. As these are stored in a column of the aggregated data.table(), we can easily extract them:

tab = bmr$aggregate(measures) rr = tab[task_id == "german_credit" & learner_id == "classif.ranger"]$resample_result[]
print(rr)
<ResampleResult> of 3 iterations
* Learner: classif.ranger
* Warnings: 0 in 0 iterations
* Errors: 0 in 0 iterations

We can now investigate this resampling and even single resampling iterations using one of the approaches shown in the previous section on resampling:

measure = msr("classif.auc")
rr$aggregate(measure) classif.auc 0.8055811  # get the iteration with worst AUC perf = rr$score(measure)
i = which.min(perf$classif.auc) # get the corresponding learner and training set print(rr$learners[[i]])
<LearnerClassifRanger:classif.ranger>
* Model: -
* Packages: mlr3, mlr3learners, ranger
* Predict Types:  response, [prob]
* Feature Types: logical, integer, numeric, character, factor, ordered
* Properties: hotstart_backward, importance, multiclass, oob_error,
twoclass, weights
head(rr$resampling$train_set(i))
  2  5  7  8  9 10

### 3.3.6 Converting and Merging

A ResampleResult can be converted to a BenchmarkResult with the function as_benchmark_result(). We can also combine two BenchmarkResults into a larger result object, for example, for two related benchmarks that are computed on different machines.

task = tsk("iris")

rr1 = resample(task, lrn("classif.rpart"), resampling)
rr2 = resample(task, lrn("classif.featureless"), resampling)

# Cast both ResampleResults to BenchmarkResults
bmr1 = as_benchmark_result(rr1)
bmr2 = as_benchmark_result(rr2)

# Merge 2nd BMR into the first BMR
bmr = c(bmr1, bmr2)
print(bmr)
<BenchmarkResult> of 2 rows with 2 resampling runs
nr task_id          learner_id resampling_id iters warnings errors
1    iris       classif.rpart       holdout     1        0      0
2    iris classif.featureless       holdout     1        0      0