# 3 Performance Evaluation and Comparison

Now that we are familiar with the basics of how to create tasks and learners, how to fit models, and do some basic performance evaluation, let’s have a look at some of the details, and in particular how mlr3 makes it easy to perform many common machine learning steps.

We will cover the following topics:

Binary classification and ROC curves

Binary classification is a special case of classification where the target variable to predict has only two possible values. In this case, additional considerations apply; in particular ROC curves and the threshold of where to predict one class versus the other.

Resampling

A resampling is a method to create training and test splits. We cover how to

Additional information on resampling can be found in the section about nested resampling and in the chapter on model optimization.

Benchmarking

Benchmarking is used to compare the performance of different models, for example models trained with different learners, on different tasks, or with different resampling methods. This is usually done to get an overview of how different methods perform across different tasks. We cover how to

## 3.1 ROC Curve and Thresholds

As we have seen before, binary classification is special because of the presence of a positive and negative class and a threshold probability to ditinguish between the two. ROC Analysis, which stands for receiver operating characteristics, applies specifically to this case and allows to get a better picture of the trade offs when choosing between the two classes. We saw earlier that one can retrieve the confusion matrix of a Prediction by accessing the $confusion field: data("Sonar", package = "mlbench") task = as_task_classif(Sonar, target = "Class", positive = "M") learner = lrn("classif.rpart", predict_type = "prob") pred = learner$train(task)$predict(task) C = pred$confusion
print(C)
##         truth
## response  M  R
##        M 95 10
##        R 16 87

The upper left quandrant denotes the number of times our model predicted the positive class and was correct about it. Similarly, the lower right quadrant denotes the number of times our model predicted the negative class and was also correct about it. Together, the elements on the diagonal are called True Positives (TP) and True Negatives (TN). The upper right quadrant denotes the number of times we falsely predicted a positive label, and is called False Positives (FP). The lower left quadrant is called False Negatives (FN).

We can derive the following performance metrics from the confusion matrix:

• True Positive Rate (TPR): How many of the true positives did we predict as positive?
• True Negative Rate (TNR): How many of the true negatives did we predict as negative?
• Positive Predictive Value PPV: If we predict positive how likely is it a true positive?
• Negative Predictive Value NPV: If we predict negative how likely is it a true negative?

It is difficult to achieve a high TPR and low FPR at the same time. We can characterize the behavior of a binary classifier for different thresholds by plotting the TPR and FPR values – this is the ROC curve. The best classifier lies on the top-left corner – the TPR is 1 and the FPR is 0. The worst classifier lies at the diagonal. Classifiers on the diagonal produce labels essentially randomly (possibly with different proportions). For example, if each positive $$x$$ will be randomly classified with 25% as “positive”, we get a TPR of 0.25. If we assign each negative $$x$$ randomly to “positive” we get a FPR of 0.25. In practice, we should never obtain a classifier below the diagonal – ROC curves for different labels are symmetric with respect to the diagonal, so a curve below the diagonal would indicate that the positive and negative class labels have been switched by the classifier.

For mlr3 prediction objects, the ROC curve can easily be created with mlr3viz which relies on the precrec to calculate and plot ROC curves:

library("mlr3viz")

# TPR vs FPR / Sensitivity vs (1 - Specificity)
autoplot(pred, type = "roc") We can also plot the precision-recall curve (PPV vs. TPR). The main difference between ROC curves and precision-recall curves (PRC) is that the number of true-negative results is not used for making a PRC. PRCs are preferred over ROC curves for imbalanced populations.

# Precision vs Recall
autoplot(pred, type = "prc") ## 3.2 Resampling

When evaluating the performance of a model, we are interested in its generalization performance – how well will it perform on new data that has not been seen during training? We can estimate the generalization performance by evaluating a model on a test set, as we have done above, that was created to contain only observations that are not contained in the training set. There are many different strategies for partitioning a data set into training and test; in mlr3 we call these strategies “resampling”. mlr3 includes the following predefined resampling strategies:

In particular, it is often desirable to repeatedly split the entire data in different ways to ensure that a “lucky” or “unlucky” split does not bias the generalization performance estimate. Without resampling strategies like the ones we provide here, this is a tedious and error-prone process.

The following sections provide guidance on how to select a resampling strategy and how to use it.

Here is a graphical illustration of the resampling process in general: ### 3.2.1 Settings

We will use the penguins task and a simple classification tree from the rpart package as an example here.

library("mlr3verse")

learner = lrn("classif.rpart")

When performing resampling with a dataset, we first need to define which approach should be used. mlr3 resampling strategies and their parameters can be queried by looking at the data.table output of the mlr_resamplings dictionary; this also lists the parameters that can be changed to affect the behavior of each strategy:

as.data.table(mlr_resamplings)
##            key                         label        params iters
## 1:   bootstrap                     Bootstrap ratio,repeats    30
## 2:      custom                 Custom Splits                  NA
## 3:   custom_cv Custom Split Cross-Validation                  NA
## 4:          cv              Cross-Validation         folds    10
## 5:     holdout                       Holdout         ratio     1
## 6:    insample           Insample Resampling                   1
## 7:         loo                 Leave-One-Out                  NA
## 8: repeated_cv     Repeated Cross-Validation folds,repeats   100
## 9: subsampling                   Subsampling ratio,repeats    30

Additional resampling methods for special use cases are available via extension packages, such as mlr3spatiotemporal for spatial data.

What we showed in the train/predict/score part is the equivalent of holdout resampling, done manually, so let’s consider this one first. We can retrieve elements from the dictionary mlr_resamplings via $get() or the convenience function rsmp(): resampling = rsmp("holdout") print(resampling) ## <ResamplingHoldout>: Holdout ## * Iterations: 1 ## * Instantiated: FALSE ## * Parameters: ratio=0.6667 Note that the $is_instantiated field is set to FALSE. This means we did not actually apply the strategy to a dataset yet.

By default we get a .66/.33 split of the data into training and test. There are two ways in which this ratio can be changed:

1. Overwriting the slot in $param_set$values using a named list:
resampling$param_set$values = list(ratio = 0.8)
1. Specifying the resampling parameters directly during construction:
rsmp("holdout", ratio = 0.8)
## <ResamplingHoldout>: Holdout
## * Iterations: 1
## * Instantiated: FALSE
## * Parameters: ratio=0.8

### 3.2.2 Instantiation

So far we have only chosen a resampling strategy; we now need to instantiate it with data.

To actually perform the splitting and obtain indices for the training and the test split, the resampling needs a Task. By calling the method instantiate(), we split the indices of the data into indices for training and test sets. These resulting indices are stored in the Resampling objects.

resampling$instantiate(task) str(resampling$train_set(1))
##  int [1:275] 1 2 3 4 5 6 7 9 10 11 ...
str(resampling$test_set(1)) ## int [1:69] 8 18 28 32 34 37 39 42 59 60 ... Note that if you want to compare multiple Learners in a fair manner, using the same instantiated resampling for each learner is mandatory, such that each learner gets exactly the same training data and the performance of the trained model is evaluated in exactly the same test set. A way to greatly simplify the comparison of multiple learners is discussed in the section on benchmarking. ### 3.2.3 Execution With a Task, a Learner, and a Resampling object we can call resample(), which fits the learner on the training set and evaluates it on the test set. This may happen multiple times, depending on the given resampling strategy. The result of running the resample() function is a ResampleResult object. We can tell resample() to keep the fitted models (for example for later inspection) by setting the store_models option to TRUEand then starting the computation: task = tsk("penguins") learner = lrn("classif.rpart", maxdepth = 3, predict_type = "prob") resampling = rsmp("cv", folds = 3) rr = resample(task, learner, resampling, store_models = TRUE) print(rr) ## <ResampleResult> of 3 iterations ## * Task: penguins ## * Learner: classif.rpart ## * Warnings: 0 in 0 iterations ## * Errors: 0 in 0 iterations Here we use a three-fold cross-validation resampling, which trains and evaluates on three different training and test sets. The returned ResampleResult, stored as rr, provides various getters to access and aggregate the stored information. Here are a few examples: • Calculate the average performance across all resampling iterations, in terms of classification error: rr$aggregate(msr("classif.ce"))
## classif.ce
##     0.0552
• Extract the performance for the individual resampling iterations:

rr$score(msr("classif.ce")) ## task task_id learner learner_id ## 1: <TaskClassif> penguins <LearnerClassifRpart> classif.rpart ## 2: <TaskClassif> penguins <LearnerClassifRpart> classif.rpart ## 3: <TaskClassif> penguins <LearnerClassifRpart> classif.rpart ## resampling resampling_id iteration prediction ## 1: <ResamplingCV> cv 1 <PredictionClassif> ## 2: <ResamplingCV> cv 2 <PredictionClassif> ## 3: <ResamplingCV> cv 3 <PredictionClassif> ## classif.ce ## 1: 0.06087 ## 2: 0.06087 ## 3: 0.04386 This is useful to check if one (or more) of the iterations are very different from the average. • Check for warnings or errors: rr$warnings
## Empty data.table (0 rows and 2 cols): iteration,msg
rr$errors ## Empty data.table (0 rows and 2 cols): iteration,msg • Extract and inspect the resampling splits; this allows to see in detail which observations were used for what purpose when: rr$resampling
## <ResamplingCV>: Cross-Validation
## * Iterations: 3
## * Instantiated: TRUE
## * Parameters: folds=3
rr$resampling$iters
##  3
str(rr$resampling$test_set(1))
##  int [1:115] 1 3 4 6 8 10 15 17 18 23 ...
str(rr$resampling$train_set(1))
##  int [1:229] 2 5 9 11 19 20 21 29 30 35 ...
• Retrieve the model trained in a specific iteration and inspect it, for example to investigate why the performance in this iteration was very different from the average:

lrn = rr$learners[] lrn$model
## n= 229
##
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
##
## 1) root 229 132 Adelie (0.42358 0.21397 0.36245)
##   2) flipper_length< 207.5 142  47 Adelie (0.66901 0.33099 0.00000)
##     4) bill_length< 43.35 97   4 Adelie (0.95876 0.04124 0.00000) *
##     5) bill_length>=43.35 45   2 Chinstrap (0.04444 0.95556 0.00000) *
##   3) flipper_length>=207.5 87   4 Gentoo (0.02299 0.02299 0.95402) *
• Extract the individual predictions:

rr$prediction() # all predictions merged into a single Prediction object ## <PredictionClassif> for 344 observations: ## row_ids truth response prob.Adelie prob.Chinstrap prob.Gentoo ## 1 Adelie Adelie 0.95876 0.04124 0.00000 ## 3 Adelie Adelie 0.95876 0.04124 0.00000 ## 4 Adelie Adelie 0.95876 0.04124 0.00000 ## --- ## 335 Chinstrap Chinstrap 0.09091 0.88636 0.02273 ## 339 Chinstrap Chinstrap 0.09091 0.88636 0.02273 ## 342 Chinstrap Chinstrap 0.09091 0.88636 0.02273 rr$predictions()[] # predictions of first resampling iteration
## <PredictionClassif> for 115 observations:
##     row_ids     truth  response prob.Adelie prob.Chinstrap prob.Gentoo
## ---
##         337 Chinstrap Chinstrap     0.04444        0.95556       0.000
##         338 Chinstrap Chinstrap     0.04444        0.95556       0.000
##         343 Chinstrap    Gentoo     0.02299        0.02299       0.954
• Filter the result to only keep specified resampling iterations:

rr$filter(c(1, 3)) print(rr) ## <ResampleResult> of 2 iterations ## * Task: penguins ## * Learner: classif.rpart ## * Warnings: 0 in 0 iterations ## * Errors: 0 in 0 iterations ### 3.2.4 Custom resampling Sometimes it is necessary to perform resampling with custom splits, e.g. to reproduce results reported in a study. A manual resampling instance can be created using the "custom" template. resampling = rsmp("custom") resampling$instantiate(task,
train = list(c(1:10, 51:60, 101:110)),
test = list(c(11:20, 61:70, 111:120))
)
resampling$iters ##  1 resampling$train_set(1)
##     1   2   3   4   5   6   7   8   9  10  51  52  53  54  55  56  57  58  59
##   60 101 102 103 104 105 106 107 108 109 110
resampling$test_set(1) ##  11 12 13 14 15 16 17 18 19 20 61 62 63 64 65 66 67 68 69 ##  70 111 112 113 114 115 116 117 118 119 120 ### 3.2.5 Resampling with (predefined) groups In some cases, it is desirable to keep observations together, i.e. to not separate them into training and test set. This can be defined through the column role "group" during Task creation, i.e. a special column in the data specifies the groups (see also the help page on this column role). In mlr this was called “blocking”. See also the mlr3gallery post on this topic for a practical example. ### 3.2.6 Plotting Resample Results mlr3viz provides a autoplot() method for resampling results. As an example, we create a binary classification task with two features, perform a resampling with a 10-fold cross-validation and visualize the results: task = tsk("pima") task$select(c("glucose", "mass"))
learner = lrn("classif.rpart", predict_type = "prob")
rr = resample(task, learner, rsmp("cv"), store_models = TRUE)

# boxplot of AUC values across the 10 folds
autoplot(rr, measure = msr("classif.auc")) # ROC curve, averaged over 10 folds
autoplot(rr, type = "roc") We can also plot the predictions of individual models:

# learner predictions for the first fold
rr$filter(1) autoplot(rr, type = "prediction") ## Warning: Removed 4 rows containing missing values (geom_point). All available plot types are listed on the manual page of autoplot.ResampleResult(). ## 3.3 Benchmarking Comparing the performance of different learners on multiple tasks and/or different resampling schemes is a common task. This operation is usually referred to as “benchmarking” in the field of machine learning. The mlr3 package offers the benchmark() convenience function that takes care of most of the work of repeatedly training and evaluating models under the same conditions. ### 3.3.1 Design Creation Benchmark experiments in mlr3 are specified through a design. Such a design is essentially a table of scenarios to be evaluated; in particular unique combinations of Task, Learner and Resampling triplets. We use the benchmark_grid() function to create an exhaustive design (that evaluates each learner on each task with each resampling) and instantiate the resampling properly, so that all learners are executed on the same train/test split for each tasks. We set the learners to predict probabilities and also tell them to predict for the observations of the training set (by setting predict_sets to c("train", "test")). Additionally, we use tsks(), lrns(), and rsmps() to retrieve lists of Task, Learner and Resampling in the same fashion as tsk(), lrn() and rsmp(). library("mlr3verse") design = benchmark_grid( tasks = tsks(c("spam", "german_credit", "sonar")), learners = lrns(c("classif.ranger", "classif.rpart", "classif.featureless"), predict_type = "prob", predict_sets = c("train", "test")), resamplings = rsmps("cv", folds = 3) ) print(design) ## task learner resampling ## 1: <TaskClassif> <LearnerClassifRanger> <ResamplingCV> ## 2: <TaskClassif> <LearnerClassifRpart> <ResamplingCV> ## 3: <TaskClassif> <LearnerClassifFeatureless> <ResamplingCV> ## 4: <TaskClassif> <LearnerClassifRanger> <ResamplingCV> ## 5: <TaskClassif> <LearnerClassifRpart> <ResamplingCV> ## 6: <TaskClassif> <LearnerClassifFeatureless> <ResamplingCV> ## 7: <TaskClassif> <LearnerClassifRanger> <ResamplingCV> ## 8: <TaskClassif> <LearnerClassifRpart> <ResamplingCV> ## 9: <TaskClassif> <LearnerClassifFeatureless> <ResamplingCV> The created design can be passed to benchmark() to start the computation. It is also possible to create a custom design manually, for example to exclude certain task-learner combinations. However, if you create a custom task with data.table(), the train/test splits will be different for each row of the design if you do not manually instantiate the resampling before creating the design. See the help page on benchmark_grid() for an example. ### 3.3.2 Execution and Aggregation of Results After the benchmark design is ready, we can call benchmark() on it: bmr = benchmark(design) Note that we did not have to instantiate the resampling manually. benchmark_grid() took care of it for us: each resampling strategy is instantiated once for each task during the construction of the exhaustive grid. Once the benchmarking is done (and, depending on the size of your design, this can take quite some time), we can aggregate the performance with $aggregate(). We create two measures to calculate the area under the curve (AUC) for the training and the test set:

measures = list(
msr("classif.auc", predict_sets = "train", id = "auc_train"),
msr("classif.auc", id = "auc_test")
)

tab = bmr$aggregate(measures) print(tab) ## nr resample_result task_id learner_id resampling_id ## 1: 1 <ResampleResult> spam classif.ranger cv ## 2: 2 <ResampleResult> spam classif.rpart cv ## 3: 3 <ResampleResult> spam classif.featureless cv ## 4: 4 <ResampleResult> german_credit classif.ranger cv ## 5: 5 <ResampleResult> german_credit classif.rpart cv ## 6: 6 <ResampleResult> german_credit classif.featureless cv ## 7: 7 <ResampleResult> sonar classif.ranger cv ## 8: 8 <ResampleResult> sonar classif.rpart cv ## 9: 9 <ResampleResult> sonar classif.featureless cv ## iters auc_train auc_test ## 1: 3 0.9995 0.9841 ## 2: 3 0.9040 0.8927 ## 3: 3 0.5000 0.5000 ## 4: 3 0.9984 0.7965 ## 5: 3 0.8100 0.7225 ## 6: 3 0.5000 0.5000 ## 7: 3 1.0000 0.9078 ## 8: 3 0.9223 0.7381 ## 9: 3 0.5000 0.5000 We can aggregate the results even further. For example, we might be interested to know which learner performed best across all tasks. Simply aggregating the performances with the mean is usually not statistically sound. Instead, we calculate the rank statistic for each learner, grouped by task. Then the calculated ranks, grouped by learner, are aggregated with the data.table package. As larger AUC scores are better, we multiply the values by $$-1$$ such that the best learner has a rank of $$1$$. library("data.table") # group by levels of task_id, return columns: # - learner_id # - rank of col '-auc_train' (per level of learner_id) # - rank of col '-auc_test' (per level of learner_id) ranks = tab[, .(learner_id, rank_train = rank(-auc_train), rank_test = rank(-auc_test)), by = task_id] print(ranks) ## task_id learner_id rank_train rank_test ## 1: spam classif.ranger 1 1 ## 2: spam classif.rpart 2 2 ## 3: spam classif.featureless 3 3 ## 4: german_credit classif.ranger 1 1 ## 5: german_credit classif.rpart 2 2 ## 6: german_credit classif.featureless 3 3 ## 7: sonar classif.ranger 1 1 ## 8: sonar classif.rpart 2 2 ## 9: sonar classif.featureless 3 3 # group by levels of learner_id, return columns: # - mean rank of col 'rank_train' (per level of learner_id) # - mean rank of col 'rank_test' (per level of learner_id) ranks = ranks[, .(mrank_train = mean(rank_train), mrank_test = mean(rank_test)), by = learner_id] # print the final table, ordered by mean rank of AUC test ranks[order(mrank_test)] ## learner_id mrank_train mrank_test ## 1: classif.ranger 1 1 ## 2: classif.rpart 2 2 ## 3: classif.featureless 3 3 Unsurprisingly, the featureless learner is worse overall. The winner is the classification forest, which outperforms a single classification tree. ### 3.3.3 Plotting Benchmark Results Similar to tasks, predictions, or resample results, mlr3viz also provides a autoplot() method for benchmark results. autoplot(bmr) + ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 45, hjust = 1)) Such a plot gives a nice overview of the overall performance and how learners compare on different tasks in an intuitive way. We can also plot ROC (receiver operating characteristics) curves. We filter the BenchmarkResult to only contain a single Task, then we simply plot the result: bmr_small = bmr$clone()$filter(task_id = "german_credit") autoplot(bmr_small, type = "roc") All available plot types are listed on the manual page of autoplot.BenchmarkResult(). ### 3.3.4 Extracting ResampleResults A BenchmarkResult object is essentially a collection of multiple ResampleResult objects. As these are stored in a column of the aggregated data.table(), we can easily extract them: tab = bmr$aggregate(measures)
rr = tab[task_id == "german_credit" & learner_id == "classif.ranger"]$resample_result[] print(rr) ## <ResampleResult> of 3 iterations ## * Task: german_credit ## * Learner: classif.ranger ## * Warnings: 0 in 0 iterations ## * Errors: 0 in 0 iterations We can now investigate this resampling and even single resampling iterations using one of the approaches shown in the previous section: measure = msr("classif.auc") rr$aggregate(measure)
## classif.auc
##      0.7965
# get the iteration with worst AUC
perf = rr$score(measure) i = which.min(perf$classif.auc)

# get the corresponding learner and training set
print(rr$learners[[i]]) ## <LearnerClassifRanger:classif.ranger> ## * Model: - ## * Parameters: num.threads=1 ## * Packages: mlr3, mlr3learners, ranger ## * Predict Type: prob ## * Feature types: logical, integer, numeric, character, factor, ordered ## * Properties: hotstart_backward, importance, multiclass, oob_error, ## twoclass, weights head(rr$resampling$train_set(i)) ##  1 4 6 9 11 12 ### 3.3.5 Converting and Merging A ResampleResult can be converted to a BenchmarkResult with the function as_benchmark_result(). We can also merge two BenchmarkResults into a larger result object, for example for two related benchmarks that were done on different machines. task = tsk("iris") resampling = rsmp("holdout")$instantiate(task)

bmr1
## <BenchmarkResult> of 2 rows with 2 resampling runs
##   2    iris classif.featureless       holdout     1        0      0