8  Benchmarking

Comparing the performance of different learners on multiple tasks and/or different resampling schemes is a common task. This operation is usually referred to as “benchmarking” in the field of machine learning. The mlr3 package offers the benchmark() convenience function that takes care of most of the work of repeatedly training and evaluating models under the same conditions.

8.1 Design Creation

Benchmark experiments in mlr3 are specified through a design. Such a design is essentially a table of scenarios to be evaluated; in particular unique combinations of Task, Learner and Resampling triplets.

We use the benchmark_grid() function to create an exhaustive design (that evaluates each learner on each task with each resampling) and instantiate the resampling properly, so that all learners are executed on the same train/test split for each tasks. We set the learners to predict probabilities and also tell them to predict for the observations of the training set (by setting predict_sets to c("train", "test")). Additionally, we use tsks(), lrns(), and rsmps() to retrieve lists of Task, Learner and Resampling.

library("mlr3verse")

design = benchmark_grid(
  tasks = tsks(c("spam", "german_credit", "sonar")),
  learners = lrns(c("classif.ranger", "classif.rpart", "classif.featureless"),
    predict_type = "prob", predict_sets = c("train", "test")),
  resamplings = rsmps("cv", folds = 3)
)
print(design)
                task                         learner         resampling
1: <TaskClassif[50]>      <LearnerClassifRanger[38]> <ResamplingCV[20]>
2: <TaskClassif[50]>       <LearnerClassifRpart[38]> <ResamplingCV[20]>
3: <TaskClassif[50]> <LearnerClassifFeatureless[38]> <ResamplingCV[20]>
4: <TaskClassif[50]>      <LearnerClassifRanger[38]> <ResamplingCV[20]>
5: <TaskClassif[50]>       <LearnerClassifRpart[38]> <ResamplingCV[20]>
6: <TaskClassif[50]> <LearnerClassifFeatureless[38]> <ResamplingCV[20]>
7: <TaskClassif[50]>      <LearnerClassifRanger[38]> <ResamplingCV[20]>
8: <TaskClassif[50]>       <LearnerClassifRpart[38]> <ResamplingCV[20]>
9: <TaskClassif[50]> <LearnerClassifFeatureless[38]> <ResamplingCV[20]>

The created Design can be passed to benchmark() to start the computation. It is also possible to create a custom design manually, for example to exclude certain task-learner combinations.

Danger

However, if you create a custom task with data.table(), the train/test splits will be different for each row of the design if you do not manually instantiate the resampling before creating the design. See the help page on benchmark_grid() for an example.

8.2 Execution and Aggregation of Results

After the benchmark design is ready, we can call benchmark() on it:

bmr = benchmark(design)
INFO  [21:32:11.740] [mlr3] Running benchmark with 27 resampling iterations 
INFO  [21:32:11.840] [mlr3] Applying learner 'classif.rpart' on task 'spam' (iter 2/3) 
INFO  [21:32:11.959] [mlr3] Applying learner 'classif.featureless' on task 'spam' (iter 3/3) 
INFO  [21:32:11.993] [mlr3] Applying learner 'classif.rpart' on task 'german_credit' (iter 2/3) 
INFO  [21:32:12.046] [mlr3] Applying learner 'classif.ranger' on task 'spam' (iter 1/3) 
INFO  [21:32:15.395] [mlr3] Applying learner 'classif.ranger' on task 'sonar' (iter 1/3) 
INFO  [21:32:15.529] [mlr3] Applying learner 'classif.ranger' on task 'sonar' (iter 3/3) 
INFO  [21:32:15.665] [mlr3] Applying learner 'classif.ranger' on task 'german_credit' (iter 2/3) 
INFO  [21:32:16.066] [mlr3] Applying learner 'classif.rpart' on task 'spam' (iter 1/3) 
INFO  [21:32:16.212] [mlr3] Applying learner 'classif.rpart' on task 'german_credit' (iter 3/3) 
INFO  [21:32:16.263] [mlr3] Applying learner 'classif.rpart' on task 'sonar' (iter 2/3) 
INFO  [21:32:16.308] [mlr3] Applying learner 'classif.ranger' on task 'german_credit' (iter 1/3) 
INFO  [21:32:16.709] [mlr3] Applying learner 'classif.ranger' on task 'spam' (iter 2/3) 
INFO  [21:32:18.915] [mlr3] Applying learner 'classif.ranger' on task 'spam' (iter 3/3) 
INFO  [21:32:21.174] [mlr3] Applying learner 'classif.rpart' on task 'spam' (iter 3/3) 
INFO  [21:32:21.520] [mlr3] Applying learner 'classif.featureless' on task 'spam' (iter 2/3) 
INFO  [21:32:21.550] [mlr3] Applying learner 'classif.rpart' on task 'sonar' (iter 1/3) 
INFO  [21:32:21.592] [mlr3] Applying learner 'classif.featureless' on task 'sonar' (iter 1/3) 
INFO  [21:32:21.612] [mlr3] Applying learner 'classif.ranger' on task 'german_credit' (iter 3/3) 
INFO  [21:32:21.969] [mlr3] Applying learner 'classif.featureless' on task 'sonar' (iter 2/3) 
INFO  [21:32:21.990] [mlr3] Applying learner 'classif.featureless' on task 'german_credit' (iter 3/3) 
INFO  [21:32:22.011] [mlr3] Applying learner 'classif.featureless' on task 'sonar' (iter 3/3) 
INFO  [21:32:22.031] [mlr3] Applying learner 'classif.rpart' on task 'sonar' (iter 3/3) 
INFO  [21:32:22.077] [mlr3] Applying learner 'classif.ranger' on task 'sonar' (iter 2/3) 
INFO  [21:32:22.218] [mlr3] Applying learner 'classif.featureless' on task 'german_credit' (iter 2/3) 
INFO  [21:32:22.240] [mlr3] Applying learner 'classif.rpart' on task 'german_credit' (iter 1/3) 
INFO  [21:32:22.287] [mlr3] Applying learner 'classif.featureless' on task 'spam' (iter 1/3) 
INFO  [21:32:22.318] [mlr3] Applying learner 'classif.featureless' on task 'german_credit' (iter 1/3) 
INFO  [21:32:22.346] [mlr3] Finished benchmark 
Tip

Note that we did not have to instantiate the resampling manually. benchmark_grid() took care of it for us: each resampling strategy is instantiated once for each task during the construction of the exhaustive grid.

Once the benchmarking is done (and, depending on the size of your design, this can take quite some time), we can aggregate the performance with $aggregate(). We create two measures to calculate the area under the curve (AUC) for the training and the test set:

measures = list(
  msr("classif.auc", predict_sets = "train", id = "auc_train"),
  msr("classif.auc", id = "auc_test")
)

tab = bmr$aggregate(measures)
print(tab)
   nr      resample_result       task_id          learner_id resampling_id
1:  1 <ResampleResult[22]>          spam      classif.ranger            cv
2:  2 <ResampleResult[22]>          spam       classif.rpart            cv
3:  3 <ResampleResult[22]>          spam classif.featureless            cv
4:  4 <ResampleResult[22]> german_credit      classif.ranger            cv
5:  5 <ResampleResult[22]> german_credit       classif.rpart            cv
6:  6 <ResampleResult[22]> german_credit classif.featureless            cv
7:  7 <ResampleResult[22]>         sonar      classif.ranger            cv
8:  8 <ResampleResult[22]>         sonar       classif.rpart            cv
9:  9 <ResampleResult[22]>         sonar classif.featureless            cv
   iters auc_train  auc_test
1:     3 0.9994742 0.9852367
2:     3 0.9072487 0.9008022
3:     3 0.5000000 0.5000000
4:     3 0.9985898 0.7841484
5:     3 0.8153702 0.6931914
6:     3 0.5000000 0.5000000
7:     3 1.0000000 0.9208535
8:     3 0.9220016 0.7174544
9:     3 0.5000000 0.5000000

We can aggregate the results even further. For example, we might be interested to know which learner performed best across all tasks. Simply aggregating the performances with the mean is usually not statistically sound. Instead, we calculate the rank statistic for each learner, grouped by task. Then the calculated ranks, grouped by learner, are aggregated with the data.table package. As larger AUC scores are better, we multiply the values by \(-1\) such that the best learner has a rank of \(1\).

library("data.table")
# group by levels of task_id, return columns:
# - learner_id
# - rank of col '-auc_train' (per level of learner_id)
# - rank of col '-auc_test' (per level of learner_id)
ranks = tab[, .(learner_id, rank_train = rank(-auc_train), rank_test = rank(-auc_test)), by = task_id]
print(ranks)
         task_id          learner_id rank_train rank_test
1:          spam      classif.ranger          1         1
2:          spam       classif.rpart          2         2
3:          spam classif.featureless          3         3
4: german_credit      classif.ranger          1         1
5: german_credit       classif.rpart          2         2
6: german_credit classif.featureless          3         3
7:         sonar      classif.ranger          1         1
8:         sonar       classif.rpart          2         2
9:         sonar classif.featureless          3         3
# group by levels of learner_id, return columns:
# - mean rank of col 'rank_train' (per level of learner_id)
# - mean rank of col 'rank_test' (per level of learner_id)
ranks = ranks[, .(mrank_train = mean(rank_train), mrank_test = mean(rank_test)), by = learner_id]

# print the final table, ordered by mean rank of AUC test
ranks[order(mrank_test)]
            learner_id mrank_train mrank_test
1:      classif.ranger           1          1
2:       classif.rpart           2          2
3: classif.featureless           3          3

Unsurprisingly, the featureless learner is worse overall. The winner is the classification forest, which outperforms a single classification tree.

8.3 Plotting Benchmark Results

Similar to tasks, predictions, or resample results, mlr3viz also provides a autoplot() method for benchmark results.

autoplot(bmr) + ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 45, hjust = 1))

Such a plot gives a nice overview of the overall performance and how learners compare on different tasks in an intuitive way.

We can also plot ROC (receiver operating characteristics) curves. We filter the BenchmarkResult to only contain a single Task, then we simply plot the result:

bmr_small = bmr$clone()$filter(task_id = "german_credit")
autoplot(bmr_small, type = "roc")

All available plot types are listed on the manual page of autoplot.BenchmarkResult().

8.4 Extracting ResampleResults

A BenchmarkResult object is essentially a collection of multiple ResampleResult objects. As these are stored in a column of the aggregated data.table(), we can easily extract them:

tab = bmr$aggregate(measures)
rr = tab[task_id == "german_credit" & learner_id == "classif.ranger"]$resample_result[[1]]
print(rr)
<ResampleResult> of 3 iterations
* Task: german_credit
* Learner: classif.ranger
* Warnings: 0 in 0 iterations
* Errors: 0 in 0 iterations

We can now investigate this resampling and even single resampling iterations using one of the approaches shown in the previous section on resampling:

measure = msr("classif.auc")
rr$aggregate(measure)
classif.auc 
  0.7841484 
# get the iteration with worst AUC
perf = rr$score(measure)
i = which.min(perf$classif.auc)

# get the corresponding learner and training set
print(rr$learners[[i]])
<LearnerClassifRanger:classif.ranger>
* Model: -
* Parameters: num.threads=1
* Packages: mlr3, mlr3learners, ranger
* Predict Type: prob
* Feature types: logical, integer, numeric, character, factor, ordered
* Properties: hotstart_backward, importance, multiclass, oob_error,
  twoclass, weights
head(rr$resampling$train_set(i))
[1]  2  7 13 15 16 18

8.5 Converting and Merging

A ResampleResult can be converted to a BenchmarkResult with the function as_benchmark_result(). We can also merge two BenchmarkResults into a larger result object, for example for two related benchmarks that were done on different machines.

task = tsk("iris")
resampling = rsmp("holdout")$instantiate(task)

rr1 = resample(task, lrn("classif.rpart"), resampling)
INFO  [21:32:25.216] [mlr3] Applying learner 'classif.rpart' on task 'iris' (iter 1/1) 
rr2 = resample(task, lrn("classif.featureless"), resampling)
INFO  [21:32:25.306] [mlr3] Applying learner 'classif.featureless' on task 'iris' (iter 1/1) 
# Cast both ResampleResults to BenchmarkResults
bmr1 = as_benchmark_result(rr1)
bmr2 = as_benchmark_result(rr2)

# Merge 2nd BMR into the first BMR
bmr1$combine(bmr2)

bmr1
<BenchmarkResult> of 2 rows with 2 resampling runs
 nr task_id          learner_id resampling_id iters warnings errors
  1    iris       classif.rpart       holdout     1        0      0
  2    iris classif.featureless       holdout     1        0      0