2.6 Benchmarking

Comparing the performance of different learners on multiple tasks and/or different resampling schemes is a common task. This operation is usually referred to as “benchmarking” in the field of machine-learning. The mlr3 package offers the benchmark() function for convenience.

2.6.1 Design Creation

In mlr3 we require you to supply a “design” of your benchmark experiment. By “design” we essentially mean the matrix of settings you want to execute. A “design” consists of unique combinations of Task, Learner and Resampling triplets.

Here, we call benchmark() to perform a single holdout split on a single task and two learners. We use the benchmark_grid() function to create an exhaustive design and instantiate the resampling properly, so that all learners are executed on the same train/test split for each task:

##             task                     learner          resampling
## 1: <TaskClassif>       <LearnerClassifRpart> <ResamplingHoldout>
## 2: <TaskClassif> <LearnerClassifFeatureless> <ResamplingHoldout>

Instead of using benchmark_grid() you could also create the design manually as a data.table and use the full flexibility of the benchmark() function. The design does not have to be exhaustive, e.g. it can also contain a different learner for each task. However, you should note that benchmark_grid() makes sure to instantiate the resamplings for each task. If you create the design manually, even if the same task is used multiple times, the train/test splits will be different for each row of the design if you do not manually instantiate the resampling before creating the design.

Let’s construct a more complex design to show the full capabilities of the benchmark() function.

##             task                     learner     resampling
## 1: <TaskClassif> <LearnerClassifFeatureless> <ResamplingCV>
## 2: <TaskClassif>       <LearnerClassifRpart> <ResamplingCV>
## 3: <TaskClassif>      <LearnerClassifRanger> <ResamplingCV>
## 4: <TaskClassif>        <LearnerClassifKKNN> <ResamplingCV>
## 5: <TaskClassif> <LearnerClassifFeatureless> <ResamplingCV>
## 6: <TaskClassif>       <LearnerClassifRpart> <ResamplingCV>
## 7: <TaskClassif>      <LearnerClassifRanger> <ResamplingCV>
## 8: <TaskClassif>        <LearnerClassifKKNN> <ResamplingCV>

2.6.2 Execution and Aggregation of Results

After the benchmark design is ready, we can directly call benchmark():

Note that we did not instantiate the resampling instance manually. benchmark_grid() took care of it for us: Each resampling strategy is instantiated once for each task during the construction of the exhaustive grid.

After the benchmark, we can calculate and aggregate the performance with $aggregate():

##    nr  resample_result       task_id          learner_id resampling_id iters
## 1:  1 <ResampleResult> german_credit classif.featureless            cv     3
## 2:  2 <ResampleResult> german_credit       classif.rpart            cv     3
## 3:  3 <ResampleResult> german_credit      classif.ranger            cv     3
## 4:  4 <ResampleResult> german_credit        classif.kknn            cv     3
## 5:  5 <ResampleResult>         sonar classif.featureless            cv     3
## 6:  6 <ResampleResult>         sonar       classif.rpart            cv     3
## 7:  7 <ResampleResult>         sonar      classif.ranger            cv     3
## 8:  8 <ResampleResult>         sonar        classif.kknn            cv     3
##    auc_train auc_test
## 1:    0.5000   0.5000
## 2:    0.8108   0.7136
## 3:    0.9986   0.7967
## 4:    0.9887   0.7104
## 5:    0.5000   0.5000
## 6:    0.9128   0.7687
## 7:    1.0000   0.9213
## 8:    0.9986   0.9260

Subsequently, we can aggregate the results further. For example, we might be interested which learner performed best over all tasks simultaneously. Simply aggregating the performances with the mean is usually not statistically sound. Instead, we calculate the rank statistic for each learner grouped by task. Then the calculated ranks grouped by learner are aggregated with data.table. Since the AUC needs to be maximized, we multiply with \(-1\) so that the best learner gets a rank of \(1\).

##    nr  resample_result       task_id          learner_id resampling_id iters
## 1:  1 <ResampleResult> german_credit classif.featureless            cv     3
## 2:  2 <ResampleResult> german_credit       classif.rpart            cv     3
## 3:  3 <ResampleResult> german_credit      classif.ranger            cv     3
## 4:  4 <ResampleResult> german_credit        classif.kknn            cv     3
## 5:  5 <ResampleResult>         sonar classif.featureless            cv     3
## 6:  6 <ResampleResult>         sonar       classif.rpart            cv     3
## 7:  7 <ResampleResult>         sonar      classif.ranger            cv     3
## 8:  8 <ResampleResult>         sonar        classif.kknn            cv     3
##    auc_train auc_test
## 1:    0.5000   0.5000
## 2:    0.8108   0.7136
## 3:    0.9986   0.7967
## 4:    0.9887   0.7104
## 5:    0.5000   0.5000
## 6:    0.9128   0.7687
## 7:    1.0000   0.9213
## 8:    0.9986   0.9260
##          task_id          learner_id rank_train rank_test
## 1: german_credit classif.featureless          4         4
## 2: german_credit       classif.rpart          3         2
## 3: german_credit      classif.ranger          1         1
## 4: german_credit        classif.kknn          2         3
## 5:         sonar classif.featureless          4         4
## 6:         sonar       classif.rpart          3         3
## 7:         sonar      classif.ranger          1         2
## 8:         sonar        classif.kknn          2         1
##             learner_id mrank_train mrank_test
## 1:      classif.ranger           1        1.5
## 2:        classif.kknn           2        2.0
## 3:       classif.rpart           3        2.5
## 4: classif.featureless           4        4.0

Unsurprisingly, the featureless learner is outperformed on both training and test set.

2.6.3 Plotting Benchmark Results

Analogously to plotting tasks, predictions or resample results, mlr3viz also provides a autoplot() method for benchmark results.

We can also plot ROC curves. To do so, we first need to filter the BenchmarkResult to only contain a single Task:

All available types are listed on the manual page of autoplot.BenchmarkResult().

2.6.4 Extracting ResampleResults

A BenchmarkResult object is essentially a collection of multiple ResampleResult objects. As these are stored in a column of the aggregated data.table(), we can easily extract them:

## <ResampleResult> of 3 iterations
## * Task: sonar
## * Learner: classif.ranger
## * Warnings: 0 in 0 iterations
## * Errors: 0 in 0 iterations

We can now investigate this resampling and even single resampling iterations using one of the approaches shown in the previous section:

## classif.auc 
##      0.9213
## <LearnerClassifRanger:classif.ranger>
## * Model: -
## * Parameters: list()
## * Packages: ranger
## * Predict Type: prob
## * Feature types: logical, integer, numeric, character, factor, ordered
## * Properties: importance, multiclass, oob_error, twoclass, weights
## [1]  1  4 10 14 17 19

2.6.5 Converting and Merging ResampleResults

It is also possible to cast a single ResampleResult to a BenchmarkResult using the converter as_benchmark_result().

## <BenchmarkResult> of 2 rows with 2 resampling runs
##  nr task_id          learner_id resampling_id iters warnings errors
##   1    iris       classif.rpart       holdout     1        0      0
##   2    iris classif.featureless       holdout     1        0      0