3.6 Benchmarking

Comparing the performance of different learners on multiple tasks and/or different resampling schemes is a recurrent task. This operation is usually referred to as “benchmarking” in the field of machine-learning. mlr3 offers the benchmark() function for convenience.

3.6.1 Design Creation

In mlr3 we require you to supply a “design” of your benchmark experiment. By “design” we essentially mean the matrix of settings you want to execute. A “design” consists of Task, Learner and Resampling. Additionally, you can supply different Measure along side.

Here, we call benchmark() to perform a single holdout split on a single task and two learners. We use the expand_grid() function to create an exhaustive design and properly instantiate the resampling:

Note that the holdout splits have been automatically instantiated for each row of the design. As a result, the rpart learner used a different training set than the featureless learner. However, for comparison of learners you usually want the learners to see the same splits into train and test sets. To overcome this issue, the resampling strategy needs to be manually instantiated before creating the design.

While the interface of benchmark() allows full flexibility, the creation of such design tables can be tedious. Therefore, mlr3 provides a convenience function to quickly generate design tables and instantiate resampling strategies in an exhaustive grid fashion: expand_grid().

3.6.2 Execution and Aggregation of Results

After the benchmark design is ready, we can directly call benchmark()

Note that we did not instantiate the resampling instance, but expand_grid() took care of it for us: each resampling strategy is instantiated for each task during the construction of the exhaustive grid.

After the benchmark, we can calculate and aggregate the performance with .$aggregate():

We can aggregate the results further. For example, we might be interested which learner performed best over all tasks. Since we have data.table object here, we could do the following:

Alternatively, we can also use the tidyverse approach:

## # A tibble: 2 x 3
##   learner_id            acc   auc
##   <chr>               <dbl> <dbl>
## 1 classif.featureless 0.597 0.5  
## 2 classif.rpart       0.781 0.804

Unsurprisingly, the classification tree outperformed the featureless learner.