2.6 Benchmarking
Comparing the performance of different learners on multiple tasks and/or different resampling schemes is a common task.
This operation is usually referred to as “benchmarking” in the field of machine-learning.
The mlr3 package offers the benchmark()
convenience function.
2.6.1 Design Creation
In mlr3 we require you to supply a “design” of your benchmark experiment.
A “design” is essentially a matrix of settings you want to execute.
It consists of unique combinations of Task
, Learner
and Resampling
triplets.
Here, we call benchmark()
to perform a single holdout split on a single task and two learners.
We use the benchmark_grid()
function to create an exhaustive design and instantiate the resampling properly, so that all learners are executed on the same train/test split for each task:
library("data.table")
library("mlr3")
= benchmark_grid(
design tasks = tsk("iris"),
learners = list(lrn("classif.rpart"), lrn("classif.featureless")),
resamplings = rsmp("holdout")
)print(design)
## task learner resampling
## 1: <TaskClassif[45]> <LearnerClassifRpart[34]> <ResamplingHoldout[19]>
## 2: <TaskClassif[45]> <LearnerClassifFeatureless[34]> <ResamplingHoldout[19]>
= benchmark(design) bmr
Instead of using benchmark_grid()
you could also create the design manually as a data.table
and use the full flexibility of the benchmark()
function.
The design does not have to be exhaustive, e.g. it can also contain a different learner for each task.
However, you should note that benchmark_grid()
makes sure to instantiate the resamplings for each task.
If you create the design manually, even if the same task is used multiple times, the train/test splits will be different for each row of the design if you do not manually instantiate the resampling before creating the design.
Let’s construct a more complex design to show the full capabilities of the benchmark()
function.
# get some example tasks
= lapply(c("german_credit", "sonar"), tsk)
tasks
# get some learners and for all learners ...
# * predict probabilities
# * predict also on the training set
library("mlr3learners")
= c("classif.featureless", "classif.rpart", "classif.ranger", "classif.kknn")
learners = lapply(learners, lrn,
learners predict_type = "prob", predict_sets = c("train", "test"))
# compare via 3-fold cross validation
= rsmp("cv", folds = 3)
resamplings
# create a BenchmarkDesign object
= benchmark_grid(tasks, learners, resamplings)
design print(design)
## task learner resampling
## 1: <TaskClassif[45]> <LearnerClassifFeatureless[34]> <ResamplingCV[19]>
## 2: <TaskClassif[45]> <LearnerClassifRpart[34]> <ResamplingCV[19]>
## 3: <TaskClassif[45]> <LearnerClassifRanger[34]> <ResamplingCV[19]>
## 4: <TaskClassif[45]> <LearnerClassifKKNN[32]> <ResamplingCV[19]>
## 5: <TaskClassif[45]> <LearnerClassifFeatureless[34]> <ResamplingCV[19]>
## 6: <TaskClassif[45]> <LearnerClassifRpart[34]> <ResamplingCV[19]>
## 7: <TaskClassif[45]> <LearnerClassifRanger[34]> <ResamplingCV[19]>
## 8: <TaskClassif[45]> <LearnerClassifKKNN[32]> <ResamplingCV[19]>
2.6.2 Execution and Aggregation of Results
After the benchmark design is ready, we can directly call benchmark()
:
# execute the benchmark
= benchmark(design) bmr
Note that we did not instantiate the resampling instance manually.
benchmark_grid()
took care of it for us:
Each resampling strategy is instantiated once for each task during the construction of the exhaustive grid.
Once the benchmarking is done, we can aggregate the performance with $aggregate()
:
# measures:
# * area under the curve (auc) on training
# * area under the curve (auc) on test
= list(
measures msr("classif.auc", id = "auc_train", predict_sets = "train"),
msr("classif.auc", id = "auc_test")
)$aggregate(measures) bmr
## nr resample_result task_id learner_id resampling_id
## 1: 1 <ResampleResult[21]> german_credit classif.featureless cv
## 2: 2 <ResampleResult[21]> german_credit classif.rpart cv
## 3: 3 <ResampleResult[21]> german_credit classif.ranger cv
## 4: 4 <ResampleResult[21]> german_credit classif.kknn cv
## 5: 5 <ResampleResult[21]> sonar classif.featureless cv
## 6: 6 <ResampleResult[21]> sonar classif.rpart cv
## 7: 7 <ResampleResult[21]> sonar classif.ranger cv
## 8: 8 <ResampleResult[21]> sonar classif.kknn cv
## iters auc_train auc_test
## 1: 3 0.5000 0.5000
## 2: 3 0.8101 0.7199
## 3: 3 0.9984 0.7887
## 4: 3 0.9866 0.7190
## 5: 3 0.5000 0.5000
## 6: 3 0.9023 0.7138
## 7: 3 1.0000 0.9077
## 8: 3 0.9992 0.9315
We can aggregate the results even further. For example, we might be interested to know which learner performed best over all tasks simultaneously. Simply aggregating the performances with the mean is usually not statistically sound. Instead, we calculate the rank statistic for each learner grouped by task. Then the calculated ranks grouped by learner are aggregated with data.table. Since the AUC needs to be maximized, we multiply the values by \(-1\) so that the best learner has a rank of \(1\).
= bmr$aggregate(measures)
tab print(tab)
## nr resample_result task_id learner_id resampling_id
## 1: 1 <ResampleResult[21]> german_credit classif.featureless cv
## 2: 2 <ResampleResult[21]> german_credit classif.rpart cv
## 3: 3 <ResampleResult[21]> german_credit classif.ranger cv
## 4: 4 <ResampleResult[21]> german_credit classif.kknn cv
## 5: 5 <ResampleResult[21]> sonar classif.featureless cv
## 6: 6 <ResampleResult[21]> sonar classif.rpart cv
## 7: 7 <ResampleResult[21]> sonar classif.ranger cv
## 8: 8 <ResampleResult[21]> sonar classif.kknn cv
## iters auc_train auc_test
## 1: 3 0.5000 0.5000
## 2: 3 0.8101 0.7199
## 3: 3 0.9984 0.7887
## 4: 3 0.9866 0.7190
## 5: 3 0.5000 0.5000
## 6: 3 0.9023 0.7138
## 7: 3 1.0000 0.9077
## 8: 3 0.9992 0.9315
# group by levels of task_id, return columns:
# - learner_id
# - rank of col '-auc_train' (per level of learner_id)
# - rank of col '-auc_test' (per level of learner_id)
= tab[, .(learner_id, rank_train = rank(-auc_train), rank_test = rank(-auc_test)), by = task_id]
ranks print(ranks)
## task_id learner_id rank_train rank_test
## 1: german_credit classif.featureless 4 4
## 2: german_credit classif.rpart 3 2
## 3: german_credit classif.ranger 1 1
## 4: german_credit classif.kknn 2 3
## 5: sonar classif.featureless 4 4
## 6: sonar classif.rpart 3 3
## 7: sonar classif.ranger 1 2
## 8: sonar classif.kknn 2 1
# group by levels of learner_id, return columns:
# - mean rank of col 'rank_train' (per level of learner_id)
# - mean rank of col 'rank_test' (per level of learner_id)
= ranks[, .(mrank_train = mean(rank_train), mrank_test = mean(rank_test)), by = learner_id]
ranks
# print the final table, ordered by mean rank of AUC test
order(mrank_test)] ranks[
## learner_id mrank_train mrank_test
## 1: classif.ranger 1 1.5
## 2: classif.kknn 2 2.0
## 3: classif.rpart 3 2.5
## 4: classif.featureless 4 4.0
Unsurprisingly, the featureless learner is outperformed on both training and test set.
2.6.3 Plotting Benchmark Results
Analogously to plotting tasks, predictions or resample results, mlr3viz also provides a autoplot()
method for benchmark results.
library("mlr3viz")
library("ggplot2")
autoplot(bmr) + theme(axis.text.x = element_text(angle = 45, hjust = 1))
We can also plot ROC curves.
To do so, we first need to filter the BenchmarkResult
to only contain a single Task
:
autoplot(bmr$clone()$filter(task_id = "german_credit"), type = "roc")
All available types are listed on the manual page of autoplot.BenchmarkResult()
.
2.6.4 Extracting ResampleResults
A BenchmarkResult
object is essentially a collection of multiple ResampleResult
objects.
As these are stored in a column of the aggregated data.table()
, we can easily extract them:
= bmr$aggregate(measures)
tab = tab[task_id == "german_credit" & learner_id == "classif.ranger"]$resample_result[[1]]
rr print(rr)
## <ResampleResult> of 3 iterations
## * Task: german_credit
## * Learner: classif.ranger
## * Warnings: 0 in 0 iterations
## * Errors: 0 in 0 iterations
We can now investigate this resampling and even single resampling iterations using one of the approaches shown in the previous section:
= msr("classif.auc")
measure $aggregate(measure) rr
## classif.auc
## 0.7887
# get the iteration with worst AUC
= rr$score(measure)
perf = which.min(perf$classif.auc)
i
# get the corresponding learner and train set
print(rr$learners[[i]])
## <LearnerClassifRanger:classif.ranger>
## * Model: -
## * Parameters: list()
## * Packages: ranger
## * Predict Type: prob
## * Feature types: logical, integer, numeric, character, factor, ordered
## * Properties: importance, multiclass, oob_error, twoclass, weights
head(rr$resampling$train_set(i))
## [1] 3 4 8 11 17 19
2.6.5 Converting and Merging ResampleResults
It is also possible to cast a single ResampleResult
to a BenchmarkResult
using the converter as_benchmark_result()
.
= tsk("iris")
task = rsmp("holdout")$instantiate(task)
resampling
= resample(task, lrn("classif.rpart"), resampling)
rr1 = resample(task, lrn("classif.featureless"), resampling)
rr2
# Cast both ResampleResults to BenchmarkResults
= as_benchmark_result(rr1)
bmr1 = as_benchmark_result(rr2)
bmr2
# Merge 2nd BMR into the first BMR
$combine(bmr2)
bmr1
bmr1
## <BenchmarkResult> of 2 rows with 2 resampling runs
## nr task_id learner_id resampling_id iters warnings errors
## 1 iris classif.rpart holdout 1 0 0
## 2 iris classif.featureless holdout 1 0 0