3  Performance Evaluation and Comparison

Now that we are familiar with the basics of how to construct tasks and learners and how to fit models, we would like to know more about how to assess fitted models. In particular, we would like to assign one or several numeric scores to a given model. This allows us to select a best model out of several fitted models and furthermore, helps us assess whether the models we have fitted are useful for a task at hand. Scores in this context can reflect different qualities of a model, depending on the measures we employ. To provide an example, in some cases we might be interested in how well our model predicts (in terms of global accuracy). In others we might care about other perspectives, such as how many features it uses or how fast it predicts.

In supervised learning contexts, the core idea in performance evaluation is often that we would like to assess how good a fitted model will work on new, unseen data points. In order to mimic this, we often hold out some data during model development. Afterwards, we use predictions made on the held-out data and compare them to the true label to simulate how our model would have fared on new data points. This strategy is often called train-test split or holdout. Performance Measures now summarize such comparisons in a single number, e.g. a model’s accuracy is the fraction of cases in which it predicted the true outcome correctly.

If datasets are small, the model might perform poorly on the test data by chance. We would therefore like to repeat this process multiple times to reduce the variance in our estimation. This gives rise to different Resampling strategies encoding more complex strategies.

Performance Measures

Measures are a way to quantify model properties along relevant dimensions. We cover how to * construct Measure objects for different task_types, * use Measure objects to score models

An in-depth guide on how to construct custom measures can be found in the extending measures section.


Resampling is a methodology to construct training and test splits. We cover how to

In this section, we will use Resampling objects that define the resampling strategy we want to employ. Calling resample on a Task, Learner and a Resampling object then will return a ResampleResult which contains the results of applying the strategy on a Task and Learner. Additional information on resampling can be found in the section about nested resampling and in the chapter on model optimization.


Benchmarking is used to compare the performance of different models, for example models trained with different learners, on different tasks, or with different resampling methods. This is usually done to obtain an overview of how different methods perform across different tasks. We cover how to Further Reading

  • If data is pre-processed before feeding it to a model, the train-test split needs to be taken into account. Pipelines solve this problem by cleanly separating train and predict steps!
  • Depending on the task at hand, more involved resampling strategies might be required, e.g. for spatiotemporal data.
  • Bischl et al. (2012) provide an overview of resampling methods.

3.1 ROC Curve and Thresholds

As we have seen before, binary classification is unique because of the presence of a positive and negative class and a threshold probability to distinguish between the two. ROC Analysis, which stands for receiver operating characteristics, applies specifically to this case and allows a better picture of the trade-offs when choosing between the two classes.

We saw earlier that one can retrieve the confusion matrix of a mlr3::Prediction by accessing the $confusion field. In the following code chunk, we first retrieve the Sonar classification task and construct a classification tree learner. Next, we use the mlr3::partition() helper function to randomly split the rows of the Sonar task into two disjunct sets: a training set and a test set. We train the learner on the training set and use the trained model to generate predictions on the test set. Finally, we retrieve the confusion matrix.

task = tsk("sonar")
learner = lrn("classif.rpart", predict_type = "prob")

# split into training and test
splits = partition(task, ratio = 0.8)
List of 2
 $ train: int [1:167] 3 4 5 6 7 8 9 12 13 14 ...
 $ test : int [1:41] 1 2 10 11 24 28 35 37 46 48 ...
pred = learner$train(task, splits$train)$predict(task, splits$test)
response  M  R
       M 19  5
       R  3 14

The upper left quadrant denotes the number of times our model predicted the positive class and was correct about it. Similarly, the lower right quadrant denotes the number of times our model predicted the negative class and was also correct about it. Together, the elements on the diagonal are called True Positives (TP) and True Negatives (TN). The upper right quadrant denotes the number of times we falsely predicted a positive label and is called False Positives (FP). The lower left quadrant is called False Negatives (FN).

We can derive the following performance metrics from the confusion matrix:

  • True Positive Rate (TPR): How many of the true positives did we predict as positive?
  • True Negative Rate (TNR): How many of the true negatives did we predict as negative?
  • Positive Predictive Value PPV: If we predict positive how likely is it a true positive?
  • Negative Predictive Value NPV: If we predict negative how likely is it a true negative?

It is difficult to achieve a high TPR and low FPR simultaneously. We can characterize the behavior of a binary classifier for different thresholds by plotting the TPR and FPR values – this is the ROC curve. The best classifier lies on the top-left corner – the TPR is 1 and the FPR is 0. Classifiers on the diagonal produce labels randomly (possibly with different proportions). For example, if each positive \(x\) will be randomly classified with 25% as “positive”, we get a TPR of 0.25. If we assign each negative \(x\) randomly to “positive” we get a FPR of 0.25. In practice, we should never obtain a classifier clearly below the diagonal – ROC curves for different labels are symmetric with respect to the diagonal, so a curve below the diagonal would indicate that the positive and negative class labels have been switched by the classifier.

For mlr3 prediction objects, the ROC curve can easily be constructed with mlr3viz which relies on the precrec to calculate and plot ROC curves:


# TPR vs FPR / Sensitivity vs (1 - Specificity)
autoplot(pred, type = "roc")

We can also plot the precision-recall curve (PPV vs. TPR). The main difference between ROC curves and precision-recall curves (PRC) is that the number of true-negative results is not used for making a PRC. PRCs are preferred over ROC curves for imbalanced populations.

# Precision vs Recall
autoplot(pred, type = "prc")

3.2 Resampling

When evaluating the performance of a model, we are interested in its generalization performance – how well will it perform on new data that has not been seen during training? We can estimate the generalization performance by evaluating a model on a test set, as we have done above, that was constructed to contain only observations that are not contained in the training set. There are many different strategies for partitioning a data set into training and test; in mlr3 we call these strategies “resampling”. mlr3 includes the following predefined resampling strategies:

Name Identifier
mlr3::cross validation "cv"
mlr3::leave-one-out cross validation "loo"
mlr3::repeated cross validation "repeated_cv"
mlr3::bootstrapping "bootstrap"
mlr3::subsampling "subsampling"
mlr3::holdout "holdout"
mlr3::in-sample resampling "insample"
mlr3::custom resampling "custom"

It is often desirable to repeatedly split the entire data in different ways to ensure that a “lucky” or “unlucky” split does not bias the generalization performance estimate. Without resampling strategies like the ones we provide here, this is a tedious and error-prone process.

The following sections provide guidance on how to select a resampling strategy and how to use it.

Here is a graphical illustration of the resampling process in general:

3.2.1 Settings

We will use the mlr3::penguins task and a simple classification tree from the rpart package as an example here.

task = tsk("penguins")
learner = lrn("classif.rpart")

When performing resampling with a dataset, we first need to define which approach to use. mlr3 resampling strategies and their parameters can be queried by looking at the data.table output of the mlr3::mlr_resamplings dictionary; this also lists the parameters that can be changed to affect the behavior of each strategy:

           key                         label        params iters
1:   bootstrap                     Bootstrap ratio,repeats    30
2:      custom                 Custom Splits                  NA
3:   custom_cv Custom Split Cross-Validation                  NA
4:          cv              Cross-Validation         folds    10
5:     holdout                       Holdout         ratio     1
6:    insample           Insample Resampling                   1
7:         loo                 Leave-One-Out                  NA
8: repeated_cv     Repeated Cross-Validation folds,repeats   100
9: subsampling                   Subsampling ratio,repeats    30

Additional resampling methods for special use cases are available via extension packages, such as mlr3spatiotempcv for spatial data.

What we showed in the train/predict/score part is the equivalent of holdout resampling, done manually, so let’s consider this one first. We can retrieve elements from the dictionary mlr3::mlr_resamplings with the convenience function mlr3::rsmp():

resampling = rsmp("holdout")
<ResamplingHoldout>: Holdout
* Iterations: 1
* Instantiated: FALSE
* Parameters: ratio=0.6667

Note that the $is_instantiated field is set to FALSE. This means we did not apply the strategy to a dataset yet.

By default we get a .66/.33 split of the data into training and test. There are two ways in which this ratio can be changed:

  1. Overwriting the slot in $param_set$values using a named list:
resampling$param_set$values = list(ratio = 0.8)
  1. Specifying the resampling parameters directly during construction:
rsmp("holdout", ratio = 0.8)
<ResamplingHoldout>: Holdout
* Iterations: 1
* Instantiated: FALSE
* Parameters: ratio=0.8

3.2.2 Instantiation

So far, we have only chosen a resampling strategy; we now need to instantiate it with data.

To perform the splitting and obtain indices for the training and the test split, the resampling needs a mlr3::Task. By calling the method instantiate(), we split the row indices of the task into indices for training and test sets. These resulting indices are stored in the mlr3::Resampling objects.

train = resampling$train_set(1)
test = resampling$test_set(1)
 int [1:275] 2 3 4 5 6 7 8 9 10 11 ...
 int [1:69] 1 13 14 15 17 20 28 32 42 44 ...
# are they disjunct?
intersect(train, test)
# are all row ids either in train or test?
setequal(c(train, test), task$row_ids)
[1] TRUE

Note that if you want to compare multiple Learners in a fair manner, using the same instantiated resampling for each learner is mandatory, such that each learner gets exactly the same training data and the performance of the trained model is evaluated in exactly the same test set. A way to greatly simplify the comparison of multiple learners is discussed in the section on benchmarking.

3.2.3 Execution

With a mlr3::Task, a mlr3::Learner, and a mlr3::Resampling object we can now perform a resampling: fit the learner on a subset of the task repeatedly and predict on the left-out observations. For this, we call the mlr3::resample() function which returns a mlr3::ResampleResult object. We can tell mlr3::resample() to keep the fitted models (for example for later inspection) by setting the store_models option to TRUE.


The butcher package allows you to remove components of a fitted model that may no longer be needed. However, from the point of view of the mlr3 framework, it is impossible to find a good compromise here. Of course, as much balast as possible should be removed. However, if too much is removed, some graphics, outputs or third-party extensions may not work anymore as intended. Our conservative policy here is that the training data should not be part of the stored model. If this is the case somewhere, please open an issue.

To manually apply butcher, just replace the stored model in the learner:

task = tsk("penguins")
learner = lrn("classif.rpart")

learner$model = butcher::butcher(learner$model)

Per default, mlr3 discards fitted models after the prediction step to keep the mlr3::ResampleResult object small w.r.t. its memory consumption.

task = tsk("penguins")
learner = lrn("classif.rpart", maxdepth = 3, predict_type = "prob")
resampling = rsmp("cv", folds = 3)

rr = resample(task, learner, resampling, store_models = TRUE)
<ResampleResult> of 3 iterations
* Task: penguins
* Learner: classif.rpart
* Warnings: 0 in 0 iterations
* Errors: 0 in 0 iterations

Here we use a three-fold cross-validation resampling, which trains and evaluates on three different training and test sets. The returned mlr3::ResampleResult, stored as rr, provides various getters to access and aggregate the stored information. Here are a few examples:

  • Calculate the average performance across all resampling iterations, in terms of classification error:


    This is useful to check if one (or more) of the iterations are very different from the average.

  • Check for warnings or errors:

    Empty data.table (0 rows and 2 cols): iteration,msg
    Empty data.table (0 rows and 2 cols): iteration,msg
  • Extract and inspect the resampling splits; this allows us to see in detail which observations were used for what purpose when:

    <ResamplingCV>: Cross-Validation
    * Iterations: 3
    * Instantiated: TRUE
    * Parameters: folds=3
    [1] 3
     int [1:229] 9 12 16 17 19 24 25 29 30 34 ...
     int [1:115] 1 5 8 10 15 18 21 22 23 26 ...
  • Retrieve the model trained in a specific iteration and inspect it, for example, to investigate why the performance in this iteration was very different from the average:

    lrn = rr$learners[[1]]
    n= 229 
    node), split, n, loss, yval, (yprob)
          * denotes terminal node
    1) root 229 128 Adelie (0.44104803 0.22707424 0.33187773)  
      2) bill_length< 42.35 94   1 Adelie (0.98936170 0.00000000 0.01063830) *
      3) bill_length>=42.35 135  60 Gentoo (0.05925926 0.38518519 0.55555556)  
        6) island=Dream,Torgersen 58   6 Chinstrap (0.10344828 0.89655172 0.00000000) *
        7) island=Biscoe 77   2 Gentoo (0.02597403 0.00000000 0.97402597) *
  • Extract the individual predictions:

    rr$prediction() # all predictions merged into a single Prediction object
    <PredictionClassif> for 344 observations:
        row_ids     truth  response prob.Adelie prob.Chinstrap prob.Gentoo
              1    Adelie    Adelie  0.98936170     0.00000000  0.01063830
              5    Adelie    Adelie  0.98936170     0.00000000  0.01063830
              8    Adelie    Adelie  0.98936170     0.00000000  0.01063830
            340 Chinstrap    Gentoo  0.01162791     0.01162791  0.97674419
            342 Chinstrap Chinstrap  0.04761905     0.92857143  0.02380952
            343 Chinstrap    Gentoo  0.01162791     0.01162791  0.97674419
    rr$predictions()[[1]] # predictions of first resampling iteration
    <PredictionClassif> for 115 observations:
        row_ids     truth  response prob.Adelie prob.Chinstrap prob.Gentoo
              1    Adelie    Adelie   0.9893617      0.0000000   0.0106383
              5    Adelie    Adelie   0.9893617      0.0000000   0.0106383
              8    Adelie    Adelie   0.9893617      0.0000000   0.0106383
            335 Chinstrap Chinstrap   0.1034483      0.8965517   0.0000000
            336 Chinstrap Chinstrap   0.1034483      0.8965517   0.0000000
            339 Chinstrap Chinstrap   0.1034483      0.8965517   0.0000000
  • Filter the result, keep only specified resampling iterations:

    rr$filter(c(1, 3))
    <ResampleResult> of 2 iterations
    * Task: penguins
    * Learner: classif.rpart
    * Warnings: 0 in 0 iterations
    * Errors: 0 in 0 iterations

3.2.4 Custom resampling

Sometimes it is necessary to perform resampling with custom splits, e.g., to reproduce results reported in a study. A manual resampling instance can be constructed using the "custom" template.

resampling = rsmp("custom")
  train = list(c(1:10, 51:60, 101:110)),
  test = list(c(11:20, 61:70, 111:120))
[1] 1
 [1]   1   2   3   4   5   6   7   8   9  10  51  52  53  54  55  56  57  58  59
[20]  60 101 102 103 104 105 106 107 108 109 110
 [1]  11  12  13  14  15  16  17  18  19  20  61  62  63  64  65  66  67  68  69
[20]  70 111 112 113 114 115 116 117 118 119 120

3.2.5 Resampling with (predefined) groups

In some cases, it is desirable to keep observations together, i.e., either have all observations of the same group in the training set or have all observations of the same group in the test set. This can be defined through the column role "group" during Task creation, i.e., a column in the data specifies the groups (see also the help page on this column role).

A potential use case for this is spatial modeling, where one wants to account for groups of observations during resampling.

Besides the "group" column role, dedicated spatiotemporal resampling methods are available in mlr3spatiotempcv. More general information about this topic can be found in the spatiotemporal resampling section.

If you want the ensure that the observations in the training and test set are similarly distributed as in the complete task, one or multiple columns can be used for stratification via the column role stratum. See also the mlr3gallery post on this topic for a practical introduction.

In the old mlr package, grouping was called “blocking”.

3.2.6 Plotting Resample Results

mlr3viz provides a ggplot2::autoplot() method for resampling results. As an example, we construct a binary classification task with two features, perform a resampling with a 10-fold cross-validation and visualize the results:

task = tsk("pima")
task$select(c("glucose", "mass"))
learner = lrn("classif.rpart", predict_type = "prob")
rr = resample(task, learner, rsmp("cv"), store_models = TRUE)
                 task task_id                   learner    learner_id
 1: <TaskClassif[50]>    pima <LearnerClassifRpart[38]> classif.rpart
 2: <TaskClassif[50]>    pima <LearnerClassifRpart[38]> classif.rpart
 3: <TaskClassif[50]>    pima <LearnerClassifRpart[38]> classif.rpart
 4: <TaskClassif[50]>    pima <LearnerClassifRpart[38]> classif.rpart
 5: <TaskClassif[50]>    pima <LearnerClassifRpart[38]> classif.rpart
 6: <TaskClassif[50]>    pima <LearnerClassifRpart[38]> classif.rpart
 7: <TaskClassif[50]>    pima <LearnerClassifRpart[38]> classif.rpart
 8: <TaskClassif[50]>    pima <LearnerClassifRpart[38]> classif.rpart
 9: <TaskClassif[50]>    pima <LearnerClassifRpart[38]> classif.rpart
10: <TaskClassif[50]>    pima <LearnerClassifRpart[38]> classif.rpart
5 variables not shown: [resampling, resampling_id, iteration, prediction, classif.auc]
# boxplot of AUC values across the 10 folds
autoplot(rr, measure = msr("classif.auc"))

# ROC curve, averaged over 10 folds
autoplot(rr, type = "roc")

We can also plot the predictions of individual models:

# learner predictions for the first fold
rr1 = rr$clone()$filter(1)
autoplot(rr1, type = "prediction")
Warning: Removed 2 rows containing missing values (`geom_point()`).


All available plot types are listed on the manual page of mlr3viz::autoplot.ResampleResult().

3.3 Benchmarking

Comparing the performance of different learners on multiple tasks and/or different resampling schemes is a common task. This operation is usually referred to as “benchmarking” in the field of machine learning. The mlr3 package offers the mlr3::benchmark() convenience function that takes care of most of the work of repeatedly training and evaluating models under the same conditions.

3.3.1 Design Creation

Benchmark experiments in mlr3 are specified through a design. Such a design is essentially a table of scenarios to be evaluated; in particular unique combinations of mlr3::Task, mlr3::Learner and mlr3::Resampling triplets.

We use the mlr3::benchmark_grid() function to construct an exhaustive design (that evaluates each learner on each task with each resampling) and instantiate the resampling properly, so that all learners are executed on the same train/test split for each tasks. We set the learners to predict probabilities and also tell them to predict for the observations of the training set (by setting predict_sets to c("train", "test")). Additionally, we use mlr3::tsks(), mlr3::lrns(), and mlr3::rsmps() to retrieve lists of mlr3::Task, mlr3::Learner and mlr3::Resampling.


design = benchmark_grid(
  tasks = tsks(c("spam", "german_credit", "sonar")),
  learners = lrns(c("classif.ranger", "classif.rpart", "classif.featureless"),
    predict_type = "prob", predict_sets = c("train", "test")),
  resamplings = rsmps("cv", folds = 3)
                task                         learner         resampling
1: <TaskClassif[50]>      <LearnerClassifRanger[38]> <ResamplingCV[20]>
2: <TaskClassif[50]>       <LearnerClassifRpart[38]> <ResamplingCV[20]>
3: <TaskClassif[50]> <LearnerClassifFeatureless[38]> <ResamplingCV[20]>
4: <TaskClassif[50]>      <LearnerClassifRanger[38]> <ResamplingCV[20]>
5: <TaskClassif[50]>       <LearnerClassifRpart[38]> <ResamplingCV[20]>
6: <TaskClassif[50]> <LearnerClassifFeatureless[38]> <ResamplingCV[20]>
7: <TaskClassif[50]>      <LearnerClassifRanger[38]> <ResamplingCV[20]>
8: <TaskClassif[50]>       <LearnerClassifRpart[38]> <ResamplingCV[20]>
9: <TaskClassif[50]> <LearnerClassifFeatureless[38]> <ResamplingCV[20]>

The constructed design table can be passed to mlr3::benchmark() to start the computation. It is also possible to construct a custom design manually, for example, to exclude certain task-learner combinations.


However, if you construct a custom design with data.table::data.table(), the train/test splits will be different for each row of the design if you do not manually instantiate the resampling before creating the design. See the help page on mlr3::benchmark_grid() for an example.

3.3.2 Execution Aggregation of Results

After the benchmark design is ready, we can call mlr3::benchmark() on it:

bmr = benchmark(design)

Note that we did not have to instantiate the resampling manually. mlr3::benchmark_grid() took care of it for us: each resampling strategy is instantiated once for each task during the construction of the exhaustive grid. This way, each learner operates on the same training and test sets which makes the results easier to compare.

Once the benchmarking is finished (and, depending on the size of your design, this can take quite some time), we can aggregate the performance with the $aggregate() method of the returned mlr3::BenchmarkResult. We construct two measures to calculate the area under the curve (AUC) for the training and the test set:

measures = list(
  msr("classif.auc", predict_sets = "train", id = "auc_train"),
  msr("classif.auc", id = "auc_test")

tab = bmr$aggregate(measures)
print(tab[, .(task_id, learner_id, auc_train, auc_test)])
         task_id          learner_id auc_train  auc_test
1:          spam      classif.ranger 0.9994318 0.9860031
2:          spam       classif.rpart 0.9152765 0.9061875
3:          spam classif.featureless 0.5000000 0.5000000
4: german_credit      classif.ranger 0.9984109 0.8055811
5: german_credit       classif.rpart 0.8073881 0.7110071
6: german_credit classif.featureless 0.5000000 0.5000000
7:         sonar      classif.ranger 1.0000000 0.9142603
8:         sonar       classif.rpart 0.8953818 0.7588270
9:         sonar classif.featureless 0.5000000 0.5000000

We can aggregate the results even further. For example, we might be interested to know which learner performed best across all tasks. Simply aggregating the performances with the mean is usually not statistically sound. Instead, we calculate the rank statistic for each learner, grouped by task. Then the calculated ranks, grouped by the learner, are aggregated with the data.table package. As larger AUC scores are better, we multiply the values by \(-1\) such that the best learner has a rank of \(1\).

# group by levels of task_id, return columns:
# - learner_id
# - rank of col '-auc_train' (per level of learner_id)
# - rank of col '-auc_test' (per level of learner_id)
ranks = tab[, .(learner_id, rank_train = rank(-auc_train), rank_test = rank(-auc_test)), by = task_id]
         task_id          learner_id rank_train rank_test
1:          spam      classif.ranger          1         1
2:          spam       classif.rpart          2         2
3:          spam classif.featureless          3         3
4: german_credit      classif.ranger          1         1
5: german_credit       classif.rpart          2         2
6: german_credit classif.featureless          3         3
7:         sonar      classif.ranger          1         1
8:         sonar       classif.rpart          2         2
9:         sonar classif.featureless          3         3
# group by levels of learner_id, return columns:
# - mean rank of col 'rank_train' (per level of learner_id)
# - mean rank of col 'rank_test' (per level of learner_id)
ranks = ranks[, .(mrank_train = mean(rank_train), mrank_test = mean(rank_test)), by = learner_id]

# print the final table, ordered by mean rank of AUC test
            learner_id mrank_train mrank_test
1:      classif.ranger           1          1
2:       classif.rpart           2          2
3: classif.featureless           3          3

Unsurprisingly, the featureless learner is worse overall. The winner is the classification forest, which outperforms a single classification tree.

3.3.3 Plotting Benchmark Results

Similar to tasks, predictions, or resample results, mlr3viz also provides a ggplot2::autoplot() method for benchmark results.

autoplot(bmr) + ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 45, hjust = 1))

Such a plot gives a nice overview of the overall performance and how learners compare on different tasks in an intuitive way.

We can also plot ROC (receiver operating characteristics) curves. We filter the mlr3::BenchmarkResult to only contain a single mlr3::Task, then we simply plot the result:

bmr_small = bmr$clone(deep = TRUE)$filter(task_id = "german_credit")
autoplot(bmr_small, type = "roc")

All available plot types are listed on the manual page of mlr3viz::autoplot.BenchmarkResult().

3.3.4 Statistical Tests

The package mlr3benchmark provides some infrastructure for applying statistical significance tests on mlr3::BenchmarkResult. Currently, Friedman tests and pairwise Friedman-Nemenyi tests (Demšar 2006) are supported for benchmarks with at least two tasks and at least two learners. The results can be summarized in Critical Difference Plots.


bma = as.BenchmarkAggr(bmr, measures = msr("classif.auc"))

    Pairwise comparisons using Nemenyi-Wilcoxon-Wilcox all-pairs test for a two-way balanced complete block design
data: auc and learner_id and task_id
            ranger rpart
rpart       0.438  -    
featureless 0.038  0.438

P value adjustment method: single-step
autoplot(bma, type = "cd")

3.3.5 Extracting ResampleResults

A mlr3::BenchmarkResult object is essentially a collection of multiple mlr3::ResampleResult objects. As these are stored in a column of the aggregated data.table::data.table(), we can easily extract them:

tab = bmr$aggregate(measures)
rr = tab[task_id == "german_credit" & learner_id == "classif.ranger"]$resample_result[[1]]
<ResampleResult> of 3 iterations
* Task: german_credit
* Learner: classif.ranger
* Warnings: 0 in 0 iterations
* Errors: 0 in 0 iterations

We can now investigate this resampling and even single resampling iterations using one of the approaches shown in the previous section on resampling:

measure = msr("classif.auc")
# get the iteration with worst AUC
perf = rr$score(measure)
i = which.min(perf$classif.auc)

# get the corresponding learner and training set
* Model: -
* Parameters: num.threads=1
* Packages: mlr3, mlr3learners, ranger
* Predict Types:  response, [prob]
* Feature Types: logical, integer, numeric, character, factor, ordered
* Properties: hotstart_backward, importance, multiclass, oob_error,
  twoclass, weights
[1]  2  5  7  8  9 10

3.3.6 Converting and Merging

A mlr3::ResampleResult can be converted to a mlr3::BenchmarkResult with the function mlr3::as_benchmark_result(). We can also combine two mlr3::BenchmarkResults into a larger result object, for example, for two related benchmarks that are computed on different machines.

task = tsk("iris")
resampling = rsmp("holdout")$instantiate(task)

rr1 = resample(task, lrn("classif.rpart"), resampling)
rr2 = resample(task, lrn("classif.featureless"), resampling)

# Cast both ResampleResults to BenchmarkResults
bmr1 = as_benchmark_result(rr1)
bmr2 = as_benchmark_result(rr2)

# Merge 2nd BMR into the first BMR
bmr = c(bmr1, bmr2)
<BenchmarkResult> of 2 rows with 2 resampling runs
 nr task_id          learner_id resampling_id iters warnings errors
  1    iris       classif.rpart       holdout     1        0      0
  2    iris classif.featureless       holdout     1        0      0