7  Resampling

When evaluating the performance of a model, we are interested in its generalization performance – how well will it perform on new data that has not been seen during training? We can estimate the generalization performance by evaluating a model on a test set, as we have done above, that was created to contain only observations that are not contained in the training set. There are many different strategies for partitioning a data set into training and test; in mlr3 we call these strategies “resampling”. mlr3 includes the following predefined resampling strategies:

Name Identifier
cross validation "cv"
leave-one-out cross validation "loo"
repeated cross validation "repeated_cv"
bootstrapping "bootstrap"
subsampling "subsampling"
holdout "holdout"
in-sample resampling "insample"
custom resampling "custom"
Note

It is often desirable to repeatedly split the entire data in different ways to ensure that a “lucky” or “unlucky” split does not bias the generalization performance estimate. Without resampling strategies like the ones we provide here, this is a tedious and error-prone process.

The following sections provide guidance on how to select a resampling strategy and how to use it.

Here is a graphical illustration of the resampling process in general:

7.1 Settings

We will use the penguins task and a simple classification tree from the rpart package as an example here.

library("mlr3verse")

task = tsk("penguins")
learner = lrn("classif.rpart")

When performing resampling with a dataset, we first need to define which approach should be used. mlr3 resampling strategies and their parameters can be queried by looking at the data.table output of the mlr_resamplings dictionary; this also lists the parameters that can be changed to affect the behavior of each strategy:

as.data.table(mlr_resamplings)
           key                         label        params iters
1:   bootstrap                     Bootstrap ratio,repeats    30
2:      custom                 Custom Splits                  NA
3:   custom_cv Custom Split Cross-Validation                  NA
4:          cv              Cross-Validation         folds    10
5:     holdout                       Holdout         ratio     1
6:    insample           Insample Resampling                   1
7:         loo                 Leave-One-Out                  NA
8: repeated_cv     Repeated Cross-Validation folds,repeats   100
9: subsampling                   Subsampling ratio,repeats    30
Tip

Additional resampling methods for special use cases are available via extension packages, such as mlr3spatiotemporal for spatial data.

What we showed in the train/predict/score part is the equivalent of holdout resampling, done manually, so let’s consider this one first. We can retrieve elements from the dictionary mlr_resamplings via $get() or the convenience function rsmp():

resampling = rsmp("holdout")
print(resampling)
<ResamplingHoldout>: Holdout
* Iterations: 1
* Instantiated: FALSE
* Parameters: ratio=0.6667
Note

Note that the $is_instantiated field is set to FALSE. This means we did not actually apply the strategy to a dataset yet.

By default we get a .66/.33 split of the data into training and test. There are two ways in which this ratio can be changed:

  1. Overwriting the slot in $param_set$values using a named list:
resampling$param_set$values = list(ratio = 0.8)
  1. Specifying the resampling parameters directly during construction:
rsmp("holdout", ratio = 0.8)
<ResamplingHoldout>: Holdout
* Iterations: 1
* Instantiated: FALSE
* Parameters: ratio=0.8

7.2 Instantiation

So far we have only chosen a resampling strategy; we now need to instantiate it with data.

To actually perform the splitting and obtain indices for the training and the test split, the resampling needs a Task. By calling the method instantiate(), we split the indices of the data into indices for training and test sets. These resulting indices are stored in the Resampling objects.

resampling$instantiate(task)
str(resampling$train_set(1))
 int [1:275] 2 3 6 7 8 9 11 12 13 15 ...
str(resampling$test_set(1))
 int [1:69] 1 4 5 10 14 16 17 21 28 29 ...

Note that if you want to compare multiple Learners in a fair manner, using the same instantiated resampling for each learner is mandatory, such that each learner gets exactly the same training data and the performance of the trained model is evaluated in exactly the same test set. A way to greatly simplify the comparison of multiple learners is discussed in the section on benchmarking.

7.3 Execution

With a Task, a Learner, and a Resampling object we can call resample(), which fits the learner on the training set and evaluates it on the test set. This may happen multiple times, depending on the given resampling strategy. The result of running the resample() function is a ResampleResult object. We can tell resample() to keep the fitted models (for example for later inspection) by setting the store_models option to TRUEand then starting the computation:

task = tsk("penguins")
learner = lrn("classif.rpart", maxdepth = 3, predict_type = "prob")
resampling = rsmp("cv", folds = 3)

rr = resample(task, learner, resampling, store_models = TRUE)
INFO  [21:31:54.843] [mlr3] Applying learner 'classif.rpart' on task 'penguins' (iter 2/3) 
INFO  [21:31:54.924] [mlr3] Applying learner 'classif.rpart' on task 'penguins' (iter 3/3) 
INFO  [21:31:54.952] [mlr3] Applying learner 'classif.rpart' on task 'penguins' (iter 1/3) 
print(rr)
<ResampleResult> of 3 iterations
* Task: penguins
* Learner: classif.rpart
* Warnings: 0 in 0 iterations
* Errors: 0 in 0 iterations

Here we use a three-fold cross-validation resampling, which trains and evaluates on three different training and test sets. The returned ResampleResult, stored as rr, provides various getters to access and aggregate the stored information. Here are a few examples:

  • Calculate the average performance across all resampling iterations, in terms of classification error:

    rr$aggregate(msr("classif.ce"))
    classif.ce 
    0.08433766 
  • Extract the performance for the individual resampling iterations:

    rr$score(msr("classif.ce"))
                    task  task_id                   learner    learner_id
    1: <TaskClassif[50]> penguins <LearnerClassifRpart[38]> classif.rpart
    2: <TaskClassif[50]> penguins <LearnerClassifRpart[38]> classif.rpart
    3: <TaskClassif[50]> penguins <LearnerClassifRpart[38]> classif.rpart
               resampling resampling_id iteration              prediction
    1: <ResamplingCV[20]>            cv         1 <PredictionClassif[20]>
    2: <ResamplingCV[20]>            cv         2 <PredictionClassif[20]>
    3: <ResamplingCV[20]>            cv         3 <PredictionClassif[20]>
       classif.ce
    1: 0.06086957
    2: 0.09565217
    3: 0.09649123

    This is useful to check if one (or more) of the iterations are very different from the average.

  • Check for warnings or errors:

    rr$warnings
    Empty data.table (0 rows and 2 cols): iteration,msg
    rr$errors
    Empty data.table (0 rows and 2 cols): iteration,msg
  • Extract and inspect the resampling splits; this allows to see in detail which observations were used for what purpose when:

    rr$resampling
    <ResamplingCV>: Cross-Validation
    * Iterations: 3
    * Instantiated: TRUE
    * Parameters: folds=3
    rr$resampling$iters
    [1] 3
    str(rr$resampling$test_set(1))
     int [1:115] 1 7 8 9 10 11 13 17 18 19 ...
    str(rr$resampling$train_set(1))
     int [1:229] 3 4 16 22 23 26 27 28 37 38 ...
  • Retrieve the model trained in a specific iteration and inspect it, for example to investigate why the performance in this iteration was very different from the average:

    lrn = rr$learners[[1]]
    lrn$model
    n= 229 
    
    node), split, n, loss, yval, (yprob)
          * denotes terminal node
    
    1) root 229 132 Adelie (0.423580786 0.231441048 0.344978166)  
      2) flipper_length< 207.5 146  50 Adelie (0.657534247 0.335616438 0.006849315)  
        4) bill_length< 44.65 95   2 Adelie (0.978947368 0.021052632 0.000000000) *
        5) bill_length>=44.65 51   4 Chinstrap (0.058823529 0.921568627 0.019607843) *
      3) flipper_length>=207.5 83   5 Gentoo (0.012048193 0.048192771 0.939759036)  
        6) bill_depth>=17.05 7   3 Chinstrap (0.142857143 0.571428571 0.285714286) *
        7) bill_depth< 17.05 76   0 Gentoo (0.000000000 0.000000000 1.000000000) *
  • Extract the individual predictions:

    rr$prediction() # all predictions merged into a single Prediction object
    <PredictionClassif> for 344 observations:
        row_ids     truth  response prob.Adelie prob.Chinstrap prob.Gentoo
              1    Adelie    Adelie   0.9789474     0.02105263           0
              7    Adelie    Adelie   0.9789474     0.02105263           0
              8    Adelie    Adelie   0.9789474     0.02105263           0
    ---                                                                   
            329 Chinstrap Chinstrap   0.0000000     1.00000000           0
            338 Chinstrap Chinstrap   0.0000000     1.00000000           0
            343 Chinstrap Chinstrap   0.0000000     1.00000000           0
    rr$predictions()[[1]] # predictions of first resampling iteration
    <PredictionClassif> for 115 observations:
        row_ids     truth  response prob.Adelie prob.Chinstrap prob.Gentoo
              1    Adelie    Adelie  0.97894737     0.02105263  0.00000000
              7    Adelie    Adelie  0.97894737     0.02105263  0.00000000
              8    Adelie    Adelie  0.97894737     0.02105263  0.00000000
    ---                                                                   
            337 Chinstrap Chinstrap  0.05882353     0.92156863  0.01960784
            341 Chinstrap    Adelie  0.97894737     0.02105263  0.00000000
            344 Chinstrap Chinstrap  0.05882353     0.92156863  0.01960784
  • Filter the result to only keep specified resampling iterations:

    rr$filter(c(1, 3))
    print(rr)
    <ResampleResult> of 2 iterations
    * Task: penguins
    * Learner: classif.rpart
    * Warnings: 0 in 0 iterations
    * Errors: 0 in 0 iterations

7.4 Custom resampling

Sometimes it is necessary to perform resampling with custom splits, e.g. to reproduce results reported in a study. A manual resampling instance can be created using the "custom" template.

resampling = rsmp("custom")
resampling$instantiate(task,
  train = list(c(1:10, 51:60, 101:110)),
  test = list(c(11:20, 61:70, 111:120))
)
resampling$iters
[1] 1
resampling$train_set(1)
 [1]   1   2   3   4   5   6   7   8   9  10  51  52  53  54  55  56  57  58  59
[20]  60 101 102 103 104 105 106 107 108 109 110
resampling$test_set(1)
 [1]  11  12  13  14  15  16  17  18  19  20  61  62  63  64  65  66  67  68  69
[20]  70 111 112 113 114 115 116 117 118 119 120

7.5 Resampling with (predefined) groups

In some cases, it is desirable to keep observations together, i.e. to not separate them into training and test set. This can be defined through the column role "group" during Task creation, i.e. a special column in the data specifies the groups (see also the help page on this column role). See also the mlr3gallery post on this topic for a practical example.

Note

In the old mlr package this was called “blocking”.

7.6 Plotting Resample Results

mlr3viz provides a autoplot() method for resampling results. As an example, we create a binary classification task with two features, perform a resampling with a 10-fold cross-validation and visualize the results:

task = tsk("pima")
task$select(c("glucose", "mass"))
learner = lrn("classif.rpart", predict_type = "prob")
rr = resample(task, learner, rsmp("cv"), store_models = TRUE)
INFO  [21:31:55.390] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 4/10) 
INFO  [21:31:55.413] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 1/10) 
INFO  [21:31:55.438] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 6/10) 
INFO  [21:31:55.461] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 8/10) 
INFO  [21:31:55.484] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 2/10) 
INFO  [21:31:55.507] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 3/10) 
INFO  [21:31:55.538] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 5/10) 
INFO  [21:31:55.561] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 7/10) 
INFO  [21:31:55.583] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 9/10) 
INFO  [21:31:55.605] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 10/10) 
# boxplot of AUC values across the 10 folds
autoplot(rr, measure = msr("classif.auc"))

# ROC curve, averaged over 10 folds
autoplot(rr, type = "roc")

We can also plot the predictions of individual models:

# learner predictions for the first fold
rr$filter(1)
autoplot(rr, type = "prediction")

Tip

All available plot types are listed on the manual page of autoplot.ResampleResult().