3.3 Nested Resampling

In order to obtain unbiased performance estimates for learners, all parts of the model building (preprocessing and model selection steps) should be included in the resampling, i.e., repeated for every pair of training/test data. For steps that themselves require resampling like hyperparameter tuning or feature-selection (via the wrapper approach) this results in two nested resampling loops.

The graphic above illustrates nested resampling for parameter tuning with 3-fold cross-validation in the outer and 4-fold cross-validation in the inner loop.

In the outer resampling loop, we have three pairs of training/test sets. On each of these outer training sets parameter tuning is done, thereby executing the inner resampling loop. This way, we get one set of selected hyperparameters for each outer training set. Then the learner is fitted on each outer training set using the corresponding selected hyperparameters and its performance is evaluated on the outer test sets.

In mlr3, you can get nested resampling for free without programming any looping by using the mlr3tuning::AutoTuner class. This works as follows:

  1. Generate a wrapped Learner via class mlr3tuning::AutoTuner or mlr3filters::AutoSelect (not yet implemented).
  2. Specify all required settings - see section “Automating the Tuning” for help.
  3. Call function resample() or benchmark() with the created Learner.

You can freely combine different inner and outer resampling strategies.

A common setup is prediction and performance evaluation on a fixed outer test set. This can be achieved by passing the Resampling strategy (rsmp("holdout")) as the outer resampling instance to either resample() or benchmark().

The inner resampling strategy could be a cross-validation one (rsmp("cv")) as the sizes of the outer training sets might differ. Per default, the inner resample description is instantiated once for every outer training set.

Nested resampling is computationally expensive. For this reason in the examples shown below, we use relatively small search spaces and a low number of resampling iterations. In practice, you normally have to increase both. As this is computationally intensive you might want to have a look at the section Parallelization.

3.3.1 Execution

To optimize hyperparameters or conduct features-selection in a nested resampling you need to create learners using either:

  • the AutoTuner class, or
  • the mlr3filters::AutoSelect class (not yet implemented)

We use the example from section “Automating the Tuning” and pipe the resulting learner into a resample() call.

library(mlr3tuning)
task = tsk("iris")
learner = lrn("classif.rpart")
resampling = rsmp("holdout")
measures = msr("classif.ce")
param_set = paradox::ParamSet$new(
  params = list(paradox::ParamDbl$new("cp", lower = 0.001, upper = 0.1)))
terminator = term("evals", n_evals = 5)
tuner = tnr("grid_search", resolution = 10)

at = AutoTuner$new(learner, resampling, measures = measures,
  param_set, terminator, tuner = tuner)

Now construct the resample() call:

resampling_outer = rsmp("cv", folds = 3)
rr = resample(task = task, learner = at, resampling = resampling_outer)

3.3.2 Evaluation

With the created ResampleResult we can now inspect the executed resampling iterations more closely. See also section Resampling for more detailed information about ResampleResult objects.

For example, we can query the aggregated performance result:

rr$aggregate()
## classif.ce 
##       0.08

Check for any errors in the folds during execution (if there is not output, warnings or errors recorded, this is an empty data.table():

rr$errors
## Empty data.table (0 rows and 2 cols): iteration,msg

Or take a look at the confusion matrix of the joined predictions:

rr$prediction()$confusion
##             truth
## response     setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         47         9
##   virginica       0          3        41