3.3 Nested Resampling

In order to obtain unbiased performance estimates for learners, all parts of the model building (preprocessing and model selection steps) should be included in the resampling, i.e., repeated for every pair of training/test data. For steps that themselves require resampling like hyperparameter tuning or feature-selection (via the wrapper approach) this results in two nested resampling loops.

The graphic above illustrates nested resampling for parameter tuning with 3-fold cross-validation in the outer and 4-fold cross-validation in the inner loop.

In the outer resampling loop, we have three pairs of training/test sets. On each of these outer training sets parameter tuning is done, thereby executing the inner resampling loop. This way, we get one set of selected hyperparameters for each outer training set. Then the learner is fitted on each outer training set using the corresponding selected hyperparameters. Subsequently, we can evaluate the performance of the learner on the outer test sets.

In mlr3, you can run nested resampling for free without programming any loops by using the mlr3tuning::AutoTuner class. This works as follows:

  1. Generate a wrapped Learner via class mlr3tuning::AutoTuner or mlr3filters::AutoSelect (not yet implemented).
  2. Specify all required settings - see section “Automating the Tuning” for help.
  3. Call function resample() or benchmark() with the created Learner.

You can freely combine different inner and outer resampling strategies.

A common setup is prediction and performance evaluation on a fixed outer test set. This can be achieved by passing the Resampling strategy (rsmp("holdout")) as the outer resampling instance to either resample() or benchmark().

The inner resampling strategy could be a cross-validation one (rsmp("cv")) as the sizes of the outer training sets might differ. Per default, the inner resample description is instantiated once for every outer training set.

Note that nested resampling is computationally expensive. For this reason we use relatively small search spaces and a low number of resampling iterations in the examples shown below. In practice, you normally have to increase both. As this is computationally intensive you might want to have a look at the section on Parallelization.

3.3.2 Evaluation

With the created ResampleResult we can now inspect the executed resampling iterations more closely. See the section on Resampling for more detailed information about ResampleResult objects.

For example, we can query the aggregated performance result:

## classif.ce 
##       0.06

Check for any errors in the folds during execution (if there is not output, warnings or errors recorded, this is an empty data.table():

## Empty data.table (0 rows and 2 cols): iteration,msg

Or take a look at the confusion matrix of the joined predictions:

##             truth
## response     setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         45         4
##   virginica       0          5        46