3.3 Nested Resampling

Evaluating a machine learning model often requires an additional layer of resampling when hyperparameters or features have to be selected. Nested resampling separates these model selection steps from the process estimating the performance of the model. If the same data is used for the model selection steps and the evaluation of the model itself, the resulting performance estimate of the model might be severely biased. One reason is that the repeated evaluation of the model on the test data could leak information about its structure into the model, what results in over-optimistic performance estimates. Keep in mind that nested resampling is a statistical procedure to estimate the predictive performance of the model trained on the full dataset. Nested resampling is not a procedure to select optimal hyperparameters. The resampling produces many hyperparameter configurations which should be not used to construct a final model (Simon 2007).

The graphic above illustrates nested resampling for hyperparameter tuning with 3-fold cross-validation in the outer and 4-fold cross-validation in the inner loop.

In the outer resampling loop, we have three pairs of training/test sets. On each of these outer training sets parameter tuning is done, thereby executing the inner resampling loop. This way, we get one set of selected hyperparameters for each outer training set. Then the learner is fitted on each outer training set using the corresponding selected hyperparameters. Subsequently, we can evaluate the performance of the learner on the outer test sets. The aggregated performance on the outer test sets is the unbiased performance estimate of the model.

3.3.1 Execution

The previous section examined the optimization of a simple classification tree on the mlr_tasks_pima. We continue the example and estimate the predictive performance of the model with nested resampling.

We use a 4-fold cross-validation in the inner resampling loop. The AutoTuner executes the hyperparameter tuning and is stopped after 5 evaluations. The hyperparameter configurations are proposed by grid search.

library("mlr3verse")

learner = lrn("classif.rpart")
resampling = rsmp("holdout")
measure = msr("classif.ce")
search_space = ps(cp = p_dbl(lower = 0.001, upper = 0.1))
terminator = trm("evals", n_evals = 5)
tuner = tnr("grid_search", resolution = 10)

at = AutoTuner$new(learner, resampling, measure, terminator, tuner, search_space)

A 3-fold cross-validation is used in the outer resampling loop. On each of the three outer train sets hyperparameter tuning is done and we receive three optimized hyperparameter configurations. To execute the nested resampling, we pass the AutoTuner to the resample() function. We have to set store_models = TRUE because we need the AutoTuner models to investigate the inner tuning.

task = tsk("pima")
outer_resampling = rsmp("cv", folds = 3)

rr = resample(task, at, outer_resampling, store_models = TRUE)
## INFO  [08:36:11.799] [bbotk] Starting to optimize 1 parameter(s) with '<OptimizerGridSearch>' and '<TerminatorEvals> [n_evals=5]' 
## INFO  [08:36:11.847] [bbotk] Evaluating 1 configuration(s) 
## INFO  [08:36:11.981] [bbotk] Result of batch 1: 
## INFO  [08:36:11.985] [bbotk]     cp classif.ce                                uhash 
## INFO  [08:36:11.985] [bbotk]  0.067     0.2573 ebc46036-00f3-4763-8c45-f227ed5225a8 
## INFO  [08:36:11.987] [bbotk] Evaluating 1 configuration(s) 
## INFO  [08:36:12.099] [bbotk] Result of batch 2: 
## INFO  [08:36:12.102] [bbotk]   cp classif.ce                                uhash 
## INFO  [08:36:12.102] [bbotk]  0.1     0.2924 8882bd1b-6da9-4c18-bec6-e1cb59b3c8fa 
## INFO  [08:36:12.103] [bbotk] Evaluating 1 configuration(s) 
## INFO  [08:36:12.220] [bbotk] Result of batch 3: 
## INFO  [08:36:12.222] [bbotk]     cp classif.ce                                uhash 
## INFO  [08:36:12.222] [bbotk]  0.001     0.2865 3f1b965f-a65d-4189-89c6-49d2942a3e6d 
## INFO  [08:36:12.224] [bbotk] Evaluating 1 configuration(s) 
## INFO  [08:36:12.328] [bbotk] Result of batch 4: 
## INFO  [08:36:12.330] [bbotk]     cp classif.ce                                uhash 
## INFO  [08:36:12.330] [bbotk]  0.023     0.2573 373cdfbb-c50d-4eb7-85f8-dc48f22970ba 
## INFO  [08:36:12.332] [bbotk] Evaluating 1 configuration(s) 
## INFO  [08:36:12.438] [bbotk] Result of batch 5: 
## INFO  [08:36:12.440] [bbotk]     cp classif.ce                                uhash 
## INFO  [08:36:12.440] [bbotk]  0.056     0.2573 b0e9da36-84bb-4aef-b66a-8ee9fe6ee7f0 
## INFO  [08:36:12.448] [bbotk] Finished optimizing after 5 evaluation(s) 
## INFO  [08:36:12.449] [bbotk] Result: 
## INFO  [08:36:12.450] [bbotk]     cp learner_param_vals  x_domain classif.ce 
## INFO  [08:36:12.450] [bbotk]  0.067          <list[2]> <list[1]>     0.2573 
## INFO  [08:36:12.516] [bbotk] Starting to optimize 1 parameter(s) with '<OptimizerGridSearch>' and '<TerminatorEvals> [n_evals=5]' 
## INFO  [08:36:12.520] [bbotk] Evaluating 1 configuration(s) 
## INFO  [08:36:12.628] [bbotk] Result of batch 1: 
## INFO  [08:36:12.630] [bbotk]     cp classif.ce                                uhash 
## INFO  [08:36:12.630] [bbotk]  0.056     0.2515 2842472d-34b0-4305-9ec6-6d0a34ddf069 
## INFO  [08:36:12.632] [bbotk] Evaluating 1 configuration(s) 
## INFO  [08:36:12.738] [bbotk] Result of batch 2: 
## INFO  [08:36:12.740] [bbotk]     cp classif.ce                                uhash 
## INFO  [08:36:12.740] [bbotk]  0.067     0.2515 7a814ffc-084d-4575-aaef-6b82b5aa373d 
## INFO  [08:36:12.742] [bbotk] Evaluating 1 configuration(s) 
## INFO  [08:36:12.836] [bbotk] Result of batch 3: 
## INFO  [08:36:12.838] [bbotk]     cp classif.ce                                uhash 
## INFO  [08:36:12.838] [bbotk]  0.001     0.2749 44cb6e32-7b1c-4b9a-adac-84af79d6d443 
## INFO  [08:36:12.840] [bbotk] Evaluating 1 configuration(s) 
## INFO  [08:36:12.954] [bbotk] Result of batch 4: 
## INFO  [08:36:12.956] [bbotk]     cp classif.ce                                uhash 
## INFO  [08:36:12.956] [bbotk]  0.012     0.2807 4637d6a5-2a55-45c3-9948-90b05944412d 
## INFO  [08:36:12.957] [bbotk] Evaluating 1 configuration(s) 
## INFO  [08:36:13.059] [bbotk] Result of batch 5: 
## INFO  [08:36:13.062] [bbotk]     cp classif.ce                                uhash 
## INFO  [08:36:13.062] [bbotk]  0.078     0.2515 329ee2e8-fcb6-423b-85b7-4a8e7f8f5889 
## INFO  [08:36:13.071] [bbotk] Finished optimizing after 5 evaluation(s) 
## INFO  [08:36:13.072] [bbotk] Result: 
## INFO  [08:36:13.074] [bbotk]     cp learner_param_vals  x_domain classif.ce 
## INFO  [08:36:13.074] [bbotk]  0.056          <list[2]> <list[1]>     0.2515 
## INFO  [08:36:13.141] [bbotk] Starting to optimize 1 parameter(s) with '<OptimizerGridSearch>' and '<TerminatorEvals> [n_evals=5]' 
## INFO  [08:36:13.144] [bbotk] Evaluating 1 configuration(s) 
## INFO  [08:36:13.240] [bbotk] Result of batch 1: 
## INFO  [08:36:13.242] [bbotk]     cp classif.ce                                uhash 
## INFO  [08:36:13.242] [bbotk]  0.067     0.2398 f0a8eeb7-62fb-496b-9f3f-acbd62976887 
## INFO  [08:36:13.244] [bbotk] Evaluating 1 configuration(s) 
## INFO  [08:36:13.348] [bbotk] Result of batch 2: 
## INFO  [08:36:13.350] [bbotk]   cp classif.ce                                uhash 
## INFO  [08:36:13.350] [bbotk]  0.1     0.2398 e2a6b89d-e777-446e-bf58-342f182ffc29 
## INFO  [08:36:13.351] [bbotk] Evaluating 1 configuration(s) 
## INFO  [08:36:13.451] [bbotk] Result of batch 3: 
## INFO  [08:36:13.453] [bbotk]     cp classif.ce                                uhash 
## INFO  [08:36:13.453] [bbotk]  0.089     0.2398 1f66fb91-0957-4613-b595-620a7a021b87 
## INFO  [08:36:13.454] [bbotk] Evaluating 1 configuration(s) 
## INFO  [08:36:13.556] [bbotk] Result of batch 4: 
## INFO  [08:36:13.558] [bbotk]     cp classif.ce                                uhash 
## INFO  [08:36:13.558] [bbotk]  0.012      0.193 2c1a57ec-ea15-48f1-a540-07ddbfcc2fee 
## INFO  [08:36:13.560] [bbotk] Evaluating 1 configuration(s) 
## INFO  [08:36:13.669] [bbotk] Result of batch 5: 
## INFO  [08:36:13.671] [bbotk]     cp classif.ce                                uhash 
## INFO  [08:36:13.671] [bbotk]  0.001     0.1696 0cb5ce45-3155-4e7e-8a11-c4c4daecbb4f 
## INFO  [08:36:13.678] [bbotk] Finished optimizing after 5 evaluation(s) 
## INFO  [08:36:13.679] [bbotk] Result: 
## INFO  [08:36:13.681] [bbotk]     cp learner_param_vals  x_domain classif.ce 
## INFO  [08:36:13.681] [bbotk]  0.001          <list[2]> <list[1]>     0.1696

You can freely combine different inner and outer resampling strategies. Nested resampling is not restricted to hyperparameter tuning. You can swap the AutoTuner for a AutoFSelector and estimate the performance of a model which is fitted on an optimized feature subset.

3.3.2 Evaluation

With the created ResampleResult we can now inspect the executed resampling iterations more closely. See the section on Resampling for more detailed information about ResampleResult objects.

We check the inner tuning results for stable hyperparameters. This means that the selected hyperparameters should not vary too much. We might observe unstable models in this example because the small data set and the low number of resampling iterations might introduces too much randomness. Usually, we aim for the selection of stable hyperparameters for all outer training sets.

extract_inner_tuning_results(rr)
##       cp learner_param_vals  x_domain classif.ce
## 1: 0.056          <list[2]> <list[1]>     0.2515
## 2: 0.067          <list[2]> <list[1]>     0.2573
## 3: 0.001          <list[2]> <list[1]>     0.1696

Next, we want to compare the predictive performances estimated on the outer resampling to the inner resampling. Significantly lower predictive performances on the outer resampling indicate that the models with the optimized hyperparameters overfit the data.

rr$score()
##                 task task_id         learner          learner_id
## 1: <TaskClassif[46]>    pima <AutoTuner[38]> classif.rpart.tuned
## 2: <TaskClassif[46]>    pima <AutoTuner[38]> classif.rpart.tuned
## 3: <TaskClassif[46]>    pima <AutoTuner[38]> classif.rpart.tuned
##            resampling resampling_id iteration              prediction
## 1: <ResamplingCV[19]>            cv         1 <PredictionClassif[19]>
## 2: <ResamplingCV[19]>            cv         2 <PredictionClassif[19]>
## 3: <ResamplingCV[19]>            cv         3 <PredictionClassif[19]>
##    classif.ce
## 1:     0.2539
## 2:     0.3047
## 3:     0.2969

The aggregated performance of all outer resampling iterations is essentially the unbiased performance of the model with optimal hyperparameter found by grid search.

rr$aggregate()
## classif.ce 
##     0.2852

Note that nested resampling is computationally expensive. For this reason we use relatively small number of hyperparameter configurations and a low number of resampling iterations in this example. In practice, you normally have to increase both. As this is computationally intensive you might want to have a look at the section on Parallelization.

3.3.3 Final Model

We can use the AutoTuner to tune the hyperparameters of our learner and fit the final model on the full data set.

at$train(task)
## INFO  [08:36:13.982] [bbotk] Starting to optimize 1 parameter(s) with '<OptimizerGridSearch>' and '<TerminatorEvals> [n_evals=5]' 
## INFO  [08:36:13.986] [bbotk] Evaluating 1 configuration(s) 
## INFO  [08:36:14.088] [bbotk] Result of batch 1: 
## INFO  [08:36:14.090] [bbotk]     cp classif.ce                                uhash 
## INFO  [08:36:14.090] [bbotk]  0.078     0.2734 47bba7da-f502-474d-bfce-205732715824 
## INFO  [08:36:14.092] [bbotk] Evaluating 1 configuration(s) 
## INFO  [08:36:14.197] [bbotk] Result of batch 2: 
## INFO  [08:36:14.200] [bbotk]     cp classif.ce                                uhash 
## INFO  [08:36:14.200] [bbotk]  0.045     0.2734 b7caf1dd-c315-4906-aa0d-c7c2e9d0419b 
## INFO  [08:36:14.201] [bbotk] Evaluating 1 configuration(s) 
## INFO  [08:36:14.306] [bbotk] Result of batch 3: 
## INFO  [08:36:14.308] [bbotk]     cp classif.ce                                uhash 
## INFO  [08:36:14.308] [bbotk]  0.001     0.3008 55b41b1d-90c7-4306-aeca-b96669855fe2 
## INFO  [08:36:14.310] [bbotk] Evaluating 1 configuration(s) 
## INFO  [08:36:14.421] [bbotk] Result of batch 4: 
## INFO  [08:36:14.423] [bbotk]     cp classif.ce                                uhash 
## INFO  [08:36:14.423] [bbotk]  0.067     0.2734 d25d6cb8-7c66-4cfe-9c2f-6077c7de340e 
## INFO  [08:36:14.425] [bbotk] Evaluating 1 configuration(s) 
## INFO  [08:36:14.530] [bbotk] Result of batch 5: 
## INFO  [08:36:14.532] [bbotk]     cp classif.ce                                uhash 
## INFO  [08:36:14.532] [bbotk]  0.012     0.2578 7ad1e6d3-5463-48a7-9060-b1889ee0ff47 
## INFO  [08:36:14.539] [bbotk] Finished optimizing after 5 evaluation(s) 
## INFO  [08:36:14.540] [bbotk] Result: 
## INFO  [08:36:14.542] [bbotk]     cp learner_param_vals  x_domain classif.ce 
## INFO  [08:36:14.542] [bbotk]  0.012          <list[2]> <list[1]>     0.2578

The trained model can now be used to make predictions on new data. A common mistake is to report the performance estimated on the resampling sets on which the tuning was performed (at$tuning_result$classif.ce) as the model’s performance. Instead, we report the performance estimated with nested resampling as the performance of the model.

References

Simon, Richard. 2007. “Resampling Strategies for Model Assessment and Selection.” In Fundamentals of Data Mining in Genomics and Proteomics, edited by Werner Dubitzky, Martin Granzow, and Daniel Berrar, 173–86. Boston, MA: Springer US. https://doi.org/10.1007/978-0-387-47509-7_8.