4 Model Optimization

In machine learning, when you are dissatisfied with the performance of a model, you might ask yourself how to best improve the model: * Can it be improved by tweaking the hyperparameters of the learner, i.e. the configuration options that affect its behavior? * Or, should you just use a completely different learner for this particular task? This chapter might help answer this question.

Model Tuning

Machine learning algorithms have default values set for their hyperparameters. In many cases, these hyperparameters need to be changed by the user to achieve optimal performance on the given dataset. While you can certainly search for hyperparameter settings that improve performance manually, we do not recommend this approach as it is tedious and rarely leads to the best performance. Fortunately, the mlr3 ecosystem provides packages and tools for automated tuning. In order to tune a machine learning algorithm, you have to specify (1) the search space, (2) the optimization algorithm (i.e. tuning method), (3) an evaluation method (i.e. a resampling strategy), and (4) a performance measure.

In the tuning part, we will have a look at:

We will use the mlr3tuning package, which supports common tuning operations.

Feature Selection

Tuning the hyperparameters is only one way of improving the performance of your model. The second part of this chapter explains feature selection, also known as variable or descriptor selection. Feature selection is the process of finding the feature subset that is most relevant with respect to the prediction or for which the learner fits a model with the highest performance. Apart from improving model performance, there are additional reasons to perform feature selection:

  • enhance the interpretability of the model,
  • speed up model fitting, or
  • eliminate the need to collect lots of expensive features.

Here, we mostly focus on feature selection as a means for improving model performance.

There are different approaches to identify the relevant features. In the feature selection part, we describe three methods:

  • Filter algorithms select features independently of the learner by scoring the different features.
  • Variable importance filters select features that are important according the model induced by a learner.
  • Wrapper methods iteratively select features to optimize a performance measure, each time fitting and evaluating a model with a different subset of features.

Note that filters operate independently of learners. Variable importance filters rely on the learner to extract information on feature importance from a trained model, for example by inspecting a learned decision tree and returning the features that are used in the first few levels. The obtained importance values can be used to subset the data, which can then be used to train a learner. Wrapper methods can be used with any learner but need to train the learner potentially many times, making this the most expensive method.

Nested Resampling

For hyperparameter tuning, a normal resampling (e.g. a cross-validation) is no longer sufficient to ensure an unbiased evaluation. Consider the following thought experiment to gain intuition for why this is the case. Suppose a learner has a hyperparameter that has no real effect on the fitted model, but only introduces random noise into the predictions. Evaluating different values for this hyperparameter, one will show the best performance (purely randomly). This is the hyperparameter value that will be chosen as the best, although the hyperparameter has no real effect. To discover this, another separate validation set is required – it will reveal that the “optimized” setting really does not perform better than anything else.

We need a nested resampling to ensure unbiased estimates of the generalization error during hyperparameter optimization. We discuss the following aspects in this part:

4.1 Hyperparameter Tuning

Hyperparameters are the parameters of the learners that control how a model is fit to the data. They are sometimes called second-level or second-order parameters of machine learning – the parameters of the models are the first-order parameters and “fit” to the data during model training. The hyerparameters of a learner can have a major impact on the performance of a learned model, but are often only optimized in an ad-hoc manner or not at all. This process is often called model ‘tuning’.

Hyperparameter tuning is supported via the mlr3tuning extension package. Below you can find an illustration of the general process:

At the heart of mlr3tuning are the R6 classes

4.1.1 The TuningInstance* Classes

We will examine the optimization of a simple classification tree on the Pima Indian Diabetes data set as an introductory example here.

library("mlr3verse")
task = tsk("pima")
print(task)
## <TaskClassif:pima> (768 x 9)
## * Target: diabetes
## * Properties: twoclass
## * Features (8):
##   - dbl (8): age, glucose, insulin, mass, pedigree, pregnant, pressure,
##     triceps

We use the rpart classification tree and choose a subset of the hyperparameters we want to tune. This is often referred to as the “tuning space”. First, let’s look at all the hyperparameters that are available. Information on what they do can be found in the documentation of the learner.

learner = lrn("classif.rpart")
learner$param_set
## <ParamSet>
##                 id    class lower upper nlevels        default value
##  1:             cp ParamDbl     0     1     Inf           0.01      
##  2:     keep_model ParamLgl    NA    NA       2          FALSE      
##  3:     maxcompete ParamInt     0   Inf     Inf              4      
##  4:       maxdepth ParamInt     1    30      30             30      
##  5:   maxsurrogate ParamInt     0   Inf     Inf              5      
##  6:      minbucket ParamInt     1   Inf     Inf <NoDefault[3]>      
##  7:       minsplit ParamInt     1   Inf     Inf             20      
##  8: surrogatestyle ParamInt     0     1       2              0      
##  9:   usesurrogate ParamInt     0     2       3              2      
## 10:           xval ParamInt     0   Inf     Inf             10     0

Here, we opt to tune two hyperparameters:

  • The complexity hyperparameter cp that controls when the learner considers introducing another branch.
  • The minsplit hyperparameter that controls how many observations must be present in a leaf for another split to be attempted.

The tuning space needs to be bounded with lower and upper bounds for the values of the hyperparameters:

search_space = ps(
  cp = p_dbl(lower = 0.001, upper = 0.1),
  minsplit = p_int(lower = 1, upper = 10)
)
search_space
## <ParamSet>
##          id    class lower upper nlevels        default value
## 1:       cp ParamDbl 0.001   0.1     Inf <NoDefault[3]>      
## 2: minsplit ParamInt 1.000  10.0      10 <NoDefault[3]>

The bounds are usually set based on experience.

Next, we need to specify how to evaluate the performance of a trained model. For this, we need to choose a resampling strategy and a performance measure.

hout = rsmp("holdout")
measure = msr("classif.ce")

Finally, we have to specify the budget available for tuning. This is a crucial step, as exhaustively evaluating all possible hyperparameter configurations is usually not feasible. mlr3 allows to specify complex termination criteria by selecting one of the available Terminators:

For this short introduction, we specify a budget of 20 iterations and then put everything together into a TuningInstanceSingleCrit:

## Loading required package: paradox
evals20 = trm("evals", n_evals = 20)

instance = TuningInstanceSingleCrit$new(
  task = task,
  learner = learner,
  resampling = hout,
  measure = measure,
  search_space = search_space,
  terminator = evals20
)
instance
## <TuningInstanceSingleCrit>
## * State:  Not optimized
## * Objective: <ObjectiveTuning:classif.rpart_on_pima>
## * Search Space:
## <ParamSet>
##          id    class lower upper nlevels        default value
## 1:       cp ParamDbl 0.001   0.1     Inf <NoDefault[3]>      
## 2: minsplit ParamInt 1.000  10.0      10 <NoDefault[3]>      
## * Terminator: <TerminatorEvals>
## * Terminated: FALSE
## * Archive:
## <ArchiveTuning>
## Null data.table (0 rows and 0 cols)

To start the tuning, we still need to select how the optimization should take place. In other words, we need to choose the optimization algorithm via the Tuner class.

4.1.2 The Tuner Class

The following algorithms are currently implemented in mlr3tuning:

If you’re interested in learning more about these approaches, the Wikipedia page on hyperparameter optimization is a good place to start.

In this example, we will use a simple grid search with a grid resolution of 5.

tuner = tnr("grid_search", resolution = 5)

As we have only numeric parameters, TunerGridSearch will create an equidistant grid between the respective upper and lower bounds. Our two-dimensional grid of resolution 5 consists of \(5^2 = 25\) configurations. Each configuration is a distinct setting of hyperparameter values for the previously defined Learner which is then fitted to the task and evaluated using the provided Resampling. All configurations will be examined by the tuner (in a random order), until either all configurations are evaluated or the Terminator signals that the budget is exhausted, i.e. here the tuner will stop after evaluating 20 of the 25 total configurations.

4.1.3 Triggering the Tuning

To start the tuning, we simply pass the TuningInstanceSingleCrit to the $optimize() method of the initialized Tuner. The tuner proceeds as follows:

  1. The Tuner proposes at least one hyperparameter configuration to evaluate (the Tuner may propose multiple points to be able to evaluate them in parallel, which can be controlled via the setting batch_size).
  2. For each configuration, the given Learner is fitted on the Task and evaluated using the provided Resampling. All evaluations are stored in the archive of the TuningInstanceSingleCrit.
  3. The Terminator is queried if the budget is exhausted. If the budget is not exhausted, go back to 1), else terminate.
  4. Determine the configurations with the best observed performance from the archive.
  5. Store the best configurations as result in the tuning instance object. The best hyperparameter settings ($result_learner_param_vals) and the corresponding measured performance ($result_y) can be retrieved from the tuning instance.
tuner$optimize(instance)
## INFO  [20:20:58.654] [bbotk] Starting to optimize 2 parameter(s) with '<OptimizerGridSearch>' and '<TerminatorEvals> [n_evals=20, k=0]' 
## INFO  [20:20:58.697] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:20:58.893] [bbotk] Result of batch 1: 
## INFO  [20:20:58.895] [bbotk]      cp minsplit classif.ce runtime_learners 
## INFO  [20:20:58.895] [bbotk]  0.0505        8     0.2031            0.021 
## INFO  [20:20:58.895] [bbotk]                                 uhash 
## INFO  [20:20:58.895] [bbotk]  df90c7fb-1bb0-46d2-a44d-ff440d5039f4 
## INFO  [20:20:58.897] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:20:58.990] [bbotk] Result of batch 2: 
## INFO  [20:20:58.993] [bbotk]       cp minsplit classif.ce runtime_learners 
## INFO  [20:20:58.993] [bbotk]  0.02575        8     0.2148            0.015 
## INFO  [20:20:58.993] [bbotk]                                 uhash 
## INFO  [20:20:58.993] [bbotk]  482b4ea9-c3f9-45a7-ab28-66caf6ec2f88 
## INFO  [20:20:58.994] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:20:59.095] [bbotk] Result of batch 3: 
## INFO  [20:20:59.097] [bbotk]       cp minsplit classif.ce runtime_learners 
## INFO  [20:20:59.097] [bbotk]  0.02575       10     0.2148            0.012 
## INFO  [20:20:59.097] [bbotk]                                 uhash 
## INFO  [20:20:59.097] [bbotk]  b3142b2e-e442-4276-bfc9-fbd608a9f6dd 
## INFO  [20:20:59.099] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:20:59.190] [bbotk] Result of batch 4: 
## INFO  [20:20:59.193] [bbotk]     cp minsplit classif.ce runtime_learners 
## INFO  [20:20:59.193] [bbotk]  0.001        8     0.2617            0.015 
## INFO  [20:20:59.193] [bbotk]                                 uhash 
## INFO  [20:20:59.193] [bbotk]  41d887d1-77b6-4406-9151-3b08befbf78a 
## INFO  [20:20:59.194] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:20:59.291] [bbotk] Result of batch 5: 
## INFO  [20:20:59.293] [bbotk]   cp minsplit classif.ce runtime_learners                                uhash 
## INFO  [20:20:59.293] [bbotk]  0.1       10     0.2422            0.021 ba561e0a-5f15-4d2c-bf34-d2814ff1db72 
## INFO  [20:20:59.294] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:20:59.386] [bbotk] Result of batch 6: 
## INFO  [20:20:59.388] [bbotk]       cp minsplit classif.ce runtime_learners 
## INFO  [20:20:59.388] [bbotk]  0.07525        8     0.2031            0.013 
## INFO  [20:20:59.388] [bbotk]                                 uhash 
## INFO  [20:20:59.388] [bbotk]  78b04d22-183d-4003-9359-7e8a22492196 
## INFO  [20:20:59.390] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:20:59.567] [bbotk] Result of batch 7: 
## INFO  [20:20:59.569] [bbotk]   cp minsplit classif.ce runtime_learners                                uhash 
## INFO  [20:20:59.569] [bbotk]  0.1        1     0.2422            0.013 2fb5b192-9085-4619-a32a-ee53777ef45d 
## INFO  [20:20:59.570] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:20:59.662] [bbotk] Result of batch 8: 
## INFO  [20:20:59.664] [bbotk]       cp minsplit classif.ce runtime_learners 
## INFO  [20:20:59.664] [bbotk]  0.07525        1     0.2031            0.015 
## INFO  [20:20:59.664] [bbotk]                                 uhash 
## INFO  [20:20:59.664] [bbotk]  cf6d3107-408b-4ed5-a3dc-b1991f33ff06 
## INFO  [20:20:59.666] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:20:59.757] [bbotk] Result of batch 9: 
## INFO  [20:20:59.759] [bbotk]       cp minsplit classif.ce runtime_learners 
## INFO  [20:20:59.759] [bbotk]  0.07525       10     0.2031            0.013 
## INFO  [20:20:59.759] [bbotk]                                 uhash 
## INFO  [20:20:59.759] [bbotk]  30fdcad8-7568-47b7-8c51-9baf8875aabe 
## INFO  [20:20:59.760] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:20:59.862] [bbotk] Result of batch 10: 
## INFO  [20:20:59.864] [bbotk]   cp minsplit classif.ce runtime_learners                                uhash 
## INFO  [20:20:59.864] [bbotk]  0.1        5     0.2422             0.02 243e2e71-7d86-4379-a823-58a6c5926fb9 
## INFO  [20:20:59.866] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:20:59.957] [bbotk] Result of batch 11: 
## INFO  [20:20:59.959] [bbotk]      cp minsplit classif.ce runtime_learners 
## INFO  [20:20:59.959] [bbotk]  0.0505        5     0.2031            0.014 
## INFO  [20:20:59.959] [bbotk]                                 uhash 
## INFO  [20:20:59.959] [bbotk]  8ed3f485-6098-419e-ac37-6419d4ad7b09 
## INFO  [20:20:59.961] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:21:00.063] [bbotk] Result of batch 12: 
## INFO  [20:21:00.065] [bbotk]     cp minsplit classif.ce runtime_learners 
## INFO  [20:21:00.065] [bbotk]  0.001        5     0.2578            0.016 
## INFO  [20:21:00.065] [bbotk]                                 uhash 
## INFO  [20:21:00.065] [bbotk]  83b41136-ca1a-40b0-bdfe-f47dd48601b9 
## INFO  [20:21:00.067] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:21:00.159] [bbotk] Result of batch 13: 
## INFO  [20:21:00.161] [bbotk]     cp minsplit classif.ce runtime_learners 
## INFO  [20:21:00.161] [bbotk]  0.001        1     0.2812            0.016 
## INFO  [20:21:00.161] [bbotk]                                 uhash 
## INFO  [20:21:00.161] [bbotk]  ea58074a-f7b6-413b-ad6a-86d08163778c 
## INFO  [20:21:00.162] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:21:00.255] [bbotk] Result of batch 14: 
## INFO  [20:21:00.257] [bbotk]      cp minsplit classif.ce runtime_learners 
## INFO  [20:21:00.257] [bbotk]  0.0505        1     0.2031            0.013 
## INFO  [20:21:00.257] [bbotk]                                 uhash 
## INFO  [20:21:00.257] [bbotk]  7b196c76-ebb3-467b-8538-dc9aef6f6f48 
## INFO  [20:21:00.259] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:21:00.355] [bbotk] Result of batch 15: 
## INFO  [20:21:00.357] [bbotk]       cp minsplit classif.ce runtime_learners 
## INFO  [20:21:00.357] [bbotk]  0.07525        5     0.2031            0.013 
## INFO  [20:21:00.357] [bbotk]                                 uhash 
## INFO  [20:21:00.357] [bbotk]  b5b833a5-02b1-497d-859d-a0ec1cba5b71 
## INFO  [20:21:00.359] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:21:00.455] [bbotk] Result of batch 16: 
## INFO  [20:21:00.457] [bbotk]     cp minsplit classif.ce runtime_learners 
## INFO  [20:21:00.457] [bbotk]  0.001       10     0.2539            0.014 
## INFO  [20:21:00.457] [bbotk]                                 uhash 
## INFO  [20:21:00.457] [bbotk]  8a772760-e5c2-4642-b276-cb7336255d2f 
## INFO  [20:21:00.459] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:21:00.563] [bbotk] Result of batch 17: 
## INFO  [20:21:00.565] [bbotk]      cp minsplit classif.ce runtime_learners 
## INFO  [20:21:00.565] [bbotk]  0.0505       10     0.2031            0.014 
## INFO  [20:21:00.565] [bbotk]                                 uhash 
## INFO  [20:21:00.565] [bbotk]  4875ecca-3bb9-4271-9b8e-5b6c49c0f8d5 
## INFO  [20:21:00.566] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:21:00.657] [bbotk] Result of batch 18: 
## INFO  [20:21:00.659] [bbotk]   cp minsplit classif.ce runtime_learners                                uhash 
## INFO  [20:21:00.659] [bbotk]  0.1        3     0.2422            0.013 ef551e26-4bad-4167-b6e6-67660a2dc829 
## INFO  [20:21:00.661] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:21:00.756] [bbotk] Result of batch 19: 
## INFO  [20:21:00.758] [bbotk]       cp minsplit classif.ce runtime_learners 
## INFO  [20:21:00.758] [bbotk]  0.02575        5     0.2148            0.014 
## INFO  [20:21:00.758] [bbotk]                                 uhash 
## INFO  [20:21:00.758] [bbotk]  4d3300e0-c717-4138-9e7d-a875368eed3b 
## INFO  [20:21:00.759] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:21:00.859] [bbotk] Result of batch 20: 
## INFO  [20:21:00.861] [bbotk]       cp minsplit classif.ce runtime_learners 
## INFO  [20:21:00.861] [bbotk]  0.02575        3     0.2148            0.013 
## INFO  [20:21:00.861] [bbotk]                                 uhash 
## INFO  [20:21:00.861] [bbotk]  7127a94b-362a-4a8e-8cbf-65c430474266 
## INFO  [20:21:00.868] [bbotk] Finished optimizing after 20 evaluation(s) 
## INFO  [20:21:00.868] [bbotk] Result: 
## INFO  [20:21:00.870] [bbotk]      cp minsplit learner_param_vals  x_domain classif.ce 
## INFO  [20:21:00.870] [bbotk]  0.0505        8          <list[3]> <list[2]>     0.2031
##        cp minsplit learner_param_vals  x_domain classif.ce
## 1: 0.0505        8          <list[3]> <list[2]>     0.2031
instance$result_learner_param_vals
## $xval
## [1] 0
## 
## $cp
## [1] 0.0505
## 
## $minsplit
## [1] 8
instance$result_y
## classif.ce 
##     0.2031

You can investigate all of the evaluations that were performed; they are stored in the archive of the TuningInstanceSingleCrit and can be accessed by using as.data.table():

as.data.table(instance$archive)
##          cp minsplit classif.ce x_domain_cp x_domain_minsplit runtime_learners
##  1: 0.05050        8     0.2031     0.05050                 8            0.021
##  2: 0.02575        8     0.2148     0.02575                 8            0.015
##  3: 0.02575       10     0.2148     0.02575                10            0.012
##  4: 0.00100        8     0.2617     0.00100                 8            0.015
##  5: 0.10000       10     0.2422     0.10000                10            0.021
##  6: 0.07525        8     0.2031     0.07525                 8            0.013
##  7: 0.10000        1     0.2422     0.10000                 1            0.013
##  8: 0.07525        1     0.2031     0.07525                 1            0.015
##  9: 0.07525       10     0.2031     0.07525                10            0.013
## 10: 0.10000        5     0.2422     0.10000                 5            0.020
## 11: 0.05050        5     0.2031     0.05050                 5            0.014
## 12: 0.00100        5     0.2578     0.00100                 5            0.016
## 13: 0.00100        1     0.2812     0.00100                 1            0.016
## 14: 0.05050        1     0.2031     0.05050                 1            0.013
## 15: 0.07525        5     0.2031     0.07525                 5            0.013
## 16: 0.00100       10     0.2539     0.00100                10            0.014
## 17: 0.05050       10     0.2031     0.05050                10            0.014
## 18: 0.10000        3     0.2422     0.10000                 3            0.013
## 19: 0.02575        5     0.2148     0.02575                 5            0.014
## 20: 0.02575        3     0.2148     0.02575                 3            0.013
##               timestamp batch_nr      resample_result
##  1: 2021-11-25 20:20:58        1 <ResampleResult[22]>
##  2: 2021-11-25 20:20:58        2 <ResampleResult[22]>
##  3: 2021-11-25 20:20:59        3 <ResampleResult[22]>
##  4: 2021-11-25 20:20:59        4 <ResampleResult[22]>
##  5: 2021-11-25 20:20:59        5 <ResampleResult[22]>
##  6: 2021-11-25 20:20:59        6 <ResampleResult[22]>
##  7: 2021-11-25 20:20:59        7 <ResampleResult[22]>
##  8: 2021-11-25 20:20:59        8 <ResampleResult[22]>
##  9: 2021-11-25 20:20:59        9 <ResampleResult[22]>
## 10: 2021-11-25 20:20:59       10 <ResampleResult[22]>
## 11: 2021-11-25 20:20:59       11 <ResampleResult[22]>
## 12: 2021-11-25 20:21:00       12 <ResampleResult[22]>
## 13: 2021-11-25 20:21:00       13 <ResampleResult[22]>
## 14: 2021-11-25 20:21:00       14 <ResampleResult[22]>
## 15: 2021-11-25 20:21:00       15 <ResampleResult[22]>
## 16: 2021-11-25 20:21:00       16 <ResampleResult[22]>
## 17: 2021-11-25 20:21:00       17 <ResampleResult[22]>
## 18: 2021-11-25 20:21:00       18 <ResampleResult[22]>
## 19: 2021-11-25 20:21:00       19 <ResampleResult[22]>
## 20: 2021-11-25 20:21:00       20 <ResampleResult[22]>

Altogether, the grid search evaluated 20/25 different hyperparameter configurations in a random order before the Terminator stopped the tuning.

The associated resampling iterations can be accessed in the BenchmarkResult of the tuning instance:

instance$archive$benchmark_result
## <BenchmarkResult> of 20 rows with 20 resampling runs
##  nr task_id    learner_id resampling_id iters warnings errors
##   1    pima classif.rpart       holdout     1        0      0
##   2    pima classif.rpart       holdout     1        0      0
##   3    pima classif.rpart       holdout     1        0      0
##   4    pima classif.rpart       holdout     1        0      0
##   5    pima classif.rpart       holdout     1        0      0
##   6    pima classif.rpart       holdout     1        0      0
##   7    pima classif.rpart       holdout     1        0      0
##   8    pima classif.rpart       holdout     1        0      0
##   9    pima classif.rpart       holdout     1        0      0
##  10    pima classif.rpart       holdout     1        0      0
##  11    pima classif.rpart       holdout     1        0      0
##  12    pima classif.rpart       holdout     1        0      0
##  13    pima classif.rpart       holdout     1        0      0
##  14    pima classif.rpart       holdout     1        0      0
##  15    pima classif.rpart       holdout     1        0      0
##  16    pima classif.rpart       holdout     1        0      0
##  17    pima classif.rpart       holdout     1        0      0
##  18    pima classif.rpart       holdout     1        0      0
##  19    pima classif.rpart       holdout     1        0      0
##  20    pima classif.rpart       holdout     1        0      0

The uhash column links the resampling iterations to the evaluated configurations stored in instance$archive$data. This allows e.g. to score the included ResampleResults on a different performance measure.

instance$archive$benchmark_result$score(msr("classif.acc"))
##                                    uhash nr              task task_id
##  1: df90c7fb-1bb0-46d2-a44d-ff440d5039f4  1 <TaskClassif[49]>    pima
##  2: 482b4ea9-c3f9-45a7-ab28-66caf6ec2f88  2 <TaskClassif[49]>    pima
##  3: b3142b2e-e442-4276-bfc9-fbd608a9f6dd  3 <TaskClassif[49]>    pima
##  4: 41d887d1-77b6-4406-9151-3b08befbf78a  4 <TaskClassif[49]>    pima
##  5: ba561e0a-5f15-4d2c-bf34-d2814ff1db72  5 <TaskClassif[49]>    pima
##  6: 78b04d22-183d-4003-9359-7e8a22492196  6 <TaskClassif[49]>    pima
##  7: 2fb5b192-9085-4619-a32a-ee53777ef45d  7 <TaskClassif[49]>    pima
##  8: cf6d3107-408b-4ed5-a3dc-b1991f33ff06  8 <TaskClassif[49]>    pima
##  9: 30fdcad8-7568-47b7-8c51-9baf8875aabe  9 <TaskClassif[49]>    pima
## 10: 243e2e71-7d86-4379-a823-58a6c5926fb9 10 <TaskClassif[49]>    pima
## 11: 8ed3f485-6098-419e-ac37-6419d4ad7b09 11 <TaskClassif[49]>    pima
## 12: 83b41136-ca1a-40b0-bdfe-f47dd48601b9 12 <TaskClassif[49]>    pima
## 13: ea58074a-f7b6-413b-ad6a-86d08163778c 13 <TaskClassif[49]>    pima
## 14: 7b196c76-ebb3-467b-8538-dc9aef6f6f48 14 <TaskClassif[49]>    pima
## 15: b5b833a5-02b1-497d-859d-a0ec1cba5b71 15 <TaskClassif[49]>    pima
## 16: 8a772760-e5c2-4642-b276-cb7336255d2f 16 <TaskClassif[49]>    pima
## 17: 4875ecca-3bb9-4271-9b8e-5b6c49c0f8d5 17 <TaskClassif[49]>    pima
## 18: ef551e26-4bad-4167-b6e6-67660a2dc829 18 <TaskClassif[49]>    pima
## 19: 4d3300e0-c717-4138-9e7d-a875368eed3b 19 <TaskClassif[49]>    pima
## 20: 7127a94b-362a-4a8e-8cbf-65c430474266 20 <TaskClassif[49]>    pima
##                       learner    learner_id              resampling
##  1: <LearnerClassifRpart[37]> classif.rpart <ResamplingHoldout[19]>
##  2: <LearnerClassifRpart[37]> classif.rpart <ResamplingHoldout[19]>
##  3: <LearnerClassifRpart[37]> classif.rpart <ResamplingHoldout[19]>
##  4: <LearnerClassifRpart[37]> classif.rpart <ResamplingHoldout[19]>
##  5: <LearnerClassifRpart[37]> classif.rpart <ResamplingHoldout[19]>
##  6: <LearnerClassifRpart[37]> classif.rpart <ResamplingHoldout[19]>
##  7: <LearnerClassifRpart[37]> classif.rpart <ResamplingHoldout[19]>
##  8: <LearnerClassifRpart[37]> classif.rpart <ResamplingHoldout[19]>
##  9: <LearnerClassifRpart[37]> classif.rpart <ResamplingHoldout[19]>
## 10: <LearnerClassifRpart[37]> classif.rpart <ResamplingHoldout[19]>
## 11: <LearnerClassifRpart[37]> classif.rpart <ResamplingHoldout[19]>
## 12: <LearnerClassifRpart[37]> classif.rpart <ResamplingHoldout[19]>
## 13: <LearnerClassifRpart[37]> classif.rpart <ResamplingHoldout[19]>
## 14: <LearnerClassifRpart[37]> classif.rpart <ResamplingHoldout[19]>
## 15: <LearnerClassifRpart[37]> classif.rpart <ResamplingHoldout[19]>
## 16: <LearnerClassifRpart[37]> classif.rpart <ResamplingHoldout[19]>
## 17: <LearnerClassifRpart[37]> classif.rpart <ResamplingHoldout[19]>
## 18: <LearnerClassifRpart[37]> classif.rpart <ResamplingHoldout[19]>
## 19: <LearnerClassifRpart[37]> classif.rpart <ResamplingHoldout[19]>
## 20: <LearnerClassifRpart[37]> classif.rpart <ResamplingHoldout[19]>
##     resampling_id iteration              prediction classif.acc
##  1:       holdout         1 <PredictionClassif[20]>      0.7969
##  2:       holdout         1 <PredictionClassif[20]>      0.7852
##  3:       holdout         1 <PredictionClassif[20]>      0.7852
##  4:       holdout         1 <PredictionClassif[20]>      0.7383
##  5:       holdout         1 <PredictionClassif[20]>      0.7578
##  6:       holdout         1 <PredictionClassif[20]>      0.7969
##  7:       holdout         1 <PredictionClassif[20]>      0.7578
##  8:       holdout         1 <PredictionClassif[20]>      0.7969
##  9:       holdout         1 <PredictionClassif[20]>      0.7969
## 10:       holdout         1 <PredictionClassif[20]>      0.7578
## 11:       holdout         1 <PredictionClassif[20]>      0.7969
## 12:       holdout         1 <PredictionClassif[20]>      0.7422
## 13:       holdout         1 <PredictionClassif[20]>      0.7188
## 14:       holdout         1 <PredictionClassif[20]>      0.7969
## 15:       holdout         1 <PredictionClassif[20]>      0.7969
## 16:       holdout         1 <PredictionClassif[20]>      0.7461
## 17:       holdout         1 <PredictionClassif[20]>      0.7969
## 18:       holdout         1 <PredictionClassif[20]>      0.7578
## 19:       holdout         1 <PredictionClassif[20]>      0.7852
## 20:       holdout         1 <PredictionClassif[20]>      0.7852

Now we can take the optimized hyperparameters, set them for the previously-created Learner, and train it on the full dataset.

learner$param_set$values = instance$result_learner_param_vals
learner$train(task)

The trained model can now be used to make a prediction on new, external data. Note that predicting on observations present in the task should be avoided because the model has seen these observations already during tuning and training and therefore performance values would be statistically biased – the resulting performance measure would be over-optimistic. To get statistically unbiased performance estimates for a given task, nested resampling is required.

4.1.4 Automating the Tuning

We can automate this entire process in mlr3 so that learners are tuned transparently, without the need to extract information on the best hyperparameter settings at the end.. The AutoTuner wraps a learner and augments it with an automatic tuning process for a given set of hyperparameters. Because the AutoTuner itself inherits from the Learner base class, it can be used like any other learner. In keeping with our example above, we create a classification learner that tunes itself automatically. This classification tree learner tunes the parameters cp and minsplit using an inner resampling (holdout). We create a terminator which allows 10 evaluations, and use a simple random search as tuning algorithm:

learner = lrn("classif.rpart")
search_space = ps(
  cp = p_dbl(lower = 0.001, upper = 0.1),
  minsplit = p_int(lower = 1, upper = 10)
)
terminator = trm("evals", n_evals = 10)
tuner = tnr("random_search")

at = AutoTuner$new(
  learner = learner,
  resampling = rsmp("holdout"),
  measure = msr("classif.ce"),
  search_space = search_space,
  terminator = terminator,
  tuner = tuner
)
at
## <AutoTuner:classif.rpart.tuned>
## * Model: -
## * Search Space:
## <ParamSet>
##          id    class lower upper nlevels        default value
## 1:       cp ParamDbl 0.001   0.1     Inf <NoDefault[3]>      
## 2: minsplit ParamInt 1.000  10.0      10 <NoDefault[3]>      
## * Packages: mlr3, rpart
## * Predict Type: response
## * Feature Types: logical, integer, numeric, factor, ordered
## * Properties: importance, missings, multiclass, selected_features,
##   twoclass, weights

We can now use the learner like any other learner, calling the $train() and $predict() method. The differnce to a normal learner is that $train() runs the tuning, which will take longer than a normal training process.

at$train(task)
## INFO  [20:21:01.372] [bbotk] Starting to optimize 2 parameter(s) with '<OptimizerRandomSearch>' and '<TerminatorEvals> [n_evals=10, k=0]' 
## INFO  [20:21:01.390] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:21:01.486] [bbotk] Result of batch 1: 
## INFO  [20:21:01.488] [bbotk]       cp minsplit classif.ce runtime_learners 
## INFO  [20:21:01.488] [bbotk]  0.07304        5      0.293            0.014 
## INFO  [20:21:01.488] [bbotk]                                 uhash 
## INFO  [20:21:01.488] [bbotk]  f986f989-e042-4fad-8230-ebbbe582d484 
## INFO  [20:21:01.492] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:21:01.592] [bbotk] Result of batch 2: 
## INFO  [20:21:01.594] [bbotk]       cp minsplit classif.ce runtime_learners 
## INFO  [20:21:01.594] [bbotk]  0.07593        2      0.293            0.016 
## INFO  [20:21:01.594] [bbotk]                                 uhash 
## INFO  [20:21:01.594] [bbotk]  42057fd3-5898-471e-bf80-7c22f4d230cb 
## INFO  [20:21:01.598] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:21:01.697] [bbotk] Result of batch 3: 
## INFO  [20:21:01.699] [bbotk]       cp minsplit classif.ce runtime_learners 
## INFO  [20:21:01.699] [bbotk]  0.05476        4      0.293            0.014 
## INFO  [20:21:01.699] [bbotk]                                 uhash 
## INFO  [20:21:01.699] [bbotk]  dd6c2d82-ebae-4adb-9c0a-6acf492f7be9 
## INFO  [20:21:01.703] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:21:01.796] [bbotk] Result of batch 4: 
## INFO  [20:21:01.798] [bbotk]       cp minsplit classif.ce runtime_learners 
## INFO  [20:21:01.798] [bbotk]  0.06905        3      0.293            0.014 
## INFO  [20:21:01.798] [bbotk]                                 uhash 
## INFO  [20:21:01.798] [bbotk]  3aebac62-9240-44d9-9fcb-1338a3964c35 
## INFO  [20:21:01.802] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:21:01.904] [bbotk] Result of batch 5: 
## INFO  [20:21:01.907] [bbotk]      cp minsplit classif.ce runtime_learners 
## INFO  [20:21:01.907] [bbotk]  0.0359        7      0.293            0.014 
## INFO  [20:21:01.907] [bbotk]                                 uhash 
## INFO  [20:21:01.907] [bbotk]  73298d80-0c2c-49af-8d53-fd7be0676d80 
## INFO  [20:21:01.911] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:21:02.008] [bbotk] Result of batch 6: 
## INFO  [20:21:02.010] [bbotk]       cp minsplit classif.ce runtime_learners 
## INFO  [20:21:02.010] [bbotk]  0.02021        7     0.2617            0.014 
## INFO  [20:21:02.010] [bbotk]                                 uhash 
## INFO  [20:21:02.010] [bbotk]  25518693-6f5f-4b5b-87e4-ef404cbe6bc8 
## INFO  [20:21:02.015] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:21:02.121] [bbotk] Result of batch 7: 
## INFO  [20:21:02.123] [bbotk]       cp minsplit classif.ce runtime_learners 
## INFO  [20:21:02.123] [bbotk]  0.09625        8      0.293            0.015 
## INFO  [20:21:02.123] [bbotk]                                 uhash 
## INFO  [20:21:02.123] [bbotk]  ad104c31-c115-439f-907a-597e6d683ffd 
## INFO  [20:21:02.127] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:21:02.222] [bbotk] Result of batch 8: 
## INFO  [20:21:02.224] [bbotk]       cp minsplit classif.ce runtime_learners 
## INFO  [20:21:02.224] [bbotk]  0.05129        2      0.293            0.014 
## INFO  [20:21:02.224] [bbotk]                                 uhash 
## INFO  [20:21:02.224] [bbotk]  819a22ff-80f9-48b6-bdaa-0c050663a864 
## INFO  [20:21:02.229] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:21:02.327] [bbotk] Result of batch 9: 
## INFO  [20:21:02.346] [bbotk]       cp minsplit classif.ce runtime_learners 
## INFO  [20:21:02.346] [bbotk]  0.01606        1     0.2578            0.015 
## INFO  [20:21:02.346] [bbotk]                                 uhash 
## INFO  [20:21:02.346] [bbotk]  201a870e-e961-446c-a523-4240d78c949c 
## INFO  [20:21:02.350] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:21:02.447] [bbotk] Result of batch 10: 
## INFO  [20:21:02.449] [bbotk]       cp minsplit classif.ce runtime_learners 
## INFO  [20:21:02.449] [bbotk]  0.03342        7      0.293            0.014 
## INFO  [20:21:02.449] [bbotk]                                 uhash 
## INFO  [20:21:02.449] [bbotk]  bb056383-5af8-4356-882e-76aae0a35974 
## INFO  [20:21:02.458] [bbotk] Finished optimizing after 10 evaluation(s) 
## INFO  [20:21:02.459] [bbotk] Result: 
## INFO  [20:21:02.460] [bbotk]       cp minsplit learner_param_vals  x_domain classif.ce 
## INFO  [20:21:02.460] [bbotk]  0.01606        1          <list[3]> <list[2]>     0.2578

We can also pass it to resample() and benchmark(), just like any other learner. This would result in a nested resampling.

4.2 Tuning Search Spaces

When running an optimization, it is important to inform the tuning algorithm about what hyperparameters are valid. Here the names, types, and valid ranges of each hyperparameter are important. All this information is communicated with objects of the class ParamSet, which is defined in paradox. While it is possible to create ParamSet-objects using its $new-constructor, it is much shorter and readable to use the ps-shortcut, which will be presented here. For an in-depth description of paradox and its classes, see the paradox chapter.

Note, that ParamSet objects exist in two contexts. First, ParamSet-objects are used to define the space of valid parameter setting for a learner (and other objects). Second, they are used to define a search space for tuning. We are mainly interested in the latter. For an example we can consider the minsplit parameter of the classif.rpart Learner. The ParamSet associated with the learner has a lower but no upper bound. However, for tuning the value, a lower and upper bound must be given because tuning search spaces need to be bounded. For Learner or PipeOp objects, typically “unbounded” ParamSets are used. Here, however, we will mainly focus on creating “bounded” ParamSets that can be used for tuning. See the in-depth paradox chapter for more details on using ParamSets to define parameter ranges for use-cases besides tuning.

4.2.1 Creating ParamSets

An empty ParamSet – not yet very useful – can be constructed using just the ps call:

library("mlr3verse")

search_space = ps()
print(search_space)
## <ParamSet>
## Empty.

ps takes named Domain arguments that are turned into parameters. A possible search space for the "classif.svm" learner could for example be:

search_space = ps(
  cost = p_dbl(lower = 0.1, upper = 10),
  kernel = p_fct(levels = c("polynomial", "radial"))
)
print(search_space)
## <ParamSet>
##        id    class lower upper nlevels        default value
## 1:   cost ParamDbl   0.1    10     Inf <NoDefault[3]>      
## 2: kernel ParamFct    NA    NA       2 <NoDefault[3]>

There are five domain constructors that produce a parameters when given to ps:

Constructor Description Is bounded? Underlying Class
p_dbl Real valued parameter (“double”) When upper and lower are given ParamDbl
p_int Integer parameter When upper and lower are given ParamInt
p_fct Discrete valued parameter (“factor”) Always ParamFct
p_lgl Logical / Boolean parameter Always ParamLgl
p_uty Untyped parameter Never ParamUty

These domain constructors each take some of the following arguments:

  • lower, upper: lower and upper bound of numerical parameters (p_dbl and p_int). These need to be given to get bounded parameter spaces valid for tuning.
  • levels: Allowed categorical values for p_fct parameters. Required argument for p_fct. See below for more details on this parameter.
  • trafo: transformation function, see below.
  • depends: dependencies, see below.
  • tags: Further information about a parameter, used for example by the hyperband tuner.
  • default: Value corresponding to default behavior when the parameter is not given. Not used for tuning search spaces.
  • special_vals: Valid values besides the normally accepted values for a parameter. Not used for tuning search spaces.
  • custom_check: Function that checks whether a value given to p_uty is valid. Not used for tuning search spaces.

The lower, upper, or levels parameters are always at the first (or second, for upper) position of the respective constructors, so it is preferred to omit them when defining a ParamSet, for improved conciseness:

search_space = ps(cost = p_dbl(0.1, 10), kernel = p_fct(c("polynomial", "radial")))

4.2.2 Transformations (trafo)

We can use the paradox function generate_design_grid to look at the values that would be evaluated by grid search. (We are using rbindlist() here because the result of $transpose() is a list that is harder to read. If we didn’t use $transpose(), on the other hand, the transformations that we investigate here are not applied.)

library("data.table")
rbindlist(generate_design_grid(search_space, 3)$transpose())
##     cost     kernel
## 1:  0.10 polynomial
## 2:  0.10     radial
## 3:  5.05 polynomial
## 4:  5.05     radial
## 5: 10.00 polynomial
## 6: 10.00     radial

We notice that the cost parameter is taken on a linear scale. We assume, however, that the difference of cost between 0.1 and 1 should have a similar effect as the difference between 1 and 10. Therefore it makes more sense to tune it on a logarithmic scale. This is done by using a transformation (trafo). This is a function that is applied to a parameter after it has been sampled by the tuner. We can tune cost on a logarithmic scale by sampling on the linear scale [-1, 1] and computing 10^x from that value.

search_space = ps(
  cost = p_dbl(-1, 1, trafo = function(x) 10^x),
  kernel = p_fct(c("polynomial", "radial"))
)
rbindlist(generate_design_grid(search_space, 3)$transpose())
##    cost     kernel
## 1:  0.1 polynomial
## 2:  0.1     radial
## 3:  1.0 polynomial
## 4:  1.0     radial
## 5: 10.0 polynomial
## 6: 10.0     radial

It is even possible to attach another transformation to the ParamSet as a whole that gets executed after individual parameter’s transformations were performed. It is given through the .extra_trafo argument and should be a function with parameters x and param_set that takes a list of parameter values in x and returns a modified list. This transformation can access all parameter values of an evaluation and modify them with interactions. It is even possible to add or remove parameters. (The following is a bit of a silly example.)

search_space = ps(
  cost = p_dbl(-1, 1, trafo = function(x) 10^x),
  kernel = p_fct(c("polynomial", "radial")),
  .extra_trafo = function(x, param_set) {
    if (x$kernel == "polynomial") {
      x$cost = x$cost * 2
    }
    x
  }
)
rbindlist(generate_design_grid(search_space, 3)$transpose())
##    cost     kernel
## 1:  0.2 polynomial
## 2:  0.1     radial
## 3:  2.0 polynomial
## 4:  1.0     radial
## 5: 20.0 polynomial
## 6: 10.0     radial

The available types of search space parameters are limited: continuous, integer, discrete, and logical scalars. There are many machine learning algorithms, however, that take parameters of other types, for example vectors or functions. These can not be defined in a search space ParamSet, and they are often given as ParamUty in the Learner’s ParamSet. When trying to tune over these hyperparameters, it is necessary to perform a Transformation that changes the type of a parameter.

An example is the class.weights parameter of the SVM, which takes a named vector of class weights with one entry for each target class. The trafo that would tune class.weights for the tsk("spam") dataset could be:

search_space = ps(
  class.weights = p_dbl(0.1, 0.9, trafo = function(x) c(spam = x, nonspam = 1 - x))
)
generate_design_grid(search_space, 3)$transpose()
## [[1]]
## [[1]]$class.weights
##    spam nonspam 
##     0.1     0.9 
## 
## 
## [[2]]
## [[2]]$class.weights
##    spam nonspam 
##     0.5     0.5 
## 
## 
## [[3]]
## [[3]]$class.weights
##    spam nonspam 
##     0.9     0.1

(We are omitting rbindlist() in this example because it breaks the vector valued return elements.)

4.2.3 Automatic Factor Level Transformation

A common use-case is the necessity to specify a list of values that should all be tried (or sampled from). It may be the case that a hyperparameter accepts function objects as values and a certain list of functions should be tried. Or it may be that a choice of special numeric values should be tried. For this, the p_fct constructor’s level argument may be a value that is not a character vector, but something else. If, for example, only the values 0.1, 3, and 10 should be tried for the cost parameter, even when doing random search, then the following search space would achieve that:

search_space = ps(
  cost = p_fct(c(0.1, 3, 10)),
  kernel = p_fct(c("polynomial", "radial"))
)
rbindlist(generate_design_grid(search_space, 3)$transpose())
##    cost     kernel
## 1:  0.1 polynomial
## 2:  0.1     radial
## 3:  3.0 polynomial
## 4:  3.0     radial
## 5: 10.0 polynomial
## 6: 10.0     radial

This is equivalent to the following:

search_space = ps(
  cost = p_fct(c("0.1", "3", "10"),
    trafo = function(x) list(`0.1` = 0.1, `3` = 3, `10` = 10)[[x]]),
  kernel = p_fct(c("polynomial", "radial"))
)
rbindlist(generate_design_grid(search_space, 3)$transpose())
##    cost     kernel
## 1:  0.1 polynomial
## 2:  0.1     radial
## 3:  3.0 polynomial
## 4:  3.0     radial
## 5: 10.0 polynomial
## 6: 10.0     radial

This may seem silly, but makes sense when considering that factorial tuning parameters are always character values:

search_space = ps(
  cost = p_fct(c(0.1, 3, 10)),
  kernel = p_fct(c("polynomial", "radial"))
)
typeof(search_space$params$cost$levels)
## [1] "character"

Be aware that this results in an “unordered” hyperparameter, however. Tuning algorithms that make use of ordering information of parameters, like genetic algorithms or model based optimization, will perform worse when this is done. For these algorithms, it may make more sense to define a p_dbl or p_int with a more fitting trafo.

The class.weights case from above can also be implemented like this, if there are only a few candidates of class.weights vectors that should be tried. Note that the levels argument of p_fct must be named if there is no easy way for as.character() to create names:

search_space = ps(
  class.weights = p_fct(
    list(
      candidate_a = c(spam = 0.5, nonspam = 0.5),
      candidate_b = c(spam = 0.3, nonspam = 0.7)
    )
  )
)
generate_design_grid(search_space)$transpose()
## [[1]]
## [[1]]$class.weights
##    spam nonspam 
##     0.5     0.5 
## 
## 
## [[2]]
## [[2]]$class.weights
##    spam nonspam 
##     0.3     0.7

4.2.4 Parameter Dependencies (depends)

Some parameters are only relevant when another parameter has a certain value, or one of several values. The SVM, for example, has the degree parameter that is only valid when kernel is "polynomial". This can be specified using the depends argument. It is an expression that must involve other parameters and be of the form <param> == <scalar>, <param> %in% <vector>, or multiple of these chained by &&. To tune the degree parameter, one would need to do the following:

search_space = ps(
  cost = p_dbl(-1, 1, trafo = function(x) 10^x),
  kernel = p_fct(c("polynomial", "radial")),
  degree = p_int(1, 3, depends = kernel == "polynomial")
)
rbindlist(generate_design_grid(search_space, 3)$transpose(), fill = TRUE)
##     cost     kernel degree
##  1:  0.1 polynomial      1
##  2:  0.1 polynomial      2
##  3:  0.1 polynomial      3
##  4:  0.1     radial     NA
##  5:  1.0 polynomial      1
##  6:  1.0 polynomial      2
##  7:  1.0 polynomial      3
##  8:  1.0     radial     NA
##  9: 10.0 polynomial      1
## 10: 10.0 polynomial      2
## 11: 10.0 polynomial      3
## 12: 10.0     radial     NA

4.2.5 Creating Tuning ParamSets from other ParamSets

Having to define a tuning ParamSet for a Learner that already has parameter set information may seem unnecessarily tedious, and there is indeed a way to create tuning ParamSets from a Learner’s ParamSet, making use of as much information as already available.

This is done by setting values of a Learner’s ParamSet to so-called TuneTokens, constructed with a to_tune call. This can be done in the same way that other hyperparameters are set to specific values. It can be understood as the hyperparameters being tagged for later tuning. The resulting ParamSet used for tuning can be retrieved using the $search_space() method.

learner = lrn("classif.svm")
learner$param_set$values$kernel = "polynomial" # for example
learner$param_set$values$degree = to_tune(lower = 1, upper = 3)

print(learner$param_set$search_space())
## <ParamSet>
##        id    class lower upper nlevels        default value
## 1: degree ParamInt     1     3       3 <NoDefault[3]>
rbindlist(generate_design_grid(learner$param_set$search_space(), 3)$transpose())
##    degree
## 1:      1
## 2:      2
## 3:      3

It is possible to omit lower here, because it can be inferred from the lower bound of the degree parameter itself. For other parameters, that are already bounded, it is possible to not give any bounds at all, because their ranges are already bounded. An example is the logical shrinking hyperparameter:

learner$param_set$values$shrinking = to_tune()

print(learner$param_set$search_space())
## <ParamSet>
##           id    class lower upper nlevels        default value
## 1:    degree ParamInt     1     3       3 <NoDefault[3]>      
## 2: shrinking ParamLgl    NA    NA       2           TRUE
rbindlist(generate_design_grid(learner$param_set$search_space(), 3)$transpose())
##    degree shrinking
## 1:      1      TRUE
## 2:      1     FALSE
## 3:      2      TRUE
## 4:      2     FALSE
## 5:      3      TRUE
## 6:      3     FALSE

to_tune can also be constructed with a Domain object, i.e. something constructed with a p_*** call. This way it is possible to tune continuous parameters with discrete values, or to give trafos or dependencies. One could, for example, tune the cost as above on three given special values, and introduce a dependency of shrinking on it. Notice that a short form for to_tune(<levels>) is a short form of to_tune(p_fct(<levels>)). (When introducing the dependency, we need to use the degree value from before the implicit trafo, which is the name or as.character() of the respective value, here "val2"!)

learner$param_set$values$type = "C-classification" # needs to be set because of a bug in paradox
learner$param_set$values$cost = to_tune(c(val1 = 0.3, val2 = 0.7))
learner$param_set$values$shrinking = to_tune(p_lgl(depends = cost == "val2"))

print(learner$param_set$search_space())
## <ParamSet>
##           id    class lower upper nlevels        default parents value
## 1:      cost ParamFct    NA    NA       2 <NoDefault[3]>              
## 2:    degree ParamInt     1     3       3 <NoDefault[3]>              
## 3: shrinking ParamLgl    NA    NA       2 <NoDefault[3]>    cost      
## Trafo is set.
rbindlist(generate_design_grid(learner$param_set$search_space(), 3)$transpose(), fill = TRUE)
##    degree cost shrinking
## 1:      1  0.3        NA
## 2:      1  0.7      TRUE
## 3:      1  0.7     FALSE
## 4:      2  0.3        NA
## 5:      2  0.7      TRUE
## 6:      2  0.7     FALSE
## 7:      3  0.3        NA
## 8:      3  0.7      TRUE
## 9:      3  0.7     FALSE

The search_space() picks up dependencies fromt the underlying ParamSet automatically. So if the kernel is tuned, then degree automatically gets the dependency on it, without us having to specify that. (Here we reset cost and shrinking to NULL for the sake of clarity of the generated output.)

learner$param_set$values$cost = NULL
learner$param_set$values$shrinking = NULL
learner$param_set$values$kernel = to_tune(c("polynomial", "radial"))

print(learner$param_set$search_space())
## <ParamSet>
##        id    class lower upper nlevels        default parents value
## 1: degree ParamInt     1     3       3 <NoDefault[3]>  kernel      
## 2: kernel ParamFct    NA    NA       2 <NoDefault[3]>
rbindlist(generate_design_grid(learner$param_set$search_space(), 3)$transpose(), fill = TRUE)
##        kernel degree
## 1: polynomial      1
## 2: polynomial      2
## 3: polynomial      3
## 4:     radial     NA

It is even possible to define whole ParamSets that get tuned over for a single parameter. This may be especially useful for vector hyperparameters that should be searched along multiple dimensions. This ParamSet must, however, have an .extra_trafo that returns a list with a single element, because it corresponds to a single hyperparameter that is being tuned. Suppose the class.weights hyperparameter should be tuned along two dimensions:

learner$param_set$values$class.weights = to_tune(
  ps(spam = p_dbl(0.1, 0.9), nonspam = p_dbl(0.1, 0.9),
    .extra_trafo = function(x, param_set) list(c(spam = x$spam, nonspam = x$nonspam))
))
head(generate_design_grid(learner$param_set$search_space(), 3)$transpose(), 3)
## [[1]]
## [[1]]$kernel
## [1] "polynomial"
## 
## [[1]]$degree
## [1] 1
## 
## [[1]]$class.weights
##    spam nonspam 
##     0.1     0.1 
## 
## 
## [[2]]
## [[2]]$kernel
## [1] "polynomial"
## 
## [[2]]$degree
## [1] 1
## 
## [[2]]$class.weights
##    spam nonspam 
##     0.1     0.5 
## 
## 
## [[3]]
## [[3]]$kernel
## [1] "polynomial"
## 
## [[3]]$degree
## [1] 1
## 
## [[3]]$class.weights
##    spam nonspam 
##     0.1     0.9

4.3 Nested Resampling

Evaluating a machine learning model often requires an additional layer of resampling when hyperparameters or features have to be selected. Nested resampling separates these model selection steps from the process estimating the performance of the model. If the same data is used for the model selection steps and the evaluation of the model itself, the resulting performance estimate of the model might be severely biased. One reason is that the repeated evaluation of the model on the test data could leak information about its structure into the model, what results in over-optimistic performance estimates. Keep in mind that nested resampling is a statistical procedure to estimate the predictive performance of the model trained on the full dataset. Nested resampling is not a procedure to select optimal hyperparameters. The resampling produces many hyperparameter configurations which should be not used to construct a final model (Simon 2007).

The graphic above illustrates nested resampling for hyperparameter tuning with 3-fold cross-validation in the outer and 4-fold cross-validation in the inner loop.

In the outer resampling loop, we have three pairs of training/test sets. On each of these outer training sets parameter tuning is done, thereby executing the inner resampling loop. This way, we get one set of selected hyperparameters for each outer training set. Then the learner is fitted on each outer training set using the corresponding selected hyperparameters. Subsequently, we can evaluate the performance of the learner on the outer test sets. The aggregated performance on the outer test sets is the unbiased performance estimate of the model.

4.3.1 Execution

The previous section examined the optimization of a simple classification tree on the mlr_tasks_pima. We continue the example and estimate the predictive performance of the model with nested resampling.

We use a 4-fold cross-validation in the inner resampling loop. The AutoTuner executes the hyperparameter tuning and is stopped after 5 evaluations. The hyperparameter configurations are proposed by grid search.

library("mlr3verse")

learner = lrn("classif.rpart")
resampling = rsmp("cv", folds = 4)
measure = msr("classif.ce")
search_space = ps(cp = p_dbl(lower = 0.001, upper = 0.1))
terminator = trm("evals", n_evals = 5)
tuner = tnr("grid_search", resolution = 10)

at = AutoTuner$new(learner, resampling, measure, terminator, tuner, search_space)

A 3-fold cross-validation is used in the outer resampling loop. On each of the three outer train sets hyperparameter tuning is done and we receive three optimized hyperparameter configurations. To execute the nested resampling, we pass the AutoTuner to the resample() function. We have to set store_models = TRUE because we need the AutoTuner models to investigate the inner tuning.

task = tsk("pima")
outer_resampling = rsmp("cv", folds = 3)

rr = resample(task, at, outer_resampling, store_models = TRUE)
## INFO  [20:21:12.449] [bbotk] Starting to optimize 1 parameter(s) with '<OptimizerGridSearch>' and '<TerminatorEvals> [n_evals=5, k=0]' 
## INFO  [20:21:12.493] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:21:12.714] [bbotk] Result of batch 1: 
## INFO  [20:21:12.717] [bbotk]     cp classif.ce runtime_learners                                uhash 
## INFO  [20:21:12.717] [bbotk]  0.056     0.2578            0.061 d131598f-191f-4313-bbd2-19d19158e650 
## INFO  [20:21:12.718] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:21:12.928] [bbotk] Result of batch 2: 
## INFO  [20:21:12.930] [bbotk]     cp classif.ce runtime_learners                                uhash 
## INFO  [20:21:12.930] [bbotk]  0.012     0.2559            0.068 6eeff693-4937-4808-a01c-dfccf8e108b6 
## INFO  [20:21:12.932] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:21:13.131] [bbotk] Result of batch 3: 
## INFO  [20:21:13.133] [bbotk]     cp classif.ce runtime_learners                                uhash 
## INFO  [20:21:13.133] [bbotk]  0.034      0.248            0.062 c322e1d8-36fc-4b3b-ba7f-53c71bea40c3 
## INFO  [20:21:13.135] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:21:13.409] [bbotk] Result of batch 4: 
## INFO  [20:21:13.411] [bbotk]     cp classif.ce runtime_learners                                uhash 
## INFO  [20:21:13.411] [bbotk]  0.089      0.291            0.054 d6733896-f20f-4c73-a343-ece0b0ad4954 
## INFO  [20:21:13.412] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:21:13.604] [bbotk] Result of batch 5: 
## INFO  [20:21:13.606] [bbotk]     cp classif.ce runtime_learners                                uhash 
## INFO  [20:21:13.606] [bbotk]  0.001     0.2695            0.052 c3f11ca5-45d4-4a27-b84f-68dcc0af6a51 
## INFO  [20:21:13.612] [bbotk] Finished optimizing after 5 evaluation(s) 
## INFO  [20:21:13.613] [bbotk] Result: 
## INFO  [20:21:13.614] [bbotk]     cp learner_param_vals  x_domain classif.ce 
## INFO  [20:21:13.614] [bbotk]  0.034          <list[2]> <list[1]>      0.248 
## INFO  [20:21:13.687] [bbotk] Starting to optimize 1 parameter(s) with '<OptimizerGridSearch>' and '<TerminatorEvals> [n_evals=5, k=0]' 
## INFO  [20:21:13.691] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:21:13.879] [bbotk] Result of batch 1: 
## INFO  [20:21:13.882] [bbotk]     cp classif.ce runtime_learners                                uhash 
## INFO  [20:21:13.882] [bbotk]  0.023      0.252            0.053 f7cbb537-6e83-4c53-a4e8-3a5a34203ab2 
## INFO  [20:21:13.884] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:21:14.091] [bbotk] Result of batch 2: 
## INFO  [20:21:14.093] [bbotk]     cp classif.ce runtime_learners                                uhash 
## INFO  [20:21:14.093] [bbotk]  0.078     0.2773            0.056 43f6facc-05ee-475f-bd34-c574b8ae551b 
## INFO  [20:21:14.095] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:21:14.295] [bbotk] Result of batch 3: 
## INFO  [20:21:14.297] [bbotk]     cp classif.ce runtime_learners                                uhash 
## INFO  [20:21:14.297] [bbotk]  0.089     0.2773            0.057 ab350b40-4f96-40d7-aed8-4cb97709f0d5 
## INFO  [20:21:14.299] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:21:14.491] [bbotk] Result of batch 4: 
## INFO  [20:21:14.493] [bbotk]     cp classif.ce runtime_learners                                uhash 
## INFO  [20:21:14.493] [bbotk]  0.045     0.2754            0.053 734337ae-b0a3-451c-8877-201603c84270 
## INFO  [20:21:14.495] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:21:14.707] [bbotk] Result of batch 5: 
## INFO  [20:21:14.710] [bbotk]     cp classif.ce runtime_learners                                uhash 
## INFO  [20:21:14.710] [bbotk]  0.001     0.2695            0.056 aa53647e-2b08-4519-b937-18380d75befa 
## INFO  [20:21:14.715] [bbotk] Finished optimizing after 5 evaluation(s) 
## INFO  [20:21:14.716] [bbotk] Result: 
## INFO  [20:21:14.718] [bbotk]     cp learner_param_vals  x_domain classif.ce 
## INFO  [20:21:14.718] [bbotk]  0.023          <list[2]> <list[1]>      0.252 
## INFO  [20:21:14.807] [bbotk] Starting to optimize 1 parameter(s) with '<OptimizerGridSearch>' and '<TerminatorEvals> [n_evals=5, k=0]' 
## INFO  [20:21:14.812] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:21:15.010] [bbotk] Result of batch 1: 
## INFO  [20:21:15.012] [bbotk]     cp classif.ce runtime_learners                                uhash 
## INFO  [20:21:15.012] [bbotk]  0.056     0.2676            0.058 4813bef7-6d6a-45b9-af47-3e8b270c1a0e 
## INFO  [20:21:15.014] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:21:15.210] [bbotk] Result of batch 2: 
## INFO  [20:21:15.212] [bbotk]     cp classif.ce runtime_learners                                uhash 
## INFO  [20:21:15.212] [bbotk]  0.078     0.2598            0.056 2a1768a1-1920-4b86-ad31-7e3946579472 
## INFO  [20:21:15.214] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:21:15.411] [bbotk] Result of batch 3: 
## INFO  [20:21:15.413] [bbotk]     cp classif.ce runtime_learners                                uhash 
## INFO  [20:21:15.413] [bbotk]  0.067     0.2676            0.052 a14321bb-ed3b-4649-91bf-21662b441ec7 
## INFO  [20:21:15.415] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:21:15.612] [bbotk] Result of batch 4: 
## INFO  [20:21:15.614] [bbotk]     cp classif.ce runtime_learners                                uhash 
## INFO  [20:21:15.614] [bbotk]  0.023     0.2695            0.056 7dd0fc64-4628-494c-9de3-df486d234af1 
## INFO  [20:21:15.616] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:21:15.814] [bbotk] Result of batch 5: 
## INFO  [20:21:15.816] [bbotk]     cp classif.ce runtime_learners                                uhash 
## INFO  [20:21:15.816] [bbotk]  0.045     0.2676            0.054 355b47a3-6647-4e6a-848a-e9008b04b123 
## INFO  [20:21:15.822] [bbotk] Finished optimizing after 5 evaluation(s) 
## INFO  [20:21:15.823] [bbotk] Result: 
## INFO  [20:21:15.824] [bbotk]     cp learner_param_vals  x_domain classif.ce 
## INFO  [20:21:15.824] [bbotk]  0.078          <list[2]> <list[1]>     0.2598

You can freely combine different inner and outer resampling strategies. Nested resampling is not restricted to hyperparameter tuning. You can swap the AutoTuner for a AutoFSelector and estimate the performance of a model which is fitted on an optimized feature subset.

4.3.2 Evaluation

With the created ResampleResult we can now inspect the executed resampling iterations more closely. See the section on Resampling for more detailed information about ResampleResult objects.

We check the inner tuning results for stable hyperparameters. This means that the selected hyperparameters should not vary too much. We might observe unstable models in this example because the small data set and the low number of resampling iterations might introduces too much randomness. Usually, we aim for the selection of stable hyperparameters for all outer training sets.

##    iteration    cp classif.ce learner_param_vals  x_domain task_id
## 1:         1 0.023     0.2520          <list[2]> <list[1]>    pima
## 2:         2 0.034     0.2480          <list[2]> <list[1]>    pima
## 3:         3 0.078     0.2598          <list[2]> <list[1]>    pima
##             learner_id resampling_id
## 1: classif.rpart.tuned            cv
## 2: classif.rpart.tuned            cv
## 3: classif.rpart.tuned            cv

Next, we want to compare the predictive performances estimated on the outer resampling to the inner resampling. Significantly lower predictive performances on the outer resampling indicate that the models with the optimized hyperparameters overfit the data.

rr$score()
##                 task task_id         learner          learner_id
## 1: <TaskClassif[49]>    pima <AutoTuner[41]> classif.rpart.tuned
## 2: <TaskClassif[49]>    pima <AutoTuner[41]> classif.rpart.tuned
## 3: <TaskClassif[49]>    pima <AutoTuner[41]> classif.rpart.tuned
##            resampling resampling_id iteration              prediction
## 1: <ResamplingCV[19]>            cv         1 <PredictionClassif[20]>
## 2: <ResamplingCV[19]>            cv         2 <PredictionClassif[20]>
## 3: <ResamplingCV[19]>            cv         3 <PredictionClassif[20]>
##    classif.ce
## 1:     0.2422
## 2:     0.2266
## 3:     0.2695

The aggregated performance of all outer resampling iterations is essentially the unbiased performance of the model with optimal hyperparameter found by grid search.

rr$aggregate()
## classif.ce 
##     0.2461

Note that nested resampling is computationally expensive. For this reason we use relatively small number of hyperparameter configurations and a low number of resampling iterations in this example. In practice, you normally have to increase both. As this is computationally intensive you might want to have a look at the section on Parallelization.

4.3.3 Final Model

We can use the AutoTuner to tune the hyperparameters of our learner and fit the final model on the full data set.

at$train(task)
## INFO  [20:21:16.203] [bbotk] Starting to optimize 1 parameter(s) with '<OptimizerGridSearch>' and '<TerminatorEvals> [n_evals=5, k=0]' 
## INFO  [20:21:16.206] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:21:16.406] [bbotk] Result of batch 1: 
## INFO  [20:21:16.408] [bbotk]     cp classif.ce runtime_learners                                uhash 
## INFO  [20:21:16.408] [bbotk]  0.056     0.2695            0.057 a6f5f060-2083-43da-9c82-5d1851d88db1 
## INFO  [20:21:16.409] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:21:16.604] [bbotk] Result of batch 2: 
## INFO  [20:21:16.606] [bbotk]     cp classif.ce runtime_learners                                uhash 
## INFO  [20:21:16.606] [bbotk]  0.012     0.2474            0.062 e2e4ef0e-f241-406c-b2f3-640d67c1734c 
## INFO  [20:21:16.608] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:21:16.805] [bbotk] Result of batch 3: 
## INFO  [20:21:16.807] [bbotk]     cp classif.ce runtime_learners                                uhash 
## INFO  [20:21:16.807] [bbotk]  0.067     0.2695            0.053 2976e257-7352-456c-b982-a64ed34cdcb0 
## INFO  [20:21:16.809] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:21:17.004] [bbotk] Result of batch 4: 
## INFO  [20:21:17.007] [bbotk]     cp classif.ce runtime_learners                                uhash 
## INFO  [20:21:17.007] [bbotk]  0.045     0.2487            0.052 a897daa9-6d78-4348-a2ac-5618d730f430 
## INFO  [20:21:17.008] [bbotk] Evaluating 1 configuration(s) 
## INFO  [20:21:17.204] [bbotk] Result of batch 5: 
## INFO  [20:21:17.206] [bbotk]   cp classif.ce runtime_learners                                uhash 
## INFO  [20:21:17.206] [bbotk]  0.1     0.2695            0.055 538d06bb-1265-4963-800e-00815f88e248 
## INFO  [20:21:17.213] [bbotk] Finished optimizing after 5 evaluation(s) 
## INFO  [20:21:17.214] [bbotk] Result: 
## INFO  [20:21:17.215] [bbotk]     cp learner_param_vals  x_domain classif.ce 
## INFO  [20:21:17.215] [bbotk]  0.012          <list[2]> <list[1]>     0.2474

The trained model can now be used to make predictions on new data. A common mistake is to report the performance estimated on the resampling sets on which the tuning was performed (at$tuning_result$classif.ce) as the model’s performance. Instead, we report the performance estimated with nested resampling as the performance of the model.

4.4 Tuning with Hyperband

Besides the more traditional tuning methods, the ecosystem around mlr3 offers another procedure for hyperparameter optimization called Hyperband implemented in the mlr3hyperband package.

Hyperband is a budget-oriented procedure, weeding out suboptimal performing configurations early on during a partially sequential training process, increasing tuning efficiency as a consequence. For this, a combination of incremental resource allocation and early stopping is used: As optimization progresses, computational resources are increased for more promising configurations, while less promising ones are terminated early.

To give an introductory analogy, imagine two horse trainers are given eight untrained horses. Both trainers want to win the upcoming race, but they are only given 32 units of food. Given that each horse can be fed up to 8 units food (“maximum budget” per horse), there is not enough food for all the horses. It is critical to identify the most promising horses early, and give them enough food to improve. So, the trainers need to develop a strategy to split up the food in the best possible way. The first trainer is very optimistic and wants to explore the full capabilities of a horse, because he does not want to pass a judgment on a horse’s performance unless it has been fully trained. So, he divides his budget by the maximum amount he can give to a horse (lets say eight, so \(32 / 8 = 4\)) and randomly picks four horses - his budget simply is not enough to fully train more. Those four horses are then trained to their full capabilities, while the rest is set free. This way, the trainer is confident about choosing the best out of the four trained horses, but he might have overlooked the horse with the highest potential since he only focused on half of them. The other trainer is more creative and develops a different strategy. He thinks, if a horse is not performing well at the beginning, it will also not improve after further training. Based on this assumption, he decides to give one unit of food to each horse and observes how they develop. After the initial food is consumed, he checks their performance and kicks the slowest half out of his training regime. Then, he increases the available food for the remaining, further trains them until the food is consumed again, only to kick out the worst half once more. He repeats this until the one remaining horse gets the rest of the food. This means only one horse is fully trained, but on the flip side, he was able to start training with all eight horses.

On race day, all the horses are put on the starting line. But which trainer will have the winning horse? The one, who tried to train a maximum amount of horses to their fullest? Or the other one, who made assumptions about the training progress of his horses? How the training phases may possibly look like is visualized in figure 4.1.

Visulization of how the training processes may look like. The left plot corresponds to the non-selective trainer, while the right one to the selective trainer.

Figure 4.1: Visulization of how the training processes may look like. The left plot corresponds to the non-selective trainer, while the right one to the selective trainer.

Hyperband works very similar in some ways, but also different in others. It is not embodied by one of the trainers in our analogy, but more by the person, who would pay them. Hyperband consists of several brackets, each bracket corresponding to a trainer, and we do not care about horses but about hyperparameter configurations of a machine learning algorithm. The budget is not in terms of food, but in terms of a hyperparameter of the learner that scales in some way with the computational effort. An example is the number of epochs we train a neural network, or the number of iterations in boosting. Furthermore, there are not only two brackets (or trainers), but several, each placed at a unique spot between fully explorative of later training stages and extremely selective, equal to higher exploration of early training stages. The level of selection aggressiveness is handled by a user-defined parameter called \(\eta\). So, \(1/\eta\) is the fraction of remaining configurations after a bracket removes his worst performing ones, but \(\eta\) is also the factor by that the budget is increased for the next stage. Because there is a different maximum budget per configuration that makes sense in different scenarios, the user also has to set this as the \(R\) parameter. No further parameters are required for Hyperband – the full required budget across all brackets is indirectly given by \[ (\lfloor \log_{\eta}{R} \rfloor + 1)^2 * R\] (Li et al. 2016). To give an idea how a full bracket layout might look like for a specific \(R\) and \(\eta\), a quick overview is given in the following table.

Table 4.1: Hyperband layout for \(\eta = 2\) and \(R = 8\), consisting of four brackets with \(n\) as the amount of active configurations.
stage budget n
1 1 8
2 2 4
3 4 2
4 8 1
stage budget n
1 2 6
2 4 3
3 8 1
stage budget n
1 4 4
2 8 2
stage budget n
1 8 4

Of course, early termination based on a performance criterion may be disadvantageous if it is done too aggressively in certain scenarios. A learner to jumping radically in its estimated performance during the training phase may get the best configurations canceled too early, simply because they do not improve quickly enough compared to others. In other words, it is often unclear beforehand if having an high amount of configurations \(n\), that gets aggressively discarded early, is better than having a high budget \(B\) per configuration. The arising tradeoff, that has to be made, is called the “\(n\) versus \(B/n\) problem”. To create a balance between selection based on early training performance versus exploration of training performances in later training stages, \(\lfloor \log_{\eta}{R} \rfloor + 1\) brackets are constructed with an associated set of varying sized configurations. Thus, some brackets contain more configurations, with a small initial budget. In these, a lot are discarded after having been trained for only a short amount of time, corresponding to the selective trainer in our horse analogy. Others are constructed with fewer configurations, where discarding only takes place after a significant amount of budget was consumed. The last bracket usually never discards anything, but also starts with only very few configurations – this is equivalent to the trainer explorative of later stages. The former corresponds high \(n\), while the latter high \(B/n\). Even though different brackets are initialized with a different amount of configurations and different initial budget sizes, each bracket is assigned (approximately) the same budget \((\lfloor \log_{\eta}{R} \rfloor + 1) * R\).

The configurations at the start of each bracket are initialized by random, often uniform sampling. Note that currently all configurations are trained completely from the beginning, so no online updates of models from stage to stage is happening.

To identify the budget for evaluating Hyperband, the user has to specify explicitly which hyperparameter of the learner influences the budget by extending a single hyperparameter in the ParamSet with an argument (tags = "budget"), like in the following snippet:

library("mlr3verse")

# Hyperparameter subset of XGBoost
search_space = ps(
  nrounds = p_int(lower = 1, upper = 16, tags = "budget"),
  booster = p_fct(levels = c("gbtree", "gblinear", "dart"))
)

Thanks to the broad ecosystem of the mlr3verse a learner does not require a natural budget parameter. A typical case of this would be decision trees. By using subsampling as preprocessing with mlr3pipelines, we can work around a lacking budget parameter.

set.seed(123)

# extend "classif.rpart" with "subsampling" as preprocessing step
ll = po("subsample") %>>% lrn("classif.rpart")

# extend hyperparameters of "classif.rpart" with subsampling fraction as budget
search_space = ps(
  classif.rpart.cp = p_dbl(lower = 0.001, upper = 0.1),
  classif.rpart.minsplit = p_int(lower = 1, upper = 10),
  subsample.frac = p_dbl(lower = 0.1, upper = 1, tags = "budget")
)

We can now plug the new learner with the extended hyperparameter set into a TuningInstanceSingleCrit the same way as usual. Naturally, Hyperband terminates once all of its brackets are evaluated, so a Terminator in the tuning instance acts as an upper bound and should be only set to a low value if one is unsure of how long Hyperband will take to finish under the given settings.

instance = TuningInstanceSingleCrit$new(
  task = tsk("iris"),
  learner = ll,
  resampling = rsmp("holdout"),
  measure = msr("classif.ce"),
  terminator = trm("none"), # hyperband terminates itself
  search_space = search_space
)

Now, we initialize a new instance of the mlr3hyperband::mlr_tuners_hyperband class and start tuning with it.

## Loading required package: mlr3tuning
## Loading required package: paradox
tuner = tnr("hyperband", eta = 3)

# reduce logging output
lgr::get_logger("bbotk")$set_threshold("warn")

tuner$optimize(instance)
##    classif.rpart.cp classif.rpart.minsplit subsample.frac learner_param_vals
## 1:          0.07348                      5         0.1111          <list[6]>
##     x_domain classif.ce
## 1: <list[3]>       0.02

To receive the results of each sampled configuration, we simply run the following snippet.

as.data.table(instance$archive)[, c(
  "subsample.frac",
  "classif.rpart.cp",
  "classif.rpart.minsplit",
  "classif.ce"
), with = FALSE]
##     subsample.frac classif.rpart.cp classif.rpart.minsplit classif.ce
##  1:         0.1111          0.02533                      3       0.04
##  2:         0.1111          0.07348                      5       0.02
##  3:         0.1111          0.08490                      3       0.02
##  4:         0.1111          0.05026                      6       0.02
##  5:         0.1111          0.03940                      4       0.02
##  6:         0.1111          0.02540                      7       0.42
##  7:         0.1111          0.01200                      4       0.14
##  8:         0.1111          0.03961                      4       0.02
##  9:         0.1111          0.05762                      6       0.02
## 10:         0.3333          0.07348                      5       0.06
## 11:         0.3333          0.08490                      3       0.04
## 12:         0.3333          0.05026                      6       0.06
## 13:         1.0000          0.08490                      3       0.04
## 14:         0.3333          0.08650                      6       0.02
## 15:         0.3333          0.07491                      9       0.06
## 16:         0.3333          0.06716                      6       0.04
## 17:         0.3333          0.06218                      9       0.08
## 18:         0.3333          0.03785                      4       0.06
## 19:         1.0000          0.08650                      6       0.04
## 20:         1.0000          0.02724                     10       0.04
## 21:         1.0000          0.05689                      3       0.04
## 22:         1.0000          0.09141                      4       0.04
##     subsample.frac classif.rpart.cp classif.rpart.minsplit classif.ce

You can access the best found configuration through the instance object.

instance$result
##    classif.rpart.cp classif.rpart.minsplit subsample.frac learner_param_vals
## 1:          0.07348                      5         0.1111          <list[6]>
##     x_domain classif.ce
## 1: <list[3]>       0.02
instance$result_learner_param_vals
## $subsample.frac
## [1] 0.1111
## 
## $subsample.stratify
## [1] FALSE
## 
## $subsample.replace
## [1] FALSE
## 
## $classif.rpart.xval
## [1] 0
## 
## $classif.rpart.cp
## [1] 0.07348
## 
## $classif.rpart.minsplit
## [1] 5
instance$result_y
## classif.ce 
##       0.02

If you are familiar with the original paper, you may have wondered how we just used Hyperband with a parameter ranging from 0.1 to 1.0 (Li et al. 2016). The answer is, with the help the internal rescaling of the budget parameter. mlr3hyperband automatically divides the budget parameters boundaries with its lower bound, ending up with a budget range starting again at 1, like it is the case originally. If we want an overview of what bracket layout Hyperband created and how the rescaling in each bracket worked, we can print a compact table to see this information.

unique(as.data.table(instance$archive)[, .(bracket, bracket_stage, budget_scaled, budget_real, n_configs)])
##    bracket bracket_stage budget_scaled budget_real n_configs
## 1:       2             0         1.111      0.1111         9
## 2:       2             1         3.333      0.3333         3
## 3:       2             2        10.000      1.0000         1
## 4:       1             0         3.333      0.3333         5
## 5:       1             1        10.000      1.0000         1
## 6:       0             0        10.000      1.0000         3

In the traditional way, Hyperband uses uniform sampling to receive a configuration sample at the start of each bracket. But it is also possible to define a custom Sampler for each hyperparameter.

search_space = ps(
  nrounds = p_int(lower = 1, upper = 16, tags = "budget"),
  eta = p_dbl(lower = 0, upper = 1),
  booster = p_fct(levels = c("gbtree", "gblinear", "dart"))
)

instance = TuningInstanceSingleCrit$new(
  task = tsk("iris"),
  learner = lrn("classif.xgboost"),
  resampling = rsmp("holdout"),
  measure = msr("classif.ce"),
  terminator = trm("none"), # hyperband terminates itself
  search_space = search_space
)

# beta distribution with alpha = 2 and beta = 5
# categorical distribution with custom probabilities
sampler = SamplerJointIndep$new(list(
  Sampler1DRfun$new(search_space$params$eta, function(n) rbeta(n, 2, 5)),
  Sampler1DCateg$new(search_space$params$booster, prob = c(0.2, 0.3, 0.5))
))

Then, the defined sampler has to be given as an argument during instance creation. Afterwards, the usual tuning can proceed.

tuner = tnr("hyperband", eta = 2, sampler = sampler)
set.seed(123)
tuner$optimize(instance)
##    nrounds    eta booster learner_param_vals  x_domain classif.ce
## 1:       1 0.2415    dart          <list[5]> <list[3]>       0.04
instance$result
##    nrounds    eta booster learner_param_vals  x_domain classif.ce
## 1:       1 0.2415    dart          <list[5]> <list[3]>       0.04

Furthermore, we extended the original algorithm, to make it also possible to use mlr3hyperband for multi-objective optimization. To do this, simply specify more measures in the TuningInstanceMultiCrit and run the rest as usual.

instance = TuningInstanceMultiCrit$new(
  task = tsk("pima"),
  learner = lrn("classif.xgboost"),
  resampling = rsmp("holdout"),
  measures = msrs(c("classif.tpr", "classif.fpr")),
  terminator = trm("none"), # hyperband terminates itself
  search_space = search_space
)

tuner = tnr("hyperband", eta = 4)
tuner$optimize(instance)
##     nrounds     eta  booster learner_param_vals  x_domain classif.tpr
##  1:       1 0.34093 gblinear          <list[5]> <list[3]>      0.0000
##  2:       1 0.50700 gblinear          <list[5]> <list[3]>      0.0000
##  3:       1 0.77369 gblinear          <list[5]> <list[3]>      0.0000
##  4:       1 0.55967 gblinear          <list[5]> <list[3]>      0.0000
##  5:       1 0.46843 gblinear          <list[5]> <list[3]>      0.0000
##  6:       1 0.27922 gblinear          <list[5]> <list[3]>      0.0000
##  7:       1 0.79274 gblinear          <list[5]> <list[3]>      0.0000
##  8:       1 0.58557 gblinear          <list[5]> <list[3]>      0.0000
##  9:       1 0.39476 gblinear          <list[5]> <list[3]>      0.0000
## 10:       1 0.58145 gblinear          <list[5]> <list[3]>      0.0000
## 11:      16 0.01919   gbtree          <list[5]> <list[3]>      0.6374
## 12:       4 0.55531     dart          <list[5]> <list[3]>      0.5824
## 13:       4 0.65920   gbtree          <list[5]> <list[3]>      0.6044
## 14:      16 0.13587     dart          <list[5]> <list[3]>      0.6264
##     classif.fpr
##  1:      0.0000
##  2:      0.0000
##  3:      0.0000
##  4:      0.0000
##  5:      0.0000
##  6:      0.0000
##  7:      0.0000
##  8:      0.0000
##  9:      0.0000
## 10:      0.0000
## 11:      0.2364
## 12:      0.1576
## 13:      0.1758
## 14:      0.1818

Now the result is not a single best configuration but an estimated Pareto front. All red points are not dominated by another parameter configuration regarding their fpr and tpr performance measures.

instance$result
##     nrounds     eta  booster learner_param_vals  x_domain classif.tpr
##  1:       1 0.34093 gblinear          <list[5]> <list[3]>      0.0000
##  2:       1 0.50700 gblinear          <list[5]> <list[3]>      0.0000
##  3:       1 0.77369 gblinear          <list[5]> <list[3]>      0.0000
##  4:       1 0.55967 gblinear          <list[5]> <list[3]>      0.0000
##  5:       1 0.46843 gblinear          <list[5]> <list[3]>      0.0000
##  6:       1 0.27922 gblinear          <list[5]> <list[3]>      0.0000
##  7:       1 0.79274 gblinear          <list[5]> <list[3]>      0.0000
##  8:       1 0.58557 gblinear          <list[5]> <list[3]>      0.0000
##  9:       1 0.39476 gblinear          <list[5]> <list[3]>      0.0000
## 10:       1 0.58145 gblinear          <list[5]> <list[3]>      0.0000
## 11:      16 0.01919   gbtree          <list[5]> <list[3]>      0.6374
## 12:       4 0.55531     dart          <list[5]> <list[3]>      0.5824
## 13:       4 0.65920   gbtree          <list[5]> <list[3]>      0.6044
## 14:      16 0.13587     dart          <list[5]> <list[3]>      0.6264
##     classif.fpr
##  1:      0.0000
##  2:      0.0000
##  3:      0.0000
##  4:      0.0000
##  5:      0.0000
##  6:      0.0000
##  7:      0.0000
##  8:      0.0000
##  9:      0.0000
## 10:      0.0000
## 11:      0.2364
## 12:      0.1576
## 13:      0.1758
## 14:      0.1818
plot(classif.tpr ~ classif.fpr, instance$archive$data)
points(classif.tpr ~ classif.fpr, instance$result, col = "red")

4.5 Feature Selection / Filtering

Often, data sets include a large number of features. The technique of extracting a subset of relevant features is called “feature selection”.

The objective of feature selection is to fit the sparse dependent of a model on a subset of available data features in the most suitable manner. Feature selection can enhance the interpretability of the model, speed up the learning process and improve the learner performance. Different approaches exist to identify the relevant features. Two different approaches are emphasized in the literature: one is called Filtering and the other approach is often referred to as feature subset selection or wrapper methods.

What are the differences (Guyon and Elisseeff 2003; Chandrashekar and Sahin 2014)?

  • Filtering: An external algorithm computes a rank of the features (e.g. based on the correlation to the response). Then, features are subsetted by a certain criteria, e.g. an absolute number or a percentage of the number of variables. The selected features will then be used to fit a model (with optional hyperparameters selected by tuning). This calculation is usually cheaper than “feature subset selection” in terms of computation time. All filters are connected via package mlr3filters.

  • Wrapper Methods: Here, no ranking of features is done. Instead, an optimization algorithm selects a subset of the features, evaluates the set by calculating the resampled predictive performance, and then proposes a new set of features (or terminates). A simple example is the sequential forward selection. This method is usually computationally very intensive as a lot of models are fitted. Also, strictly speaking, all these models would need to be tuned before the performance is estimated. This would require an additional nested level in a CV setting. After undertaken all of these steps, the final set of selected features is again fitted (with optional hyperparameters selected by tuning). Wrapper methods are implemented in the mlr3fselect package.

  • Embedded Methods: Many learners internally select a subset of the features which they find helpful for prediction. These subsets can usually be queried, as the following example demonstrates:

    library("mlr3verse")
    
    task = tsk("iris")
    learner = lrn("classif.rpart")
    
    # ensure that the learner selects features
    stopifnot("selected_features" %in% learner$properties)
    
    # fit a simple classification tree
    learner = learner$train(task)
    
    # extract all features used in the classification tree:
    learner$selected_features()
    ## [1] "Petal.Length" "Petal.Width"

There are also ensemble filters built upon the idea of stacking single filter methods. These are not yet implemented.

4.5.1 Filters

Filter methods assign an importance value to each feature. Based on these values the features can be ranked. Thereafter, we are able to select a feature subset. There is a list of all implemented filter methods in the Appendix.

4.5.2 Calculating filter values

Currently, only classification and regression tasks are supported.

The first step it to create a new R object using the class of the desired filter method. Similar to other instances in mlr3, these are registered in a dictionary (mlr_filters) with an associated shortcut function flt(). Each object of class Filter has a .$calculate() method which computes the filter values and ranks them in a descending order.

filter = flt("jmim")

task = tsk("iris")
filter$calculate(task)

as.data.table(filter)
##         feature  score
## 1:  Petal.Width 1.0000
## 2: Sepal.Length 0.6667
## 3: Petal.Length 0.3333
## 4:  Sepal.Width 0.0000

Some filters support changing specific hyperparameters. This is similar to setting hyperparameters of a Learner using .$param_set$values:

filter_cor = flt("correlation")
filter_cor$param_set
## <ParamSet>
##        id    class lower upper nlevels    default value
## 1:    use ParamFct    NA    NA       5 everything      
## 2: method ParamFct    NA    NA       3    pearson
# change parameter 'method'
filter_cor$param_set$values = list(method = "spearman")
filter_cor$param_set
## <ParamSet>
##        id    class lower upper nlevels    default    value
## 1:    use ParamFct    NA    NA       5 everything         
## 2: method ParamFct    NA    NA       3    pearson spearman

4.5.3 Variable Importance Filters

All Learner with the property “importance” come with integrated feature selection methods.

You can find a list of all learners with this property in the Appendix.

For some learners the desired filter method needs to be set during learner creation. For example, learner classif.ranger comes with multiple integrated methods, c.f. the help page of ranger::ranger(). To use method “impurity”, you need to set the filter method during construction.

lrn = lrn("classif.ranger", importance = "impurity")

Now you can use the FilterImportance filter class for algorithm-embedded methods:

task = tsk("iris")
filter = flt("importance", learner = lrn)
filter$calculate(task)
head(as.data.table(filter), 3)
##         feature  score
## 1: Petal.Length 45.334
## 2:  Petal.Width 42.640
## 3: Sepal.Length  9.045

4.5.4 Wrapper Methods

Wrapper feature selection is supported via the mlr3fselect extension package. At the heart of mlr3fselect are the R6 classes:

4.5.5 The FSelectInstance Classes

The following sub-section examines the feature selection on the Pima data set which is used to predict whether or not a patient has diabetes.

task = tsk("pima")
print(task)
## <TaskClassif:pima> (768 x 9)
## * Target: diabetes
## * Properties: twoclass
## * Features (8):
##   - dbl (8): age, glucose, insulin, mass, pedigree, pregnant, pressure,
##     triceps

We use the classification tree from rpart.

learner = lrn("classif.rpart")

Next, we need to specify how to evaluate the performance of the feature subsets. For this, we need to choose a resampling strategy and a performance measure.

hout = rsmp("holdout")
measure = msr("classif.ce")

Finally, one has to choose the available budget for the feature selection. This is done by selecting one of the available Terminators:

For this short introduction, we specify a budget of 20 evaluations and then put everything together into a FSelectInstanceSingleCrit:

evals20 = trm("evals", n_evals = 20)

instance = FSelectInstanceSingleCrit$new(
  task = task,
  learner = learner,
  resampling = hout,
  measure = measure,
  terminator = evals20
)
instance
## <FSelectInstanceSingleCrit>
## * State:  Not optimized
## * Objective: <ObjectiveFSelect:classif.rpart_on_pima>
## * Search Space:
## <ParamSet>
##          id    class lower upper nlevels        default value
## 1:      age ParamLgl    NA    NA       2 <NoDefault[3]>      
## 2:  glucose ParamLgl    NA    NA       2 <NoDefault[3]>      
## 3:  insulin ParamLgl    NA    NA       2 <NoDefault[3]>      
## 4:     mass ParamLgl    NA    NA       2 <NoDefault[3]>      
## 5: pedigree ParamLgl    NA    NA       2 <NoDefault[3]>      
## 6: pregnant ParamLgl    NA    NA       2 <NoDefault[3]>      
## 7: pressure ParamLgl    NA    NA       2 <NoDefault[3]>      
## 8:  triceps ParamLgl    NA    NA       2 <NoDefault[3]>      
## * Terminator: <TerminatorEvals>
## * Terminated: FALSE
## * Archive:
## <ArchiveFSelect>
## Null data.table (0 rows and 0 cols)

To start the feature selection, we still need to select an algorithm which are defined via the FSelector class

4.5.6 The FSelector Class

The following algorithms are currently implemented in mlr3fselect:

In this example, we will use a simple random search and retrieve it from the dictionary mlr_fselectors with the fs() function:

fselector = fs("random_search")

4.5.7 Triggering the Tuning

To start the feature selection, we simply pass the FSelectInstanceSingleCrit to the $optimize() method of the initialized FSelector. The algorithm proceeds as follows

  1. The FSelector proposes at least one feature subset and may propose multiple subsets to improve parallelization, which can be controlled via the setting batch_size).
  2. For each feature subset, the given Learner is fitted on the Task using the provided Resampling. All evaluations are stored in the archive of the FSelectInstanceSingleCrit.
  3. The Terminator is queried if the budget is exhausted. If the budget is not exhausted, restart with 1) until it is.
  4. Determine the feature subset with the best observed performance.
  5. Store the best feature subset as the result in the instance object. The best feature subset ($result_feature_set) and the corresponding measured performance ($result_y) can be accessed from the instance.
# reduce logging output
lgr::get_logger("bbotk")$set_threshold("warn")

fselector$optimize(instance)
##     age glucose insulin mass pedigree pregnant pressure triceps
## 1: TRUE    TRUE    TRUE TRUE     TRUE     TRUE     TRUE    TRUE
##                                          features classif.ce
## 1: age,glucose,insulin,mass,pedigree,pregnant,...     0.2344
instance$result_feature_set
## [1] "age"      "glucose"  "insulin"  "mass"     "pedigree" "pregnant" "pressure"
## [8] "triceps"
instance$result_y
## classif.ce 
##     0.2344

One can investigate all resamplings which were undertaken, as they are stored in the archive of the FSelectInstanceSingleCrit and can be accessed by using as.data.table():

as.data.table(instance$archive)
##       age glucose insulin  mass pedigree pregnant pressure triceps classif.ce
##  1:  TRUE    TRUE    TRUE  TRUE     TRUE     TRUE     TRUE    TRUE     0.2344
##  2: FALSE    TRUE    TRUE FALSE    FALSE    FALSE    FALSE   FALSE     0.2461
##  3: FALSE    TRUE   FALSE  TRUE    FALSE    FALSE     TRUE   FALSE     0.2422
##  4:  TRUE   FALSE   FALSE  TRUE     TRUE    FALSE    FALSE   FALSE     0.3320
##  5:  TRUE   FALSE   FALSE  TRUE    FALSE    FALSE    FALSE    TRUE     0.2891
##  6:  TRUE   FALSE    TRUE  TRUE    FALSE     TRUE     TRUE   FALSE     0.2617
##  7: FALSE   FALSE    TRUE FALSE    FALSE    FALSE    FALSE   FALSE     0.3906
##  8:  TRUE    TRUE    TRUE  TRUE     TRUE     TRUE     TRUE    TRUE     0.2344
##  9: FALSE    TRUE    TRUE  TRUE    FALSE    FALSE     TRUE   FALSE     0.2422
## 10:  TRUE   FALSE    TRUE FALSE    FALSE     TRUE     TRUE    TRUE     0.2773
## 11:  TRUE   FALSE    TRUE  TRUE    FALSE    FALSE     TRUE    TRUE     0.2734
## 12: FALSE   FALSE   FALSE FALSE     TRUE    FALSE     TRUE   FALSE     0.3516
## 13: FALSE    TRUE   FALSE FALSE    FALSE     TRUE    FALSE   FALSE     0.2617
## 14: FALSE    TRUE   FALSE FALSE     TRUE    FALSE    FALSE   FALSE     0.2656
## 15:  TRUE    TRUE    TRUE FALSE     TRUE    FALSE     TRUE   FALSE     0.2695
## 16:  TRUE    TRUE   FALSE FALSE     TRUE     TRUE    FALSE   FALSE     0.2656
## 17: FALSE   FALSE   FALSE  TRUE     TRUE     TRUE     TRUE   FALSE     0.3438
## 18:  TRUE    TRUE   FALSE FALSE    FALSE    FALSE    FALSE    TRUE     0.2344
## 19:  TRUE    TRUE    TRUE  TRUE     TRUE     TRUE     TRUE    TRUE     0.2344
## 20: FALSE   FALSE   FALSE FALSE    FALSE     TRUE    FALSE   FALSE     0.3281
##     runtime_learners           timestamp batch_nr      resample_result
##  1:            0.092 2021-11-25 20:21:47        1 <ResampleResult[22]>
##  2:            0.085 2021-11-25 20:21:47        2 <ResampleResult[22]>
##  3:            0.172 2021-11-25 20:21:48        3 <ResampleResult[22]>
##  4:            0.072 2021-11-25 20:21:48        4 <ResampleResult[22]>
##  5:            0.079 2021-11-25 20:21:48        5 <ResampleResult[22]>
##  6:            0.078 2021-11-25 20:21:49        6 <ResampleResult[22]>
##  7:            0.075 2021-11-25 20:21:49        7 <ResampleResult[22]>
##  8:            0.076 2021-11-25 20:21:49        8 <ResampleResult[22]>
##  9:            0.085 2021-11-25 20:21:49        9 <ResampleResult[22]>
## 10:            0.081 2021-11-25 20:21:49       10 <ResampleResult[22]>
## 11:            0.080 2021-11-25 20:21:50       11 <ResampleResult[22]>
## 12:            0.076 2021-11-25 20:21:50       12 <ResampleResult[22]>
## 13:            0.077 2021-11-25 20:21:50       13 <ResampleResult[22]>
## 14:            0.073 2021-11-25 20:21:51       14 <ResampleResult[22]>
## 15:            0.082 2021-11-25 20:21:51       15 <ResampleResult[22]>
## 16:            0.090 2021-11-25 20:21:51       16 <ResampleResult[22]>
## 17:            0.074 2021-11-25 20:21:51       17 <ResampleResult[22]>
## 18:            0.072 2021-11-25 20:21:52       18 <ResampleResult[22]>
## 19:            0.080 2021-11-25 20:21:52       19 <ResampleResult[22]>
## 20:            0.081 2021-11-25 20:21:52       20 <ResampleResult[22]>

The associated resampling iterations can be accessed in the BenchmarkResult:

instance$archive$benchmark_result$data
## Warning: '.__BenchmarkResult__data' is deprecated.
## Use 'as.data.table(benchmark_result)' instead.
## See help("Deprecated")
## <ResultData>
##   Public:
##     as_data_table: function (view = NULL, reassemble_learners = TRUE, convert_predictions = TRUE, 
##     clone: function (deep = FALSE) 
##     combine: function (rdata) 
##     data: list
##     discard: function (backends = FALSE, models = FALSE) 
##     initialize: function (data = NULL, store_backends = TRUE) 
##     iterations: function (view = NULL) 
##     learners: function (view = NULL, states = TRUE, reassemble = TRUE) 
##     logs: function (view = NULL, condition) 
##     prediction: function (view = NULL, predict_sets = "test") 
##     predictions: function (view = NULL, predict_sets = "test") 
##     resamplings: function (view = NULL) 
##     sweep: function () 
##     task_type: active binding
##     tasks: function (view = NULL) 
##     uhashes: function (view = NULL) 
##   Private:
##     deep_clone: function (name, value) 
##     get_view_index: function (view)

The uhash column links the resampling iterations to the evaluated feature subsets stored in instance$archive$data(). This allows e.g. to score the included ResampleResults on a different measure.

Now the optimized feature subset can be used to subset the task and fit the model on all observations.

task$select(instance$result_feature_set)
learner$train(task)

The trained model can now be used to make a prediction on external data. Note that predicting on observations present in the task, should be avoided. The model has seen these observations already during feature selection and therefore results would be statistically biased. Hence, the resulting performance measure would be over-optimistic. Instead, to get statistically unbiased performance estimates for the current task, nested resampling is required.

4.5.8 Automating the Feature Selection

The AutoFSelector wraps a learner and augments it with an automatic feature selection for a given task. Because the AutoFSelector itself inherits from the Learner base class, it can be used like any other learner. Analogously to the previous subsection, a new classification tree learner is created. This classification tree learner automatically starts a feature selection on the given task using an inner resampling (holdout). We create a terminator which allows 10 evaluations, and uses a simple random search as feature selection algorithm:

learner = lrn("classif.rpart")
terminator = trm("evals", n_evals = 10)
fselector = fs("random_search")

at = AutoFSelector$new(
  learner = learner,
  resampling = rsmp("holdout"),
  measure = msr("classif.ce"),
  terminator = terminator,
  fselector = fselector
)
at
## <AutoFSelector:classif.rpart.fselector>
## * Model: -
## * Parameters: xval=0
## * Packages: mlr3, rpart
## * Predict Type: response
## * Feature types: logical, integer, numeric, factor, ordered
## * Properties: importance, missings, multiclass, selected_features,
##   twoclass, weights

We can now use the learner like any other learner, calling the $train() and $predict() method. This time however, we pass it to benchmark() to compare the optimized feature subset to the complete feature set. This way, the AutoFSelector will do its resampling for feature selection on the training set of the respective split of the outer resampling. The learner then undertakes predictions using the test set of the outer resampling. This yields unbiased performance measures, as the observations in the test set have not been used during feature selection or fitting of the respective learner. This is called nested resampling.

To compare the optimized feature subset with the complete feature set, we can use benchmark():

grid = benchmark_grid(
  task = tsk("pima"),
  learner = list(at, lrn("classif.rpart")),
  resampling = rsmp("cv", folds = 3)
)

bmr = benchmark(grid, store_models = TRUE)
bmr$aggregate(msrs(c("classif.ce", "time_train")))
##    nr      resample_result task_id              learner_id resampling_id iters
## 1:  1 <ResampleResult[22]>    pima classif.rpart.fselector            cv     3
## 2:  2 <ResampleResult[22]>    pima           classif.rpart            cv     3
##    classif.ce time_train
## 1:     0.2513          0
## 2:     0.2409          0

Note that we do not expect any significant differences since we only evaluated a small fraction of the possible feature subsets.