4  Model Optimization

In machine learning, when you are dissatisfied with the performance of a model, you might ask yourself how to best improve the model:

This chapter might help answer this question.

Model Tuning

Machine learning algorithms have default values set for their hyperparameters. In many cases, these hyperparameters need to be changed by the user to achieve optimal performance on the given dataset. While you can certainly search for hyperparameter settings that improve performance manually, we do not recommend this approach as it is tedious and rarely leads to the best performance. Fortunately, the mlr3 ecosystem provides packages and tools for automated tuning. To tune a machine learning algorithm, you have to specify

  1. the search space,
  2. the optimization algorithm (i.e. tuning method),
  3. an evaluation method (i.e., a resampling strategy), and
  4. a performance measure.

In the tuning part, we will have a look at:

We will use the mlr3tuning package, which supports common tuning operations.

Feature Selection

Tuning the hyperparameters is only one way of improving the performance of your model. The second part of this chapter explains feature selection, also known as variable or descriptor selection. Feature selection is the process of finding the feature subset that is most relevant with respect to the prediction or for which the learner fits a model with the highest performance. Apart from improving model performance, there are additional reasons to perform feature selection:

Here, we mostly focus on feature selection as a means of improving model performance.

There are different approaches to identifying the relevant features. In the feature selection part, we describe three methods:

Note that filters operate independently of learners. Variable importance filters rely on the learner to extract information on feature importance from a trained model, for example, by inspecting a learned decision tree and returning the features that are used in the first few levels. The obtained importance values can be used to subset the data, which can then be used to train a learner. Wrapper methods can be used with any learner but need to train the learner potentially many times, making this the most expensive method.

Nested Resampling

For hyperparameter tuning, a normal resampling (e.g. a cross-validation) is no longer sufficient to ensure an unbiased evaluation. Consider the following thought experiment to gain intuition for why this is the case. Suppose a learner has a hyperparameter that has no real effect on the fitted model, but only introduces random noise into the predictions. Evaluating different values for this hyperparameter, one will show the best performance (purely randomly). This is the hyperparameter value that will be chosen as the best, although the hyperparameter has no real effect. To discover this, another separate validation set is required – it will reveal that the “optimized” setting really does not perform better than anything else.

We need a nested resampling to ensure unbiased estimates of the generalization error during hyperparameter optimization. We discuss the following aspects in this part:

4.1 Hyperparameter Tuning

Hyperparameters are the parameters of the learners that control how a model is fit to the data. They are sometimes called second-level or second-order parameters of machine learning – the parameters of the models are the first-order parameters and “fit” to the data during model training. The hyperparameters of a learner can have a major impact on the performance of a learned model, but are often only optimized in an ad-hoc manner or not at all. This process is often called model ‘tuning’.

Hyperparameter tuning is supported via the mlr3tuning extension package. Below you can find an illustration of the general process:

At the heart of mlr3tuning are the R6 classes

4.1.1 The TuningInstance* Classes

We will examine the optimization of a simple classification tree on the Pima Indian Diabetes data set as an introductory example here.

library("mlr3verse")
task = tsk("pima")
print(task)
<TaskClassif:pima> (768 x 9): Pima Indian Diabetes
* Target: diabetes
* Properties: twoclass
* Features (8):
  - dbl (8): age, glucose, insulin, mass, pedigree, pregnant, pressure,
    triceps

We use the rpart classification tree and choose a subset of the hyperparameters we want to tune. This is often referred to as the “tuning space”. First, let’s look at all the hyperparameters that are available. Information on what they do can be found in the documentation of the learner.

learner = lrn("classif.rpart")
learner$param_set
<ParamSet>
                id    class lower upper nlevels        default value
 1:             cp ParamDbl     0     1     Inf           0.01      
 2:     keep_model ParamLgl    NA    NA       2          FALSE      
 3:     maxcompete ParamInt     0   Inf     Inf              4      
 4:       maxdepth ParamInt     1    30      30             30      
 5:   maxsurrogate ParamInt     0   Inf     Inf              5      
 6:      minbucket ParamInt     1   Inf     Inf <NoDefault[3]>      
 7:       minsplit ParamInt     1   Inf     Inf             20      
 8: surrogatestyle ParamInt     0     1       2              0      
 9:   usesurrogate ParamInt     0     2       3              2      
10:           xval ParamInt     0   Inf     Inf             10     0

Here, we opt to tune two hyperparameters:

  • The complexity hyperparameter cp that controls when the learner considers introducing another branch.
  • The minsplit hyperparameter that controls how many observations must be present in a leaf for another split to be attempted.

The tuning space needs to be bounded with lower and upper bounds for the values of the hyperparameters:

search_space = ps(
  cp = p_dbl(lower = 0.001, upper = 0.1),
  minsplit = p_int(lower = 1, upper = 10)
)
search_space
<ParamSet>
         id    class lower upper nlevels        default value
1:       cp ParamDbl 0.001   0.1     Inf <NoDefault[3]>      
2: minsplit ParamInt 1.000  10.0      10 <NoDefault[3]>      

The bounds are usually set based on experience.

Next, we need to specify how to evaluate the performance of a trained model. For this, we need to choose a resampling strategy and a performance measure.

hout = rsmp("holdout")
measure = msr("classif.ce")

Finally, we have to specify the budget available for tuning. This is a crucial step, as exhaustively evaluating all possible hyperparameter configurations is usually not feasible. mlr3 allows to specify complex termination criteria by selecting one of the available Terminators:

For this short introduction, we specify a budget of 20 iterations and then put everything together into a TuningInstanceSingleCrit:

Loading required package: paradox
evals20 = trm("evals", n_evals = 20)

instance = TuningInstanceSingleCrit$new(
  task = task,
  learner = learner,
  resampling = hout,
  measure = measure,
  search_space = search_space,
  terminator = evals20
)
instance
<TuningInstanceSingleCrit>
* State:  Not optimized
* Objective: <ObjectiveTuning:classif.rpart_on_pima>
* Search Space:
         id    class lower upper nlevels
1:       cp ParamDbl 0.001   0.1     Inf
2: minsplit ParamInt 1.000  10.0      10
* Terminator: <TerminatorEvals>

To start the tuning, we still need to select how the optimization should take place. In other words, we need to choose the optimization algorithm via the Tuner class.

4.1.2 The Tuner Class

The following algorithms are currently implemented in mlr3tuning:

If you’re interested in learning more about these approaches, the Wikipedia page on hyperparameter optimization is a good place to start.

In this example, we will use a simple grid search with a grid resolution of 5.

tuner = tnr("grid_search", resolution = 5)

As we have only numeric parameters, TunerGridSearch will create an equidistant grid between the respective upper and lower bounds. Our two-dimensional grid of resolution 5 consists of \(5^2 = 25\) configurations. Each configuration is a distinct setting of hyperparameter values for the previously defined Learner which is then fitted to the task and evaluated using the provided Resampling. All configurations will be examined by the tuner (in a random order), until either all configurations are evaluated or the Terminator signals that the budget is exhausted, i.e. here the tuner will stop after evaluating 20 of the 25 total configurations.

4.1.3 Triggering the Tuning

To start the tuning, we simply pass the TuningInstanceSingleCrit to the $optimize() method of the initialized Tuner. The tuner proceeds as follows:

  1. The Tuner proposes at least one hyperparameter configuration to evaluate (the Tuner may propose multiple points to be able to evaluate them in parallel, which can be controlled via the setting batch_size).
  2. For each configuration, the given Learner is fitted on the Task and evaluated using the provided Resampling. 1 All evaluations are stored in the archive of the TuningInstanceSingleCrit.
  3. The Terminator is queried if the budget is exhausted. 1 If the budget is not exhausted, go back to 1), else terminate.
  4. Determine the configurations with the best observed performance from the archive.
  5. Store the best configurations as result in the tuning instance object. The best hyperparameter settings ($result_learner_param_vals) and the corresponding measured performance ($result_y) can be retrieved from the tuning instance.
tuner$optimize(instance)
      cp minsplit learner_param_vals  x_domain classif.ce
1: 0.001       10          <list[3]> <list[2]>  0.2734375
instance$result_learner_param_vals
$xval
[1] 0

$cp
[1] 0.001

$minsplit
[1] 10
instance$result_y
classif.ce 
 0.2734375 

You can investigate all of the evaluations that were performed; they are stored in the archive of the TuningInstanceSingleCrit and can be accessed by using as.data.table():

as.data.table(instance$archive)
         cp minsplit classif.ce x_domain_cp x_domain_minsplit runtime_learners
 1: 0.00100        5  0.2929688     0.00100                 5            0.026
 2: 0.00100        3  0.3125000     0.00100                 3            0.018
 3: 0.10000        1  0.2773438     0.10000                 1            0.021
 4: 0.02575       10  0.2773438     0.02575                10            0.017
 5: 0.10000        5  0.2773438     0.10000                 5            0.016
 6: 0.10000        3  0.2773438     0.10000                 3            0.015
 7: 0.00100       10  0.2734375     0.00100                10            0.023
 8: 0.00100        1  0.3125000     0.00100                 1            0.018
 9: 0.00100        8  0.2890625     0.00100                 8            0.018
10: 0.07525        1  0.2773438     0.07525                 1            0.016
11: 0.05050        8  0.2773438     0.05050                 8            0.016
12: 0.05050        3  0.2773438     0.05050                 3            0.015
13: 0.02575        1  0.2773438     0.02575                 1            0.024
14: 0.02575        3  0.2773438     0.02575                 3            0.016
15: 0.07525        3  0.2773438     0.07525                 3            0.015
16: 0.02575        5  0.2773438     0.02575                 5            0.015
17: 0.07525        8  0.2773438     0.07525                 8            0.017
18: 0.05050       10  0.2773438     0.05050                10            0.015
19: 0.07525        5  0.2773438     0.07525                 5            0.022
20: 0.05050        1  0.2773438     0.05050                 1            0.023
5 variables not shown: [timestamp, batch_nr, warnings, errors, resample_result]

Altogether, the grid search evaluated 20/25 different hyperparameter configurations in a random order before the Terminator stopped the tuning. In this example there were multiple configurations with the same best classification error, and without other criteria, the first one was returned. You may want to choose the configuration with the lowest classification error as well as time to train the model or some other combination of criteria for hyper parameter selection. You can do this with r ref("TuningInstanceMultiCrit"), see Tuning with Multiple Performance Measures.

The associated resampling iterations can be accessed in the "BenchmarkResult") of the tuning instance:

instance$archive$benchmark_result
<BenchmarkResult> of 20 rows with 20 resampling runs
 nr task_id    learner_id resampling_id iters warnings errors
  1    pima classif.rpart       holdout     1        0      0
  2    pima classif.rpart       holdout     1        0      0
  3    pima classif.rpart       holdout     1        0      0
  4    pima classif.rpart       holdout     1        0      0
  5    pima classif.rpart       holdout     1        0      0
  6    pima classif.rpart       holdout     1        0      0
  7    pima classif.rpart       holdout     1        0      0
  8    pima classif.rpart       holdout     1        0      0
  9    pima classif.rpart       holdout     1        0      0
 10    pima classif.rpart       holdout     1        0      0
 11    pima classif.rpart       holdout     1        0      0
 12    pima classif.rpart       holdout     1        0      0
 13    pima classif.rpart       holdout     1        0      0
 14    pima classif.rpart       holdout     1        0      0
 15    pima classif.rpart       holdout     1        0      0
 16    pima classif.rpart       holdout     1        0      0
 17    pima classif.rpart       holdout     1        0      0
 18    pima classif.rpart       holdout     1        0      0
 19    pima classif.rpart       holdout     1        0      0
 20    pima classif.rpart       holdout     1        0      0

The uhash column links the resampling iterations to the evaluated configurations stored in instance$archive$data. This allows e.g. to score the included ResampleResults on a different performance measure.

instance$archive$benchmark_result$score(msr("classif.acc"))
                                   uhash nr              task task_id
 1: cb18fdee-f7aa-4687-ad76-a02acbe8e976  1 <TaskClassif[50]>    pima
 2: e1d36e0b-4e2a-4bc9-8e99-c218b34ac381  2 <TaskClassif[50]>    pima
 3: 1166b8f9-f21d-4294-95b4-83d31be88451  3 <TaskClassif[50]>    pima
 4: 35cf36a3-829a-4df9-96c0-258596b0dc26  4 <TaskClassif[50]>    pima
 5: 9b308e1a-bcb1-43d4-820c-b249296b7571  5 <TaskClassif[50]>    pima
 6: ebdc36be-58f7-4ba0-b92d-4ff6e12763df  6 <TaskClassif[50]>    pima
 7: 865a38c1-5da5-45cf-af71-ff9265b26799  7 <TaskClassif[50]>    pima
 8: 3f702309-dd85-4c0b-869f-c798b1dfd6d0  8 <TaskClassif[50]>    pima
 9: 5c3d216f-9261-4737-abf2-4a2d29f2d268  9 <TaskClassif[50]>    pima
10: b1069a43-c781-4609-9807-65df359a815a 10 <TaskClassif[50]>    pima
11: 38d02c92-9818-43aa-a6ab-312355df6193 11 <TaskClassif[50]>    pima
12: bd6b34b6-a521-4403-9cc9-2be084a1c7f2 12 <TaskClassif[50]>    pima
13: bf10ba14-7b97-48b9-b8fc-0392f7cb9901 13 <TaskClassif[50]>    pima
14: 73062648-aa19-45b3-b97b-13905357b4db 14 <TaskClassif[50]>    pima
15: 8f3a3cf7-6a21-4a46-ae3c-5384cbf69cdc 15 <TaskClassif[50]>    pima
16: f086cd9f-87c3-433d-a1a2-18f8686a0bc5 16 <TaskClassif[50]>    pima
17: 7785616a-695d-4013-a4d3-311d3b620a30 17 <TaskClassif[50]>    pima
18: 71d3fed4-5521-4897-b2d3-cbf87b41f271 18 <TaskClassif[50]>    pima
19: 8301e9f4-e7dd-4ff5-970a-5eb26e462a23 19 <TaskClassif[50]>    pima
20: 75e2e2fb-12a9-4675-8c89-1f0dd39d99dc 20 <TaskClassif[50]>    pima
7 variables not shown: [learner, learner_id, resampling, resampling_id, iteration, prediction, classif.acc]

Now we can take the optimized hyperparameters, set them for the previously-created Learner, and train it on the full dataset.

learner$param_set$values = instance$result_learner_param_vals
learner$train(task)

The trained model can now be used to make a prediction on new, external data. Note that predicting on observations present in the Task should be avoided because the model has seen these observations already during tuning and training and therefore performance values would be statistically biased – the resulting performance measure would be over-optimistic. To get statistically unbiased performance estimates for a given task, nested resampling is required.

4.1.4 Tuning with Multiple Performance Measures

When tuning, you might want to use multiple criteria to find the best configuration of hyperparameters. For example, you might want the configuration with the lowest classification error and lowest time to train the model. The full list of performance measures can be found here.

Continuing the above example and tuning the same hyperparameters:

  • The complexity hyperparameter cp that controls when the learner considers introducing another branch.
  • The minsplit hyperparameter that controls how many observations must be present in a leaf for another split to be attempted.

The tuning process is identical to the previous example, however, this time we will specify two performance measures, classification error and time to train the model (time_train).

measures = msrs(c("classif.ce", "time_train"))

Instead of creating a new TuningInstanceSingleCrit with a single measure, we create a new TuningInstanceMultiCrit with the two measures we are interested in here. Otherwise, it is the same as above.

library("mlr3tuning")

evals20 = trm("evals", n_evals = 20)

instance = TuningInstanceMultiCrit$new(
  task = task,
  learner = learner,
  resampling = hout,
  measures = measures,
  search_space = search_space,
  terminator = evals20
)
instance
<TuningInstanceMultiCrit>
* State:  Not optimized
* Objective: <ObjectiveTuning:classif.rpart_on_pima>
* Search Space:
         id    class lower upper nlevels
1:       cp ParamDbl 0.001   0.1     Inf
2: minsplit ParamInt 1.000  10.0      10
* Terminator: <TerminatorEvals>

After triggering the tuning, we will have the configuration with the best classification error and time to train the model.

tuner$optimize(instance)
    cp minsplit learner_param_vals  x_domain classif.ce time_train
1: 0.1        3          <list[3]> <list[2]>  0.2421875      0.007
instance$result_learner_param_vals
[[1]]
[[1]]$xval
[1] 0

[[1]]$cp
[1] 0.1

[[1]]$minsplit
[1] 3
instance$result_y
   classif.ce time_train
1:  0.2421875      0.007

4.1.5 Automating the Tuning

We can automate this entire process in mlr3 so that learners are tuned transparently, without the need to extract information on the best hyperparameter settings at the end. The AutoTuner wraps a learner and augments it with an automatic tuning process for a given set of hyperparameters. Because the AutoTuner itself inherits from the Learner base class, it can be used like any other learner. In keeping with our example above, we create a classification learner that tunes itself automatically. This classification tree learner tunes the parameters cp and minsplit using an inner resampling (holdout). We create a terminator which allows 10 evaluations, and use a simple random search as tuning algorithm:

learner = lrn("classif.rpart")
search_space = ps(
  cp = p_dbl(lower = 0.001, upper = 0.1),
  minsplit = p_int(lower = 1, upper = 10)
)
terminator = trm("evals", n_evals = 10)
tuner = tnr("random_search")

at = AutoTuner$new(
  learner = learner,
  resampling = rsmp("holdout"),
  measure = msr("classif.ce"),
  search_space = search_space,
  terminator = terminator,
  tuner = tuner
)
at
<AutoTuner:classif.rpart.tuned>
* Model: list
* Search Space:
<ParamSet>
         id    class lower upper nlevels        default value
1:       cp ParamDbl 0.001   0.1     Inf <NoDefault[3]>      
2: minsplit ParamInt 1.000  10.0      10 <NoDefault[3]>      
* Packages: mlr3, mlr3tuning, rpart
* Predict Type: response
* Feature Types: logical, integer, numeric, factor, ordered
* Properties: importance, missings, multiclass, selected_features,
  twoclass, weights

We can now use the learner like any other learner, calling the $train() and $predict() method. The differnce to a normal learner is that $train() runs the tuning, which will take longer than a normal training process.

at$train(task)

We can also pass it to resample() and benchmark(), just like any other learner. This would result in a nested resampling.

4.2 Tuning Search Spaces

When running an optimization, it is important to inform the tuning algorithm about what hyperparameters are valid. Here the names, types, and valid ranges of each hyperparameter are important. All this information is communicated with objects of the class ParamSet, which is defined in paradox. While it is possible to create ParamSet-objects using its $new-constructor, it is much shorter and readable to use the ps-shortcut, which will be presented here. For an in-depth description of paradox and its classes, see the [paradox](https://paradox.mlr-org.com) chapter.

Note, that ParamSet objects exist in two contexts. First, ParamSet-objects are used to define the space of valid parameter settings for a learner (and other objects). Second, they are used to define a search space for tuning. We are mainly interested in the latter. For example we can consider the minsplit parameter of the classif.rpart Learner. The ParamSet associated with the learner has a lower but no upper bound. However, for tuning the value, a lower and upper bound must be given because tuning search spaces need to be bounded. For Learner or PipeOp objects, typically “unbounded” ParamSets are used. Here, however, we will mainly focus on creating “bounded” ParamSets that can be used for tuning. See the in-depth [paradox](https://paradox.mlr-org.com) chapter for more details on using ParamSets to define parameter ranges for use-cases besides tuning.

4.2.1 Creating ParamSets

An empty "ParamSet") – not yet very useful – can be constructed using just the "ps") call:

library("mlr3verse")

search_space = ps()
print(search_space)
<ParamSet>
Empty.

ps takes named Domain arguments that are turned into parameters. A possible search space for the "classif.svm" learner could for example be:

search_space = ps(
  cost = p_dbl(lower = 0.1, upper = 10),
  kernel = p_fct(levels = c("polynomial", "radial"))
)
print(search_space)
<ParamSet>
       id    class lower upper nlevels        default value
1:   cost ParamDbl   0.1    10     Inf <NoDefault[3]>      
2: kernel ParamFct    NA    NA       2 <NoDefault[3]>      

There are five domain constructors that produce a parameters when given to ps:

Constructor Description Is bounded? Underlying Class
p_dbl Real valued parameter (“double”) When upper and lower are given ParamDbl
p_int Integer parameter When upper and lower are given ParamInt
p_fct Discrete valued parameter (“factor”) Always ParamFct
p_lgl Logical / Boolean parameter Always ParamLgl
p_uty Untyped parameter Never ParamUty

These domain constructors each take some of the following arguments:

  • lower, upper: lower and upper bound of numerical parameters (p_dbl and p_int). These need to be given to get bounded parameter spaces valid for tuning.
  • levels: Allowed categorical values for p_fct parameters. Required argument for p_fct. See below for more details on this parameter.
  • trafo: transformation function, see below.
  • depends: dependencies, see below.
  • tags: Further information about a parameter, used for example by the hyperband tuner.
  • default: Value corresponding to default behavior when the parameter is not given. Not used for tuning search spaces.
  • special_vals: Valid values besides the normally accepted values for a parameter. Not used for tuning search spaces.
  • custom_check: Function that checks whether a value given to p_uty is valid. Not used for tuning search spaces.

The lower and upper parameters are always in the first and second position respectively, except for p_fct where levels is in the first position. It is preferred to omit the labels (ex: upper = 0.1 becomes just 0.1). This way of defining a ParamSet is more concise than the equivalent definition above. Preferred:

search_space = ps(cost = p_dbl(0.1, 10), kernel = p_fct(c("polynomial", "radial")))

4.2.2 Transformations (trafo)

We can use the paradox function generate_design_grid to look at the values that would be evaluated by grid search. (We are using rbindlist() here because the result of $transpose() is a list that is harder to read. If we didn’t use $transpose(), on the other hand, the transformations that we investigate here are not applied.) In generate_design_grid(search_space, 3), search_space is the ParamSet argument and 3 is the specified resolution in the parameter space. The resolution for categorical parameters is ignored; these parameters always produce a grid over all of their valid levels. For numerical parameters the endpoints of the params are always included in the grid, so if there were 3 levels for the kernel instead of 2 there would be 9 rows, or if the resolution was 4 in this example there would be 8 rows in the resulting table.

library("data.table")
rbindlist(generate_design_grid(search_space, 3)$transpose())
    cost     kernel
1:  0.10 polynomial
2:  0.10     radial
3:  5.05 polynomial
4:  5.05     radial
5: 10.00 polynomial
6: 10.00     radial

We notice that the cost parameter is taken on a linear scale. We assume, however, that the difference of cost between 0.1 and 1 should have a similar effect as the difference between 1 and 10. Therefore it makes more sense to tune it on a logarithmic scale. This is done by using a transformation (trafo). This is a function that is applied to a parameter after it has been sampled by the tuner. We can tune cost on a logarithmic scale by sampling on the linear scale [-1, 1] and computing 10^x from that value.

search_space = ps(
  cost = p_dbl(-1, 1, trafo = function(x) 10^x),
  kernel = p_fct(c("polynomial", "radial"))
)
rbindlist(generate_design_grid(search_space, 3)$transpose())
   cost     kernel
1:  0.1 polynomial
2:  0.1     radial
3:  1.0 polynomial
4:  1.0     radial
5: 10.0 polynomial
6: 10.0     radial

It is even possible to attach another transformation to the ParamSet as a whole that gets executed after individual parameter’s transformations were performed. It is given through the .extra_trafo argument and should be a function with parameters x and param_set that takes a list of parameter values in x and returns a modified list. This transformation can access all parameter values of an evaluation and modify them with interactions. It is even possible to add or remove parameters. (The following is a bit of a silly example.)

search_space = ps(
  cost = p_dbl(-1, 1, trafo = function(x) 10^x),
  kernel = p_fct(c("polynomial", "radial")),
  .extra_trafo = function(x, param_set) {
    if (x$kernel == "polynomial") {
      x$cost = x$cost * 2
    }
    x
  }
)
rbindlist(generate_design_grid(search_space, 3)$transpose())
   cost     kernel
1:  0.2 polynomial
2:  0.1     radial
3:  2.0 polynomial
4:  1.0     radial
5: 20.0 polynomial
6: 10.0     radial

The available types of search space parameters are limited: continuous, integer, discrete, and logical scalars. There are many machine learning algorithms, however, that take parameters of other types, for example vectors or functions. These can not be defined in a search space ParamSet, and they are often given as ParamUty in the Learner’s ParamSet. When trying to tune over these hyperparameters, it is necessary to perform a Transformation that changes the type of a parameter.

An example is the class.weights parameter of the Support Vector Machine (SVM), which takes a named vector of class weights with one entry for each target class. The trafo that would tune class.weights for the mlr_tasks_spam, 'tsk("spam") dataset could be:

search_space = ps(
  class.weights = p_dbl(0.1, 0.9, trafo = function(x) c(spam = x, nonspam = 1 - x))
)
generate_design_grid(search_space, 3)$transpose()
[[1]]
[[1]]$class.weights
   spam nonspam 
    0.1     0.9 


[[2]]
[[2]]$class.weights
   spam nonspam 
    0.5     0.5 


[[3]]
[[3]]$class.weights
   spam nonspam 
    0.9     0.1 

(We are omitting rbindlist() in this example because it breaks the vector valued return elements.)

4.2.3 Automatic Factor Level Transformation

A common use-case is the necessity to specify a list of values that should all be tried (or sampled from). It may be the case that a hyperparameter accepts function objects as values and a certain list of functions should be tried. Or it may be that a choice of special numeric values should be tried. For this, the p_fct constructor’s level argument may be a value that is not a character vector, but something else. If, for example, only the values 0.1, 3, and 10 should be tried for the cost parameter, even when doing random search, then the following search space would achieve that:

search_space = ps(
  cost = p_fct(c(0.1, 3, 10)),
  kernel = p_fct(c("polynomial", "radial"))
)
rbindlist(generate_design_grid(search_space, 3)$transpose())
   cost     kernel
1:  0.1 polynomial
2:  0.1     radial
3:  3.0 polynomial
4:  3.0     radial
5: 10.0 polynomial
6: 10.0     radial

This is equivalent to the following:

search_space = ps(
  cost = p_fct(c("0.1", "3", "10"),
    trafo = function(x) list(`0.1` = 0.1, `3` = 3, `10` = 10)[[x]]),
  kernel = p_fct(c("polynomial", "radial"))
)
rbindlist(generate_design_grid(search_space, 3)$transpose())
   cost     kernel
1:  0.1 polynomial
2:  0.1     radial
3:  3.0 polynomial
4:  3.0     radial
5: 10.0 polynomial
6: 10.0     radial

Note: Though the resolution is 3 here, in this case it doesn’t matter because both cost and kernel are factors (the resolution for categorical variables is ignored, these parameters always produce a grid over all their valid levels).

This may seem silly, but makes sense when considering that factorial tuning parameters are always character values:

search_space = ps(
  cost = p_fct(c(0.1, 3, 10)),
  kernel = p_fct(c("polynomial", "radial"))
)
typeof(search_space$params$cost$levels)
[1] "character"

Be aware that this results in an “unordered” hyperparameter, however. Tuning algorithms that make use of ordering information of parameters, like genetic algorithms or model based optimization, will perform worse when this is done. For these algorithms, it may make more sense to define a p_dbl or p_int with a more fitting trafo.

The class.weights case from above can also be implemented like this, if there are only a few candidates of class.weights vectors that should be tried. Note that the levels argument of p_fct must be named if there is no easy way for as.character() to create names:

search_space = ps(
  class.weights = p_fct(
    list(
      candidate_a = c(spam = 0.5, nonspam = 0.5),
      candidate_b = c(spam = 0.3, nonspam = 0.7)
    )
  )
)
generate_design_grid(search_space)$transpose()
[[1]]
[[1]]$class.weights
   spam nonspam 
    0.5     0.5 


[[2]]
[[2]]$class.weights
   spam nonspam 
    0.3     0.7 

4.2.4 Parameter Dependencies (depends)

Some parameters are only relevant when another parameter has a certain value, or one of several values. The Support Vector Machine (SVM), for example, has the degree parameter that is only valid when kernel is "polynomial". This can be specified using the depends argument. It is an expression that must involve other parameters and be of the form <param> == <scalar>, <param> %in% <vector>, or multiple of these chained by &&. To tune the degree parameter, one would need to do the following:

search_space = ps(
  cost = p_dbl(-1, 1, trafo = function(x) 10^x),
  kernel = p_fct(c("polynomial", "radial")),
  degree = p_int(1, 3, depends = kernel == "polynomial")
)
rbindlist(generate_design_grid(search_space, 3)$transpose(), fill = TRUE)
    cost     kernel degree
 1:  0.1 polynomial      1
 2:  0.1 polynomial      2
 3:  0.1 polynomial      3
 4:  0.1     radial     NA
 5:  1.0 polynomial      1
 6:  1.0 polynomial      2
 7:  1.0 polynomial      3
 8:  1.0     radial     NA
 9: 10.0 polynomial      1
10: 10.0 polynomial      2
11: 10.0 polynomial      3
12: 10.0     radial     NA

4.2.5 Creating Tuning ParamSets from other ParamSets

Having to define a tuning ParamSet for a Learner that already has parameter set information may seem unnecessarily tedious, and there is indeed a way to create tuning ParamSets from a Learner’s ParamSet, making use of as much information as already available.

This is done by setting values of a Learner’s ParamSet to so-called TuneTokens, constructed with a to_tune call. This can be done in the same way that other hyperparameters are set to specific values. It can be understood as the hyperparameters being tagged for later tuning. The resulting ParamSet used for tuning can be retrieved using the $search_space() method.

learner = lrn("classif.svm")
learner$param_set$values$kernel = "polynomial" # for example
learner$param_set$values$degree = to_tune(lower = 1, upper = 3)

print(learner$param_set$search_space())
<ParamSet>
       id    class lower upper nlevels        default value
1: degree ParamInt     1     3       3 <NoDefault[3]>      
rbindlist(generate_design_grid(
  learner$param_set$search_space(), 3)$transpose()
)
   degree
1:      1
2:      2
3:      3

It is possible to omit lower here, because it can be inferred from the lower bound of the degree parameter itself. For other parameters, that are already bounded, it is possible to not give any bounds at all, because their ranges are already bounded. An example is the logical shrinking hyperparameter:

learner$param_set$values$shrinking = to_tune()

print(learner$param_set$search_space())
<ParamSet>
          id    class lower upper nlevels        default value
1:    degree ParamInt     1     3       3 <NoDefault[3]>      
2: shrinking ParamLgl    NA    NA       2           TRUE      
rbindlist(generate_design_grid(
  learner$param_set$search_space(), 3)$transpose()
)
   degree shrinking
1:      1      TRUE
2:      1     FALSE
3:      2      TRUE
4:      2     FALSE
5:      3      TRUE
6:      3     FALSE

"to_tune") can also be constructed with a Domain object, i.e. something constructed with a p_*** call. This way it is possible to tune continuous parameters with discrete values, or to give trafos or dependencies. One could, for example, tune the cost as above on three given special values, and introduce a dependency of shrinking on it. Notice that a short form for to_tune(<levels>) is a short form of to_tune(p_fct(<levels>)).

Note

When introducing the dependency, we need to use the degree value from before the implicit trafo, which is the name or as.character() of the respective value, here "val2"!

learner$param_set$values$type = "C-classification" # needs to be set because of a bug in paradox
learner$param_set$values$cost = to_tune(c(val1 = 0.3, val2 = 0.7))
learner$param_set$values$shrinking = to_tune(p_lgl(depends = cost == "val2"))

print(learner$param_set$search_space())
<ParamSet>
          id    class lower upper nlevels        default parents value
1:      cost ParamFct    NA    NA       2 <NoDefault[3]>              
2:    degree ParamInt     1     3       3 <NoDefault[3]>              
3: shrinking ParamLgl    NA    NA       2 <NoDefault[3]>    cost      
Trafo is set.
rbindlist(generate_design_grid(learner$param_set$search_space(), 3)$transpose(), fill = TRUE)
   degree cost shrinking
1:      1  0.3        NA
2:      1  0.7      TRUE
3:      1  0.7     FALSE
4:      2  0.3        NA
5:      2  0.7      TRUE
6:      2  0.7     FALSE
7:      3  0.3        NA
8:      3  0.7      TRUE
9:      3  0.7     FALSE

The "search_space() picks up dependencies fromt the underlying ParamSet automatically. So if the kernel is tuned, then degree automatically gets the dependency on it, without us having to specify that. (Here we reset cost and shrinking to NULL for the sake of clarity of the generated output.)

learner$param_set$values$cost = NULL
learner$param_set$values$shrinking = NULL
learner$param_set$values$kernel = to_tune(c("polynomial", "radial"))

print(learner$param_set$search_space())
<ParamSet>
       id    class lower upper nlevels        default parents value
1: degree ParamInt     1     3       3 <NoDefault[3]>  kernel      
2: kernel ParamFct    NA    NA       2 <NoDefault[3]>              
rbindlist(generate_design_grid(learner$param_set$search_space(), 3)$transpose(), fill = TRUE)
       kernel degree
1: polynomial      1
2: polynomial      2
3: polynomial      3
4:     radial     NA

It is even possible to define whole ParamSets that get tuned over for a single parameter. This may be especially useful for vector hyperparameters that should be searched along multiple dimensions. This ParamSet must, however, have an .extra_trafo that returns a list with a single element, because it corresponds to a single hyperparameter that is being tuned. Suppose the class.weights hyperparameter should be tuned along two dimensions:

learner$param_set$values$class.weights = to_tune(
  ps(spam = p_dbl(0.1, 0.9), nonspam = p_dbl(0.1, 0.9),
    .extra_trafo = function(x, param_set) list(c(spam = x$spam, nonspam = x$nonspam))
))
head(generate_design_grid(learner$param_set$search_space(), 3)$transpose(), 3)
[[1]]
[[1]]$kernel
[1] "polynomial"

[[1]]$degree
[1] 1

[[1]]$class.weights
   spam nonspam 
    0.1     0.1 


[[2]]
[[2]]$kernel
[1] "polynomial"

[[2]]$degree
[1] 1

[[2]]$class.weights
   spam nonspam 
    0.1     0.5 


[[3]]
[[3]]$kernel
[1] "polynomial"

[[3]]$degree
[1] 1

[[3]]$class.weights
   spam nonspam 
    0.1     0.9 

4.3 Nested Resampling

Evaluating a machine learning model often requires an additional layer of resampling when hyperparameters or features have to be selected. Nested resampling separates these model selection steps from the process estimating the performance of the model. If the same data is used for the model selection steps and the evaluation of the model itself, the resulting performance estimate of the model might be severely biased. One reason for this bias is that the repeated evaluation of the model on the test data could leak information about its structure into the model, this results in over-optimistic performance estimates. Keep in mind that nested resampling is a statistical procedure to estimate the predictive performance of the model trained on the full dataset. Nested resampling is not a procedure to select optimal hyperparameters. The resampling produces many hyperparameter configurations which should be not used to construct a final model (Simon 2007).

The graphic above illustrates nested resampling for hyperparameter tuning with 3-fold cross-validation in the outer resampling and 4-fold cross-validation in the inner resampling.

The nested resampling process:

  1. Uses a 3-fold cross-validation to get different testing and training data sets (outer resampling).
  2. Within the training data uses a 4-fold cross-validation to get different inner testing and training data sets (inner resampling).
  3. Tunes the hyperparameters using the inner data splits.
  4. Fits the learner on the outer training data set using the tuned hyperparameter configuration obtained with the inner resampling.
  5. Evaluates the performance of the learner on the outer testing data.
  6. 2-5 is repeated for each of the three folds (outer resampling).
  7. The three performance values are aggregated for an unbiased performance estimate.

See also this article for more explanations.

4.3.1 Execution

The previous section examined the optimization of a simple classification tree on the mlr_tasks_pima. We continue the example and estimate the predictive performance of the model with nested resampling.

We use a 4-fold cross-validation in the inner resampling loop. The AutoTuner executes the hyperparameter tuning and is stopped after 5 evaluations. The hyperparameter configurations are proposed by grid search.

library("mlr3verse")

learner = lrn("classif.rpart")
resampling = rsmp("cv", folds = 4)
measure = msr("classif.ce")
search_space = ps(cp = p_dbl(lower = 0.001, upper = 0.1))
terminator = trm("evals", n_evals = 5)
tuner = tnr("grid_search", resolution = 10)

at = AutoTuner$new(learner, resampling, measure, terminator, tuner, search_space)

A 3-fold cross-validation is used in the outer resampling loop. On each of the three outer train sets hyperparameter tuning is done and we receive three optimized hyperparameter configurations. To execute the nested resampling, we pass the AutoTuner to the resample() function. We have to set store_models = TRUE because we need the AutoTuner models to investigate the inner tuning.

task = tsk("pima")
outer_resampling = rsmp("cv", folds = 3)

rr = resample(task, at, outer_resampling, store_models = TRUE)

You can freely combine different inner and outer resampling strategies. Nested resampling is not restricted to hyperparameter tuning. You can swap the AutoTuner for a AutoFSelector and estimate the performance of a model which is fitted on an optimized feature subset.

4.3.2 Evaluation

With the created ResampleResult we can now inspect the executed resampling iterations more closely. See the section on Resampling for more detailed information about ResampleResult objects.

We check the inner tuning results for stable hyperparameters. This means that the selected hyperparameters should not vary too much. We might observe unstable models in this example because the small data set and the low number of resampling iterations might introduces too much randomness. Usually, we aim for the selection of stable hyperparameters for all outer training sets.

   iteration    cp classif.ce learner_param_vals  x_domain task_id
1:         1 0.056  0.2734375          <list[2]> <list[1]>    pima
2:         2 0.034  0.2460938          <list[2]> <list[1]>    pima
3:         3 0.001  0.2519531          <list[2]> <list[1]>    pima
2 variables not shown: [learner_id, resampling_id]

Next, we want to compare the predictive performances estimated on the outer resampling to the inner resampling. Significantly lower predictive performances on the outer resampling indicate that the models with the optimized hyperparameters overfit the data.

rr$score()
                task task_id         learner          learner_id
1: <TaskClassif[50]>    pima <AutoTuner[42]> classif.rpart.tuned
2: <TaskClassif[50]>    pima <AutoTuner[42]> classif.rpart.tuned
3: <TaskClassif[50]>    pima <AutoTuner[42]> classif.rpart.tuned
5 variables not shown: [resampling, resampling_id, iteration, prediction, classif.ce]

The aggregated performance of all outer resampling iterations is essentially the unbiased performance of the model with optimal hyperparameter found by grid search.

rr$aggregate()
classif.ce 
 0.2838542 

Note that nested resampling is computationally expensive. For this reason we use relatively small number of hyperparameter configurations and a low number of resampling iterations in this example. In practice, you normally have to increase both. As this is computationally intensive you might want to have a look at the section on Parallelization.

4.3.3 Final Model

We can use the AutoTuner to tune the hyperparameters of our learner and fit the final model on the full data set.

at$train(task)

The trained model can now be used to make predictions on new data.

Warning

A common mistake is to report the performance estimated on the resampling sets on which the tuning was performed (at$tuning_result$classif.ce) as the model’s performance.

Instead, the performance estimated with nested resampling should be reported as the actual performance of the model.

4.4 Tuning with Hyperband

Besides the more traditional tuning methods, the ecosystem around mlr3 offers another procedure for hyperparameter optimization called Hyperband implemented in the mlr3hyperband package.

Hyperband is a budget-oriented procedure, weeding out suboptimal performing configurations early on during a partially sequential training process, increasing tuning efficiency as a consequence. For this, a combination of incremental resource allocation and early stopping is used: As optimization progresses, computational resources are increased for more promising configurations, while less promising ones are terminated early.

To give an introductory analogy, imagine two horse trainers are given eight untrained horses. Both trainers want to win the upcoming race, but they are only given 32 units of food. Given that each horse can be fed up to 8 units food (“maximum budget” per horse), there is not enough food for all the horses. It is critical to identify the most promising horses early, and give them enough food to improve. So, the trainers need to develop a strategy to split up the food in the best possible way. The first trainer is very optimistic and wants to explore the full capabilities of a horse, because he does not want to pass a judgment on a horse’s performance unless it has been fully trained. So, he divides his budget by the maximum amount he can give to a horse (lets say eight, so \(32 / 8 = 4\)) and randomly picks four horses - his budget simply is not enough to fully train more. Those four horses are then trained to their full capabilities, while the rest is set free. This way, the trainer is confident about choosing the best out of the four trained horses, but he might have overlooked the horse with the highest potential since he only focused on half of them. The other trainer is more creative and develops a different strategy. He thinks, if a horse is not performing well at the beginning, it will also not improve after further training. Based on this assumption, he decides to give one unit of food to each horse and observes how they develop. After the initial food is consumed, he checks their performance and kicks the slowest half out of his training regime. Then, he increases the available food for the remaining, further trains them until the food is consumed again, only to kick out the worst half once more. He repeats this until the one remaining horse gets the rest of the food. This means only one horse is fully trained, but on the flip side, he was able to start training with all eight horses.

On race day, all the horses are put on the starting line. But which trainer will have the winning horse? The one, who tried to train a maximum amount of horses to their fullest? Or the other one, who made assumptions about the training progress of his horses? How the training phases may possibly look like is visualized in Figure 4.1.

Figure 4.1: Visulization of how the training processes may look like. The left plot corresponds to the non-selective trainer, while the right one to the selective trainer.

Hyperband works very similar in some ways, but also different in others. It is not embodied by one of the trainers in our analogy, but more by the person, who would pay them. Hyperband consists of several brackets, each bracket corresponding to a trainer, and we do not care about horses but about hyperparameter configurations of a machine learning algorithm. The budget is not in terms of food, but in terms of a hyperparameter of the learner that scales in some way with the computational effort. An example is the number of epochs we train a neural network, or the number of iterations in boosting. Furthermore, there are not only two brackets (or trainers), but several, each placed at a unique spot between fully explorative of later training stages and extremely selective, equal to higher exploration of early training stages. The level of selection aggressiveness is handled by a user-defined parameter called \(\eta\). So, \(1/\eta\) is the fraction of remaining configurations after a bracket removes his worst performing ones, but \(\eta\) is also the factor by that the budget is increased for the next stage. Because there is a different maximum budget per configuration that makes sense in different scenarios, the user also has to set this as the \(R\) parameter. No further parameters are required for Hyperband – the full required budget across all brackets is indirectly given by \[ (\lfloor \log_{\eta}{R} \rfloor + 1)^2 * R \]

(Li et al. 2016). To give an idea how a full bracket layout might look like for a specific \(R\) and \(\eta\), a quick overview is given in the following table.

Hyperband layout for \(\eta = 2\) and \(R = 8\), consisting of four brackets with \(n\) as the amount of active configurations.
stage budget n
1 1 8
2 2 4
3 4 2
4 8 1
stage budget n
1 2 6
2 4 3
3 8 1
stage budget n
1 4 4
2 8 2
stage budget n
1 8 4

Of course, early termination based on a performance criterion may be disadvantageous if it is done too aggressively in certain scenarios. A learner to jumping radically in its estimated performance during the training phase may get the best configurations canceled too early, simply because they do not improve quickly enough compared to others. In other words, it is often unclear beforehand if having an high amount of configurations \(n\), that gets aggressively discarded early, is better than having a high budget \(B\) per configuration. The arising tradeoff, that has to be made, is called the “\(n\) versus \(B/n\) problem”. To create a balance between selection based on early training performance versus exploration of training performances in later training stages, \(\lfloor \log_{\eta}{R} \rfloor + 1\) brackets are constructed with an associated set of varying sized configurations. Thus, some brackets contain more configurations, with a small initial budget. In these, a lot are discarded after having been trained for only a short amount of time, corresponding to the selective trainer in our horse analogy. Others are constructed with fewer configurations, where discarding only takes place after a significant amount of budget was consumed. The last bracket usually never discards anything, but also starts with only very few configurations – this is equivalent to the trainer explorative of later stages. The former corresponds high \(n\), while the latter high \(B/n\). Even though different brackets are initialized with a different amount of configurations and different initial budget sizes, each bracket is assigned (approximately) the same budget \((\lfloor \log_{\eta}{R} \rfloor + 1) * R\).

The configurations at the start of each bracket are initialized by random, often uniform sampling. Note that currently all configurations are trained completely from the beginning, so no online updates of models from stage to stage is happening.

To identify the budget for evaluating Hyperband, the user has to specify explicitly which hyperparameter of the learner influences the budget by extending a single hyperparameter in the ParamSet with an argument (tags = "budget"), like in the following snippet:

library("mlr3verse")

# Hyperparameter subset of XGBoost
search_space = ps(
  nrounds = p_int(lower = 1, upper = 16, tags = "budget"),
  booster = p_fct(levels = c("gbtree", "gblinear", "dart"))
)

Thanks to the broad ecosystem of the mlr3verse a learner does not require a natural budget parameter. A typical case of this would be decision trees. By using subsampling as preprocessing with mlr3pipelines, we can work around a lacking budget parameter.

set.seed(123)

# extend "classif.rpart" with "subsampling" as preprocessing step
ll = po("subsample") %>>% lrn("classif.rpart")

# extend hyperparameters of "classif.rpart" with subsampling fraction as budget
search_space = ps(
  classif.rpart.cp = p_dbl(lower = 0.001, upper = 0.1),
  classif.rpart.minsplit = p_int(lower = 1, upper = 10),
  subsample.frac = p_dbl(lower = 0.1, upper = 1, tags = "budget")
)

We can now plug the new learner with the extended hyperparameter set into a TuningInstanceSingleCrit the same way as usual. Naturally, Hyperband terminates once all of its brackets are evaluated, so a Terminator in the tuning instance acts as an upper bound and should be only set to a low value if one is unsure of how long Hyperband will take to finish under the given settings.

instance = TuningInstanceSingleCrit$new(
  task = tsk("iris"),
  learner = ll,
  resampling = rsmp("holdout"),
  measure = msr("classif.ce"),
  terminator = trm("none"), # hyperband terminates itself
  search_space = search_space
)

Now, we initialize a new instance of the mlr3hyperband::mlr_tuners_hyperband class and start tuning with it.

library("mlr3hyperband")
tuner = tnr("hyperband", eta = 3)

# reduce logging output
lgr::get_logger("bbotk")$set_threshold("warn")

tuner$optimize(instance)
   classif.rpart.cp classif.rpart.minsplit subsample.frac learner_param_vals
1:       0.07348139                      5      0.1111111          <list[6]>
2 variables not shown: [x_domain, classif.ce]

To receive the results of each sampled configuration, we simply run the following snippet.

as.data.table(instance$archive)[, c(
  "subsample.frac",
  "classif.rpart.cp",
  "classif.rpart.minsplit",
  "classif.ce"
), with = FALSE]
    subsample.frac classif.rpart.cp classif.rpart.minsplit classif.ce
 1:      0.1111111       0.02532664                      3       0.04
 2:      0.1111111       0.07348139                      5       0.02
 3:      0.1111111       0.08489786                      3       0.02
 4:      0.1111111       0.05025520                      6       0.02
 5:      0.1111111       0.03940299                      4       0.02
 6:      0.1111111       0.02539845                      7       0.42
 7:      0.1111111       0.01199855                      4       0.14
 8:      0.1111111       0.03960945                      4       0.02
 9:      0.1111111       0.05762160                      6       0.02
10:      0.3333333       0.04186187                      8       0.02
11:      0.3333333       0.02730298                      7       0.08
12:      0.3333333       0.06336733                      7       0.04
13:      0.3333333       0.01919902                      4       0.06
14:      0.3333333       0.08650077                      6       0.02
15:      0.3333333       0.07348139                      5       0.04
16:      0.3333333       0.08489786                      3       0.02
17:      0.3333333       0.05025520                      6       0.06
18:      1.0000000       0.08413701                      3       0.04
19:      1.0000000       0.03193237                      6       0.04
20:      1.0000000       0.07112074                      5       0.04
21:      1.0000000       0.08489786                      3       0.04
22:      1.0000000       0.04186187                      8       0.04
    subsample.frac classif.rpart.cp classif.rpart.minsplit classif.ce

You can access the best found configuration through the instance object.

instance$result
   classif.rpart.cp classif.rpart.minsplit subsample.frac learner_param_vals
1:       0.07348139                      5      0.1111111          <list[6]>
2 variables not shown: [x_domain, classif.ce]
instance$result_learner_param_vals
$subsample.frac
[1] 0.1111111

$subsample.stratify
[1] FALSE

$subsample.replace
[1] FALSE

$classif.rpart.xval
[1] 0

$classif.rpart.cp
[1] 0.07348139

$classif.rpart.minsplit
[1] 5
instance$result_y
classif.ce 
      0.02 

In the traditional way, Hyperband uses uniform sampling to receive a configuration sample at the start of each bracket. But it is also possible to define a custom Sampler for each hyperparameter.

search_space = ps(
  nrounds = p_int(lower = 1, upper = 16, tags = "budget"),
  eta = p_dbl(lower = 0, upper = 1),
  booster = p_fct(levels = c("gbtree", "gblinear", "dart"))
)

instance = TuningInstanceSingleCrit$new(
  task = tsk("iris"),
  learner = lrn("classif.xgboost"),
  resampling = rsmp("holdout"),
  measure = msr("classif.ce"),
  terminator = trm("none"), # hyperband terminates itself
  search_space = search_space
)

# beta distribution with alpha = 2 and beta = 5
# categorical distribution with custom probabilities
sampler = SamplerJointIndep$new(list(
  Sampler1DRfun$new(search_space$params$eta, function(n) rbeta(n, 2, 5)),
  Sampler1DCateg$new(search_space$params$booster, prob = c(0.2, 0.3, 0.5))
))

Then, the defined sampler has to be given as an argument during instance creation. Afterwards, the usual tuning can proceed.

tuner = tnr("hyperband", eta = 2, sampler = sampler)
set.seed(123)
tuner$optimize(instance)
   nrounds       eta booster learner_param_vals  x_domain classif.ce
1:       1 0.2414701    dart          <list[5]> <list[3]>       0.04
instance$result
   nrounds       eta booster learner_param_vals  x_domain classif.ce
1:       1 0.2414701    dart          <list[5]> <list[3]>       0.04

Furthermore, we extended the original algorithm, to make it also possible to use mlr3hyperband for multi-objective optimization. To do this, simply specify more measures in the TuningInstanceMultiCrit and run the rest as usual.

instance = TuningInstanceMultiCrit$new(
  task = tsk("pima"),
  learner = lrn("classif.xgboost"),
  resampling = rsmp("holdout"),
  measures = msrs(c("classif.tpr", "classif.fpr")),
  terminator = trm("none"), # hyperband terminates itself
  search_space = search_space
)

tuner = tnr("hyperband", eta = 4)
tuner$optimize(instance)
    nrounds       eta  booster learner_param_vals  x_domain classif.tpr
 1:       1 0.5654654 gblinear          <list[5]> <list[3]>  0.00000000
 2:       1 0.8296239 gblinear          <list[5]> <list[3]>  0.00000000
 3:       1 0.3914988 gblinear          <list[5]> <list[3]>  0.00000000
 4:       1 0.2736227 gblinear          <list[5]> <list[3]>  0.00000000
 5:       1 0.5938669 gblinear          <list[5]> <list[3]>  0.00000000
 6:       1 0.1601848 gblinear          <list[5]> <list[3]>  0.00000000
 7:       1 0.8477392 gblinear          <list[5]> <list[3]>  0.00000000
 8:       1 0.7736921 gblinear          <list[5]> <list[3]>  0.00000000
 9:       1 0.2954001 gblinear          <list[5]> <list[3]>  0.00000000
10:       4 0.2688034 gblinear          <list[5]> <list[3]>  0.00000000
11:       4 0.5551772 gblinear          <list[5]> <list[3]>  0.06172840
12:       4 0.2312309   gbtree          <list[5]> <list[3]>  0.72839506
13:       4 0.5432329 gblinear          <list[5]> <list[3]>  0.06172840
14:       4 0.4475701 gblinear          <list[5]> <list[3]>  0.03703704
15:       4 0.1302881 gblinear          <list[5]> <list[3]>  0.00000000
16:       4 0.5654654 gblinear          <list[5]> <list[3]>  0.07407407
17:       4 0.7736921 gblinear          <list[5]> <list[3]>  0.14814815
18:      16 0.2882662   gbtree          <list[5]> <list[3]>  0.71604938
19:      16 0.7275336 gblinear          <list[5]> <list[3]>  0.45679012
20:      16 0.2688034 gblinear          <list[5]> <list[3]>  0.20987654
1 variable not shown: [classif.fpr]

Now the result is not a single best configuration but an estimated Pareto front. All red points are not dominated by another parameter configuration regarding their fpr and tpr performance measures.

instance$result
    nrounds       eta  booster learner_param_vals  x_domain classif.tpr
 1:       1 0.5654654 gblinear          <list[5]> <list[3]>  0.00000000
 2:       1 0.8296239 gblinear          <list[5]> <list[3]>  0.00000000
 3:       1 0.3914988 gblinear          <list[5]> <list[3]>  0.00000000
 4:       1 0.2736227 gblinear          <list[5]> <list[3]>  0.00000000
 5:       1 0.5938669 gblinear          <list[5]> <list[3]>  0.00000000
 6:       1 0.1601848 gblinear          <list[5]> <list[3]>  0.00000000
 7:       1 0.8477392 gblinear          <list[5]> <list[3]>  0.00000000
 8:       1 0.7736921 gblinear          <list[5]> <list[3]>  0.00000000
 9:       1 0.2954001 gblinear          <list[5]> <list[3]>  0.00000000
10:       4 0.2688034 gblinear          <list[5]> <list[3]>  0.00000000
11:       4 0.5551772 gblinear          <list[5]> <list[3]>  0.06172840
12:       4 0.2312309   gbtree          <list[5]> <list[3]>  0.72839506
13:       4 0.5432329 gblinear          <list[5]> <list[3]>  0.06172840
14:       4 0.4475701 gblinear          <list[5]> <list[3]>  0.03703704
15:       4 0.1302881 gblinear          <list[5]> <list[3]>  0.00000000
16:       4 0.5654654 gblinear          <list[5]> <list[3]>  0.07407407
17:       4 0.7736921 gblinear          <list[5]> <list[3]>  0.14814815
18:      16 0.2882662   gbtree          <list[5]> <list[3]>  0.71604938
19:      16 0.7275336 gblinear          <list[5]> <list[3]>  0.45679012
20:      16 0.2688034 gblinear          <list[5]> <list[3]>  0.20987654
1 variable not shown: [classif.fpr]
plot(classif.tpr ~ classif.fpr, instance$archive$data)
points(classif.tpr ~ classif.fpr, instance$result, col = "red")

4.5 Feature Selection

Often, data sets include a large number of features. The technique of extracting a subset of relevant features is called “feature selection”.

The objective of feature selection is to fit the sparse dependent of a model on a subset of available data features in the most suitable manner. Feature selection can enhance the interpretability of the model, speed up the learning process and improve the learner performance. Different approaches exist to identify the relevant features. Two different approaches are emphasized in the literature: one is called Filtering and the other approach is often referred to as feature subset selection or wrapper methods.

What are the differences (Guyon and Elisseeff 2003; Chandrashekar and Sahin 2014)?

  • Filtering: An external algorithm computes a rank of the features (e.g. based on the correlation to the response). Then, features are subsetted by a certain criteria, e.g. an absolute number or a percentage of the number of variables. The selected features will then be used to fit a model (with optional hyperparameters selected by tuning). This calculation is usually cheaper than “feature subset selection” in terms of computation time. All filters are connected via package mlr3filters.

  • Wrapper Methods: Here, no ranking of features is done. Instead, an optimization algorithm selects a subset of the features, evaluates the set by calculating the resampled predictive performance, and then proposes a new set of features (or terminates). A simple example is the sequential forward selection. This method is usually computationally very intensive as a lot of models are fitted. Also, strictly speaking, all these models would need to be tuned before the performance is estimated. This would require an additional nested level in a CV setting. After undertaken all of these steps, the final set of selected features is again fitted (with optional hyperparameters selected by tuning). Wrapper methods are implemented in the mlr3fselect package.

  • Embedded Methods: Many learners internally select a subset of the features which they find helpful for prediction. These subsets can usually be queried, as the following example demonstrates:

    library("mlr3verse")
    
    task = tsk("iris")
    learner = lrn("classif.rpart")
    
    # ensure that the learner selects features
    stopifnot("selected_features" %in% learner$properties)
    
    # fit a simple classification tree
    learner = learner$train(task)
    
    # extract all features used in the classification tree:
    learner$selected_features()
    [1] "Petal.Length" "Petal.Width" 

There are also ensemble filters built upon the idea of stacking single filter methods. These are not yet implemented.

4.5.1 Filters

Filter methods assign an importance value to each feature. Based on these values the features can be ranked. Thereafter, we are able to select a feature subset. There is a list of all implemented filter methods in the Appendix.

4.5.1.1 Calculating filter values

Currently, only classification and regression tasks are supported.

The first step it to create a new R object using the class of the desired filter method. Similar to other instances in mlr3, these are registered in a dictionary (mlr_filters) with an associated shortcut function flt(). Each object of class Filter has a .$calculate() method which computes the filter values and ranks them in a descending order.

filter = flt("jmim")

task = tsk("iris")
filter$calculate(task)

as.data.table(filter)
        feature     score
1:  Petal.Width 1.0000000
2: Sepal.Length 0.6666667
3: Petal.Length 0.3333333
4:  Sepal.Width 0.0000000

Some filters support changing specific hyperparameters. This is similar to setting hyperparameters of a Learner using .$param_set$values:

filter_cor = flt("correlation")
filter_cor$param_set
<ParamSet>
       id    class lower upper nlevels    default value
1:    use ParamFct    NA    NA       5 everything      
2: method ParamFct    NA    NA       3    pearson      
# change parameter 'method'
filter_cor$param_set$values = list(method = "spearman")
filter_cor$param_set
<ParamSet>
       id    class lower upper nlevels    default    value
1:    use ParamFct    NA    NA       5 everything         
2: method ParamFct    NA    NA       3    pearson spearman

4.5.1.2 Variable Importance Filters

All Learner with the property “importance” come with integrated feature selection methods.

You can find a list of all learners with this property in the Appendix.

For some learners the desired filter method needs to be set during learner creation. For example, learner classif.ranger comes with multiple integrated methods, c.f. the help page of ranger::ranger(). To use method “impurity”, you need to set the filter method during construction.

lrn = lrn("classif.ranger", importance = "impurity")

Now you can use the FilterImportance filter class for algorithm-embedded methods:

task = tsk("iris")
filter = flt("importance", learner = lrn)
filter$calculate(task)
head(as.data.table(filter), 3)
        feature     score
1:  Petal.Width 43.831226
2: Petal.Length 43.663788
3: Sepal.Length  9.210399

4.5.2 Wrapper Methods

Wrapper feature selection is supported via the mlr3fselect extension package. At the heart of mlr3fselect are the R6 classes:

  • FSelectInstanceSingleCrit, "r ref(FSelectInstanceMultiCrit"): These two classes describe the feature selection problem and store the results.
  • FSelector: This class is the base class for implementations of feature selection algorithms.

4.5.2.1 The FSelectInstance Classes

The following sub-section examines the feature selection on the Pima data set which is used to predict whether or not a patient has diabetes.

task = tsk("pima")
print(task)
<TaskClassif:pima> (768 x 9): Pima Indian Diabetes
* Target: diabetes
* Properties: twoclass
* Features (8):
  - dbl (8): age, glucose, insulin, mass, pedigree, pregnant, pressure,
    triceps

We use the classification tree from rpart.

learner = lrn("classif.rpart")

Next, we need to specify how to evaluate the performance of the feature subsets. For this, we need to choose a resampling strategy and a performance measure.

hout = rsmp("holdout")
measure = msr("classif.ce")

Finally, one has to choose the available budget for the feature selection. This is done by selecting one of the available Terminators:

For this short introduction, we specify a budget of 20 evaluations and then put everything together into a FSelectInstanceSingleCrit:

evals20 = trm("evals", n_evals = 20)

instance = FSelectInstanceSingleCrit$new(
  task = task,
  learner = learner,
  resampling = hout,
  measure = measure,
  terminator = evals20
)
instance
<FSelectInstanceSingleCrit>
* State:  Not optimized
* Objective: <ObjectiveFSelect:classif.rpart_on_pima>
* Search Space:
         id    class lower upper nlevels
1:      age ParamLgl    NA    NA       2
2:  glucose ParamLgl    NA    NA       2
3:  insulin ParamLgl    NA    NA       2
4:     mass ParamLgl    NA    NA       2
5: pedigree ParamLgl    NA    NA       2
6: pregnant ParamLgl    NA    NA       2
7: pressure ParamLgl    NA    NA       2
8:  triceps ParamLgl    NA    NA       2
* Terminator: <TerminatorEvals>

To start the feature selection, we still need to select an algorithm which are defined via the FSelector class

4.5.2.2 The FSelector Class

The following algorithms are currently implemented in mlr3fselect:

In this example, we will use a simple random search and retrieve it from the dictionary mlr_fselectors with the fs() function:

fselector = fs("random_search")

4.5.2.3 Triggering the Tuning

To start the feature selection, we simply pass the FSelectInstanceSingleCrit to the $optimize() method of the initialized FSelector. The algorithm proceeds as follows

  1. The FSelector proposes at least one feature subset and may propose multiple subsets to improve parallelization, which can be controlled via the setting batch_size).
  2. For each feature subset, the given Learner is fitted on the Task using the provided Resampling. 1 All evaluations are stored in the archive of the FSelectInstanceSingleCrit.
  3. The Terminator is queried if the budget is exhausted. 1 If the budget is not exhausted, restart with 1) until it is.
  4. Determine the feature subset with the best observed performance.
  5. Store the best feature subset as the result in the instance object. The best feature subset ($result_feature_set) and the corresponding measured performance ($result_y) can be accessed from the instance.
# reduce logging output
lgr::get_logger("bbotk")$set_threshold("warn")

fselector$optimize(instance)
    age glucose insulin mass pedigree pregnant pressure triceps
1: TRUE    TRUE    TRUE TRUE    FALSE    FALSE     TRUE   FALSE
2 variables not shown: [features, classif.ce]
instance$result_feature_set
[1] "age"      "glucose"  "insulin"  "mass"     "pressure"
instance$result_y
classif.ce 
 0.1953125 

One can investigate all resamplings which were undertaken, as they are stored in the archive of the FSelectInstanceSingleCrit and can be accessed by using as.data.table():

as.data.table(instance$archive)
      age glucose insulin  mass pedigree pregnant pressure triceps classif.ce
 1: FALSE   FALSE   FALSE FALSE    FALSE     TRUE     TRUE    TRUE  0.3281250
 2: FALSE   FALSE    TRUE  TRUE    FALSE    FALSE     TRUE    TRUE  0.3242188
 3:  TRUE    TRUE    TRUE  TRUE     TRUE     TRUE    FALSE   FALSE  0.2187500
 4: FALSE   FALSE    TRUE FALSE    FALSE    FALSE    FALSE    TRUE  0.3710938
 5: FALSE    TRUE    TRUE  TRUE    FALSE     TRUE    FALSE   FALSE  0.2226562
 6:  TRUE   FALSE   FALSE FALSE    FALSE    FALSE    FALSE    TRUE  0.3085938
 7:  TRUE   FALSE    TRUE  TRUE    FALSE     TRUE     TRUE   FALSE  0.2578125
 8:  TRUE   FALSE    TRUE  TRUE     TRUE     TRUE     TRUE    TRUE  0.2851562
 9: FALSE   FALSE   FALSE FALSE     TRUE    FALSE    FALSE   FALSE  0.3671875
10:  TRUE    TRUE    TRUE FALSE     TRUE    FALSE    FALSE    TRUE  0.2851562
11:  TRUE    TRUE   FALSE FALSE     TRUE     TRUE     TRUE    TRUE  0.2500000
12:  TRUE    TRUE   FALSE FALSE     TRUE    FALSE    FALSE   FALSE  0.2539062
13:  TRUE    TRUE    TRUE  TRUE     TRUE     TRUE    FALSE    TRUE  0.2343750
14:  TRUE   FALSE    TRUE  TRUE     TRUE     TRUE     TRUE    TRUE  0.2851562
15:  TRUE   FALSE    TRUE FALSE    FALSE     TRUE    FALSE    TRUE  0.2851562
16:  TRUE   FALSE   FALSE FALSE    FALSE    FALSE     TRUE    TRUE  0.3710938
17:  TRUE    TRUE   FALSE FALSE    FALSE     TRUE    FALSE    TRUE  0.2382812
18:  TRUE    TRUE    TRUE  TRUE    FALSE    FALSE     TRUE   FALSE  0.1953125
19: FALSE   FALSE   FALSE FALSE    FALSE    FALSE    FALSE    TRUE  0.3750000
20:  TRUE    TRUE    TRUE FALSE     TRUE     TRUE     TRUE    TRUE  0.2460938
4 variables not shown: [runtime_learners, timestamp, batch_nr, resample_result]

The associated resampling iterations can be accessed in the BenchmarkResult:

instance$archive$benchmark_result$data
NULL

The uhash column links the resampling iterations to the evaluated feature subsets stored in instance$archive$data(). This allows e.g. to score the included ResampleResults on a different measure.

Now the optimized feature subset can be used to subset the task and fit the model on all observations.

task$select(instance$result_feature_set)
learner$train(task)

The trained model can now be used to make a prediction on external data. Note that predicting on observations present in the Task, should be avoided. The model has seen these observations already during feature selection and therefore results would be statistically biased. Hence, the resulting performance measure would be over-optimistic. Instead, to get statistically unbiased performance estimates for the current task, nested resampling is required.

4.5.2.4 Filtering with Multiple Performance Measures

When filtering, you might want to use multiple criteria to evaluate the performance of the feature subsets. For example, you might want the subset with the lowest classification error and lowest time to train the model. The full list of performance measures can be found here.

We will expand the previous example and perform feature selection on the same dataset, Pima Indian Diabetes, however, this time we will use FSelectInstanceMultiCrit to select the subset of features that has the lowest classification error and the lowest time to train the model.

The filtering process with multiple criteria is very similar to filtering with a single criterion.

measures = msrs(c("classif.ce", "time_train"))

Instead of creating a new FSelectInstanceSingleCrit with a single measure, we create a new FSelectInstanceMultiCrit with the two measures we are interested in here. Otherwise, it is the same as above.

library("mlr3filters")

evals20 = trm("evals", n_evals = 20)
instance = FSelectInstanceMultiCrit$new(
task = task,
learner = learner,
resampling = hout,
measures = measures,
terminator = evals20
)
instance
<FSelectInstanceMultiCrit>
* State:  Not optimized
* Objective: <ObjectiveFSelect:classif.rpart_on_pima>
* Search Space:
         id    class lower upper nlevels
1:      age ParamLgl    NA    NA       2
2:  glucose ParamLgl    NA    NA       2
3:  insulin ParamLgl    NA    NA       2
4:     mass ParamLgl    NA    NA       2
5: pressure ParamLgl    NA    NA       2
* Terminator: <TerminatorEvals>

After triggering the filtering, we will have the subset of features with the best classification error and time to train the model.

# reduce logging output
lgr::get_logger("bbotk")$set_threshold("warn")

fselector$optimize(instance)
    age glucose insulin  mass pressure                          features
1: TRUE    TRUE    TRUE  TRUE     TRUE age,glucose,insulin,mass,pressure
2: TRUE    TRUE    TRUE FALSE     TRUE      age,glucose,insulin,pressure
2 variables not shown: [classif.ce, time_train]
instance$result_feature_set
[[1]]
[1] "age"      "glucose"  "insulin"  "mass"     "pressure"

[[2]]
[1] "age"      "glucose"  "insulin"  "pressure"
instance$result_y
   classif.ce time_train
1:  0.2500000      0.042
2:  0.2695312      0.041

4.5.2.5 Automating the Feature Selection

The AutoFSelector wraps a learner and augments it with an automatic feature selection for a given task. Because the AutoFSelector itself inherits from the Learner base class, it can be used like any other learner. Analogously to the previous subsection, a new classification tree learner is created. This classification tree learner automatically starts a feature selection on the given task using an inner resampling (holdout). We create a terminator which allows 10 evaluations, and uses a simple random search as feature selection algorithm:

learner = lrn("classif.rpart")
terminator = trm("evals", n_evals = 10)
fselector = fs("random_search")

at = AutoFSelector$new(
  learner = learner,
  resampling = rsmp("holdout"),
  measure = msr("classif.ce"),
  terminator = terminator,
  fselector = fselector
)
at
<AutoFSelector:classif.rpart.fselector>
* Model: -
* Parameters: xval=0
* Packages: mlr3, mlr3fselect, rpart
* Predict Types:  [response], prob
* Feature Types: logical, integer, numeric, factor, ordered
* Properties: importance, missings, multiclass, selected_features,
  twoclass, weights

We can now use the learner like any other learner, calling the $train() and $predict() method. This time however, we pass it to benchmark() to compare the optimized feature subset to the complete feature set. This way, the AutoFSelector will do its resampling for feature selection on the training set of the respective split of the outer resampling. The learner then undertakes predictions using the test set of the outer resampling. This yields unbiased performance measures, as the observations in the test set have not been used during feature selection or fitting of the respective learner. This is called nested resampling.

To compare the optimized feature subset with the complete feature set, we can use benchmark():

grid = benchmark_grid(
  task = tsk("pima"),
  learner = list(at, lrn("classif.rpart")),
  resampling = rsmp("cv", folds = 3)
)

bmr = benchmark(grid, store_models = TRUE)
bmr$aggregate(msrs(c("classif.ce", "time_train")))
   nr      resample_result task_id              learner_id resampling_id iters
1:  1 <ResampleResult[21]>    pima classif.rpart.fselector            cv     3
2:  2 <ResampleResult[21]>    pima           classif.rpart            cv     3
2 variables not shown: [classif.ce, time_train]

Note that we do not expect any significant differences since we only evaluated a small fraction of the possible feature subsets.