13  Feature Selection / Filtering

Often, data sets include a large number of features. The technique of extracting a subset of relevant features is called “feature selection”.

The objective of feature selection is to fit the sparse dependent of a model on a subset of available data features in the most suitable manner. Feature selection can enhance the interpretability of the model, speed up the learning process and improve the learner performance. Different approaches exist to identify the relevant features. Two different approaches are emphasized in the literature: one is called Filtering and the other approach is often referred to as feature subset selection or wrapper methods.

What are the differences (Guyon and Elisseeff 2003; Chandrashekar and Sahin 2014)?

There are also ensemble filters built upon the idea of stacking single filter methods. These are not yet implemented.

13.1 Filters

Filter methods assign an importance value to each feature. Based on these values the features can be ranked. Thereafter, we are able to select a feature subset. There is a list of all implemented filter methods in the Appendix.

13.2 Calculating filter values

Currently, only classification and regression tasks are supported.

The first step it to create a new R object using the class of the desired filter method. Similar to other instances in mlr3, these are registered in a dictionary (mlr_filters) with an associated shortcut function flt(). Each object of class Filter has a .$calculate() method which computes the filter values and ranks them in a descending order.

filter = flt("jmim")

task = tsk("iris")
filter$calculate(task)

as.data.table(filter)
        feature     score
1:  Petal.Width 1.0000000
2: Sepal.Length 0.6666667
3: Petal.Length 0.3333333
4:  Sepal.Width 0.0000000

Some filters support changing specific hyperparameters. This is similar to setting hyperparameters of a Learner using .$param_set$values:

filter_cor = flt("correlation")
filter_cor$param_set
<ParamSet>
       id    class lower upper nlevels    default value
1:    use ParamFct    NA    NA       5 everything      
2: method ParamFct    NA    NA       3    pearson      
# change parameter 'method'
filter_cor$param_set$values = list(method = "spearman")
filter_cor$param_set
<ParamSet>
       id    class lower upper nlevels    default    value
1:    use ParamFct    NA    NA       5 everything         
2: method ParamFct    NA    NA       3    pearson spearman

13.3 Variable Importance Filters

All Learner with the property “importance” come with integrated feature selection methods.

You can find a list of all learners with this property in the Appendix.

For some learners the desired filter method needs to be set during learner creation. For example, learner classif.ranger comes with multiple integrated methods, c.f. the help page of ranger::ranger(). To use method “impurity”, you need to set the filter method during construction.

lrn = lrn("classif.ranger", importance = "impurity")

Now you can use the FilterImportance filter class for algorithm-embedded methods:

task = tsk("iris")
filter = flt("importance", learner = lrn)
filter$calculate(task)
head(as.data.table(filter), 3)
        feature     score
1:  Petal.Width 43.954917
2: Petal.Length 42.988042
3: Sepal.Length  9.788235

13.4 Wrapper Methods

Wrapper feature selection is supported via the mlr3fselect extension package. At the heart of mlr3fselect are the R6 classes:

  • FSelectInstanceSingleCrit, "r ref(FSelectInstanceMultiCrit"): These two classes describe the feature selection problem and store the results.
  • FSelector: This class is the base class for implementations of feature selection algorithms.

13.5 The FSelectInstance Classes

The following sub-section examines the feature selection on the Pima data set which is used to predict whether or not a patient has diabetes.

task = tsk("pima")
print(task)
<TaskClassif:pima> (768 x 9): Pima Indian Diabetes
* Target: diabetes
* Properties: twoclass
* Features (8):
  - dbl (8): age, glucose, insulin, mass, pedigree, pregnant, pressure,
    triceps

We use the classification tree from rpart.

learner = lrn("classif.rpart")

Next, we need to specify how to evaluate the performance of the feature subsets. For this, we need to choose a resampling strategy and a performance measure.

hout = rsmp("holdout")
measure = msr("classif.ce")

Finally, one has to choose the available budget for the feature selection. This is done by selecting one of the available Terminators:

For this short introduction, we specify a budget of 20 evaluations and then put everything together into a FSelectInstanceSingleCrit:

evals20 = trm("evals", n_evals = 20)

instance = FSelectInstanceSingleCrit$new(
  task = task,
  learner = learner,
  resampling = hout,
  measure = measure,
  terminator = evals20
)
instance
<FSelectInstanceSingleCrit>
* State:  Not optimized
* Objective: <ObjectiveFSelect:classif.rpart_on_pima>
* Search Space:
         id    class lower upper nlevels
1:      age ParamLgl    NA    NA       2
2:  glucose ParamLgl    NA    NA       2
3:  insulin ParamLgl    NA    NA       2
4:     mass ParamLgl    NA    NA       2
5: pedigree ParamLgl    NA    NA       2
6: pregnant ParamLgl    NA    NA       2
7: pressure ParamLgl    NA    NA       2
8:  triceps ParamLgl    NA    NA       2
* Terminator: <TerminatorEvals>

To start the feature selection, we still need to select an algorithm which are defined via the FSelector class

13.6 The FSelector Class

The following algorithms are currently implemented in mlr3fselect:

In this example, we will use a simple random search and retrieve it from the dictionary mlr_fselectors with the fs() function:

fselector = fs("random_search")

13.7 Triggering the Tuning

To start the feature selection, we simply pass the FSelectInstanceSingleCrit to the $optimize() method of the initialized FSelector. The algorithm proceeds as follows

  1. The FSelector proposes at least one feature subset and may propose multiple subsets to improve parallelization, which can be controlled via the setting batch_size).
  2. For each feature subset, the given Learner is fitted on the Task using the provided Resampling. 1 All evaluations are stored in the archive of the FSelectInstanceSingleCrit.
  3. The Terminator is queried if the budget is exhausted. 1 If the budget is not exhausted, restart with 1) until it is.
  4. Determine the feature subset with the best observed performance.
  5. Store the best feature subset as the result in the instance object. The best feature subset ($result_feature_set) and the corresponding measured performance ($result_y) can be accessed from the instance.
# reduce logging output
lgr::get_logger("bbotk")$set_threshold("warn")

fselector$optimize(instance)
INFO  [21:34:34.649] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:34.757] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:34.867] [mlr3] Finished benchmark 
INFO  [21:34:35.016] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:35.023] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:35.112] [mlr3] Finished benchmark 
INFO  [21:34:35.258] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:35.265] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:35.354] [mlr3] Finished benchmark 
INFO  [21:34:35.507] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:35.514] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:35.601] [mlr3] Finished benchmark 
INFO  [21:34:35.748] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:35.757] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:35.854] [mlr3] Finished benchmark 
INFO  [21:34:35.992] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:36.001] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:36.112] [mlr3] Finished benchmark 
INFO  [21:34:36.255] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:36.262] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:36.363] [mlr3] Finished benchmark 
INFO  [21:34:36.514] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:36.521] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:36.630] [mlr3] Finished benchmark 
INFO  [21:34:36.772] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:36.779] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:36.876] [mlr3] Finished benchmark 
INFO  [21:34:37.012] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:37.019] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:37.117] [mlr3] Finished benchmark 
INFO  [21:34:37.253] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:37.260] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:37.351] [mlr3] Finished benchmark 
INFO  [21:34:37.483] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:37.490] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:37.586] [mlr3] Finished benchmark 
INFO  [21:34:37.718] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:37.725] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:37.817] [mlr3] Finished benchmark 
INFO  [21:34:37.955] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:37.962] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:38.133] [mlr3] Finished benchmark 
INFO  [21:34:38.270] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:38.277] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:38.373] [mlr3] Finished benchmark 
INFO  [21:34:38.504] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:38.511] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:38.611] [mlr3] Finished benchmark 
INFO  [21:34:38.751] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:38.759] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:38.863] [mlr3] Finished benchmark 
INFO  [21:34:39.012] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:39.020] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:39.113] [mlr3] Finished benchmark 
INFO  [21:34:39.273] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:39.281] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:39.368] [mlr3] Finished benchmark 
INFO  [21:34:39.517] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:39.524] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:39.611] [mlr3] Finished benchmark 
     age glucose insulin mass pedigree pregnant pressure triceps
1: FALSE    TRUE    TRUE TRUE     TRUE     TRUE     TRUE    TRUE
                                              features classif.ce
1: glucose,insulin,mass,pedigree,pregnant,pressure,...   0.203125
instance$result_feature_set
[1] "glucose"  "insulin"  "mass"     "pedigree" "pregnant" "pressure" "triceps" 
instance$result_y
classif.ce 
  0.203125 

One can investigate all resamplings which were undertaken, as they are stored in the archive of the FSelectInstanceSingleCrit and can be accessed by using as.data.table():

as.data.table(instance$archive)
      age glucose insulin  mass pedigree pregnant pressure triceps classif.ce
 1: FALSE   FALSE   FALSE  TRUE    FALSE    FALSE    FALSE   FALSE  0.2968750
 2: FALSE   FALSE   FALSE  TRUE    FALSE     TRUE     TRUE   FALSE  0.2851562
 3: FALSE    TRUE   FALSE FALSE     TRUE     TRUE    FALSE    TRUE  0.2304688
 4:  TRUE   FALSE   FALSE FALSE     TRUE    FALSE    FALSE   FALSE  0.3359375
 5:  TRUE   FALSE    TRUE  TRUE     TRUE    FALSE    FALSE    TRUE  0.3046875
 6:  TRUE   FALSE    TRUE  TRUE    FALSE     TRUE     TRUE   FALSE  0.2500000
 7: FALSE   FALSE    TRUE  TRUE     TRUE    FALSE    FALSE    TRUE  0.3046875
 8: FALSE   FALSE   FALSE  TRUE     TRUE     TRUE    FALSE   FALSE  0.2617188
 9:  TRUE    TRUE    TRUE  TRUE     TRUE     TRUE     TRUE   FALSE  0.2460938
10: FALSE   FALSE    TRUE FALSE    FALSE     TRUE    FALSE   FALSE  0.3242188
11: FALSE   FALSE   FALSE FALSE    FALSE    FALSE     TRUE   FALSE  0.3515625
12:  TRUE    TRUE    TRUE FALSE     TRUE     TRUE     TRUE    TRUE  0.2578125
13: FALSE    TRUE    TRUE  TRUE     TRUE     TRUE     TRUE    TRUE  0.2031250
14: FALSE   FALSE   FALSE FALSE     TRUE    FALSE    FALSE   FALSE  0.3125000
15:  TRUE    TRUE   FALSE  TRUE    FALSE    FALSE    FALSE   FALSE  0.2187500
16: FALSE   FALSE    TRUE FALSE     TRUE     TRUE     TRUE   FALSE  0.3007812
17:  TRUE    TRUE    TRUE  TRUE     TRUE     TRUE     TRUE    TRUE  0.2460938
18: FALSE   FALSE    TRUE  TRUE    FALSE    FALSE     TRUE   FALSE  0.3828125
19:  TRUE   FALSE    TRUE FALSE     TRUE    FALSE    FALSE    TRUE  0.2734375
20: FALSE   FALSE   FALSE FALSE    FALSE    FALSE     TRUE    TRUE  0.3750000
    runtime_learners           timestamp batch_nr      resample_result
 1:            0.089 2022-06-29 21:34:34        1 <ResampleResult[22]>
 2:            0.079 2022-06-29 21:34:35        2 <ResampleResult[22]>
 3:            0.079 2022-06-29 21:34:35        3 <ResampleResult[22]>
 4:            0.077 2022-06-29 21:34:35        4 <ResampleResult[22]>
 5:            0.087 2022-06-29 21:34:35        5 <ResampleResult[22]>
 6:            0.098 2022-06-29 21:34:36        6 <ResampleResult[22]>
 7:            0.091 2022-06-29 21:34:36        7 <ResampleResult[22]>
 8:            0.098 2022-06-29 21:34:36        8 <ResampleResult[22]>
 9:            0.087 2022-06-29 21:34:36        9 <ResampleResult[22]>
10:            0.088 2022-06-29 21:34:37       10 <ResampleResult[22]>
11:            0.075 2022-06-29 21:34:37       11 <ResampleResult[22]>
12:            0.086 2022-06-29 21:34:37       12 <ResampleResult[22]>
13:            0.082 2022-06-29 21:34:37       13 <ResampleResult[22]>
14:            0.160 2022-06-29 21:34:38       14 <ResampleResult[22]>
15:            0.085 2022-06-29 21:34:38       15 <ResampleResult[22]>
16:            0.090 2022-06-29 21:34:38       16 <ResampleResult[22]>
17:            0.092 2022-06-29 21:34:38       17 <ResampleResult[22]>
18:            0.083 2022-06-29 21:34:39       18 <ResampleResult[22]>
19:            0.078 2022-06-29 21:34:39       19 <ResampleResult[22]>
20:            0.076 2022-06-29 21:34:39       20 <ResampleResult[22]>

The associated resampling iterations can be accessed in the BenchmarkResult:

instance$archive$benchmark_result$data
Warning: '.__BenchmarkResult__data' is deprecated.
Use 'as.data.table(benchmark_result)' instead.
See help("Deprecated")
<ResultData>
  Public:
    as_data_table: function (view = NULL, reassemble_learners = TRUE, convert_predictions = TRUE, 
    clone: function (deep = FALSE) 
    combine: function (rdata) 
    data: list
    discard: function (backends = FALSE, models = FALSE) 
    initialize: function (data = NULL, store_backends = TRUE) 
    iterations: function (view = NULL) 
    learner_states: function (view = NULL) 
    learners: function (view = NULL, states = TRUE, reassemble = TRUE) 
    logs: function (view = NULL, condition) 
    prediction: function (view = NULL, predict_sets = "test") 
    predictions: function (view = NULL, predict_sets = "test") 
    resamplings: function (view = NULL) 
    sweep: function () 
    task_type: active binding
    tasks: function (view = NULL) 
    uhashes: function (view = NULL) 
  Private:
    deep_clone: function (name, value) 
    get_view_index: function (view) 

The uhash column links the resampling iterations to the evaluated feature subsets stored in instance$archive$data(). This allows e.g. to score the included ResampleResults on a different measure.

Now the optimized feature subset can be used to subset the task and fit the model on all observations.

task$select(instance$result_feature_set)
learner$train(task)

The trained model can now be used to make a prediction on external data. Note that predicting on observations present in the Task, should be avoided. The model has seen these observations already during feature selection and therefore results would be statistically biased. Hence, the resulting performance measure would be over-optimistic. Instead, to get statistically unbiased performance estimates for the current task, nested resampling is required.

13.8 Filtering with Multiple Performance Measures

When filtering, you might want to use multiple criteria to evaluate the performance of the feature subsets. For example, you might want the subset with the lowest classification error and lowest time to train the model. The full list of performance measures can be found here.

We will expand the previous example and perform feature selection on the same dataset, Pima Indian Diabetes, however, this time we will use FSelectInstanceMultiCrit to select the subset of features that has the lowest classification error and the lowest time to train the model.

The filtering process with multiple criteria is very similar to filtering with a single criterion.

measures = msrs(c("classif.ce", "time_train"))

Instead of creating a new FSelectInstanceSingleCrit with a single measure, we create a new FSelectInstanceMultiCrit with the two measures we are interested in here. Otherwise, it is the same as above.

library("mlr3filters")

evals20 = trm("evals", n_evals = 20)
instance = FSelectInstanceMultiCrit$new(
task = task,
learner = learner,
resampling = hout,
measures = measures,
terminator = evals20
)
instance
<FSelectInstanceMultiCrit>
* State:  Not optimized
* Objective: <ObjectiveFSelect:classif.rpart_on_pima>
* Search Space:
         id    class lower upper nlevels
1:  glucose ParamLgl    NA    NA       2
2:  insulin ParamLgl    NA    NA       2
3:     mass ParamLgl    NA    NA       2
4: pedigree ParamLgl    NA    NA       2
5: pregnant ParamLgl    NA    NA       2
6: pressure ParamLgl    NA    NA       2
7:  triceps ParamLgl    NA    NA       2
* Terminator: <TerminatorEvals>

After triggering the filtering, we will have the subset of features with the best classification error and time to train the model.

# reduce logging output
lgr::get_logger("bbotk")$set_threshold("warn")

fselector$optimize(instance)
INFO  [21:34:40.222] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:40.229] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:40.317] [mlr3] Finished benchmark 
INFO  [21:34:40.463] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:40.471] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:40.562] [mlr3] Finished benchmark 
INFO  [21:34:40.748] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:40.755] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:40.841] [mlr3] Finished benchmark 
INFO  [21:34:40.998] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:41.005] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:41.097] [mlr3] Finished benchmark 
INFO  [21:34:41.258] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:41.264] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:41.349] [mlr3] Finished benchmark 
INFO  [21:34:41.496] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:41.503] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:41.594] [mlr3] Finished benchmark 
INFO  [21:34:41.751] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:41.758] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:41.844] [mlr3] Finished benchmark 
INFO  [21:34:41.994] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:42.001] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:42.092] [mlr3] Finished benchmark 
INFO  [21:34:42.251] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:42.258] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:42.343] [mlr3] Finished benchmark 
INFO  [21:34:42.495] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:42.502] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:42.593] [mlr3] Finished benchmark 
INFO  [21:34:42.751] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:42.758] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:42.842] [mlr3] Finished benchmark 
INFO  [21:34:42.994] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:43.001] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:43.098] [mlr3] Finished benchmark 
INFO  [21:34:43.242] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:43.249] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:43.335] [mlr3] Finished benchmark 
INFO  [21:34:43.493] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:43.500] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:43.595] [mlr3] Finished benchmark 
INFO  [21:34:43.747] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:43.754] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:43.841] [mlr3] Finished benchmark 
INFO  [21:34:43.998] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:44.006] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:44.100] [mlr3] Finished benchmark 
INFO  [21:34:44.249] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:44.257] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:44.356] [mlr3] Finished benchmark 
INFO  [21:34:44.530] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:44.537] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:44.625] [mlr3] Finished benchmark 
INFO  [21:34:44.779] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:44.786] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:44.890] [mlr3] Finished benchmark 
INFO  [21:34:45.049] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:45.057] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:45.152] [mlr3] Finished benchmark 
   glucose insulin mass pedigree pregnant pressure triceps
1:    TRUE   FALSE TRUE     TRUE     TRUE    FALSE    TRUE
                                 features classif.ce time_train
1: glucose,mass,pedigree,pregnant,triceps  0.2382812          0
instance$result_feature_set
[[1]]
[1] "glucose"  "mass"     "pedigree" "pregnant" "triceps" 
instance$result_y
   classif.ce time_train
1:  0.2382812          0

13.9 Automating the Feature Selection

The AutoFSelector wraps a learner and augments it with an automatic feature selection for a given task. Because the AutoFSelector itself inherits from the Learner base class, it can be used like any other learner. Analogously to the previous subsection, a new classification tree learner is created. This classification tree learner automatically starts a feature selection on the given task using an inner resampling (holdout). We create a terminator which allows 10 evaluations, and uses a simple random search as feature selection algorithm:

learner = lrn("classif.rpart")
terminator = trm("evals", n_evals = 10)
fselector = fs("random_search")

at = AutoFSelector$new(
  learner = learner,
  resampling = rsmp("holdout"),
  measure = msr("classif.ce"),
  terminator = terminator,
  fselector = fselector
)
at
<AutoFSelector:classif.rpart.fselector>
* Model: -
* Parameters: xval=0
* Packages: mlr3, mlr3fselect, rpart
* Predict Type: response
* Feature types: logical, integer, numeric, factor, ordered
* Properties: importance, missings, multiclass, selected_features,
  twoclass, weights

We can now use the learner like any other learner, calling the $train() and $predict() method. This time however, we pass it to benchmark() to compare the optimized feature subset to the complete feature set. This way, the AutoFSelector will do its resampling for feature selection on the training set of the respective split of the outer resampling. The learner then undertakes predictions using the test set of the outer resampling. This yields unbiased performance measures, as the observations in the test set have not been used during feature selection or fitting of the respective learner. This is called nested resampling.

To compare the optimized feature subset with the complete feature set, we can use benchmark():

grid = benchmark_grid(
  task = tsk("pima"),
  learner = list(at, lrn("classif.rpart")),
  resampling = rsmp("cv", folds = 3)
)

bmr = benchmark(grid, store_models = TRUE)
INFO  [21:34:45.434] [mlr3] Running benchmark with 6 resampling iterations 
INFO  [21:34:45.441] [mlr3] Applying learner 'classif.rpart.fselector' on task 'pima' (iter 3/3) 
INFO  [21:34:45.559] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:45.567] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:45.653] [mlr3] Finished benchmark 
INFO  [21:34:45.780] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:45.788] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:45.888] [mlr3] Finished benchmark 
INFO  [21:34:46.018] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:46.026] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:46.124] [mlr3] Finished benchmark 
INFO  [21:34:46.265] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:46.273] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:46.371] [mlr3] Finished benchmark 
INFO  [21:34:46.504] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:46.512] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:46.607] [mlr3] Finished benchmark 
INFO  [21:34:46.753] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:46.760] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:46.846] [mlr3] Finished benchmark 
INFO  [21:34:46.988] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:46.996] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:47.088] [mlr3] Finished benchmark 
INFO  [21:34:47.229] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:47.236] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:47.333] [mlr3] Finished benchmark 
INFO  [21:34:47.477] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:47.485] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:47.586] [mlr3] Finished benchmark 
INFO  [21:34:47.731] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:47.739] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:47.832] [mlr3] Finished benchmark 
INFO  [21:34:47.952] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 2/3) 
INFO  [21:34:47.977] [mlr3] Applying learner 'classif.rpart.fselector' on task 'pima' (iter 1/3) 
INFO  [21:34:48.096] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:48.104] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:48.191] [mlr3] Finished benchmark 
INFO  [21:34:48.328] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:48.335] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:48.418] [mlr3] Finished benchmark 
INFO  [21:34:48.547] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:48.554] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:48.643] [mlr3] Finished benchmark 
INFO  [21:34:48.778] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:48.787] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:48.870] [mlr3] Finished benchmark 
INFO  [21:34:49.013] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:49.020] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:49.113] [mlr3] Finished benchmark 
INFO  [21:34:49.253] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:49.261] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:49.409] [mlr3] Finished benchmark 
INFO  [21:34:49.547] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:49.555] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:49.642] [mlr3] Finished benchmark 
INFO  [21:34:49.793] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:49.801] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:49.889] [mlr3] Finished benchmark 
INFO  [21:34:50.022] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:50.029] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:50.115] [mlr3] Finished benchmark 
INFO  [21:34:50.254] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:50.262] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:50.363] [mlr3] Finished benchmark 
INFO  [21:34:50.463] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 1/3) 
INFO  [21:34:50.487] [mlr3] Applying learner 'classif.rpart.fselector' on task 'pima' (iter 2/3) 
INFO  [21:34:50.603] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:50.610] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:50.696] [mlr3] Finished benchmark 
INFO  [21:34:50.834] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:50.842] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:50.929] [mlr3] Finished benchmark 
INFO  [21:34:51.062] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:51.070] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:51.159] [mlr3] Finished benchmark 
INFO  [21:34:51.307] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:51.314] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:51.401] [mlr3] Finished benchmark 
INFO  [21:34:51.535] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:51.543] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:51.631] [mlr3] Finished benchmark 
INFO  [21:34:51.774] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:51.781] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:51.870] [mlr3] Finished benchmark 
INFO  [21:34:52.005] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:52.014] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:52.114] [mlr3] Finished benchmark 
INFO  [21:34:52.248] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:52.256] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:52.344] [mlr3] Finished benchmark 
INFO  [21:34:52.479] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:52.486] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:52.587] [mlr3] Finished benchmark 
INFO  [21:34:52.720] [mlr3] Running benchmark with 1 resampling iterations 
INFO  [21:34:52.728] [mlr3] Applying learner 'select.classif.rpart' on task 'pima' (iter 1/1) 
INFO  [21:34:52.818] [mlr3] Finished benchmark 
INFO  [21:34:52.925] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 3/3) 
INFO  [21:34:52.965] [mlr3] Finished benchmark 
bmr$aggregate(msrs(c("classif.ce", "time_train")))
   nr      resample_result task_id              learner_id resampling_id iters
1:  1 <ResampleResult[22]>    pima classif.rpart.fselector            cv     3
2:  2 <ResampleResult[22]>    pima           classif.rpart            cv     3
   classif.ce time_train
1:  0.2552083          0
2:  0.2539062          0

Note that we do not expect any significant differences since we only evaluated a small fraction of the possible feature subsets.