3.5 Feature Selection / Filtering

Often, data sets include a large number of features. The technique of extracting a subset of relevant features is called “feature selection.”

The objective of feature selection is to fit the sparse dependent of a model on a subset of available data features in the most suitable manner. Feature selection can enhance the interpretability of the model, speed up the learning process and improve the learner performance. Different approaches exist to identify the relevant features. Two different approaches are emphasized in the literature: one is called Filtering and the other approach is often referred to as feature subset selection or wrapper methods.

What are the differences (Guyon and Elisseeff 2003; Chandrashekar and Sahin 2014)?

  • Filtering: An external algorithm computes a rank of the features (e.g. based on the correlation to the response). Then, features are subsetted by a certain criteria, e.g. an absolute number or a percentage of the number of variables. The selected features will then be used to fit a model (with optional hyperparameters selected by tuning). This calculation is usually cheaper than “feature subset selection” in terms of computation time. All filters are connected via package mlr3filters.

  • Wrapper Methods: Here, no ranking of features is done. Instead, an optimization algorithm selects a subset of the features, evaluates the set by calculating the resampled predictive performance, and then proposes a new set of features (or terminates). A simple example is the sequential forward selection. This method is usually computationally very intensive as a lot of models are fitted. Also, strictly speaking, all these models would need to be tuned before the performance is estimated. This would require an additional nested level in a CV setting. After undertaken all of these steps, the final set of selected features is again fitted (with optional hyperparameters selected by tuning). Wrapper methods are implemented in the mlr3fselect package.

  • Embedded Methods: Many learners internally select a subset of the features which they find helpful for prediction. These subsets can usually be queried, as the following example demonstrates:

    library("mlr3verse")
    
    task = tsk("iris")
    learner = lrn("classif.rpart")
    
    # ensure that the learner selects features
    stopifnot("selected_features" %in% learner$properties)
    
    # fit a simple classification tree
    learner = learner$train(task)
    
    # extract all features used in the classification tree:
    learner$selected_features()
    ## [1] "Petal.Length" "Petal.Width"

There are also ensemble filters built upon the idea of stacking single filter methods. These are not yet implemented.

3.5.1 Filters

Filter methods assign an importance value to each feature. Based on these values the features can be ranked. Thereafter, we are able to select a feature subset. There is a list of all implemented filter methods in the Appendix.

3.5.2 Calculating filter values

Currently, only classification and regression tasks are supported.

The first step it to create a new R object using the class of the desired filter method. Similar to other instances in mlr3, these are registered in a dictionary (mlr_filters) with an associated shortcut function flt(). Each object of class Filter has a .$calculate() method which computes the filter values and ranks them in a descending order.

filter = flt("jmim")

task = tsk("iris")
filter$calculate(task)

as.data.table(filter)
##         feature score
## 1:  Petal.Width   1.0
## 2: Sepal.Length   0.5
## 3: Petal.Length   0.0
## 4:  Sepal.Width    NA

Some filters support changing specific hyperparameters. This is similar to setting hyperparameters of a Learner using .$param_set$values:

filter_cor = flt("correlation")
filter_cor$param_set
## <ParamSet>
##        id    class lower upper nlevels    default value
## 1:    use ParamFct    NA    NA       5 everything      
## 2: method ParamFct    NA    NA       3    pearson
# change parameter 'method'
filter_cor$param_set$values = list(method = "spearman")
filter_cor$param_set
## <ParamSet>
##        id    class lower upper nlevels    default    value
## 1:    use ParamFct    NA    NA       5 everything         
## 2: method ParamFct    NA    NA       3    pearson spearman

3.5.3 Variable Importance Filters

All Learner with the property “importance” come with integrated feature selection methods.

You can find a list of all learners with this property in the Appendix.

For some learners the desired filter method needs to be set during learner creation. For example, learner classif.ranger comes with multiple integrated methods, c.f. the help page of ranger::ranger(). To use method “impurity,” you need to set the filter method during construction.

lrn = lrn("classif.ranger", importance = "impurity")

Now you can use the FilterImportance filter class for algorithm-embedded methods:

task = tsk("iris")
filter = flt("importance", learner = lrn)
filter$calculate(task)
head(as.data.table(filter), 3)
##         feature  score
## 1:  Petal.Width 44.559
## 2: Petal.Length 43.208
## 3: Sepal.Length  9.381

3.5.4 Wrapper Methods

Wrapper feature selection is supported via the mlr3fselect extension package. At the heart of mlr3fselect are the R6 classes:

3.5.5 The FSelectInstance Classes

The following sub-section examines the feature selection on the Pima data set which is used to predict whether or not a patient has diabetes.

task = tsk("pima")
print(task)
## <TaskClassif:pima> (768 x 9)
## * Target: diabetes
## * Properties: twoclass
## * Features (8):
##   - dbl (8): age, glucose, insulin, mass, pedigree, pregnant, pressure,
##     triceps

We use the classification tree from rpart.

learner = lrn("classif.rpart")

Next, we need to specify how to evaluate the performance of the feature subsets. For this, we need to choose a resampling strategy and a performance measure.

hout = rsmp("holdout")
measure = msr("classif.ce")

Finally, one has to choose the available budget for the feature selection. This is done by selecting one of the available Terminators:

For this short introduction, we specify a budget of 20 evaluations and then put everything together into a FSelectInstanceSingleCrit:

evals20 = trm("evals", n_evals = 20)

instance = FSelectInstanceSingleCrit$new(
  task = task,
  learner = learner,
  resampling = hout,
  measure = measure,
  terminator = evals20
)
instance
## <FSelectInstanceSingleCrit>
## * State:  Not optimized
## * Objective: <ObjectiveFSelect:classif.rpart_on_pima>
## * Search Space:
## <ParamSet>
##          id    class lower upper nlevels        default value
## 1:      age ParamLgl    NA    NA       2 <NoDefault[3]>      
## 2:  glucose ParamLgl    NA    NA       2 <NoDefault[3]>      
## 3:  insulin ParamLgl    NA    NA       2 <NoDefault[3]>      
## 4:     mass ParamLgl    NA    NA       2 <NoDefault[3]>      
## 5: pedigree ParamLgl    NA    NA       2 <NoDefault[3]>      
## 6: pregnant ParamLgl    NA    NA       2 <NoDefault[3]>      
## 7: pressure ParamLgl    NA    NA       2 <NoDefault[3]>      
## 8:  triceps ParamLgl    NA    NA       2 <NoDefault[3]>      
## * Terminator: <TerminatorEvals>
## * Terminated: FALSE
## * Archive:
## <ArchiveFSelect>
## Null data.table (0 rows and 0 cols)

To start the feature selection, we still need to select an algorithm which are defined via the FSelector class

3.5.6 The FSelector Class

The following algorithms are currently implemented in mlr3fselect:

In this example, we will use a simple random search and retrieve it from the dictionary mlr_fselectors with the fs() function:

fselector = fs("random_search")

3.5.7 Triggering the Tuning

To start the feature selection, we simply pass the FSelectInstanceSingleCrit to the $optimize() method of the initialized FSelector. The algorithm proceeds as follows

  1. The FSelector proposes at least one feature subset and may propose multiple subsets to improve parallelization, which can be controlled via the setting batch_size).
  2. For each feature subset, the given Learner is fitted on the Task using the provided Resampling. All evaluations are stored in the archive of the FSelectInstanceSingleCrit.
  3. The Terminator is queried if the budget is exhausted. If the budget is not exhausted, restart with 1) until it is.
  4. Determine the feature subset with the best observed performance.
  5. Store the best feature subset as the result in the instance object. The best feature subset ($result_feature_set) and the corresponding measured performance ($result_y) can be accessed from the instance.
# reduce logging output
lgr::get_logger("bbotk")$set_threshold("warn")

fselector$optimize(instance)
##     age glucose insulin mass pedigree pregnant pressure triceps
## 1: TRUE    TRUE    TRUE TRUE     TRUE    FALSE     TRUE    TRUE
##                                          features classif.ce
## 1: age,glucose,insulin,mass,pedigree,pressure,...       0.25
instance$result_feature_set
## [1] "age"      "glucose"  "insulin"  "mass"     "pedigree" "pressure" "triceps"
instance$result_y
## classif.ce 
##       0.25

One can investigate all resamplings which were undertaken, as they are stored in the archive of the FSelectInstanceSingleCrit and can be accessed by using as.data.table():

as.data.table(instance$archive)
##       age glucose insulin  mass pedigree pregnant pressure triceps classif.ce
##  1:  TRUE    TRUE    TRUE  TRUE     TRUE     TRUE     TRUE   FALSE     0.2734
##  2: FALSE   FALSE   FALSE FALSE     TRUE    FALSE    FALSE   FALSE     0.3711
##  3:  TRUE    TRUE    TRUE FALSE     TRUE     TRUE     TRUE    TRUE     0.2734
##  4: FALSE   FALSE   FALSE FALSE    FALSE     TRUE    FALSE   FALSE     0.3281
##  5:  TRUE    TRUE    TRUE  TRUE     TRUE     TRUE     TRUE    TRUE     0.2734
##  6:  TRUE    TRUE    TRUE  TRUE     TRUE    FALSE     TRUE    TRUE     0.2500
##  7:  TRUE   FALSE   FALSE FALSE     TRUE    FALSE    FALSE   FALSE     0.3672
##  8: FALSE    TRUE   FALSE FALSE     TRUE     TRUE    FALSE    TRUE     0.2617
##  9: FALSE   FALSE   FALSE  TRUE     TRUE    FALSE    FALSE    TRUE     0.3242
## 10:  TRUE   FALSE   FALSE FALSE    FALSE    FALSE    FALSE   FALSE     0.3398
## 11:  TRUE   FALSE   FALSE  TRUE     TRUE     TRUE    FALSE    TRUE     0.3203
## 12:  TRUE   FALSE   FALSE  TRUE    FALSE    FALSE     TRUE   FALSE     0.3164
## 13:  TRUE    TRUE    TRUE  TRUE     TRUE     TRUE     TRUE    TRUE     0.2734
## 14: FALSE   FALSE   FALSE FALSE    FALSE    FALSE     TRUE   FALSE     0.3516
## 15:  TRUE    TRUE    TRUE FALSE     TRUE     TRUE     TRUE    TRUE     0.2734
## 16:  TRUE   FALSE   FALSE  TRUE    FALSE    FALSE     TRUE    TRUE     0.3359
## 17:  TRUE    TRUE    TRUE FALSE     TRUE     TRUE    FALSE    TRUE     0.2695
## 18:  TRUE    TRUE    TRUE FALSE    FALSE     TRUE    FALSE    TRUE     0.3008
## 19:  TRUE   FALSE    TRUE FALSE     TRUE    FALSE    FALSE   FALSE     0.3633
## 20:  TRUE    TRUE    TRUE  TRUE     TRUE    FALSE     TRUE    TRUE     0.2500
##                                    uhash           timestamp batch_nr
##  1: e3bad0e4-8ec7-44a1-bad8-692320bad8f8 2021-06-27 08:36:47        1
##  2: 2396089c-d0fb-4490-a480-cb2d2321c3fc 2021-06-27 08:36:47        1
##  3: eb028bb1-a783-481c-ba3f-8b26924f7679 2021-06-27 08:36:47        1
##  4: e67be419-fd42-42dd-bea8-637d952ab002 2021-06-27 08:36:47        1
##  5: 8e8f4a57-5f1d-4e1b-8306-6dfb7b0218e1 2021-06-27 08:36:47        1
##  6: f01613bd-0196-4cf2-bb39-4488fd1fa2b1 2021-06-27 08:36:47        1
##  7: f2dba3f4-7634-40d1-9421-cb5d873e90fd 2021-06-27 08:36:47        1
##  8: 29cacbcd-a8ae-4c1d-9e3b-edfed5221819 2021-06-27 08:36:47        1
##  9: 871e353b-fa98-4af2-ab4a-766648bad088 2021-06-27 08:36:47        1
## 10: 1a1bb78c-ee49-4342-9fca-dc0825a93526 2021-06-27 08:36:47        1
## 11: 24179e65-79b7-4ba4-9d31-3b7c430d1e23 2021-06-27 08:36:49        2
## 12: d5cc313d-f7f4-45c4-96d6-fe42de9c2a75 2021-06-27 08:36:49        2
## 13: a4b3bd68-1dfa-4a55-9e22-8d37084bbec3 2021-06-27 08:36:49        2
## 14: 2f15bd3b-c642-4925-8925-5fc7f747cc9a 2021-06-27 08:36:49        2
## 15: d21c12ef-2a72-4f0a-bf6e-4cfa3a8cc2b5 2021-06-27 08:36:49        2
## 16: 62b2453c-1d27-463f-a37e-e7e892d96134 2021-06-27 08:36:49        2
## 17: 7e09c009-5429-496b-bd65-3142789ab2ad 2021-06-27 08:36:49        2
## 18: b88c14f6-c08c-4c20-a4fe-e9db47c81b39 2021-06-27 08:36:49        2
## 19: fbf42e27-0f5e-4a5f-b39b-1e2a4cb07c62 2021-06-27 08:36:49        2
## 20: 0f113642-702b-4398-910e-f2f19cd5ef52 2021-06-27 08:36:49        2

The associated resampling iterations can be accessed in the BenchmarkResult:

instance$archive$benchmark_result$data
## <ResultData>
##   Public:
##     as_data_table: function (view = NULL, reassemble_learners = TRUE, convert_predictions = TRUE, 
##     clone: function (deep = FALSE) 
##     combine: function (rdata) 
##     data: list
##     initialize: function (data = NULL, store_backends = TRUE) 
##     iterations: function (view = NULL) 
##     learners: function (view = NULL, states = TRUE, reassemble = TRUE) 
##     logs: function (view = NULL, condition) 
##     prediction: function (view = NULL, predict_sets = "test") 
##     predictions: function (view = NULL, predict_sets = "test") 
##     resamplings: function (view = NULL) 
##     sweep: function () 
##     task_type: active binding
##     tasks: function (view = NULL) 
##     uhashes: function (view = NULL) 
##   Private:
##     deep_clone: function (name, value) 
##     get_view_index: function (view)

The uhash column links the resampling iterations to the evaluated feature subsets stored in instance$archive$data(). This allows e.g. to score the included ResampleResults on a different measure.

Now the optimized feature subset can be used to subset the task and fit the model on all observations.

task$select(instance$result_feature_set)
learner$train(task)

The trained model can now be used to make a prediction on external data. Note that predicting on observations present in the task, should be avoided. The model has seen these observations already during feature selection and therefore results would be statistically biased. Hence, the resulting performance measure would be over-optimistic. Instead, to get statistically unbiased performance estimates for the current task, nested resampling is required.

3.5.8 Automating the Feature Selection

The AutoFSelector wraps a learner and augments it with an automatic feature selection for a given task. Because the AutoFSelector itself inherits from the Learner base class, it can be used like any other learner. Analogously to the previous subsection, a new classification tree learner is created. This classification tree learner automatically starts a feature selection on the given task using an inner resampling (holdout). We create a terminator which allows 10 evaluations, and uses a simple random search as feature selection algorithm:

learner = lrn("classif.rpart")
terminator = trm("evals", n_evals = 10)
fselector = fs("random_search")

at = AutoFSelector$new(
  learner = learner,
  resampling = rsmp("holdout"),
  measure = msr("classif.ce"),
  terminator = terminator,
  fselector = fselector
)
at
## <AutoFSelector:classif.rpart.fselector>
## * Model: -
## * Parameters: xval=0
## * Packages: rpart
## * Predict Type: response
## * Feature types: logical, integer, numeric, factor, ordered
## * Properties: importance, missings, multiclass, selected_features,
##   twoclass, weights

We can now use the learner like any other learner, calling the $train() and $predict() method. This time however, we pass it to benchmark() to compare the optimized feature subset to the complete feature set. This way, the AutoFSelector will do its resampling for feature selection on the training set of the respective split of the outer resampling. The learner then undertakes predictions using the test set of the outer resampling. This yields unbiased performance measures, as the observations in the test set have not been used during feature selection or fitting of the respective learner. This is called nested resampling.

To compare the optimized feature subset with the complete feature set, we can use benchmark():

grid = benchmark_grid(
  task = tsk("pima"),
  learner = list(at, lrn("classif.rpart")),
  resampling = rsmp("cv", folds = 3)
)

bmr = benchmark(grid, store_models = TRUE)
bmr$aggregate(msrs(c("classif.ce", "time_train")))
##    nr      resample_result task_id              learner_id resampling_id iters
## 1:  1 <ResampleResult[21]>    pima classif.rpart.fselector            cv     3
## 2:  2 <ResampleResult[21]>    pima           classif.rpart            cv     3
##    classif.ce time_train
## 1:     0.2591          0
## 2:     0.2422          0

Note that we do not expect any significant differences since we only evaluated a small fraction of the possible feature subsets.

References

Chandrashekar, Girish, and Ferat Sahin. 2014. “A Survey on Feature Selection Methods.” Computers and Electrical Engineering 40 (1): 16–28. https://doi.org/https://doi.org/10.1016/j.compeleceng.2013.11.024.

Guyon, Isabelle, and André Elisseeff. 2003. “An Introduction to Variable and Feature Selection.” Journal of Machine Learning Research 3 (Mar): 1157–82.