3.5 Feature Selection / Filtering

Often, data sets include a large number of features. The technique of extracting a subset of relevant features is called “feature selection.”

The objective of feature selection is to fit the sparse dependent of a model on a subset of available data features in the most suitable manner. Feature selection can enhance the interpretability of the model, speed up the learning process and improve the learner performance. Different approaches exist to identify the relevant features. Two different approaches are emphasized in the literature: one is called Filtering and the other approach is often referred to as feature subset selection or wrapper methods.

What are the differences (Guyon and Elisseeff 2003; Chandrashekar and Sahin 2014)?

  • Filtering: An external algorithm computes a rank of the features (e.g. based on the correlation to the response). Then, features are subsetted by a certain criteria, e.g. an absolute number or a percentage of the number of variables. The selected features will then be used to fit a model (with optional hyperparameters selected by tuning). This calculation is usually cheaper than “feature subset selection” in terms of computation time. All filters are connected via package mlr3filters.

  • Wrapper Methods: Here, no ranking of features is done. Instead, an optimization algorithm selects a subset of the features, evaluates the set by calculating the resampled predictive performance, and then proposes a new set of features (or terminates). A simple example is the sequential forward selection. This method is usually computationally very intensive as a lot of models are fitted. Also, strictly speaking, all these models would need to be tuned before the performance is estimated. This would require an additional nested level in a CV setting. After undertaken all of these steps, the final set of selected features is again fitted (with optional hyperparameters selected by tuning). Wrapper methods are implemented in the mlr3fselect package.

  • Embedded Methods: Many learners internally select a subset of the features which they find helpful for prediction. These subsets can usually be queried, as the following example demonstrates:

    library("mlr3verse")
    
    task = tsk("iris")
    learner = lrn("classif.rpart")
    
    # ensure that the learner selects features
    stopifnot("selected_features" %in% learner$properties)
    
    # fit a simple classification tree
    learner = learner$train(task)
    
    # extract all features used in the classification tree:
    learner$selected_features()
    ## [1] "Petal.Length" "Petal.Width"

There are also ensemble filters built upon the idea of stacking single filter methods. These are not yet implemented.

3.5.1 Filters

Filter methods assign an importance value to each feature. Based on these values the features can be ranked. Thereafter, we are able to select a feature subset. There is a list of all implemented filter methods in the Appendix.

3.5.2 Calculating filter values

Currently, only classification and regression tasks are supported.

The first step it to create a new R object using the class of the desired filter method. Similar to other instances in mlr3, these are registered in a dictionary (mlr_filters) with an associated shortcut function flt(). Each object of class Filter has a .$calculate() method which computes the filter values and ranks them in a descending order.

filter = flt("jmim")

task = tsk("iris")
filter$calculate(task)

as.data.table(filter)
##         feature score
## 1:  Petal.Width   1.0
## 2: Sepal.Length   0.5
## 3: Petal.Length   0.0
## 4:  Sepal.Width    NA

Some filters support changing specific hyperparameters. This is similar to setting hyperparameters of a Learner using .$param_set$values:

filter_cor = flt("correlation")
filter_cor$param_set
## <ParamSet>
##        id    class lower upper nlevels    default value
## 1:    use ParamFct    NA    NA       5 everything      
## 2: method ParamFct    NA    NA       3    pearson
# change parameter 'method'
filter_cor$param_set$values = list(method = "spearman")
filter_cor$param_set
## <ParamSet>
##        id    class lower upper nlevels    default    value
## 1:    use ParamFct    NA    NA       5 everything         
## 2: method ParamFct    NA    NA       3    pearson spearman

3.5.3 Variable Importance Filters

All Learner with the property “importance” come with integrated feature selection methods.

You can find a list of all learners with this property in the Appendix.

For some learners the desired filter method needs to be set during learner creation. For example, learner classif.ranger comes with multiple integrated methods, c.f. the help page of ranger::ranger(). To use method “impurity,” you need to set the filter method during construction.

lrn = lrn("classif.ranger", importance = "impurity")

Now you can use the FilterImportance filter class for algorithm-embedded methods:

task = tsk("iris")
filter = flt("importance", learner = lrn)
filter$calculate(task)
head(as.data.table(filter), 3)
##         feature  score
## 1: Petal.Length 47.956
## 2:  Petal.Width 39.951
## 3: Sepal.Length  9.031

3.5.4 Wrapper Methods

Wrapper feature selection is supported via the mlr3fselect extension package. At the heart of mlr3fselect are the R6 classes:

3.5.5 The FSelectInstance Classes

The following sub-section examines the feature selection on the Pima data set which is used to predict whether or not a patient has diabetes.

task = tsk("pima")
print(task)
## <TaskClassif:pima> (768 x 9)
## * Target: diabetes
## * Properties: twoclass
## * Features (8):
##   - dbl (8): age, glucose, insulin, mass, pedigree, pregnant, pressure,
##     triceps

We use the classification tree from rpart.

learner = lrn("classif.rpart")

Next, we need to specify how to evaluate the performance of the feature subsets. For this, we need to choose a resampling strategy and a performance measure.

hout = rsmp("holdout")
measure = msr("classif.ce")

Finally, one has to choose the available budget for the feature selection. This is done by selecting one of the available Terminators:

For this short introduction, we specify a budget of 20 evaluations and then put everything together into a FSelectInstanceSingleCrit:

evals20 = trm("evals", n_evals = 20)

instance = FSelectInstanceSingleCrit$new(
  task = task,
  learner = learner,
  resampling = hout,
  measure = measure,
  terminator = evals20
)
instance
## <FSelectInstanceSingleCrit>
## * State:  Not optimized
## * Objective: <ObjectiveFSelect:classif.rpart_on_pima>
## * Search Space:
## <ParamSet>
##          id    class lower upper nlevels        default value
## 1:      age ParamLgl    NA    NA       2 <NoDefault[3]>      
## 2:  glucose ParamLgl    NA    NA       2 <NoDefault[3]>      
## 3:  insulin ParamLgl    NA    NA       2 <NoDefault[3]>      
## 4:     mass ParamLgl    NA    NA       2 <NoDefault[3]>      
## 5: pedigree ParamLgl    NA    NA       2 <NoDefault[3]>      
## 6: pregnant ParamLgl    NA    NA       2 <NoDefault[3]>      
## 7: pressure ParamLgl    NA    NA       2 <NoDefault[3]>      
## 8:  triceps ParamLgl    NA    NA       2 <NoDefault[3]>      
## * Terminator: <TerminatorEvals>
## * Terminated: FALSE
## * Archive:
## <ArchiveFSelect>
## Null data.table (0 rows and 0 cols)

To start the feature selection, we still need to select an algorithm which are defined via the FSelector class

3.5.6 The FSelector Class

The following algorithms are currently implemented in mlr3fselect:

In this example, we will use a simple random search and retrieve it from the dictionary mlr_fselectors with the fs() function:

fselector = fs("random_search")

3.5.7 Triggering the Tuning

To start the feature selection, we simply pass the FSelectInstanceSingleCrit to the $optimize() method of the initialized FSelector. The algorithm proceeds as follows

  1. The FSelector proposes at least one feature subset and may propose multiple subsets to improve parallelization, which can be controlled via the setting batch_size).
  2. For each feature subset, the given Learner is fitted on the Task using the provided Resampling. All evaluations are stored in the archive of the FSelectInstanceSingleCrit.
  3. The Terminator is queried if the budget is exhausted. If the budget is not exhausted, restart with 1) until it is.
  4. Determine the feature subset with the best observed performance.
  5. Store the best feature subset as the result in the instance object. The best feature subset ($result_feature_set) and the corresponding measured performance ($result_y) can be accessed from the instance.
# reduce logging output
lgr::get_logger("bbotk")$set_threshold("warn")

fselector$optimize(instance)
##      age glucose insulin  mass pedigree pregnant pressure triceps
## 1: FALSE    TRUE    TRUE FALSE     TRUE    FALSE     TRUE   FALSE
##                             features classif.ce
## 1: glucose,insulin,pedigree,pressure     0.2461
instance$result_feature_set
## [1] "glucose"  "insulin"  "pedigree" "pressure"
instance$result_y
## classif.ce 
##     0.2461

One can investigate all resamplings which were undertaken, as they are stored in the archive of the FSelectInstanceSingleCrit and can be accessed by using as.data.table():

as.data.table(instance$archive)
##       age glucose insulin  mass pedigree pregnant pressure triceps classif.ce
##  1:  TRUE    TRUE    TRUE  TRUE     TRUE     TRUE     TRUE    TRUE     0.2656
##  2:  TRUE    TRUE   FALSE FALSE     TRUE    FALSE    FALSE   FALSE     0.2539
##  3: FALSE   FALSE    TRUE  TRUE     TRUE    FALSE     TRUE    TRUE     0.3359
##  4: FALSE   FALSE   FALSE FALSE    FALSE     TRUE    FALSE    TRUE     0.3242
##  5: FALSE   FALSE   FALSE FALSE    FALSE     TRUE    FALSE   FALSE     0.3086
##  6:  TRUE   FALSE    TRUE FALSE    FALSE    FALSE     TRUE    TRUE     0.3438
##  7:  TRUE    TRUE    TRUE  TRUE     TRUE     TRUE     TRUE    TRUE     0.2656
##  8:  TRUE   FALSE   FALSE  TRUE     TRUE     TRUE     TRUE   FALSE     0.2891
##  9:  TRUE   FALSE   FALSE FALSE     TRUE    FALSE     TRUE    TRUE     0.3438
## 10: FALSE   FALSE   FALSE  TRUE     TRUE    FALSE    FALSE   FALSE     0.3516
## 11: FALSE   FALSE   FALSE FALSE    FALSE     TRUE    FALSE   FALSE     0.3086
## 12:  TRUE    TRUE    TRUE  TRUE    FALSE     TRUE     TRUE    TRUE     0.2617
## 13: FALSE   FALSE   FALSE FALSE    FALSE    FALSE     TRUE   FALSE     0.3164
## 14:  TRUE    TRUE    TRUE  TRUE     TRUE     TRUE     TRUE   FALSE     0.2656
## 15: FALSE    TRUE    TRUE FALSE     TRUE    FALSE     TRUE   FALSE     0.2461
## 16:  TRUE   FALSE    TRUE FALSE     TRUE    FALSE    FALSE    TRUE     0.3242
## 17:  TRUE   FALSE   FALSE FALSE    FALSE    FALSE     TRUE   FALSE     0.3867
## 18: FALSE   FALSE    TRUE  TRUE     TRUE     TRUE     TRUE    TRUE     0.3242
## 19:  TRUE    TRUE    TRUE  TRUE     TRUE     TRUE     TRUE    TRUE     0.2656
## 20:  TRUE    TRUE    TRUE  TRUE     TRUE     TRUE     TRUE    TRUE     0.2656
##                                    uhash           timestamp batch_nr
##  1: 213a04ee-2ac9-4568-94b5-5729076d3d14 2021-06-01 09:36:22        1
##  2: 05db2ac6-f4c7-4ac4-a0e5-86a9de3abc07 2021-06-01 09:36:22        1
##  3: 53b4b4cc-0469-4431-a32f-342bb7765a69 2021-06-01 09:36:22        1
##  4: 23cc9bac-8e3a-4562-af2d-e32b61708771 2021-06-01 09:36:22        1
##  5: e90406cd-4de8-4301-bb8f-ac1451fbb9e1 2021-06-01 09:36:22        1
##  6: 0a64e05e-fcfe-4c40-ad6d-a623af35a4ea 2021-06-01 09:36:22        1
##  7: 2710f530-5313-4e65-b181-230b1308b896 2021-06-01 09:36:22        1
##  8: 5217f6a7-11b1-488a-aef1-d01618992dae 2021-06-01 09:36:22        1
##  9: 0469b6b5-0f07-47d6-980e-605f747e04d8 2021-06-01 09:36:22        1
## 10: 1f36d11e-246d-4a5f-9263-36775cf876fe 2021-06-01 09:36:22        1
## 11: 01799802-fe64-4cab-8bfd-e0fe3d2d4ef9 2021-06-01 09:36:24        2
## 12: 256126b0-8ee5-4067-869b-1c2e0fcabf39 2021-06-01 09:36:24        2
## 13: ee6389c9-383c-48ea-8be8-2ec4c99aa004 2021-06-01 09:36:24        2
## 14: 5f61ecbf-6fe4-4b29-8da0-9dd6df35ae9b 2021-06-01 09:36:24        2
## 15: f93034ac-ab8a-4955-bccd-c41394517d00 2021-06-01 09:36:24        2
## 16: bc6cab2a-dfed-4c9b-af5e-31ec487af3a8 2021-06-01 09:36:24        2
## 17: 41c6d026-1cd8-4c9f-9f96-70e38d6b07c5 2021-06-01 09:36:24        2
## 18: 19652134-d439-4c26-806f-afa70b9f4809 2021-06-01 09:36:24        2
## 19: e2be8303-71a7-4821-ab6f-e7644e5a5157 2021-06-01 09:36:24        2
## 20: 8b0c8d38-d6f9-4269-99fe-1ba88977009e 2021-06-01 09:36:24        2

The associated resampling iterations can be accessed in the BenchmarkResult:

instance$archive$benchmark_result$data
## <ResultData>
##   Public:
##     as_data_table: function (view = NULL, reassemble_learners = TRUE, convert_predictions = TRUE, 
##     clone: function (deep = FALSE) 
##     combine: function (rdata) 
##     data: list
##     initialize: function (data = NULL, store_backends = TRUE) 
##     iterations: function (view = NULL) 
##     learners: function (view = NULL, states = TRUE, reassemble = TRUE) 
##     logs: function (view = NULL, condition) 
##     prediction: function (view = NULL, predict_sets = "test") 
##     predictions: function (view = NULL, predict_sets = "test") 
##     resamplings: function (view = NULL) 
##     sweep: function () 
##     task_type: active binding
##     tasks: function (view = NULL) 
##     uhashes: function (view = NULL) 
##   Private:
##     deep_clone: function (name, value) 
##     get_view_index: function (view)

The uhash column links the resampling iterations to the evaluated feature subsets stored in instance$archive$data(). This allows e.g. to score the included ResampleResults on a different measure.

Now the optimized feature subset can be used to subset the task and fit the model on all observations.

task$select(instance$result_feature_set)
learner$train(task)

The trained model can now be used to make a prediction on external data. Note that predicting on observations present in the task, should be avoided. The model has seen these observations already during feature selection and therefore results would be statistically biased. Hence, the resulting performance measure would be over-optimistic. Instead, to get statistically unbiased performance estimates for the current task, nested resampling is required.

3.5.8 Automating the Feature Selection

The AutoFSelector wraps a learner and augments it with an automatic feature selection for a given task. Because the AutoFSelector itself inherits from the Learner base class, it can be used like any other learner. Analogously to the previous subsection, a new classification tree learner is created. This classification tree learner automatically starts a feature selection on the given task using an inner resampling (holdout). We create a terminator which allows 10 evaluations, and uses a simple random search as feature selection algorithm:

learner = lrn("classif.rpart")
terminator = trm("evals", n_evals = 10)
fselector = fs("random_search")

at = AutoFSelector$new(
  learner = learner,
  resampling = rsmp("holdout"),
  measure = msr("classif.ce"),
  terminator = terminator,
  fselector = fselector
)
at
## <AutoFSelector:classif.rpart.fselector>
## * Model: -
## * Parameters: xval=0
## * Packages: rpart
## * Predict Type: response
## * Feature types: logical, integer, numeric, factor, ordered
## * Properties: importance, missings, multiclass, selected_features,
##   twoclass, weights

We can now use the learner like any other learner, calling the $train() and $predict() method. This time however, we pass it to benchmark() to compare the optimized feature subset to the complete feature set. This way, the AutoFSelector will do its resampling for feature selection on the training set of the respective split of the outer resampling. The learner then undertakes predictions using the test set of the outer resampling. This yields unbiased performance measures, as the observations in the test set have not been used during feature selection or fitting of the respective learner. This is called nested resampling.

To compare the optimized feature subset with the complete feature set, we can use benchmark():

grid = benchmark_grid(
  task = tsk("pima"),
  learner = list(at, lrn("classif.rpart")),
  resampling = rsmp("cv", folds = 3)
)

bmr = benchmark(grid, store_models = TRUE)
bmr$aggregate(msrs(c("classif.ce", "time_train")))
##    nr      resample_result task_id              learner_id resampling_id iters
## 1:  1 <ResampleResult[21]>    pima classif.rpart.fselector            cv     3
## 2:  2 <ResampleResult[21]>    pima           classif.rpart            cv     3
##    classif.ce time_train
## 1:     0.2669          0
## 2:     0.2383          0

Note that we do not expect any significant differences since we only evaluated a small fraction of the possible feature subsets.

References

Chandrashekar, Girish, and Ferat Sahin. 2014. “A Survey on Feature Selection Methods.” Computers and Electrical Engineering 40 (1): 16–28. https://doi.org/https://doi.org/10.1016/j.compeleceng.2013.11.024.

Guyon, Isabelle, and André Elisseeff. 2003. “An Introduction to Variable and Feature Selection.” Journal of Machine Learning Research 3 (Mar): 1157–82.