5  Feature Selection

5.1 Introduction

Feature selection, also known as variable or descriptor selection, is the process of finding a subset of features to use with a given model. Using an optimal set of features can have several benefits:

  • improved performance, since we reduce overfitting on irrelevant features,
  • robust models that do not rely on noisy features,
  • simpler models that are easier to interpret,
  • faster model fitting, and
  • no need to collect potentially expensive features.

Reducing the amount of features can improve models across many scenarios, but it can be especially helpful in datasets that have a high number of features in comparison to the number of datapoints. Many learners perform implicit feature selection, e.g. via the choice of variables used for splitting in a decision tree. Most other feature selection methods are model agnostic, i.e. they can be used together with any learner. Of the many different approaches to identifying relevant features, we will focus on two general concepts, which are described in detail below: Filter and Wrapper methods (Guyon and Elisseeff 2003; Chandrashekar and Sahin 2014).

Filter methods are preprocessing steps that can be applied before training a model. A very simple filter approach could look like this:

  1. calculate the correlation coefficient \(\rho\) between each feature and the target variable, and
  2. select all features with \(\rho > 0.2\) for further modelling steps.

This approach is a univariate filter because it only considers the univariate relationship between each feature and the target variable. Further, it can only be applied to regression tasks with continuous features and the threshold of \(\rho > 0.2\) is quite arbitrary. Thus, more advanced filter methods, e.g. multivariate filters based on feature importance, usually perform better (Bommert et al. 2020). In the Filters section, it is described how to calculate univariate, multivariate and feature importance filters, how to access implicitly selected features, how to integrate filters in a machine learning pipeline and how to optimize filter thresholds.

Wrapper methods work by fitting models on selected feature subsets and evaluating their performance. This can be done in a sequential fashion, e.g. by iteratively adding features to the model in the so-called sequential forward selection, or in a parallel fashion, e.g. by evaluating random feature subsets in a random search. Below, in the Wrapper Methods section, the use of these simple approaches is described in a common framework along with more advanced methods such as Genetic Search. It is further shown how to select features by optimizing multiple performance measures and how to wrap a learner with feature selection to use it in pipelines or benchmarks.

For this chapter, the reader should know the basic concepts of mlr3, i.e. know about tasks and learners. Basics about performance evaluation, i.e. resampling and benchmarking are helpful but not strictly necessary. In this section, we mostly focus on feature selection as a means of improving model performance.

5.1.1 Further Reading

5.2 Filters

Filter algorithms select features by assigning numeric scores to each feature, e.g. correlation between feature and target variables, use these to rank the features and select a feature subset based on the ranking. Features that are assigned lower scores can then be omitted in subsequent modeling steps. All filters are implemented via the package mlr3filters. We can distinguish between several types of filters, univariate and multivariate filters. While univariate filter methods only consider the relationship between each feature and the target variable, multivariate filters can take interactions with other features into account. A benefit of univariate filters is, that they are usually computationally cheaper than more complex filter or wrapper methods. Below, we cover how to

  • instantiate a Filter object,
  • calculate scores for a given task, and
  • use calculated scores to select or drop features.

One special case of filters are feature importance filters. They select features that are important according to the model induced by a selected Learner. Feature importance filters rely on the learner to extract information on feature importance from a trained model, for example, by inspecting a learned decision tree and returning the features that are used as split variables, or by computing model-agnostic feature importance values for each feature.

There is a list of all implemented filter methods in the Appendix.

5.2.1 Calculating Filter Values

The first step is to create a new R object using the class of the desired filter method. Similar to other instances in mlr3, these are registered in a dictionary (mlr3filters::mlr_filters) with an associated shortcut function mlr3filters::flt(). Each object of class Filter has a $calculate() method which computes the filter values and ranks them in a descending order. For example, to calculate an information gain filter:

filter = flt("information_gain")

task = tsk("penguins")

          feature       score
1: flipper_length 0.581167901
2:    bill_length 0.544896584
3:     bill_depth 0.538718879
4:         island 0.520157171
5:      body_mass 0.442879511
6:            sex 0.007244168
7:           year 0.000000000

Some filters have hyperparameters, which can be changed similar to setting hyperparameters of a mlr3::Learner using $param_set$values. For example, to calculate "spearman" instead of "pearson" correlation with the correlation filter:

filter_cor = flt("correlation")
filter_cor$param_set$values = list(method = "spearman")
       id    class lower upper nlevels    default    value
1:    use ParamFct    NA    NA       5 everything         
2: method ParamFct    NA    NA       3    pearson spearman

5.2.2 Feature Importance Filters

To use feature importance filters, we can use a learner with integrated feature importance methods. All learners with the property “importance” have this functionality. A list of all learners with this property is in the Appendix.

For some learners, the desired filter method needs to be set during learner creation. For example, learner mlr3learners::classif.ranger comes with multiple integrated methods, c.f. the help page of ranger::ranger::ranger(). To use the feature importance method “impurity”, select it during learner construction:

lrn = lrn("classif.ranger", importance = "impurity")

Now you can use the mlr3filters::FilterImportance filter class:

task = tsk("penguins")

# Remove observations with missing data

filter = flt("importance", learner = lrn)
          feature     score
1:    bill_length 76.374739
2: flipper_length 45.348924
3:     bill_depth 36.305939
4:      body_mass 26.457564
5:         island 24.077990
6:            sex  1.597289
7:           year  1.215536

5.2.3 Embedded Methods

Another alternative are embedded methods. Many learners internally select a subset of the features which they find helpful for prediction. These subsets can usually be queried, as the following example demonstrates:

task = tsk("penguins")
learner = lrn("classif.rpart")

# ensure that the learner selects features
stopifnot("selected_features" %in% learner$properties)

learner = learner$train(task)
[1] "flipper_length" "bill_length"    "island"        

The features selected by the model can be extracted by a Filter object, where $calculate() corresponds to training the learner on the given task:

filter = flt("selected_features", learner = learner)
          feature score
1:         island     1
2: flipper_length     1
3:    bill_length     1
4:     bill_depth     0
5:            sex     0
6:           year     0
7:      body_mass     0

Contrary to other filter methods, embedded methods just return value of 1 (selected features) and 0 (dropped feature).

5.2.4 Filter-based Feature Selection

After calculating a score for each feature, one has to select the features to be kept or those to be dropped from further modelling steps. For the "selected_features" filter described in embedded methods, this step is straight-forward since the methods assigns either a value of 1 for a feature to be kept or 0 for a feature to be dropped. With task$select() the features with a value of 1 can be selected:

task = tsk("penguins")
learner = lrn("classif.rpart")
filter = flt("selected_features", learner = learner)

# select all features used by rpart
keep = names(which(filter$scores == 1))
[1] "bill_length"    "flipper_length" "island"        

Note that we use the function task$select() and not task$filter(), which is used to filter rows (not columns) of the data matrix, see task mutators.

For filter methods which assign continuous scores, there are essentially two ways to select features:

  • select the top \(k\) features, or
  • select all features with a score above a threshold \(\tau\)

Where the first option is equivalent to dropping the bottom \(p-k\) features. For both options, one has to decide on a threshold, which is often quite arbitrary. For example, to implement the first option with the information gain filter:

task = tsk("penguins")
filter = flt("information_gain")

# select top 3 features from information gain filter
keep = names(head(filter$scores, 3))
[1] "bill_depth"     "bill_length"    "flipper_length"

Or, the second option with \(\tau = 0.5\):

task = tsk("penguins")
filter = flt("information_gain")

# select all features with score >0.5 from information gain filter
keep = names(which(filter$scores > 0.5))
[1] "bill_depth"     "bill_length"    "flipper_length" "island"        

Filters can be integrated into Pipelines. While pipelines are described in detail in the Pipelines Chapter, here is a brief preview:

task = tsk("penguins")

# combine filter (keep top 3 features) with learner
graph = po("filter", filter = flt("information_gain"), filter.nfeat = 3) %>>%
  po("learner", lrn("classif.rpart"))

# now it can be used as any learner, but it includes the feature selection
learner = as_learner(graph)

Pipelines can also be used to apply Hyperparameter Optimization to the filter, i.e. tune the filter threshold to optimize the feature selection regarding prediction performance:

# combine filter with learner
graph = po("filter", filter = flt("information_gain")) %>>%
  po("learner", lrn("classif.rpart"))
learner = as_learner(graph)

# tune how many feature to include
Loading required package: paradox
ps = ps(information_gain.filter.nfeat = p_int(lower = 1, upper = 7))
instance = TuningInstanceSingleCrit$new(
  task = task,
  learner = learner,
  resampling = rsmp("holdout"),
  measure = msr("classif.acc"),
  search_space = ps,
  terminator = trm("none")
tuner = tnr("grid_search")
   information_gain.filter.nfeat learner_param_vals  x_domain classif.acc
1:                             5          <list[2]> <list[1]>   0.9391304
# plot tuning results

For more details, see the Pipelines and Hyperparameter Optimization chapters.

5.3 Wrapper Methods

Wrapper methods iteratively select features that optimize a performance measure. Instead of ranking features, a model is fit on a selected subset of features in each iteration and evaluated with respect to a selected performance measure. The strategy that determines which feature subset is used in each iteration is given by the FSelector object. A simple example is the sequential forward selection that starts with computing each single-feature model and then iteratively adds the feature that leads to the largest performance improvement. Wrapper methods can be used with any learner but need to train the learner potentially many times, leading to a computationally intensive method. All wrapper methods are implemented via the package mlr3fselect. In this chapter, we cover how to

  • instantiate an FSelector object,
  • configure it, to e.g. respect a runtime limit or for different objectives,
  • run it or fuse it with a Learner via an AutoFSelector.

5.3.1 Simple Forward Selection Example

We start with the simple example from above and do sequential forward selection with the penguins data:


instance = fselect(
  method = "sequential",
  task =  tsk("penguins"),
  learner = lrn("classif.rpart"),
  resampling = rsmp("holdout"),
  measure = msr("classif.acc")

To show all analyzed feature subsets and the corresponding performance, use:

   bill_depth bill_length body_mass flipper_length island   sex  year
1:       TRUE       FALSE     FALSE          FALSE  FALSE FALSE FALSE
2:      FALSE        TRUE     FALSE          FALSE  FALSE FALSE FALSE
3:      FALSE       FALSE      TRUE          FALSE  FALSE FALSE FALSE
4:      FALSE       FALSE     FALSE           TRUE  FALSE FALSE FALSE
5:      FALSE       FALSE     FALSE          FALSE   TRUE FALSE FALSE
6:      FALSE       FALSE     FALSE          FALSE  FALSE  TRUE FALSE
7 variables not shown: [classif.acc, runtime_learners, timestamp, batch_nr, warnings, errors, resample_result]

And to only show the best feature set:

[1] "bill_length"    "flipper_length" "island"        

Internally, the mlr3fselect::fselect function creates an mlr3fselect::FSelectInstanceSingleCrit object and executes the feature selection with an mlr3fselect::FSelector object, based on the selected method, in this example an mlr3fselect::FSelectorSequential object. It uses the supplied resampling and measure to evaluate all feature subsets provided by the mlr3fselect::FSelector on the task.

At the heart of mlr3fselect are the R6 classes:

In the following two sections, these classes will be created manually, to learn more about the mlr3fselect package.

5.3.2 The FSelectInstance Classes

To create an mlr3fselect::FSelectInstanceSingleCrit object, we use the sugar function mlr3fselect::fsi, which is short for FSelectInstanceSingleCrit$new() or FSelectInstanceMultiCrit$new(), depending on the selected measure(s):

instance = fsi(
  task = tsk("penguins"),
  learner = lrn("classif.rpart"),
  resampling = rsmp("holdout"),
  measure = msr("classif.acc"),
  terminator = trm("evals", n_evals = 20)

Note that we have not selected a feature selection algorithm and thus did not select any features, yet. We have also supplied a so-called bbotk::Terminator, which is used to stop the feature selection. For the forward selection in the example above, we did not need a terminator because we simply tried all remaining features until the full model. However, for other feature selection algorithms such as mlr3fselect::random search, a terminator is required. The following terminator are available:

Above we used the sugar function bbotk::trm to select bbotk::TerminatorEvals with 20 evaluations.

To start the feature selection, we still need to select an algorithm which are defined via the mlr3fselect::FSelector class, described in the next section.

5.3.3 The FSelector Class

The mlr3fselect::FSelector class is the base class for different feature selection algorithms. The following algorithms are currently implemented in mlr3fselect:

In this example, we will use a simple random search and retrieve it from the dictionary mlr3fselect::mlr_fselectors with the mlr3fselect::fs() sugar function, which is short for FSelectorRandomSearch$new():

fselector = fs("random_search")

5.3.4 Starting the Feature Selection

To start the feature selection, we pass the mlr3fselect::FSelectInstanceSingleCrit object to the $optimize() method of the initialized mlr3fselect::FSelector object:

   bill_depth bill_length body_mass flipper_length island  sex year
1:      FALSE        TRUE     FALSE           TRUE   TRUE TRUE TRUE
2 variables not shown: [features, classif.acc]

The algorithm proceeds as follows

  1. The mlr3fselect::FSelector proposes at least one feature subset and may propose multiple subsets to improve parallelization, which can be controlled via the setting batch_size).
  2. For each feature subset, the given mlr3::Learner is fitted on the mlr3::Task using the provided mlr3::Resampling and evaluated with the given mlr3::Measure.
  3. All evaluations are stored in the archive of the mlr3fselect::FSelectInstanceSingleCrit.
  4. The bbotk::Terminator is queried if the budget is exhausted. 1 If the budget is not exhausted, restart with 1) until it is.
  5. Determine the feature subset with the best observed performance.
  6. Store the best feature subset as the result in the instance object.

The best feature subset ($result_feature_set) and the corresponding measured performance ($result_y) can be accessed from the instance:

[1] "bill_length"    "flipper_length" "island"         "sex"           
[5] "year"          

As in the forward selection example above, one can investigate all resamplings which were undertaken, as they are stored in the archive of the mlr3fselect::FSelectInstanceSingleCrit and can be accessed by using data.table::as.data.table():

   bill_depth bill_length body_mass flipper_length island   sex  year
1:      FALSE       FALSE     FALSE           TRUE  FALSE FALSE FALSE
2:      FALSE        TRUE     FALSE           TRUE   TRUE  TRUE  TRUE
3:       TRUE       FALSE     FALSE          FALSE  FALSE FALSE FALSE
4:       TRUE        TRUE      TRUE           TRUE  FALSE FALSE  TRUE
5:      FALSE       FALSE     FALSE           TRUE   TRUE FALSE FALSE
6:      FALSE       FALSE     FALSE           TRUE  FALSE FALSE FALSE
7 variables not shown: [classif.acc, runtime_learners, timestamp, batch_nr, warnings, errors, resample_result]

Now the optimized feature subset can be used to subset the task and fit the model on all observations:

task = tsk("penguins")
learner = lrn("classif.rpart")


The trained model can now be used to make a prediction on external data. Note that predicting on observations present in the task used for feature selection should be avoided. The model has seen these observations already during feature selection and therefore performance evaluation results would be over-optimistic. Instead, to get unbiased performance estimates for the current task, nested resampling is required.

5.3.5 Optimizing Multiple Performance Measures

You might want to use multiple criteria to evaluate the performance of the feature subsets. For example, you might want to select the subset with the lowest classification error and lowest time to train the model. The full list of performance measures can be found here.

We will expand the previous example and perform feature selection on the mlr3::Penguins dataset, however, this time we will use mlr3fselect::FSelectInstanceMultiCrit to select the subset of features that has the highest classification accuracy and the lowest time to train the model.

The feature selection process with multiple criteria is similar to that with a single criterion, except that we select two measures to be optimized:

instance = fsi(
  task = tsk("penguins"),
  learner = lrn("classif.rpart"),
  resampling = rsmp("holdout"),
  measure = msrs(c("classif.acc", "time_train")),
  terminator = trm("evals", n_evals = 20)

The function mlr3fselect::fsi creates an instance of FSelectInstanceMultiCrit if more than one measure is selected. We now create an mlr3fselect::FSelector, e.g. using random search, and call the $optimize() function of the FSelector with the FSelectInstanceMultiCrit object, to search for the subset of features with the best classification accuracy and time to train the model.

fselector = fs("random_search")
   bill_depth bill_length body_mass flipper_length island  sex year
1:       TRUE        TRUE     FALSE           TRUE   TRUE TRUE TRUE
3 variables not shown: [features, classif.acc, time_train]

As above, the best feature subset ($result_feature_set) and the corresponding measured performance ($result_y) can be accessed from the instance. However, in this simple case, if the fastest subset is not also the best performing subset, the result consists of two subsets: one with the lowest training time and one with the best classification accuracy.

[1] "bill_depth"     "bill_length"    "flipper_length" "island"        
[5] "sex"            "year"          
   classif.acc time_train
1:   0.9391304      0.026

More generally, the result is the pareto-optimal solution, i.e. the best feature subset for each of the criteria that is not dominated by another subset. For the example with classification accuracy and training time, a feature subset that is best in accuracy and training time will dominate all other subsets and thus will be the only pareto-optimal solution. If, however, different subsets are best in the two criteria, both subsets are pareto-optimal.

5.3.6 Automating the Feature Selection

The mlr3fselect::AutoFSelector class wraps a learner and augments it with an automatic feature selection for a given task. Because the mlr3fselect::AutoFSelector itself inherits from the mlr3::Learner base class, it can be used like any other learner. Analogously to the previous subsection, a new classification tree ("classif.rpart) learner is created. This classification tree learner is then wrapped in a random search feature selector, which automatically starts a feature selection on the given task using an inner resampling (holdout), as soon as the wrapped learner is trained. Here, the function mlr3fselect::auto_fselector creates an instance of AutoFSelector, i.e. it is short for AutoFSelector$new().

learner = lrn("classif.rpart")

at = auto_fselector(
  method = fs("random_search"),
  learner = learner,
  resampling = rsmp("holdout"),
  measure = msr("classif.acc"),
  terminator = trm("evals", n_evals = 10)
* Model: list
* Packages: mlr3, mlr3fselect, rpart
* Predict Type: response
* Feature Types: logical, integer, numeric, factor, ordered
* Properties: importance, missings, multiclass, selected_features,
  twoclass, weights

We can now, as with any other learner, call the $train() and $predict() method. This time however, we pass it to mlr3::benchmark() to compare the optimized feature subset to the complete feature set. This way, the mlr3fselect::AutoFSelector will do its resampling for feature selection on the training set of the respective split of the outer resampling. The learner then undertakes predictions using the test set of the outer resampling. Here, the outer resampling refers to the resampling specified in benchmark(), whereas the inner resampling is that specified in auto_fselector(). This is called nested resampling and yields unbiased performance measures, as the observations in the test set have not been used during feature selection or fitting of the respective learner.

In the call to benchmark(), we compare our wrapped learner at with a normal classification tree lrn("classif.rpart"). For that, we create a benchmark grid with the task, the learners and a 3-fold cross validation.

grid = benchmark_grid(
  task = tsk("penguins"),
  learner = list(at, lrn("classif.rpart")),
  resampling = rsmp("cv", folds = 3)

bmr = benchmark(grid)

Now, we compare those two learners regarding classification accuracy and training time:

aggr <- bmr$aggregate(msrs(c("classif.acc", "time_train")))
as.data.table(aggr)[, .(learner_id, classif.acc, time_train)]
                learner_id classif.acc  time_train
1: classif.rpart.fselector   0.9419019 1.785333333
2:           classif.rpart   0.9418764 0.003666667

Because of the implicit feature selection in classification trees, we do not expect huge performance improvements by the the feature selection in this example.