5  Feature Selection

Author
Affiliations
Marvin N. Wright

Leibniz Institute for Prevention Research and Epidemiology – BIPS, Bremen, Germany

University of Bremen, Germany

University of Copenhagen, Denmark

Abstract
Feature selection is a key component of data analysis and is part of most machine learning applications. This chapter introduces two concepts of feature selection and how to use them in mlr3. Filter methods are preprocessing steps that are independent of the model to be fitted, whereas wrapper methods work by fitting models on feature subsets and evaluating their performance. We start with simple examples such as univariate filters and sequential forward selection but also cover more advanced topics such as optimizing multiple performance measures simultaneously. To integrate feature selection into complex machine learning pipelines and the mlr3 ecosystem, we show how to combine feature selection with mlr3 pipelines and how to automate feature selection by wrapping it in mlr3 learners.

Feature selection, also known as variable or descriptor selection, is the process of finding a subset of features to use with a given task and learner. Using an optimal set of features can have several benefits:

However, these objectives will not necessarily be optimized by the same optimal set of features and thus feature selection is inherently multi-objective. In this chapter, we mostly focus on feature selection as a means of improving predictive performance, but also briefly cover optimization of multiple criteria (Section 5.2.5).

Reducing the amount of features can improve models across many scenarios, but it can be especially helpful in datasets that have a high number of features in comparison to the number of datapoints. Many learners perform implicit, also called embedded, feature selection, e.g. via the choice of variables used for splitting in a decision tree. Most other feature selection methods are model agnostic, i.e. they can be used together with any learner. Of the many different approaches to identifying relevant features, we will focus on two general concepts, which are described in detail below: Filter and Wrapper methods (Guyon and Elisseeff 2003; Chandrashekar and Sahin 2014).

For this chapter, the reader should know the basic concepts of mlr3 (Chapter 2), i.e. know about tasks (Section 2.1) and learners (Section 2.2). Basics about performance evaluation (Chapter 3), i.e. resampling (Section 3.2) and benchmarking (Section 3.3) are helpful but not strictly necessary.

5.1 Filters

Filter methods are preprocessing steps that can be applied before training a model. A very simple filter approach could look like this:

  1. calculate the correlation coefficient \(\rho\) between each feature and a numeric target variable, and
  2. select all features with \(\rho > 0.2\) for further modelling steps.

This approach is a univariate filter because it only considers the univariate relationship between each feature and the target variable. Further, it can only be applied to regression tasks with continuous features and the threshold of \(\rho > 0.2\) is quite arbitrary. Thus, more advanced filter methods, e.g. multivariate filters based on feature importance, usually perform better (Bommert et al. 2020). On the other hand, a benefit of univariate filters is that they are usually computationally cheaper than more complex filter or wrapper methods. In the following, it is described how to calculate univariate, multivariate and feature importance filters, how to access implicitly selected features, how to integrate filters in a machine learning pipeline and how to optimize filter thresholds.

Filter algorithms select features by assigning numeric scores to each feature, e.g. correlation between feature and target variables, use these to rank the features and select a feature subset based on the ranking. Features that are assigned lower scores can then be omitted in subsequent modeling steps. All filters are implemented via the package mlr3filters. Below, we cover how to

  • instantiate a Filter object,
  • calculate scores for a given task, and
  • use calculated scores to select or drop features.

One special case of filters are feature importance filters (Section 5.1.2). They select features that are important according to the model induced by a selected Learner. Feature importance filters rely on the learner to extract information on feature importance from a trained model, for example, by inspecting a learned decision tree and returning the features that are used as split variables, or by computing model-agnostic feature importance (Chapter 10) values for each feature.

Many filter methods are implemented in mlr3filters, for example:

  • Correlation, calculating Pearson or Spearman correlation between numeric features and numeric targets (FilterCorrelation)
  • Information gain, i.e. mutual information of the feature and the target or the reduction of uncertainty of the target due to a feature (FilterInformationGain)
  • Minimal joint mutual information maximization, minimizing the joint information between selected features to avoid redundancy (FilterJMIM)
  • Permutation score, which calculates permutation feature importance (see Chapter 10) with a given learner for each feature (FilterPermutation)
  • Area under the ROC curve calculated for each feature separately (FilterAUC)

Most of the filter methods have some limitations, e.g. the correlation filter can only be calculated for regression tasks with numeric features. For a full list of all implemented filter methods we refer the reader to the mlr3filters website1, which also shows the supported task and features types. A benchmark of filter methods was performed by Bommert et al. (2020), who recommend to not rely on a single filter method but try several ones if the available computational resources allow. If only a single filter method is to be used, the authors recommend to use a feature importance filter using random forest permutation importance (see Section 5.1.2), similar to the permutation method described above, but also the JMIM and AUC filters performed well in their comparison.

5.1.1 Calculating Filter Values

The first step is to create a new R object using the class of the desired filter method. Similar to other instances in mlr3, these are registered in a dictionary (mlr_filters) with an associated shortcut function flt(). Each object of class Filter has a $calculate() method which computes the filter values and ranks them in a descending order. For example, we can use the information gain filter described above:

library("mlr3verse")
filter = flt("information_gain")

Such a Filter object can now be used to calculate the filter on the penguins data and get the results:

task = tsk("penguins")
filter$calculate(task)

as.data.table(filter)
          feature       score
1: flipper_length 0.581167901
2:    bill_length 0.544896584
3:     bill_depth 0.538718879
4:         island 0.520157171
5:      body_mass 0.442879511
6:            sex 0.007244168
7:           year 0.000000000

Some filters have hyperparameters, which can be changed similar to setting hyperparameters of a Learner using $param_set$values. For example, to calculate "spearman" instead of "pearson" correlation with the correlation filter:

filter_cor = flt("correlation")
filter_cor$param_set$values = list(method = "spearman")
filter_cor$param_set
<ParamSet>
       id    class lower upper nlevels    default    value
1:    use ParamFct    NA    NA       5 everything         
2: method ParamFct    NA    NA       3    pearson spearman

5.1.2 Feature Importance Filters

To use feature importance filters, we can use a learner with integrated feature importance methods. All learners with the property “importance” have this functionality. A list of all learners with this property can be found with

as.data.table(mlr_learners)[sapply(properties, function(x) "importance" %in% x)]
                         key                              label task_type
 1:         classif.catboost                  Gradient Boosting   classif
 2:      classif.featureless Featureless Classification Learner   classif
 3:              classif.gbm                  Gradient Boosting   classif
 4: classif.imbalanced_rfsrc           Imbalanced Random Forest   classif
 5:         classif.lightgbm                  Gradient Boosting   classif
---                                                                      
22:                 surv.gbm                  Gradient Boosting      surv
23:              surv.mboost Boosted Generalized Additive Model      surv
24:              surv.ranger                      Random Forest      surv
25:               surv.rfsrc                      Random Forest      surv
26:             surv.xgboost                  Gradient Boosting      surv
4 variables not shown: [feature_types, packages, properties, predict_types]

For some learners, the desired filter method needs to be set during learner creation. For example, learner classif.ranger comes with multiple integrated methods, c.f. the help page of ranger::ranger(). To use the feature importance method “impurity”, select it during learner construction:

lrn = lrn("classif.ranger", importance = "impurity")

Now we can use the FilterImportance filter class:

task = tsk("penguins")

# Remove observations with missing data
task$filter(which(complete.cases(task$data())))

filter = flt("importance", learner = lrn)
filter$calculate(task)
as.data.table(filter)
          feature     score
1:    bill_length 76.374739
2: flipper_length 45.348924
3:     bill_depth 36.305939
4:      body_mass 26.457564
5:         island 24.077990
6:            sex  1.597289
7:           year  1.215536

5.1.3 Embedded Methods

Many learners internally select a subset of the features which they find helpful for prediction, but ignore other features. For example, a decision tree might never select some features for splitting. These subsets can be used for feature selection, which we call embedded methods because the feature selection is embedded in the learner. The selected features (and those not selected) can be queried if the learner has the "selected_features" property, as the following example demonstrates:

task = tsk("penguins")
learner = lrn("classif.rpart")

# ensure that the learner selects features
stopifnot("selected_features" %in% learner$properties)

learner = learner$train(task)
learner$selected_features()
[1] "flipper_length" "bill_length"    "island"        

The features selected by the model can be extracted by a Filter object, where $calculate() corresponds to training the learner on the given task:

filter = flt("selected_features", learner = learner)
filter$calculate(task)
as.data.table(filter)
          feature score
1:         island     1
2: flipper_length     1
3:    bill_length     1
4:     bill_depth     0
5:            sex     0
6:           year     0
7:      body_mass     0

Contrary to other filter methods, embedded methods just return value of 1 (selected features) and 0 (dropped feature).

5.1.4 Filter-based Feature Selection

After calculating a score for each feature, one has to select the features to be kept or those to be dropped from further modelling steps. For the "selected_features" filter described in embedded methods (Section 5.1.3), this step is straight-forward since the methods assigns either a value of 1 for a feature to be kept or 0 for a feature to be dropped. With task$select() the features with a value of 1 can be selected:

task = tsk("penguins")
learner = lrn("classif.rpart")
filter = flt("selected_features", learner = learner)
filter$calculate(task)

# select all features used by rpart
keep = names(which(filter$scores == 1))
task$select(keep)
task$feature_names
[1] "bill_length"    "flipper_length" "island"        
Tip

To select features, we use the function task$select() and not task$filter(), which is used to filter rows (not columns) of the data matrix, see task mutators (Section 2.1.3).

For filter methods which assign continuous scores, there are essentially two ways to select features:

  • select the top \(k\) features, or
  • select all features with a score above a threshold \(\tau\).

Where the first option is equivalent to dropping the bottom \(p-k\) features. For both options, one has to decide on a threshold, which is often quite arbitrary. For example, to implement the first option with the information gain filter:

task = tsk("penguins")
filter = flt("information_gain")
filter$calculate(task)

# select top 3 features from information gain filter
keep = names(head(filter$scores, 3))
task$select(keep)
task$feature_names
[1] "bill_depth"     "bill_length"    "flipper_length"

Or, the second option with \(\tau = 0.5\):

task = tsk("penguins")
filter = flt("information_gain")
filter$calculate(task)

# select all features with score >0.5 from information gain filter
keep = names(which(filter$scores > 0.5))
task$select(keep)
task$feature_names
[1] "bill_depth"     "bill_length"    "flipper_length" "island"        

Filters can be integrated into pipelines. While pipelines are described in detail in Chapter 6, here is a brief preview:

library(mlr3pipelines)
task = tsk("penguins")

# combine filter (keep top 3 features) with learner
graph = po("filter", filter = flt("information_gain"), filter.nfeat = 3) %>>%
  po("learner", lrn("classif.rpart"))

# now it can be used as any learner, but it includes the feature selection
learner = as_learner(graph)
learner$train(task)

Pipelines can also be used to apply hyperparameter optimization (Chapter 4) to the filter, i.e. tune the filter threshold to optimize the feature selection regarding prediction performance. We first combine a filter with a learner,

graph = po("filter", filter = flt("information_gain")) %>>%
  po("learner", lrn("classif.rpart"))
learner = as_learner(graph)

and tune how many feature to include

library("mlr3tuning")
ps = ps(information_gain.filter.nfeat = p_int(lower = 1, upper = 7))
instance = TuningInstanceSingleCrit$new(
  task = task,
  learner = learner,
  resampling = rsmp("holdout"),
  measure = msr("classif.acc"),
  search_space = ps,
  terminator = trm("none")
)
tuner = tnr("grid_search")
tuner$optimize(instance)
   information_gain.filter.nfeat learner_param_vals  x_domain classif.acc
1:                             5          <list[2]> <list[1]>   0.9391304

The output above shows only the best result. To show the results of all tuning steps, retrieve them from the archive of the tuning instance:

as.data.table(instance$archive)
   information_gain.filter.nfeat classif.acc
1:                             2   0.9304348
2:                             5   0.9391304
3:                             1   0.7478261
4:                             7   0.9391304
5:                             3   0.9391304
6:                             6   0.9391304
7:                             4   0.9391304
7 variables not shown: [x_domain_information_gain.filter.nfeat, runtime_learners, timestamp, batch_nr, warnings, errors, resample_result]

We can also plot the tuning results:

autoplot(instance)

Plot showing model performance in filter-based feature selection, showing that adding a second and third feature to the model improves performance, while adding more feature achieves no further performance gain.

Figure 5.1: Model performance with different numbers of features, selected by an information gain filter.

For more details, see Pipelines (Chapter 6) and Hyperparameter Optimization (Chapter 4).

5.2 Wrapper Methods

Wrapper methods work by fitting models on selected feature subsets and evaluating their performance. This can be done in a sequential fashion, e.g. by iteratively adding features to the model in the so-called sequential forward selection, or in a parallel fashion, e.g. by evaluating random feature subsets in a random search. Below, the use of these simple approaches is described in a common framework along with more advanced methods such as genetic search. It is further shown how to select features by optimizing multiple performance measures and how to wrap a learner with feature selection to use it in pipelines or benchmarks.

In more detail, wrapper methods iteratively select features that optimize a performance measure. Instead of ranking features, a model is fit on a selected subset of features in each iteration and evaluated in resampling with respect to a selected performance measure. The strategy that determines which feature subset is used in each iteration is given by the FSelector object. A simple example is the sequential forward selection that starts with computing each single-feature model, selects the best one, and then iteratively adds the feature that leads to the largest performance improvement. Wrapper methods can be used with any learner but need to train the learner potentially many times, leading to a computationally intensive method. All wrapper methods are implemented via the package mlr3fselect. In this chapter, we cover how to

  • instantiate an FSelector object,
  • configure it, to e.g. respect a runtime limit or for different objectives,
  • run it or fuse it with a Learner via an AutoFSelector.
Note

Wrapper-based feature selection is very similar to hyperparameter optimization (Chapter 4). The major difference is that we search for well-performing feature subsets instead of hyperparameter configurations. We will see below, that we can even use the same terminators, that some feature selection algorithms are similar to tuners and that we can also optimize multiple performance measures with feature selection.

5.2.1 Simple Forward Selection Example

We start with the simple example from above and do sequential forward selection with the penguins data:

library("mlr3fselect")

# subset features to ease visualization
task = tsk("penguins")
task$select(c("bill_depth", "bill_length", "body_mass", "flipper_length"))

instance = fselect(
  fselector = fs("sequential"),
  task =  task,
  learner = lrn("classif.rpart"),
  resampling = rsmp("holdout"),
  measure = msr("classif.acc")
)

To show all analyzed feature subsets and the corresponding performance, we use as.data.table(instance$archive).

dt = as.data.table(instance$archive)
dt[batch_nr == 1, 1:5]
   bill_depth bill_length body_mass flipper_length classif.acc
1:       TRUE       FALSE     FALSE          FALSE   0.6956522
2:      FALSE        TRUE     FALSE          FALSE   0.7652174
3:      FALSE       FALSE      TRUE          FALSE   0.7043478
4:      FALSE       FALSE     FALSE           TRUE   0.7913043

We see that the feature flipper_length achieved the highest prediction performance in the first iteration and is thus selected. In the second round, adding bill_length improves performance to over 90%:

dt[batch_nr == 2, 1:5]
   bill_depth bill_length body_mass flipper_length classif.acc
1:       TRUE       FALSE     FALSE           TRUE   0.7652174
2:      FALSE        TRUE     FALSE           TRUE   0.9391304
3:      FALSE       FALSE      TRUE           TRUE   0.8173913

However, adding a third feature does not improve performance

dt[batch_nr == 3, 1:5]
   bill_depth bill_length body_mass flipper_length classif.acc
1:       TRUE        TRUE     FALSE           TRUE   0.9391304
2:      FALSE        TRUE      TRUE           TRUE   0.9391304

and the algorithm terminates. To directly show the best feature set, we can use:

instance$result_feature_set
[1] "bill_length"    "flipper_length"
Note

instance$result_feature_set shows features in alphabetical order and not in the order selected.

Internally, the fselect function creates an FSelectInstanceSingleCrit object and executes the feature selection with an FSelector object, based on the selected method, in this example an FSelectorSequential object. It uses the supplied resampling and measure to evaluate all feature subsets provided by the FSelector on the task.

At the heart of mlr3fselect are the R6 classes:

  • FSelectInstanceSingleCrit, FSelectInstanceMultiCrit: These two classes describe the feature selection problem and store the results.
  • FSelector: This class is the base class for implementations of feature selection algorithms.

In the following two sections, these classes will be created manually, to learn more about the mlr3fselect package.

5.2.2 The FSelectInstance Classes

To create an FSelectInstanceSingleCrit object, we use the sugar function fsi, which is short for FSelectInstanceSingleCrit$new() or FSelectInstanceMultiCrit$new(), depending on the selected measure(s):

instance = fsi(
  task = tsk("penguins"),
  learner = lrn("classif.rpart"),
  resampling = rsmp("holdout"),
  measure = msr("classif.acc"),
  terminator = trm("evals", n_evals = 20)
)

Note that we have not selected a feature selection algorithm and thus did not select any features, yet. We have also supplied a so-called Terminator, which is used to stop the feature selection. For the forward selection in the example above, we did not need a terminator because we simply tried all remaining features until the full model or no further performance improvement. However, for other feature selection algorithms such as random search, a terminator is required. The following terminator are available:

Above we used the sugar function trm to select TerminatorEvals with 20 evaluations.

To start the feature selection, we still need to select an algorithm which are defined via the FSelector class, described in the next section.

5.2.3 The FSelector Class

The FSelector class is the base class for different feature selection algorithms. The following algorithms are currently implemented in mlr3fselect:

  • Random search, trying random feature subsets until termination (FSelectorRandomSearch)
  • Exhaustive search, trying all possible feature subsets (FSelectorExhaustiveSearch)
  • Sequential search, i.e. sequential forward or backward selection (FSelectorSequential)
  • Recursive feature elimination, which uses learner’s importance scores to iteratively remove features with low feature importance (FSelectorRFE)
  • Design points, trying all user-supplied feature sets (FSelectorDesignPoints)
  • Genetic search, implementing a genetic algorithm which treats the features as a binary sequence and tries to find the best subset with mutations (FSelectorGeneticSearch)
  • Shadow variable search, which adds permuted copies of all features (shadow variables) and stops when a shadow variable is selected (FSelectorShadowVariableSearch)

In this example, we will use a simple random search and retrieve it from the dictionary mlr_fselectors with the fs() sugar function, which is short for FSelectorRandomSearch$new():

fselector = fs("random_search")

5.2.4 Starting the Feature Selection

To start the feature selection, we pass the FSelectInstanceSingleCrit object to the $optimize() method of the initialized FSelector object:

fselector$optimize(instance)

The algorithm proceeds as follows

  1. The FSelector proposes at least one feature subset and may propose multiple subsets to improve parallelization, which can be controlled via the setting batch_size.
  2. For each feature subset, the given learner is fitted on the task using the provided resampling and evaluated with the given measure.
  3. All evaluations are stored in the archive of the FSelectInstanceSingleCrit object.
  4. The terminator is queried if the budget is exhausted. If the budget is not exhausted, restart with 1) until it is.
  5. Determine the feature subset with the best observed performance.
  6. Store the best feature subset as the result in the instance object.

The best feature subset and the corresponding measured performance can be accessed from the instance:

as.data.table(instance$result)[, .(features, classif.acc)]
                                                         features classif.acc
1: bill_depth,bill_length,body_mass,flipper_length,island,sex,...   0.9391304

As in the forward selection example above, one can investigate all resamplings which were undertaken, as they are stored in the archive of the FSelectInstanceSingleCrit object and can be accessed by using as.data.table():

as.data.table(instance$archive)[, .(bill_depth, bill_length, body_mass, classif.acc)]
    bill_depth bill_length body_mass classif.acc
 1:       TRUE        TRUE      TRUE   0.9391304
 2:      FALSE       FALSE      TRUE   0.7391304
 3:       TRUE        TRUE      TRUE   0.9391304
 4:       TRUE        TRUE      TRUE   0.9391304
 5:       TRUE        TRUE      TRUE   0.9391304
 6:      FALSE       FALSE     FALSE   0.6869565
 7:       TRUE       FALSE      TRUE   0.8086957
 8:      FALSE       FALSE      TRUE   0.7391304
 9:       TRUE       FALSE     FALSE   0.7739130
10:       TRUE       FALSE     FALSE   0.8000000
11:      FALSE       FALSE     FALSE   0.8086957
12:      FALSE       FALSE      TRUE   0.6869565
13:       TRUE        TRUE      TRUE   0.9391304
14:      FALSE       FALSE     FALSE   0.6173913
15:       TRUE        TRUE      TRUE   0.9217391
16:       TRUE        TRUE      TRUE   0.9391304
17:       TRUE        TRUE      TRUE   0.9043478
18:      FALSE       FALSE      TRUE   0.7391304
19:      FALSE       FALSE     FALSE   0.7739130
20:      FALSE       FALSE     FALSE   0.8086957

Now the optimized feature subset can be used to subset the task and fit the model on all observations:

task = tsk("penguins")
learner = lrn("classif.rpart")

task$select(instance$result_feature_set)
learner$train(task)

The trained model can now be used to make a prediction on external data.

Warning

Predicting on observations present in the task used for feature selection should be avoided. The model has seen these observations already during feature selection and therefore performance evaluation results would be over-optimistic. Instead, to get unbiased performance estimates for the current task, nested resampling (see Section 5.2.6 and Section 4.4) is required.

5.2.5 Optimizing Multiple Performance Measures

You might want to use multiple criteria to evaluate the performance of the feature subsets. For example, you might want to select the subset with the highest classification accuracy and lowest time to train the model. However, these two subsets will generally not coincide, i.e. the subset with highest classification accuracy will probably be another subset than that with lowest training time. With mlr3fselect, the result is the pareto-optimal solution, i.e. the best feature subset for each of the criteria that is not dominated by another subset. For the example with classification accuracy and training time, a feature subset that is best in accuracy and training time will dominate all other subsets and thus will be the only pareto-optimal solution. If, however, different subsets are best in the two criteria, both subsets are pareto-optimal.

We will expand the previous example and perform feature selection on the penguins dataset, however, this time we will use FSelectInstanceMultiCrit to select the subset of features that has the highest classification accuracy and the one with the lowest time to train the model.

The feature selection process with multiple criteria is similar to that with a single criterion, except that we select two measures to be optimized:

instance = fsi(
  task = tsk("penguins"),
  learner = lrn("classif.rpart"),
  resampling = rsmp("holdout"),
  measure = msrs(c("classif.acc", "time_train")),
  terminator = trm("evals", n_evals = 5)
)

The function fsi creates an instance of FSelectInstanceMultiCrit if more than one measure is selected. We now create an FSelector and call the $optimize() function of the FSelector with the FSelectInstanceMultiCrit object, to search for the subset of features with the best classification accuracy and time to train the model. This time, we use design points to manually specify two feature sets to try: one with only the feature sex and one with all features except island, sex and year. We expect the sex-only model to train fast and the model including many features to be accurate.

design = mlr3misc::rowwise_table(
  ~bill_depth, ~bill_length, ~body_mass, ~flipper_length, ~island, ~sex, ~year,
  FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,
  TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE
)

fselector = fs("design_points", design = design)
fselector$optimize(instance)

As above, the best feature subset and the corresponding measured performance can be accessed from the instance. However, in this simple case, if the fastest subset is not also the best performing subset, the result consists of two subsets: one with the lowest training time and one with the best classification accuracy:

as.data.table(instance$result)[, .(features, classif.acc, time_train)]
                                          features classif.acc time_train
1:                                             sex   0.4347826      0.003
2: bill_depth,bill_length,body_mass,flipper_length   0.9304348      0.004

As explained above, the result is the pareto-optimal solution.

5.2.6 Automating the Feature Selection

The AutoFSelector class wraps a learner and augments it with an automatic feature selection for a given task. Because the AutoFSelector itself inherits from the Learner base class, it can be used like any other learner. Below, a new learner is created. This learner is then wrapped in a random search feature selector, which automatically starts a feature selection on the given task using an inner resampling, as soon as the wrapped learner is trained. Here, the function auto_fselector creates an instance of AutoFSelector, i.e. it is short for AutoFSelector$new().

at = auto_fselector(
  fselector = fs("random_search"),
  learner = lrn("classif.log_reg"),
  resampling = rsmp("holdout"),
  measure = msr("classif.acc"),
  terminator = trm("evals", n_evals = 10)
)
at
<AutoFSelector:classif.log_reg.fselector>
* Model: list
* Packages: mlr3, mlr3fselect, mlr3learners, stats
* Predict Type: response
* Feature Types: logical, integer, numeric, character, factor, ordered
* Properties: loglik, twoclass

We can now, as with any other learner, call the $train() and $predict() method. This time however, we pass it to benchmark() to compare the optimized feature subset to the complete feature set. This way, the AutoFSelector will do its resampling for feature selection on the training set of the respective split of the outer resampling. The learner then undertakes predictions using the test set of the outer resampling. Here, the outer resampling refers to the resampling specified in benchmark(), whereas the inner resampling is that specified in auto_fselector(). This is called nested resampling (Section 4.4) and yields unbiased performance measures, as the observations in the test set have not been used during feature selection or fitting of the respective learner.

In the call to benchmark(), we compare our wrapped learner at with a normal logistic regression lrn("classif.log_reg"). For that, we create a benchmark grid with the task, the learners and a 3-fold cross validation on the sonar data.

grid = benchmark_grid(
  task = tsk("sonar"),
  learner = list(at, lrn("classif.log_reg")),
  resampling = rsmp("cv", folds = 3)
)

bmr = benchmark(grid)

Now, we compare those two learners regarding classification accuracy and training time:

aggr = bmr$aggregate(msrs(c("classif.acc", "time_train")))
as.data.table(aggr)[, .(learner_id, classif.acc, time_train)]
                  learner_id classif.acc time_train
1: classif.log_reg.fselector   0.7351967 1.05400000
2:           classif.log_reg   0.6585231 0.03566667

We can see that, in this example, the feature selection improves prediction performance but also drastically increases the training time, since the feature selection (including resampling and random search) is part of the model training of the wrapped learner.

5.3 Conclusion

In this chapter, we learned how to perform feature selection with mlr3. We introduced filter and wrapper methods, combined feature selection with pipelines, learned how to automate the feature selection and covered the optimization of multiple performance measures. Table 5.1 gives an overview of the most important functions (S3) and classes (R6) used in this chapter.

Table 5.1: Core S3 ‘sugar’ functions for feature selection in mlr3 with the underlying R6 class that are constructed when these functions are called (if applicable) and a summary of the purpose of the functions.
S3 function R6 Class Summary
flt() Filter Selects features by calculating a score for each feature
Filter$calculate() Filter Calculates scores on a given task
fselect() FSelectInstanceSingleCrit or FSelectInstanceMultiCrit Specifies a feature selection problem and stores the results
fs() FSelector Specifies a feature selection algorithm
FSelector$optimize() FSelector Executes the features selection specified by the FSelectInstance with the algorithm specified by the FSelector
auto_fselector() AutoFSelector Defines a learner that includes feature selection

Resources

  • A list of implemented filters in the mlr3filters package is provided on the mlr3filters website3.
  • A summary of wrapper-based feature selection with the mlr3fselect package is provided in the mlr3fselect cheatsheet4.
  • An overview of feature selection methods is provided by Chandrashekar and Sahin (2014).
  • A more formal and detailed introduction to filters and wrappers is given in Guyon and Elisseeff (2003).
  • Bommert et al. (2020) perform a benchmark of filter methods.
  • Filters can be used as part of a machine learning pipeline (Chapter 6).
  • Filters can be optimized with hyperparameter optimization (Chapter 4).

5.4 Exercises

  1. Calculate a correlation filter on the Motor Trend data set (mtcars).
  2. Use the filter from the first exercise to select the five best features in the mtcars data set.
  3. Apply a backward selection to the penguins data set with a classification tree learner "classif.rpart" and holdout resampling by the measure classification accuracy. Compare the results with those in Section 5.2.1. Answer the following questions:
    1. Do the selected features differ?
    2. Which feature selection method achieves a higher classification accuracy?
    3. Are the accuracy values in b) directly comparable? If not, what has to be changed to make them comparable?
  4. Automate the feature selection as in Section 5.2.6 with the spam data set and a logistic regression learner ("classif.log_reg"). Hint: Remember to call library("mlr3learners") for the logistic regression learner.