8.2 Feature Selection / Filtering

Often, data sets include a large number of features. The technique of extracting a subset of relevant features is called “feature selection”. Feature selection can enhance the interpretability of the model, speed up the learning process and improve the learner performance. Different approaches exist to identify the relevant features. In the literature two different approaches exist: One is called “Filtering” and the other approach is often referred to as “feature subset selection” or “wrapper methods”.

What are the differences (Chandrashekar and Sahin 2014)?

  • Filter: An external algorithm computes a rank of the variables (e.g. based on the correlation to the response). Then, features are subsetted by a certain criteria, e.g. an absolute number or a percentage of the number of variables. The selected features will then be used to fit a model (with optional hyperparameters selected by tuning). This calculation is usually cheaper than “feature subset selection” in terms of computation time.
  • Feature subset selection: Here, no ranking of features is done. Features are selected by a (random) subset of the data. Then, a model is fit and the performance is checked. This is done for a lot of feature combinations in a CV setting and the best combination is reported. This method is very computational intense as a lot of models are fitted. Also, strictly all these models would need to be tuned before the performance is estimated which would require an additional nested level in a CV setting. After all this, the selected subset of features is again fitted (with optional hyperparameters selected by tuning).

There is also a third approach which can be attributed to the “filter” family: The embedded feature-selection methods of some Learner. Read more about how to use these in section embedded feature-selection methods.

Ensemble filters built upon the idea of stacking single filter methods. These are not yet implemented.

All feature selection related functionality is implemented via the extension package mlr3featsel.

8.2.1 Filters

Filter methods assign an importance value to each feature. Based on these values the features can be ranked and a feature subset can be selected. There is a list of all implemented filter methods in the Appendix.

8.2.1.1 Calculating the feature importance

Currently, only classification and regression tasks are supported.

The first step it to create a new R object using the class of the desired filter method. Each object of class Filter has a .$calculate() method which calculates the ranking of the features. This function requires a Task and returns the calculated values of each variable.

filter = FilterJMIM$new()

task = mlr3::mlr_tasks$get("iris")
filter$calculate(task)
as.data.table(filter)
##            name  value
## 1: Sepal.Length 1.0401
## 2:  Petal.Width 0.9894
## 3: Petal.Length 0.9881
## 4:  Sepal.Width 0.8314

The combination of single filter results is not yet supported.

8.2.1.2 Selecting a feature subset

Instead of calculating the raw filter values, you can directly subset the task by using the member functions .$filter_abs(), .$filter_perc() and .$filter_thresh().

  • .$filter_abs(): Keep a certain absolute number (abs) of features with highest importance.
  • .$filter_perc(): Keep a certain percentage (perc) of features with highest importance.
  • .$filter_thresh(): Keep all features whose importance exceeds a certain threshold value (threshold).
filter$filter_abs(task, 2)
task$feature_names
## [1] "Petal.Width"  "Sepal.Length"

8.2.2 Wrapper Methods

Work in progress :)

8.2.3 Embedded Methods

All Learner with the property “importance” come with integrated feature selection methods.

You can find a list of all learners with this property in the Appendix.

For some learners the desired filter method needs to be set during learner creation. For example, learner classif.ranger comes with multiple integrated methods. See the help page of ranger::ranger. To use method “impurity”, you need to set it via the param_vals argument:

lrn = mlr_learners$get("classif.ranger",
  param_vals = list(importance = "impurity"))

Now you can use the mlr3featsel::FilterVariableImportance class for algorithm-embedded methods to filter a Task.

task = mlr_tasks$get("iris")
filter = FilterVariableImportance$new(learner = lrn)
filter$calculate(task)
head(as.data.table(filter), 3)
##            name  value
## 1:  Petal.Width 45.119
## 2: Petal.Length 41.934
## 3: Sepal.Length  9.754

8.2.4 Ensemble Methods

Work in progress :)

References

Chandrashekar, Girish, and Ferat Sahin. 2014. “A Survey on Feature Selection Methods.” Computers & Electrical Engineering 40 (1): 16–28. https://doi.org/https://doi.org/10.1016/j.compeleceng.2013.11.024.