3.5 Feature Selection / Filtering
Often, data sets include a large number of features. The technique of extracting a subset of relevant features is called “feature selection.”
The objective of feature selection is to fit the sparse dependent of a model on a subset of available data features in the most suitable manner. Feature selection can enhance the interpretability of the model, speed up the learning process and improve the learner performance. Different approaches exist to identify the relevant features. Two different approaches are emphasized in the literature: one is called Filtering and the other approach is often referred to as feature subset selection or wrapper methods.
What are the differences (Guyon and Elisseeff 2003; Chandrashekar and Sahin 2014)?
Filtering: An external algorithm computes a rank of the features (e.g. based on the correlation to the response). Then, features are subsetted by a certain criteria, e.g. an absolute number or a percentage of the number of variables. The selected features will then be used to fit a model (with optional hyperparameters selected by tuning). This calculation is usually cheaper than “feature subset selection” in terms of computation time. All filters are connected via package mlr3filters.
Wrapper Methods: Here, no ranking of features is done. Instead, an optimization algorithm selects a subset of the features, evaluates the set by calculating the resampled predictive performance, and then proposes a new set of features (or terminates). A simple example is the sequential forward selection. This method is usually computationally very intensive as a lot of models are fitted. Also, strictly speaking, all these models would need to be tuned before the performance is estimated. This would require an additional nested level in a CV setting. After undertaken all of these steps, the final set of selected features is again fitted (with optional hyperparameters selected by tuning). Wrapper methods are implemented in the mlr3fselect package.
Embedded Methods: Many learners internally select a subset of the features which they find helpful for prediction. These subsets can usually be queried, as the following example demonstrates:
= tsk("iris") task = lrn("classif.rpart") learner # ensure that the learner selects features stopifnot("selected_features" %in% learner$properties) # fit a simple classification tree = learner$train(task) learner # extract all features used in the classification tree: $selected_features() learner
## [1] "Petal.Length" "Petal.Width"
There are also Ensemble filters built upon the idea of stacking single filter methods. These are not yet implemented.
3.5.1 Filters
Filter methods assign an importance value to each feature. Based on these values the features can be ranked. Thereafter, we are able to select a feature subset. There is a list of all implemented filter methods in the Appendix.
3.5.2 Calculating filter values
Currently, only classification and regression tasks are supported.
The first step it to create a new R object using the class of the desired filter method.
Each object of class Filter
has a .$calculate()
method which computes the filter values and ranks them in a descending order.
library("mlr3filters")
= FilterJMIM$new()
filter
= tsk("iris")
task $calculate(task)
filter
as.data.table(filter)
## feature score
## 1: Petal.Width 1.0000
## 2: Sepal.Length 0.6667
## 3: Petal.Length 0.3333
## 4: Sepal.Width 0.0000
Some filters support changing specific hyperparameters.
This is similar to setting hyperparameters of a Learner
using .$param_set$values
:
= FilterCorrelation$new()
filter_cor $param_set filter_cor
## <ParamSet>
## id class lower upper
## 1: use ParamFct NA NA
## 2: method ParamFct NA NA
## levels
## 1: everything,all.obs,complete.obs,na.or.complete,pairwise.complete.obs
## 2: pearson,kendall,spearman
## default value
## 1: everything
## 2: pearson
# change parameter 'method'
$param_set$values = list(method = "spearman")
filter_cor$param_set filter_cor
## <ParamSet>
## id class lower upper
## 1: use ParamFct NA NA
## 2: method ParamFct NA NA
## levels
## 1: everything,all.obs,complete.obs,na.or.complete,pairwise.complete.obs
## 2: pearson,kendall,spearman
## default value
## 1: everything
## 2: pearson spearman
Rather than taking the “long” R6 way to create a filter, there is also a built-in shorthand notation for filter creation:
= flt("cmim")
filter filter
## <FilterCMIM:cmim>
## Task Types: classif, regr
## Task Properties: -
## Packages: praznik
## Feature types: integer, numeric, factor, ordered
3.5.3 Variable Importance Filters
All Learner
with the property “importance” come with integrated feature selection methods.
You can find a list of all learners with this property in the Appendix.
For some learners the desired filter method needs to be set during learner creation.
For example, learner classif.ranger
(in the package mlr3learners) comes with multiple integrated methods.
See the help page of ranger::ranger
.
To use method “impurity,” you need to set the filter method during construction.
library("mlr3learners")
= lrn("classif.ranger", importance = "impurity") lrn
Now you can use the mlr3filters::FilterImportance
class for algorithm-embedded methods to filter a Task
.
library("mlr3learners")
= tsk("iris")
task = flt("importance", learner = lrn)
filter $calculate(task)
filterhead(as.data.table(filter), 3)
## feature score
## 1: Petal.Width 45.405
## 2: Petal.Length 41.951
## 3: Sepal.Length 9.551
3.5.4 Wrapper Methods
Wrapper feature selection is supported via the mlr3fselect extension package. At the heart of mlr3fselect are the R6 classes:
FSelectInstanceSingleCrit
,FSelectInstanceMultiCrit
: These two classes describe the feature selection problem and store the results.FSelector
: This class is the base class for implementations of feature selection algorithms.
3.5.5 The FSelectInstance
Classes
The following sub-section examines the feature selection on the Pima
data set which is used to predict whether or not a patient has diabetes.
= tsk("pima")
task print(task)
## <TaskClassif:pima> (768 x 9)
## * Target: diabetes
## * Properties: twoclass
## * Features (8):
## - dbl (8): age, glucose, insulin, mass, pedigree, pregnant, pressure,
## triceps
We use the classification tree from rpart.
= lrn("classif.rpart") learner
Next, we need to specify how to evaluate the performance of the feature subsets.
For this, we need to choose a resampling strategy
and a performance measure
.
= rsmp("holdout")
hout = msr("classif.ce") measure
Finally, one has to choose the available budget for the feature selection.
This is done by selecting one of the available Terminators
:
- Terminate after a given time (
TerminatorClockTime
) - Terminate after a given amount of iterations (
TerminatorEvals
) - Terminate after a specific performance is reached (
TerminatorPerfReached
) - Terminate when feature selection does not improve (
TerminatorStagnation
) - A combination of the above in an ALL or ANY fashion (
TerminatorCombo
)
For this short introduction, we specify a budget of 20 evaluations and then put everything together into a FSelectInstanceSingleCrit
:
library("mlr3fselect")
= trm("evals", n_evals = 20)
evals20
= FSelectInstanceSingleCrit$new(
instance task = task,
learner = learner,
resampling = hout,
measure = measure,
terminator = evals20
) instance
## <FSelectInstanceSingleCrit>
## * State: Not optimized
## * Objective: <ObjectiveFSelect:classif.rpart_on_pima>
## * Search Space:
## <ParamSet>
## id class lower upper levels default value
## 1: age ParamLgl NA NA TRUE,FALSE <NoDefault[3]>
## 2: glucose ParamLgl NA NA TRUE,FALSE <NoDefault[3]>
## 3: insulin ParamLgl NA NA TRUE,FALSE <NoDefault[3]>
## 4: mass ParamLgl NA NA TRUE,FALSE <NoDefault[3]>
## 5: pedigree ParamLgl NA NA TRUE,FALSE <NoDefault[3]>
## 6: pregnant ParamLgl NA NA TRUE,FALSE <NoDefault[3]>
## 7: pressure ParamLgl NA NA TRUE,FALSE <NoDefault[3]>
## 8: triceps ParamLgl NA NA TRUE,FALSE <NoDefault[3]>
## * Terminator: <TerminatorEvals>
## * Terminated: FALSE
## * Archive:
## <ArchiveFSelect>
## Null data.table (0 rows and 0 cols)
To start the feature selection, we still need to select an algorithm which are defined via the FSelector
class
3.5.6 The FSelector
Class
The following algorithms are currently implemented in mlr3fselect:
- Random Search (
FSelectorRandomSearch
) - Exhaustive Search (
FSelectorExhaustiveSearch
) - Sequential Search (
FSelectorSequential
) - Recursive Feature Elimination (
FSelectorRFE
) - Design Points (
FSelectorDesignPoints
)
In this example, we will use a simple random search.
= fs("random_search") fselector
3.5.7 Triggering the Tuning
To start the feature selection, we simply pass the FSelectInstanceSingleCrit
to the $optimize()
method of the initialized FSelector
. The algorithm proceeds as follows
- The
FSelector
proposes at least one feature subset and may propose multiple subsets to improve parallelization, which can be controlled via the settingbatch_size
). - For each feature subset, the given
Learner
is fitted on theTask
using the providedResampling
. All evaluations are stored in the archive of theFSelectInstanceSingleCrit
. - The
Terminator
is queried if the budget is exhausted. If the budget is not exhausted, restart with 1) until it is. - Determine the feature subset with the best observed performance.
- Store the best feature subset as the result in the instance object.
The best feature subset (
$result_feature_set
) and the corresponding measured performance ($result_y
) can be accessed from the instance.
$optimize(instance) fselector
## INFO [20:21:27.177] [bbotk] Starting to optimize 8 parameter(s) with '<FSelectorRandomSearch>' and '<TerminatorEvals> [n_evals=20]'
## INFO [20:21:27.361] [bbotk] Evaluating 10 configuration(s)
## INFO [20:21:29.777] [bbotk] Result of batch 1:
## INFO [20:21:29.779] [bbotk] age glucose insulin mass pedigree pregnant pressure triceps classif.ce
## INFO [20:21:29.779] [bbotk] TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE 0.2969
## INFO [20:21:29.779] [bbotk] TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE 0.3008
## INFO [20:21:29.779] [bbotk] FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE 0.3047
## INFO [20:21:29.779] [bbotk] TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE 0.3555
## INFO [20:21:29.779] [bbotk] FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE 0.3125
## INFO [20:21:29.779] [bbotk] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE 0.3594
## INFO [20:21:29.779] [bbotk] FALSE TRUE TRUE TRUE FALSE FALSE TRUE FALSE 0.2656
## INFO [20:21:29.779] [bbotk] FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE 0.3555
## INFO [20:21:29.779] [bbotk] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE 0.2930
## INFO [20:21:29.779] [bbotk] FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE 0.3828
## INFO [20:21:29.779] [bbotk] uhash
## INFO [20:21:29.779] [bbotk] 2e8dd8cb-c9e9-492d-b366-0b62a996eb0b
## INFO [20:21:29.779] [bbotk] 92de7559-3e19-4ccf-8982-026010399c67
## INFO [20:21:29.779] [bbotk] 7700c348-70c3-4408-a746-cf42b3d3acf9
## INFO [20:21:29.779] [bbotk] b902f323-3d82-4efe-ada3-79f2be3fc9ab
## INFO [20:21:29.779] [bbotk] 7d11db1d-f27d-4f5e-b296-db4e24f3388a
## INFO [20:21:29.779] [bbotk] f63f618b-3d1c-4f0a-b982-61cad21ba7dc
## INFO [20:21:29.779] [bbotk] f653577c-50ad-44da-b11a-aaba3032c8d2
## INFO [20:21:29.779] [bbotk] 0354ac45-7c78-4aa2-9ad2-f360307e87a5
## INFO [20:21:29.779] [bbotk] 5490ea47-a03c-4f54-aaee-fb8062f6a762
## INFO [20:21:29.779] [bbotk] 3f0f1d50-aa17-45a6-bbfd-eaa7e277838e
## INFO [20:21:29.781] [bbotk] Evaluating 10 configuration(s)
## INFO [20:21:32.039] [bbotk] Result of batch 2:
## INFO [20:21:32.041] [bbotk] age glucose insulin mass pedigree pregnant pressure triceps classif.ce
## INFO [20:21:32.041] [bbotk] TRUE TRUE TRUE FALSE FALSE TRUE FALSE TRUE 0.3008
## INFO [20:21:32.041] [bbotk] FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE 0.2891
## INFO [20:21:32.041] [bbotk] TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE 0.3008
## INFO [20:21:32.041] [bbotk] TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE 0.3086
## INFO [20:21:32.041] [bbotk] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE 0.2617
## INFO [20:21:32.041] [bbotk] TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE 0.2969
## INFO [20:21:32.041] [bbotk] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE 0.4062
## INFO [20:21:32.041] [bbotk] FALSE FALSE TRUE TRUE FALSE TRUE TRUE TRUE 0.3320
## INFO [20:21:32.041] [bbotk] TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE 0.3359
## INFO [20:21:32.041] [bbotk] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE 0.3594
## INFO [20:21:32.041] [bbotk] uhash
## INFO [20:21:32.041] [bbotk] 357c99e8-ea9e-4ddd-9bcc-cb4557faf910
## INFO [20:21:32.041] [bbotk] a083e3bc-04be-4909-b795-b1f8690e1935
## INFO [20:21:32.041] [bbotk] d8cd43b2-b8c8-44df-b94d-d239d7485870
## INFO [20:21:32.041] [bbotk] 10947ed0-e99c-4f6d-9ec9-0dae6e3806ab
## INFO [20:21:32.041] [bbotk] b4098d04-55b9-4684-ba42-da2a3df26aae
## INFO [20:21:32.041] [bbotk] 022601e4-360f-4510-8424-1961c6042c38
## INFO [20:21:32.041] [bbotk] 1655ecaf-abb8-4a68-a4c6-894dc98cfaae
## INFO [20:21:32.041] [bbotk] b4c69b2d-2699-48aa-ac6e-40d3786c16a6
## INFO [20:21:32.041] [bbotk] e0371776-84a1-49aa-b9ab-99534177adb1
## INFO [20:21:32.041] [bbotk] 6323f37a-c9d3-49e0-9a5c-14332be2dffd
## INFO [20:21:32.048] [bbotk] Finished optimizing after 20 evaluation(s)
## INFO [20:21:32.049] [bbotk] Result:
## INFO [20:21:32.051] [bbotk] age glucose insulin mass pedigree pregnant pressure triceps
## INFO [20:21:32.051] [bbotk] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## INFO [20:21:32.051] [bbotk] features x_domain classif.ce
## INFO [20:21:32.051] [bbotk] glucose,insulin,mass,pedigree,pregnant,pressure,... <list[8]> 0.2617
## age glucose insulin mass pedigree pregnant pressure triceps
## 1: FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## features x_domain classif.ce
## 1: glucose,insulin,mass,pedigree,pregnant,pressure,... <list[8]> 0.2617
$result_feature_set instance
## [1] "glucose" "insulin" "mass" "pedigree" "pregnant" "pressure" "triceps"
$result_y instance
## classif.ce
## 0.2617
One can investigate all resamplings which were undertaken, as they are stored in the archive of the FSelectInstanceSingleCrit
and can be accessed by using as.data.table()
:
as.data.table(instance$archive)
## age glucose insulin mass pedigree pregnant pressure triceps classif.ce
## 1: TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE 0.2969
## 2: TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE 0.3008
## 3: FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE 0.3047
## 4: TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE 0.3555
## 5: FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE 0.3125
## 6: FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE 0.3594
## 7: FALSE TRUE TRUE TRUE FALSE FALSE TRUE FALSE 0.2656
## 8: FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE 0.3555
## 9: TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE 0.2930
## 10: FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE 0.3828
## 11: TRUE TRUE TRUE FALSE FALSE TRUE FALSE TRUE 0.3008
## 12: FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE 0.2891
## 13: TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE 0.3008
## 14: TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE 0.3086
## 15: FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE 0.2617
## 16: TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE 0.2969
## 17: FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE 0.4062
## 18: FALSE FALSE TRUE TRUE FALSE TRUE TRUE TRUE 0.3320
## 19: TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE 0.3359
## 20: FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE 0.3594
## uhash timestamp batch_nr
## 1: 2e8dd8cb-c9e9-492d-b366-0b62a996eb0b 2021-02-23 20:21:29 1
## 2: 92de7559-3e19-4ccf-8982-026010399c67 2021-02-23 20:21:29 1
## 3: 7700c348-70c3-4408-a746-cf42b3d3acf9 2021-02-23 20:21:29 1
## 4: b902f323-3d82-4efe-ada3-79f2be3fc9ab 2021-02-23 20:21:29 1
## 5: 7d11db1d-f27d-4f5e-b296-db4e24f3388a 2021-02-23 20:21:29 1
## 6: f63f618b-3d1c-4f0a-b982-61cad21ba7dc 2021-02-23 20:21:29 1
## 7: f653577c-50ad-44da-b11a-aaba3032c8d2 2021-02-23 20:21:29 1
## 8: 0354ac45-7c78-4aa2-9ad2-f360307e87a5 2021-02-23 20:21:29 1
## 9: 5490ea47-a03c-4f54-aaee-fb8062f6a762 2021-02-23 20:21:29 1
## 10: 3f0f1d50-aa17-45a6-bbfd-eaa7e277838e 2021-02-23 20:21:29 1
## 11: 357c99e8-ea9e-4ddd-9bcc-cb4557faf910 2021-02-23 20:21:32 2
## 12: a083e3bc-04be-4909-b795-b1f8690e1935 2021-02-23 20:21:32 2
## 13: d8cd43b2-b8c8-44df-b94d-d239d7485870 2021-02-23 20:21:32 2
## 14: 10947ed0-e99c-4f6d-9ec9-0dae6e3806ab 2021-02-23 20:21:32 2
## 15: b4098d04-55b9-4684-ba42-da2a3df26aae 2021-02-23 20:21:32 2
## 16: 022601e4-360f-4510-8424-1961c6042c38 2021-02-23 20:21:32 2
## 17: 1655ecaf-abb8-4a68-a4c6-894dc98cfaae 2021-02-23 20:21:32 2
## 18: b4c69b2d-2699-48aa-ac6e-40d3786c16a6 2021-02-23 20:21:32 2
## 19: e0371776-84a1-49aa-b9ab-99534177adb1 2021-02-23 20:21:32 2
## 20: 6323f37a-c9d3-49e0-9a5c-14332be2dffd 2021-02-23 20:21:32 2
## x_domain_age x_domain_glucose x_domain_insulin x_domain_mass
## 1: TRUE TRUE FALSE TRUE
## 2: TRUE FALSE TRUE TRUE
## 3: FALSE FALSE FALSE TRUE
## 4: TRUE FALSE FALSE FALSE
## 5: FALSE TRUE FALSE FALSE
## 6: FALSE FALSE FALSE FALSE
## 7: FALSE TRUE TRUE TRUE
## 8: FALSE FALSE TRUE TRUE
## 9: TRUE TRUE TRUE TRUE
## 10: FALSE FALSE FALSE TRUE
## 11: TRUE TRUE TRUE FALSE
## 12: FALSE TRUE FALSE FALSE
## 13: TRUE FALSE TRUE TRUE
## 14: TRUE TRUE FALSE FALSE
## 15: FALSE TRUE TRUE TRUE
## 16: TRUE TRUE TRUE TRUE
## 17: FALSE FALSE FALSE FALSE
## 18: FALSE FALSE TRUE TRUE
## 19: TRUE FALSE TRUE FALSE
## 20: FALSE FALSE FALSE FALSE
## x_domain_pedigree x_domain_pregnant x_domain_pressure x_domain_triceps
## 1: FALSE FALSE TRUE TRUE
## 2: TRUE TRUE TRUE TRUE
## 3: FALSE TRUE FALSE FALSE
## 4: FALSE TRUE TRUE TRUE
## 5: TRUE TRUE TRUE TRUE
## 6: FALSE FALSE FALSE TRUE
## 7: FALSE FALSE TRUE FALSE
## 8: FALSE FALSE FALSE TRUE
## 9: TRUE TRUE TRUE TRUE
## 10: FALSE FALSE FALSE TRUE
## 11: FALSE TRUE FALSE TRUE
## 12: TRUE FALSE FALSE FALSE
## 13: TRUE TRUE TRUE FALSE
## 14: FALSE FALSE TRUE TRUE
## 15: TRUE TRUE TRUE TRUE
## 16: FALSE FALSE TRUE TRUE
## 17: FALSE FALSE TRUE TRUE
## 18: FALSE TRUE TRUE TRUE
## 19: FALSE FALSE TRUE FALSE
## 20: FALSE FALSE FALSE TRUE
The associated resampling iterations can be accessed in the BenchmarkResult
:
$archive$benchmark_result$data instance
## <ResultData>
## Public:
## as_data_table: function (view = NULL, reassemble_learners = TRUE, convert_predictions = TRUE,
## clone: function (deep = FALSE)
## combine: function (rdata)
## data: list
## initialize: function (data = NULL, store_backends = TRUE)
## iterations: function (view = NULL)
## learners: function (view = NULL, states = TRUE, reassemble = TRUE)
## logs: function (view = NULL, condition)
## prediction: function (view = NULL, predict_sets = "test")
## predictions: function (view = NULL, predict_sets = "test")
## resamplings: function (view = NULL)
## sweep: function ()
## task_type: active binding
## tasks: function (view = NULL)
## uhashes: function (view = NULL)
## Private:
## deep_clone: function (name, value)
## get_view_index: function (view)
The uhash
column links the resampling iterations to the evaluated feature subsets stored in instance$archive$data()
. This allows e.g. to score the included ResampleResult
s on a different measure.
Now the optimized feature subset can be used to subset the task and fit the model on all observations.
$select(instance$result_feature_set)
task$train(task) learner
The trained model can now be used to make a prediction on external data.
Note that predicting on observations present in the task
, should be avoided.
The model has seen these observations already during feature selection and therefore results would be statistically biased.
Hence, the resulting performance measure would be over-optimistic.
Instead, to get statistically unbiased performance estimates for the current task, nested resampling is required.
3.5.8 Automating the Feature Selection
The AutoFSelector
wraps a learner and augments it with an automatic feature selection for a given task.
Because the AutoFSelector
itself inherits from the Learner
base class, it can be used like any other learner.
Analogously to the previous subsection, a new classification tree learner is created.
This classification tree learner automatically starts a feature selection on the given task using an inner resampling (holdout).
We create a terminator which allows 10 evaluations, and uses a simple random search as feature selection algorithm:
library("paradox")
library("mlr3fselect")
= lrn("classif.rpart")
learner = trm("evals", n_evals = 10)
terminator = fs("random_search")
fselector
= AutoFSelector$new(
at learner = learner,
resampling = rsmp("holdout"),
measure = msr("classif.ce"),
terminator = terminator,
fselector = fselector
) at
## <AutoFSelector:classif.rpart.fselector>
## * Model: -
## * Parameters: xval=0
## * Packages: rpart
## * Predict Type: response
## * Feature types: logical, integer, numeric, factor, ordered
## * Properties: importance, missings, multiclass, selected_features,
## twoclass, weights
We can now use the learner like any other learner, calling the $train()
and $predict()
method.
This time however, we pass it to benchmark()
to compare the optimized feature subset to the complete feature set.
This way, the AutoFSelector
will do its resampling for feature selection on the training set of the respective split of the outer resampling.
The learner then undertakes predictions using the test set of the outer resampling.
This yields unbiased performance measures, as the observations in the test set have not been used during feature selection or fitting of the respective learner.
This is called nested resampling.
To compare the optimized feature subset with the complete feature set, we can use benchmark()
:
= benchmark_grid(
grid task = tsk("pima"),
learner = list(at, lrn("classif.rpart")),
resampling = rsmp("cv", folds = 3)
)
# avoid console output from mlrfselect
= lgr::get_logger("bbotk")
logger $set_threshold("warn")
logger
= benchmark(grid, store_models = TRUE)
bmr $aggregate(msrs(c("classif.ce", "time_train"))) bmr
## nr resample_result task_id learner_id resampling_id iters
## 1: 1 <ResampleResult[21]> pima classif.rpart.fselector cv 3
## 2: 2 <ResampleResult[21]> pima classif.rpart cv 3
## classif.ce time_train
## 1: 0.2747 0
## 2: 0.2643 0
Note that we do not expect any significant differences since we only evaluated a small fraction of the possible feature subsets.