5.6 Special Operators

This section introduce some special operators, that might be useful in many applications.

5.6.1 Imputation: PipeOpImpute

An often occurring setting is the imputation of missing data. Imputation methods range from relatively simple imputation using either mean, median or histograms to way more involved methods including using machine learning algorithms in order to predict missing values.

The following PipeOp imputes numeric values from a histogram, adds a new level for factors and additionally adds a column marking whether a value for a given feature was missing or not.

A learner can thus be equipped with automatic imputation of missing values by adding an imputation Pipeop.

5.6.2 Feature Engineering: PipeOpMutate

New features can be added or computed from a task using PipeOpMutate . The operator evaluates one or multiple expressions provided in an alist. In this example, we compute some new features on top of the iris task and add them to the data.

If outside data is required, we can make use of the env parameter and additionally provide an environment, where expressions are evaluated (env defaults to .GlobalEnv).

5.6.3 Training on data subsets: PipeOpChunk

In cases, where data is too big to fit into the machine’s memory, an often-used technique is to split the data into several parts, train on each part of the data and afterwards aggregate the models. In this example, we split our data into 4 parts using PipeOpChunk . Additionally, we create 4 PipeOpLearner POS, which are then trained on each split of the data.

Afterwards we can use PipeOpMajorityVote to aggregate the predictions from the 4 different models into a new one.

We can now connect the different operators and visualize the full graph:

5.6.4 Feature Selection: PipeOpFilter , PipeOpSelect

mlr3featsel contains many different Filters that can be used to select features for subsequent learners. This is often required when the data has a large amount of features.

How many features to keep can be set using filter_nfeat, filter_frac and filter_cutoff.

Filters can be selected / de-selected by name using PipeOpSelect .