19  Special Operators

This section introduces some special operators, that might be useful in numerous further applications.

19.1 Imputation: PipeOpImpute

Often you will be using data sets that have missing values. There are many methods of dealing with this issue, from relatively simple imputation using either mean, median or histograms to way more involved methods including using machine learning algorithms in order to predict missing values. These methods are called imputation.

The following PipeOps, PipeOpImpute:

  • Add an indicator column marking whether a value for a given feature was missing or not (numeric only)
  • Impute numeric values from a histogram
  • Impute categorical values using a learner
  • We use po("featureunion") and po("nop") to cbind the missing indicator features. In other words to combine the indicator columns with the rest of the data.
# Imputation example
task = tsk("penguins")
task$missings()
       species     bill_depth    bill_length      body_mass flipper_length 
             0              2              2              2              2 
        island            sex           year 
             0             11              0 
# Add missing indicator columns ("dummy columns") to the Task
pom = po("missind")
# Simply pushes the input forward
nop = po("nop")
# Imputes numerical features by histogram.
pon = po("imputehist", id = "imputer_num")
# combines features (used here to add indicator columns to original data)
pou = po("featureunion")
# Impute categorical features by fitting a Learner ("classif.rpart") for each feature.
pof = po("imputelearner", lrn("classif.rpart"), id = "imputer_fct", affect_columns = selector_type("factor"))

Now we construct the graph.

impgraph = list(
  pom,
  nop
) %>>% pou %>>% pof %>>% pon

impgraph$plot()

Now we get the new task and we can see that all of the missing values have been imputed.

new_task = impgraph$train(task)[[1]]

new_task$missings()
               species     missing_bill_depth    missing_bill_length 
                     0                      0                      0 
     missing_body_mass missing_flipper_length                 island 
                     0                      0                      0 
                  year                    sex             bill_depth 
                     0                      0                      0 
           bill_length              body_mass         flipper_length 
                     0                      0                      0 

A learner can thus be equipped with automatic imputation of missing values by adding an imputation Pipeop.

polrn = po("learner", lrn("classif.rpart"))
lrn = as_learner(impgraph %>>% polrn)

19.2 Feature Engineering: PipeOpMutate

New features can be added or computed from a task using PipeOpMutate . The operator evaluates one or multiple expressions provided in an alist. In this example, we compute some new features on top of the iris task. Then we add them to the data as illustrated below:

iris dataset looks like this:

task = task = tsk("iris")
head(as.data.table(task))
   Species Petal.Length Petal.Width Sepal.Length Sepal.Width
1:  setosa          1.4         0.2          5.1         3.5
2:  setosa          1.4         0.2          4.9         3.0
3:  setosa          1.3         0.2          4.7         3.2
4:  setosa          1.5         0.2          4.6         3.1
5:  setosa          1.4         0.2          5.0         3.6
6:  setosa          1.7         0.4          5.4         3.9

Once we do the mutations, you can see the new columns:

pom = po("mutate")

# Define a set of mutations
mutations = list(
  Sepal.Sum = ~ Sepal.Length + Sepal.Width,
  Petal.Sum = ~ Petal.Length + Petal.Width,
  Sepal.Petal.Ratio = ~ (Sepal.Length / Petal.Length)
)
pom$param_set$values$mutation = mutations

new_task = pom$train(list(task))[[1]]
head(as.data.table(new_task))
   Species Petal.Length Petal.Width Sepal.Length Sepal.Width Sepal.Sum
1:  setosa          1.4         0.2          5.1         3.5       8.6
2:  setosa          1.4         0.2          4.9         3.0       7.9
3:  setosa          1.3         0.2          4.7         3.2       7.9
4:  setosa          1.5         0.2          4.6         3.1       7.7
5:  setosa          1.4         0.2          5.0         3.6       8.6
6:  setosa          1.7         0.4          5.4         3.9       9.3
   Petal.Sum Sepal.Petal.Ratio
1:       1.6          3.642857
2:       1.6          3.500000
3:       1.5          3.615385
4:       1.7          3.066667
5:       1.6          3.571429
6:       2.1          3.176471

If outside data is required, we can make use of the env parameter. Moreover, we provide an environment, where expressions are evaluated (env defaults to .GlobalEnv).

19.3 Training on data subsets: PipeOpChunk

In cases, where data is too big to fit into the machine’s memory, an often-used technique is to split the data into several parts. Subsequently, the parts are trained on each part of the data.

After undertaking these steps, we aggregate the models. In this example, we split our data into 4 parts using PipeOpChunk . Additionally, we create 4 PipeOpLearner POS, which are then trained on each split of the data.

chks = po("chunk", 4)
lrns = ppl("greplicate", po("learner", lrn("classif.rpart")), 4)

Afterwards we can use PipeOpClassifAvg to aggregate the predictions from the 4 different models into a new one.

mjv = po("classifavg", 4)

We can now connect the different operators and visualize the full graph:

pipeline = chks %>>% lrns %>>% mjv
pipeline$plot(html = FALSE)

task = tsk("iris")
train.idx = sample(seq_len(task$nrow), 120)
test.idx = setdiff(seq_len(task$nrow), train.idx)

pipelrn = as_learner(pipeline)
pipelrn$train(task, train.idx)$
  predict(task, train.idx)$
  score()
classif.ce 
      0.05 

19.4 Feature Selection: PipeOpFilter and PipeOpSelect

The package mlr3filters contains many different "mlr3filters::Filter")s that can be used to select features for subsequent learners. This is often required when the data has a large amount of features.

A PipeOp for filters is PipeOpFilter:

po("filter", mlr3filters::flt("information_gain"))
PipeOp: <information_gain> (not trained)
values: <list()>
Input channels <name [train type, predict type]>:
  input [Task,Task]
Output channels <name [train type, predict type]>:
  output [Task,Task]

How many features to keep can be set using filter_nfeat, filter_frac and filter_cutoff.

Filters can be selected / de-selected by name using PipeOpSelect.