5 Pipelines
mlr3pipelines (Binder et al. 2021) is a dataflow programming toolkit. This chapter focuses on the applicant’s side of the package. A more indepth and technically oriented guide can be found in the Indepth look into mlr3pipelines chapter.
Machine learning workflows can be written as directed “Graphs”/“Pipelines” that represent data flows between preprocessing, model fitting, and ensemble learning units in an expressive and intuitive language. We will most often use the term “Graph” in this manual but it can interchangeably be used with “pipeline” or “workflow”.
Below you can examine an example for such a graph:
Single computational steps can be represented as socalled PipeOps, which can then be connected with directed edges in a Graph. The scope of mlr3pipelines is still growing. Currently supported features are:
 Data manipulation and preprocessing operations, e.g. PCA, feature filtering, imputation
 Task subsampling for speed and outcome class imbalance handling
 mlr3 Learner operations for prediction and stacking
 Ensemble methods and aggregation of predictions
Additionally, we implement several meta operators that can be used to construct powerful pipelines:
 Simultaneous path branching (data going both ways)
 Alternative path branching (data going one specific way, controlled by hyperparameters)
An extensive introduction to creating custom PipeOps (PO’s) can be found in the technical introduction.
Using methods from mlr3tuning, it is even possible to simultaneously optimize parameters of multiple processing units.
A predecessor to this package is the mlrCPO package, which works with mlr 2.x. Other packages that provide, to varying degree, some preprocessing functionality or machine learning domain specific language, are:
An example for a Pipeline that can be constructed using mlr3pipelines is depicted below:
5.1 The Building Blocks: PipeOps
The building blocks of mlr3pipelines are PipeOpobjects (PO).
They can be constructed directly using PipeOp<NAME>$new()
, but the recommended way is to retrieve them from the mlr_pipeops
dictionary:
library("mlr3pipelines")
as.data.table(mlr_pipeops)
## key packages
## 1: boxcox mlr3pipelines,bestNormalize
## 2: branch mlr3pipelines
## 3: chunk mlr3pipelines
## 4: classbalancing mlr3pipelines
## 5: classifavg mlr3pipelines,stats
## 6: classweights mlr3pipelines
## 7: colapply mlr3pipelines
## 8: collapsefactors mlr3pipelines
## 9: colroles mlr3pipelines
## 10: copy mlr3pipelines
## 11: datefeatures mlr3pipelines
## 12: encode mlr3pipelines,stats
## 13: encodeimpact mlr3pipelines
## 14: encodelmer mlr3pipelines,lme4,nloptr
## 15: featureunion mlr3pipelines
## 16: filter mlr3pipelines
## 17: fixfactors mlr3pipelines
## 18: histbin mlr3pipelines,graphics
## 19: ica mlr3pipelines,fastICA
## 20: imputeconstant mlr3pipelines
## 21: imputehist mlr3pipelines,graphics
## 22: imputelearner mlr3pipelines
## 23: imputemean mlr3pipelines
## 24: imputemedian mlr3pipelines,stats
## 25: imputemode mlr3pipelines
## 26: imputeoor mlr3pipelines
## 27: imputesample mlr3pipelines
## 28: kernelpca mlr3pipelines,kernlab
## 29: learner mlr3pipelines
## 30: learner_cv mlr3pipelines
## 31: missind mlr3pipelines
## 32: modelmatrix mlr3pipelines,stats
## 33: multiplicityexply mlr3pipelines
## 34: multiplicityimply mlr3pipelines
## 35: mutate mlr3pipelines
## 36: nmf mlr3pipelines,MASS,NMF
## 37: nop mlr3pipelines
## 38: ovrsplit mlr3pipelines
## 39: ovrunite mlr3pipelines
## 40: pca mlr3pipelines
## 41: proxy mlr3pipelines
## 42: quantilebin mlr3pipelines,stats
## 43: randomprojection mlr3pipelines
## 44: randomresponse mlr3pipelines
## 45: regravg mlr3pipelines
## 46: removeconstants mlr3pipelines
## 47: renamecolumns mlr3pipelines
## 48: replicate mlr3pipelines
## 49: scale mlr3pipelines
## 50: scalemaxabs mlr3pipelines
## 51: scalerange mlr3pipelines
## 52: select mlr3pipelines
## 53: smote mlr3pipelines,smotefamily
## 54: spatialsign mlr3pipelines
## 55: subsample mlr3pipelines
## 56: targetinvert mlr3pipelines
## 57: targetmutate mlr3pipelines
## 58: targettrafoscalerange mlr3pipelines
## 59: textvectorizer mlr3pipelines,quanteda,stopwords
## 60: threshold mlr3pipelines
## 61: tunethreshold mlr3pipelines,bbotk
## 62: unbranch mlr3pipelines
## 63: vtreat mlr3pipelines,vtreat
## 64: yeojohnson mlr3pipelines,bestNormalize
## key packages
## tags
## 1: data transform
## 2: meta
## 3: meta
## 4: imbalanced data,data transform
## 5: ensemble
## 6: imbalanced data,data transform
## 7: data transform
## 8: data transform
## 9: data transform
## 10: meta
## 11: data transform
## 12: encode,data transform
## 13: encode,data transform
## 14: encode,data transform
## 15: ensemble
## 16: feature selection,data transform
## 17: robustify,data transform
## 18: data transform
## 19: data transform
## 20: missings
## 21: missings
## 22: missings
## 23: missings
## 24: missings
## 25: missings
## 26: missings
## 27: missings
## 28: data transform
## 29: learner
## 30: learner,ensemble,data transform
## 31: missings,data transform
## 32: data transform
## 33: multiplicity
## 34: multiplicity
## 35: data transform
## 36: data transform
## 37: meta
## 38: target transform,multiplicity
## 39: multiplicity,ensemble
## 40: data transform
## 41: meta
## 42: data transform
## 43: data transform
## 44: abstract
## 45: ensemble
## 46: robustify,data transform
## 47: data transform
## 48: multiplicity
## 49: data transform
## 50: data transform
## 51: data transform
## 52: feature selection,data transform
## 53: imbalanced data,data transform
## 54: data transform
## 55: data transform
## 56: abstract
## 57: target transform
## 58: target transform
## 59: data transform
## 60: target transform
## 61: target transform
## 62: meta
## 63: encode,missings,data transform
## 64: data transform
## tags
## feature_types input.num output.num
## 1: numeric,integer 1 1
## 2: NA 1 NA
## 3: NA 1 NA
## 4: logical,integer,numeric,character,factor,ordered,... 1 1
## 5: NA NA 1
## 6: logical,integer,numeric,character,factor,ordered,... 1 1
## 7: logical,integer,numeric,character,factor,ordered,... 1 1
## 8: factor,ordered 1 1
## 9: logical,integer,numeric,character,factor,ordered,... 1 1
## 10: NA 1 NA
## 11: POSIXct 1 1
## 12: factor,ordered 1 1
## 13: factor,ordered 1 1
## 14: factor,ordered 1 1
## 15: NA NA 1
## 16: logical,integer,numeric,character,factor,ordered,... 1 1
## 17: factor,ordered 1 1
## 18: numeric,integer 1 1
## 19: numeric,integer 1 1
## 20: logical,integer,numeric,character,factor,ordered,... 1 1
## 21: integer,numeric 1 1
## 22: logical,factor,ordered 1 1
## 23: numeric,integer 1 1
## 24: numeric,integer 1 1
## 25: factor,integer,logical,numeric,ordered 1 1
## 26: character,factor,integer,numeric,ordered 1 1
## 27: factor,integer,logical,numeric,ordered 1 1
## 28: numeric,integer 1 1
## 29: NA 1 1
## 30: logical,integer,numeric,character,factor,ordered,... 1 1
## 31: logical,integer,numeric,character,factor,ordered,... 1 1
## 32: logical,integer,numeric,character,factor,ordered,... 1 1
## 33: NA 1 NA
## 34: NA NA 1
## 35: logical,integer,numeric,character,factor,ordered,... 1 1
## 36: numeric,integer 1 1
## 37: NA 1 1
## 38: NA 1 1
## 39: NA 1 1
## 40: numeric,integer 1 1
## 41: NA NA 1
## 42: numeric,integer 1 1
## 43: numeric,integer 1 1
## 44: NA 1 1
## 45: NA NA 1
## 46: logical,integer,numeric,character,factor,ordered,... 1 1
## 47: logical,integer,numeric,character,factor,ordered,... 1 1
## 48: NA 1 1
## 49: numeric,integer 1 1
## 50: numeric,integer 1 1
## 51: numeric,integer 1 1
## 52: logical,integer,numeric,character,factor,ordered,... 1 1
## 53: logical,integer,numeric,character,factor,ordered,... 1 1
## 54: numeric,integer 1 1
## 55: logical,integer,numeric,character,factor,ordered,... 1 1
## 56: NA 2 1
## 57: NA 1 2
## 58: NA 1 2
## 59: character 1 1
## 60: NA 1 1
## 61: NA 1 1
## 62: NA NA 1
## 63: logical,integer,numeric,character,factor,ordered,... 1 1
## 64: numeric,integer 1 1
## feature_types input.num output.num
## input.type.train input.type.predict output.type.train output.type.predict
## 1: Task Task Task Task
## 2: * * * *
## 3: Task Task Task Task
## 4: TaskClassif TaskClassif TaskClassif TaskClassif
## 5: NULL PredictionClassif NULL PredictionClassif
## 6: TaskClassif TaskClassif TaskClassif TaskClassif
## 7: Task Task Task Task
## 8: Task Task Task Task
## 9: Task Task Task Task
## 10: * * * *
## 11: Task Task Task Task
## 12: Task Task Task Task
## 13: Task Task Task Task
## 14: Task Task Task Task
## 15: Task Task Task Task
## 16: Task Task Task Task
## 17: Task Task Task Task
## 18: Task Task Task Task
## 19: Task Task Task Task
## 20: Task Task Task Task
## 21: Task Task Task Task
## 22: Task Task Task Task
## 23: Task Task Task Task
## 24: Task Task Task Task
## 25: Task Task Task Task
## 26: Task Task Task Task
## 27: Task Task Task Task
## 28: Task Task Task Task
## 29: TaskClassif TaskClassif NULL PredictionClassif
## 30: TaskClassif TaskClassif TaskClassif TaskClassif
## 31: Task Task Task Task
## 32: Task Task Task Task
## 33: [*] [*] * *
## 34: * * [*] [*]
## 35: Task Task Task Task
## 36: Task Task Task Task
## 37: * * * *
## 38: TaskClassif TaskClassif [TaskClassif] [TaskClassif]
## 39: [NULL] [PredictionClassif] NULL PredictionClassif
## 40: Task Task Task Task
## 41: * * * *
## 42: Task Task Task Task
## 43: Task Task Task Task
## 44: NULL Prediction NULL Prediction
## 45: NULL PredictionRegr NULL PredictionRegr
## 46: Task Task Task Task
## 47: Task Task Task Task
## 48: * * [*] [*]
## 49: Task Task Task Task
## 50: Task Task Task Task
## 51: Task Task Task Task
## 52: Task Task Task Task
## 53: Task Task Task Task
## 54: Task Task Task Task
## 55: Task Task Task Task
## 56: NULL,NULL function,Prediction NULL Prediction
## 57: Task Task NULL,Task function,Task
## 58: TaskRegr TaskRegr NULL,TaskRegr function,TaskRegr
## 59: Task Task Task Task
## 60: NULL PredictionClassif NULL PredictionClassif
## 61: Task Task NULL Prediction
## 62: * * * *
## 63: Task Task Task Task
## 64: Task Task Task Task
## input.type.train input.type.predict output.type.train output.type.predict
Single POs can be created using the dictionary:
pca = mlr_pipeops$get("pca")
or using syntactic sugar po(<name>)
:
pca = po("pca")
Some POs require additional arguments for construction:
learner = po("learner")
# Error in as_learner(learner) : argument "learner" is missing, with no default argument "learner" is missing, with no default
learner = mlr_pipeops$get("learner", lrn("classif.rpart"))
or in short po("learner", lrn("classif.rpart"))
.
Hyperparameters of POs can be set through the param_vals
argument.
Here we set the fraction of features for a filter:
or in short notation:
The figure below shows an exemplary PipeOp
.
It takes an input, transforms it during .$train
and .$predict
and returns data:
5.2 The Pipeline Operator: %>>%
It is possible to create intricate Graphs
with edges going all over the place (as long as no loops are introduced).
Irrespective, there is usually a clear direction of flow between “layers” in the Graph
.
It is therefore convenient to build up a Graph
from layers.
This can be done using the %>>%
(“doublearrow”) operator.
It takes either a PipeOp
or a Graph
on each of its sides and connects all of the outputs of its lefthand side to one of the inputs each of its righthand side.
The number of inputs therefore must match the number of outputs.
library("magrittr")
gr = po("scale") %>>% po("pca")
gr$plot(html = FALSE)
5.3 Nodes, Edges and Graphs
POs are combined into Graph
s.
POs are identified by their $id
.
Note that the operations all modify the object inplace and return the object itself.
Therefore, multiple modifications can be chained.
For this example we use the pca
PO defined above and a new PO named “mutate”.
The latter creates a new feature from existing variables.
Additionally, we use the filter PO again.
mutate = po("mutate")
filter = po("filter",
filter = mlr3filters::flt("variance"),
param_vals = list(filter.frac = 0.5))
The recommended way to construct a graph is to use the %>>%
operator to chain POs or Graph
s.
graph = mutate %>>% filter
To illustrate how this sugar operator works under the surface we will include an example of the manual way (= hard way) to construct a Graph
.
This is done by creating an empty graph first.
Then one fills the empty graph with POs, and connects edges between the POs.
Conceptually, this may look like this:
graph = Graph$new()$
add_pipeop(mutate)$
add_pipeop(filter)$
add_edge("mutate", "variance") # add connection mutate > filter
The constructed Graph
can be inspected using its $plot()
function:
graph$plot()
Chaining multiple POs of the same kind
If multiple POs of the same kind should be chained, it is necessary to change the id
to avoid name clashes.
This can be done by either accessing the $id
slot or during construction:
graph$add_pipeop(po("pca"))
graph$add_pipeop(po("pca", id = "pca2"))
graph$plot()
5.4 Modeling
The main purpose of a Graph
is to build combined preprocessing and model fitting pipelines that can be used as mlr3 Learner
.
Conceptually, the process may be summarized as follows:
In the following we chain two preprocessing tasks:
 mutate (creation of a new feature)
 filter (filtering the dataset)
Subsequently one can chain a PO learner to train and predict on the modified dataset.
mutate = po("mutate")
filter = po("filter",
filter = mlr3filters::flt("variance"),
param_vals = list(filter.frac = 0.5))
graph = mutate %>>%
filter %>>%
po("learner",
learner = lrn("classif.rpart"))
Until here we defined the main pipeline stored in Graph
.
Now we can train and predict the pipeline:
task = tsk("iris")
graph$train(task)
## $classif.rpart.output
## NULL
graph$predict(task)
## $classif.rpart.output
## <PredictionClassif> for 150 observations:
## row_ids truth response
## 1 setosa setosa
## 2 setosa setosa
## 3 setosa setosa
## 
## 148 virginica virginica
## 149 virginica virginica
## 150 virginica virginica
Rather than calling $train()
and $predict()
manually, we can put the pipeline Graph
into a GraphLearner
object.
A GraphLearner
encapsulates the whole pipeline (including the preprocessing steps) and can be put into resample()
or benchmark()
.
If you are familiar with the old mlr package, this is the equivalent of all the make*Wrapper()
functions.
The pipeline being encapsulated (here Graph
) must always produce a Prediction
with its $predict()
call, so it will probably contain at least one PipeOpLearner
.
glrn = as_learner(graph)
This learner can be used for model fitting, resampling, benchmarking, and tuning:
## <ResampleResult> of 3 iterations
## * Task: iris
## * Learner: mutate.variance.classif.rpart
## * Warnings: 0 in 0 iterations
## * Errors: 0 in 0 iterations
5.4.1 Setting Hyperparameters
Individual POs offer hyperparameters because they contain $param_set
slots that can be read and written from $param_set$values
(via the paradox package).
The parameters get passed down to the Graph
, and finally to the GraphLearner
.
This makes it not only possible to easily change the behavior of a Graph
/ GraphLearner
and try different settings manually, but also to perform tuning using the mlr3tuning package.
glrn$param_set$values$variance.filter.frac = 0.25
cv3 = rsmp("cv", folds = 3)
resample(task, glrn, cv3)
## <ResampleResult> of 3 iterations
## * Task: iris
## * Learner: mutate.variance.classif.rpart
## * Warnings: 0 in 0 iterations
## * Errors: 0 in 0 iterations
5.4.2 Tuning
If you are unfamiliar with tuning in mlr3, we recommend to take a look at the section about tuning first.
Here we define a ParamSet
for the “rpart” learner and the “variance” filter which should be optimized during the tuning process.
library("paradox")
ps = ps(
classif.rpart.cp = p_dbl(lower = 0, upper = 0.05),
variance.filter.frac = p_dbl(lower = 0.25, upper = 1)
)
After having defined the PerformanceEvaluator
, a random search with 10 iterations is created.
For the inner resampling, we are simply using holdout (single split into train/test) to keep the runtimes reasonable.
library("mlr3tuning")
instance = TuningInstanceSingleCrit$new(
task = task,
learner = glrn,
resampling = rsmp("holdout"),
measure = msr("classif.ce"),
search_space = ps,
terminator = trm("evals", n_evals = 20)
)
tuner = tnr("random_search")
tuner$optimize(instance)
The tuning result can be found in the respective result
slots.
instance$result_learner_param_vals
instance$result_y
5.5 NonLinear Graphs
The Graphs seen so far all have a linear structure. Some POs may have multiple input or output channels. These channels make it possible to create nonlinear Graphs with alternative paths taken by the data.
Possible types are:
 Branching: Splitting of a node into several paths, e.g. useful when comparing multiple featureselection methods (pca, filters). Only one path will be executed.
 Copying: Splitting of a node into several paths, all paths will be executed (sequentially). Parallel execution is not yet supported.

Stacking:
Single graphs are stacked onto each other, i.e. the output of one
Graph
is the input for another. In machine learning this means that the prediction of oneGraph
is used as input for anotherGraph
5.5.1 Branching & Copying
The PipeOpBranch
and PipeOpUnbranch
POs make it possible to specify multiple alternative paths.
Only one path is actually executed, the others are ignored.
The active path is determined by a hyperparameter.
This concept makes it possible to tune alternative preprocessing paths (or learner models).
Below a conceptual visualization of branching:
PipeOp(Un)Branch
is initialized either with the number of branches, or with a character
vector indicating the names of the branches.
If names are given, the “branchchoosing” hyperparameter becomes more readable.
In the following, we set three options:
 Doing nothing (“nop”)
 Applying a PCA
 Scaling the data
It is important to “unbranch” again after “branching”, so that the outputs are merged into one result objects.
In the following we first create the branched graph and then show what happens if the “unbranching” is not applied:
graph = po("branch", c("nop", "pca", "scale")) %>>%
gunion(list(
po("nop", id = "null1"),
po("pca"),
po("scale")
))
Without “unbranching” one creates the following graph:
graph$plot(html = FALSE)
Now when “unbranching”, we obtain the following results:
The same can be achieved using a shorter notation:
# List of pipeops
opts = list(po("nop", "no_op"), po("pca"), po("scale"))
# List of po ids
opt_ids = mlr3misc::map_chr(opts, `[[`, "id")
po("branch", options = opt_ids) %>>%
gunion(opts) %>>%
po("unbranch", options = opt_ids)
## Graph with 5 PipeOps:
## ID State sccssors prdcssors
## branch <<UNTRAINED>> no_op,pca,scale
## no_op <<UNTRAINED>> unbranch branch
## pca <<UNTRAINED>> unbranch branch
## scale <<UNTRAINED>> unbranch branch
## unbranch <<UNTRAINED>> no_op,pca,scale
5.5.2 Model Ensembles
We can leverage the different operations presented to connect POs. This allows us to form powerful graphs.
Before we go into details, we split the task into train and test indices.
task = tsk("iris")
train.idx = sample(seq_len(task$nrow), 120)
test.idx = setdiff(seq_len(task$nrow), train.idx)
5.5.2.1 Bagging
We first examine Bagging introduced by (Breiman 1996). The basic idea is to create multiple predictors and then aggregate those to a single, more powerful predictor.
“… multiple versions are formed by making bootstrap replicates of the learning set and using these as new learning sets” (Breiman 1996)
Bagging then aggregates a set of predictors by averaging (regression) or majority vote (classification). The idea behind bagging is, that a set of weak, but different predictors can be combined in order to arrive at a single, better predictor.
We can achieve this by downsampling our data before training a learner, repeating this e.g. 10 times and then performing a majority vote on the predictions. Graphically, it may be summarized as follows:
First, we create a simple pipeline, that uses PipeOpSubsample
before a PipeOpLearner
is trained:
We can now copy this operation 10 times using pipeline_greplicate
.
The pipeline_greplicate
allows us to parallelize many copies of an operation by creating a Graph containing n
copies of the input Graph.
We can also create it using syntactic sugar via ppl()
:
pred_set = ppl("greplicate", single_pred, 10L)
Afterwards we need to aggregate the 10 pipelines to form a single model:
Now we can plot again to see what happens:
bagging$plot(html = FALSE)
This pipeline can again be used in conjunction with GraphLearner
in order for Bagging to be used like a Learner
:
baglrn = as_learner(bagging)
baglrn$train(task, train.idx)
baglrn$predict(task, test.idx)
## <PredictionClassif> for 30 observations:
## row_ids truth response prob.setosa prob.versicolor prob.virginica
## 6 setosa setosa 1 0 0
## 14 setosa setosa 1 0 0
## 17 setosa setosa 1 0 0
## 
## 144 virginica virginica 0 0 1
## 145 virginica virginica 0 0 1
## 146 virginica virginica 0 0 1
In conjunction with different Backends
, this can be a very powerful tool.
In cases when the data does not fully fit in memory, one can obtain a fraction of the data for each learner from a DataBackend
and then aggregate predictions over all learners.
5.5.2.2 Stacking
Stacking (Wolpert 1992) is another technique that can improve model performance. The basic idea behind stacking is the use of predictions from one model as features for a subsequent model to possibly improve performance.
Below an conceptual illustration of stacking:
As an example we can train a decision tree and use the predictions from this model in conjunction with the original features in order to train an additional model on top.
To limit overfitting, we additionally do not predict on the original predictions of the learner.
Instead, we predict on outofbag predictions.
To do all this, we can use PipeOpLearnerCV
.
PipeOpLearnerCV
performs nested crossvalidation on the training data, fitting a model in each fold.
Each of the models is then used to predict on the outoffold data.
As a result, we obtain predictions for every data point in our input data.
We first create a “level 0” learner, which is used to extract a lower level prediction.
Additionally, we clone()
the learner object to obtain a copy of the learner.
Subsequently, one sets a custom id for the PipeOp
.
We use PipeOpNOP
in combination with gunion
, in order to send the unchanged Task to the next level.
There it is combined with the predictions from our decision tree learner.
Afterwards, we want to concatenate the predictions from PipeOpLearnerCV
and the original Task using PipeOpFeatureUnion
:
Now we can train another learner on top of the combined features:
stacklrn = as_learner(stack)
stacklrn$train(task, train.idx)
stacklrn$predict(task, test.idx)
## <PredictionClassif> for 30 observations:
## row_ids truth response
## 6 setosa setosa
## 14 setosa setosa
## 17 setosa setosa
## 
## 144 virginica virginica
## 145 virginica virginica
## 146 virginica virginica
In this vignette, we showed a very simple usecase for stacking. In many realworld applications, stacking is done for multiple levels and on multiple representations of the dataset. On a lower level, different preprocessing methods can be defined in conjunction with several learners. On a higher level, we can then combine those predictions in order to form a very powerful model.
5.5.2.3 Multilevel Stacking
In order to showcase the power of mlr3pipelines, we will show a more complicated stacking example.
In this case, we train a glmnet
and 2 different rpart
models (some transform its inputs using PipeOpPCA
) on our task in the “level 0” and concatenate them with the original features (via gunion
).
The result is then passed on to “level 1”, where we copy the concatenated features 3 times and put this task into an rpart
and a glmnet
model.
Additionally, we keep a version of the “level 0” output (via PipeOpNOP
) and pass this on to “level 2”.
In “level 2” we simply concatenate all “level 1” outputs and train a final decision tree.
In the following examples, use <lrn>$param_set$values$<param_name> = <param_value>
to set hyperparameters
for the different learner.
library("magrittr")
library("mlr3learners") # for classif.glmnet
rprt = lrn("classif.rpart", predict_type = "prob")
glmn = lrn("classif.glmnet", predict_type = "prob")
# Create Learner CV Operators
lrn_0 = po("learner_cv", rprt, id = "rpart_cv_1")
lrn_0$param_set$values$maxdepth = 5L
lrn_1 = po("pca", id = "pca1") %>>% po("learner_cv", rprt, id = "rpart_cv_2")
lrn_1$param_set$values$rpart_cv_2.maxdepth = 1L
lrn_2 = po("pca", id = "pca2") %>>% po("learner_cv", glmn)
# Union them with a PipeOpNULL to keep original features
level_0 = gunion(list(lrn_0, lrn_1, lrn_2, po("nop", id = "NOP1")))
# Cbind the output 3 times, train 2 learners but also keep level
# 0 predictions
level_1 = level_0 %>>%
po("featureunion", 4) %>>%
po("copy", 3) %>>%
gunion(list(
po("learner_cv", rprt, id = "rpart_cv_l1"),
po("learner_cv", glmn, id = "glmnt_cv_l1"),
po("nop", id = "NOP_l1")
))
# Cbind predictions, train a final learner
level_2 = level_1 %>>%
po("featureunion", 3, id = "u2") %>>%
po("learner", rprt, id = "rpart_l2")
# Plot the resulting graph
level_2$plot(html = FALSE)
task = tsk("iris")
lrn = as_learner(level_2)
And we can again call .$train
and .$predict
:
lrn$
train(task, train.idx)$
predict(task, test.idx)$
score()
## classif.ce
## 0.03333
5.6 Special Operators
This section introduces some special operators, that might be useful in numerous further applications.
5.6.1 Imputation: PipeOpImpute
Often you will be using data sets that have missing values. There are many methods of dealing with this issue, from relatively simple imputation using either mean, median or histograms to way more involved methods including using machine learning algorithms in order to predict missing values. These methods are called imputation.
The following PipeOp
s, PipeOpImpute
:
 Add an indicator column marking whether a value for a given feature was missing or not (numeric only)
 Impute numeric values from a histogram
 Impute categorical values using a learner
 We use
po("featureunion")
andpo("nop")
to cbind the missing indicator features. In other words to combine the indicator columns with the rest of the data.
# Imputation example
task = tsk("penguins")
task$missings()
## species bill_depth bill_length body_mass flipper_length
## 0 2 2 2 2
## island sex year
## 0 11 0
# Add missing indicator columns ("dummy columns") to the Task
pom = po("missind")
# Simply pushes the input forward
nop = po("nop")
# Imputes numerical features by histogram.
pon = po("imputehist", id = "imputer_num")
# combines features (used here to add indicator columns to original data)
pou = po("featureunion")
# Impute categorical features by fitting a Learner ("classif.rpart") for each feature.
pof = po("imputelearner", lrn("classif.rpart"), id = "imputer_fct", affect_columns = selector_type("factor"))
Now we construct the graph.
Now we get the new task and we can see that all of the missing values have been imputed.
new_task = impgraph$train(task)[[1]]
new_task$missings()
## species missing_bill_depth missing_bill_length
## 0 0 0
## missing_body_mass missing_flipper_length island
## 0 0 0
## year sex bill_depth
## 0 0 0
## bill_length body_mass flipper_length
## 0 0 0
A learner can thus be equipped with automatic imputation of missing values by adding an imputation Pipeop.
polrn = po("learner", lrn("classif.rpart"))
lrn = as_learner(impgraph %>>% polrn)
5.6.2 Feature Engineering: PipeOpMutate
New features can be added or computed from a task using PipeOpMutate
.
The operator evaluates one or multiple expressions provided in an alist
.
In this example, we compute some new features on top of the iris
task.
Then we add them to the data as illustrated below:
iris
dataset looks like this:
task = task = tsk("iris")
head(as.data.table(task))
## Species Petal.Length Petal.Width Sepal.Length Sepal.Width
## 1: setosa 1.4 0.2 5.1 3.5
## 2: setosa 1.4 0.2 4.9 3.0
## 3: setosa 1.3 0.2 4.7 3.2
## 4: setosa 1.5 0.2 4.6 3.1
## 5: setosa 1.4 0.2 5.0 3.6
## 6: setosa 1.7 0.4 5.4 3.9
Once we do the mutations, you can see the new columns:
pom = po("mutate")
# Define a set of mutations
mutations = list(
Sepal.Sum = ~ Sepal.Length + Sepal.Width,
Petal.Sum = ~ Petal.Length + Petal.Width,
Sepal.Petal.Ratio = ~ (Sepal.Length / Petal.Length)
)
pom$param_set$values$mutation = mutations
new_task = pom$train(list(task))[[1]]
head(as.data.table(new_task))
## Species Petal.Length Petal.Width Sepal.Length Sepal.Width Sepal.Sum
## 1: setosa 1.4 0.2 5.1 3.5 8.6
## 2: setosa 1.4 0.2 4.9 3.0 7.9
## 3: setosa 1.3 0.2 4.7 3.2 7.9
## 4: setosa 1.5 0.2 4.6 3.1 7.7
## 5: setosa 1.4 0.2 5.0 3.6 8.6
## 6: setosa 1.7 0.4 5.4 3.9 9.3
## Petal.Sum Sepal.Petal.Ratio
## 1: 1.6 3.643
## 2: 1.6 3.500
## 3: 1.5 3.615
## 4: 1.7 3.067
## 5: 1.6 3.571
## 6: 2.1 3.176
If outside data is required, we can make use of the env
parameter.
Moreover, we provide an environment, where expressions are evaluated (env
defaults to .GlobalEnv
).
5.6.3 Training on data subsets: PipeOpChunk
In cases, where data is too big to fit into the machine’s memory, an oftenused technique is to split the data into several parts. Subsequently, the parts are trained on each part of the data.
After undertaking these steps, we aggregate the models.
In this example, we split our data into 4 parts using PipeOpChunk
.
Additionally, we create 4 PipeOpLearner
POS, which are then trained on each split of the data.
Afterwards we can use PipeOpClassifAvg
to aggregate the predictions from the 4 different models into a new one.
mjv = po("classifavg", 4)
We can now connect the different operators and visualize the full graph:
task = tsk("iris")
train.idx = sample(seq_len(task$nrow), 120)
test.idx = setdiff(seq_len(task$nrow), train.idx)
pipelrn = as_learner(pipeline)
pipelrn$train(task, train.idx)$
predict(task, train.idx)$
score()
## classif.ce
## 0.03333
5.6.4 Feature Selection: PipeOpFilter
and PipeOpSelect
The package mlr3filters contains many different mlr3filters::Filter
s that can be used to select features for subsequent learners.
This is often required when the data has a large amount of features.
A PipeOp
for filters is PipeOpFilter
:
## PipeOp: <information_gain> (not trained)
## values: <list()>
## Input channels <name [train type, predict type]>:
## input [Task,Task]
## Output channels <name [train type, predict type]>:
## output [Task,Task]
How many features to keep can be set using filter_nfeat
, filter_frac
and filter_cutoff
.
Filters can be selected / deselected by name using PipeOpSelect
.
5.7 Indepth look into mlr3pipelines
This vignette is an indepth introduction to mlr3pipelines, the dataflow programming toolkit for machine learning in R
using mlr3.
It will go through basic concepts and then give a few examples that both show the simplicity as well as the power and versatility of using mlr3pipelines.
5.7.1 What’s the Point
Machine learning toolkits often try to abstract away the processes happening inside machine learning algorithms.
This makes it easy for the user to switch out one algorithm for another without having to worry about what is happening inside it, what kind of data it is able to operate on etc.
The benefit of using mlr3
, for example, is that one can create a Learner
, a Task
, a Resampling
etc. and use them for typical machine learning operations.
It is trivial to exchange individual components and therefore use, for example, a different Learner
in the same experiment for comparison.
task = as_task_classif(iris, target = "Species")
lrn = lrn("classif.rpart")
rsmp = rsmp("holdout")
resample(task, lrn, rsmp)
## <ResampleResult> of 1 iterations
## * Task: iris
## * Learner: classif.rpart
## * Warnings: 0 in 0 iterations
## * Errors: 0 in 0 iterations
However, this modularity breaks down as soon as the learning algorithm encompasses more than just model fitting, like data preprocessing, ensembles or other meta models.
mlr3pipelines takes modularity one step further than mlr3
: it makes it possible to build individual steps within a Learner
out of building blocks called PipeOp
s.
5.7.2 PipeOp
: Pipeline Operators
The most basic unit of functionality within mlr3pipelines is the PipeOp
, short for “pipeline operator”, which represents a transformative operation on input (for example a training dataset) leading to output.
It can therefore be seen as a generalized notion of a function, with a certain twist: PipeOp
s behave differently during a “training phase” and a “prediction phase”.
The training phase will typically generate a certain model of the data that is saved as internal state.
The prediction phase will then operate on the input data depending on the trained model.
An example of this behavior is the principal component analysis operation (“PipeOpPCA
”):
During training, it will transform incoming data by rotating it in a way that leads to uncorrelated features ordered by their contribution to total variance.
It will also save the rotation matrix to be use for new data during the “prediction phase”.
This makes it possible to perform “prediction” with single rows of new data, where a row’s scores on each of the principal components (the components of the training data!) is computed.
## Species PC1 PC2 PC3 PC4
## 1: setosa 2.684 0.31940 0.02791 0.002262
## 2: setosa 2.714 0.17700 0.21046 0.099027
## 3: setosa 2.889 0.14495 0.01790 0.019968
## 4: setosa 2.745 0.31830 0.03156 0.075576
## 5: setosa 2.729 0.32675 0.09008 0.061259
## 
## 146: virginica 1.944 0.18753 0.17783 0.426196
## 147: virginica 1.527 0.37532 0.12190 0.254367
## 148: virginica 1.764 0.07886 0.13048 0.137001
## 149: virginica 1.901 0.11663 0.72325 0.044595
## 150: virginica 1.390 0.28266 0.36291 0.155039
single_line_task = task$clone()$filter(1)
po$predict(list(single_line_task))[[1]]$data()
## Species PC1 PC2 PC3 PC4
## 1: setosa 2.684 0.3194 0.02791 0.002262
po$state
## Standard deviations (1, .., p=4):
## [1] 2.0563 0.4926 0.2797 0.1544
##
## Rotation (n x k) = (4 x 4):
## PC1 PC2 PC3 PC4
## Petal.Length 0.85667 0.17337 0.07624 0.4798
## Petal.Width 0.35829 0.07548 0.54583 0.7537
## Sepal.Length 0.36139 0.65659 0.58203 0.3155
## Sepal.Width 0.08452 0.73016 0.59791 0.3197
This shows the most important primitives incorporated in a PipeOp
:
* $train()
, taking a list of input arguments, turning them into a list of outputs, meanwhile saving a state in $state
* $predict()
, taking a list of input arguments, turning them into a list of outputs, making use of the saved $state
* $state
, the “model” trained with $train()
and utilized during $predict()
.
Schematically we can represent the PipeOp
like so:
5.7.2.1 Why the $state
It is important to take a moment and notice the importance of a $state
variable and the $train()
/ $predict()
dichotomy in a PipeOp
.
There are many preprocessing methods, for example scaling of parameters or imputation, that could in theory just be applied to training data and prediction / validation data separately, or they could be applied to a task before resampling is performed.
This would, however, be fallacious:
 The preprocessing of each instance of prediction data should not depend on the remaining prediction dataset. A prediction on a single instance of new data should give the same result as prediction performed on a whole dataset.
 If preprocessing is performed on a task before resampling is done, information about the test set can leak into the training set. Resampling should evaluate the generalization performance of the entire machine learning method, therefore the behavior of this entire method must only depend only on the content of the training split during resampling.
5.7.2.2 Where to Get PipeOp
s
Each PipeOp
is an instance of an “R6
” class, many of which are provided by the mlr3pipelines package itself.
They can be constructed explicitly (“PipeOpPCA$new()
”) or retrieved from the mlr_pipeops
dictionary: po("pca")
.
The entire list of available PipeOp
s, and some metainformation, can be retrieved using as.data.table()
:
as.data.table(mlr_pipeops)[, c("key", "input.num", "output.num")]
## key input.num output.num
## 1: boxcox 1 1
## 2: branch 1 NA
## 3: chunk 1 NA
## 4: classbalancing 1 1
## 5: classifavg NA 1
## 6: classweights 1 1
## 7: colapply 1 1
## 8: collapsefactors 1 1
## 9: colroles 1 1
## 10: copy 1 NA
## 11: datefeatures 1 1
## 12: encode 1 1
## 13: encodeimpact 1 1
## 14: encodelmer 1 1
## 15: featureunion NA 1
## 16: filter 1 1
## 17: fixfactors 1 1
## 18: histbin 1 1
## 19: ica 1 1
## 20: imputeconstant 1 1
## 21: imputehist 1 1
## 22: imputelearner 1 1
## 23: imputemean 1 1
## 24: imputemedian 1 1
## 25: imputemode 1 1
## 26: imputeoor 1 1
## 27: imputesample 1 1
## 28: kernelpca 1 1
## 29: learner 1 1
## 30: learner_cv 1 1
## 31: missind 1 1
## 32: modelmatrix 1 1
## 33: multiplicityexply 1 NA
## 34: multiplicityimply NA 1
## 35: mutate 1 1
## 36: nmf 1 1
## 37: nop 1 1
## 38: ovrsplit 1 1
## 39: ovrunite 1 1
## 40: pca 1 1
## 41: proxy NA 1
## 42: quantilebin 1 1
## 43: randomprojection 1 1
## 44: randomresponse 1 1
## 45: regravg NA 1
## 46: removeconstants 1 1
## 47: renamecolumns 1 1
## 48: replicate 1 1
## 49: scale 1 1
## 50: scalemaxabs 1 1
## 51: scalerange 1 1
## 52: select 1 1
## 53: smote 1 1
## 54: spatialsign 1 1
## 55: subsample 1 1
## 56: targetinvert 2 1
## 57: targetmutate 1 2
## 58: targettrafoscalerange 1 2
## 59: textvectorizer 1 1
## 60: threshold 1 1
## 61: tunethreshold 1 1
## 62: unbranch NA 1
## 63: vtreat 1 1
## 64: yeojohnson 1 1
## key input.num output.num
When retrieving PipeOp
s from the mlr_pipeops
dictionary, it is also possible to give additional constructor arguments, such as an id or parameter values.
po("pca", rank. = 3)
## PipeOp: <pca> (not trained)
## values: <rank.=3>
## Input channels <name [train type, predict type]>:
## input [Task,Task]
## Output channels <name [train type, predict type]>:
## output [Task,Task]
5.7.3 PipeOp Channels
5.7.3.1 Input Channels
Just like functions, PipeOp
s can take multiple inputs.
These multiple inputs are always given as elements in the input list.
For example, there is a PipeOpFeatureUnion
that combines multiple tasks with different features and “cbind()
s” them together, creating one combined task.
When two halves of the iris
task are given, for example, it recreates the original task:
iris_first_half = task$clone()$select(c("Petal.Length", "Petal.Width"))
iris_second_half = task$clone()$select(c("Sepal.Length", "Sepal.Width"))
pofu = po("featureunion", innum = 2)
pofu$train(list(iris_first_half, iris_second_half))[[1]]$data()
## Species Petal.Length Petal.Width Sepal.Length Sepal.Width
## 1: setosa 1.4 0.2 5.1 3.5
## 2: setosa 1.4 0.2 4.9 3.0
## 3: setosa 1.3 0.2 4.7 3.2
## 4: setosa 1.5 0.2 4.6 3.1
## 5: setosa 1.4 0.2 5.0 3.6
## 
## 146: virginica 5.2 2.3 6.7 3.0
## 147: virginica 5.0 1.9 6.3 2.5
## 148: virginica 5.2 2.0 6.5 3.0
## 149: virginica 5.4 2.3 6.2 3.4
## 150: virginica 5.1 1.8 5.9 3.0
Because PipeOpFeatureUnion
effectively takes two input arguments here, we can say it has two input channels.
An input channel also carries information about the type of input that is acceptable.
The input channels of the pofu
object constructed above, for example, each accept a Task
during training and prediction.
This information can be queried from the $input
slot:
pofu$input
## name train predict
## 1: input1 Task Task
## 2: input2 Task Task
Other PipeOp
s may have channels that take different types during different phases.
The backuplearner
PipeOp
, for example, takes a NULL
and a Task
during training, and a Prediction
and a Task
during prediction:
## TODO this is an important case to handle here, do not delete unless there is a better example.
## po("backuplearner")$input
5.7.3.2 Output Channels
Unlike the typical notion of a function, PipeOp
s can also have multiple output channels.
$train()
and $predict()
always return a list, so certain PipeOp
s may return lists with more than one element.
Similar to input channels, the information about the number and type of outputs given by a PipeOp
is available in the $output
slot.
The chunk
PipeOp, for example, chunks a given Task
into subsets and consequently returns multiple Task
objects, both during training and prediction.
The number of output channels must be given during construction through the outnum
argument.
po("chunk", outnum = 3)$output
## name train predict
## 1: output1 Task Task
## 2: output2 Task Task
## 3: output3 Task Task
Note that the number of output channels during training and prediction is the same.
A schema of a PipeOp
with two output channels:
5.7.3.3 Channel Configuration
Most PipeOp
s have only one input channel (so they take a list with a single element), but there are a few with more than one;
In many cases, the number of input or output channels is determined during construction, e.g. through the innum
/ outnum
arguments.
The input.num
and output.num
columns of the mlr_pipeops
table above show the default number of channels, and NA
if the number depends on a construction argument.
The default printer of a PipeOp
gives information about channel names and types:
## po("backuplearner")
5.7.4 Graph
: Networks of PipeOp
s
5.7.4.1 Basics
What is the advantage of this tedious way of declaring input and output channels and handling in/output through lists?
Because each PipeOp
has a known number of input and output channels that always produce or accept data of a known type, it is possible to network them together in Graph
s.
A Graph
is a collection of PipeOp
s with “edges” that mandate that data should be flowing along them.
Edges always pass between PipeOp
channels, so it is not only possible to explicitly prescribe which position of an input or output list an edge refers to, it makes it possible to make different components of a PipeOp
’s output flow to multiple different other PipeOp
s, as well as to have a PipeOp
gather its input from multiple other PipeOp
s.
A schema of a simple graph of PipeOp
s:
A Graph
is empty when first created, and PipeOp
s can be added using the $add_pipeop()
method.
The $add_edge()
method is used to create connections between them.
While the printer of a Graph
gives some information about its layout, the most intuitive way of visualizing it is using the $plot()
function.
gr = Graph$new()
gr$add_pipeop(po("scale"))
gr$add_pipeop(po("subsample", frac = 0.1))
gr$add_edge("scale", "subsample")
print(gr)
## Graph with 2 PipeOps:
## ID State sccssors prdcssors
## scale <<UNTRAINED>> subsample
## subsample <<UNTRAINED>> scale
gr$plot(html = FALSE)
A Graph
itself has a $train()
and a $predict()
method that accept some data and propagate this data through the network of PipeOp
s.
The return value corresponds to the output of the PipeOp
output channels that are not connected to other PipeOp
s.
gr$train(task)[[1]]$data()
## Species Petal.Length Petal.Width Sepal.Length Sepal.Width
## 1: setosa 1.16581 1.0486668 0.53538 1.9333
## 2: setosa 1.50569 1.4422448 1.86378 0.1315
## 3: setosa 1.22246 1.0486668 1.01844 0.7862
## 4: setosa 1.27910 1.4422448 0.77691 2.3922
## 5: versicolor 0.53362 0.5256453 0.55149 0.5567
## 6: versicolor 0.13709 0.1320673 0.30996 0.5904
## 7: versicolor 0.36368 0.2632600 0.91378 0.1315
## 8: versicolor 0.08044 0.0008746 0.05233 0.8198
## 9: versicolor 0.13709 0.0008746 0.05233 1.0493
## 10: virginica 1.27004 1.7063794 0.55149 0.5567
## 11: virginica 0.98680 1.1816087 1.15530 0.1315
## 12: virginica 0.70356 1.0504160 0.17309 1.2787
## 13: virginica 1.10010 1.4439941 1.27607 0.3273
## 14: virginica 1.27004 0.7880307 1.63836 0.3273
## 15: virginica 1.21339 1.4439941 1.15530 0.3273
gr$predict(single_line_task)[[1]]$data()
## Species Petal.Length Petal.Width Sepal.Length Sepal.Width
## 1: setosa 1.336 1.311 0.8977 1.016
The collection of PipeOp
s inside a Graph
can be accessed through the $pipeops
slot.
The set of edges in the Graph can be inspected through the $edges
slot.
It is possible to modify individual PipeOps
and edges in a Graph through these slots, but this is not recommended because no error checking is performed and it may put the Graph
in an unsupported state.
5.7.4.2 Networks
The example above showed a linear preprocessing pipeline, but it is in fact possible to build true “graphs” of operations, as long as no loops are introduced^{1}.
PipeOp
s with multiple output channels can feed their data to multiple different subsequent PipeOp
s, and PipeOp
s with multiple input channels can take results from different PipeOp
s.
When a PipeOp
has more than one input / output channel, then the Graph
’s $add_edge()
method needs an additional argument that indicates which channel to connect to.
This argument can be given in the form of an integer, or as the name of the channel.
The following constructs a Graph
that copies the input and gives one copy each to a “scale” and a “pca” PipeOp
.
The resulting columns of each operation are put next to each other by “featureunion”.
gr = Graph$new()$
add_pipeop(po("copy", outnum = 2))$
add_pipeop(po("scale"))$
add_pipeop(po("pca"))$
add_pipeop(po("featureunion", innum = 2))
gr$
add_edge("copy", "scale", src_channel = 1)$ # designating channel by index
add_edge("copy", "pca", src_channel = "output2")$ # designating channel by name
add_edge("scale", "featureunion", dst_channel = 1)$
add_edge("pca", "featureunion", dst_channel = 2)
gr$plot(html = FALSE)
gr$train(iris_first_half)[[1]]$data()
## Species Petal.Length Petal.Width PC1 PC2
## 1: setosa 1.3358 1.3111 2.561 0.006922
## 2: setosa 1.3358 1.3111 2.561 0.006922
## 3: setosa 1.3924 1.3111 2.653 0.031850
## 4: setosa 1.2791 1.3111 2.469 0.045694
## 5: setosa 1.3358 1.3111 2.561 0.006922
## 
## 146: virginica 0.8169 1.4440 1.756 0.455479
## 147: virginica 0.7036 0.9192 1.417 0.164312
## 148: virginica 0.8169 1.0504 1.640 0.178946
## 149: virginica 0.9302 1.4440 1.940 0.377936
## 150: virginica 0.7602 0.7880 1.470 0.033362
5.7.4.3 Syntactic Sugar
Although it is possible to create intricate Graphs
with edges going all over the place (as long as no loops are introduced), there is usually a clear direction of flow between “layers” in the Graph
.
It is therefore convenient to build up a Graph
from layers, which can be done using the %>>%
(“doublearrow”) operator.
It takes either a PipeOp
or a Graph
on each of its sides and connects all of the outputs of its lefthand side to one of the inputs each of its righthand side–the number of inputs therefore must match the number of outputs.
Together with the gunion()
operation, which takes PipeOp
s or Graph
s and arranges them next to each other akin to a (disjoint) graph union, the above network can more easily be constructed as follows:
5.7.4.4 PipeOp
IDs and ID Name Clashes
PipeOp
s within a graph are addressed by their $id
slot.
It is therefore necessary for all PipeOp
s within a Graph
to have a unique $id
.
The $id
can be set during or after construction, but it should not directly be changed after a PipeOp
was inserted in a Graph
.
At that point, the $set_names()
method can be used to change PipeOp
ids.
## Error in gunion(list(g1, g2), in_place = c(TRUE, TRUE)): Assertion on 'ids of pipe operators of graphs' failed: Must have unique names, but element 2 is duplicated.
po2$id = "scale2"
gr = po1 %>>% po2
gr
## Graph with 2 PipeOps:
## ID State sccssors prdcssors
## scale <<UNTRAINED>> scale2
## scale2 <<UNTRAINED>> scale
## Alternative ways of getting new ids:
po("scale", id = "scale2")
## PipeOp: <scale2> (not trained)
## values: <robust=FALSE>
## Input channels <name [train type, predict type]>:
## input [Task,Task]
## Output channels <name [train type, predict type]>:
## output [Task,Task]
po("scale", id = "scale2")
## PipeOp: <scale2> (not trained)
## values: <robust=FALSE>
## Input channels <name [train type, predict type]>:
## input [Task,Task]
## Output channels <name [train type, predict type]>:
## output [Task,Task]
## sometimes names of PipeOps within a Graph need to be changed
gr2 = po("scale") %>>% po("pca")
gr %>>% gr2
## Error in gunion(list(g1, g2), in_place = c(TRUE, TRUE)): Assertion on 'ids of pipe operators of graphs' failed: Must have unique names, but element 3 is duplicated.
gr2$set_names("scale", "scale3")
gr %>>% gr2
## Graph with 4 PipeOps:
## ID State sccssors prdcssors
## scale <<UNTRAINED>> scale2
## scale2 <<UNTRAINED>> scale3 scale
## scale3 <<UNTRAINED>> pca scale2
## pca <<UNTRAINED>> scale3
5.7.5 Learners in Graphs, Graphs in Learners
The true power of mlr3pipelines derives from the fact that it can be integrated seamlessly with mlr3
.
Two components are mainly responsible for this:

PipeOpLearner
, aPipeOp
that encapsulates amlr3
Learner
and creates aPredictionData
object in its$predict()
phase 
GraphLearner
, amlr3
Learner
that can be used in place of any othermlr3
Learner
, but which does prediction using aGraph
given to it
Note that these are dual to each other: One takes a Learner
and produces a PipeOp
(and by extension a Graph
); the other takes a Graph
and produces a Learner
.
5.7.5.1 PipeOpLearner
The PipeOpLearner
is constructed using a mlr3
Learner
and will use it to create PredictionData
in the $predict()
phase.
The output during $train()
is NULL
.
It can be used after a preprocessing pipeline, and it is even possible to perform operations on the PredictionData
, for example by averaging multiple predictions or by using the “PipeOpBackupLearner
” operator to impute predictions that a given model failed to create.
The following is a very simple Graph
that performs training and prediction on data after performing principal component analysis.
gr$train(task)
## $classif.rpart.output
## NULL
gr$predict(task)
## $classif.rpart.output
## <PredictionClassif> for 150 observations:
## row_ids truth response
## 1 setosa setosa
## 2 setosa setosa
## 3 setosa setosa
## 
## 148 virginica virginica
## 149 virginica virginica
## 150 virginica virginica
5.7.5.2 GraphLearner
Although a Graph
has $train()
and $predict()
functions, it can not be used directly in places where mlr3
Learners
can be used like resampling or benchmarks.
For this, it needs to be wrapped in a GraphLearner
object, which is a thin wrapper that enables this functionality.
The resulting Learner
is extremely versatile, because every part of it can be modified, replaced, parameterized and optimized over.
Resampling the graph above can be done the same way that resampling of the Learner
was performed in the introductory example.
lrngrph = as_learner(gr)
resample(task, lrngrph, rsmp)
## <ResampleResult> of 1 iterations
## * Task: iris
## * Learner: pca.classif.rpart
## * Warnings: 0 in 0 iterations
## * Errors: 0 in 0 iterations
5.7.6 Hyperparameters
mlr3pipelines relies on the paradox
package to provide parameters that can modify each PipeOp
’s behavior.
paradox
parameters provide information about the parameters that can be changed, as well as their types and ranges.
They provide a unified interface for benchmarks and parameter optimization (“tuning”).
For a deep dive into paradox
, see the tuning chapter or the indepth paradox
chapter.
The ParamSet
, representing the space of possible parameter configurations of a PipeOp
, can be inspected by accessing the $param_set
slot of a PipeOp
or a Graph
.
op_pca = po("pca")
op_pca$param_set
## <ParamSet:pca>
## id class lower upper nlevels default value
## 1: center ParamLgl NA NA 2 TRUE
## 2: scale. ParamLgl NA NA 2 FALSE
## 3: rank. ParamInt 1 Inf Inf
## 4: affect_columns ParamUty NA NA Inf <Selector[1]>
To set or retrieve a parameter, the $param_set$values
slot can be accessed.
Alternatively, the param_vals
value can be given during construction.
op_pca$param_set$values$center = FALSE
op_pca$param_set$values
## $center
## [1] FALSE
op_pca = po("pca", center = TRUE)
op_pca$param_set$values
## $center
## [1] TRUE
Each PipeOp
can bring its own individual parameters which are collected together in the Graph
’s $param_set
.
A PipeOp
’s parameter names are prefixed with its $id
to prevent parameter name clashes.
## <ParamSetCollection>
## id class lower upper nlevels default value
## 1: pca.center ParamLgl NA NA 2 TRUE TRUE
## 2: pca.scale. ParamLgl NA NA 2 FALSE
## 3: pca.rank. ParamInt 1 Inf Inf
## 4: pca.affect_columns ParamUty NA NA Inf <Selector[1]>
## 5: scale.center ParamLgl NA NA 2 TRUE
## 6: scale.scale ParamLgl NA NA 2 TRUE
## 7: scale.robust ParamLgl NA NA 2 <NoDefault[3]> FALSE
## 8: scale.affect_columns ParamUty NA NA Inf <Selector[1]>
gr$param_set$values
## $pca.center
## [1] TRUE
##
## $scale.robust
## [1] FALSE
Both PipeOpLearner
and GraphLearner
preserve parameters of the objects they encapsulate.
## <ParamSet:classif.rpart>
## id class lower upper nlevels default value
## 1: cp ParamDbl 0 1 Inf 0.01
## 2: keep_model ParamLgl NA NA 2 FALSE
## 3: maxcompete ParamInt 0 Inf Inf 4
## 4: maxdepth ParamInt 1 30 30 30
## 5: maxsurrogate ParamInt 0 Inf Inf 5
## 6: minbucket ParamInt 1 Inf Inf <NoDefault[3]>
## 7: minsplit ParamInt 1 Inf Inf 20
## 8: surrogatestyle ParamInt 0 1 2 0
## 9: usesurrogate ParamInt 0 2 3 2
## 10: xval ParamInt 0 Inf Inf 10 0
glrn = as_learner(gr %>>% op_rpart)
glrn$param_set
## <ParamSetCollection>
## id class lower upper nlevels default
## 1: pca.center ParamLgl NA NA 2 TRUE
## 2: pca.scale. ParamLgl NA NA 2 FALSE
## 3: pca.rank. ParamInt 1 Inf Inf
## 4: pca.affect_columns ParamUty NA NA Inf <Selector[1]>
## 5: scale.center ParamLgl NA NA 2 TRUE
## 6: scale.scale ParamLgl NA NA 2 TRUE
## 7: scale.robust ParamLgl NA NA 2 <NoDefault[3]>
## 8: scale.affect_columns ParamUty NA NA Inf <Selector[1]>
## 9: classif.rpart.cp ParamDbl 0 1 Inf 0.01
## 10: classif.rpart.keep_model ParamLgl NA NA 2 FALSE
## 11: classif.rpart.maxcompete ParamInt 0 Inf Inf 4
## 12: classif.rpart.maxdepth ParamInt 1 30 30 30
## 13: classif.rpart.maxsurrogate ParamInt 0 Inf Inf 5
## 14: classif.rpart.minbucket ParamInt 1 Inf Inf <NoDefault[3]>
## 15: classif.rpart.minsplit ParamInt 1 Inf Inf 20
## 16: classif.rpart.surrogatestyle ParamInt 0 1 2 0
## 17: classif.rpart.usesurrogate ParamInt 0 2 3 2
## 18: classif.rpart.xval ParamInt 0 Inf Inf 10
## value
## 1: TRUE
## 2:
## 3:
## 4:
## 5:
## 6:
## 7: FALSE
## 8:
## 9:
## 10:
## 11:
## 12:
## 13:
## 14:
## 15:
## 16:
## 17:
## 18: 0