10 Appendix
10.1 Integrated Learners
Learners are available from one of the following packages:
 mlr3: debug learner and rpart learners.
 mlr3learners: opinionated selection of some default learners.
 mlr3proba: base learners for survival and probabilistic regression.
 mlr3cluster: learners for unsupervised clustering.
 mlr3extralearners: more experimental learners for regression, classification and survival.
Use the interactive search table to look through all our learners.
10.2 Integrated Performance Measures
Also see the overview on the website of mlr3measures.
10.3 Integrated Filter Methods
10.3.1 Standalone filter methods
Id  Packages  Task Types  Feature Types 

anova 
stats  classif  int, dbl 
auc 
mlr3measures  classif  int, dbl 
carscore 
care  regr  dbl 
cmim 
praznik  classif, regr  int, dbl, fct, ord 
correlation 
stats  regr  int, dbl 
disr 
praznik  classif, regr  int, dbl, fct, ord 
find_correlation 
stats  classif, regr  int, dbl 
importance 
mlr3, rpart  classif  lgl, int, dbl, fct, ord 
information_gain 
FSelectorRcpp  classif, regr  int, dbl, fct, ord 
jmi 
praznik  classif, regr  int, dbl, fct, ord 
jmim 
praznik  classif, regr  int, dbl, fct, ord 
kruskal_test 
stats  classif  int, dbl 
mim 
praznik  classif, regr  int, dbl, fct, ord 
mrmr 
praznik  classif, regr  int, dbl, fct, ord 
njmim 
praznik  classif, regr  int, dbl, fct, ord 
performance 
mlr3, mlr3measures, rpart  classif  lgl, int, dbl, fct, ord 
permutation 
mlr3, mlr3measures, rpart  classif  lgl, int, dbl, fct, ord 
relief 
FSelectorRcpp  classif, regr  int, dbl, fct, ord 
variance 
stats  classif, regr  int, dbl 
10.4 Integrated Pipe Operators
Id  Packages  Tags  Train  Predict 

boxcox 
bestNormalize, mlr3pipelines  data transform  Task → Task  Task→Task 
colapply 
mlr3pipelines  data transform  Task → Task  Task→Task 
collapsefactors 
mlr3pipelines  data transform  Task → Task  Task→Task 
colroles 
mlr3pipelines  data transform  Task → Task  Task→Task 
datefeatures 
mlr3pipelines  data transform  Task → Task  Task→Task 
histbin 
graphics, mlr3pipelines  data transform  Task → Task  Task→Task 
ica 
fastICA, mlr3pipelines  data transform  Task → Task  Task→Task 
kernelpca 
kernlab, mlr3pipelines  data transform  Task → Task  Task→Task 
modelmatrix 
mlr3pipelines, stats  data transform  Task → Task  Task→Task 
mutate 
mlr3pipelines  data transform  Task → Task  Task→Task 
nmf 
MASS, mlr3pipelines, NMF  data transform  Task → Task  Task→Task 
pca 
mlr3pipelines  data transform  Task → Task  Task→Task 
quantilebin 
mlr3pipelines, stats  data transform  Task → Task  Task→Task 
randomprojection 
mlr3pipelines  data transform  Task → Task  Task→Task 
renamecolumns 
mlr3pipelines  data transform  Task → Task  Task→Task 
scale 
mlr3pipelines  data transform  Task → Task  Task→Task 
scalemaxabs 
mlr3pipelines  data transform  Task → Task  Task→Task 
scalerange 
mlr3pipelines  data transform  Task → Task  Task→Task 
spatialsign 
mlr3pipelines  data transform  Task → Task  Task→Task 
subsample 
mlr3pipelines  data transform  Task → Task  Task→Task 
textvectorizer 
mlr3pipelines, quanteda, stopwords  data transform  Task → Task  Task→Task 
yeojohnson 
bestNormalize, mlr3pipelines  data transform  Task → Task  Task→Task 
encode 
mlr3pipelines, stats  encode , data transform  Task → Task  Task→Task 
encodeimpact 
mlr3pipelines  encode , data transform  Task → Task  Task→Task 
encodelmer 
lme4, mlr3pipelines, nloptr  encode , data transform  Task → Task  Task→Task 
vtreat 
mlr3pipelines, vtreat  encode , missings , data transform  Task → Task  Task→Task 
classifavg 
mlr3pipelines, stats  ensemble  NULL → NULL  PredictionClassif→PredictionClassif 
featureunion 
mlr3pipelines  ensemble  Task → Task  Task→Task 
regravg 
mlr3pipelines  ensemble  NULL → NULL  PredictionRegr→PredictionRegr 
survavg 
mlr3pipelines  ensemble  NULL → NULL  PredictionSurv→PredictionSurv 
filter 
mlr3pipelines  feature selection, data transform  Task → Task  Task→Task 
select 
mlr3pipelines  feature selection, data transform  Task → Task  Task→Task 
classbalancing 
mlr3pipelines  imbalanced data, data transform  TaskClassif → TaskClassif  TaskClassif→TaskClassif 
classweights 
mlr3pipelines  imbalanced data, data transform  TaskClassif → TaskClassif  TaskClassif→TaskClassif 
smote 
mlr3pipelines, smotefamily  imbalanced data, data transform  Task → Task  Task→Task 
learner 
mlr3pipelines  learner  TaskClassif → NULL  TaskClassif→PredictionClassif 
learner_cv 
mlr3pipelines  learner , ensemble , data transform  TaskClassif → TaskClassif  TaskClassif→TaskClassif 
imputeconstant 
mlr3pipelines  missings  Task → Task  Task→Task 
imputehist 
graphics, mlr3pipelines  missings  Task → Task  Task→Task 
imputelearner 
mlr3pipelines  missings  Task → Task  Task→Task 
imputemean 
mlr3pipelines  missings  Task → Task  Task→Task 
imputemedian 
mlr3pipelines, stats  missings  Task → Task  Task→Task 
imputemode 
mlr3pipelines  missings  Task → Task  Task→Task 
imputeoor 
mlr3pipelines  missings  Task → Task  Task→Task 
imputesample 
mlr3pipelines  missings  Task → Task  Task→Task 
missind 
mlr3pipelines  missings , data transform  Task → Task  Task→Task 
multiplicityexply 
mlr3pipelines  multiplicity  [*] → *  [*]→* 
multiplicityimply 
mlr3pipelines  multiplicity  * → [*]  *→[*] 
ovrunite 
mlr3pipelines  multiplicity, ensemble  [NULL] → NULL  [PredictionClassif]→PredictionClassif 
replicate 
mlr3pipelines  multiplicity  * → [*]  *→[*] 
fixfactors 
mlr3pipelines  robustify , data transform  Task → Task  Task→Task 
removeconstants 
mlr3pipelines  robustify , data transform  Task → Task  Task→Task 
ovrsplit 
mlr3pipelines  target transform, multiplicity  TaskClassif → [TaskClassif]  TaskClassif→[TaskClassif] 
targetmutate 
mlr3pipelines  target transform  Task → NULL, Task  Task→function, Task 
targettrafoscalerange 
mlr3pipelines  target transform  TaskRegr → NULL, TaskRegr  TaskRegr→function, TaskRegr 
threshold 
mlr3pipelines  target transform  NULL → NULL  PredictionClassif→PredictionClassif 
tunethreshold 
bbotk, mlr3pipelines  target transform  Task → NULL  Task→Prediction 
branch 
mlr3pipelines  meta  * → *  → 
chunk 
mlr3pipelines  meta  Task → Task  Task→Task 
copy 
mlr3pipelines  meta  * → *  → 
nop 
mlr3pipelines  meta  * → *  → 
proxy 
mlr3pipelines  meta  * → *  → 
unbranch 
mlr3pipelines  meta  * → *  → 
compose_crank 
distr6, mlr3pipelines  abstract  NULL → NULL  PredictionSurv→PredictionSurv 
compose_distr 
distr6, mlr3pipelines  abstract  NULL, NULL → NULL  PredictionSurv, PredictionSurv→PredictionSurv 
compose_probregr 
distr6, mlr3pipelines  abstract  NULL, NULL → NULL  PredictionRegr, PredictionRegr→PredictionRegr 
crankcompose  distr6, mlr3pipelines  abstract  NULL → NULL  PredictionSurv→PredictionSurv 
distrcompose  distr6, mlr3pipelines  abstract  NULL, NULL → NULL  PredictionSurv, PredictionSurv→PredictionSurv 
randomresponse 
mlr3pipelines  abstract  NULL → NULL  Prediction→Prediction 
targetinvert 
mlr3pipelines  abstract  NULL, NULL → NULL  function, Prediction→Prediction 
trafopred_regrsurv 
mlr3pipelines  abstract  NULL, NULL → NULL  PredictionRegr, *→PredictionSurv 
trafopred_survregr 
mlr3pipelines  abstract  NULL → NULL  PredictionSurv→PredictionRegr 
trafotask_regrsurv 
mlr3pipelines  abstract  TaskRegr, * → TaskSurv  TaskRegr, *→TaskSurv 
trafotask_survregr 
mlr3pipelines  abstract  TaskSurv, * → TaskRegr  TaskSurv, *→TaskRegr 
10.5 Framework Comparison
Below, we collected some examples, where mlr3pipelines is compared to different other software packages, such as mlr, recipes and sklearn.
Before diving deeper, we give a short introduction to PipeOps.
10.5.1 An introduction to “PipeOp”s
In this example, we create a linear Pipeline. After scaling all input features, we rotate our data using principal component analysis. After this transformation, we use a simple Decision Tree learner for classification.
As exemplary data, we will use the “iris
” classification task.
This object contains the famous iris dataset and some metainformation, such as the target variable.
We quickly split our data into a train and a test set:
test.idx = sample(seq_len(task$nrow), 30)
train.idx = setdiff(seq_len(task$nrow), test.idx)
# Set task to only use train indexes
task$row_roles$use = train.idx
A Pipeline (or Graph
) contains multiple pipeline operators (“PipeOp
”s), where each PipeOp
transforms the data when it flows through it.
For this use case, we require 3 transformations:
 A
PipeOp
that scales the data  A
PipeOp
that performs PCA  A
PipeOp
that contains the Decision Tree learner
A list of available PipeOp
s can be obtained from
## <DictionaryPipeOp> with 64 stored values
## Keys: boxcox, branch, chunk, classbalancing, classifavg, classweights,
## colapply, collapsefactors, colroles, copy, datefeatures, encode,
## encodeimpact, encodelmer, featureunion, filter, fixfactors, histbin,
## ica, imputeconstant, imputehist, imputelearner, imputemean,
## imputemedian, imputemode, imputeoor, imputesample, kernelpca,
## learner, learner_cv, missind, modelmatrix, multiplicityexply,
## multiplicityimply, mutate, nmf, nop, ovrsplit, ovrunite, pca, proxy,
## quantilebin, randomprojection, randomresponse, regravg,
## removeconstants, renamecolumns, replicate, scale, scalemaxabs,
## scalerange, select, smote, spatialsign, subsample, targetinvert,
## targetmutate, targettrafoscalerange, textvectorizer, threshold,
## tunethreshold, unbranch, vtreat, yeojohnson
First we define the required PipeOp
s:
10.5.1.1 A quick glance into a PipeOp
In order to get a better understanding of what the respective PipeOps do, we quickly look at one of them in detail:
The most important slots in a PipeOp are:

$train()
: A function used to train the PipeOp. 
$predict()
: A function used to predict with the PipeOp.
The $train()
and $predict()
functions define the core functionality of our PipeOp.
In many cases, in order to not leak information from the training set into the test set it is imperative to treat train and test data separately.
For this we require a $train()
function that learns the appropriate transformations from the training set and a $predict()
function that applies the transformation on future data.
In the case of PipeOpPCA
this means the following:

$train()
learns a rotation matrix from its input and saves this matrix to an additional slot,$state
. It returns the rotated input data stored in a newTask
. 
$predict()
uses the rotation matrix stored in$state
in order to rotate future, unseen data. It returns those in a newTask
.
10.5.1.2 Constructing the Pipeline
We can now connect the PipeOp
s constructed earlier to a Pipeline.
We can do this using the %>>%
operator.
The result of this operation is a “Graph
”.
A Graph
connects the input and output of each PipeOp
to the following PipeOp
.
This allows us to specify linear processing pipelines.
In this case, we connect the output of the scaling PipeOp to the input of the PCA PipeOp and the output of the PCA PipeOp to the input of PipeOpLearner.
We can now train the Graph
using the iris
Task
.
linear_pipeline$train(task)
## $classif.rpart.output
## NULL
When we now train the graph, the data flows through the graph as follows:
 The Task flows into the
PipeOpScale
. ThePipeOp
scales each column in the data contained in the Task and returns a new Task that contains the scaled data to its output.  The scaled Task flows into the
PipeOpPCA
. PCA transforms the data and returns a (possibly smaller) Task, that contains the transformed data.  This transformed data then flows into the learner, in our case classif.rpart. It is then used to train the learner, and as a result saves a model that can be used to predict new data.
In order to predict on new data, we need to save the relevant transformations our data went through while training.
As a result, each PipeOp
saves a state, where information required to appropriately transform future data is stored.
In our case, this is mean and standard deviation of each column for PipeOpScale
, the PCA rotation matrix for PipeOpPCA
and the learned model for PipeOpLearner
.
# predict on test.idx
task$row_roles$use = test.idx
linear_pipeline$predict(task)
## $classif.rpart.output
## <PredictionClassif> for 30 observations:
## row_ids truth response
## 36 setosa setosa
## 24 setosa setosa
## 70 versicolor versicolor
## 
## 34 setosa setosa
## 130 virginica versicolor
## 100 versicolor versicolor
10.5.2 mlr3pipelines vs. mlr
In order to showcase the benefits of mlr3pipelines over mlr’s Wrapper
mechanism, we compare the case of imputing missing values before filtering the top 2 features and then applying a learner.
While mlr wrappers are generally less verbose and require a little less code, this heavily inhibits flexibility. As an example, wrappers can generally not process data in parallel.
10.5.2.1 mlr
library("mlr")
# We first create a learner
lrn = makeLearner("classif.rpart")
# Wrap this learner in a FilterWrapper
lrn.wrp = makeFilterWrapper(lrn, fw.abs = 2L)
# And wrap the resulting wrapped learner into an ImputeWrapper.
lrn.wrp = makeImputeWrapper(lrn.wrp, classes = list(factor = imputeConstant("missing")))
# Afterwards, we can train the resulting learner on a task
train(lrn, iris.task)
10.5.2.2 mlr3pipelines
library("mlr3")
library("mlr3pipelines")
library("mlr3filters")
impute = po("imputeoor")
filter = po("filter", filter = flt("variance"), filter.nfeat = 2L)
rpart = po("learner", lrn("classif.rpart"))
# Assemble the Pipeline
pipeline = impute %>>% filter %>>% rpart
# And convert to a 'GraphLearner'
learner = as_learner(pipeline)
The fact that mlr’s wrappers have to be applied insideout, i.e. in the reverse order is often confusing.
This is way more straightforward in mlr3pipelines
, where we simply chain the different methods using %>>%
.
Additionally, mlr3pipelines
offers way greater possibilities with respect to the kinds of Pipelines that can be constructed.
In mlr3pipelines
, we allow for the construction of parallel and conditional pipelines.
This was previously not possible.
10.5.3 mlr3pipelines vs. sklearn.pipeline.Pipeline
In order to broaden the horizon, we compare to Python sklearn’s Pipeline
methods.
sklearn.pipeline.Pipeline
sequentially applies a list of transforms before fitting a final estimator.
Intermediate steps of the pipeline are transforms
, i.e. steps that can learn from the data, but also transform the data while it flows through it.
The purpose of the pipeline is to assemble several steps that can be crossvalidated together while setting different parameters.
For this, it enables setting parameters of the various steps.
It is thus conceptually very similar to mlr3pipelines.
Similarly to mlr3pipelines, we can tune over a full Pipeline
using various tuning methods.
Pipeline
mainly supports linear pipelines.
This means, that it can execute parallel steps, such as for example Bagging, but it does not support conditional execution, i.e. PipeOpBranch
.
At the same time, the different transforms
in the pipeline can be cached, which makes tuning over the configuration space of a Pipeline
more efficient, as executing some steps multiple times can be avoided.
We compare functionality available in both mlr3pipelines and sklearn.pipeline.Pipeline
to give a comparison.
The following example obtained from the sklearn documentation showcases a Pipeline that first Selects a feature and performs PCA on the original data, concatenates the resulting datasets and applies a Support Vector Machine.
10.5.3.1 sklearn
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
= load_iris()
iris
= iris.data, iris.target
X, y
# This dataset is way too highdimensional. Better do PCA:
= PCA(n_components=2)
pca
# Maybe some original features where good, too?
= SelectKBest(k=1)
selection
# Build estimator from PCA and Univariate selection:
= FeatureUnion([("pca", pca), ("univ_select", selection)])
combined_features
# Use combined features to transform dataset:
= combined_features.fit(X, y).transform(X)
X_features
= SVC(kernel="linear")
svm
# Do grid search over k, n_components and C:
= Pipeline([("features", combined_features), ("svm", svm)])
pipeline
= dict(features__pca__n_components=[1, 2, 3],
param_grid =[1, 2],
features__univ_select__k=[0.1, 1, 10])
svm__C
= GridSearchCV(pipeline, param_grid=param_grid, cv=5, verbose=10)
grid_search grid_search.fit(X, y)
10.5.3.2 mlr3pipelines
library("mlr3verse")
= tsk("iris")
iris
# Build the steps
= po("copy", 2)
copy = po("pca")
pca = po("filter", filter = flt("variance"))
selection = po("featureunion", 2)
union = po("learner", lrn("classif.svm", kernel = "linear", type = "Cclassification"))
svm
# Assemble the Pipeline
= copy %>>% gunion(list(pca, selection)) %>>% union %>>% svm
pipeline = as_learner(pipeline)
learner
# For tuning, we define the resampling and the Parameter Space
= rsmp("cv", folds = 5L)
resampling
library("paradox")
= ps(
search_space classif.svm.cost = p_dbl(lower = 0.1, upper = 1),
pca.rank. = p_int(lower = 1, upper = 3),
variance.filter.nfeat = p_int(lower = 1, upper = 2)
)
= TuningInstanceSingleCrit$new(
instance task = iris,
learner = learner,
resampling = resampling,
measure = msr("classif.ce"),
terminator = trm("none"),
search_space = search_space
)
= tnr("grid_search", resolution = 10)
tuner $optimize(instance)
tuner
Set the learner to the optimal values and train$param_set$values = instance$result_learner_param_vals learner
In summary, we can achieve similar results with a comparable number of lines, while at the same time offering greater flexibility with respect to which kinds of pipelines we want to optimize over.
At the same time, experiments using mlr3
can now be arbitrarily parallelized using futures
.
10.5.4 mlr3pipelines vs recipes
recipes is a new package, that covers some of the same applications steps as mlr3pipelines.
Both packages feature the possibility to connect different pre and postprocessing methods using a pipeoperator.
As the recipes package tightly integrates with the tidymodels ecosystem, much of the functionality integrated there can be used in recipes
.
We compare recipes to mlr3pipelines using an example from the recipes vignette.
The aim of the analysis is to predict whether customers pay back their loans given some information on the customers. In order to do this, we build a model that does the following:
 It first imputes missing values using knearest neighbors
 All factor variables are converted to numerics using dummy encoding
 The data is first centered then scaled.
In order to validate the algorithm, data is first split into a train and test set using initial_split
, training
, testing
.
The recipe trained on the train data (see steps above) is then applied to the test data.
10.5.4.1 recipes
library("tidymodels")
library("rsample")
data("credit_data", package = "modeldata")
set.seed(55)
train_test_split = initial_split(credit_data)
credit_train = training(train_test_split)
credit_test = testing(train_test_split)
rec = recipe(Status ~ ., data = credit_train) %>%
step_knnimpute(all_predictors()) %>%
step_dummy(all_predictors(), all_numeric()) %>%
step_center(all_numeric()) %>%
step_scale(all_numeric())
trained_rec = prep(rec, training = credit_train)
# Apply to train and test set
train_data < bake(trained_rec, new_data = credit_train)
test_data < bake(trained_rec, new_data = credit_test)
Afterwards, the transformed data can be used during train and predict:
10.5.4.2 mlr3pipelines
The same analysis can be performed in mlr3pipelines.
Note, that for now we do not impute via knn
but instead via sampling.
library("data.table")
library("mlr3")
library("mlr3learners")
library("mlr3pipelines")
data("credit_data", package = "modeldata")
set.seed(55)
# Create the task
task = as_task_classif(credit_data, target = "Status")
# Build up the Pipeline:
g = po("imputesample", id = "impute") %>>%
po("encode", method = "onehot") %>>%
po("scale") %>>%
po("learner", lrn("classif.ranger", num.trees = 200, mtry = 12))
# We can visualize what happens to the data using the `plot` function:
g$plot()
# And we can use `mlr3's` full functionality be wrapping the Graph into a GraphLearner.
glrn = as_learner(g)
resample(task, glrn, rsmp("holdout"))