7 Sequential Pipelines
Martin Binder
Ludwig-Maximilians-Universität München, and Munich Center for Machine Learning (MCML)
Florian Pfisterer
Ludwig-Maximilians-Universität München
mlr3
aims to provide a layer of abstraction for ML practitioners, allowing users to quickly swap one algorithm for another without needing expert knowledge of the underlying implementation. A unified interface for Task
, Learner
, and Measure
objects means that complex benchmark and tuning experiments can be run in just a few lines of code for any off-the-shelf model, i.e., if you just want to run an experiment using the basic implementation from the underlying algorithm, we hope we have made this easy for you to do.
mlr3pipelines
(Binder et al. 2021) takes this modularity one step further, extending it to workflows that may also include data preprocessing (Chapter 9), building ensemble-models, or even more complicated meta-models. mlr3pipelines
makes it possible to build individual steps within a Learner
out of building blocks, which inherit from the PipeOp
class. PipeOp
s can be connected using directed edges to form a Graph
or ‘pipeline’, which represent the flow of data between operations. During model training, the PipeOp
s in a Graph
transform a given Task
and subsequent PipeOp
s receive the transformed Task
as input. As well as transforming data, PipeOp
s generate a state, which is used to inform the PipeOp
s operation during prediction, similar to how learners learn and store model parameters/weights during training that go on to inform model prediction. This is visualized in Figure 7.1 using the “Scaling” PipeOp
, which scales features during training and saves the scaling factors as a state to be used in predictions.
We refer to pipelines as either sequential or non-sequential. These terms should not be confused with “sequential” and “parallel” processing. In the context of pipelines, “sequential” refers to the movement of data through the pipeline from one PipeOp
directly to the next from start to finish. Sequential pipelines can be visualized in a straight line – as we will see in this chapter. In contrast, non-sequential pipelines see data being processed through PipeOp
s that may have multiple inputs and/or outputs. Non-sequential pipelines are characterized by multiple branches so data may be processed by different PipeOp
s at different times. Visually, non-sequential pipelines will not be a straight line from start to finish, but a more complex graph. In this chapter, we will look at sequential pipelines and in the next we will focus on non-sequential pipelines.
7.1 PipeOp: Pipeline Operators
The basic class of mlr3pipelines
is the PipeOp
, short for “pipeline operator”. It represents a transformative operation on an input (for example, a training Task
), resulting in some output. Similarly to a learner, it includes a $train()
and a $predict()
method. The training phase typically generates a particular model of the data, which is saved as the internal state. In the prediction phase, the PipeOp
acts on the prediction Task
using information from the saved state. Therefore, just like a learner, a PipeOp has “parameters” (i.e., the state) that are trained. As well as ‘parameters’, PipeOp
s also have hyperparameters that can be set by the user when constructing the PipeOp
or by accessing its $param_set
. As with other classes, PipeOp
s can be constructed with a sugar function, po()
, or pos()
for multiple PipeOp
s, and all available PipeOp
s are made available in the dictionary mlr_pipeops
. An up-to-date list of PipeOp
s contained in mlr3pipelines
with links to their documentation can be found at https://mlr-org.com/pipeops.html, a small subset of these are printed below. If you want to extend mlr3pipelines
with a PipeOp
that has not been implemented, have a look at our vignette on extending PipeOp
s by running: vignette("extending", package = "mlr3pipelines")
.
as.data.table(po())[1:6, 1:2]
key label
1: adas ADAS Balancing
2: blsmote BLSMOTE Balancing
3: boxcox Box-Cox Transformation of Numeric Features
4: branch Path Branching
5: chunk Chunk Input into Multiple Outputs
6: classbalancing Class Balancing
Let us now take a look at a PipeOp
in practice using principal component analysis (PCA) as an example, which is implemented in PipeOpPCA
. Below we construct the PipeOp
using its ID "pca"
and inspect it.
library(mlr3pipelines)
po_pca = po("pca", center = TRUE)
po_pca
PipeOp: <pca> (not trained)
values: <center=TRUE>
Input channels <name [train type, predict type]>:
input [Task,Task]
Output channels <name [train type, predict type]>:
output [Task,Task]
On printing, we can see that the PipeOp
has not been trained and that we have changed some of the hyperparameters from their default values. The Input channels
and Output channels
lines provide information about the input and output types of this PipeOp. The PCA PipeOp
takes one input (named “input”) of type “Task
”, both during training and prediction (“input [Task,Task]
”), and produces one called “output” that is also of type “Task
” in both phases (“output [Task,Task]
”). This highlights a key difference from the Learner
class: PipeOp
s can return results after the training phase.
A PipeOp
can be trained using $train()
, which can have multiple inputs and outputs. Both inputs and outputs are passed as elements in a single list
. The "pca"
PipeOp
takes as input the original task and after training returns the task with features replaced by their principal components.
tsk_small = tsk("penguins_simple")$select(c("bill_depth", "bill_length"))
poin = list(tsk_small$clone()$filter(1:5))
poout = po_pca$train(poin) # poin: Task in a list
poout # list with a single element 'output'
$output
<TaskClassif:penguins> (5 x 3): Simplified Palmer Penguins
* Target: species
* Properties: multiclass
* Features (2):
- dbl (2): PC1, PC2
poout[[1]]$head()
species PC1 PC2
1: Adelie 0.1561 0.005716
2: Adelie 1.2677 0.789534
3: Adelie 1.5336 -0.174460
4: Adelie -2.1096 0.998977
5: Adelie -0.8478 -1.619768
During training, PCA transforms incoming data by rotating it in such a way that features become uncorrelated and are ordered by their contribution to the total variance. The rotation matrix is also saved in the internal $state
field during training (shown in Figure 7.1), which is then used during predictions and applied to new data.
po_pca$state
Standard deviations (1, .., p=2):
[1] 1.513 1.034
Rotation (n x k) = (2 x 2):
PC1 PC2
bill_depth -0.6116 -0.7911
bill_length 0.7911 -0.6116
Once trained, the $predict()
function can then access the saved state to operate on the test data, which again is passed as a list
:
tsk_onepenguin = tsk_small$clone()$filter(42)
poin = list(tsk_onepenguin)
poout = po_pca$predict(poin)
poout[[1]]$data()
species PC1 PC2
1: Adelie 1.555 -1.455
7.2 Graph: Networks of PipeOps
PipeOp
s represent individual computational steps in machine learning pipelines. These pipelines themselves are defined by Graph
objects. A Graph
is a collection of PipeOp
s with “edges” that guide the flow of data.
The most convenient way of building a Graph
is to connect a sequence of PipeOp
s using the %>>%
-operator (read “double-arrow”) operator. When given two PipeOp
s, this operator creates a Graph
that first executes the left-hand PipeOp
, followed by the right-hand one. It can also be used to connect a Graph
with a PipeOp
, or with another Graph
. The following example uses po("mutate")
to add a new feature to the task, and po("scale")
to then scale and center all numeric features.
%>>%
po_mutate = po("mutate",
mutation = list(bill_ratio = ~bill_length / bill_depth)
)
po_scale = po("scale")
graph = po_mutate %>>% po_scale
graph
Graph with 2 PipeOps:
ID State sccssors prdcssors
mutate <<UNTRAINED>> scale
scale <<UNTRAINED>> mutate
The output provides information about the layout of the Graph. For each PipOp
(ID
), we can see information about the state (State
), as well as a list of its successors (sccssors
), which are PipeOp
s that come directly after the given PipeOp
, and its predecessors (prdcssors
), the PipeOp
s that are connected to its input. In this simple Graph
, the output of the "mutate"
PipeOp
is passed directly to the "scale"
PipeOp
and neither takes any other inputs or outputs from other PipeOp
s. The $plot()
method can be used to visualize the graph.
graph$plot(horizontal = TRUE)
The plot demonstrates how a Graph
is simply a collection of PipeOp
s that are connected by ‘edges’. The collection of PipeOp
s inside a Graph
can be accessed through the $pipeops
field. The $edges
field can be used to access edges, which returns a data.table
listing the “source” (src_id
, src_channel
) and “destination” (dst_id
, dst_channel
) of data flowing along each edge .
$edges
/$pipeops
graph$pipeops
$mutate
PipeOp: <mutate> (not trained)
values: <mutation=<list>, delete_originals=FALSE>
Input channels <name [train type, predict type]>:
input [Task,Task]
Output channels <name [train type, predict type]>:
output [Task,Task]
$scale
PipeOp: <scale> (not trained)
values: <robust=FALSE>
Input channels <name [train type, predict type]>:
input [Task,Task]
Output channels <name [train type, predict type]>:
output [Task,Task]
graph$edges
src_id src_channel dst_id dst_channel
1: mutate output scale input
Instead of using %>>%
, you can also create a Graph
explicitly using the $add_pipeop()
and $add_edge()
methods to create PipeOp
s and the edges connecting them:
graph = Graph$new()$
add_pipeop(po_mutate)$
add_pipeop(po_scale)$
add_edge("mutate", "scale")
The Graph
class represents an object similar to a directed acyclic graph (DAG), since the input of a PipeOp
cannot depend on its output and hence cycles are not allowed. However, the resemblance to a DAG is not perfect, since the Graph
class allows for multiple edges between nodes. A term such as “directed acyclic multigraph” would be more accurate, but we use “graph” for simplicity.
Once built, a Graph
can be used by calling $train()
and $predict()
as if it were a Learner
(though it still outputs a list
during training and prediction):
result = graph$train(tsk_small)
result
$scale.output
<TaskClassif:penguins> (333 x 4): Simplified Palmer Penguins
* Target: species
* Properties: multiclass
* Features (3):
- dbl (3): bill_depth, bill_length, bill_ratio
result[[1]]$data()[1:3]
species bill_depth bill_length bill_ratio
1: Adelie 0.7796 -0.8947 -1.0421
2: Adelie 0.1194 -0.8216 -0.6804
3: Adelie 0.4241 -0.6753 -0.7435
result = graph$predict(tsk_onepenguin)
result[[1]]$head()
species bill_depth bill_length bill_ratio
1: Adelie 0.9319 -0.529 -0.8963
7.3 Sequential Learner-Pipelines
Possibly the most common application for mlr3pipelines
is to use it to perform preprocessing tasks, such as missing value imputation or factor encoding, and to then feed the resulting data into a Learner
– we will see more of this in practice in Chapter 9. A Graph
representing this workflow manipulates data and fits a Learner
-model during training, ensuring that the data is processed the same way during the prediction stage. Conceptually, the process may look as shown in Figure 7.3.
7.3.1 Learners as PipeOps and Graphs as Learners
In Figure 7.3 the final PipeOp
is a Learner
. Learner
objects can be converted to PipeOp
s with as_pipeop()
, however, this is only necessary if you choose to manually create a graph instead of using %>>%
. With either method, internally Learner
s are passed to po("learner")
. The following code creates a Graph
that uses po("imputesample")
to impute missing values by sampling from observed values (Section 9.3) then fits a logistic regression on the transformed task.
We have seen how training and predicting Graph
s is possible but has a slightly different design to Learner
objects, i.e., inputs and outputs during both training and predicting are list
objects. To use a Graph
as a Learner
with an identical interface, it can be wrapped in a GraphLearner
object with as_learner()
. The Graph
can then be used like any other Learner
, so now we can benchmark our pipeline to decide if we should impute by sampling or with the mode of observed values (po("imputemode")
):
GraphLearner
glrn_sample = as_learner(graph)
glrn_mode = as_learner(po("imputemode") %>>% lrn_logreg)
design = benchmark_grid(tsk("pima"), list(glrn_sample, glrn_mode),
rsmp("cv", folds = 3))
bmr = benchmark(design)
aggr = bmr$aggregate()[, .(learner_id, classif.ce)]
aggr
learner_id classif.ce
1: imputesample.classif.log_reg 0.2357
2: imputemode.classif.log_reg 0.2396
In this example, we can see that the sampling imputation method worked slightly better, although the difference is likely not significant.
In this book, we always use as_learner()
to convert a Graph
to a Learner
explicitly for clarity. While this conversion is necessary when you want to use Learner
-specific functions like $predict_newdata()
, builtin mlr3
methods like resample()
and benchmark_grid()
will make this conversion automatically and it is therefore not strictly needed. In the above example, it is therefore also possible to use
7.3.2 Inspecting Graphs
You may want to inspect pipelines and the flow of data to learn more about your pipeline or to debug them. We first need to set the $keep_results
flag to be TRUE
so that intermediate results are retained, which is turned off by default to save memory.
glrn_sample$graph_model$keep_results = TRUE
glrn_sample$train(tsk("pima"))
The Graph
can be accessed through the $graph_model
field and then PipeOp
s can be accessed with $pipeops
as before. In this example, we can see that our Task
no longer has missing data after training the "imputesample"
PipeOp
. This can be used to access arbitrary intermediate results:
imputesample_output = glrn_sample$graph_model$pipeops$imputesample$
.result
imputesample_output[[1]]$missings()
diabetes age pedigree pregnant glucose insulin mass pressure
0 0 0 0 0 0 0 0
triceps
0
We could also use $pipeops
to access our underlying Learner
, note we need to use $learner_model
to get the learner from the PipeOpLearner
. We could use a similar method to peek at the state of any PipeOp
in the graph:
pipeop_logreg = glrn_sample$graph_model$pipeops$classif.log_reg
learner_logreg = pipeop_logreg$learner_model
learner_logreg
<LearnerClassifLogReg:classif.log_reg>: Logistic Regression
* Model: glm
* Parameters: list()
* Packages: mlr3, mlr3learners, stats
* Predict Types: [response], prob
* Feature Types: logical, integer, numeric, character, factor,
ordered
* Properties: loglik, twoclass
$base_learner()
In this example we could have used glrn_sample$base_learner()
to immediately access our trained learner, however, this does not generalize to more complex pipelines that may contain multiple learners.
7.3.3 Configuring Pipeline Hyperparameters
PipeOp
hyperparameters are collected together in the $param_set
of a graph and prefixed with the ID of the PipeOp
to avoid parameter name clashes. Below we use the same PipeOp
twice but set the id
to ensure their IDs are unique.
graph = po("scale", center = FALSE, scale = TRUE, id = "scale") %>>%
po("scale", center = TRUE, scale = FALSE, id = "center") %>>%
lrn("classif.rpart", cp = 1)
unlist(graph$param_set$values)
scale.center scale.scale scale.robust
0 1 0
center.center center.scale center.robust
1 0 0
classif.rpart.cp classif.rpart.xval
1 0
If you need to change the ID of a PipeOp
in a Graph
then use the $set_names
method from the Graph
class, e.g., some_graph$set_names(old = "old_name", new = "new_name")
. Do not change the ID of a PipeOp
through graph$pipeops$<old_id>$id = <new_id>
, as this will only alter the PipeOp
’s record of its own ID, and not the Graph
’s record, which will lead to errors.
Whether a pipeline is treated as a Graph
or GraphLearner
, hyperparameters are updated and accessed in the same way.
graph$param_set$values$classif.rpart.maxdepth = 5
graph_learner = as_learner(graph)
graph_learner$param_set$values$classif.rpart.minsplit = 2
unlist(graph_learner$param_set$values)
scale.center scale.scale scale.robust
0 1 0
center.center center.scale center.robust
1 0 0
classif.rpart.cp classif.rpart.maxdepth classif.rpart.minsplit
1 5 2
classif.rpart.xval
0
7.4 Conclusion
In this chapter, we introduced mlr3pipelines
and its building blocks: Graph
and PipeOp
. We saw how to create pipelines as Graph
objects from multiple PipeOp
objects and how to access PipeOp
s from a Graph
. We also saw how to treat a Learner
as a PipeOp
and how to treat a Graph
as a Learner
. In Chapter 8 we will take this functionality a step further and look at pipelines where PipeOp
s are not executed sequentially, as well as looking at how you can use mlr3tuning
to tune pipelines. A lot of practical examples that use sequential pipelines can be found in Chapter 9 where we look at pipelines for data preprocessing.
Class | Constructor/Function | Fields/Methods |
---|---|---|
PipeOp |
po() |
$train() ; $predict() ; $state ; $id ; $param_set
|
Graph |
%>>% |
$add_pipeop() ; $add_edge() ; $pipeops ; $edges ;$train() ; $predict()
|
GraphLearner |
as_learner |
$graph |
PipeOpLearner |
as_pipeop |
$learner_model |
7.5 Exercises
- Create a learner containing a
Graph
that first imputes missing values usingpo("imputeoor")
, standardizes the data usingpo("scale")
, and then fits a logistic linear model usinglrn("classif.log_reg")
. - Train the learner created in the previous exercise on
tsk("pima")
and display the coefficients of the resulting model. What are two different ways to access the model? - Verify that the
"age"
column of the input task oflrn("classif.log_reg")
from the previous exercise is indeed standardized. One way to do this would be to look at the$data
field of thelrn("classif.log_reg")
model; however, that is specific to that particular learner and does not work in general. What would be a different, more general way to do this? Hint: use the$keep_results
flag.
7.6 Citation
Please cite this chapter as:
Binder M, Pfisterer F. (2024). Sequential Pipelines. In Bischl B, Sonabend R, Kotthoff L, Lang M, (Eds.), Applied Machine Learning Using mlr3 in R. CRC Press. https://mlr3book.mlr-org.com/sequential_pipelines.html.
@incollection{citekey,
author = "Martin Binder and Florian Pfisterer",
title = "Sequential Pipelines",
booktitle = "Applied Machine Learning Using {m}lr3 in {R}",
publisher = "CRC Press", year = "2024",
editor = "Bernd Bischl and Raphael Sonabend and Lars Kotthoff and Michel Lang",
url = "https://mlr3book.mlr-org.com/sequential_pipelines.html"
}