7  Sequential Pipelines

Martin Binder
Ludwig-Maximilians-Universität München, and Munich Center for Machine Learning (MCML)

Florian Pfisterer
Ludwig-Maximilians-Universität München

mlr3 aims to provide a layer of abstraction for ML practitioners, allowing users to quickly swap one algorithm for another without needing expert knowledge of the underlying implementation. A unified interface for Task, Learner, and Measure objects means that complex benchmark and tuning experiments can be run in just a few lines of code for any off-the-shelf model, i.e., if you just want to run an experiment using the basic implementation from the underlying algorithm, we hope we have made this easy for you to do.

mlr3pipelines (Binder et al. 2021) takes this modularity one step further, extending it to workflows that may also include data preprocessing (Chapter 9), building ensemble-models, or even more complicated meta-models. mlr3pipelines makes it possible to build individual steps within a Learner out of building blocks, which inherit from the PipeOp class. PipeOps can be connected using directed edges to form a Graph or ‘pipeline’, which represent the flow of data between operations. During model training, the PipeOps in a Graph transform a given Task and subsequent PipeOps receive the transformed Task as input. As well as transforming data, PipeOps generate a state, which is used to inform the PipeOps operation during prediction, similar to how learners learn and store model parameters/weights during training that go on to inform model prediction. This is visualized in Figure 7.1 using the “Scaling” PipeOp, which scales features during training and saves the scaling factors as a state to be used in predictions.

Plot shows a box that says "Dtrain" with an arrow to "Scaling" which itself has an arrow to "Transformed Data". Below "Dtrain" is a box that says "Dtest" with an arrow to "Scaling; Scaling Factors" which itself has an arrow to "Transformed Data". There is an arrow pointing from the scaling box on the top row to the one on the bottom. There is also an arrow from the top row scaling box to "Scaling Factors", the implication is the top row created the scaling factors for the bottom row. Finally, there is a curly bracket next to "Scaling Factors" with the text "State (learned parameters)".
Figure 7.1: The $train() method of the “Scaling” PipeOp both transforms data (rectangles) as well as creates a state, which is the scaling factors necessary to transform data during prediction.

We refer to pipelines as either sequential or non-sequential. These terms should not be confused with “sequential” and “parallel” processing. In the context of pipelines, “sequential” refers to the movement of data through the pipeline from one PipeOp directly to the next from start to finish. Sequential pipelines can be visualized in a straight line – as we will see in this chapter. In contrast, non-sequential pipelines see data being processed through PipeOps that may have multiple inputs and/or outputs. Non-sequential pipelines are characterized by multiple branches so data may be processed by different PipeOps at different times. Visually, non-sequential pipelines will not be a straight line from start to finish, but a more complex graph. In this chapter, we will look at sequential pipelines and in the next we will focus on non-sequential pipelines.

7.1 PipeOp: Pipeline Operators

The basic class of mlr3pipelines is the PipeOp, short for “pipeline operator”. It represents a transformative operation on an input (for example, a training Task), resulting in some output. Similarly to a learner, it includes a $train() and a $predict() method. The training phase typically generates a particular model of the data, which is saved as the internal state. In the prediction phase, the PipeOp acts on the prediction Task using information from the saved state. Therefore, just like a learner, a PipeOp has “parameters” (i.e., the state) that are trained. As well as ‘parameters’, PipeOps also have hyperparameters that can be set by the user when constructing the PipeOp or by accessing its $param_set. As with other classes, PipeOps can be constructed with a sugar function, po(), or pos() for multiple PipeOps, and all available PipeOps are made available in the dictionary mlr_pipeops. An up-to-date list of PipeOps contained in mlr3pipelines with links to their documentation can be found at https://mlr-org.com/pipeops.html, a small subset of these are printed below. If you want to extend mlr3pipelines with a PipeOp that has not been implemented, have a look at our vignette on extending PipeOps by running: vignette("extending", package = "mlr3pipelines").

PipeOpStatepo()mlr_pipeops
as.data.table(po())[1:6, 1:2]
              key                                      label
1:         boxcox Box-Cox Transformation of Numeric Features
2:         branch                             Path Branching
3:          chunk          Chunk Input into Multiple Outputs
4: classbalancing                            Class Balancing
5:     classifavg                   Majority Vote Prediction
6:   classweights         Class Weights for Sample Weighting

Let us now take a look at a PipeOp in practice using principal component analysis (PCA) as an example, which is implemented in PipeOpPCA. Below we construct the PipeOp using its ID "pca" and inspect it.

library(mlr3pipelines)

po_pca = po("pca", center = TRUE)
po_pca
PipeOp: <pca> (not trained)
values: <center=TRUE>
Input channels <name [train type, predict type]>:
  input [Task,Task]
Output channels <name [train type, predict type]>:
  output [Task,Task]

On printing, we can see that the PipeOp has not been trained and that we have changed some of the hyperparameters from their default values. The Input channels and Output channels lines provide information about the input and output types of this PipeOp. The PCA PipeOp takes one input (named “input”) of type “Task”, both during training and prediction (“input [Task,Task]”), and produces one called “output” that is also of type “Task” in both phases (“output [Task,Task]”). This highlights a key difference from the Learner class: PipeOps can return results after the training phase.

A PipeOp can be trained using $train(), which can have multiple inputs and outputs. Both inputs and outputs are passed as elements in a single list. The "pca" PipeOp takes as input the original task and after training returns the task with features replaced by their principal components.

tsk_small = tsk("penguins_simple")$select(c("bill_depth", "bill_length"))
poin = list(tsk_small$clone()$filter(1:5))
poout = po_pca$train(poin) # poin: Task in a list
poout # list with a single element 'output'
$output
<TaskClassif:penguins> (5 x 3): Simplified Palmer Penguins
* Target: species
* Properties: multiclass
* Features (2):
  - dbl (2): PC1, PC2
poout[[1]]$head()
   species     PC1       PC2
1:  Adelie  0.1561  0.005716
2:  Adelie  1.2677  0.789534
3:  Adelie  1.5336 -0.174460
4:  Adelie -2.1096  0.998977
5:  Adelie -0.8478 -1.619768

During training, PCA transforms incoming data by rotating it in such a way that features become uncorrelated and are ordered by their contribution to the total variance. The rotation matrix is also saved in the internal $state field during training (shown in Figure 7.1), which is then used during predictions and applied to new data.

po_pca$state
Standard deviations (1, .., p=2):
[1] 1.513 1.034

Rotation (n x k) = (2 x 2):
                PC1     PC2
bill_depth  -0.6116 -0.7911
bill_length  0.7911 -0.6116

Once trained, the $predict() function can then access the saved state to operate on the test data, which again is passed as a list:

tsk_onepenguin = tsk_small$clone()$filter(42)
poin = list(tsk_onepenguin)
poout = po_pca$predict(poin)
poout[[1]]$data()
   species   PC1    PC2
1:  Adelie 1.555 -1.455

7.2 Graph: Networks of PipeOps

PipeOps represent individual computational steps in machine learning pipelines. These pipelines themselves are defined by Graph objects. A Graph is a collection of PipeOps with “edges” that guide the flow of data.

The most convenient way of building a Graph is to connect a sequence of PipeOps using the %>>%-operator (read “double-arrow”) operator. When given two PipeOps, this operator creates a Graph that first executes the left-hand PipeOp, followed by the right-hand one. It can also be used to connect a Graph with a PipeOp, or with another Graph. The following example uses po("mutate") to add a new feature to the task, and po("scale") to then scale and center all numeric features.

%>>%
po_mutate = po("mutate",
  mutation = list(bill_ratio = ~bill_length / bill_depth)
)
po_scale = po("scale")
graph = po_mutate %>>% po_scale
graph
Graph with 2 PipeOps:
     ID         State sccssors prdcssors
 mutate <<UNTRAINED>>    scale          
  scale <<UNTRAINED>>             mutate

The output provides information about the layout of the Graph. For each PipOp (ID), we can see information about the state (State), as well as a list of its successors (sccssors), which are PipeOps that come directly after the given PipeOp, and its predecessors (prdcssors), the PipeOps that are connected to its input. In this simple Graph, the output of the "mutate" PipeOp is passed directly to the "scale" PipeOp and neither takes any other inputs or outputs from other PipeOps. The $plot() method can be used to visualize the graph.

$plot()
graph$plot(horizontal = TRUE)
Four boxes in a straight line connected by arrows: "<INPUT> -> mutate -> scale -> <OUTPUT>".
Figure 7.2: Simple sequential pipeline plot.

The plot demonstrates how a Graph is simply a collection of PipeOps that are connected by ‘edges’. The collection of PipeOps inside a Graph can be accessed through the $pipeops field. The $edges field can be used to access edges, which returns a data.table listing the “source” (src_id, src_channel) and “destination” (dst_id, dst_channel) of data flowing along each edge .

$edges/$pipeops
graph$pipeops
$mutate
PipeOp: <mutate> (not trained)
values: <mutation=<list>, delete_originals=FALSE>
Input channels <name [train type, predict type]>:
  input [Task,Task]
Output channels <name [train type, predict type]>:
  output [Task,Task]

$scale
PipeOp: <scale> (not trained)
values: <robust=FALSE>
Input channels <name [train type, predict type]>:
  input [Task,Task]
Output channels <name [train type, predict type]>:
  output [Task,Task]
graph$edges
   src_id src_channel dst_id dst_channel
1: mutate      output  scale       input

Instead of using %>>%, you can also create a Graph explicitly using the $add_pipeop() and $add_edge() methods to create PipeOps and the edges connecting them:

graph = Graph$new()$
  add_pipeop(po_mutate)$
  add_pipeop(po_scale)$
  add_edge("mutate", "scale")
Graphs and DAGs

The Graph class represents an object similar to a directed acyclic graph (DAG), since the input of a PipeOp cannot depend on its output and hence cycles are not allowed. However, the resemblance to a DAG is not perfect, since the Graph class allows for multiple edges between nodes. A term such as “directed acyclic multigraph” would be more accurate, but we use “graph” for simplicity.

Once built, a Graph can be used by calling $train() and $predict() as if it were a Learner (though it still outputs a list during training and prediction):

result = graph$train(tsk_small)
result
$scale.output
<TaskClassif:penguins> (333 x 4): Simplified Palmer Penguins
* Target: species
* Properties: multiclass
* Features (3):
  - dbl (3): bill_depth, bill_length, bill_ratio
result[[1]]$data()[1:3]
   species bill_depth bill_length bill_ratio
1:  Adelie     0.7796     -0.8947    -1.0421
2:  Adelie     0.1194     -0.8216    -0.6804
3:  Adelie     0.4241     -0.6753    -0.7435
result = graph$predict(tsk_onepenguin)
result[[1]]$head()
   species bill_depth bill_length bill_ratio
1:  Adelie     0.9319      -0.529    -0.8963

7.3 Sequential Learner-Pipelines

Possibly the most common application for mlr3pipelines is to use it to perform preprocessing tasks, such as missing value imputation or factor encoding, and to then feed the resulting data into a Learner – we will see more of this in practice in Chapter 9. A Graph representing this workflow manipulates data and fits a Learner-model during training, ensuring that the data is processed the same way during the prediction stage. Conceptually, the process may look as shown in Figure 7.3.

Top pipeline: Dtrain -> Scaling -> Factor Encoding -> Median Imputation -> Decision Tree. Bottom is same as Top except starts with Dtest and at the end has an arrow to Prediction. Each PipeOp in the top row has an arrow to the same PipeOp in the bottom row pointing to a trained state.
Figure 7.3: Conceptualization of training and prediction process inside a sequential learner-pipeline. During training (top row), the data is passed along the preprocessing operators, each of which modifies the data and creates a $state. Finally, the learner receives the data and a model is created. During prediction (bottom row), data is likewise transformed by preprocessing operators, using their respective $state (gray boxes) information in the process. The learner then receives data that has the same format as the data seen during training, and makes a prediction.

7.3.1 Learners as PipeOps and Graphs as Learners

In Figure 7.3 the final PipeOp is a Learner. Learner objects can be converted to PipeOps with as_pipeop(), however, this is only necessary if you choose to manually create a graph instead of using %>>%. With either method, internally Learners are passed to po("learner"). The following code creates a Graph that uses po("imputesample") to impute missing values by sampling from observed values (Section 9.3) then fits a logistic regression on the transformed task.

lrn_logreg = lrn("classif.log_reg")
graph = po("imputesample") %>>% lrn_logreg
graph$plot(horizontal = TRUE)
Four boxes in a straight line connected by arrows: "<INPUT> -> imputesample -> classif.log_reg -> <OUTPUT>".
Figure 7.4: "imputesample" and "learner" PipeOps in a sequential pipeline.

We have seen how training and predicting Graphs is possible but has a slightly different design to Learner objects, i.e., inputs and outputs during both training and predicting are list objects. To use a Graph as a Learner with an identical interface, it can be wrapped in a GraphLearner object with as_learner(). The Graph can then be used like any other Learner, so now we can benchmark our pipeline to decide if we should impute by sampling or with the mode of observed values (po("imputemode")):

GraphLearner
glrn_sample = as_learner(graph)
glrn_mode = as_learner(po("imputemode") %>>% lrn_logreg)

design = benchmark_grid(tsk("pima"), list(glrn_sample, glrn_mode),
  rsmp("cv", folds = 3))
bmr = benchmark(design)
aggr = bmr$aggregate()[, .(learner_id, classif.ce)]
aggr
                     learner_id classif.ce
1: imputesample.classif.log_reg     0.2357
2:   imputemode.classif.log_reg     0.2396

In this example, we can see that the sampling imputation method worked slightly better, although the difference is likely not significant.

Automatic Conversion to Learner

In this book, we always use as_learner() to convert a Graph to a Learner explicitly for clarity. While this conversion is necessary when you want to use Learner-specific functions like $predict_newdata(), builtin mlr3 methods like resample() and benchmark_grid() will make this conversion automatically and it is therefore not strictly needed. In the above example, it is therefore also possible to use

design = benchmark_grid(tsk("pima"),
  list(graph, po("imputesample") %>>% lrn_logreg),
  rsmp("cv", folds = 3))

7.3.2 Inspecting Graphs

You may want to inspect pipelines and the flow of data to learn more about your pipeline or to debug them. We first need to set the $keep_results flag to be TRUE so that intermediate results are retained, which is turned off by default to save memory.

glrn_sample$graph_model$keep_results = TRUE
glrn_sample$train(tsk("pima"))

The Graph can be accessed through the $graph_model field and then PipeOps can be accessed with $pipeops as before. In this example, we can see that our Task no longer has missing data after training the "imputesample" PipeOp. This can be used to access arbitrary intermediate results:

imputesample_output = glrn_sample$graph_model$pipeops$imputesample$
  .result
imputesample_output[[1]]$missings()
diabetes      age pedigree pregnant  glucose  insulin     mass pressure 
       0        0        0        0        0        0        0        0 
 triceps 
       0 

We could also use $pipeops to access our underlying Learner, note we need to use $learner_model to get the learner from the PipeOpLearner. We could use a similar method to peek at the state of any PipeOp in the graph:

pipeop_logreg = glrn_sample$graph_model$pipeops$classif.log_reg
learner_logreg = pipeop_logreg$learner_model
learner_logreg
<LearnerClassifLogReg:classif.log_reg>
* Model: glm
* Parameters: list()
* Packages: mlr3, mlr3learners, stats
* Predict Types:  [response], prob
* Feature Types: logical, integer, numeric, character, factor,
  ordered
* Properties: loglik, twoclass
$base_learner()

In this example we could have used glrn_sample$base_learner() to immediately access our trained learner, however, this does not generalize to more complex pipelines that may contain multiple learners.

7.3.3 Configuring Pipeline Hyperparameters

PipeOp hyperparameters are collected together in the $param_set of a graph and prefixed with the ID of the PipeOp to avoid parameter name clashes. Below we use the same PipeOp twice but set the id to ensure their IDs are unique.

graph = po("scale", center = FALSE, scale = TRUE, id = "scale") %>>%
  po("scale", center = TRUE, scale = FALSE, id = "center") %>>%
  lrn("classif.rpart", cp = 1)
unlist(graph$param_set$values)
      scale.robust       scale.center        scale.scale 
                 0                  0                  1 
     center.robust      center.center       center.scale 
                 0                  1                  0 
classif.rpart.xval   classif.rpart.cp 
                 0                  1 
PipeOp IDs in Graphs

If you need to change the ID of a PipeOp in a Graph then use the $set_names method from the Graph class, e.g., some_graph$set_names(old = "old_name", new = "new_name"). Do not change the ID of a PipeOp through graph$pipeops$<old_id>$id = <new_id>, as this will only alter the PipeOp’s record of its own ID, and not the Graph’s record, which will lead to errors.

Whether a pipeline is treated as a Graph or GraphLearner, hyperparameters are updated and accessed in the same way.

graph$param_set$values$classif.rpart.maxdepth = 5
graph_learner = as_learner(graph)
graph_learner$param_set$values$classif.rpart.minsplit = 2
unlist(graph_learner$param_set$values)
          scale.center            scale.scale           scale.robust 
                     0                      1                      0 
         center.center           center.scale          center.robust 
                     1                      0                      0 
      classif.rpart.cp classif.rpart.maxdepth classif.rpart.minsplit 
                     1                      5                      2 
    classif.rpart.xval 
                     0 

7.4 Conclusion

In this chapter, we introduced mlr3pipelines and its building blocks: Graph and PipeOp. We saw how to create pipelines as Graph objects from multiple PipeOp objects and how to access PipeOps from a Graph. We also saw how to treat a Learner as a PipeOp and how to treat a Graph as a Learner. In Chapter 8 we will take this functionality a step further and look at pipelines where PipeOps are not executed sequentially, as well as looking at how you can use mlr3tuning to tune pipelines. A lot of practical examples that use sequential pipelines can be found in Chapter 9 where we look at pipelines for data preprocessing.

Table 7.1: Important classes and functions covered in this chapter with underlying class (if applicable), class constructor or function, and important class fields and methods (if applicable).
Class Constructor/Function Fields/Methods
PipeOp po() $train(); $predict(); $state; $id; $param_set
Graph %>>% $add_pipeop(); $add_edge(); $pipeops; $edges;$train(); $predict()
GraphLearner as_learner $graph
PipeOpLearner as_pipeop $learner_model

7.5 Exercises

  1. Create a learner containing a Graph that first imputes missing values using po("imputeoor"), standardizes the data using po("scale"), and then fits a logistic linear model using lrn("classif.log_reg").
  2. Train the learner created in the previous exercise on tsk("pima") and display the coefficients of the resulting model. What are two different ways to access the model?
  3. Verify that the "age" column of the input task of lrn("classif.log_reg") from the previous exercise is indeed standardized. One way to do this would be to look at the $data field of the lrn("classif.log_reg") model; however, that is specific to that particular learner and does not work in general. What would be a different, more general way to do this? Hint: use the $keep_results flag.

7.6 Citation

Please cite this chapter as:

Binder M, Pfisterer F. (2024). Sequential Pipelines. In Bischl B, Sonabend R, Kotthoff L, Lang M, (Eds.), Applied Machine Learning Using mlr3 in R. CRC Press. https://mlr3book.mlr-org.com/sequential_pipelines.html.

@incollection{citekey, 
  author = "Martin Binder and Florian Pfisterer", 
  title = "Sequential Pipelines",
  booktitle = "Applied Machine Learning Using {m}lr3 in {R}",
  publisher = "CRC Press", year = "2024",
  editor = "Bernd Bischl and Raphael Sonabend and Lars Kotthoff and Michel Lang", 
  url = "https://mlr3book.mlr-org.com/sequential_pipelines.html"
}