7 Sequential Pipelines

Martin Binder
Ludwig-Maximilians-Universität München, and Munich Center for Machine Learning (MCML)

Florian Pfisterer
Ludwig-Maximilians-Universität München

mlr3 aims to provide a layer of abstraction for ML practitioners, allowing users to quickly swap one algorithm for another without needing expert knowledge of the underlying implementation. A unified interface for Task, Learner, and Measure objects means that complex benchmark and tuning experiments can be run in just a few lines of code for any off-the-shelf model, i.e., if you just want to run an experiment using the basic implementation from the underlying algorithm, we hope we have made this easy for you to do.

mlr3pipelines (Binder et al. 2021) takes this modularity one step further, extending it to workflows that may also include data preprocessing (Chapter 9), building ensemble-models, or even more complicated meta-models. mlr3pipelines makes it possible to build individual steps within a Learner out of building blocks, which inherit from the PipeOp class. PipeOps can be connected using directed edges to form a Graph or ‘pipeline’, which represent the flow of data between operations. During model training, the PipeOps in a Graph transform a given Task and subsequent PipeOps receive the transformed Task as input. As well as transforming data, PipeOps generate a state, which is used to inform the PipeOps operation during prediction, similar to how learners learn and store model parameters/weights during training that go on to inform model prediction. This is visualized in Figure 7.1 using the “Scaling” PipeOp, which scales features during training and saves the scaling factors as a state to be used in predictions.

Figure 7.1: The $train() method of the “Scaling” PipeOp both transforms data (rectangles) as well as creates a state, which is the scaling factors necessary to transform data during prediction.

We refer to pipelines as either sequential or non-sequential. These terms should not be confused with “sequential” and “parallel” processing. In the context of pipelines, “sequential” refers to the movement of data through the pipeline from one PipeOp directly to the next from start to finish. Sequential pipelines can be visualized in a straight line – as we will see in this chapter. In contrast, non-sequential pipelines see data being processed through PipeOps that may have multiple inputs and/or outputs. Non-sequential pipelines are characterized by multiple branches so data may be processed by different PipeOps at different times. Visually, non-sequential pipelines will not be a straight line from start to finish, but a more complex graph. In this chapter, we will look at sequential pipelines and in the next we will focus on non-sequential pipelines.

7.1 PipeOp: Pipeline Operators

The basic class of mlr3pipelines is the PipeOp, short for “pipeline operator”. It represents a transformative operation on an input (for example, a training Task), resulting in some output. Similarly to a learner, it includes a $train() and a $predict() method. The training phase typically generates a particular model of the data, which is saved as the internal state. In the prediction phase, the PipeOp acts on the prediction Task using information from the saved state. Therefore, just like a learner, a PipeOp has “parameters” (i.e., the state) that are trained. As well as ‘parameters’, PipeOps also have hyperparameters that can be set by the user when constructing the PipeOp or by accessing its $param_set. As with other classes, PipeOps can be constructed with a sugar function, po(), or pos() for multiple PipeOps, and all available PipeOps are made available in the dictionary mlr_pipeops. An up-to-date list of PipeOps contained in mlr3pipelines with links to their documentation can be found at https://mlr-org.com/pipeops.html, a small subset of these are printed below. If you want to extend mlr3pipelines with a PipeOp that has not been implemented, have a look at our vignette on extending PipeOps by running: vignette("extending", package = "mlr3pipelines").

PipeOpStatepo()mlr_pipeops

as.data.table(po())[1:6, 1:2]

              key                                      label
1:           adas                             ADAS Balancing
2:        blsmote                          BLSMOTE Balancing
3:         boxcox Box-Cox Transformation of Numeric Features
4:         branch                             Path Branching
5:          chunk          Chunk Input into Multiple Outputs
6: classbalancing                            Class Balancing

Let us now take a look at a PipeOp in practice using principal component analysis (PCA) as an example, which is implemented in PipeOpPCA. Below we construct the PipeOp using its ID "pca" and inspect it.

library(mlr3pipelines)

po_pca = po("pca", center = TRUE)
po_pca

PipeOp: <pca> (not trained)
values: <center=TRUE>
Input channels <name [train type, predict type]>:
  input [Task,Task]
Output channels <name [train type, predict type]>:
  output [Task,Task]

On printing, we can see that the PipeOp has not been trained and that we have changed some of the hyperparameters from their default values. The Input channels and Output channels lines provide information about the input and output types of this PipeOp. The PCA PipeOp takes one input (named “input”) of type “Task”, both during training and prediction (“input [Task,Task]”), and produces one called “output” that is also of type “Task” in both phases (“output [Task,Task]”). This highlights a key difference from the Learner class: PipeOps can return results after the training phase.

A PipeOp can be trained using $train(), which can have multiple inputs and outputs. Both inputs and outputs are passed as elements in a single list. The "pca" PipeOp takes as input the original task and after training returns the task with features replaced by their principal components.

tsk_small = tsk("penguins_simple")$select(c("bill_depth", "bill_length"))
poin = list(tsk_small$clone()$filter(1:5))
poout = po_pca$train(poin) # poin: Task in a list
poout # list with a single element 'output'

$output

── <TaskClassif> (5x3): Simplified Palmer Penguins ──────────────────────
• Target: species
• Target classes: Adelie (100%), Chinstrap (0%), Gentoo (0%)
• Properties: multiclass
• Features (2):
  • dbl (2): PC1, PC2

poout[[1]]$head()

   species     PC1       PC2
1:  Adelie  0.1561  0.005716
2:  Adelie  1.2677  0.789534
3:  Adelie  1.5336 -0.174460
4:  Adelie -2.1096  0.998977
5:  Adelie -0.8478 -1.619768

During training, PCA transforms incoming data by rotating it in such a way that features become uncorrelated and are ordered by their contribution to the total variance. The rotation matrix is also saved in the internal $state field during training (shown in Figure 7.1), which is then used during predictions and applied to new data.

po_pca$state

Standard deviations (1, .., p=2):
[1] 1.513 1.034

Rotation (n x k) = (2 x 2):
                PC1     PC2
bill_depth  -0.6116 -0.7911
bill_length  0.7911 -0.6116

Once trained, the $predict() function can then access the saved state to operate on the test data, which again is passed as a list:

tsk_onepenguin = tsk_small$clone()$filter(42)
poin = list(tsk_onepenguin)
poout = po_pca$predict(poin)
poout[[1]]$data()

   species   PC1    PC2
1:  Adelie 1.555 -1.455

7.2 Graph: Networks of PipeOps

PipeOps represent individual computational steps in machine learning pipelines. These pipelines themselves are defined by Graph objects. A Graph is a collection of PipeOps with “edges” that guide the flow of data.

The most convenient way of building a Graph is to connect a sequence of PipeOps using the %>>%-operator (read “double-arrow”) operator. When given two PipeOps, this operator creates a Graph that first executes the left-hand PipeOp, followed by the right-hand one. It can also be used to connect a Graph with a PipeOp, or with another Graph. The following example uses po("mutate") to add a new feature to the task, and po("scale") to then scale and center all numeric features.

%>>%

po_mutate = po("mutate",
  mutation = list(bill_ratio = ~bill_length / bill_depth)
)
po_scale = po("scale")
graph = po_mutate %>>% po_scale
graph

Graph with 2 PipeOps:
     ID         State sccssors prdcssors
 mutate <<UNTRAINED>>    scale          
  scale <<UNTRAINED>>             mutate

The output provides information about the layout of the Graph. For each PipOp (ID), we can see information about the state (State), as well as a list of its successors (sccssors), which are PipeOps that come directly after the given PipeOp, and its predecessors (prdcssors), the PipeOps that are connected to its input. In this simple Graph, the output of the "mutate" PipeOp is passed directly to the "scale" PipeOp and neither takes any other inputs or outputs from other PipeOps. The $plot() method can be used to visualize the graph.

$plot()

graph$plot(horizontal = TRUE)

Four boxes in a straight line connected by arrows: "<INPUT> -> mutate -> scale -> <OUTPUT>". — Figure 7.2: Simple sequential pipeline plot.

The plot demonstrates how a Graph is simply a collection of PipeOps that are connected by ‘edges’. The collection of PipeOps inside a Graph can be accessed through the $pipeops field. The $edges field can be used to access edges, which returns a data.table listing the “source” (src_id, src_channel) and “destination” (dst_id, dst_channel) of data flowing along each edge .

$edges/$pipeops

graph$pipeops

$mutate
PipeOp: <mutate> (not trained)
values: <mutation=<list>, delete_originals=FALSE>
Input channels <name [train type, predict type]>:
  input [Task,Task]
Output channels <name [train type, predict type]>:
  output [Task,Task]

$scale
PipeOp: <scale> (not trained)
values: <robust=FALSE>
Input channels <name [train type, predict type]>:
  input [Task,Task]
Output channels <name [train type, predict type]>:
  output [Task,Task]

graph$edges

   src_id src_channel dst_id dst_channel
1: mutate      output  scale       input

Instead of using %>>%, you can also create a Graph explicitly using the $add_pipeop() and $add_edge() methods to create PipeOps and the edges connecting them:

graph = Graph$new()$
  add_pipeop(po_mutate)$
  add_pipeop(po_scale)$
  add_edge("mutate", "scale")

Graphs and DAGs

The Graph class represents an object similar to a directed acyclic graph (DAG), since the input of a PipeOp cannot depend on its output and hence cycles are not allowed. However, the resemblance to a DAG is not perfect, since the Graph class allows for multiple edges between nodes. A term such as “directed acyclic multigraph” would be more accurate, but we use “graph” for simplicity.

Once built, a Graph can be used by calling $train() and $predict() as if it were a Learner (though it still outputs a list during training and prediction):

result = graph$train(tsk_small)
result

$scale.output

── <TaskClassif> (333x4): Simplified Palmer Penguins ────────────────────
• Target: species
• Target classes: Adelie (44%), Gentoo (36%), Chinstrap (20%)
• Properties: multiclass
• Features (3):
  • dbl (3): bill_depth, bill_length, bill_ratio

result[[1]]$data()[1:3]

   species bill_depth bill_length bill_ratio
1:  Adelie     0.7796     -0.8947    -1.0421
2:  Adelie     0.1194     -0.8216    -0.6804
3:  Adelie     0.4241     -0.6753    -0.7435

result = graph$predict(tsk_onepenguin)
result[[1]]$head()

   species bill_depth bill_length bill_ratio
1:  Adelie     0.9319      -0.529    -0.8963

7.3 Sequential Learner-Pipelines

Possibly the most common application for mlr3pipelines is to use it to perform preprocessing tasks, such as missing value imputation or factor encoding, and to then feed the resulting data into a Learner – we will see more of this in practice in Chapter 9. A Graph representing this workflow manipulates data and fits a Learner-model during training, ensuring that the data is processed the same way during the prediction stage. Conceptually, the process may look as shown in Figure 7.3.

Figure 7.3: Conceptualization of training and prediction process inside a sequential learner-pipeline. During training (top row), the data is passed along the preprocessing operators, each of which modifies the data and creates a $state. Finally, the learner receives the data and a model is created. During prediction (bottom row), data is likewise transformed by preprocessing operators, using their respective $state (gray boxes) information in the process. The learner then receives data that has the same format as the data seen during training, and makes a prediction.

7.3.1 Learners as PipeOps and Graphs as Learners

In Figure 7.3 the final PipeOp is a Learner. Learner objects can be converted to PipeOps with as_pipeop(), however, this is only necessary if you choose to manually create a graph instead of using %>>%. With either method, internally Learners are passed to po("learner"). The following code creates a Graph that uses po("imputesample") to impute missing values by sampling from observed values (Section 9.3) then fits a logistic regression on the transformed task.

lrn_logreg = lrn("classif.log_reg")
graph = po("imputesample") %>>% lrn_logreg
graph$plot(horizontal = TRUE)

Four boxes in a straight line connected by arrows: "<INPUT> -> imputesample -> classif.log_reg -> <OUTPUT>". — Figure 7.4: `"imputesample"` and `"learner"` PipeOps in a sequential pipeline.

We have seen how training and predicting Graphs is possible but has a slightly different design to Learner objects, i.e., inputs and outputs during both training and predicting are list objects. To use a Graph as a Learner with an identical interface, it can be wrapped in a GraphLearner object with as_learner(). The Graph can then be used like any other Learner, so now we can benchmark our pipeline to decide if we should impute by sampling or with the mode of observed values (po("imputemode")):

GraphLearner

glrn_sample = as_learner(graph)
glrn_mode = as_learner(po("imputemode") %>>% lrn_logreg)

design = benchmark_grid(tsk("pima"), list(glrn_sample, glrn_mode),
  rsmp("cv", folds = 3))
bmr = benchmark(design)
aggr = bmr$aggregate()[, .(learner_id, classif.ce)]
aggr

                     learner_id classif.ce
1: imputesample.classif.log_reg     0.2357
2:   imputemode.classif.log_reg     0.2396

In this example, we can see that the sampling imputation method worked slightly better, although the difference is likely not significant.

Automatic Conversion to Learner

In this book, we always use as_learner() to convert a Graph to a Learner explicitly for clarity. While this conversion is necessary when you want to use Learner-specific functions like $predict_newdata(), builtin mlr3 methods like resample() and benchmark_grid() will make this conversion automatically and it is therefore not strictly needed. In the above example, it is therefore also possible to use

design = benchmark_grid(tsk("pima"),
  list(graph, po("imputesample") %>>% lrn_logreg),
  rsmp("cv", folds = 3))

7.3.2 Inspecting Graphs

You may want to inspect pipelines and the flow of data to learn more about your pipeline or to debug them. We first need to set the $keep_results flag to be TRUE so that intermediate results are retained, which is turned off by default to save memory.

glrn_sample$graph_model$keep_results = TRUE
glrn_sample$train(tsk("pima"))

The Graph can be accessed through the $graph_model field and then PipeOps can be accessed with $pipeops as before. In this example, we can see that our Task no longer has missing data after training the "imputesample" PipeOp. This can be used to access arbitrary intermediate results:

imputesample_output = glrn_sample$graph_model$pipeops$imputesample$
  .result
imputesample_output[[1]]$missings()

diabetes      age pedigree pregnant  glucose  insulin     mass pressure 
       0        0        0        0        0        0        0        0 
 triceps 
       0

We could also use $pipeops to access our underlying Learner, note we need to use $learner_model to get the learner from the PipeOpLearner. We could use a similar method to peek at the state of any PipeOp in the graph:

pipeop_logreg = glrn_sample$graph_model$pipeops$classif.log_reg
learner_logreg = pipeop_logreg$learner_model
learner_logreg


── <LearnerClassifLogReg> (classif.log_reg): Logistic Regression ────────
• Model: glm
• Parameters: use_pred_offset=TRUE
• Packages: mlr3, mlr3learners, and stats
• Predict Types: [response] and prob
• Feature Types: logical, integer, numeric, character, factor, and
ordered
• Encapsulation: none (fallback: -)
• Properties: offset, twoclass, and weights
• Other settings: use_weights = 'use'

$base_learner()

In this example we could have used glrn_sample$base_learner() to immediately access our trained learner, however, this does not generalize to more complex pipelines that may contain multiple learners.

7.3.3 Configuring Pipeline Hyperparameters

PipeOp hyperparameters are collected together in the $param_set of a graph and prefixed with the ID of the PipeOp to avoid parameter name clashes. Below we use the same PipeOp twice but set the id to ensure their IDs are unique.

graph = po("scale", center = FALSE, scale = TRUE, id = "scale") %>>%
  po("scale", center = TRUE, scale = FALSE, id = "center") %>>%
  lrn("classif.rpart", cp = 1)
unlist(graph$param_set$values)

      scale.center        scale.scale       scale.robust 
                 0                  1                  0 
     center.center       center.scale      center.robust 
                 1                  0                  0 
  classif.rpart.cp classif.rpart.xval 
                 1                  0

PipeOp IDs in Graphs

If you need to change the ID of a PipeOp in a Graph then use the $set_names method from the Graph class, e.g., some_graph$set_names(old = "old_name", new = "new_name"). Do not change the ID of a PipeOp through graph$pipeops$<old_id>$id = <new_id>, as this will only alter the PipeOp’s record of its own ID, and not the Graph’s record, which will lead to errors.

Whether a pipeline is treated as a Graph or GraphLearner, hyperparameters are updated and accessed in the same way.

graph$param_set$values$classif.rpart.maxdepth = 5
graph_learner = as_learner(graph)
graph_learner$param_set$values$classif.rpart.minsplit = 2
unlist(graph_learner$param_set$values)

          scale.center            scale.scale           scale.robust 
                     0                      1                      0 
         center.center           center.scale          center.robust 
                     1                      0                      0 
      classif.rpart.cp classif.rpart.maxdepth classif.rpart.minsplit 
                     1                      5                      2 
    classif.rpart.xval 
                     0

7.4 Conclusion

In this chapter, we introduced mlr3pipelines and its building blocks: Graph and PipeOp. We saw how to create pipelines as Graph objects from multiple PipeOp objects and how to access PipeOps from a Graph. We also saw how to treat a Learner as a PipeOp and how to treat a Graph as a Learner. In Chapter 8 we will take this functionality a step further and look at pipelines where PipeOps are not executed sequentially, as well as looking at how you can use mlr3tuning to tune pipelines. A lot of practical examples that use sequential pipelines can be found in Chapter 9 where we look at pipelines for data preprocessing.

Table 7.1: Important classes and functions covered in this chapter with underlying class (if applicable), class constructor or function, and important class fields and methods (if applicable).

Class	Constructor/Function	Fields/Methods
`PipeOp`	`po()`	`$train()`; `$predict()`; `$state`; `$id`; `$param_set`
`Graph`	`%>>%`	`$add_pipeop()`; `$add_edge()`; `$pipeops`; `$edges`;`$train()`; `$predict()`
`GraphLearner`	`as_learner`	`$graph`
`PipeOpLearner`	`as_pipeop`	`$learner_model`

7.5 Exercises

Create a learner containing a Graph that first imputes missing values using po("imputeoor"), standardizes the data using po("scale"), and then fits a logistic linear model using lrn("classif.log_reg").
Train the learner created in the previous exercise on tsk("pima") and display the coefficients of the resulting model. What are two different ways to access the model?
Verify that the "age" column of the input task of lrn("classif.log_reg") from the previous exercise is indeed standardized. One way to do this would be to look at the $data field of the lrn("classif.log_reg") model; however, that is specific to that particular learner and does not work in general. What would be a different, more general way to do this? Hint: use the $keep_results flag.

7.6 Citation

Please cite this chapter as:

Binder M, Pfisterer F. (2024). Sequential Pipelines. In Bischl B, Sonabend R, Kotthoff L, Lang M, (Eds.), Applied Machine Learning Using mlr3 in R. CRC Press. https://mlr3book.mlr-org.com/sequential_pipelines.html.

@incollection{citekey, 
  author = "Martin Binder and Florian Pfisterer", 
  title = "Sequential Pipelines",
  booktitle = "Applied Machine Learning Using {m}lr3 in {R}",
  publisher = "CRC Press", year = "2024",
  editor = "Bernd Bischl and Raphael Sonabend and Lars Kotthoff and Michel Lang", 
  url = "https://mlr3book.mlr-org.com/sequential_pipelines.html"
}

--- aliases: - "/sequential_pipelines.html" --- # Sequential Pipelines {#sec-pipelines} {{< include ../../common/_setup.qmd >}} `r chapter = "Sequential Pipelines"` `r authors(chapter)` `r mlr3` aims to provide a layer of abstraction for ML practitioners, allowing users to quickly swap one algorithm for another without needing expert knowledge of the underlying implementation. A unified interface for `r ref("Task")`, `r ref("Learner")`, and `r ref("Measure")` objects means that complex benchmark and tuning experiments can be run in just a few lines of code for any off-the-shelf model, i.e., if you just want to run an experiment using the basic implementation from the underlying algorithm, we hope we have made this easy for you to do. `r mlr3pipelines` [@mlr3pipelines] takes this modularity one step further, extending it to workflows that may also include data `r index('preprocessing')` (@sec-preprocessing), building `r index('ensemble')`-models, or even more complicated meta-models. `mlr3pipelines` makes it possible to build individual steps within a `Learner` out of building blocks, which inherit from the `r ref("PipeOp", index = TRUE)` class. `PipeOp`s can be connected using directed edges to form a `r ref("Graph", index = TRUE)` or 'pipeline', which represent the flow of data between operations. During model training, the `PipeOp`s in a `Graph` transform a given `Task` and subsequent `PipeOp`s receive the transformed `Task` as input. As well as transforming data, `PipeOp`s generate a *state*, which is used to inform the `PipeOp`s operation during prediction, similar to how learners learn and store model parameters/weights during training that go on to inform model prediction. This is visualized in @fig-pipelines-state using the "Scaling" `PipeOp`, which scales features during training and saves the scaling factors as a state to be used in predictions. ```{r sequential_pipelines-001, echo = FALSE, out.width="70%"} #| label: fig-pipelines-state #| fig-cap: 'The `$train()` method of the "Scaling" PipeOp both transforms data (rectangles) as well as creates a state, which is the scaling factors necessary to transform data during prediction.' #| fig-alt: 'Plot shows a box that says "Dtrain" with an arrow to "Scaling" which itself has an arrow to "Transformed Data". Below "Dtrain" is a box that says "Dtest" with an arrow to "Scaling; Scaling Factors" which itself has an arrow to "Transformed Data". There is an arrow pointing from the scaling box on the top row to the one on the bottom. There is also an arrow from the top row scaling box to "Scaling Factors", the implication is the top row created the scaling factors for the bottom row. Finally, there is a curly bracket next to "Scaling Factors" with the text "State (learned parameters)".' include_multi_graphics("mlr3book_figures-23") ``` We refer to pipelines as either sequential or non-sequential. These terms should not be confused with "sequential" and "parallel" processing. In the context of pipelines, "sequential" refers to the movement of data through the pipeline from one `PipeOp` directly to the next from start to finish. Sequential pipelines can be visualized in a straight line -- as we will see in this chapter. In contrast, non-sequential pipelines see data being processed through `PipeOp`s that may have multiple inputs and/or outputs. Non-sequential pipelines are characterized by multiple branches so data may be processed by different `PipeOp`s at different times. Visually, non-sequential pipelines will not be a straight line from start to finish, but a more complex graph. In this chapter, we will look at sequential pipelines and in the next we will focus on non-sequential pipelines. ## PipeOp: Pipeline Operators {#sec-pipelines-pipeops} The basic class of `mlr3pipelines` is the `r ref("PipeOp", aside = TRUE)`, short for "pipeline operator". It represents a transformative operation on an input (for example, a training `r ref("Task")`), resulting in some output. Similarly to a learner, it includes a `$train()` and a `$predict()` method. The training phase typically generates a particular model of the data, which is saved as the internal `r index("state", aside = TRUE)`. In the prediction phase, the `PipeOp` acts on the prediction `Task` using information from the saved state. Therefore, just like a learner, a PipeOp has "parameters" (i.e., the state) that are trained. As well as 'parameters', `PipeOp`s also have `r index('hyperparameters')` that can be set by the user when constructing the `PipeOp` or by accessing its `$param_set`. As with other classes, `PipeOp`s can be constructed with a sugar function, `r ref("po()", aside = TRUE)`, or `pos()` for multiple `PipeOp`s, and all available `PipeOp`s are made available in the dictionary `r ref("mlr_pipeops", aside = TRUE)`. An up-to-date list of `PipeOp`s contained in `mlr3pipelines` with links to their documentation can be found at `r link("https://mlr-org.com/pipeops.html")`, a small subset of these are printed below. If you want to extend `mlr3pipelines` with a `PipeOp` that has not been implemented, have a look at our vignette on extending `PipeOp`s by running: `vignette("extending", package = "mlr3pipelines")`. ```{r sequential_pipelines-002} as.data.table(po())[1:6, 1:2] ``` Let us now take a look at a `PipeOp` in practice using `r index('principal component analysis')` (PCA)\index{PCA|see{principal component analysis}} as an example, which is implemented in `r ref("PipeOpPCA")`. Below we construct the `PipeOp` using its ID `"pca"` and inspect it. ```{r sequential_pipelines-003, eval = TRUE} library(mlr3pipelines) po_pca = po("pca", center = TRUE) po_pca ``` On printing, we can see that the `PipeOp` has not been trained and that we have changed some of the hyperparameters from their default values. The `Input channels` and `Output channels` lines provide information about the input and output types of this PipeOp. The PCA `PipeOp` takes one input (named "input") of type "`Task`", both during training and prediction ("`input [Task,Task]`"), and produces one called "output" that is also of type "`Task`" in both phases ("`output [Task,Task]`"). This highlights a key difference from the `Learner` class: `PipeOp`s can return results after the training phase. A `PipeOp` can be trained using `$train()`, which can have multiple inputs and outputs. Both inputs and outputs are passed as elements in a single `list`. The `"pca"` `PipeOp` takes as input the original task and after training returns the task with features replaced by their principal components. ```{r sequential_pipelines-004, eval = TRUE} tsk_small = tsk("penguins_simple")$select(c("bill_depth", "bill_length")) poin = list(tsk_small$clone()$filter(1:5)) poout = po_pca$train(poin) # poin: Task in a list poout # list with a single element 'output' poout[[1]]$head() ``` During training, PCA transforms incoming data by rotating it in such a way that features become uncorrelated and are ordered by their contribution to the total variance. The rotation matrix is also saved in the internal `$state` field during training (shown in @fig-pipelines-state), which is then used during predictions and applied to new data. ```{r sequential_pipelines-005, eval = TRUE} po_pca$state ``` Once trained, the `$predict()` function can then access the saved state to operate on the test data, which again is passed as a `list`: ```{r sequential_pipelines-006, eval = TRUE} tsk_onepenguin = tsk_small$clone()$filter(42) poin = list(tsk_onepenguin) poout = po_pca$predict(poin) poout[[1]]$data() ``` ## Graph: Networks of PipeOps {#sec-pipelines-graphs} `PipeOp`s represent individual computational steps in machine learning pipelines. These pipelines themselves are defined by `r ref("Graph", index = TRUE)` objects. A `Graph` is a collection of `PipeOp`s with "edges" that guide the flow of data. The most convenient way of building a `Graph` is to connect a sequence of `PipeOp`s using the `%>>%`-operator [`%>>%`]{.aside} \index{\%>>\%} (read "double-arrow") operator. When given two `PipeOp`s, this operator creates a `Graph` that first executes the left-hand `PipeOp`, followed by the right-hand one. It can also be used to connect a `Graph` with a `PipeOp`, or with another `Graph`. The following example uses `po("mutate")` to add a new feature to the task, and `po("scale")` to then `r index('scale')` and center all numeric features. ```{r sequential_pipelines-007} po_mutate = po("mutate", mutation = list(bill_ratio = ~bill_length / bill_depth) ) po_scale = po("scale") graph = po_mutate %>>% po_scale graph ``` The output provides information about the layout of the Graph. For each `PipOp` (`ID`), we can see information about the state (`State`), as well as a list of its successors (`sccssors`), which are `PipeOp`s that come directly after the given `PipeOp`, and its predecessors (`prdcssors`), the `PipeOp`s that are connected to its input. In this simple `Graph`, the output of the `"mutate"` `PipeOp` is passed directly to the `"scale"` `PipeOp` and neither takes any other inputs or outputs from other `PipeOp`s. The `r index("$plot()", parent = "Graph", aside = TRUE, code = TRUE)` method can be used to visualize the graph. ```{r sequential_pipelines-008, eval = FALSE} graph$plot(horizontal = TRUE) ``` ```{r sequential_pipelines-009, eval = TRUE, echo = FALSE} #| label: fig-pipelines-basic-plot #| fig-cap: Simple sequential pipeline plot. #| fig-alt: 'Four boxes in a straight line connected by arrows: "<INPUT> -> mutate -> scale -> <OUTPUT>".' fig = magick::image_graph(width = 1500, height = 1000, res = 100, pointsize = 24) graph$plot(horizontal = TRUE) invisible(dev.off()) magick::image_trim(fig) ``` The plot demonstrates how a `Graph` is simply a collection of `PipeOp`s that are connected by 'edges'. The collection of `PipeOp`s inside a `Graph` can be accessed through the `$pipeops` \index{\$pipeops} field. The `$edges` \index{\$edges} field can be used to access edges, which returns a `data.table` listing the "source" (`src_id`, `src_channel`) and "destination" (`dst_id`, `dst_channel`) of data flowing along each edge [`$edges`/`$pipeops`]{.aside}. ```{r sequential_pipelines-010, eval = TRUE} graph$pipeops graph$edges ``` Instead of using `%>>%`, you can also create a `Graph` explicitly using the `$add_pipeop()` and `$add_edge()` methods to create `PipeOp`s and the edges connecting them: ```{r sequential_pipelines-011} graph = Graph$new()$ add_pipeop(po_mutate)$ add_pipeop(po_scale)$ add_edge("mutate", "scale") ``` :::{.callout-tip} ## Graphs and DAGs The `r ref("Graph")` class represents an object similar to a `r index('directed acyclic graph')` (DAG)\index{DAG|see{Directed Acyclic Graph}}, since the input of a `r ref("PipeOp")` cannot depend on its output and hence cycles are not allowed. However, the resemblance to a DAG is not perfect, since the `Graph` class allows for multiple edges between nodes. A term such as "directed acyclic multigraph" would be more accurate, but we use "graph" for simplicity. ::: Once built, a `Graph` can be used by calling `$train()` and `$predict()` as if it were a `Learner` (though it still outputs a `list` during training and prediction): ```{r sequential_pipelines-012, eval = TRUE} result = graph$train(tsk_small) result result[[1]]$data()[1:3] result = graph$predict(tsk_onepenguin) result[[1]]$head() ``` ## Sequential Learner-Pipelines {#sec-pipelines-sequential} Possibly the most common application for `mlr3pipelines` is to use it to perform `r index('preprocessing')` tasks, such as missing value `r index('imputation')` or `r index('factor encoding')`, and to then feed the resulting data into a `Learner` -- we will see more of this in practice in @sec-preprocessing. A `Graph` representing this workflow manipulates data and fits a `Learner`-model during training, ensuring that the data is processed the same way during the prediction stage. Conceptually, the process may look as shown in @fig-pipelines-pipeline. ```{r sequential_pipelines-013, eval = TRUE, echo = FALSE} #| label: fig-pipelines-pipeline #| fig-cap: "Conceptualization of training and prediction process inside a sequential learner-pipeline. During training (top row), the data is passed along the preprocessing operators, each of which modifies the data and creates a `$state`. Finally, the learner receives the data and a model is created. During prediction (bottom row), data is likewise transformed by preprocessing operators, using their respective `$state` (gray boxes) information in the process. The learner then receives data that has the same format as the data seen during training, and makes a prediction." #| fig-alt: "Top pipeline: Dtrain -> Scaling -> Factor Encoding -> Median Imputation -> Decision Tree. Bottom is same as Top except starts with Dtest and at the end has an arrow to Prediction. Each PipeOp in the top row has an arrow to the same PipeOp in the bottom row pointing to a trained state." include_multi_graphics("mlr3book_figures-22") ``` ### Learners as PipeOps and Graphs as Learners In @fig-pipelines-pipeline the final `PipeOp` is a `Learner`. `Learner` objects can be converted to `PipeOp`s with `r ref("as_pipeop()")`, however, this is only necessary if you choose to manually create a graph instead of using `%>>%`. With either method, internally `Learner`s are passed to `po("learner")`. The following code creates a `r ref("Graph")` that uses `po("imputesample")` to impute\index{imputation} missing values by sampling from observed values (@sec-preprocessing-missing) then fits a `r index('logistic regression')` on the transformed task. ```{r sequential_pipelines-014, eval = FALSE} lrn_logreg = lrn("classif.log_reg") graph = po("imputesample") %>>% lrn_logreg graph$plot(horizontal = TRUE) ``` ```{r sequential_pipelines-015, eval = TRUE, echo = FALSE} #| label: fig-pipelines-learnerpipeop #| fig-cap: '`"imputesample"` and `"learner"` PipeOps in a sequential pipeline.' #| fig-alt: 'Four boxes in a straight line connected by arrows: "<INPUT> -> imputesample -> classif.log_reg -> <OUTPUT>".' lrn_logreg = lrn("classif.log_reg") graph = po("imputesample") %>>% lrn_logreg fig = magick::image_graph(width = 1500, height = 1000, res = 100, pointsize = 24) graph$plot(horizontal = TRUE) invisible(dev.off()) magick::image_trim(fig) ``` We have seen how training and predicting `Graph`s is possible but has a slightly different design to `Learner` objects, i.e., inputs and outputs during both training and predicting are `list` objects. To use a `Graph` as a `Learner` with an identical interface, it can be wrapped in a `r ref("GraphLearner", index = TRUE)` object with `r ref("as_learner()", index = TRUE)`[`GraphLearner`]{.aside}. The `Graph` can then be used like any other `Learner`, so now we can benchmark our pipeline to decide if we should impute by sampling or with the mode of observed values (`po("imputemode")`): ```{r sequential_pipelines-016} glrn_sample = as_learner(graph) glrn_mode = as_learner(po("imputemode") %>>% lrn_logreg) design = benchmark_grid(tsk("pima"), list(glrn_sample, glrn_mode), rsmp("cv", folds = 3)) bmr = benchmark(design) aggr = bmr$aggregate()[, .(learner_id, classif.ce)] aggr ``` In this example, we can see that the `r c("sampling", "mode")[which.min(unlist(aggr[,2]))]` imputation method worked slightly better, although the difference is likely not significant. :::{.callout-tip} ## Automatic Conversion to Learner In this book, we always use `as_learner()` to convert a `Graph` to a `Learner` explicitly for clarity. While this conversion is necessary when you want to use `Learner`-specific functions like `$predict_newdata()`, builtin `mlr3` methods like `resample()` and `benchmark_grid()` will make this conversion automatically and it is therefore not strictly needed. In the above example, it is therefore also possible to use ```{r sequential_pipelines-017, eval = FALSE} design = benchmark_grid(tsk("pima"), list(graph, po("imputesample") %>>% lrn_logreg), rsmp("cv", folds = 3)) ``` ::: ### Inspecting Graphs You may want to inspect pipelines and the flow of data to learn more about your pipeline or to debug\index{debugging} them. We first need to set the `$keep_results` flag to be `TRUE` so that intermediate results are retained, which is turned off by default to save memory. ```{r sequential_pipelines-018, eval = TRUE} glrn_sample$graph_model$keep_results = TRUE glrn_sample$train(tsk("pima")) ``` The `Graph` can be accessed through the `$graph_model` field and then `PipeOp`s can be accessed with `$pipeops` as before. In this example, we can see that our `r ref("Task")` no longer has missing data after training the `"imputesample"` `PipeOp`. This can be used to access arbitrary intermediate results: ```{r sequential_pipelines-019, eval = TRUE} imputesample_output = glrn_sample$graph_model$pipeops$imputesample$ .result imputesample_output[[1]]$missings() ``` We could also use `$pipeops` to access our underlying `r ref("Learner")`, note we need to use `$learner_model` to get the learner from the `r ref("PipeOpLearner")`. We could use a similar method to peek at the state of any `PipeOp` in the graph: ```{r sequential_pipelines-020, eval = TRUE} pipeop_logreg = glrn_sample$graph_model$pipeops$classif.log_reg learner_logreg = pipeop_logreg$learner_model learner_logreg ``` :::{.callout-tip} ## `$base_learner()` In this example we could have used `glrn_sample$base_learner()` to immediately access our trained learner, however, this does not generalize to more complex pipelines that may contain multiple learners. ::: ### Configuring Pipeline Hyperparameters `PipeOp` hyperparameters are collected together in the `$param_set` of a graph and prefixed with the ID of the `PipeOp` to avoid parameter name clashes. Below we use the same `PipeOp` twice but set the `id` to ensure their IDs are unique. ```{r sequential_pipelines-021, eval = TRUE} graph = po("scale", center = FALSE, scale = TRUE, id = "scale") %>>% po("scale", center = TRUE, scale = FALSE, id = "center") %>>% lrn("classif.rpart", cp = 1) unlist(graph$param_set$values) ``` :::{.callout-warning} ## PipeOp IDs in Graphs If you need to change the ID of a `r ref("PipeOp")` in a `r ref("Graph")` then use the `$set_names` method from the `Graph` class, e.g., `some_graph$set_names(old = "old_name", new = "new_name")`. Do not change the ID of a `PipeOp` through `graph$pipeops$<old_id>$id = <new_id>`, as this will only alter the `PipeOp`'s record of its own ID, and not the `Graph`'s record, which will lead to errors. ::: Whether a pipeline is treated as a `Graph` or `GraphLearner`, `r index('hyperparameters')` are updated and accessed in the same way. ```{r sequential_pipelines-022} graph$param_set$values$classif.rpart.maxdepth = 5 graph_learner = as_learner(graph) graph_learner$param_set$values$classif.rpart.minsplit = 2 unlist(graph_learner$param_set$values) ``` ## Conclusion In this chapter, we introduced `r mlr3pipelines` and its building blocks: `r ref("Graph")` and `r ref("PipeOp")`. We saw how to create pipelines as `Graph` objects from multiple `PipeOp` objects and how to access `PipeOp`s from a `Graph`. We also saw how to treat a `Learner` as a `PipeOp` and how to treat a `Graph` as a `Learner`. In @sec-pipelines-nonseq we will take this functionality a step further and look at pipelines where `PipeOp`s are not executed sequentially, as well as looking at how you can use `r mlr3tuning` to tune pipelines. A lot of practical examples that use sequential pipelines can be found in @sec-preprocessing where we look at pipelines for data preprocessing. | Class | Constructor/Function | Fields/Methods | | --- | --- | --- | | `r ref("PipeOp")` | `r ref("po()")` | `$train()`; `$predict()`; `$state`; `$id`; `$param_set` | | `r ref("Graph")` | `%>>%` | `$add_pipeop()`; `$add_edge()`; `$pipeops`; `$edges`;`$train()`; `$predict()` | | `r ref("GraphLearner")` | `r ref("as_learner")` | `$graph` | | `r ref("PipeOpLearner")` | `r ref("as_pipeop")` | `$learner_model` | : Important classes and functions covered in this chapter with underlying class (if applicable), class constructor or function, and important class fields and methods (if applicable). {#tbl-api-pipelines-seq} ## Exercises 1. Create a learner containing a `Graph` that first imputes missing values using `po("imputeoor")`, standardizes the data using `po("scale")`, and then fits a logistic linear model using `lrn("classif.log_reg")`. 2. Train the learner created in the previous exercise on `tsk("pima")` and display the coefficients of the resulting model. What are two different ways to access the model? 3. Verify that the `"age"` column of the input task of `lrn("classif.log_reg")` from the previous exercise is indeed standardized. One way to do this would be to look at the `$data` field of the `lrn("classif.log_reg")` model; however, that is specific to that particular learner and does not work in general. What would be a different, more general way to do this? Hint: use the `$keep_results` flag. ::: {.content-visible when-format="html"} `r citeas(chapter)` :::