5 Pipelines

mlr3pipelines is a dataflow programming toolkit for machine learning in R utilizing the mlr3pipelines package. A more in-deptth and technically oriented vignette can be found in the mlr3pipeline vignette.

Machine learning workflows can be written as directed “Graphs”/“Pipelines” that represent data flows between preprocessing, model fitting, and ensemble learning units in an expressive and intuitive language.

We will most often use the term “Graph” in this manual but it can interchangeably be used with “pipeline” or “workflow”.

Single computational steps can be represented as so-called PipeOps, which can then be connected with directed edges in a Graph. The scope of mlr3pipelines is still growing. Currently supported features are:

  • Simple data manipulation and preprocessing operations, e.g. PCA, feature filtering, imputation
  • Task subsampling for speed and outcome class imbalance handling
  • mlr3 Learner operations for prediction and stacking
  • Ensemble methods and aggregation of predictions

Additionally, we implement several meta operators that can be used to construct powerful pipelines: - Simultaneous path branching (data going both ways) - Alternative path branching (data going one specific way, controlled by hyperparameters)

An extensive introduction to creating custom PipeOps (PO’s) can be found in the technical introduction.

Using methods from mlr3tuning, it is even possible to simultaneously optimize parameters of multiple processing units.

A predecessor to this package is the mlrCPO package, which works with mlr 2.x. Other packages that provide, to varying degree, some preprocessing functionality or machine learning domain specific language, are the caret package and the related recipies project, and the dplyr package.

An example for a Pipeline that can be constructed using mlr3pipelines is depicted below: