5.5 Non-Linear Graphs

The Graphs seen so far all have a linear structure. Some POs may have multiple input or output channels. These make it possible to create non-linear Graphs with alternative paths taken by the data.

Possible types are

  • Branching: Splitting of a node into several paths, useful for example when comparing multiple feature-selection methods (pca, filters). Only one path will be executed.
  • Copying: Splitting of a node into several paths, all paths will be executed (sequentially). Parallel execution is not yet supported.
  • Stacking: Single graphs are stacked onto each other, i.e. the output of one Graph is the input for another. In machine learning this means that the prediction of one Graph is used as input for another Graph

5.5.1 Branching & Copying

The PipeOpBranch and PipeOpUnbranch POs make it possible to specify multiple alternative paths. Only one is actually executed, the others are ignored. The active path is determined by a hyperparameter. This concept makes it possible to tune alternative preprocessing paths (or learner models).

PipeOp(Un)Branch is initialized either with the number of branches, or with a character-vector indicating the names of the branches. If names are given, the “branch-choosing” hyperparameter becomes more readable. In the following, we set three options

  1. Doing nothing (“null”)
  2. Applying a PCA
  3. Scaling the data

It is important to “unbranch” again after “branching”, so that the outputs are merged into one result objects.

In the following we first create the branched graph and then show what happens if the “unbranching” is not applied.

Without “unbranching”:

With “unbranching”:

5.5.2 Model Ensembles

We can leverage the different operations presented to connect POs to form powerful graphs.

Before we go into details, we split the task into train and test indices.

5.5.2.1 Bagging

We first examine Bagging introduced by (Breiman 1996). The basic idea is to create multiple predictors and then aggregate those to a single, more powerful predictor.

“… multiple versions are formed by making bootstrap replicates of the learning set and using these as new learning sets” (Breiman 1996)

Bagging then aggregates a set of predictors by averaging (regression) or majority vote (classification). The idea behind bagging is, that a set of weak, but different predictors can be combined in order to arrive at a single, better predictor.

We can achieve this by downsampling our data before training a learner, repeating this for say 10 times and then performing a majority vote on the predictions.

First, we create a simple pipeline, that uses PipeOpSubsample before a PipeOpLearner is trained:

We can now copy this operation 10 times using greplicate .

Afterwards we need to aggregate the 10 pipelines to form a single model:

and plot again to see what happens:

This pipeline can again be used in conjunction with GraphLearner in order for Bagging to be used like a Learner.

In conjunction with different Backends, this can be a very powerful tool. In cases when the data does not fully fit in memory, we can obtain a fraction of the data for each learner from a DataBackend and then aggregate predictions over all learners.

5.5.2.2 Stacking

Stacking (Wolpert 1992) is another technique that can improve model performance. The basic idea behind stacking is the use of predictions from one model as features for a subsequent model to possibly improve performance.

As an example we can train a decision tree and use the predictions from this model in conjunction with the original features in order to train an additional model on top.

In order to limit overfitting, we additionally do not predict on the original predictions of the learner, but instead on out-of-bag predictions. To do all this, we can use PipeOpLearnerCV .

PipeOpLearnerCV performs nested cross-validation on the training data, fitting a model in each fold. Each of the models is then used to predict on the out-of-fold data. As a result, we obtain predictions for every data point in our input data.

We first create a “level 0” learner, which is used to extract a lower level prediction. We additionally clone() the learner object to obtain a copy of the learner, and set a custom id for the PipeOp .

Additionally, we use a PipeOpNULL in parallel to the “level 0” learner, in order to send the unchanged Task to the next level, where it is then combined with the predictions from our decision tree learner.

Afterwards, we want to concatenate the predictions from PipeOpLearnerCV and the original Task using PipeOpFeatureUnion .

We can now train another learner on top of the combined features.

In this vignette, we showed a very simple usecase for stacking. In many real-world applications, stacking is done for multiple levels and on multiple representations of the dataset. On a lower level, different preprocessing methods can be defined in conjunction with several learners. On a higher level, we can then combine those predictions in order to form a very powerful model.

5.5.2.3 Multilevel Stacking

In order to showcase the power of mlr3pipelines, we will show a more complicated stacking example.

In this case, we train a glmnet and 2 different rpart models (some transform its inputs using PipeOpPCA ) on our task in the “level 0” and concatenate them with the original features (via PipeOpNULL ). The result is then passed on to “level 1”, where we copy the concatenated features 3 times and put this task into a rpart and a glmnet model. Additionally, we keep a version of the “level 0” output (via PipeOpNULL ) and pass this on to “level 2”. In “level 2” we simply concatenate all “level 1” outputs and train a final decision tree.

References

Breiman, Leo. 1996. “Bagging Predictors.” Machine Learning 24 (2): 123–40.

Wolpert, David H. 1992. “Stacked Generalization.” Neural Networks 5 (2): 241–59. https://doi.org/https://doi.org/10.1016/S0893-6080(05)80023-1.