9.5 Framework Comparison
Before diving deeper, we give a short introduction to PipeOps.
9.5.1 An introduction to “PipeOp”s
In this example, we create a linear Pipeline. After scaling all input features, we rotate our data using principal component analysis. After this transformation, we use a simple Decision Tree learner for classification.
As exemplary data, we will use the “
iris” classification task.
This object contains the famous iris dataset and some meta-information, such as the target variable.
We quickly split our data into a train and a test set:
A Pipeline (or
Graph) contains multiple pipeline operators (“
PipeOp”s), where each
PipeOp transforms the data when it flows through it.
For this use case, we require 3 transformations:
PipeOpthat scales the data
PipeOpthat performs PCA
PipeOpthat contains the Decision Tree learner
A list of available
PipeOps can be obtained from
##  "boxcox" "branch" "chunk" ##  "classbalancing" "classifavg" "classweights" ##  "colapply" "collapsefactors" "colroles" ##  "copy" "datefeatures" "encode" ##  "encodeimpact" "encodelmer" "featureunion" ##  "filter" "fixfactors" "histbin" ##  "ica" "imputeconstant" "imputehist" ##  "imputelearner" "imputemean" "imputemedian" ##  "imputemode" "imputeoor" "imputesample" ##  "kernelpca" "learner" "learner_cv" ##  "missind" "modelmatrix" "multiplicityexply" ##  "multiplicityimply" "mutate" "nmf" ##  "nop" "ovrsplit" "ovrunite" ##  "pca" "proxy" "quantilebin" ##  "randomprojection" "randomresponse" "regravg" ##  "removeconstants" "renamecolumns" "replicate" ##  "scale" "scalemaxabs" "scalerange" ##  "select" "smote" "spatialsign" ##  "subsample" "targetinvert" "targetmutate" ##  "targettrafoscalerange" "textvectorizer" "threshold" ##  "tunethreshold" "unbranch" "vtreat" ##  "yeojohnson"
First we define the required
188.8.131.52 A quick glance into a PipeOp
In order to get a better understanding of what the respective PipeOps do, we quickly look at one of them in detail:
The most important slots in a PipeOp are:
$train(): A function used to train the PipeOp.
$predict(): A function used to predict with the PipeOp.
$predict() functions define the core functionality of our PipeOp.
In many cases, in order to not leak information from the training set into the test set it is imperative to treat train and test data separately.
For this we require a
$train() function that learns the appropriate transformations from the training set and a
$predict() function that applies the transformation on future data.
In the case of
PipeOpPCA this means the following:
$train()learns a rotation matrix from its input and saves this matrix to an additional slot,
$state. It returns the rotated input data stored in a new
$predict()uses the rotation matrix stored in
$statein order to rotate future, unseen data. It returns those in a new
184.108.40.206 Constructing the Pipeline
We can now connect the
PipeOps constructed earlier to a Pipeline.
We can do this using the
The result of this operation is a “
Graph connects the input and output of each
PipeOp to the following
This allows us to specify linear processing pipelines.
In this case, we connect the output of the scaling PipeOp to the input of the PCA PipeOp and the output of the PCA PipeOp to the input of PipeOpLearner.
We can now train the
Graph using the
## $classif.rpart.output ## NULL
When we now train the graph, the data flows through the graph as follows:
- The Task flows into the
PipeOpscales each column in the data contained in the Task and returns a new Task that contains the scaled data to its output.
- The scaled Task flows into the
PipeOpPCA. PCA transforms the data and returns a (possibly smaller) Task, that contains the transformed data.
- This transformed data then flows into the learner, in our case classif.rpart. It is then used to train the learner, and as a result saves a model that can be used to predict new data.
In order to predict on new data, we need to save the relevant transformations our data went through while training.
As a result, each
PipeOp saves a state, where information required to appropriately transform future data is stored.
In our case, this is mean and standard deviation of each column for
PipeOpScale, the PCA rotation matrix for
PipeOpPCA and the learned model for
## $classif.rpart.output ## <PredictionClassif> for 30 observations: ## row_id truth response ## 119 virginica virginica ## 125 virginica virginica ## 12 setosa setosa ## --- ## 45 setosa setosa ## 65 versicolor versicolor ## 84 versicolor virginica
9.5.2 mlr3pipelines vs. mlr
While mlr wrappers are generally less verbose and require a little less code, this heavily inhibits flexibility. As an example, wrappers can generally not process data in parallel.
library("mlr") # We first create a learner lrn = makeLearner("classif.rpart") # Wrap this learner in a FilterWrapper lrn.wrp = makeFilterWrapper(lrn, fw.abs = 2L) # And wrap the resulting wrapped learner into an ImputeWrapper. lrn.wrp = makeImputeWrapper(lrn.wrp) # Afterwards, we can train the resulting learner on a task train(lrn, iris.task)
library("mlr3") library("mlr3pipelines") library("mlr3filters") impute = PipeOpImpute$new() filter = PipeOpFilter$new(filter = FilterVariance$new(), param_vals = list(filter.nfeat = 2L)) rpart = PipeOpLearner$new(mlr_learners$get("classif.rpart")) # Assemble the Pipeline pipeline = impute %>>% filter %>>% rpart # And convert to a 'GraphLearner' learner = GraphLearner$new(id = "Pipeline", pipeline)
The fact that mlr’s wrappers have to be applied inside-out, i.e. in the reverse order is often confusing.
This is way more straight-forward in
mlr3pipelines, where we simply chain the different methods using
mlr3pipelines offers way greater possibilities with respect to the kinds of Pipelines that can be constructed.
mlr3pipelines, we allow for the construction of parallel and conditional pipelines.
This was previously not possible.
9.5.3 mlr3pipelines vs. sklearn.pipeline.Pipeline
In order to broaden the horizon, we compare to Python sklearn’s
sklearn.pipeline.Pipeline sequentially applies a list of transforms before fitting a final estimator.
Intermediate steps of the pipeline are
transforms, i.e. steps that can learn from the data, but also transform the data while it flows through it.
The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters.
For this, it enables setting parameters of the various steps.
It is thus conceptually very similar to mlr3pipelines.
Similarly to mlr3pipelines, we can tune over a full
Pipeline using various tuning methods.
Pipeline mainly supports linear pipelines.
This means, that it can execute parallel steps, such as for example Bagging, but it does not support conditional execution, i.e.
At the same time, the different
transforms in the pipeline can be cached, which makes tuning over the configuration space of a
Pipeline more efficient, as executing some steps multiple times can be avoided.
We compare functionality available in both mlr3pipelines and
sklearn.pipeline.Pipeline to give a comparison.
The following example obtained from the sklearn documentation showcases a Pipeline that first Selects a feature and performs PCA on the original data, concatenates the resulting datasets and applies a Support Vector Machine.
from sklearn.pipeline import Pipeline, FeatureUnion from sklearn.model_selection import GridSearchCV from sklearn.svm import SVC from sklearn.datasets import load_iris from sklearn.decomposition import PCA from sklearn.feature_selection import SelectKBest iris = load_iris() X, y = iris.data, iris.target # This dataset is way too high-dimensional. Better do PCA: pca = PCA(n_components=2) # Maybe some original features where good, too? selection = SelectKBest(k=1) # Build estimator from PCA and Univariate selection: combined_features = FeatureUnion([("pca", pca), ("univ_select", selection)]) # Use combined features to transform dataset: X_features = combined_features.fit(X, y).transform(X) svm = SVC(kernel="linear") # Do grid search over k, n_components and C: pipeline = Pipeline([("features", combined_features), ("svm", svm)]) param_grid = dict(features__pca__n_components=[1, 2, 3], features__univ_select__k=[1, 2], svm__C=[0.1, 1, 10]) grid_search = GridSearchCV(pipeline, param_grid=param_grid, cv=5, verbose=10) grid_search.fit(X, y)
library("mlr3verse") iris = mlr_tasks$get("iris") # Build the steps copy = PipeOpCopy$new(2) pca = PipeOpPCA$new() selection = PipeOpFilter$new(filter = FilterVariance$new()) union = PipeOpFeatureUnion$new(2) svm = PipeOpLearner$new(mlr_learners$get("classif.svm", param_vals = list(kernel = "linear"))) # Assemble the Pipeline pipeline = copy %>>% gunion(list(pca, selection)) %>>% union %>>% svm learner = GraphLearner$new(id = "Pipeline", pipeline) # For tuning, we define the resampling and the Parameter Space resampling = mlr3::mlr_resamplings$get("cv", param_vals = list(folds = 5L)) param_set = paradox::ParamSet$new(params = list( paradox::ParamDbl$new("classif.svm.cost", lower = 0.1, upper = 1), paradox::ParamInt$new("pca.rank.", lower = 1, upper = 3), paradox::ParamInt$new("variance.filter.nfeat", lower = 1, upper = 2) )) pe = PerformanceEvaluator$new(iris, learner, resampling, param_set) terminator = TerminatorEvaluations$new(60) tuner = TunerGridSearch$new(pe, terminator, resolution = 10)$tune() # Set the learner to the optimal values and train learner$param_set$values = tuner$tune_result()$values
In summary, we can achieve similar results with a comparable number of lines, while at the same time offering greater flexibility with respect to which kinds of pipelines we want to optimize over.
At the same time, experiments using
mlr3 can now be arbitrarily parallelized using
9.5.4 mlr3pipelines vs recipes
recipes is a new package, that covers some of the same applications steps as mlr3pipelines.
Both packages feature the possibility to connect different pre- and post-processing methods using a pipe-operator.
As the recipes package tightly integrates with the tidymodels ecosystem, much of the functionality integrated there can be used in
We compare recipes to mlr3pipelines using an example from the recipes vignette.
The aim of the analysis is to predict whether customers pay back their loans given some information on the customers. In order to do this, we build a model that does the following:
- It first imputes missing values using k-nearest neighbors
- All factor variables are converted to numerics using dummy encoding
- The data is first centered then scaled.
In order to validate the algorithm, data is first split into a train and test set using
The recipe trained on the train data (see steps above) is then applied to the test data.
library("tidymodels") library("rsample") data("credit_data", package = "modeldata") set.seed(55) train_test_split = initial_split(credit_data) credit_train = training(train_test_split) credit_test = testing(train_test_split) rec = recipe(Status ~ ., data = credit_train) %>% step_knnimpute(all_predictors()) %>% step_dummy(all_predictors(), -all_numeric()) %>% step_center(all_numeric()) %>% step_scale(all_numeric()) trained_rec = prep(rec, training = credit_train) # Apply to train and test set train_data <- bake(trained_rec, new_data = credit_train) test_data <- bake(trained_rec, new_data = credit_test)
Afterwards, the transformed data can be used during train and predict:
The same analysis can be performed in mlr3pipelines.
Note, that for now we do not impute via
knn but instead via sampling.
library("data.table") library("mlr3") library("mlr3learners") library("mlr3pipelines") data("credit_data", package = "modeldata") set.seed(55) # Create the task tsk = TaskClassif$new(id = "credit_task", target = "Status", backend = as_data_backend(data.table(credit_data))) # Build up the Pipeline: g = PipeOpImputeSample$new(id = "impute") %>>% PipeOpEncode$new(param_vals = list(method = "one-hot")) %>>% PipeOpScale$new() %>>% PipeOpLearner$new(mlr_learners$get("classif.ranger", param_vals = list(num.trees = 200, mtry = 12)) # We can visualize what happens to the data using the `plot` function: g$plot() # And we can use `mlr3's` full functionality be wrapping the Graph into a GraphLearner. glrn = GraphLearner$new(g) resample(tsk, glrn, mlr_resamplings$get("holdout"))