Now that we are familiar with the basics of how to create tasks and learners, how to fit models, and do some basic performance evaluation, let’s have a look at some of the details, and in particular how mlr3 makes it easy to perform many common machine learning steps.
We will cover the following topics:
Binary classification and ROC curves
Binary classification is a special case of classification where the target variable to predict has only two possible values. In this case, additional considerations apply; in particular ROC curves and the threshold of where to predict one class versus the other.
A resampling is a method to create training and test splits. We cover how to
- access and select resampling strategies,
- instantiate the split into training and test sets by applying the resampling, and
- execute the resampling to obtain results.
Benchmarking is used to compare the performance of different models, for example models trained with different learners, on different tasks, or with different resampling methods. This is usually done to get an overview of how different methods perform across different tasks. We cover how to
- create a benchmarking design,
- execute a design and aggregate results, and
- convert benchmarking objects to other types of objects that can be used for different purposes.
As we have seen before, binary classification is special because of the presence of a positive and negative class and a threshold probability to ditinguish between the two.
ROC Analysis, which stands for receiver operating characteristics, applies specifically to this case and allows to get a better picture of the trade offs when choosing between the two classes.
We saw earlier that one can retrieve the confusion matrix of a
Prediction by accessing the
data("Sonar", package = "mlbench") task = as_task_classif(Sonar, target = "Class", positive = "M") learner = lrn("classif.rpart", predict_type = "prob") pred = learner$train(task)$predict(task) C = pred$confusion print(C)
## truth ## response M R ## M 95 10 ## R 16 87
The upper left quandrant denotes the number of times our model predicted the positive class and was correct about it. Similarly, the lower right quadrant denotes the number of times our model predicted the negative class and was also correct about it. Together, the elements on the diagonal are called True Positives (TP) and True Negatives (TN). The upper right quadrant denotes the number of times we falsely predicted a positive label, and is called False Positives (FP). The lower left quadrant is called False Negatives (FN).
We can derive the following performance metrics from the confusion matrix:
- True Positive Rate (TPR): How many of the true positives did we predict as positive?
- True Negative Rate (TNR): How many of the true negatives did we predict as negative?
- Positive Predictive Value PPV: If we predict positive how likely is it a true positive?
- Negative Predictive Value NPV: If we predict negative how likely is it a true negative?
It is difficult to achieve a high TPR and low FPR at the same time. We can characterize the behavior of a binary classifier for different thresholds by plotting the TPR and FPR values – this is the ROC curve. The best classifier lies on the top-left corner – the TPR is 1 and the FPR is 0. The worst classifier lies at the diagonal. Classifiers on the diagonal produce labels essentially randomly (possibly with different proportions). For example, if each positive \(x\) will be randomly classified with 25% as “positive”, we get a TPR of 0.25. If we assign each negative \(x\) randomly to “positive” we get a FPR of 0.25. In practice, we should never obtain a classifier below the diagonal – ROC curves for different labels are symmetric with respect to the diagonal, so a curve below the diagonal would indicate that the positive and negative class labels have been switched by the classifier.
# Precision vs Recall autoplot(pred, type = "prc")
When evaluating the performance of a model, we are interested in its generalization performance – how well will it perform on new data that has not been seen during training?
We can estimate the generalization performance by evaluating a model on a test set, as we have done above, that was created to contain only observations that are not contained in the training set.
There are many different strategies for partitioning a data set into training and test; in
mlr3 we call these strategies “resampling”.
mlr3 includes the following predefined resampling strategies:
leave-one-out cross validation(
repeated cross validation(
In particular, it is often desirable to repeatedly split the entire data in different ways to ensure that a “lucky” or “unlucky” split does not bias the generalization performance estimate. Without resampling strategies like the ones we provide here, this is a tedious and error-prone process.
The following sections provide guidance on how to select a resampling strategy and how to use it.
Here is a graphical illustration of the resampling process in general:
When performing resampling with a dataset, we first need to define which approach should be used.
mlr3 resampling strategies and their parameters can be queried by looking at the
data.table output of the
mlr_resamplings dictionary; this also lists the parameters that can be changed to affect the behavior of each strategy:
## key params iters ## 1: bootstrap ratio,repeats 30 ## 2: custom NA ## 3: custom_cv NA ## 4: cv folds 10 ## 5: holdout ratio 1 ## 6: insample 1 ## 7: loo NA ## 8: repeated_cv folds,repeats 100 ## 9: subsampling ratio,repeats 30
Additional resampling methods for special use cases are available via extension packages, such as mlr3spatiotemporal for spatial data.
What we showed in the train/predict/score part is the equivalent of holdout resampling, done manually, so let’s consider this one first.
We can retrieve elements from the dictionary
$get() or the convenience function
## <ResamplingHoldout> with 1 iterations ## * Instantiated: FALSE ## * Parameters: ratio=0.6667
Note that the
$is_instantiated field is set to
This means we did not actually apply the strategy to a dataset yet.
By default we get a .66/.33 split of the data into training and test. There are two ways in which this ratio can be changed:
- Overwriting the slot in
$param_set$valuesusing a named list:
resampling$param_set$values = list(ratio = 0.8)
- Specifying the resampling parameters directly during construction:
rsmp("holdout", ratio = 0.8)
## <ResamplingHoldout> with 1 iterations ## * Instantiated: FALSE ## * Parameters: ratio=0.8
So far we have only chosen a resampling strategy; we now need to instantiate it with data.
To actually perform the splitting and obtain indices for the training and the test split, the resampling needs a
By calling the method
instantiate(), we split the indices of the data into indices for training and test sets.
These resulting indices are stored in the
## int [1:275] 1 2 3 4 7 9 11 13 15 16 ...
## int [1:69] 5 6 8 10 12 14 17 21 22 23 ...
Note that if you want to compare multiple Learners in a fair manner, using the same instantiated resampling for each learner is mandatory, such that each learner gets exactly the same training data and the performance of the trained model is evaluated in exactly the same test set. A way to greatly simplify the comparison of multiple learners is discussed in the section on benchmarking.
Learner, and a
Resampling object we can call
resample(), which fits the learner on the training set and evaluates it on the test set.
This may happen multiple times, depending on the given resampling strategy.
The result of running the
resample() function is a
We can tell
resample() to keep the fitted models (for example for later inspection) by setting the
store_models option to
TRUEand then starting the computation:
## <ResampleResult> of 3 iterations ## * Task: penguins ## * Learner: classif.rpart ## * Warnings: 0 in 0 iterations ## * Errors: 0 in 0 iterations
Here we use a three-fold cross-validation resampling, which trains and evaluates on three different training and test sets.
ResampleResult, stored as
rr, provides various getters to access and aggregate the stored information.
Here are a few examples:
Calculate the average performance across all resampling iterations, in terms of classification error:
## classif.ce ## 0.0523
Extract the performance for the individual resampling iterations:
## task task_id learner learner_id ## 1: <TaskClassif> penguins <LearnerClassifRpart> classif.rpart ## 2: <TaskClassif> penguins <LearnerClassifRpart> classif.rpart ## 3: <TaskClassif> penguins <LearnerClassifRpart> classif.rpart ## resampling resampling_id iteration prediction ## 1: <ResamplingCV> cv 1 <PredictionClassif> ## 2: <ResamplingCV> cv 2 <PredictionClassif> ## 3: <ResamplingCV> cv 3 <PredictionClassif> ## classif.ce ## 1: 0.06087 ## 2: 0.05217 ## 3: 0.04386
This is useful to check if one (or more) of the iterations are very different from the average.
Check for warnings or errors:
## Empty data.table (0 rows and 2 cols): iteration,msg
## Empty data.table (0 rows and 2 cols): iteration,msg
Extract and inspect the resampling splits; this allows to see in detail which observations were used for what purpose when:
## <ResamplingCV> with 3 iterations ## * Instantiated: TRUE ## * Parameters: folds=3
##  3
## int [1:115] 1 6 20 23 25 28 38 41 44 47 ...
## int [1:229] 4 7 12 15 26 27 29 32 34 35 ...
Retrieve the model trained in a specific iteration and inspect it, for example to investigate why the performance in this iteration was very different from the average:
lrn = rr$learners[] lrn$model
## n= 229 ## ## node), split, n, loss, yval, (yprob) ## * denotes terminal node ## ## 1) root 229 121 Adelie (0.47162 0.18777 0.34061) ## 2) flipper_length< 206.5 146 39 Adelie (0.73288 0.26712 0.00000) ## 4) bill_length< 43.35 109 3 Adelie (0.97248 0.02752 0.00000) * ## 5) bill_length>=43.35 37 1 Chinstrap (0.02703 0.97297 0.00000) * ## 3) flipper_length>=206.5 83 5 Gentoo (0.01205 0.04819 0.93976) ## 6) bill_depth>=17.15 7 3 Chinstrap (0.14286 0.57143 0.28571) * ## 7) bill_depth< 17.15 76 0 Gentoo (0.00000 0.00000 1.00000) *
Extract the individual predictions:
rr$prediction() # all predictions merged into a single Prediction object
## <PredictionClassif> for 344 observations: ## row_ids truth response prob.Adelie prob.Chinstrap prob.Gentoo ## 1 Adelie Adelie 0.97248 0.02752 0.00000 ## 6 Adelie Adelie 0.97248 0.02752 0.00000 ## 20 Adelie Chinstrap 0.02703 0.97297 0.00000 ## --- ## 336 Chinstrap Chinstrap 0.04348 0.93478 0.02174 ## 338 Chinstrap Chinstrap 0.04348 0.93478 0.02174 ## 341 Chinstrap Adelie 0.95918 0.04082 0.00000
rr$predictions()[] # predictions of first resampling iteration
## <PredictionClassif> for 115 observations: ## row_ids truth response prob.Adelie prob.Chinstrap prob.Gentoo ## 1 Adelie Adelie 0.97248 0.02752 0 ## 6 Adelie Adelie 0.97248 0.02752 0 ## 20 Adelie Chinstrap 0.02703 0.97297 0 ## --- ## 330 Chinstrap Chinstrap 0.02703 0.97297 0 ## 342 Chinstrap Chinstrap 0.02703 0.97297 0 ## 344 Chinstrap Chinstrap 0.02703 0.97297 0
Filter the result to only keep specified resampling iterations:
## <ResampleResult> of 2 iterations ## * Task: penguins ## * Learner: classif.rpart ## * Warnings: 0 in 0 iterations ## * Errors: 0 in 0 iterations
Sometimes it is necessary to perform resampling with custom splits, e.g. to reproduce results reported in a study.
A manual resampling instance can be created using the
##  1
##  1 2 3 4 5 6 7 8 9 10 51 52 53 54 55 56 57 58 59 ##  60 101 102 103 104 105 106 107 108 109 110
##  11 12 13 14 15 16 17 18 19 20 61 62 63 64 65 66 67 68 69 ##  70 111 112 113 114 115 116 117 118 119 120
In some cases, it is desirable to keep observations together, i.e. to not separate them into training and test set.
This can be defined through the column role
"group" during Task creation, i.e. a special column in the data specifies the groups (see also the help page on this column role).
mlr this was called “blocking”.
See also the mlr3gallery post on this topic for a practical example.
mlr3viz provides a
autoplot() method for resampling results.
As an example, we create a binary classification task with two features, perform a resampling with a 10-fold cross-validation and visualize the results:
# ROC curve, averaged over 10 folds autoplot(rr, type = "roc")
We can also plot the predictions of individual models:
# learner predictions for the first fold rr$filter(1) autoplot(rr, type = "prediction")
## Warning: Removed 2 rows containing missing values (geom_point).
All available plot types are listed on the manual page of
Comparing the performance of different learners on multiple tasks and/or different resampling schemes is a common task.
This operation is usually referred to as “benchmarking” in the field of machine learning.
The mlr3 package offers the
benchmark() convenience function that takes care of most of the work of repeatedly training and evaluating models under the same conditions.
Benchmark experiments in
mlr3 are specified through a design.
Such a design is essentially a table of scenarios to be evaluated; in particular unique combinations of
We use the
benchmark_grid() function to create an exhaustive design (that evaluates each learner on each task with each resampling) and instantiate the resampling properly, so that all learners are executed on the same train/test split for each tasks.
We set the learners to predict probabilities and also tell them to predict for the observations of the training set (by setting
Additionally, we use
rsmps() to retrieve lists of
Resampling in the same fashion as
library("mlr3verse") design = benchmark_grid( tasks = tsks(c("spam", "german_credit", "sonar")), learners = lrns(c("classif.ranger", "classif.rpart", "classif.featureless"), predict_type = "prob", predict_sets = c("train", "test")), resamplings = rsmps("cv", folds = 3) ) print(design)
## task learner resampling ## 1: <TaskClassif> <LearnerClassifRanger> <ResamplingCV> ## 2: <TaskClassif> <LearnerClassifRpart> <ResamplingCV> ## 3: <TaskClassif> <LearnerClassifFeatureless> <ResamplingCV> ## 4: <TaskClassif> <LearnerClassifRanger> <ResamplingCV> ## 5: <TaskClassif> <LearnerClassifRpart> <ResamplingCV> ## 6: <TaskClassif> <LearnerClassifFeatureless> <ResamplingCV> ## 7: <TaskClassif> <LearnerClassifRanger> <ResamplingCV> ## 8: <TaskClassif> <LearnerClassifRpart> <ResamplingCV> ## 9: <TaskClassif> <LearnerClassifFeatureless> <ResamplingCV>
design can be passed to
benchmark() to start the computation.
It is also possible to create a custom design manually, for example to exclude certain task-learner combinations.
However, if you create a custom task with
data.table(), the train/test splits will be different for each row of the design if you do not manually instantiate the resampling before creating the design.
See the help page on
benchmark_grid() for an example.
bmr = benchmark(design)
Note that we did not have to instantiate the resampling manually.
benchmark_grid() took care of it for us: each resampling strategy is instantiated once for each task during the construction of the exhaustive grid.
Once the benchmarking is done (and, depending on the size of your design, this can take quite some time), we can aggregate the performance with
We create two measures to calculate the area under the curve (AUC) for the training and the test set:
## nr resample_result task_id learner_id resampling_id ## 1: 1 <ResampleResult> spam classif.ranger cv ## 2: 2 <ResampleResult> spam classif.rpart cv ## 3: 3 <ResampleResult> spam classif.featureless cv ## 4: 4 <ResampleResult> german_credit classif.ranger cv ## 5: 5 <ResampleResult> german_credit classif.rpart cv ## 6: 6 <ResampleResult> german_credit classif.featureless cv ## 7: 7 <ResampleResult> sonar classif.ranger cv ## 8: 8 <ResampleResult> sonar classif.rpart cv ## 9: 9 <ResampleResult> sonar classif.featureless cv ## iters auc_train auc_test ## 1: 3 0.9995 0.9844 ## 2: 3 0.9078 0.9039 ## 3: 3 0.5000 0.5000 ## 4: 3 0.9987 0.7949 ## 5: 3 0.7981 0.7153 ## 6: 3 0.5000 0.5000 ## 7: 3 1.0000 0.9265 ## 8: 3 0.8900 0.7595 ## 9: 3 0.5000 0.5000
We can aggregate the results even further. For example, we might be interested to know which learner performed best across all tasks. Simply aggregating the performances with the mean is usually not statistically sound. Instead, we calculate the rank statistic for each learner, grouped by task. Then the calculated ranks, grouped by learner, are aggregated with the data.table package. As larger AUC scores are better, we multiply the values by \(-1\) such that the best learner has a rank of \(1\).
library("data.table") # group by levels of task_id, return columns: # - learner_id # - rank of col '-auc_train' (per level of learner_id) # - rank of col '-auc_test' (per level of learner_id) ranks = tab[, .(learner_id, rank_train = rank(-auc_train), rank_test = rank(-auc_test)), by = task_id] print(ranks)
## task_id learner_id rank_train rank_test ## 1: spam classif.ranger 1 1 ## 2: spam classif.rpart 2 2 ## 3: spam classif.featureless 3 3 ## 4: german_credit classif.ranger 1 1 ## 5: german_credit classif.rpart 2 2 ## 6: german_credit classif.featureless 3 3 ## 7: sonar classif.ranger 1 1 ## 8: sonar classif.rpart 2 2 ## 9: sonar classif.featureless 3 3
# group by levels of learner_id, return columns: # - mean rank of col 'rank_train' (per level of learner_id) # - mean rank of col 'rank_test' (per level of learner_id) ranks = ranks[, .(mrank_train = mean(rank_train), mrank_test = mean(rank_test)), by = learner_id] # print the final table, ordered by mean rank of AUC test ranks[order(mrank_test)]
## learner_id mrank_train mrank_test ## 1: classif.ranger 1 1 ## 2: classif.rpart 2 2 ## 3: classif.featureless 3 3
Unsurprisingly, the featureless learner is worse overall. The winner is the classification forest, which outperforms a single classification tree.
Such a plot gives a nice overview of the overall performance and how learners compare on different tasks in an intuitive way.
bmr_small = bmr$clone()$filter(task_id = "german_credit") autoplot(bmr_small, type = "roc")
All available plot types are listed on the manual page of
tab = bmr$aggregate(measures) rr = tab[task_id == "german_credit" & learner_id == "classif.ranger"]$resample_result[] print(rr)
## <ResampleResult> of 3 iterations ## * Task: german_credit ## * Learner: classif.ranger ## * Warnings: 0 in 0 iterations ## * Errors: 0 in 0 iterations
We can now investigate this resampling and even single resampling iterations using one of the approaches shown in the previous section:
measure = msr("classif.auc") rr$aggregate(measure)
## classif.auc ## 0.7949
## <LearnerClassifRanger:classif.ranger> ## * Model: - ## * Parameters: num.threads=1 ## * Packages: ranger ## * Predict Type: prob ## * Feature types: logical, integer, numeric, character, factor, ordered ## * Properties: importance, multiclass, oob_error, twoclass, weights
##  1 4 8 17 18 22
ResampleResult can be converted to a
BenchmarkResult with the function
We can also merge two
BenchmarkResults into a larger result object, for example for two related benchmarks that were done on different machines.
task = tsk("iris") resampling = rsmp("holdout")$instantiate(task) rr1 = resample(task, lrn("classif.rpart"), resampling) rr2 = resample(task, lrn("classif.featureless"), resampling) # Cast both ResampleResults to BenchmarkResults bmr1 = as_benchmark_result(rr1) bmr2 = as_benchmark_result(rr2) # Merge 2nd BMR into the first BMR bmr1$combine(bmr2) bmr1
## <BenchmarkResult> of 2 rows with 2 resampling runs ## nr task_id learner_id resampling_id iters warnings errors ## 1 iris classif.rpart holdout 1 0 0 ## 2 iris classif.featureless holdout 1 0 0