Appendix A — Solutions to exercises

A.1 Solutions to Chapter 2

  1. Set the seed to 124 then train a classification tree model with classif.rpart and default hyperparameters on 80% of the data in the predefined sonar task. Evaluate the model’s performance with the classification error measure on the remaining data. Also think about why we need to set the seed in this example.

Set the seed, load the sonar task, then the classification tree with predict_type = "prob" (needed for exercise 3), and the required measure.

set.seed(124)
task = tsk("sonar")
learner = lrn("classif.rpart", predict_type = "prob")
measure = msr("classif.ce")

Use partition to split the dataset then train the model. We set the seed because partition introduces an element of randomness when splitting the data.

splits = partition(task, ratio = 0.8)
learner$train(task, splits$train)

Once the model is trained, generate the predictions on the test set and score them.

preds = learner$predict(task, splits$test)
preds$score(measure)
classif.ce 
 0.2195122 
  1. Calculate the true positive, false positive, true negative, and false negative rates of the predictions made by the model in exercise 1.

Using $confusion, generate a confusion matrix and extract the required statistics,

preds$confusion
        truth
response  M  R
       M 20  7
       R  2 12

Since the rows represent predictions (response) and the columns represent the ground truth values, the TP, FP, TN, and FN rates are as follows:

  • True Positive (TP) = 20

  • False Positive (FP) = 2

  • True Negative (TN) = 12

  • False Positive (FN) = 7

  1. Since in this case we want the model to predict the negative class more often, we will raise the threshold (note the predict_type for the learner must be prob for this to work).
# raise threshold from 0.5 default to 0.6
preds$set_threshold(0.6)

preds$confusion
        truth
response  M  R
       M 14  4
       R  8 15

One reason we might want the false positive rate to be lower than the false negative rate is if we felt it was worse for a positive prediction to be incorrect (meaning the true label was the negative label) than it was for a negative prediction to be incorrect (meaning the true label was the positive label).

A.2 Solutions to Chapter 3

  1. Apply the “bootstrap” resampling strategy on the mtcars task and evaluate the performance of the classif.rpart decision tree learner. Use 100 replicates and an a sampling ratio of 80%. Calculate the MSE for each iteration and visualize the result. Finally, calculate the aggregated performance score.
set.seed(3)
task = tsk("mtcars")
learner = lrn("regr.rpart")
resampling = rsmp("bootstrap", repeats = 100, ratio = 0.8)

rr = resample(task, learner, resampling)
rr$score(msr("regr.mse"))
     task_id learner_id resampling_id iteration regr.mse
  1:  mtcars regr.rpart     bootstrap         1 17.52956
  2:  mtcars regr.rpart     bootstrap         2 15.70574
  3:  mtcars regr.rpart     bootstrap         3 31.62666
  4:  mtcars regr.rpart     bootstrap         4 22.59376
  5:  mtcars regr.rpart     bootstrap         5 36.64000
 ---                                                    
 96:  mtcars regr.rpart     bootstrap        96 18.79482
 97:  mtcars regr.rpart     bootstrap        97 19.09911
 98:  mtcars regr.rpart     bootstrap        98 18.72434
 99:  mtcars regr.rpart     bootstrap        99 33.31609
100:  mtcars regr.rpart     bootstrap       100 25.86660
Hidden columns: task, learner, resampling, prediction
autoplot(rr)

# Alternatively: Histogram
autoplot(rr, type = "histogram")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

rr$aggregate(msr("regr.mse"))
regr.mse 
20.75009 
  1. Use the spam task and 5-fold cross-validation to benchmark Random Forest (classif.ranger), Logistic Regression (classif.log_reg), and XGBoost (classif.xgboost) with regards to AUC. Which learner appears to do best? How confident are you in your conclusion? How would you improve upon this?
set.seed(3)
grid = benchmark_grid(
  tasks = tsk("spam"),
  learners = lrns(c("classif.ranger", "classif.log_reg", "classif.xgboost"),
                  predict_type = "prob"),
  resamplings = rsmp("cv", folds = 5)
)

bmr = benchmark(grid)

mlr3viz::autoplot(bmr, measure = msr("classif.auc"))

This is only a small example for a benchmark workflow, but without tuning (see Chapter 4), the results are naturally not suitable to make any broader statements about the superiority of either learner for this task.

  1. A colleague claims to have achieved a 93.1% classification accuracy using the classif.rpart learner on the penguins_simple task. You want to reproduce their results and ask them about their resampling strategy. They said they used a custom 3-fold cross-validation with folds assigned as factor(task$row_ids %% 3). See if you can reproduce their results.
task = tsk("penguins_simple")
r = rsmp("custom_cv")

r$instantiate(task = task, f = factor(task$row_ids %% 3))

rr = resample(
  task = task,
  learner = lrn("classif.rpart"),
  resampling = r
)

rr$aggregate(msr("classif.acc"))
classif.acc 
  0.9309309 

A.3 Solutions to Chapter 4

  1. Tune the mtry, sample.fraction, num.trees hyperparameters of a random forest model (regr.ranger) on the Motor Trend dataset (mtcars). Use a simple random search with 50 evaluations and select a suitable batch size. Evaluate with a 3-fold cross-validation and the root mean squared error.
set.seed(4)
learner = lrn("regr.ranger",
  mtry.ratio      = to_tune(0, 1),
  sample.fraction = to_tune(1e-1, 1),
  num.trees       = to_tune(1, 2000)
)

instance = ti(
  task = tsk("mtcars"),
  learner = learner,
  resampling = rsmp("cv", folds = 3),
  measures = msr("regr.rmse"),
  terminator = trm("evals", n_evals = 50)
)

tuner = tnr("random_search", batch_size = 10)

tuner$optimize(instance)
   mtry.ratio sample.fraction num.trees learner_param_vals  x_domain regr.rmse
1:  0.2764274       0.9771886       556          <list[4]> <list[3]>  2.542653
  1. Evaluate the performance of the model created in Question 1 with nested resampling. Use a holdout validation for the inner resampling and a 3-fold cross-validation for the outer resampling. Print the unbiased performance estimate of the model.
set.seed(4)
learner = lrn("regr.ranger",
  mtry.ratio      = to_tune(0, 1),
  sample.fraction = to_tune(1e-1, 1),
  num.trees       = to_tune(1, 2000)
)

at = auto_tuner(
  tuner = tnr("random_search", batch_size = 10),
  learner = learner,
  resampling = rsmp("holdout"),
  measure = msr("regr.rmse"),
  terminator = trm("evals", n_evals = 50)
)

task = tsk("mtcars")
outer_resampling = rsmp("cv", folds = 3)
rr = resample(task, at, outer_resampling, store_models = TRUE)

rr$aggregate()
regr.mse 
8.322457 
  1. Tune and benchmark an XGBoost model against a logistic regression Spam dataset and determine which has the best Brier score. Use mlr3tuningspaces and nested resampling.
Loading required package: mlr3tuning
learner_xgboost = lts(lrn("classif.xgboost", predict_type = "prob"))

at_xgboost = auto_tuner(
  tuner = tnr("random_search", batch_size = 1),
  learner = learner_xgboost,
  resampling = rsmp("cv", folds = 3),
  measure = msr("classif.bbrier"),
  term_evals = 2,
)

learner_logreg = lrn("classif.log_reg", predict_type = "prob")

at_logreg = auto_tuner(
  tuner = tnr("random_search", batch_size = 1),
  learner = learner_logreg,
  resampling = rsmp("cv", folds = 3),
  measure = msr("classif.bbrier"),
  term_evals = 2,
)

task = tsk("spam")
outer_resampling = rsmp("cv", folds = 3)

design = benchmark_grid(
  tasks = task,
  learners = list(at_xgboost, at_logreg),
  resamplings = outer_resampling
)

bmr = benchmark(design, store_models = TRUE)
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
bmr
<BenchmarkResult> of 6 rows with 2 resampling runs
 nr task_id            learner_id resampling_id iters warnings errors
  1    spam classif.xgboost.tuned            cv     3        0      0
  2    spam classif.log_reg.tuned            cv     3        0      0

A.4 Solutions to Chapter 5

  1. Calculate a correlation filter on the Motor Trend dataset (mtcars).
library("mlr3verse")
filter = flt("correlation")

task = tsk("mtcars")
filter$calculate(task)

as.data.table(filter)
    feature     score
 1:      wt 0.8676594
 2:     cyl 0.8521620
 3:    disp 0.8475514
 4:      hp 0.7761684
 5:    drat 0.6811719
 6:      vs 0.6640389
 7:      am 0.5998324
 8:    carb 0.5509251
 9:    gear 0.4802848
10:    qsec 0.4186840
  1. Use the filter from the first exercise to select the five best features in the mtcars dataset.
keep = names(head(filter$scores, 5))
task$select(keep)
task$feature_names
[1] "cyl"  "disp" "drat" "hp"   "wt"  
  1. Apply a backward selection to the penguins dataset with a classification tree learner "classif.rpart" and holdout resampling by the measure classification accuracy. Compare the results with those in Section 5.2.1.

Attaching package: 'mlr3fselect'
The following object is masked from 'package:mlr3tuning':

    ContextEval
instance = fselect(
  fselector = fs("sequential", strategy = "sbs"),
  task =  tsk("penguins"),
  learner = lrn("classif.rpart"),
  resampling = rsmp("holdout"),
  measure = msr("classif.acc")
)
as.data.table(instance$result)[, .(bill_depth, bill_length, body_mass, classif.acc)]
   bill_depth bill_length body_mass classif.acc
1:      FALSE        TRUE      TRUE   0.9826087
instance$result_feature_set
[1] "bill_length" "body_mass"   "island"      "sex"         "year"       

Answer the following questions:

  1. Do the selected features differ?

Yes, the backward selection selects more features.

  1. Which feature selection method achieves a higher classification accuracy?

In this example, the backwards example performs slightly better, but this depends heavily on the random seed and could look different in another run.

  1. Are the accuracy values in b) directly comparable? If not, what has to be changed to make them comparable?

No, they are not comparable because the holdout sampling called with rsmp("holdout") creates a different holdout set for the two runs. A fair comparison would create a single resampling instance and use it for both feature selections (see Chapter 3 for details):

resampling = rsmp("holdout")
resampling$instantiate(tsk("penguins"))

sfs = fselect(
  fselector = fs("sequential", strategy = "sfs"),
  task =  tsk("penguins"),
  learner = lrn("classif.rpart"),
  resampling = resampling,
  measure = msr("classif.acc")
)
sbs = fselect(
  fselector = fs("sequential", strategy = "sbs"),
  task =  tsk("penguins"),
  learner = lrn("classif.rpart"),
  resampling = resampling,
  measure = msr("classif.acc")
)
as.data.table(sfs$result)[, .(bill_depth, bill_length, body_mass, classif.acc)]
   bill_depth bill_length body_mass classif.acc
1:       TRUE        TRUE     FALSE   0.9391304
as.data.table(sbs$result)[, .(bill_depth, bill_length, body_mass, classif.acc)]
   bill_depth bill_length body_mass classif.acc
1:       TRUE        TRUE      TRUE   0.9565217

Alternatively, one could automate the feature selection and perform a benchmark between the two wrapped learners.

  1. Automate the feature selection as in Section 5.2.6 with the spam dataset and a logistic regression learner ("classif.log_reg"). Hint: Remember to call library("mlr3learners") for the logistic regression learner.
library("mlr3fselect")
library("mlr3learners")

at = auto_fselector(
  fselector = fs("random_search"),
  learner = lrn("classif.log_reg"),
  resampling = rsmp("holdout"),
  measure = msr("classif.acc"),
  terminator = trm("evals", n_evals = 50)
)

grid = benchmark_grid(
  task = tsk("spam"),
  learner = list(at, lrn("classif.log_reg")),
  resampling = rsmp("cv", folds = 3)
)

bmr = benchmark(grid)

aggr = bmr$aggregate(msrs(c("classif.acc", "time_train")))
as.data.table(aggr)[, .(learner_id, classif.acc, time_train)]
                  learner_id classif.acc time_train
1: classif.log_reg.fselector   0.9239290  5.3936667
2:           classif.log_reg   0.9263204  0.1093333

A.5 Solutions to Chapter 6

A.6 Solutions to Chapter 8

  1. Run a benchmark experiment on the german_credit task with algorithms: featureless, log_reg, ranger. Tune the featureless model using tunetreshold and learner_cv. Use 2-fold CV and evaluate with msr("classif.costs", costs = costs) where you should make the parameter costs so that the cost of a true positive is -10, the cost of a true negative is -1, the cost of a false positive is 2, and the cost of a false negative is 3. Use set.seed(11) to make sure you get the same results as us. Are your results surprising?
library(mlr3verse)
set.seed(11)

costs = matrix(c(-10, 3, 2, -1), nrow = 2, dimnames =
  list("Predicted Credit" = c("good", "bad"),
    Truth = c("good", "bad")))
cost_measure = msr("classif.costs", costs = costs)

gr = po("learner_cv", lrn("classif.featureless", predict_type = "prob")) %>>%
  po("tunethreshold", measure = cost_measure)
task = tsk("german_credit")
learners = list(as_learner(gr), lrn("classif.log_reg"), lrn("classif.ranger"))
bmr = benchmark(benchmark_grid(task, learners, rsmp("cv", folds = 2)))
bmr$aggregate(cost_measure)[, c(4, 7)]
                          learner_id classif.costs
1: classif.featureless.tunethreshold        -6.400
2:                   classif.log_reg        -5.420
3:                    classif.ranger        -5.923
  1. Train and predict a survival forest using rfsrc (from mlr3extralearners). Run this experiment using task = tsk("rats"); split = partition(task). Evaluate your model with the RCLL measure.
library(mlr3verse)
library(mlr3proba)
library(mlr3extralearners)
set.seed(11)

task = tsk("rats")
split = partition(task)

lrn("surv.rfsrc")$
  train(task, split$train)$
  predict(task, split$test)$
  score(msr("surv.rcll"))
surv.rcll 
 4.030926 
  1. Estimate the density of the tsk("precip") data using logspline (from mlr3extralearners). Run this experiment using task = tsk("precip"); split = partition(task). Evaluate your model with the logloss measure.
library(mlr3verse)
library(mlr3proba)
library(mlr3extralearners)
set.seed(11)

task = tsk("precip")
split = partition(task)

lrn("dens.logspline")$
  train(task, split$train)$
  predict(task, split$test)$
  score(msr("dens.logloss"))
dens.logloss 
    3.979233 
  1. Run a benchmark clustering experiment on the wine dataset without a label column. Compare the performance of k-means learner with k equal to 2, 3 and 4 using the silhouette measure. Use insample resampling technique. What value of k would you choose based on the silhouette scores?
library(mlr3)
library(mlr3cluster)
set.seed(11)
learners = list(
  lrn("clust.kmeans", centers = 2L, id = "k-means, k=2"),
  lrn("clust.kmeans", centers = 3L, id = "k-means, k=3"),
  lrn("clust.kmeans", centers = 4L, id = "k-means, k=4")
)

task = as_task_clust(tsk("wine")$data()[, -1])
measure = msr("clust.silhouette")
bmr = benchmark(benchmark_grid(task, learners, rsmp("insample")))
bmr$aggregate(measure)[, c(4, 7)]
     learner_id clust.silhouette
1: k-means, k=2        0.6568537
2: k-means, k=3        0.5711382
3: k-means, k=4        0.5605941

Based on the silhouette score, we can choose k = 2.

  1. Run a (spatially) unbiased classification benchmark experiment on the ecuador task with a featureless learner and xgboost, evaluate with the binary Brier score.

You can use any resampling method from mlr3spatiotempcv, in this solution we use 4-fold spatial environmental blocking.


Attaching package: 'mlr3spatiotempcv'
The following objects are masked from 'package:mlr3spatial':

    as_task_classif_st, as_task_classif_st.data.frame,
    as_task_classif_st.DataBackend, as_task_classif_st.sf,
    as_task_classif_st.TaskClassifST, as_task_regr_st,
    as_task_regr_st.data.frame, as_task_regr_st.DataBackend,
    as_task_regr_st.sf, as_task_regr_st.TaskClassifST,
    as_task_regr_st.TaskRegrST, TaskClassifST, TaskRegrST
set.seed(11)
learners = lrns(paste0("classif.", c("xgboost", "featureless")),
  predict_type = "prob")
rsmp_sp = rsmp("spcv_env", folds = 4)
design = benchmark_grid(tsk("ecuador"), learners, rsmp_sp)
bmr = benchmark(design)
bmr$aggregate(msr("classif.bbrier"))[, c(4, 7)]
            learner_id classif.bbrier
1:     classif.xgboost      0.2302815
2: classif.featureless      0.2413113

A.7 Solutions to Chapter 9

Parallel

Not all CPUs would be utilized in the example. All 4 of them are occupied for the first 4 iterations of the cross validation. The 5th iteration, however, only runs in parallel to the 6th fold, leaving 2 cores ilde. This is supported by the elapsed time of roughly 6 seconds for 6 jobs compared to also roughly 6 seconds for 8 jobs:

task = tsk("penguins")
learner = lrn("classif.debug", sleep_train = function() 3)

future::plan("multisession", workers = 4)

resampling = rsmp("cv", folds = 6)
system.time(resample(task, learner, resampling))

resampling = rsmp("cv", folds = 8)
system.time(resample(task, learner, resampling))

If possible, the number of resampling iterations should be an integer multiple of the number of workers. Therefore, a simple adaptation either increases the number of folds for improved accuracy of the error estimate or reduces the number of folds for improved runtime.

Custom Measures

The rules can easily be translated to R code where we expect truth and prediction to be factor vectors of the same length with levels "A" and "B":

costsens = function(truth, prediction) {
    score = numeric(length(truth))
    score[truth == "A" & prediction == "B"] = 10
    score[truth == "B" & prediction == "A"] = 1

    mean(score)
}

This function can be embedded in the Measure class accordingly.

MeasureCustom = R6::R6Class("MeasureCustom",
  inherit = mlr3::MeasureClassif, # classification measure
  public = list(
    initialize = function() { # initialize class
      super$initialize(
        id = "custom", # unique ID
        packages = character(), # no dependencies
        properties = character(), # no special properties
        predict_type = "response", # measures response prediction
        range = c(0, Inf), # results in values between (0, 1)
        minimize = TRUE # smaller values are better
      )
    }
  ),

  private = list(
    .score = function(prediction, ...) { # define score as private method
      # define loss
      costsens = function(truth, prediction) {
        score = numeric(length(truth))
        score[truth == "A" & prediction == "B"] = 10
        score[truth == "B" & prediction == "A"] = 1

        mean(score)
      }

      # call loss function
      costsens(prediction$truth, prediction$response)
    }
  )
)

An alternative (as pointed to by the hint) can be constructed by first translating the rules to a matrix of misclassification costs, and then feeding this matrix to the constructor of the mlr_measures_classif.costs measure:

# truth in columns, prediction in rows
C = matrix(c(0, 10, 1, 0), nrow = 2)
rownames(C) = colnames(C) = c("A", "B")
print(C)
   A B
A  0 1
B 10 0
msr("classif.costs", costs = C)
<MeasureClassifCosts:classif.costs>: Cost-sensitive Classification
* Packages: mlr3
* Range: [0, Inf]
* Minimize: TRUE
* Average: macro
* Parameters: normalize=TRUE
* Properties: -
* Predict type: response

A.8 Solutions to Chapter 10

Getting Data from OpenML

We access the AutoML benchmark suite with ID 269 using the mlr3oml::ocl() function.

library("mlr3oml")
automl_suite = ocl(id = 269)
automl_suite$task_ids

We can find all tasks with less than 4000 observations by specifying this constraint in mlr3oml::list_oml_tasks().

tbl = list_oml_tasks(
  task_id = automl_suite$task_ids, number_instances = c(0, 4000)
)

This returns a table which only contains matching tasks from the AutoML benchmark.

tbl[, .(task_id, data_id, name, NumberOfInstances)]
    task_id data_id                 name NumberOfInstances
 1:  167210   41021            Moneyball              1232
 2:  359930     550                quake              2178
 3:  359931     546              sensory               576
 4:  359932     541               socmob              1156
 5:  359933     507             space_ga              3107
 6:  359934     505              tecator               240
 7:  359945   42730             us_crime              1994
 8:  359950     531               boston               506
 9:  359951   42563 house_prices_nominal              1460
10:  360945   43071  MIP-2016-regression              1090

We can create mlr3 tasks from these OpenML IDs using tsk("oml").

tasks = lapply(tbl$task_id, function(id) tsk("oml", task_id = id))

Below, we define the robustified learners with a featureless fallback learner.

learner_ranger = as_learner(
  ppl("robustify", learner = lrn("regr.ranger")) %>>%
    po("learner", lrn("regr.ranger"))
)
learner_ranger$id = "ranger"
learner_ranger$fallback = lrn("regr.featureless")

learner_rpart = as_learner(
  ppl("robustify", learner = lrn("regr.rpart")) %>>%
    po("learner", lrn("regr.rpart"))
)
learner_rpart$id = "rpart"
learner_rpart$fallback = lrn("regr.featureless")

learners = list(learner_ranger, learner_rpart)

Finally, we create the experimental design using benchmark_grid().

# we set the seed, as benchmark_grid instantiates the resamplings
set.seed(123)
resampling = rsmp("cv", folds = 3)
design = benchmark_grid(tasks, learners, resampling)
design
                    task learner resampling
 1:            Moneyball  ranger         cv
 2:            Moneyball   rpart         cv
 3:                quake  ranger         cv
 4:                quake   rpart         cv
 5:              sensory  ranger         cv
 6:              sensory   rpart         cv
 7:               socmob  ranger         cv
 8:               socmob   rpart         cv
 9:             space_ga  ranger         cv
10:             space_ga   rpart         cv
11:              tecator  ranger         cv
12:              tecator   rpart         cv
13:             us_crime  ranger         cv
14:             us_crime   rpart         cv
15:               boston  ranger         cv
16:               boston   rpart         cv
17: house_prices_nominal  ranger         cv
18: house_prices_nominal   rpart         cv
19:  MIP-2016-regression  ranger         cv
20:  MIP-2016-regression   rpart         cv

Executing the Experiments using batchtools

We start with loading the relevant libraries and creating a registry. By specifying the registry’s file.dir to NA we are using a temporary directory.

library("mlr3batchmark")
library("batchtools")

reg = makeExperimentRegistry(
  file.dir = NA,
  seed = 1,
  packages = "mlr3verse"
)
No readable configuration file found
Created registry in '/tmp/RtmpgjSKSd/registry3f765e4a16d5' using cluster functions 'Interactive'

Then, we change the cluster function to “Multicore” (or “Socket” if you are on Windows).

# Mac and Linux
reg$cluster.functions = makeClusterFunctionsMulticore()
Auto-detected 2 CPUs
# Windows:
reg$cluster.functions = makeClusterFunctionsSocket()
Auto-detected 2 CPUs
[1] TRUE

The next two steps are to populate the registry with the experiments using batchmark() and to submit them. By specifying no IDs in submitJobs(), all jobs returned by findNotSubmitted() are queued, which in this case are all existing jobs.

batchmark(design, reg = reg)
submitJobs(reg = reg)
waitForJobs(reg = reg)

After the execution of the experiment finished, we can collect the results like below.

bmr = reduceResultsBatchmark(reg = reg)
bmr
<BenchmarkResult> of 60 rows with 20 resampling runs
 nr              task_id learner_id resampling_id iters warnings errors
  1            Moneyball     ranger            cv     3        0      0
  2            Moneyball      rpart            cv     3        0      0
  3                quake     ranger            cv     3        0      0
  4                quake      rpart            cv     3        0      0
  5              sensory     ranger            cv     3        0      0
  6              sensory      rpart            cv     3        0      0
  7               socmob     ranger            cv     3        0      0
  8               socmob      rpart            cv     3        0      0
  9             space_ga     ranger            cv     3        0      0
 10             space_ga      rpart            cv     3        0      0
 11              tecator     ranger            cv     3        0      0
 12              tecator      rpart            cv     3        0      0
 13             us_crime     ranger            cv     3        0      0
 14             us_crime      rpart            cv     3        0      0
 15               boston     ranger            cv     3        0      0
 16               boston      rpart            cv     3        0      0
 17 house_prices_nominal     ranger            cv     3        0      0
 18 house_prices_nominal      rpart            cv     3        0      0
 19  MIP-2016-regression     ranger            cv     3        0      0
 20  MIP-2016-regression      rpart            cv     3        0      0

Analyzing the Results

As a first step, we load mlr3benchmark and create a benchmark aggregate using msr("regr.mse").

library("mlr3benchmark")
bma = as_benchmark_aggr(bmr, measures = msr("regr.mse"))
bma
<BenchmarkAggr> of 20 rows with 10 tasks, 2 learners and 1 measure
                 task_id learner_id          mse
 1:            Moneyball     ranger 6.167099e+02
 2:            Moneyball      rpart 1.356897e+03
 3:                quake     ranger 3.826625e-02
 4:                quake      rpart 3.567377e-02
 5:              sensory     ranger 5.072431e-01
 6:              sensory      rpart 6.095145e-01
 7:               socmob     ranger 4.475963e+02
 8:               socmob      rpart 5.984411e+02
 9:             space_ga     ranger 1.401465e-02
10:             space_ga      rpart 1.992342e-02
11:              tecator     ranger 1.956551e+01
12:              tecator      rpart 1.201766e+01
13:             us_crime     ranger 1.864804e-02
14:             us_crime      rpart 2.459386e-02
15:               boston     ranger 1.210087e+01
16:               boston      rpart 2.293306e+01
17: house_prices_nominal     ranger 9.020229e+08
18: house_prices_nominal      rpart 1.949476e+09
19:  MIP-2016-regression     ranger 7.504588e+08
20:  MIP-2016-regression      rpart 6.277064e+08

We can now use the $friedman_test() method of the benchmark aggregate to apply said test to the experiment results. Note that we don’t need a post-hoc test, because we are only comparing two algorithms.

bma$friedman_test()

    Friedman rank sum test

data:  mse and learner_id and task_id
Friedman chi-squared = 1.6, df = 1, p-value = 0.2059

This experimental design was not able to detect a significant difference on the 5% level.

Next, we inspect the ranks using the $rank_data() method of the BenchmarkAggr, where we specify minimize to TRUE, because a lower mean square error is better.

bma$rank_data(minimize = TRUE)
       Moneyball quake sensory socmob space_ga tecator us_crime boston
ranger         1     2       1      1        1       2        1      1
rpart          2     1       2      2        2       1        2      2
       house_prices_nominal MIP-2016-regression
ranger                    1                   2
rpart                     2                   1

The random forest is better on 7 of the 10 tasks.

A.9 Solutions to Chapter 11

  1. Prepare a mlr3 regression task for fifa data. Select only variables describing the age and skills of footballers. Train any predictive model for this task, e.g. regr.ranger.
library("DALEX")
library("ggplot2")
data("fifa", package = "DALEX")
old_theme = set_theme_dalex("ema")

library("mlr3")
library("mlr3learners")
set.seed(1)

fifa20 <- fifa[,5:42]
task_fifa = as_task_regr(fifa20, target = "value_eur", id = "fifa20")

learner = lrn("regr.ranger")
learner$train(task_fifa)
learner$model
Ranger result

Call:
 ranger::ranger(dependent.variable.name = task$target_names, data = task$data(),      case.weights = task$weights$weight, num.threads = 1L) 

Type:                             Regression 
Number of trees:                  500 
Sample size:                      5000 
Number of independent variables:  37 
Mtry:                             6 
Target node size:                 5 
Variable importance mode:         none 
Splitrule:                        variance 
OOB prediction error (MSE):       1.022805e+13 
R squared (OOB):                  0.869943 
  1. Use the permutation importance method to calculate variable importance ranking. Which variable is the most important? Is it surprising?

With iml

library(iml)
model = Predictor$new(learner,
                data = fifa20,
                y = fifa$value_eur)

effect = FeatureImp$new(model,
                loss = "rmse")
effect$plot()

With DALEX

library("DALEX")
ranger_exp = DALEX::explain(learner,
  data = fifa20,
  y = fifa$value_eur,
  label = "Fifa 2020",
  verbose = FALSE)

ranger_effect = model_parts(ranger_exp, B = 5)
head(ranger_effect)
             variable mean_dropout_loss     label
1        _full_model_           1402526 Fifa 2020
2           value_eur           1402526 Fifa 2020
3           weight_kg           1471865 Fifa 2020
4 goalkeeping_kicking           1472795 Fifa 2020
5           height_cm           1474859 Fifa 2020
6    movement_balance           1475618 Fifa 2020
plot(ranger_effect)

  1. Use the Partial Dependence profile to draw the global behavior of the model for this variable. Is it aligned with your expectations?

With iml

num_features = c("movement_reactions", "skill_ball_control", "age")

effect = FeatureEffects$new(model)
plot(effect, features = num_features)

With DALEX

num_features = c("movement_reactions", "skill_ball_control", "age")

ranger_profiles = model_profile(ranger_exp, variables = num_features)
plot(ranger_profiles)

4 Choose one of the football players. You can choose some well-known striker (e.g. Robert Lewandowski) or a well-known goalkeeper (e.g. Manuel Neuer). The following tasks are worth repeating for several different choices.

player_1 <- fifa["R. Lewandowski", 5:42]
  1. For the selected footballer, calculate and plot the Shapley values. Which variable is locally the most important and has the strongest influence on the valuation of the footballer?

With iml

shapley = Shapley$new(model, x.interest = player_1)
plot(shapley)

With DALEX

ranger_shap = predict_parts(ranger_exp,
             new_observation = player_1,
             type = "shap", B = 1)
plot(ranger_shap, show_boxplots = FALSE)

  1. For the selected footballer, calculate the Ceteris Paribus / Individual Conditional Expectation profiles to draw the local behavior of the model for this variable. Is it different from the global behavior?

With DALEX

num_features = c("movement_reactions", "skill_ball_control", "age")

ranger_ceteris = predict_profile(ranger_exp, player_1)
plot(ranger_ceteris, variables = num_features) +
  ggtitle("Ceteris paribus for R. Lewandowski", " ")

A.10 Solutions to Chapter 13

  1. We first construct the objective function and optimization instance:
rastrigin = function(xdt) {
  D = ncol(xdt)
  y = 10 * D + rowSums(xdt^2 - (10 * cos(2 * pi * xdt)))
  data.table(y = y)
}

objective = ObjectiveRFunDt$new(
  fun = rastrigin,
  domain = ps(x1 = p_dbl(lower = -5.12, upper = 5.12),
    x2 = p_dbl(lower = -5.12, upper = 5.12)),
  codomain = ps(y = p_dbl(tags = "minimize")),
  id = "rastrigin2D")

instance = OptimInstanceSingleCrit$new(
  objective = objective,
  terminator = trm("evals", n_evals = 40))

Based on the different surrogate models, we can construct two optimizers:

surrogate_gp = srlrn(lrn("regr.km", covtype = "matern5_2",
  optim.method = "BFGS", control = list(trace = FALSE)))

surrogate_rf = srlrn(lrn("regr.ranger", num.trees = 10L, keep.inbag = TRUE,
  se.method = "jack"))

acq_function = acqf("cb", lambda = 1)

acq_optimizer = acqo(opt("nloptr", algorithm = "NLOPT_GN_ORIG_DIRECT"),
  terminator = trm("stagnation", iters = 100, threshold = 1e-5))

optimizer_gp = opt("mbo",
  loop_function = bayesopt_ego,
  surrogate = surrogate_gp,
  acq_function = acq_function,
  acq_optimizer = acq_optimizer)

optimizer_rf = opt("mbo",
  loop_function = bayesopt_ego,
  surrogate = surrogate_rf,
  acq_function = acq_function,
  acq_optimizer = acq_optimizer)

We then evaluate the given initial design on the instance and optimize it with the first BO algorithm using a Gaussian Process as surrogate model:

initial_design = data.table(
  x1 = c(-3.95, 1.16, 3.72, -1.39, -0.11, 5.00, -2.67, 2.44),
  x2 = c(1.18, -3.93, 3.74, -1.37, 5.02, -0.09, -2.65, 2.46))

instance$eval_batch(initial_design)

optimizer_gp$optimize(instance)

gp_data = instance$archive$data
gp_data[, y_min := cummin(y)]
gp_data[, nr_eval := seq_len(.N)]
gp_data[, surrogate := "Gaussian Process"]

Afterwards, we clear the instance, evaluate the initial design again and optimize the instance with the second BO algorithm using a random forest as surrogate model:

instance$archive$clear()

instance$eval_batch(initial_design)

optimizer_rf$optimize(instance)

rf_data = instance$archive$data
rf_data[, y_min := cummin(y)]
rf_data[, nr_eval := seq_len(.N)]
rf_data[, surrogate := "Random Forest"]

We collect all data and visualize the anytime performance:

library(ggplot2)
library(viridisLite)
all_data = rbind(gp_data, rf_data)
ggplot(aes(x = nr_eval, y = y_min, colour = surrogate), data = all_data) +
  geom_step() +
  scale_colour_manual(values = viridis(2, end = 0.8)) +
  labs(y = "Best Observed Function Value", x = "Number of Function Evaluations",
       colour = "Surrogate Model") +
  theme_minimal() +
  theme(legend.position = "bottom")

  1. We first construct the non-parallelized objective function and the optimization instance:
schaffer1 = function(xss) {
  evaluations = lapply(xss, FUN = function(xs) {
    Sys.sleep(5)
    list(y1 = xs$x, y2 = (xs$x - 2)^2)
  })
  rbindlist(evaluations)
}

objective = ObjectiveRFunMany$new(
  fun = schaffer1,
  domain = ps(x = p_dbl(lower = -10, upper = 10)),
  codomain = ps(y1 = p_dbl(tags = "minimize"), y2 = p_dbl(tags = "minimize")),
  id = "schaffer1")

instance = OptimInstanceMultiCrit$new(
  objective = objective,
  terminator = trm("run_time", secs = 60))

Using the surrogate, acquisition function and acquisition function optimizer that are provided, we can proceed to optimize the instance via ParEGO:

surrogate = srlrn(lrn("regr.ranger", num.trees = 10L, keep.inbag = TRUE,
  se.method = "jack"))

acq_function = acqf("ei")

acq_optimizer = acqo(opt("random_search", batch_size = 100),
  terminator = trm("evals", n_evals = 100))

optimizer = opt("mbo",
  loop_function = bayesopt_parego,
  surrogate = surrogate,
  acq_function = acq_function,
  acq_optimizer = acq_optimizer,
  args = list(q = 4))
optimizer$optimize(instance)

We observe that 12 points were evaluated in total (which makes sense as the objective function evaluation is not yet parallelized and the overhead of each function evaluation is given by 5 seconds). While the points are appropriately evaluated in batches of size q = 4 (with the initial design automatically constructed as the first batch), we don’t experience any acceleration of the optimization process unless the function evaluation is explicitly parallelized.

nrow(instance$archive$data)
[1] 12
instance$archive$data[, c("x", "timestamp", "batch_nr")]
             x           timestamp batch_nr
 1: -7.4376562 2023-06-04 14:21:14        1
 2: -5.7781853 2023-06-04 14:21:14        1
 3:  7.5952006 2023-06-04 14:21:14        1
 4: -1.5296406 2023-06-04 14:21:14        1
 5:  0.8905768 2023-06-04 14:21:34        2
 6:  9.6750455 2023-06-04 14:21:34        2
 7:  3.4237868 2023-06-04 14:21:34        2
 8: -8.1013693 2023-06-04 14:21:34        2
 9: -3.3062102 2023-06-04 14:21:54        3
10: -2.6368993 2023-06-04 14:21:54        3
11: -5.1449362 2023-06-04 14:21:54        3
12: -3.5595275 2023-06-04 14:21:54        3

We now parallelize the evaluation of the objective function and proceed to optimize the instance again:

library(future)
library(future.apply)
plan("multisession", workers = 4)

schaffer1_parallel = function(xss) {
  evaluations = future_lapply(xss, FUN = function(xs) {
    Sys.sleep(5)
    list(y1 = xs$x, y2 = (xs$x - 2)^2)
  })
  rbindlist(evaluations)
}

objective_parallel = ObjectiveRFunMany$new(
  fun = schaffer1_parallel,
  domain = ps(x = p_dbl(lower = -10, upper = 10)),
  codomain = ps(y1 = p_dbl(tags = "minimize"), y2 = p_dbl(tags = "minimize")),
  id = "schaffer1_parallel")

instance_parallel = OptimInstanceMultiCrit$new(
  objective = objective_parallel,
  terminator = trm("run_time", secs = 60))
optimizer$optimize(instance_parallel)

By parallelizing the evaluation of the objective function we used our compute resources much more efficiently and evaluated many more points:

nrow(instance_parallel$archive$data)
[1] 44
instance_parallel$archive$data[, c("x", "timestamp", "batch_nr")]
            x           timestamp batch_nr
 1:  9.082860 2023-06-04 14:22:02        1
 2:  7.001501 2023-06-04 14:22:02        1
 3:  6.865060 2023-06-04 14:22:02        1
 4: -7.746894 2023-06-04 14:22:02        1
 5:  2.198796 2023-06-04 14:22:07        2
---                                       
40: -2.169467 2023-06-04 14:22:52       10
41: -8.252578 2023-06-04 14:22:58       11
42: -5.253845 2023-06-04 14:22:58       11
43:  9.456191 2023-06-04 14:22:58       11
44:  8.343001 2023-06-04 14:22:58       11

A.11 Solutions to Section 12.1

  1. Load the "adult_train" task and try to build a first model. Train a simple model and evaluate it on the "adult_test" task that is also available with mlr3fairness.

For now we simply load the data and look at the data.

library("mlr3")
library("mlr3fairness")
set.seed(8)

trn = tsk("adult_train")
tst = tsk("adult_test")
trn
<TaskClassif:adult_train> (30718 x 13)
* Target: target
* Properties: twoclass
* Features (12):
  - fct (7): education, marital_status, occupation, race, relationship,
    sex, workclass
  - int (5): age, capital_gain, capital_loss, education_num,
    hours_per_week
* Protected attribute: sex

We can now train a simple model, e.g., a decision tree and evaluate for accuracy.

l = lrn("classif.rpart")
l$train(trn)
prd = l$predict(tst)
prd$score()
classif.ce 
 0.1609533 
  1. Assume our goal is to achieve parity in false omission rates. Construct a fairness metric that encodes this and againg evaluate your model. In order to get a deeper understanding, look at the groupwise_metrics function to obtain performance in each group.

The metric is available via the key "fairness.fomr". Note, that evaluating our prediction now requires that we also provide the task.

m = msr("fairness.fomr")
prd$score(m, tst)
fairness.fomr 
   0.03533058 

The groupwise_metrics function creates a metric for each group specified in the pta column role:

tst$col_roles$pta
[1] "sex"
m2 = groupwise_metrics(base_measure = msr("classif.fomr"), task = tst)

We can then use this metric to evaluate our model again. This gives us the false omission rates for male and female individuals separately.

prd$score(m2, tst)
  subgroup.fomr_Male subgroup.fomr_Female 
           0.2441913            0.2088608 
  1. Now try to improve your model, e.g. by employing pipelines that use pre- or post-processing methods for fairness. Evaluate your model along the two metrics and visualize the results. Compare the different models using an appropriate visualization.

First we can again construct the learners above.

library("mlr3pipelines")
l1 = po("reweighing_wts") %>>% lrn("classif.rpart")
l2 = po("learner_cv", lrn("classif.rpart")) %>>%
  po("EOd")

And run the benchmark again. Note, that we use 3-fold cross-validation this time for comparison.

lrns = list(l, l1, l2)
grd = benchmark_grid(trn, lrns, rsmp("cv", folds = 3L))
bmr = benchmark(grd)
bmr$aggregate(msrs(c("classif.acc", "fairness.fomr")))
   nr     task_id                   learner_id resampling_id iters classif.acc
1:  1 adult_train                classif.rpart            cv     3   0.8422424
2:  2 adult_train reweighing_wts.classif.rpart            cv     3   0.8407123
3:  3 adult_train            classif.rpart.EOd            cv     3   0.8302625
1 variable not shown: [fairness.fomr]
Hidden columns: resample_result

We can now again visualize the result.

  1. Now we want to add a second sensitive attribute to your dataset, in this case "race". Add the information to your task and evaluate for the initial model again. What changes? Again study the groupwise_metrics.

This can be achieved by adding “race” to the "pta" col_role.

trn$set_col_roles("race", add_to = "pta")
trn
<TaskClassif:adult_train> (30718 x 13)
* Target: target
* Properties: twoclass
* Features (12):
  - fct (7): education, marital_status, occupation, race, relationship,
    sex, workclass
  - int (5): age, capital_gain, capital_loss, education_num,
    hours_per_week
* Protected attribute: sex, race
tst$set_col_roles("race", add_to = "pta")
prd$score(m, tst)
fairness.fomr 
    0.8888889 

If we now evaluate for the intersection, we obtain a large deviation from 0. Note, that the metric by default computes the maximum discrepancy between all metrics for the non-binary case.

If we now get the groupwise_metrics, we will get a metric for the intersection of each group.

m3 = groupwise_metrics(msr("classif.fomr"),  trn)
unname(sapply(m3, function(x) x$id))
 [1] "subgroup.fomr_Male, White"               
 [2] "subgroup.fomr_Male, Black"               
 [3] "subgroup.fomr_Female, Black"             
 [4] "subgroup.fomr_Female, White"             
 [5] "subgroup.fomr_Male, Asian-Pac-Islander"  
 [6] "subgroup.fomr_Male, Amer-Indian-Eskimo"  
 [7] "subgroup.fomr_Female, Other"             
 [8] "subgroup.fomr_Female, Asian-Pac-Islander"
 [9] "subgroup.fomr_Female, Amer-Indian-Eskimo"
[10] "subgroup.fomr_Male, Other"               
prd$score(m3, tst)
               subgroup.fomr_Male, White 
                               0.2402402 
               subgroup.fomr_Male, Black 
                               0.2716049 
             subgroup.fomr_Female, Black 
                               0.2608696 
             subgroup.fomr_Female, White 
                               0.1918819 
  subgroup.fomr_Male, Asian-Pac-Islander 
                               0.3168317 
  subgroup.fomr_Male, Amer-Indian-Eskimo 
                               0.1666667 
             subgroup.fomr_Female, Other 
                               0.2500000 
subgroup.fomr_Female, Asian-Pac-Islander 
                               0.3529412 
subgroup.fomr_Female, Amer-Indian-Eskimo 
                               1.0000000 
               subgroup.fomr_Male, Other 
                               0.1111111 

And we can see, that the reason might be, that the false omission rate for female Amer-Indian-Eskimo is at 1.0! We can investigate this further by looking at actual counts:

table(tst$data(cols = c("race", "sex", "target")))
, , target = <=50K

                    sex
race                 Female Male
  Amer-Indian-Eskimo     56   74
  Asian-Pac-Islander    131  186
  Black                 654  619
  Other                  37   66
  White                3544 6176

, , target = >50K

                    sex
race                 Female Male
  Amer-Indian-Eskimo      3   16
  Asian-Pac-Islander     26  106
  Black                  41  133
  Other                   5   19
  White                 492 2931

One of the reasons might be that there are only 3 individuals in the “>50k” category! This is an often encountered problem, as error metrics have a large variance when samples are small. Note, that the pre- and post-processing methods in general do not all support multiple protected attributes.