Appendix A — Solutions to exercises

A.1 Solutions to Chapter 2

Train a classification model with the classif.rpart learner on the “Pima Indians Diabetes” dataset. Do this without using tsk("pima"), and instead by constructing a task from the dataset in the mlbench-package: data(PimaIndiansDiabetes2, package = "mlbench"). Make sure to define the pos outcome as positive class. Train the model on a random 80% subset of the given data and evaluate its performance with the classification error measure on the remaining data. (Note that the data set has NAs in its features. You can either rely on rpart‘s capability to handle them internally (’surrogate splits’) or remove them from the initial data.frame by using na.omit).

set.seed(1)

data(PimaIndiansDiabetes2, package = "mlbench")
task = as_task_classif(PimaIndiansDiabetes2, target = "diabetes", positive = "pos")
splits = partition(task, ratio = 0.8)
splits

$train
  [1]   1   3   7   9  10  12  15  16  18  20  23  24  25  26  32  38  40
 [18]  44  46  49  54  57  62  65  67  71  73  79  85  89  94 100 101 110
 [35] 111 112 115 116 117 121 125 129 130 131 132 144 156 160 166 171 172
 [52] 180 186 187 188 193 194 196 198 199 200 207 208 210 214 215 217 219
 [69] 220 221 222 228 232 236 237 238 239 244 255 256 260 262 265 267 270
 [86] 271 277 284 285 288 292 293 294 297 301 302 304 307 309 313 315 320
[103] 324 327 329 333 339 350 356 357 360 364 367 370 371 379 387 388 389
[120] 392 395 400 405 407 415 416 418 420 425 428 430 436 441 444 445 446
[137] 449 452 456 469 481 485 486 494 499 503 507 511 516 517 524 540 541
[154] 543 546 562 570 580 585 587 589 591 593 596 599 604 605 607 612 613
[171] 619 620 631 636 643 649 656 660 662 664 665 668 676 677 679 682 692
[188] 694 696 697 702 703 707 709 710 713 717 723 731 732 733 741 744 747
[205] 749 750 751 754 755 756 758 760 762 767   2   4   6   8  11  13  19
[222]  21  22  29  31  34  36  37  41  42  43  50  51  52  53  56  58  59
[239]  60  61  63  64  66  69  70  72  74  76  77  78  80  81  82  83  84
[256]  86  87  88  90  91  92  93  95  96  97  98  99 103 104 105 106 107
[273] 109 113 114 119 122 124 128 134 135 136 137 138 139 141 142 143 145
[290] 146 149 150 151 152 154 157 159 161 163 164 167 168 169 170 173 174
[307] 175 177 179 181 183 185 191 192 195 197 202 203 204 205 206 209 211
[324] 212 218 223 224 225 226 227 229 230 233 235 240 241 245 248 249 250
[341] 251 252 253 254 257 259 261 263 264 268 269 272 273 276 278 280 282
[358] 283 287 289 290 291 295 296 298 303 306 308 312 314 316 321 325 326
[375] 328 330 331 332 335 336 337 341 342 344 345 346 347 348 349 351 352
[392] 353 354 355 359 363 366 369 372 373 374 375 377 378 380 382 383 384
[409] 385 386 390 391 394 397 399 402 408 411 413 414 419 422 423 424 427
[426] 429 432 433 434 435 437 438 440 442 443 448 451 453 454 455 457 458
[443] 460 461 462 466 467 468 470 472 473 474 475 476 478 479 480 482 483
[460] 484 490 491 493 495 496 497 498 500 501 502 504 505 508 509 510 512
[477] 513 514 515 518 521 522 523 525 526 527 528 529 531 532 533 534 535
[494] 538 539 544 545 548 549 550 551 552 553 554 555 557 559 563 564 565
[511] 566 567 568 569 571 572 573 574 575 577 579 584 586 590 592 594 595
[528] 597 598 600 601 602 603 606 608 609 610 616 617 618 621 622 623 624
[545] 625 626 627 629 630 632 633 635 637 640 641 644 645 646 650 651 652
[562] 653 654 655 658 659 666 670 672 674 675 678 681 683 685 686 688 689
[579] 691 695 699 700 701 704 711 712 715 718 719 721 722 724 725 726 727
[596] 728 729 734 735 736 737 738 742 743 746 748 752 753 759 761 763 765
[613] 766 768

$test
  [1]   5  14  17  27  39 126 133 153 155 165 176 178 189 190 216 231 243
 [18] 246 281 299 310 318 322 323 338 340 358 361 376 398 401 403 409 410
 [35] 426 459 477 536 542 547 561 578 581 615 639 647 648 663 667 684 690
 [52] 716 720 740  28  30  33  35  45  47  48  55  68  75 102 108 118 120
 [69] 123 127 140 147 148 158 162 182 184 201 213 234 242 247 258 266 274
 [86] 275 279 286 300 305 311 317 319 334 343 362 365 368 381 393 396 404
[103] 406 412 417 421 431 439 447 450 463 464 465 471 487 488 489 492 506
[120] 519 520 530 537 556 558 560 576 582 583 588 611 614 628 634 638 642
[137] 657 661 669 671 673 680 687 693 698 705 706 708 714 730 739 745 757
[154] 764

learner = lrn("classif.rpart" , predict_type = "prob")
learner

<LearnerClassifRpart:classif.rpart>: Classification Tree
* Model: -
* Parameters: xval=0
* Packages: mlr3, rpart
* Predict Types:  response, [prob]
* Feature Types: logical, integer, numeric, factor, ordered
* Properties: importance, missings, multiclass,
  selected_features, twoclass, weights

learner$train(task, row_ids = splits$train)
learner$model

n= 614 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

  1) root 614 214 neg (0.34853 0.65147)  
    2) glucose>=143.5 138  36 pos (0.73913 0.26087)  
      4) glucose>=157.5 90  15 pos (0.83333 0.16667) *
      5) glucose< 157.5 48  21 pos (0.56250 0.43750)  
       10) pedigree>=0.333 30   9 pos (0.70000 0.30000)  
         20) pregnant>=6.5 10   0 pos (1.00000 0.00000) *
         21) pregnant< 6.5 20   9 pos (0.55000 0.45000)  
           42) glucose< 150 13   3 pos (0.76923 0.23077) *
           43) glucose>=150 7   1 neg (0.14286 0.85714) *
       11) pedigree< 0.333 18   6 neg (0.33333 0.66667) *
    3) glucose< 143.5 476 112 neg (0.23529 0.76471)  
      6) age>=28.5 219  83 neg (0.37900 0.62100)  
       12) mass>=26.95 178  81 neg (0.45506 0.54494)  
         24) insulin>=109 140  65 pos (0.53571 0.46429)  
           48) pedigree>=0.2015 116  47 pos (0.59483 0.40517)  
             96) age< 56.5 106  38 pos (0.64151 0.35849)  
              192) glucose>=99.5 98  32 pos (0.67347 0.32653) *
              193) glucose< 99.5 8   2 neg (0.25000 0.75000) *
             97) age>=56.5 10   1 neg (0.10000 0.90000) *
           49) pedigree< 0.2015 24   6 neg (0.25000 0.75000) *
         25) insulin< 109 38   6 neg (0.15789 0.84211) *
       13) mass< 26.95 41   2 neg (0.04878 0.95122) *
      7) age< 28.5 257  29 neg (0.11284 0.88716) *

prediction = learner$predict(task, row_ids = splits$test)
as.data.table(prediction)

     row_ids truth response prob.pos prob.neg
  1:       5   pos      pos   0.6735   0.3265
  2:      14   pos      pos   0.8333   0.1667
  3:      17   pos      pos   0.6735   0.3265
  4:      27   pos      neg   0.3333   0.6667
  5:      39   pos      neg   0.1128   0.8872
 ---                                         
150:     730   neg      neg   0.1128   0.8872
151:     739   neg      neg   0.1128   0.8872
152:     745   neg      pos   1.0000   0.0000
153:     757   neg      pos   0.6735   0.3265
154:     764   neg      neg   0.2500   0.7500

measure = msr("classif.ce")
prediction$score(measure)

classif.ce 
    0.2662

Calculate the true positive, false positive, true negative, and false negative rates of the predictions made by the model in Exercise 1. Try to solve this in two ways: (a) Using mlr3measures-predefined measure objects, and (b) without using mlr3 tools by directly working on the ground truth and prediction vectors. Compare the results.

# true positive rate
prediction$score(msr("classif.tpr"))

classif.tpr 
     0.6111

# false positive rate
prediction$score(msr("classif.fpr"))

classif.fpr 
        0.2

# true negative rate
prediction$score(msr("classif.tnr"))

classif.tnr 
        0.8

# false negative rate
prediction$score(msr("classif.fnr"))

classif.fnr 
     0.3889

# true positives
TP = sum(prediction$truth == "pos" & prediction$response == "pos")

# false positives
FP = sum(prediction$truth == "neg" & prediction$response == "pos")

# true negatives
TN = sum(prediction$truth == "neg" & prediction$response == "neg")

# false negatives
FN = sum(prediction$truth == "pos" & prediction$response == "neg")

# true positive rate
TP / (TP + FN)

[1] 0.6111

# false positive rate
FP / (FP + TN)

[1] 0.2

# true negative rate
TN / (TN + FP)

[1] 0.8

# false negative rate
FN / (FN + TP)

[1] 0.3889

The results are the same.

Change the threshold of the model from Exercise 1 such that the false negative rate is lower. What is one reason you might do this in practice?

# confusion matrix with threshold 0.5
prediction$confusion

        truth
response pos neg
     pos  33  20
     neg  21  80

prediction$set_threshold(0.3)

# confusion matrix with threshold 0.3
prediction$confusion

        truth
response pos neg
     pos  35  23
     neg  19  77

# false positive rate
prediction$score(msr("classif.fpr"))

classif.fpr 
       0.23

# false negative rate
prediction$score(msr("classif.fnr"))

classif.fnr 
     0.3519

With a false negative rate of 0.38, we miss a lot of people who have diabetes but are predicted to not have it. This could give a false sense of security. By lowering the threshold, we can reduce the false negative rate.

A.2 Solutions to Chapter 3

Apply a repeated cross-validation resampling strategy on tsk("mtcars") and evaluate the performance of lrn("regr.rpart"). Use five repeats of three folds each. Calculate the MSE for each iteration and visualize the result. Finally, calculate the aggregated performance score.

We start by instantiating our task and learner as usual:

set.seed(3)
task = tsk("mtcars")
learner = lrn("regr.rpart")

We can instantiate a temporary resampling on the task to illustrate how it assigns observations across the 5 repeats (column rep) and 3 folds:

resampling = rsmp("repeated_cv", repeats = 5, folds = 3)
resampling$instantiate(task)
resampling$instance

     row_id rep fold
  1:      1   1    2
  2:      2   1    2
  3:      3   1    3
  4:      4   1    1
  5:      5   1    1
 ---                
156:     28   5    1
157:     29   5    3
158:     30   5    3
159:     31   5    1
160:     32   5    2

Note instantiating manually is not necessary when using resample(), as it automatically instantiates the resampling for us, so we pass it a new resampling which has not been instantiated:

resampling = rsmp("repeated_cv", repeats = 5, folds = 3)
rr = resample(task, learner, resampling)

Now we can $score() the resampling with the MSE measure across each of the 5x3 resampling iterations:

scores = rr$score(msr("regr.mse"))
scores

    task_id learner_id resampling_id iteration regr.mse
 1:  mtcars regr.rpart   repeated_cv         1    16.52
 2:  mtcars regr.rpart   repeated_cv         2    18.42
 3:  mtcars regr.rpart   repeated_cv         3    15.70
 4:  mtcars regr.rpart   repeated_cv         4    11.91
 5:  mtcars regr.rpart   repeated_cv         5    14.78
---                                                    
11:  mtcars regr.rpart   repeated_cv        11    20.60
12:  mtcars regr.rpart   repeated_cv        12    28.19
13:  mtcars regr.rpart   repeated_cv        13    25.86
14:  mtcars regr.rpart   repeated_cv        14    19.07
15:  mtcars regr.rpart   repeated_cv        15    19.88
Hidden columns: task, learner, resampling, prediction

We can manually calculate these scores since rr contains all the individual predictions. The $predictions() method returns a list of predictions for each iteration, which we can use to calculate the MSE for the first iteration:

preds = rr$predictions()
pred_1 = as.data.table(preds[[1]])
pred_1[, list(rmse = mean((truth - response)^2))]

    rmse
1: 16.52

To visualize the results, we can use ggplot2 directly on the scores object, which behaves like any other data.table:

library(ggplot2)
# Barchart of the per-iteration scores
ggplot(scores, aes(x = iteration, y = regr.mse)) +
  geom_col() +
  theme_minimal()

# Boxplot of the scores
ggplot(scores, aes(x = regr.mse)) +
  geom_boxplot() +
  scale_y_continuous(breaks = 0, labels = NULL) +
  theme_minimal()

Alternatively, the autoplot() function provides defaults for the ResampleResult object. Note that it internally scores the resampling using the MSE for regression tasks per default.

autoplot(rr)

autoplot(rr, type = "histogram")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The aggregate score is the mean of the MSE scores across all iterations, which we can calculate using $aggregate() or by manually averaging the scores we stored before:

mean(scores$regr.mse)

[1] 21.94

rr$aggregate(msr("regr.mse"))

regr.mse 
   21.94

Use tsk("spam") and five-fold CV to benchmark lrn("classif.ranger"), lrn("classif.log_reg"), and lrn("classif.xgboost", nrounds = 100) with respect to AUC. Which learner appears to perform best? How confident are you in your conclusion? Think about the stability of results and investigate this by re-rerunning the experiment with different seeds. What can be done to improve this?

First we instantiate our learners with their initial parameters, setting the predict_type = "prob" once for all of them using lrns(). We then set the nrounds parameter for XGBoost to 100 and construct a resampling object for 5-fold CV:

set.seed(3)

task = tsk("spam")
learners = lrns(c("classif.ranger", "classif.log_reg", "classif.xgboost"),
                predict_type = "prob")
learners$classif.xgboost$param_set$values$nrounds = 100
resampling = rsmp("cv", folds = 5)

We could have alternatively instantiated the learners like this, but would have needed to repeat the predict_type = "prob" argument multiple times.

learners = list(
  lrn("classif.ranger", predict_type = "prob"),
  lrn("classif.log_reg", predict_type = "prob"),
  lrn("classif.xgboost", nrounds = 100, predict_type = "prob")
)

Next we can construct a benchmark design grid with the instantiated objects using benchmark_grid():

design = benchmark_grid(
  tasks = task,
  learners = learners,
  resamplings = resampling
)
design

   task         learner resampling
1: spam  classif.ranger         cv
2: spam classif.log_reg         cv
3: spam classif.xgboost         cv

To perform the benchmark, we use the aptly named benchmark() function:

bmr = benchmark(design)
bmr

<BenchmarkResult> of 15 rows with 3 resampling runs
 nr task_id      learner_id resampling_id iters warnings errors
  1    spam  classif.ranger            cv     5        0      0
  2    spam classif.log_reg            cv     5        0      0
  3    spam classif.xgboost            cv     5        0      0

And visualize the results as a boxplot:

autoplot(bmr, measure = msr("classif.auc"))

In this example,lrn("classif.xgboost") outperforms lrn("classif.ranger"), and both outperform lrn("classif.log_reg"). Naturally this is only a visual inspection of the results — proper statistical testing of benchmark results can be conducted using the mlr3benchmark package, but for the purposes of this exercise a plot suffices.

When we re-run the same experiment with a different seed, we get a slightly different result.

set.seed(3235)
resampling = rsmp("cv", folds = 5)
design = benchmark_grid(
  tasks = task,
  learners = learners,
  resamplings = resampling
)
bmr = benchmark(design)
autoplot(bmr, measure = msr("classif.auc"))

The overall trend remains about the same, but do we trust these results? Note that we applied both lrn("classif.log_reg") and lrn("classif.ranger") with their initial parameters. While lrn("classif.log_reg") does not have any hyperparameters to tune, lrn("classif.ranger") does have several, at least one of which is usually tuned (mtry). In case of lrn("classif.xgboost") however, we arbitrarily chose nrounds = 100 rather than using the learner with its initial value of nrounds = 1, which would be equivalent to a single tree decision tree. To make any generalizations based on this experiment, we need to properly tune all relevant hyperparmeters in a systematic way. We cover this and more in Chapter 4.

A colleague reports a 93.1% classification accuracy using lrn("classif.rpart") on tsk("penguins_simple"). You want to reproduce their results and ask them about their resampling strategy. They said they used a custom three-fold CV with folds assigned as factor(task$row_ids %% 3). See if you can reproduce their results.

We make use of the custom_cv resampling strategy here:

task = tsk("penguins_simple")
rsmp_cv = rsmp("custom_cv")

We apply the rule to assign resampling folds we were provided with: Every third observation is assigned to the same fold:

rsmp_cv$instantiate(task = task, f = factor(task$row_ids %% 3))

str(rsmp_cv$instance)

List of 3
 $ 0: int [1:111] 3 6 9 12 15 18 21 24 27 30 ...
 $ 1: int [1:111] 1 4 7 10 13 16 19 22 25 28 ...
 $ 2: int [1:111] 2 5 8 11 14 17 20 23 26 29 ...

We are now ready to conduct the resampling and aggregate results:

rr = resample(
  task = task,
  learner = lrn("classif.rpart"),
  resampling = rsmp_cv
)

rr$aggregate(msr("classif.acc"))

classif.acc 
     0.9309

Converting to percentages and rounding to one decimal place, we get the same result as our colleague! Luckily they kept track of their resampling to ensure their results were reproducible.

(*) Program your own ROC plotting function without using mlr3’s autoplot() function. The signature of your function should be my_roc_plot(task, learner, train_indices, test_indices). Your function should use the $set_threshold() method of Prediction, as well as mlr3measures.

Here is a function to calculate the true positive rate (TPR, Sensitivity) and false positive rate (FPR, 1 - Specificity) in a loop across a grid of probabilities. These are set as thresholds with the $set_threshold() method of the PredictionClassif object. This way we construct the ROC curve by iteratively calculating its x and y values, after which we can use geom_step() to draw a step function. Note that we do not need to re-train the learner, we merely adjust the threshold applied to the predictions we made at the top of the function

my_roc_plot = function(task, learner, train_indices, test_indices) {
  # Train learner, predict once.
  learner$train(task, train_indices)
  pred = learner$predict(task, test_indices)
  # Positive class predictions from prediction matrix
  pos_pred = pred$prob[, which(colnames(pred$prob) == task$positive)]

  # Set a grid of probabilities to evaluate at.
  prob_grid = seq(0, 1, 0.001)

  # For each possible threshold, calculate TPR,
  # FPR + aggregate to data.table
  grid = data.table::rbindlist(lapply(prob_grid, \(thresh) {
    pred$set_threshold(thresh)
    data.table::data.table(
      thresh = thresh,
      # y axis == sensitivity == TPR
      tpr = mlr3measures::tpr(
        truth = pred$truth, response = pred$response,
        positive = task$positive),
      # x axis == 1 - specificity == 1 - TNR == FPR
      fpr = mlr3measures::fpr(
        truth = pred$truth, response = pred$response,
        positive = task$positive)
    )
  }))

  # Order descending by threshold to use ggplot2::geom_step
  data.table::setorderv(grid, cols = "thresh", order = -1)

  ggplot2::ggplot(grid, ggplot2::aes(x = fpr, y = tpr)) +
    # Step function starting with (h)orizontal, then (v)ertical
    ggplot2::geom_step(direction = "hv") +
    ggplot2::coord_equal() +
    ggplot2::geom_abline(linetype = "dashed") +
    ggplot2::theme_minimal() +
    ggplot2::labs(
      title = "My Custom ROC Curve",
      subtitle = sprintf("%s on %s", learner$id, task$id),
      x = "1 - Specificity", y = "Sensitivity",
      caption = sprintf("n = %i. Test set: %i", task$nrow, length(test_indices))
    )
}

We try our function using tsk("sonar") and lrn("classif.ranger") learner with 100 trees. We set predict_type = "prob" since we need probability predictions to apply thresholds, rather than hard class predictions.

set.seed(3)

# Setting up example task and learner for testing
task = tsk("sonar")
learner = lrn("classif.ranger", num.trees = 100, predict_type = "prob")
split = partition(task)

my_roc_plot(task, learner, split$train, split$test)

We can compare it with the pre-built plot function in mlr3viz:

learner$train(task, split$train)
pred = learner$predict(task, split$test)
autoplot(pred, type = "roc")

Note the slight discrepancy between the two curves. This is caused by some implementation differences used by the precrec which is used for this functionality in mlr3viz. There are different approaches to drawing ROC curves, and our implementation above is one of the simpler ones!

A.3 Solutions to Chapter 4

Tune the mtry, sample.fraction, and num.trees hyperparameters of lrn("regr.ranger") on tsk("mtcars"). Use a simple random search with 50 evaluations. Evaluate with a three-fold CV and the root mean squared error. Visualize the effects that each hyperparameter has on the performance via simple marginal plots, which plot a single hyperparameter versus the cross-validated MSE.

set.seed(1)

task = tsk("mtcars")

learner = lrn("regr.ranger",
  mtry = to_tune(1, 10),
  sample.fraction = to_tune(0.5, 1),
  num.trees = to_tune(100, 500)
)

instance = ti(
  learner = learner,
  task = task,
  resampling = rsmp("cv", folds = 3),
  measure = msr("regr.rmse"),
  terminator = trm("evals", n_evals = 50)
)

tuner = tnr("random_search", batch_size = 10)

tuner$optimize(instance)

   mtry sample.fraction num.trees learner_param_vals  x_domain regr.rmse
1:    5          0.9477       175          <list[4]> <list[3]>     2.502

# all evaluations
as.data.table(instance$archive)

    mtry sample.fraction num.trees regr.rmse x_domain_mtry
 1:    1          0.6194       231     3.330             1
 2:    7          0.6291       281     2.845             7
 3:   10          0.8647       300     2.637            10
 4:    6          0.7263       172     2.851             6
 5:    6          0.5876       312     2.985             6
---                                                       
46:    2          0.5961       155     3.132             2
47:    1          0.6286       228     3.304             1
48:    8          0.5906       162     2.999             8
49:    7          0.7387       153     2.704             7
50:    2          0.8854       188     2.690             2
8 variables not shown: [x_domain_sample.fraction, x_domain_num.trees, runtime_learners, timestamp, batch_nr, warnings, errors, resample_result]

# best configuration
instance$result

   mtry sample.fraction num.trees learner_param_vals  x_domain regr.rmse
1:    5          0.9477       175          <list[4]> <list[3]>     2.502

# incumbent plot
autoplot(instance, type = "incumbent")

# marginal plots
autoplot(instance, type = "marginal", cols_x = "mtry")

autoplot(instance, type = "marginal", cols_x = "sample.fraction")

autoplot(instance, type = "marginal", cols_x = "num.trees")

Evaluate the performance of the model created in Exercise 1 with nested resampling. Use a holdout validation for the inner resampling and a three-fold CV for the outer resampling.

set.seed(1)

task = tsk("mtcars")

learner = lrn("regr.ranger",
  mtry = to_tune(1, 10),
  sample.fraction = to_tune(0.5, 1),
  num.trees = to_tune(100, 500)
)

at = auto_tuner(
  tuner = tnr("random_search", batch_size = 50),
  learner = learner,
  resampling = rsmp("holdout"),
  measure = msr("regr.rmse"),
  terminator = trm("evals", n_evals = 50)
)

rr = resample(task, at, rsmp("cv", folds = 3))

rr$aggregate(msr("regr.rmse"))

regr.rmse 
    2.552

The "rmse" is slightly higher than the one we obtained in Exercise 1. We see that the performance estimated while tuning overestimates the true performance

Tune and benchmark an XGBoost model against a logistic regression (without tuning the latter) and determine which has the best Brier score. Use mlr3tuningspaces and nested resampling, try to pick appropriate inner and outer resampling strategies that balance computational efficiency vs. stability of the results.

set.seed(1)

task = tsk("sonar")
lrn_log_reg = lrn("classif.log_reg", predict_type = "prob")

# load xgboost learner with search space
lrn_xgboost = lts(lrn("classif.xgboost", predict_type = "prob"))

# search space for xgboost
lrn_xgboost$param_set$search_space()

<ParamSet>
                  id    class  lower    upper nlevels        default
1:           nrounds ParamInt  1.000 5000.000    5000 <NoDefault[3]>
2:               eta ParamDbl -9.210    0.000     Inf <NoDefault[3]>
3:         max_depth ParamInt  1.000   20.000      20 <NoDefault[3]>
4:  colsample_bytree ParamDbl  0.100    1.000     Inf <NoDefault[3]>
5: colsample_bylevel ParamDbl  0.100    1.000     Inf <NoDefault[3]>
6:            lambda ParamDbl -6.908    6.908     Inf <NoDefault[3]>
7:             alpha ParamDbl -6.908    6.908     Inf <NoDefault[3]>
8:         subsample ParamDbl  0.100    1.000     Inf <NoDefault[3]>
1 variable not shown: [value]
Trafo is set.

at_xgboost = auto_tuner(
  tuner = tnr("random_search", batch_size = 50),
  learner = lrn_xgboost,
  resampling = rsmp("cv", folds = 3),
  measure = msr("classif.bbrier"),
  terminator = trm("evals", n_evals = 50)
)

design = benchmark_grid(
  tasks = task,
  learners = list(lrn_log_reg, at_xgboost),
  resamplings = rsmp("cv", folds = 5)
)

bmr = benchmark(design)

bmr$aggregate(msr("classif.bbrier"))

   nr task_id            learner_id resampling_id iters classif.bbrier
1:  1   sonar       classif.log_reg            cv     5         0.2772
2:  2   sonar classif.xgboost.tuned            cv     5         0.1206
Hidden columns: resample_result

We use the lts() function from the mlr3tuningspaces package to load the lrn("classif.xgboost") with a search space. The learner is wrapped in an auto_tuner(), which is then benchmarked against the lrn("classif.log_reg").

(*) Write a function that implements an iterated random search procedure that drills down on the optimal configuration by applying random search to iteratively smaller search spaces. Your function should have seven inputs: task, learner, search_space, resampling, measure, random_search_stages, and random_search_size. You should only worry about programming this for fully numeric and bounded search spaces that have no dependencies. In pseudo-code:
1. Create a random design of size random_search_size from the given search space and evaluate the learner on it.
2. Identify the best configuration.
3. Create a smaller search space around this best config, where you define the new range for each parameter as: new_range[i] = (best_conf[i] - 0.25 * current_range[i], best_conf[i] + 0.25*current_range[i]). Ensure that this new_range respects the initial bound of the original search_space by taking the max() of the new and old lower bound, and the min() of the new and the old upper bound (“clipping”).
4. Iterate the previous steps random_search_stages times and at the end return the best configuration you have ever evaluated.

library(mlr3misc)

focus_search = function(task, learner, search_space, resampling, measure, random_search_stages, random_search_size) {

  repeat {

    # tune learner on random design
    instance = tune(
      tuner = tnr("random_search", batch_size = random_search_size),
      learner = learner,
      task = task,
      resampling = resampling,
      measure = measure,
      search_space = search_space,
      terminator = trm("evals", n_evals = random_search_size),
    )

    # identify the best configuration
    best_config = instance$result_x_search_space

    # narrow search space
    walk(search_space$ids(), function(id) {
      best = best_config[[id]]
      lower = search_space$params[[id]]$lower
      upper = search_space$params[[id]]$upper

      new_lower = best - 0.25 * lower
      new_upper = best + 0.25 * upper

      if ("ParamInt" %in% class(search_space$params[[id]])) {
        new_lower = round(new_lower)
        new_upper = round(new_upper)
      }

      search_space$params[[id]]$lower = max(new_lower, lower)
      search_space$params[[id]]$upper = min(new_upper, upper)
    })

    random_search_stages = random_search_stages - 1
    if (!random_search_stages) return(best_config)
  }
}

focus_search(
  task = tsk("mtcars"),
  learner = lrn("regr.xgboost"),
  search_space = ps(
    eta = p_dbl(lower = 0.01, upper = 0.5),
    max_depth = p_int(lower = 1, upper = 10),
    nrounds = p_int(lower = 10, upper = 100)
  ),
  resampling = rsmp("cv", folds = 3),
  measure = msr("regr.rmse"),
  random_search_stages = 2,
  random_search_size = 50
)

     eta max_depth nrounds
1: 0.254         6      60

As a stretch goal, look into mlr3tuning’s internal source code and turn your function into an R6 class inheriting from the Tuner class – test it out on a learner of your choice.

library(R6)
library(mlr3tuning)

TunerFocusSearch = R6Class("TunerFocusSearch",
  inherit = Tuner,
  public = list(

    initialize = function() {
      param_set = ps(
        random_search_stages = p_int(lower = 1L, tags = "required"),
        random_search_size = p_int(lower = 1L, tags = "required")
      )

      param_set$values = list(random_search_stages = 10L, random_search_size = 50L)

      super$initialize(
        id = "focus_search",
        param_set = param_set,
        param_classes = c("ParamLgl", "ParamInt", "ParamDbl", "ParamFct"),
        properties = c("dependencies", "single-crit", "multi-crit"),
        label = "Focus Search",
        man = "mlr3tuning::mlr_tuners_focus_search"
      )
    }
  ),

  private = list(
    .optimize = function(inst) {
      pv = self$param_set$values
      search_space = inst$search_space

       for (i in seq(pv$random_search_stages)) {

        # evaluate random design
        xdt = generate_design_random(search_space, pv$random_search_size)$data
        inst$eval_batch(xdt)

        # identify the best configuration
        best_config = inst$archive$best(batch = i)

        # narrow search space
        walk(search_space$ids(), function(id) {
          best = best_config[[id]]
          lower = search_space$params[[id]]$lower
          upper = search_space$params[[id]]$upper

          new_lower = best - 0.25 * lower
          new_upper = best + 0.25 * upper

          if ("ParamInt" %in% class(search_space$params[[id]])) {
            new_lower = round(new_lower)
            new_upper = round(new_upper)
          }

          search_space$params[[id]]$lower = max(new_lower, lower)
          search_space$params[[id]]$upper = min(new_upper, upper)
        })
      }
    }
  )
)

mlr_tuners$add("focus_search", TunerFocusSearch)

instance = ti(
  task = tsk("mtcars"),
  learner = lrn("regr.xgboost"),
  search_space = ps(
    eta = p_dbl(lower = 0.01, upper = 0.5),
    max_depth = p_int(lower = 1, upper = 10),
    nrounds = p_int(lower = 10, upper = 100)
  ),
  resampling = rsmp("cv", folds = 3),
  measure = msr("regr.rmse"),
  terminator = trm("none")
)

tuner = tnr("focus_search", random_search_stages = 2, random_search_size = 50)

tuner$optimize(instance)

      eta max_depth nrounds learner_param_vals  x_domain regr.rmse
1: 0.1913         1      99          <list[6]> <list[3]>     2.999

A.4 Solutions to Chapter 5

Tune the mtry, sample.fraction, and num.trees hyperparameters of lrn("regr.ranger") on tsk("mtcars") and evaluate this with a three-fold CV and the root mean squared error (same as Chapter 4, Exercise 1). Use tnr("mbo") with 50 evaluations. Compare this with the performance progress of a random search run from Chapter 4, Exercise 1. Plot the progress of performance over iterations and visualize the spatial distribution of the evaluated hyperparameter configurations for both algorithms.

We first construct the learner, task, resampling, measure and terminator and then the instance.

library(mlr3mbo)
library(bbotk)
library(data.table)
library(ggplot2)
library(viridisLite)

set.seed(5)

learner = lrn("regr.ranger",
  mtry = to_tune(1, 10),
  sample.fraction = to_tune(0.5, 1),
  num.trees = to_tune(100, 500)
)
task = tsk("mtcars")
resampling = rsmp("cv", folds = 3)
measure = msr("regr.rmse")
terminator = trm("evals", n_evals = 50)
instance_rs = ti(
  learner = learner,
  task = task,
  resampling = resampling,
  measure = measure,
  terminator = terminator
)

Using a random search results in the following final performance:

tuner = tnr("random_search", batch_size = 50)
tuner$optimize(instance_rs)

   mtry sample.fraction num.trees learner_param_vals  x_domain regr.rmse
1:    7          0.9071       465          <list[4]> <list[3]>     2.452

We then construct a new instance and optimize it via Bayesian Optimization (BO) using tnr("mbo") in its default configuration (see also mbo_defaults):

instance_bo = ti(
  learner = learner,
  task = task,
  resampling = resampling,
  measure = measure,
  terminator = terminator
)
tuner = tnr("mbo")
tuner$optimize(instance_bo)

   mtry sample.fraction num.trees learner_param_vals  x_domain regr.rmse
1:    7          0.9778       364          <list[4]> <list[3]>     2.157

We then add relevant information to the archives of the instances so that we can combine their data and use this data for generating the desired plots.

instance_rs$archive$data[, iteration := seq_len(.N)]
instance_rs$archive$data[, best_rmse := cummin(regr.rmse)]
instance_rs$archive$data[, method := "Random Search"]
instance_bo$archive$data[, iteration := seq_len(.N)]
instance_bo$archive$data[, best_rmse := cummin(regr.rmse)]
instance_bo$archive$data[, method := "BO"]

plot_data = rbind(instance_rs$archive$data[, c("iteration", "best_rmse", "method")],
  instance_bo$archive$data[, c("iteration", "best_rmse", "method")])

ggplot(aes(x = iteration, y = best_rmse, colour = method), data = plot_data) +
  geom_step() +
  scale_colour_manual(values = viridis(2, end = 0.8)) +
  labs(x = "Number of Configurations", y = "Best regr.rmse", colour = "Method") +
  theme_minimal() +
  theme(legend.position = "bottom")

We see that BO manages to slightly outperform the random search. Ideally, we would replicate running both optimizers multiple times with different random seeds and visualize their average performance along with a dispersion measure to properly take randomness of the optimization process into account. We could even use the same first few random samples as the initial design in BO to allow for a fairer comparison.

To visualize the spatial distribution of the evaluated hyperparameter configurations we will plot for each evaluated configuration the number of trees on the x-axis and the sample fraction on the y-axis. The label of each point corresponds to the mtry parameter directly.

relevant_columns = c("mtry", "sample.fraction", "num.trees", "iteration", "method")
plot_data_sampling = rbind(
  instance_rs$archive$data[, ..relevant_columns, with = FALSE],
  instance_bo$archive$data[, ..relevant_columns, with = FALSE])

ggplot(
    aes(x = num.trees, y = sample.fraction, colour = method, label = mtry),
    data = plot_data_sampling
  ) +
  scale_colour_manual(values = viridis(2, end = 0.8)) +
  geom_point(size = 0) +
  geom_text() +
  guides(colour = guide_legend(title = "Method", override.aes = aes(label = "", size = 2))) +
  theme_minimal() +
  theme(legend.position = "bottom")

We observe that the random search samples uniformly at random – as expected. BO, however, focuses on regions of the search space with a high number of trees between 350 and 400, a high sample fraction and mtry values of around 5 to 8. This is also the region where the final result returned by BO is located. Nevertheless, BO also explores the search space, i.e., along the line of a high sample fraction close to 1.

Minimize the 2D Rastrigin function $f: [-5.12, 5.12] \times [-5.12, 5.12] \rightarrow \mathbb{R}$, $\mathbf{x} \mapsto 10 D+\sum_{i=1}^D\left[x_i^2-10 \cos \left(2 \pi x_i\right)\right]$, $D = 2$ via BO (standard sequential single-objective BO via bayesopt_ego()) using the lower confidence bound with lambda = 1 as acquisition function and "NLOPT_GN_ORIG_DIRECT" via opt("nloptr") as acquisition function optimizer. Use a budget of 40 function evaluations. Run this with both the “default” Gaussian process surrogate model with Matérn 5/2 kernel, and the “default” random forest surrogate model. Compare their anytime performance (similarly as in Figure 5.7). You can construct the surrogate models with default settings using:

surrogate_gp = srlrn(default_gp())
surrogate_rf = srlrn(default_rf())

We first construct the function, making use of efficient evaluation operating on a data.table directly. We then wrap this function in the corresponding ObjectiveRFunDt objective class and construct the instance.

rastrigin = function(xdt) {
  D = ncol(xdt)
  y = 10 * D + rowSums(xdt^2 - (10 * cos(2 * pi * xdt)))
  data.table(y = y)
}

objective = ObjectiveRFunDt$new(
  fun = rastrigin,
  domain = ps(x1 = p_dbl(lower = -5.12, upper = 5.12),
    x2 = p_dbl(lower = -5.12, upper = 5.12)),
  codomain = ps(y = p_dbl(tags = "minimize")),
  id = "rastrigin2D")

instance = OptimInstanceSingleCrit$new(
  objective = objective,
  terminator = trm("evals", n_evals = 40))

We then construct the surrogates as well as the acquisition function and acquisition function optimizer (we will terminate the acquisition function optimization once optimization process stagnates by 1e-5 over the last 100 iterations) and construct the two BO optimizers.

surrogate_gp = srlrn(default_gp())
surrogate_rf = srlrn(default_rf())

acq_function = acqf("cb", lambda = 1)

acq_optimizer = acqo(opt("nloptr", algorithm = "NLOPT_GN_ORIG_DIRECT"),
  terminator = trm("stagnation", iters = 100, threshold = 1e-5))

optimizer_gp = opt("mbo",
  loop_function = bayesopt_ego,
  surrogate = surrogate_gp,
  acq_function = acq_function,
  acq_optimizer = acq_optimizer)

optimizer_rf = opt("mbo",
  loop_function = bayesopt_ego,
  surrogate = surrogate_rf,
  acq_function = acq_function,
  acq_optimizer = acq_optimizer
)

We will use the following initial design for both optimizers:

initial_design = data.table(
  x1 = c(-3.95, 1.16, 3.72, -1.39, -0.11, 5.00, -2.67, 2.44),
  x2 = c(1.18, -3.93, 3.74, -1.37, 5.02, -0.09, -2.65, 2.46)
)
instance$eval_batch(initial_design)

We then proceed to optimize the instance with each of the two optimizers and make sure to extract the relevant data from the archive of the instance.

optimizer_gp$optimize(instance)

        x1     x2  x_domain    y
1: -0.9942 0.9947 <list[2]> 1.99

gp_data = instance$archive$data
gp_data[, y_min := cummin(y)]
gp_data[, iteration := seq_len(.N)]
gp_data[, surrogate := "Gaussian Process"]

instance$archive$clear()

instance$eval_batch(initial_design)

optimizer_rf$optimize(instance)

       x1     x2  x_domain     y
1: 0.9411 -2.023 <list[2]> 5.755

rf_data = instance$archive$data
rf_data[, y_min := cummin(y)]
rf_data[, iteration := seq_len(.N)]
rf_data[, surrogate := "Random forest"]

We then combine the data and use it to generate the desired plot:

plot_data = rbind(gp_data, rf_data)
ggplot(aes(x = iteration, y = y_min, colour = surrogate), data = plot_data) +
  geom_step() +
  scale_colour_manual(values = viridis(2, end = 0.8)) +
  labs(y = "Best Observed Function Value", x = "Number of Function Evaluations",
       colour = "Surrogate Model") +
  theme_minimal() +
  theme(legend.position = "bottom")

As expected, we observe that the BO algorithm with the Gaussian Process surrogate appears to outperform the random forest surrogate counterpart. However, ideally we would replicate running each algorithm using different random seeds and visualize the average performance along with some dispersion measure to properly take randomness of the optimization process into account.

Minimize the following function: $f: [-10, 10] \rightarrow \mathbb{R}^2, x \mapsto \left(x^2, (x - 2)^2\right)$ with respect to both objectives. Use the ParEGO algorithm. Construct the objective function using the ObjectiveRFunMany class. Terminate the optimization after a runtime of 100 evals. Plot the resulting Pareto front and compare it to the analytical solution, $y_2 = \left(\sqrt{y_1}-2\right)^2$ with $y_1$ ranging from $0$ to $4$.

We first construct the function, wrap it in the objective and then create the instance.

fun = function(xss) {
  evaluations = lapply(xss, FUN = function(xs) {
    list(y1 = xs$x ^ 2, y2 = (xs$x - 2)^2)
  })
  rbindlist(evaluations)
}

objective = ObjectiveRFunMany$new(
  fun = fun,
  domain = ps(x = p_dbl(lower = -10, upper = 10)),
  codomain = ps(y1 = p_dbl(tags = "minimize"), y2 = p_dbl(tags = "minimize")),
  id = "schaffer1")

instance = OptimInstanceMultiCrit$new(
  objective = objective,
  terminator = trm("evals", n_evals = 100)
)

As a surrogate we will use a random forest. ParEGO is a scalarization based multi-objective BO algorithm and therefore we use the Expected Improvement as acquisition function. We will use the same acquisition functon optimizer as earlier.

surrogate = srlrn(default_rf())

acq_function = acqf("ei")

acq_optimizer = acqo(opt("nloptr", algorithm = "NLOPT_GN_ORIG_DIRECT"),
  terminator = trm("stagnation", iters = 100, threshold = 1e-5))

optimizer = opt("mbo",
  loop_function = bayesopt_parego,
  surrogate = surrogate,
  acq_function = acq_function,
  acq_optimizer = acq_optimizer
)

We then optimize the instance:

optimizer$optimize(instance)

         x  x_domain     y1      y2
 1: 0.5151 <list[1]> 0.2654 2.20484
 2: 0.1330 <list[1]> 0.0177 3.48554
 3: 0.3292 <list[1]> 0.1084 2.79151
 4: 1.5638 <list[1]> 2.4454 0.19028
 5: 1.8107 <list[1]> 3.2786 0.03583
---                                
38: 1.0791 <list[1]> 1.1645 0.84805
39: 1.5912 <list[1]> 2.5320 0.16710
40: 0.9785 <list[1]> 0.9575 1.04344
41: 1.7010 <list[1]> 2.8933 0.08942
42: 0.8413 <list[1]> 0.7078 1.34250

Finally, we visualize the resulting Pareto front (in black) and its analytical counterpart (in darkgrey).

true_pareto = data.table(y1 = seq(from = 0, to = 4, length.out = 1001))
true_pareto[, y2 := (sqrt(y1) - 2) ^2]

ggplot(aes(x = y1, y = y2), data = instance$archive$best()) +
  geom_point() +
  geom_line(data = true_pareto, colour = "darkgrey") +
  geom_step(direction = "hv") +
  labs(x = expression(y[1]), y = expression(y[2])) +
  theme_minimal()

A.5 Solutions to Chapter 6

Compute the correlation filter scores on tsk("mtcars") and use the filter to select the five features most strongly correlated with the target. Resample lrn("regr.kknn") on both the full dataset and the reduced one, and compare both performances based on 10-fold CV with respect to MSE. NB: Here, we have performed the feature filtering outside of CV, which is generally not a good idea as it biases the CV performance estimation. To do this properly, filtering should be embedded inside the CV via pipelines – try to come back to this exercise after you read Chapter 8 to implement this with less bias.

set.seed(1)

task = tsk("mtcars")

filter = flt("correlation")

filter$calculate(task)

# sorted filter scores
filter$scores

    wt    cyl   disp     hp   drat     vs     am   carb   gear   qsec 
0.8677 0.8522 0.8476 0.7762 0.6812 0.6640 0.5998 0.5509 0.4803 0.4187

# subset task to 5 most correlated features
task$select(names(head(filter$scores, 5)))
task

<TaskRegr:mtcars> (32 x 6): Motor Trends
* Target: mpg
* Properties: -
* Features (5):
  - dbl (5): cyl, disp, drat, hp, wt

design = benchmark_grid(
  tasks = list(task, tsk("mtcars")),
  learners = lrn("regr.kknn"),
  resamplings = rsmp("cv", folds = 10)
)

bmr = benchmark(design)
bmr$aggregate(msr("regr.mse"))

   nr task_id learner_id resampling_id iters regr.mse
1:  1  mtcars  regr.kknn            cv    10    6.678
2:  2  mtcars  regr.kknn            cv    10    9.083
Hidden columns: resample_result

The "mse" is much lower on the filtered task.

Apply backward selection to tsk("penguins") with lrn("classif.rpart") and holdout resampling by the classification accuracy measure. Compare the results with those in Section 6.2.1 by also running the forward selection from that section. Do the selected features differ? Which feature selection method reports a higher classification accuracy in its $result?

set.seed(1)

fselector_sbs = fs("sequential", strategy = "sbs")

instance_sbs = fsi(
  task =  tsk("penguins"),
  learner = lrn("classif.rpart"),
  resampling = rsmp("holdout"),
  measure = msr("classif.acc"),
  terminator = trm("none")
)

fselector_sbs$optimize(instance_sbs)

   bill_depth bill_length body_mass flipper_length island  sex year
1:       TRUE        TRUE      TRUE          FALSE  FALSE TRUE TRUE
2 variables not shown: [features, classif.acc]

# optimization path sbs
fselector_sbs$optimization_path(instance_sbs)

   bill_depth bill_length body_mass flipper_length island   sex  year
1:       TRUE        TRUE      TRUE           TRUE   TRUE  TRUE  TRUE
2:      FALSE        TRUE      TRUE           TRUE   TRUE  TRUE  TRUE
3:      FALSE        TRUE      TRUE          FALSE   TRUE  TRUE  TRUE
4:      FALSE        TRUE      TRUE          FALSE  FALSE  TRUE  TRUE
5:      FALSE        TRUE      TRUE          FALSE  FALSE FALSE  TRUE
6:      FALSE        TRUE      TRUE          FALSE  FALSE FALSE FALSE
7:      FALSE        TRUE     FALSE          FALSE  FALSE FALSE FALSE
2 variables not shown: [classif.acc, batch_nr]

instance_sbs$result

   bill_depth bill_length body_mass flipper_length island  sex year
1:       TRUE        TRUE      TRUE          FALSE  FALSE TRUE TRUE
2 variables not shown: [features, classif.acc]

fselector_sfs = fs("sequential", strategy = "sfs")

instance_sfs = fsi(
  task =  tsk("penguins"),
  learner = lrn("classif.rpart"),
  resampling = rsmp("holdout"),
  measure = msr("classif.acc"),
  terminator = trm("none")
)

fselector_sfs$optimize(instance_sfs)

   bill_depth bill_length body_mass flipper_length island   sex  year
1:       TRUE        TRUE     FALSE           TRUE  FALSE FALSE FALSE
2 variables not shown: [features, classif.acc]

# optimization path sfs
fselector_sfs$optimization_path(instance_sfs)

   bill_depth bill_length body_mass flipper_length island   sex  year
1:       TRUE       FALSE     FALSE          FALSE  FALSE FALSE FALSE
2:       TRUE        TRUE     FALSE          FALSE  FALSE FALSE FALSE
3:       TRUE        TRUE     FALSE           TRUE  FALSE FALSE FALSE
4:       TRUE        TRUE      TRUE           TRUE  FALSE FALSE FALSE
5:       TRUE        TRUE      TRUE           TRUE   TRUE FALSE FALSE
6:       TRUE        TRUE      TRUE           TRUE   TRUE  TRUE FALSE
7:       TRUE        TRUE      TRUE           TRUE   TRUE  TRUE  TRUE
2 variables not shown: [classif.acc, batch_nr]

instance_sfs$result

   bill_depth bill_length body_mass flipper_length island   sex  year
1:       TRUE        TRUE     FALSE           TRUE  FALSE FALSE FALSE
2 variables not shown: [features, classif.acc]

The sequential backward search selects 5 features, while the sequential forward search selects all features. The sequential backward search reports a higher classification accuracy.

There is a problem in the performance comparison in Exercise 2 as feature selection is performed on the test-set. Change the process by applying forward feature selection with auto_fselector(). Compare the performance to backward feature selection from Exercise 2 using nested resampling.

set.seed(1)

afs_sfs = auto_fselector(
  fselector = fs("sequential", strategy = "sfs"),
  learner = lrn("classif.rpart"),
  resampling = rsmp("holdout"),
  measure = msr("classif.acc")
)

afs_sbs = auto_fselector(
  fselector = fs("sequential", strategy = "sbs"),
  learner = lrn("classif.rpart"),
  resampling = rsmp("holdout"),
  measure = msr("classif.acc")
)

design = benchmark_grid(
  tasks = tsk("penguins"),
  learners = list(afs_sfs, afs_sbs),
  resamplings = rsmp("cv", folds = 5)
)

bmr = benchmark(design)
bmr$aggregate(msr("classif.acc"))

   nr  task_id              learner_id resampling_id iters classif.acc
1:  1 penguins classif.rpart.fselector            cv     5      0.9332
2:  2 penguins classif.rpart.fselector            cv     5      0.9216
Hidden columns: resample_result

Now the sequential forward search selects yields a slightly higher classification accuracy.

(*) Write a feature selection algorithm that is a hybrid of a filter and a wrapper method. This search algorithm should compute filter scores for all features and then perform a forward search. But instead of tentatively adding all remaining features to the current feature set, it should only stochastically try a subset of the available features. Features with high filter scores should be added with higher probability. Start by coding a stand-alone R method for this search (based on a learner, task, resampling, performance measure and some control settings).

library(mlr3verse)
library(data.table)

task = tsk("sonar")
learner = lrn("classif.rpart")
resampling = rsmp("cv", folds = 3)
measure = msr("classif.acc")
filter = flt("auc")
n = 5
max_features = 10


filter_forward_selection_search = function(task, learner, resampling, measure, filter, n, max_features) {
  features = task$feature_names

  # calculate filter scores
  filter$calculate(task)
  scores = filter$scores

  result_features = character(0)
  while(max_features > length(result_features)) {

    # select n features to try
    filter_features = sample(names(scores), size = min(n, length(scores)), prob = scores)

    # create feature matrix
    states = matrix(FALSE, ncol = length(features), nrow = length(filter_features))

    # add filter features to matrix
    for (i in seq_along(filter_features)) {
      states[i, which(features %in% filter_features[i])] = TRUE
    }

    # add already selected features to matrix
    states[, which(features %in% result_features)] = TRUE

    # convert matrix to design
    design = setnames(as.data.table(states), features)

    # evaluate feature combinations
    instance = fselect(
      fselector = fs("design_points", design = design, batch_size = nrow(design)),
      task =  task,
      learner = learner,
      resampling = resampling,
      measure = measure
    )

    # current best set
    result_features = instance$result_feature_set

    # remove selected features from scores
    scores = scores[!names(scores) %in% result_features]
  }

  result_features
}

filter_forward_selection_search(task, learner, resampling, measure, filter, n, max_features)

 [1] "V11" "V12" "V2"  "V34" "V4"  "V50" "V51" "V52" "V56" "V58"

Then, as a stretch goal, see if you can implement this as an R6 class inheriting from FSelector.

library(R6)
library(checkmate)
library(mlr3verse)
library(mlr3fselect)


Attaching package: 'mlr3fselect'

The following object is masked from 'package:mlr3tuning':

    ContextEval

library(data.table)

FSelectorSequentialFilter = R6Class("FSelectorSequentialFilter",
  inherit = FSelector,
  public = list(

    #' @description
    #' Creates a new instance of this [R6][R6::R6Class] class.`
    initialize = function() {
      param_set = ps(
        filter = p_uty(tags = "required"),
        n = p_int(lower = 1, tags = "required"),
        max_features = p_int(lower = 1)
      )

      super$initialize(
        id = "filter_sequential",
        param_set = param_set,
        properties = "single-crit",
        label = "Sequential Filter Search",
        man = "mlr3fselect::mlr_fselectors_sequential"
      )
    }
  ),
  private = list(
    .optimize = function(inst) {
      pv = self$param_set$values
      features = inst$archive$cols_x
      max_features = pv$max_features %??% length(features)

      # calculate filter scores
      pv$filter$calculate(inst$objective$task)
      scores = pv$filter$scores

      result_features = character(0)
      while(max_features > length(result_features)) {

        # select n features to try
        filter_features = sample(names(scores), size = min(pv$n, length(scores)), prob = scores)

        # create feature matrix
        states = matrix(FALSE, ncol = length(features), nrow = length(filter_features))

        # add filter features to matrix
        for (i in seq_along(filter_features)) {
          states[i, which(features %in% filter_features[i])] = TRUE
        }

        # add already selected features to matrix
        states[, which(features %in% result_features)] = TRUE

        # convert matrix to design
        design = setnames(as.data.table(states), features)

        # evaluate feature combinations
        inst$eval_batch(design)

        # current best set
        res = inst$archive$best(batch = inst$archive$n_batch)
        result_features = features[as.logical(res[, features, with = FALSE])]

        # remove selected features from scores
        scores = scores[!names(scores) %in% result_features]
      }
    }
  )
)

mlr_fselectors$add("sequential_filter", FSelectorSequentialFilter)

instance = fselect(
  fselector = fs(
    "sequential_filter",
    filter = flt("auc"),
    n = 5,
    max_features = 10),
  task =  tsk("sonar"),
  learner = lrn("classif.rpart"),
  resampling = rsmp("cv", folds = 3),
  measure = msr("classif.acc")
)

A.6 Solutions to Chapter 7

Concatenate the PipeOps named in the exercise by using %>>%. The resulting Graph can then be converted to a Learner by using as_learner().

library(mlr3pipelines)
library(mlr3learners)

graph = po("imputeoor") %>>% po("scale") %>>% lrn("classif.log_reg")
graph_learner = as_learner(graph)

The GraphLearner can be trained like any other Learner object, thereby filling in its $model field. It is possible to access the $state of any PipeOp through this field: the states are named after the PipeOp’s $id. The logistic regression model can then be extracted from the state of the po("learner") that contains the lrn("classif.log_reg").

graph_learner$train(tsk("pima"))

# access the state of the po("learner") to get the model
model = graph_learner$model$classif.log_reg$model
coef(model)

(Intercept)         age     glucose     insulin        mass    pedigree 
   -0.88835     0.15584     1.13631    -0.17477     0.74383     0.32121 
   pregnant    pressure     triceps 
    0.39594    -0.24967     0.05599

Alternatively, the underlying lrn("classif.log_reg") can be accessed through the $base_learner() method:

model = graph_learner$base_learner()$model
coef(model)

(Intercept)         age     glucose     insulin        mass    pedigree 
   -0.88835     0.15584     1.13631    -0.17477     0.74383     0.32121 
   pregnant    pressure     triceps 
    0.39594    -0.24967     0.05599

As a third option, the trained PipeOp can be accessed through the $graph_model field of the GraphLearner. The trained PipeOp has a $learner_model field, which contains the trained Learner object, which contains the model.

pipeop = graph_learner$graph_model$pipeops$classif.log_reg
model = pipeop$learner_model$model
coef(model)

(Intercept)         age     glucose     insulin        mass    pedigree 
   -0.88835     0.15584     1.13631    -0.17477     0.74383     0.32121 
   pregnant    pressure     triceps 
    0.39594    -0.24967     0.05599

Set the $keep_results flag of the Graph to TRUE to keep the results of the individual PipeOps. Afterwards, the input of the lrn("classif.log_reg") can be accessed through the $.result field of its predecessor, the po("scale"). Note that the $.result is a list, we want to access its only element, named $output.

graph_learner$graph$keep_results = TRUE
graph_learner$train(tsk("pima"))

# access the input of the learner
scale_result = graph_learner$graph_model$pipeops$scale$.result

scale_output_task = scale_result$output

age_column = scale_output_task$data()$age

# check if the age column is standardized:
# 1. does it have mean 0? -- almost, up to numerical precision!
mean(age_column)

[1] 1.988e-16

# 2. does it have standard deviation 1? -- yes!
sd(age_column)

[1] 1

A.7 Solutions to Chapter 8

Use the po("pca") to replace numeric columns with their PCA transform. To restrict this operator to only columns without missing values, the affect_columns with a fitting Selector can be used: The selector_missing(), which selects columns with missing values, combined with selector_invert(), which inverts the selection. Since po("pca") only operates on numeric columns, it is not necessary to use a Selector to select numeric columns.

graph = as_graph(po("pca",
  affect_columns = selector_invert(selector_missing()))
)

# apply the graph to the pima task
graph_result = graph$train(tsk("pima"))

# we get the following features
graph_result[[1]]$feature_names

[1] "PC1"      "PC2"      "PC3"      "glucose"  "insulin"  "mass"    
[7] "pressure" "triceps"

# Compare with feature-columns of tsk("pima") with missing values:
selector_missing()(tsk("pima"))

[1] "glucose"  "insulin"  "mass"     "pressure" "triceps"

Alternatively, po("select") can be used to select the columns without missing values that are passed to po("pca"). Another po("select") can be used to select all the other columns. It is put in parallel with the first po("select") using gunion(). It is necessary to use different $id values for both po("select") to avoid a name clash in the Graph. To combine the output from both paths, po("featureunion") can be used.

path1 = po("select", id = "select_non_missing",
  selector = selector_invert(selector_missing())) %>>%
    po("pca")
path2 = po("select", id = "select_missing",
  selector = selector_missing())
graph = gunion(list(path1, path2)) %>>% po("featureunion")

# apply the graph to the pima task
graph_result = graph$train(tsk("pima"))
graph_result[[1]]$feature_names

[1] "PC1"      "PC2"      "PC3"      "glucose"  "insulin"  "mass"    
[7] "pressure" "triceps"

First, observe the feature names produced by the level 0 learners when applied to the tsk("wine") task:

lrn_rpart = lrn("classif.rpart", predict_type = "prob")
po_rpart_cv = po("learner_cv", learner = lrn_rpart,
  resampling.folds = 2, id = "rpart_cv"
)

lrn_knn = lrn("classif.kknn", predict_type = "prob")
po_knn_cv = po("learner_cv",
  learner = lrn_knn,
  resampling.folds = 2, id = "knn_cv"
)

# we restrict ourselves to two level 0 learners here to
# focus on the essentials.

gr_level_0 = gunion(list(po_rpart_cv, po_knn_cv))
gr_combined = gr_level_0 %>>% po("featureunion")

gr_combined$train(tsk("wine"))[[1]]$head()

   type rpart_cv.prob.1 rpart_cv.prob.2 rpart_cv.prob.3 knn_cv.prob.1
1:    1          0.9688         0.03125         0.00000             1
2:    1          0.9688         0.03125         0.00000             1
3:    1          0.9688         0.03125         0.00000             1
4:    1          1.0000         0.00000         0.00000             1
5:    1          0.0000         0.96667         0.03333             1
6:    1          0.9688         0.03125         0.00000             1
2 variables not shown: [knn_cv.prob.2, knn_cv.prob.3]

To use po("select") to remove, instead of keep, a feature based on a pattern, use selector_invert together with selector_grep. To remove the “1” class columns, i.e. all columns with names that end in “1”, the following po("select") could be used:

drop_one = po("select", selector = selector_invert(selector_grep("\\.1$")))

# Train it on the wine task with lrn("classif.multinom"):

gr_stack = gr_combined %>>% drop_one %>>%
  lrn("classif.multinom", trace = FALSE)

glrn_stack = as_learner(gr_stack)

glrn_stack$train(tsk("wine"))

glrn_stack$base_learner()$model

Call:
nnet::multinom(formula = type ~ ., data = task$data(), trace = FALSE)

Coefficients:
  (Intercept) rpart_cv.prob.2 rpart_cv.prob.3 knn_cv.prob.2
2      -6.559          -8.273           15.96        44.974
3     -29.359         -19.237           14.07        -9.424
  knn_cv.prob.3
2         14.64
3         64.42

Residual Deviance: 5.504 
AIC: 25.5

A solution that does not need to specify the target classes at all is to use a custom Selector, as was shown in Section 8.3.1:

selector_remove_one_prob_column = function(task) {
  class_removing = task$class_names[[1]]
  selector_use = selector_invert(selector_grep(paste0("\\.", class_removing ,"$")))
  selector_use(task)
}

Using this selector in Section 8.3.2, one could use the resulting stacking learner on any classification task with arbitrary target classes. It can be used as an alternative to the Selector used in exercise 2:

drop_one_alt = po("select", selector = selector_remove_one_prob_column)

# The same as above:
gr_stack = gr_combined %>>% drop_one_alt %>>%
  lrn("classif.multinom", trace = FALSE)
glrn_stack = as_learner(gr_stack)
glrn_stack$train(tsk("wine"))

# As before, the first class was dropped.
glrn_stack$base_learner()$model

Call:
nnet::multinom(formula = type ~ ., data = task$data(), trace = FALSE)

Coefficients:
  (Intercept) rpart_cv.prob.2 rpart_cv.prob.3 knn_cv.prob.2
2      -3.735         -0.1002          -10.62         21.81
3     -21.486        -17.2448          -21.62         40.16
  knn_cv.prob.3
2         18.13
3         53.81

Residual Deviance: 27.4 
AIC: 47.4

We choose to use the following options for imputation, factor encoding, and model training. Note the use of pos() and lrns(), which return lists of PipeOp and Learner objects, respectively.

imputing = pos(c("imputeoor", "imputesample"))

factor_encoding = pos(c("encode", "encodeimpact"))

models = lrns(c("classif.rpart", "classif.log_reg", "classif.svm"))

Use the ppl("branch") pipeline to get Graphs with alternative path branching, controlled by its own hyperparameter. We need to give the po("branch") operators that are created here individual prefixes to avoid nameclashes when we put everything together.

full_graph = ppl("branch",
    prefix_branchops = "impute_", graphs = imputing
  ) %>>% ppl("branch",
    prefix_branchops = "encode_", graphs = factor_encoding
  ) %>>% ppl("branch",
    prefix_branchops = "model_", graphs = models
  )

full_graph$plot()

The easiest way to set up the search space for this pipeline is to use to_tune(). It is necessary to record the dependencies of the hyperparameters of the preprocessing and model PipeOps on the branch hyperparameters. For this, to_tune() needs to be applied to a Domain object – p_dbl(), p_fct(), etc. – that has its dependency declared using the depends argument.

library("paradox")
full_graph$param_set$set_values(
  impute_branch.selection = to_tune(),
  encode_branch.selection = to_tune(),
  model_branch.selection = to_tune(),

  encodeimpact.smoothing = to_tune(p_dbl(1e-3, 1e3, logscale = TRUE,
    depends = encode_branch.selection == "encodeimpact")),
  encode.method = to_tune(p_fct(c("one-hot", "poly"),
    depends = encode_branch.selection == "encode")),

  classif.rpart.cp = to_tune(p_dbl(0.001, 0.3, logscale = TRUE,
    depends = model_branch.selection == "classif.rpart")),
  classif.svm.cost = to_tune(p_dbl(1e-5, 1e5, logscale = TRUE,
    depends = model_branch.selection == "classif.svm")),
  classif.svm.gamma = to_tune(p_dbl(1e-5, 1e5, logscale = TRUE,
    depends = model_branch.selection == "classif.svm"))
)

We also set a few SVM kernel hyperparameters record their dependency on the model selection branch hyperparameter. We could record these dependencies in the Graph, using the $add_dep() method of the ParamSet, but here we use the simpler approach of adding a single item search space component.

full_graph$param_set$set_values(
  classif.svm.type = to_tune(p_fct("C-classification",
    depends = model_branch.selection == "classif.svm")),
  classif.svm.kernel = to_tune(p_fct("radial",
    depends = model_branch.selection == "classif.svm"))
)

To turn this Graph into an AutoML-system, we use an AutoTuner. Here we use random search, but any other Tuner could be used.

library("mlr3tuning")
automl_at = auto_tuner(
  tuner = tnr("random_search"),
  learner = full_graph,
  resampling = rsmp("cv", folds = 4),
  measure = msr("classif.ce"),
  term_evals = 30
)

We can now benchmark this AutoTuner on a few tasks and compare it with the untuned random forest with out-of-range (OOR) imputation:

learners = list(
  automl_at,
  as_learner(po("imputeoor") %>>% lrn("classif.ranger"))
)
learners[[1]]$id = "automl"
learners[[2]]$id = "ranger"

tasks = list(
  tsk("breast_cancer"),
  tsk("pima"),
  tsk("sonar")
)

set.seed(123L)
design = benchmark_grid(tasks, learners = learners, rsmp("cv", folds = 3))
bmr = benchmark(design)

bmr$aggregate()

   nr       task_id learner_id resampling_id iters classif.ce
1:  1 breast_cancer     automl            cv     3    0.03513
2:  2 breast_cancer     ranger            cv     3    0.02489
3:  3          pima     automl            cv     3    0.22917
4:  4          pima     ranger            cv     3    0.23047
5:  5         sonar     automl            cv     3    0.22098
6:  6         sonar     ranger            cv     3    0.16812
Hidden columns: resample_result

The AutoTuner performs better than the untuned random forest on one task. This is, of course, a toy example to demonstrate the capabilities of mlr3pipelines in combination with the mlr3tuning package. To use this kind of setup on real world data, one would need to take care of making the process more robust, e.g. by using the ppl("robustify") pipeline, and by using fallback learners.

A.8 Solutions for Chapter 9

We will consider a prediction problem similar to the one from this chapter, but using the King County Housing regression data instead (available with tsk("kc_housing")). To evaluate the models, we again use 10-fold CV, mean absolute error and lrn("regr.glmnet"). For now we will ignore the date column and simply remove it:

set.seed(1)

library("mlr3data")
task = tsk("kc_housing")
task$select(setdiff(task$feature_names, "date"))

Have a look at the features, are there any features which might be problematic? If so, change or remove them. Check the dataset and learner properties to understand which preprocessing steps you need to do.

summary(task)

     price           bathrooms       bedrooms       condition   
 Min.   :  75000   Min.   :0.00   Min.   : 0.00   Min.   :1.00  
 1st Qu.: 321950   1st Qu.:1.75   1st Qu.: 3.00   1st Qu.:3.00  
 Median : 450000   Median :2.25   Median : 3.00   Median :3.00  
 Mean   : 540088   Mean   :2.12   Mean   : 3.37   Mean   :3.41  
 3rd Qu.: 645000   3rd Qu.:2.50   3rd Qu.: 4.00   3rd Qu.:4.00  
 Max.   :7700000   Max.   :8.00   Max.   :33.00   Max.   :5.00  
                                                                
     floors         grade            lat            long     
 Min.   :1.00   Min.   : 1.00   Min.   :47.2   Min.   :-123  
 1st Qu.:1.00   1st Qu.: 7.00   1st Qu.:47.5   1st Qu.:-122  
 Median :1.50   Median : 7.00   Median :47.6   Median :-122  
 Mean   :1.49   Mean   : 7.66   Mean   :47.6   Mean   :-122  
 3rd Qu.:2.00   3rd Qu.: 8.00   3rd Qu.:47.7   3rd Qu.:-122  
 Max.   :3.50   Max.   :13.00   Max.   :47.8   Max.   :-121  
                                                             
   sqft_above   sqft_basement    sqft_living    sqft_living15 
 Min.   : 290   Min.   :  10    Min.   :  290   Min.   : 399  
 1st Qu.:1190   1st Qu.: 450    1st Qu.: 1427   1st Qu.:1490  
 Median :1560   Median : 700    Median : 1910   Median :1840  
 Mean   :1788   Mean   : 742    Mean   : 2080   Mean   :1987  
 3rd Qu.:2210   3rd Qu.: 980    3rd Qu.: 2550   3rd Qu.:2360  
 Max.   :9410   Max.   :4820    Max.   :13540   Max.   :6210  
                NA's   :13126                                 
    sqft_lot         sqft_lot15          view       waterfront     
 Min.   :    520   Min.   :   651   Min.   :0.000   Mode :logical  
 1st Qu.:   5040   1st Qu.:  5100   1st Qu.:0.000   FALSE:21450    
 Median :   7618   Median :  7620   Median :0.000   TRUE :163      
 Mean   :  15107   Mean   : 12768   Mean   :0.234                  
 3rd Qu.:  10688   3rd Qu.: 10083   3rd Qu.:0.000                  
 Max.   :1651359   Max.   :871200   Max.   :4.000                  
                                                                   
    yr_built     yr_renovated      zipcode     
 Min.   :1900   Min.   :1934    Min.   :98001  
 1st Qu.:1951   1st Qu.:1987    1st Qu.:98033  
 Median :1975   Median :2000    Median :98065  
 Mean   :1971   Mean   :1996    Mean   :98078  
 3rd Qu.:1997   3rd Qu.:2007    3rd Qu.:98118  
 Max.   :2015   Max.   :2015    Max.   :98199  
                NA's   :20699

The zipcode should not be interpreted as a numeric value, so we cast it to a factor. We could argue to remove lat and long as handling them as linear effects is not necessarily a suitable, but we will keep them since glmnet performs internal feature selection anyways.

zipencode = po("mutate", mutation = list(zipcode = ~ as.factor(zipcode)), id = "zipencode")

Build a suitable pipeline that allows glmnet to be trained on the dataset. Construct a new glmnet model with ppl("robustify"). Compare the two pipelines in a benchmark experiment.

lrn_glmnet = lrn("regr.glmnet")

graph_preproc =
  zipencode %>>%
  po("fixfactors") %>>%
  po("encodeimpact") %>>%
  list(
    po("missind", type = "integer", affect_columns = selector_type("integer")),
    po("imputehist", affect_columns = selector_type("integer"))) %>>%
  po("featureunion") %>>%
  po("imputeoor", affect_columns = selector_type("factor")) %>>%
  lrn_glmnet

graph_preproc$plot()

glmnet does not support factors or missing values. So our pipeline needs to handle both. First we fix the factor levels to ensure that all 70 zipcodes are fixed. We can consider 70 levels high cardinality, so we use impact encoding. We use the same imputation strategy as in Chapter 9.

graph_robustify =
  pipeline_robustify(task = task, learner = lrn_glmnet) %>>%
  lrn_glmnet

graph_robustify$plot()

glrn_preproc = as_learner(graph_preproc, id = "glmnet_preproc")
glrn_robustify = as_learner(graph_robustify, id = "glmnet_robustify")

design = benchmark_grid(
  tasks = task,
  learners = list(glrn_preproc, glrn_robustify),
  resamplings = rsmp("cv", folds = 3)
)

bmr = benchmark(design)

Warning: Multiple lambdas have been fit. Lambda will be set to 0.01 (see parameter 's').
This happened PipeOp regr.glmnet's $predict()

Warning: Multiple lambdas have been fit. Lambda will be set to 0.01 (see parameter 's').
This happened PipeOp regr.glmnet's $predict()

Warning: Multiple lambdas have been fit. Lambda will be set to 0.01 (see parameter 's').
This happened PipeOp regr.glmnet's $predict()

Warning: Multiple lambdas have been fit. Lambda will be set to 0.01 (see parameter 's').
This happened PipeOp regr.glmnet's $predict()

Warning: Multiple lambdas have been fit. Lambda will be set to 0.01 (see parameter 's').
This happened PipeOp regr.glmnet's $predict()

Warning: Multiple lambdas have been fit. Lambda will be set to 0.01 (see parameter 's').
This happened PipeOp regr.glmnet's $predict()

bmr$aggregate(msr("regr.rmse"))

   nr    task_id
1:  1 kc_housing
2:  2 kc_housing
4 variables not shown: [learner_id, resampling_id, iters, regr.rmse]
Hidden columns: resample_result

Our preprocessing pipeline performs slightly better than the robustified one.

Now consider the date feature: How can you extract information from this feature in a way that glmnet can use? Does this improve the performance of your pipeline? Finally, consider the spatial nature of the dataset. Can you extract an additional feature from the lat / long coordinates? (Hint: Downtown Seattle has lat/long coordinates 47.605/122.334).

task = tsk("kc_housing")

graph_mutate =
  po("mutate", mutation = list(
    date = ~ as.numeric(date),
    distance_downtown = ~ sqrt((lat - 47.605)^2 + (long  + 122.334)^2))) %>>%
  zipencode %>>%
  po("encodeimpact") %>>%
  list(
    po("missind", type = "integer", affect_columns = selector_type("integer")),
    po("imputehist", affect_columns = selector_type("integer"))) %>>%
  po("featureunion") %>>%
  po("imputeoor", affect_columns = selector_type("factor")) %>>%
  lrn_glmnet

glrn_mutate = as_learner(graph_mutate)

design = benchmark_grid(
  tasks = task,
  learners = glrn_mutate,
  resamplings = rsmp("cv", folds = 3)
)

bmr_2 = benchmark(design)
bmr$combine(bmr_2)
bmr$aggregate(msr("regr.mae"))

   nr    task_id
1:  1 kc_housing
2:  2 kc_housing
3:  3 kc_housing
4 variables not shown: [learner_id, resampling_id, iters, regr.mae]
Hidden columns: resample_result

We simply convert the date feature into a numeric timestamp so that glmnet can handle the feature. We create one additional feature as the distance to downtown Seattle. This improves the average error of our model by a further 1400$.

A.9 Solutions to Chapter 10

Consider the following example where you resample a learner (debug learner, sleeps for 3 seconds during $train()) on 4 workers using the multisession backend:

task = tsk("penguins")
learner = lrn("classif.debug", sleep_train = function() 3)
resampling = rsmp("cv", folds = 6)

future::plan("multisession", workers = 4)
resample(task, learner, resampling)

<ResampleResult> with 6 resampling iterations
  task_id    learner_id resampling_id iteration warnings errors
 penguins classif.debug            cv         1        0      0
 penguins classif.debug            cv         2        0      0
 penguins classif.debug            cv         3        0      0
 penguins classif.debug            cv         4        0      0
 penguins classif.debug            cv         5        0      0
 penguins classif.debug            cv         6        0      0

Assuming that the learner would actually calculate something and not just sleep: Would all CPUs be busy?
Prove your point by measuring the elapsed time, e.g., using system.time().
What would you change in the setup and why?

Not all CPUs would be utilized for the whole duration. All 4 of them are occupied for the first 4 iterations of the cross-validation. The 5th iteration, however, only runs in parallel to the 6th fold, leaving 2 cores idle. This is supported by the elapsed time of roughly 6 seconds for 6 jobs compared to also roughly 6 seconds for 8 jobs:

task = tsk("penguins")
learner = lrn("classif.debug", sleep_train = function() 3)

future::plan("multisession", workers = 4)

resampling = rsmp("cv", folds = 6)
system.time(resample(task, learner, resampling))

   user  system elapsed 
  0.248   0.023   6.776

resampling = rsmp("cv", folds = 8)
system.time(resample(task, learner, resampling))

   user  system elapsed 
  0.309   0.003   6.874

If possible, the number of resampling iterations should be an integer multiple of the number of workers. Therefore, a simple adaptation either increases the number of folds for improved accuracy of the error estimate or reduces the number of folds for improved runtime.

Create a new custom binary classification measure which scores (“prob”-type) predictions. This measure should compute the absolute difference between the predicted probability for the positive class and a 0-1 encoding of the ground truth and then average these values across the test set. Test this with classif.log_reg on tsk("sonar").

The rules can easily be translated to R code where we first convert select the predicted probabilities for the positive class, 0-1 encode the truth vector and then calculate the mean absolute error between the two vectors.

mae_prob = function(truth, prob, task) {
  # retrieve positive class from task
  positive = task$positive
  # select positive class probabilities
  prob_positive = prob[, positive]
  # obtain 0-1 encoding of truth
  y = as.integer(truth == positive)
  # average the absolute difference
  mean(abs(prob_positive - y))
}

This function can be embedded in the Measure class accordingly.

MeasureMaeProb = R6::R6Class("MeasureMaeProb",
  inherit = mlr3::MeasureClassif, # classification measure
  public = list(
    initialize = function() { # initialize class
      super$initialize( # initialize method of parent class
        id = "mae_prob", # unique ID
        packages = character(), # no dependencies
        properties = "requires_task", # needs access to task for positive class
        predict_type = "prob", # measures probability prediction
        range = c(0, 1), # results in values between [0, 1]
        minimize = TRUE # smaller values are better
      )
    }
  ),

  private = list(
    .score = function(prediction, task, ...) { # define score as private method
      # call loss function
      mae_prob(prediction$truth, prediction$prob, task)
    }
  )
)

Because this is a custom class that is not available in the mlr_measures dictionary, we have to create a new instance using the $new() constructor.

msr_mae_prob = MeasureMaeProb$new()
msr_mae_prob

<MeasureMaeProb:mae_prob>
* Packages: mlr3
* Range: [0, 1]
* Minimize: TRUE
* Average: macro
* Parameters: list()
* Properties: requires_task
* Predict type: prob

To try this measure, we resample a logistic regression on the sonar task using five-fold cross-validation.

# predict_type is set to "prob", as otherwise our measure does not work
learner = lrn("classif.log_reg", predict_type = "prob")
task = tsk("sonar")
rr = resample(task, learner, rsmp("cv", folds = 5))

Warning: glm.fit: algorithm did not converge

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: algorithm did not converge

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: algorithm did not converge

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: algorithm did not converge

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: algorithm did not converge

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

We now score the resample result using our custom measure and msr("classif.acc").

score = rr$score(list(msr_mae_prob, msr("classif.acc")))

In this case, there is a clear relationship between the classification accuracy and our custom measure, i.e. the higher the accuracy, the lower the mean absolute error of the predicted probabilities.

cor(score$mae_prob, score$classif.acc)

[1] -0.9982

“Tune” the error_train hyperparameter of the classif.debug learner on a continuous interval from 0 to 1, using a simple classification tree as the fallback learner and the penguins task. Tune for 50 iterations using random search and 10-fold cross-validation. Inspect the resulting archive and find out which evaluations resulted in an error, and which did not. Now do the same in the interval 0.3 to 0.7. Are your results surprising?

First, we create the learner that we want to tune, mark the relevant parameter for tuning and set the fallback learner to a classification tree.

lrn_debug = lrn("classif.debug",
  error_train = to_tune(0, 1),
  fallback = lrn("classif.rpart")
)
lrn_debug

<LearnerClassifDebug:classif.debug>: Debug Learner for Classification
* Model: -
* Parameters: error_train=<RangeTuneToken>
* Packages: mlr3
* Predict Types:  [response], prob
* Feature Types: logical, integer, numeric, character, factor,
  ordered
* Properties: hotstart_forward, missings, multiclass, twoclass

This example is unusual, because we expect better results from the fallback classification tree than from the primary debug learner, which predicts the mode of the target distribution. Nonetheless it serves as a good example to illustrate the effects of training errors on the tuning results.

We proceed with optimizing the classification accuracy of the learner on the penguins task.

instance = tune(
  learner =  lrn_debug,
  task = tsk("penguins"),
  resampling = rsmp("cv"),
  tuner = tnr("random_search"),
  measure = msr("classif.acc"),
  term_evals = 50
)
instance

<TuningInstanceSingleCrit>
* State:  Optimized
* Objective: <ObjectiveTuning:classif.debug_on_penguins>
* Search Space:
            id    class lower upper nlevels
1: error_train ParamDbl     0     1     Inf
* Terminator: <TerminatorEvals>
* Result:
   error_train classif.acc
1:      0.6239      0.9066
* Archive:
    error_train classif.acc
 1:     0.87063      0.8278
 2:     0.82582      0.9007
 3:     0.03762      0.4658
 4:     0.05152      0.4071
 5:     0.23837      0.4165
---                        
46:     0.17506      0.4095
47:     0.62394      0.9066
48:     0.61325      0.8918
49:     0.58476      0.6005
50:     0.07254      0.3315

To find out which evaluations resulted in an error, we can inspect the $archive slot of the instance, which we convert to a data.table for easier filtering.

archive = as.data.table(instance$archive)
archive[, c("error_train", "classif.acc", "errors")]

    error_train classif.acc errors
 1:     0.87063      0.8278      8
 2:     0.82582      0.9007      9
 3:     0.03762      0.4658      1
 4:     0.05152      0.4071      0
 5:     0.23837      0.4165      2
---                               
46:     0.17506      0.4095      1
47:     0.62394      0.9066      9
48:     0.61325      0.8918      9
49:     0.58476      0.6005      4
50:     0.07254      0.3315      0

Below, we visualize the relationship between the error probabilty and the classification accuracy.

ggplot(data = archive, aes(x = error_train, y = classif.acc, color = errors)) +
  geom_point() +
  theme_minimal()

Higher values for error_train lead to more resampling iterations using the classification tree fallback learner and therefore to better classification accuracies. Therefore, the best found hyperparameter configurations will tend to have values of error_train close to 1. When multiple parameter configurations have the same test performance, the first one is chosen by $result_learner_param_vals.

instance$result_learner_param_vals

$error_train
[1] 0.6239

We repeat the same experiment for the tuning interval from 0.3 to 0.7.

lrn_debug$param_set$set_values(
  error_train = to_tune(0.3, 0.7)
)

instance2 = tune(
  learner =  lrn_debug,
  task = tsk("penguins"),
  resampling = rsmp("cv"),
  tuner = tnr("random_search"),
  measure = msr("classif.acc"),
  term_evals = 50
)

archive2 = as.data.table(instance2$archive)
instance2

<TuningInstanceSingleCrit>
* State:  Optimized
* Objective: <ObjectiveTuning:classif.debug_on_penguins>
* Search Space:
            id    class lower upper nlevels
1: error_train ParamDbl   0.3   0.7     Inf
* Terminator: <TerminatorEvals>
* Result:
   error_train classif.acc
1:      0.6668      0.9477
* Archive:
    error_train classif.acc
 1:      0.5358      0.7373
 2:      0.3158      0.5050
 3:      0.5598      0.8934
 4:      0.3856      0.6506
 5:      0.6456      0.8849
---                        
46:      0.6684      0.6590
47:      0.4290      0.6219
48:      0.6253      0.7330
49:      0.3790      0.4149
50:      0.3146      0.5502

As before, higher error probabilities during training lead to higher classification accuracies.

ggplot(data = archive2, aes(x = error_train, y = classif.acc, color = errors)) +
  geom_point() +
  theme_minimal()

However, the best found configurations for the error_train parameter, now tend to be close to 0.7 instead of 1 as before.

instance2$result_learner_param_vals

$error_train
[1] 0.6668

This demonstrates that when utilizing a fallback learner, the tuning results are influenced not only by the direct impact of the tuning parameters on the primary learner but also by their effect on its error probability. Therefore, it is always advisable to manually inspect the tuning results afterward. Note that in most real-world scenarios, the fallback learner performs worse than the primary learner, and thus the effects illustrated here are usually reversed.

A.10 Solutions to Chapter 11

Load the OpenML collection with ID 269, which contains regression tasks from the AutoML benchmark (Gijsbers et al. 2022). Peek into this suite to study the contained data sets and their characteristics. Then find all tasks with less than 4000 observations and convert them to mlr3 tasks.

We access the AutoML benchmark suite with ID 269 using the ocl() function.

library(mlr3oml)
automl_suite = ocl(id = 269)
automl_suite$task_ids

To create a summary of the underlying datasets, we pass their IDs to list_oml_data().

data_tbl = list_oml_data(automl_suite$data_ids)
data_tbl[, c("data_id", "name", "NumberOfInstances")]

    data_id                    name NumberOfInstances
 1:     201                     pol             15000
 2:     216               elevators             16599
 3:     287            wine_quality              6497
 4:     416               yprop_4_1              8885
 5:     422                topo_2_1              8885
---                                                  
29:   42728   Airlines_DepDelay_10M          10000000
30:   42729 nyc-taxi-green-dec-2016            581835
31:   42730                us_crime              1994
32:   42731             house_sales             21613
33:   43071     MIP-2016-regression              1090

To find those datasets with up to 4000 observations, we can simply filter the table.

data_tbl = data_tbl[NumberOfInstances < 4000, ]

Alternatively, the list_oml_tasks() also allows to filter OpenML tasks by their characteristics.

task_tbl = list_oml_tasks(
  task_id = automl_suite$task_ids, number_instances = c(0, 4000)
)

The resulting table contains matching OpenML tasks from the AutoML benchmark suite.

task_tbl[, .(task_id, data_id, name, NumberOfInstances)]

    task_id data_id                 name NumberOfInstances
 1:  167210   41021            Moneyball              1232
 2:  359930     550                quake              2178
 3:  359931     546              sensory               576
 4:  359932     541               socmob              1156
 5:  359933     507             space_ga              3107
 6:  359934     505              tecator               240
 7:  359945   42730             us_crime              1994
 8:  359950     531               boston               506
 9:  359951   42563 house_prices_nominal              1460
10:  360945   43071  MIP-2016-regression              1090

We create mlr3 tasks from these OpenML IDs using tsk("oml").

tasks = lapply(task_tbl$task_id, function(id) tsk("oml", task_id = id))

tasks[[1]]

<TaskRegr:Moneyball> (1232 x 15)
* Target: RS
* Properties: -
* Features (14):
  - fct (6): G, League, Playoffs, RankPlayoffs, RankSeason, Team
  - dbl (5): BA, OBP, OOBP, OSLG, SLG
  - int (3): RA, W, Year

Create an experimental design that compares lrn("regr.ranger") and lrn("regr.rpart") on those tasks. Use the robustify pipeline for both learners and a featureless fallback learner. You can use three-fold CV instead of the OpenML resamplings to save time. Run the comparison experiments with batchtools. Use default hyperparameter settings and do not perform any tuning to keep the experiments simple.

lrn_ranger = as_learner(
  ppl("robustify", learner = lrn("regr.ranger")) %>>%
    po("learner", lrn("regr.ranger"))
)
lrn_ranger$id = "ranger"
lrn_ranger$fallback = lrn("regr.featureless")

lrn_rpart = as_learner(
  ppl("robustify", learner = lrn("regr.rpart")) %>>%
    po("learner", lrn("regr.rpart"))
)
lrn_rpart$id = "rpart"
lrn_rpart$fallback = lrn("regr.featureless")

learners = list(lrn_ranger, lrn_rpart)

We set a seed before calling benchmark_grid() as this instantiates the resamplings, which is stochastic.

set.seed(123)
resampling = rsmp("cv", folds = 3)
design = benchmark_grid(tasks, learners, resampling)
design

                    task learner resampling
 1:            Moneyball  ranger         cv
 2:            Moneyball   rpart         cv
 3:                quake  ranger         cv
 4:                quake   rpart         cv
 5:              sensory  ranger         cv
---                                        
16:               boston   rpart         cv
17: house_prices_nominal  ranger         cv
18: house_prices_nominal   rpart         cv
19:  MIP-2016-regression  ranger         cv
20:  MIP-2016-regression   rpart         cv

To execute this benchmark design using mlr3batchmark we start by creating and configuring an experiment registry. We set file.dir = NA to use a temporary directory for the registry.

library(mlr3batchmark)
library(batchtools)


Attaching package: 'batchtools'

The following object is masked from 'package:mlr3misc':

    chunk

reg = makeExperimentRegistry(
  file.dir = NA,
  seed = 1,
  packages = "mlr3verse"
)

No readable configuration file found

Created registry in '/tmp/Rtmpparma2/registry3ad97eb777fc' using cluster functions 'Interactive'

The next two steps are to populate the registry with the experiments using batchmark() and to submit them. By specifying no IDs in submitJobs(), all jobs returned by findNotSubmitted() are queued, which in this case are all existing jobs.

batchmark(design, reg = reg)
submitJobs(reg = reg)
waitForJobs(reg = reg)

After the execution of the experiment finished we can load the results as a BenchmarkResult.

bmr = reduceResultsBatchmark(reg = reg)
bmr$aggregate(msr("regr.mse"))

    nr              task_id learner_id resampling_id iters  regr.mse
 1:  1            Moneyball     ranger            cv     3 6.167e+02
 2:  2            Moneyball      rpart            cv     3 1.357e+03
 3:  3                quake     ranger            cv     3 3.827e-02
 4:  4                quake      rpart            cv     3 3.567e-02
 5:  5              sensory     ranger            cv     3 5.072e-01
---                                                                 
16: 16               boston      rpart            cv     3 2.293e+01
17: 17 house_prices_nominal     ranger            cv     3 9.020e+08
18: 18 house_prices_nominal      rpart            cv     3 1.949e+09
19: 19  MIP-2016-regression     ranger            cv     3 7.505e+08
20: 20  MIP-2016-regression      rpart            cv     3 6.277e+08
Hidden columns: resample_result

Conduct a global Friedman test and, if appropriate, post hoc Friedman-Nemenyi tests, and interpret the results. As an evaluation measure, use the MSE.

First, we load the mlr3benchmark package and create a BenchmarkAggr from the benchmark result using msr("regr.mse").

library(mlr3benchmark)
bma = as_benchmark_aggr(bmr, measures = msr("regr.mse"))
bma

<BenchmarkAggr> of 20 rows with 10 tasks, 2 learners and 1 measure
                 task_id learner_id       mse
 1:            Moneyball     ranger 6.167e+02
 2:            Moneyball      rpart 1.357e+03
 3:                quake     ranger 3.827e-02
 4:                quake      rpart 3.567e-02
 5:              sensory     ranger 5.072e-01
---                                          
16:               boston      rpart 2.293e+01
17: house_prices_nominal     ranger 9.020e+08
18: house_prices_nominal      rpart 1.949e+09
19:  MIP-2016-regression     ranger 7.505e+08
20:  MIP-2016-regression      rpart 6.277e+08

We can also visualize this result using the autoplot() function.

autoplot(bma)

Below, we conduct a global Friedman test. Note that a post-hoc test is not needed because we are only comparing two algorithms.

bma$friedman_test()


    Friedman rank sum test

data:  mse and learner_id and task_id
Friedman chi-squared = 1.6, df = 1, p-value = 0.2

This experimental design was not able to detect a significant difference on the 5% level so we cannot reject our null hypothesis that the regression tree performs equally well as the random forest.

A.11 Solutions to Chapter 12

Prepare a mlr3 regression task for fifa data. Select only variables describing the age and skills of footballers. Train any predictive model for this task, e.g. lrn("regr.ranger").

library(DALEX)
library(ggplot2)
data("fifa", package = "DALEX")
old_theme = set_theme_dalex("ema")

library(mlr3)
library(mlr3learners)
set.seed(1)

feat_of_interest = c("age", "skill_ball_control", "skill_curve", "skill_dribbling",  "skill_fk_accuracy", "skill_long_passing", "value_eur")
fifa20 = fifa[,feat_of_interest]

task_fifa = as_task_regr(fifa20, target = "value_eur", id = "fifa20")

learner = lrn("regr.ranger")
learner$train(task_fifa)
learner$model

Ranger result

Call:
 ranger::ranger(dependent.variable.name = task$target_names, data = task$data(),      case.weights = task$weights$weight, num.threads = 1L) 

Type:                             Regression 
Number of trees:                  500 
Sample size:                      5000 
Number of independent variables:  6 
Mtry:                             2 
Target node size:                 5 
Variable importance mode:         none 
Splitrule:                        variance 
OOB prediction error (MSE):       3.404e+13 
R squared (OOB):                  0.5671

Use the permutation importance method to calculate feature importance ranking. Which feature is the most important? Do you find the results surprising?

With iml

library(iml)
model = Predictor$new(learner,
                data = fifa20,
                y = "value_eur")

effect = FeatureImp$new(model,
                loss = "rmse")
effect$plot()

With DALEX

library(DALEX)
ranger_exp = DALEX::explain(learner,
  data = fifa20[, setdiff(names(fifa20), "value_eur")],
  y = fifa$value_eur,
  label = "Fifa 2020",
  verbose = FALSE)

ranger_effect = model_parts(ranger_exp, B = 5)
head(ranger_effect)

            variable mean_dropout_loss     label
1       _full_model_           2897186 Fifa 2020
2                age           4658209 Fifa 2020
3        skill_curve           4842955 Fifa 2020
4  skill_fk_accuracy           5167775 Fifa 2020
5    skill_dribbling           5747754 Fifa 2020
6 skill_long_passing           5947734 Fifa 2020

plot(ranger_effect)

Use the partial dependence plot/profile to draw the global behavior of the model for this feature. Is it aligned with your expectations?

With iml

impfeat = c("skill_ball_control")

effect = FeatureEffects$new(model, features = impfeat)
plot(effect)

With DALEX

impfeat = c("skill_ball_control")

ranger_profiles = model_profile(ranger_exp, variables = impfeat)
plot(ranger_profiles)

Choose Robert Lewandowski as a specific example and calculate and plot the Shapley values. Which feature is locally the most important and has the strongest influence on his valuation as a soccer player?

player_1 = fifa20["R. Lewandowski",]

With iml

shapley = Shapley$new(model, x.interest = player_1)
plot(shapley)

With DALEX

ranger_shap = predict_parts(ranger_exp,
             new_observation = player_1,
             type = "shap", B = 1)
plot(ranger_shap, show_boxplots = FALSE)

A.12 Solutions to Chapter 13

Run a benchmark experiment on tsk("german_credit") with lrn("classif.featureless"), lrn("classif.log_reg"), and lrn("classif.ranger"). Tune the prediction thresholds of all learners by encapsulating them in a po("learner_cv") (with two-fold CV), followed by a po("tunethreshold"). Use msr("classif.costs", costs = costs), where the costs matrix is as follows: true positive is -10, true negative is -1, false positive is 2, and false negative is 3. Use this measure in po("tunethreshold") and when evaluating your benchmark experiment.

set.seed(1)
# Load task and learners
tsk_german = tsk("german_credit")
learners = lrns(c("classif.featureless", "classif.log_reg",
  "classif.ranger"), predict_type = "prob")

# Create costs matrix
costs = matrix(c(-10, 3, 2, -1), nrow = 2,
  dimnames = list("Predicted Credit" = c("good", "bad"),
                    Truth = c("good", "bad")))
costs

                Truth
Predicted Credit good bad
            good  -10   2
            bad     3  -1

Our cost matrix is as expected so we can plug it into our measure and setup our pipeline.

# Create measure
meas_costs = msr("classif.costs", costs = costs)

# Create a function to wrap learners in internal cross-validation
#  to tune the threshold
pipe = function(l) {
  po("learner_cv", l, resampling.folds = 2) %>>%
    po("tunethreshold", measure = meas_costs)
}

# Benchmark
learners = lapply(learners, pipe)
design = benchmark_grid(tsk_german, learners, rsmp("holdout"))
bmr = benchmark(design)$aggregate(meas_costs)

Now exploring our results…

bmr[, .(learner_id, classif.costs)]

                          learner_id classif.costs
1: classif.featureless.tunethreshold        -6.180
2:     classif.log_reg.tunethreshold        -6.216
3:      classif.ranger.tunethreshold        -6.180

Based on these results, the logistic regression performs the best with the greatest increase to costs, however the difference is only marginal compared to the other learners.

Train and test a survival forest using lrn("surv.rfsrc") (from mlr3extralearners). Run this experiment using tsk("rats") and partition(). Evaluate your model with the RCLL measure.

# Get learners
library(mlr3extralearners)
# Get survival models
library(mlr3proba)
set.seed(1)
# Use partition to split data and test our model
tsk_rats = tsk("rats")
splits = partition(tsk_rats)
learner = lrn("surv.rfsrc")
prediction = learner$train(tsk_rats, splits$train)$predict(tsk_rats, splits$test)
prediction$score(msr("surv.rcll"))

surv.rcll 
    3.754

The right-censored logloss provides a measure of predictive accuracy, but it is quite hard to interpret it without comparison to another model. To yield a more informative value, we could either compute the RCLL for an uninformed baseline like the Kaplan-Meier estimator, or we could use the ERV (explained residual variation) parameter in the measure, which returns the RCLL as a percentage increase in performance compared to an uninformed baseline (in this case the Kaplan-Meier estimator):

lrn("surv.kaplan")$
  train(tsk_rats, splits$train)$
  predict(tsk_rats, splits$test)$
  score(msr("surv.rcll"))

surv.rcll 
    3.827

prediction$score(msr("surv.rcll", ERV = TRUE),
  task = tsk_rats, train_set = splits$train)

surv.rcll 
  0.01893

Now we can see that our model is only marginally better than the Kaplan-Meier baseline (a 2% performance increase).

Estimate the density of the “precip” task from the mlr3proba package using lrn("dens.hist"), evaluate your estimation with the logloss measure. As a stretch goal, look into the documentation of distr6 to learn how to analyse your estimated distribution further.

# Get density models
library(mlr3proba)
set.seed(1)
# Run experiment
tsk_precip = tsk("precip")
learner = lrn("dens.hist")
prediction = learner$train(tsk_precip)$predict(tsk_precip)
prediction

<PredictionDens> for 70 observations:
    row_ids      pdf    cdf              distr
          1 0.001429 0.9957 <Distribution[39]>
          2 0.007143 0.9479 <Distribution[39]>
          3 0.005714 0.0400 <Distribution[39]>
---                                           
         68 0.007143 0.2507 <Distribution[39]>
         69 0.012857 0.1163 <Distribution[39]>
         70 0.007143 0.9800 <Distribution[39]>

prediction$score(msr("dens.logloss"))

dens.logloss 
       3.896

As before the logloss is not too informative by itself but as the Histogram is itself a baseline, we can use this value for comparison to more sophisticated models. To learn more about our predicted distribution, we could use distr6 to summarise the distribution and to compute values such as the pdf and cdf:

prediction$distr$summary()

Histogram Estimator
Support: [0,70]     Scientific Type: ℝ 

Traits:     continuous; univariate
Properties: asymmetric

# pdf evaluated at `50`
prediction$distr$pdf(50)

[1] 0.007143

Run a benchmark clustering experiment on the “wine” dataset without a label column. Compare the performance of k-means learner with k equal to 2, 3 and 4 using the silhouette measure and the insample resampling technique. What value of k would you choose based on the silhouette scores?

set.seed(1)
# Load clustering models and tasks
library(mlr3cluster)
# Create the clustering dataset by extracting only the features from the
#  wine task
tsk_wine = tsk("wine")
tsk_wine = as_task_clust(tsk_wine$data(cols = tsk_wine$feature_names))
# Create learners and ensure they have unique IDs
learners = c(
  lrn("clust.kmeans", centers = 2, id = "K=2"),
  lrn("clust.kmeans", centers = 3, id = "K=3"),
  lrn("clust.kmeans", centers = 4, id = "K=4")
)
# Benchmark
meas = msr("clust.silhouette")
design = benchmark_grid(tsk_wine, learners, rsmp("insample"))
benchmark(design)$aggregate(meas)[, .(learner_id, clust.silhouette)]

   learner_id clust.silhouette
1:        K=2           0.6569
2:        K=3           0.5711
3:        K=4           0.5606

We can see that we get the silhouette closest to 1 with K=2 so we might use this value for future experiments.

A.13 Solutions to Chapter 14

Train a model of your choice on tsk("adult_train") and test it on tsk("adult_test"), use any measure of your choice to evaluate your predictions. Assume our goal is to achieve parity in false omission rates across the protected ‘sex’ attribute. Construct a fairness metric that encodes this and evaluate your model. To get a deeper understanding, look at the groupwise_metrics function to obtain performance in each group.

For now we simply load the data and look at the data.

library(mlr3)
library(mlr3fairness)
set.seed(8)

tsk_adult_train = tsk("adult_train")
tsk_adult_test = tsk("adult_test")
tsk_adult_train

<TaskClassif:adult_train> (30718 x 13)
* Target: target
* Properties: twoclass
* Features (12):
  - fct (7): education, marital_status, occupation, race,
    relationship, sex, workclass
  - int (5): age, capital_gain, capital_loss, education_num,
    hours_per_week
* Protected attribute: sex

We can now train a simple model, e.g., a decision tree and evaluate for accuracy.

learner = lrn("classif.rpart")
learner$train(tsk_adult_train)
prediction = learner$predict(tsk_adult_test)
prediction$score()

classif.ce 
     0.161

The false omission rate parity metric is available via the key "fairness.fomr". Note, that evaluating our prediction now requires that we also provide the task.

msr_1 = msr("fairness.fomr")
prediction$score(msr_1, tsk_adult_test)

fairness.fomr 
      0.03533

In addition, we can look at false omission rates in each group. The groupwise_metrics function creates a metric for each group specified in the pta column role:

tsk_adult_test$col_roles$pta

[1] "sex"

We can then use this metric to evaluate our model again. This gives us the false omission rates for male and female individuals separately.

msr_2 = groupwise_metrics(base_measure = msr("classif.fomr"), task = tsk_adult_test)
prediction$score(msr_2, tsk_adult_test)

  subgroup.fomr_Male subgroup.fomr_Female 
              0.2442               0.2089

Improve your model by employing pipelines that use pre- or post-processing methods for fairness. Evaluate your model along the two metrics and visualize the resulting metrics. Compare the different models using an appropriate visualization.

First we can again construct the learners above.

library(mlr3pipelines)
lrn_1 = po("reweighing_wts") %>>% lrn("classif.rpart")
lrn_2 = po("learner_cv", lrn("classif.rpart")) %>>%
  po("EOd")

And run the benchmark again. Note, that we use three-fold CV this time for comparison.

learners = list(learner, lrn_1, lrn_2)
design = benchmark_grid(tsk_adult_train, learners, rsmp("cv", folds = 3L))
bmr = benchmark(design)
bmr$aggregate(msrs(c("classif.acc", "fairness.fomr")))

   nr     task_id                   learner_id resampling_id iters
1:  1 adult_train                classif.rpart            cv     3
2:  2 adult_train reweighing_wts.classif.rpart            cv     3
3:  3 adult_train            classif.rpart.EOd            cv     3
2 variables not shown: [classif.acc, fairness.fomr]
Hidden columns: resample_result

We can now again visualize the result.

library(ggplot2)
fairness_accuracy_tradeoff(bmr, msr("fairness.fomr")) +
  scale_color_viridis_d("Learner") +
  theme_minimal()

We can notice two main results:

We do not improve in the false omission rate by using fairness interventions. One reason might be, that the interventions chosen do not optimize for the false omission rate, but other metrics, e.g. equalized odds.
The spread between the different cross-validation iterations (small dots) is quite large, estimates might come with a considerable error.

Add “race” as a second sensitive attribute to your dataset. Add the information to your task and evaluate the initial model again. What changes? Again study the groupwise_metrics.

This can be achieved by adding “race” to the "pta" col_role.

tsk_adult_train$set_col_roles("race", add_to = "pta")
tsk_adult_train

<TaskClassif:adult_train> (30718 x 13)
* Target: target
* Properties: twoclass
* Features (12):
  - fct (7): education, marital_status, occupation, race,
    relationship, sex, workclass
  - int (5): age, capital_gain, capital_loss, education_num,
    hours_per_week
* Protected attribute: sex, race

tsk_adult_test$set_col_roles("race", add_to = "pta")
prediction$score(msr_1, tsk_adult_test)

fairness.fomr 
       0.8889

Evaluating for the intersection, we obtain a large deviation from 0. Note, that the metric by default computes the maximum discrepancy between all metrics for the non-binary case.

If we now compute the groupwise_metrics, we will get a metric for the intersection of each group.

msr_3 = groupwise_metrics(msr("classif.fomr"),  tsk_adult_train)
unname(sapply(msr_3, function(x) x$id))

 [1] "subgroup.fomr_Male, White"               
 [2] "subgroup.fomr_Male, Black"               
 [3] "subgroup.fomr_Female, Black"             
 [4] "subgroup.fomr_Female, White"             
 [5] "subgroup.fomr_Male, Asian-Pac-Islander"  
 [6] "subgroup.fomr_Male, Amer-Indian-Eskimo"  
 [7] "subgroup.fomr_Female, Other"             
 [8] "subgroup.fomr_Female, Asian-Pac-Islander"
 [9] "subgroup.fomr_Female, Amer-Indian-Eskimo"
[10] "subgroup.fomr_Male, Other"

prediction$score(msr_3, tsk_adult_test)

               subgroup.fomr_Male, White 
                                  0.2402 
               subgroup.fomr_Male, Black 
                                  0.2716 
             subgroup.fomr_Female, Black 
                                  0.2609 
             subgroup.fomr_Female, White 
                                  0.1919 
  subgroup.fomr_Male, Asian-Pac-Islander 
                                  0.3168 
  subgroup.fomr_Male, Amer-Indian-Eskimo 
                                  0.1667 
             subgroup.fomr_Female, Other 
                                  0.2500 
subgroup.fomr_Female, Asian-Pac-Islander 
                                  0.3529 
subgroup.fomr_Female, Amer-Indian-Eskimo 
                                  1.0000 
               subgroup.fomr_Male, Other 
                                  0.1111

And we can see, that the reason might be, that the false omission rate for female Amer-Indian-Eskimo is at 1.0!

In this chapter we were unable to reduce bias in our experiment. Using everything you have learned in this book, see if you can successfully reduce bias in your model. Critically reflect on this exercise, why might this be a bad idea?

Several problems with the existing metrics.

We’ll go through them one by one to deepen our understanding:

Metric and evaluation

In order for the fairness metric to be useful, we need to ensure that the data used for evaluation is representative and sufficiently large.

We can investigate this further by looking at actual counts:

  table(tsk_adult_test$data(cols = c("race", "sex", "target")))

, , target = <=50K

                    sex
race                 Female Male
  Amer-Indian-Eskimo     56   74
  Asian-Pac-Islander    131  186
  Black                 654  619
  Other                  37   66
  White                3544 6176

, , target = >50K

                    sex
race                 Female Male
  Amer-Indian-Eskimo      3   16
  Asian-Pac-Islander     26  106
  Black                  41  133
  Other                   5   19
  White                 492 2931

One of the reasons might be that there are only 3 individuals in the “>50k” category! This is an often encountered problem, as error metrics have a large variance when samples are small. Note, that the pre- and post-processing methods in general do not all support multiple protected attributes.

We should question whether comparing the metric between all groups actually makes sense for the question we are trying to answer. Instead, we might want to observe the metric between two specific subgroups, in this case between individuals with sex: Female and race: "Black" or "White.

First, we create a subset of only sex: Female and race: "Black", "White.

adult_subset = tsk_adult_test$clone()
df = adult_subset$data()
rows = seq_len(nrow(df))[df$race %in% c("Black", "White") & df$sex %in% c("Female")]
adult_subset$filter(rows)
adult_subset$set_col_roles("race", add_to = "pta")

And evaluate our measure again:

prediction$score(msr_3, adult_subset)

               subgroup.fomr_Male, White 
                                  0.2402 
               subgroup.fomr_Male, Black 
                                  0.2716 
             subgroup.fomr_Female, Black 
                                  0.2609 
             subgroup.fomr_Female, White 
                                  0.1919 
  subgroup.fomr_Male, Asian-Pac-Islander 
                                  0.3168 
  subgroup.fomr_Male, Amer-Indian-Eskimo 
                                  0.1667 
             subgroup.fomr_Female, Other 
                                  0.2500 
subgroup.fomr_Female, Asian-Pac-Islander 
                                  0.3529 
subgroup.fomr_Female, Amer-Indian-Eskimo 
                                  1.0000 
               subgroup.fomr_Male, Other 
                                  0.1111

We can see, that between women there is an even bigger discrepancy compared to men.

The bias mitigation strategies we employed do not optimize for the false omission rate metric, but other metrics instead. It might therefore be better to try to achieve fairness via other strategies, using different or more powerful models or tuning hyperparameters.