Appendix A — Solutions to exercises

A.1 Solutions to Chapter 2

Train a classification model with the classif.rpart learner on the “Pima Indians Diabetes” dataset. Do this without using tsk("pima"), and instead by constructing a task from the dataset in the mlbench-package: data(PimaIndiansDiabetes2, package = "mlbench"). Make sure to define the pos outcome as positive class. Train the model on a random 80% subset of the given data and evaluate its performance with the classification error measure on the remaining data. (Note that the data set has NAs in its features. You can either rely on rpart‘s capability to handle them internally (’surrogate splits’) or remove them from the initial data.frame by using na.omit).

set.seed(1)

data(PimaIndiansDiabetes2, package = "mlbench")
task = as_task_classif(PimaIndiansDiabetes2, target = "diabetes", positive = "pos")
splits = partition(task, ratio = 0.8)
splits

$train
  [1]   1   2   3   4   5   6   7   8  11  13  15  16  17  18  19  20  21
 [18]  22  25  26  27  28  29  30  31  32  33  34  35  36  37  39  40  41
 [35]  42  43  44  45  47  48  49  51  52  54  55  56  58  59  60  61  62
 [52]  65  66  67  69  71  72  73  77  79  81  82  83  84  85  86  87  89
 [69]  91  92  93  97  98  99 100 101 103 104 105 106 107 108 109 110 111
 [86] 112 114 115 116 117 118 119 121 122 123 124 125 126 127 128 129 130
[103] 131 132 133 134 135 136 137 138 139 140 141 143 145 146 147 148 150
[120] 153 158 159 161 162 163 164 166 167 168 169 170 173 175 176 177 178
[137] 179 180 181 182 183 184 185 186 187 189 190 191 192 193 194 197 198
[154] 201 203 204 205 206 207 208 209 212 213 214 215 216 217 218 219 220
[171] 221 222 227 228 229 230 232 233 234 235 236 238 240 241 242 243 244
[188] 247 248 249 250 252 253 254 255 256 257 259 260 261 262 263 264 265
[205] 266 267 268 269 270 271 272 273 276 277 279 280 281 282 283 284 285
[222] 286 287 289 290 291 293 294 295 296 297 299 300 301 302 303 304 305
[239] 307 309 310 311 314 315 316 318 320 321 322 323 324 325 326 327 328
[256] 329 330 332 333 334 335 336 337 338 339 340 343 345 346 349 350 351
[273] 352 354 355 356 357 358 359 360 361 363 365 367 368 369 371 373 374
[290] 375 376 377 378 379 380 381 382 383 384 385 388 389 390 391 392 394
[307] 395 397 400 401 402 403 404 405 406 407 408 409 411 412 413 415 416
[324] 418 419 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435
[341] 436 437 438 439 440 441 442 443 444 446 448 449 451 453 454 455 457
[358] 458 459 460 461 462 463 464 465 467 468 470 471 472 473 474 475 476
[375] 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493
[392] 494 495 496 498 499 500 501 502 503 504 505 506 507 508 509 510 511
[409] 512 513 514 516 517 519 521 522 525 526 527 529 530 531 532 533 534
[426] 535 536 537 539 540 541 543 544 545 546 547 548 549 551 553 554 555
[443] 556 557 558 560 561 562 565 566 567 568 569 570 572 573 575 576 577
[460] 578 579 580 581 582 583 585 586 587 588 590 591 592 593 594 595 597
[477] 598 599 600 601 602 603 604 605 606 607 610 612 613 614 615 616 619
[494] 620 621 622 623 624 626 627 628 630 631 632 633 635 636 637 639 640
[511] 641 642 643 644 646 647 648 649 650 651 652 654 655 657 658 660 661
[528] 662 663 664 666 668 669 671 672 674 675 676 677 678 679 680 681 682
[545] 683 684 685 686 687 688 690 691 692 694 695 696 697 700 701 702 703
[562] 704 705 707 708 710 711 712 714 716 717 718 720 721 722 723 724 725
[579] 726 728 729 730 731 732 733 734 735 736 737 739 740 741 742 743 744
[596] 745 746 747 748 750 751 752 753 755 756 757 758 759 760 763 764 765
[613] 767 768

$test
  [1]   9  10  12  14  23  24  38  46  50  53  57  63  64  68  70  74  75
 [18]  76  78  80  88  90  94  95  96 102 113 120 142 144 149 151 152 154
 [35] 155 156 157 160 165 171 172 174 188 195 196 199 200 202 210 211 223
 [52] 224 225 226 231 237 239 245 246 251 258 274 275 278 288 292 298 306
 [69] 308 312 313 317 319 331 341 342 344 347 348 353 362 364 366 370 372
 [86] 386 387 393 396 398 399 410 414 417 420 445 447 450 452 456 466 469
[103] 497 515 518 520 523 524 528 538 542 550 552 559 563 564 571 574 584
[120] 589 596 608 609 611 617 618 625 629 634 638 645 653 656 659 665 667
[137] 670 673 689 693 698 699 706 709 713 715 719 727 738 749 754 761 762
[154] 766

$validation
integer(0)

learner = lrn("classif.rpart" , predict_type = "prob")
learner

<LearnerClassifRpart:classif.rpart>: Classification Tree
* Model: -
* Parameters: xval=0
* Packages: mlr3, rpart
* Predict Types:  response, [prob]
* Feature Types: logical, integer, numeric, factor, ordered
* Properties: importance, missings, multiclass,
  selected_features, twoclass, weights

learner$train(task, row_ids = splits$train)
learner$model

n= 614 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

  1) root 614 217 neg (0.35342 0.64658)  
    2) glucose>=139.5 154  47 pos (0.69481 0.30519)  
      4) glucose>=166.5 61  10 pos (0.83607 0.16393) *
      5) glucose< 166.5 93  37 pos (0.60215 0.39785)  
       10) pregnant>=6.5 31   6 pos (0.80645 0.19355) *
       11) pregnant< 6.5 62  31 pos (0.50000 0.50000)  
         22) age>=23.5 55  25 pos (0.54545 0.45455)  
           44) pedigree>=0.319 36  13 pos (0.63889 0.36111) *
           45) pedigree< 0.319 19   7 neg (0.36842 0.63158) *
         23) age< 23.5 7   1 neg (0.14286 0.85714) *
    3) glucose< 139.5 460 110 neg (0.23913 0.76087)  
      6) mass>=26.9 345 107 neg (0.31014 0.68986)  
       12) age>=29.5 162  75 neg (0.46296 0.53704)  
         24) glucose>=99.5 120  53 pos (0.55833 0.44167)  
           48) pedigree>=0.5485 42  11 pos (0.73810 0.26190) *
           49) pedigree< 0.5485 78  36 neg (0.46154 0.53846)  
             98) age< 54.5 70  34 pos (0.51429 0.48571)  
              196) pedigree>=0.2 59  26 pos (0.55932 0.44068)  
                392) pedigree< 0.416 46  17 pos (0.63043 0.36957)  
                  784) pressure< 85 35  10 pos (0.71429 0.28571) *
                  785) pressure>=85 11   4 neg (0.36364 0.63636) *
                393) pedigree>=0.416 13   4 neg (0.30769 0.69231) *
              197) pedigree< 0.2 11   3 neg (0.27273 0.72727) *
             99) age>=54.5 8   0 neg (0.00000 1.00000) *
         25) glucose< 99.5 42   8 neg (0.19048 0.80952) *
       13) age< 29.5 183  32 neg (0.17486 0.82514) *
      7) mass< 26.9 115   3 neg (0.02609 0.97391) *

prediction = learner$predict(task, row_ids = splits$test)
as.data.table(prediction)

     row_ids truth response prob.pos prob.neg
  1:       9   pos      pos  0.83607   0.1639
  2:      10   pos      neg  0.36364   0.6364
  3:      12   pos      pos  0.83607   0.1639
  4:      14   pos      pos  0.83607   0.1639
  5:      23   pos      pos  0.83607   0.1639
 ---                                         
150:     749   pos      pos  0.83607   0.1639
151:     754   pos      pos  0.83607   0.1639
152:     761   neg      neg  0.17486   0.8251
153:     762   pos      pos  0.83607   0.1639
154:     766   neg      neg  0.02609   0.9739

measure = msr("classif.ce")
prediction$score(measure)

classif.ce 
    0.1818

Calculate the true positive, false positive, true negative, and false negative rates of the predictions made by the model in Exercise 1. Try to solve this in two ways: (a) Using mlr3measures-predefined measure objects, and (b) without using mlr3 tools by directly working on the ground truth and prediction vectors. Compare the results.

# true positive rate
prediction$score(msr("classif.tpr"))

classif.tpr 
     0.6275

# false positive rate
prediction$score(msr("classif.fpr"))

classif.fpr 
    0.08738

# true negative rate
prediction$score(msr("classif.tnr"))

classif.tnr 
     0.9126

# false negative rate
prediction$score(msr("classif.fnr"))

classif.fnr 
     0.3725

# true positives
TP = sum(prediction$truth == "pos" & prediction$response == "pos")

# false positives
FP = sum(prediction$truth == "neg" & prediction$response == "pos")

# true negatives
TN = sum(prediction$truth == "neg" & prediction$response == "neg")

# false negatives
FN = sum(prediction$truth == "pos" & prediction$response == "neg")

# true positive rate
TP / (TP + FN)

[1] 0.6275

# false positive rate
FP / (FP + TN)

[1] 0.08738

# true negative rate
TN / (TN + FP)

[1] 0.9126

# false negative rate
FN / (FN + TP)

[1] 0.3725

The results are the same.

Change the threshold of the model from Exercise 1 such that the false negative rate is lower. What is one reason you might do this in practice?

# confusion matrix with threshold 0.5
prediction$confusion

        truth
response pos neg
     pos  32   9
     neg  19  94

prediction$set_threshold(0.3)

# confusion matrix with threshold 0.3
prediction$confusion

        truth
response pos neg
     pos  38  16
     neg  13  87

# false positive rate
prediction$score(msr("classif.fpr"))

classif.fpr 
     0.1553

# false negative rate
prediction$score(msr("classif.fnr"))

classif.fnr 
     0.2549

With a false negative rate of 0.38, we miss a lot of people who have diabetes but are predicted to not have it. This could give a false sense of security. By lowering the threshold, we can reduce the false negative rate.

A.2 Solutions to Chapter 3

Apply a repeated cross-validation resampling strategy on tsk("mtcars") and evaluate the performance of lrn("regr.rpart"). Use five repeats of three folds each. Calculate the MSE for each iteration and visualize the result. Finally, calculate the aggregated performance score.

We start by instantiating our task and learner as usual:

set.seed(3)
task = tsk("mtcars")
learner = lrn("regr.rpart")

We can instantiate a temporary resampling on the task to illustrate how it assigns observations across the 5 repeats (column rep) and 3 folds:

resampling = rsmp("repeated_cv", repeats = 5, folds = 3)
resampling$instantiate(task)
resampling$instance

     row_id rep fold
  1:      1   1    2
  2:      2   1    2
  3:      3   1    3
  4:      4   1    1
  5:      5   1    1
 ---                
156:     28   5    1
157:     29   5    3
158:     30   5    3
159:     31   5    1
160:     32   5    2

Note instantiating manually is not necessary when using resample(), as it automatically instantiates the resampling for us, so we pass it a new resampling which has not been instantiated:

resampling = rsmp("repeated_cv", repeats = 5, folds = 3)
rr = resample(task, learner, resampling)

Now we can $score() the resampling with the MSE measure across each of the 5x3 resampling iterations:

scores = rr$score(msr("regr.mse"))
scores

    task_id learner_id resampling_id iteration regr.mse
 1:  mtcars regr.rpart   repeated_cv         1    16.52
 2:  mtcars regr.rpart   repeated_cv         2    18.42
 3:  mtcars regr.rpart   repeated_cv         3    15.70
 4:  mtcars regr.rpart   repeated_cv         4    11.91
 5:  mtcars regr.rpart   repeated_cv         5    14.78
---                                                    
11:  mtcars regr.rpart   repeated_cv        11    20.60
12:  mtcars regr.rpart   repeated_cv        12    28.19
13:  mtcars regr.rpart   repeated_cv        13    25.86
14:  mtcars regr.rpart   repeated_cv        14    19.07
15:  mtcars regr.rpart   repeated_cv        15    19.88
Hidden columns: task, learner, resampling, prediction_test

We can manually calculate these scores since rr contains all the individual predictions. The $predictions() method returns a list of predictions for each iteration, which we can use to calculate the MSE for the first iteration:

preds = rr$predictions()
pred_1 = as.data.table(preds[[1]])
pred_1[, list(rmse = mean((truth - response)^2))]

    rmse
1: 16.52

To visualize the results, we can use ggplot2 directly on the scores object, which behaves like any other data.table:

library(ggplot2)
# Barchart of the per-iteration scores
ggplot(scores, aes(x = iteration, y = regr.mse)) +
  geom_col() +
  theme_minimal()

# Boxplot of the scores
ggplot(scores, aes(x = regr.mse)) +
  geom_boxplot() +
  scale_y_continuous(breaks = 0, labels = NULL) +
  theme_minimal()

Alternatively, the autoplot() function provides defaults for the ResampleResult object. Note that it internally scores the resampling using the MSE for regression tasks per default.

autoplot(rr)

autoplot(rr, type = "histogram")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The aggregate score is the mean of the MSE scores across all iterations, which we can calculate using $aggregate() or by manually averaging the scores we stored before:

mean(scores$regr.mse)

[1] 21.94

rr$aggregate(msr("regr.mse"))

regr.mse 
   21.94

Use tsk("spam") and five-fold CV to benchmark lrn("classif.ranger"), lrn("classif.log_reg"), and lrn("classif.xgboost", nrounds = 100) with respect to AUC. Which learner appears to perform best? How confident are you in your conclusion? Think about the stability of results and investigate this by re-rerunning the experiment with different seeds. What can be done to improve this?

First we instantiate our learners with their initial parameters, setting the predict_type = "prob" once for all of them using lrns(). We then set the nrounds parameter for XGBoost to 100 and construct a resampling object for 5-fold CV:

set.seed(3)

task = tsk("spam")
learners = lrns(c("classif.ranger", "classif.log_reg", "classif.xgboost"),
                predict_type = "prob")
learners$classif.xgboost$param_set$values$nrounds = 100
resampling = rsmp("cv", folds = 5)

We could have alternatively instantiated the learners like this, but would have needed to repeat the predict_type = "prob" argument multiple times.

learners = list(
  lrn("classif.ranger", predict_type = "prob"),
  lrn("classif.log_reg", predict_type = "prob"),
  lrn("classif.xgboost", nrounds = 100, predict_type = "prob")
)

Next we can construct a benchmark design grid with the instantiated objects using benchmark_grid():

design = benchmark_grid(
  tasks = task,
  learners = learners,
  resamplings = resampling
)
design

   task         learner resampling
1: spam  classif.ranger         cv
2: spam classif.log_reg         cv
3: spam classif.xgboost         cv

To perform the benchmark, we use the aptly named benchmark() function:

bmr = benchmark(design)
bmr

<BenchmarkResult> of 15 rows with 3 resampling runs
 nr task_id      learner_id resampling_id iters warnings errors
  1    spam  classif.ranger            cv     5        0      0
  2    spam classif.log_reg            cv     5        0      0
  3    spam classif.xgboost            cv     5        0      0

And visualize the results as a boxplot:

autoplot(bmr, measure = msr("classif.auc"))

In this example,lrn("classif.xgboost") outperforms lrn("classif.ranger"), and both outperform lrn("classif.log_reg"). Naturally this is only a visual inspection of the results — proper statistical testing of benchmark results can be conducted using the mlr3benchmark package, but for the purposes of this exercise a plot suffices.

When we re-run the same experiment with a different seed, we get a slightly different result.

set.seed(3235)
resampling = rsmp("cv", folds = 5)
design = benchmark_grid(
  tasks = task,
  learners = learners,
  resamplings = resampling
)
bmr = benchmark(design)
autoplot(bmr, measure = msr("classif.auc"))

The overall trend remains about the same, but do we trust these results? Note that we applied both lrn("classif.log_reg") and lrn("classif.ranger") with their initial parameters. While lrn("classif.log_reg") does not have any hyperparameters to tune, lrn("classif.ranger") does have several, at least one of which is usually tuned (mtry). In case of lrn("classif.xgboost") however, we arbitrarily chose nrounds = 100 rather than using the learner with its initial value of nrounds = 1, which would be equivalent to a single tree decision tree. To make any generalizations based on this experiment, we need to properly tune all relevant hyperparmeters in a systematic way. We cover this and more in Chapter 4.

A colleague reports a 93.1% classification accuracy using lrn("classif.rpart") on tsk("penguins_simple"). You want to reproduce their results and ask them about their resampling strategy. They said they used a custom three-fold CV with folds assigned as factor(task$row_ids %% 3). See if you can reproduce their results.

We make use of the custom_cv resampling strategy here:

task = tsk("penguins_simple")
rsmp_cv = rsmp("custom_cv")

We apply the rule to assign resampling folds we were provided with: Every third observation is assigned to the same fold:

rsmp_cv$instantiate(task = task, f = factor(task$row_ids %% 3))

str(rsmp_cv$instance)

List of 3
 $ 0: int [1:111] 3 6 9 12 15 18 21 24 27 30 ...
 $ 1: int [1:111] 1 4 7 10 13 16 19 22 25 28 ...
 $ 2: int [1:111] 2 5 8 11 14 17 20 23 26 29 ...

We are now ready to conduct the resampling and aggregate results:

rr = resample(
  task = task,
  learner = lrn("classif.rpart"),
  resampling = rsmp_cv
)

rr$aggregate(msr("classif.acc"))

classif.acc 
     0.9309

Converting to percentages and rounding to one decimal place, we get the same result as our colleague! Luckily they kept track of their resampling to ensure their results were reproducible.

(*) Program your own ROC plotting function without using mlr3’s autoplot() function. The signature of your function should be my_roc_plot(task, learner, train_indices, test_indices). Your function should use the $set_threshold() method of Prediction, as well as mlr3measures.

Here is a function to calculate the true positive rate (TPR, Sensitivity) and false positive rate (FPR, 1 - Specificity) in a loop across a grid of probabilities. These are set as thresholds with the $set_threshold() method of the PredictionClassif object. This way we construct the ROC curve by iteratively calculating its x and y values, after which we can use geom_step() to draw a step function. Note that we do not need to re-train the learner, we merely adjust the threshold applied to the predictions we made at the top of the function

my_roc_plot = function(task, learner, train_indices, test_indices) {
  # Train learner, predict once.
  learner$train(task, train_indices)
  pred = learner$predict(task, test_indices)
  # Positive class predictions from prediction matrix
  pos_pred = pred$prob[, which(colnames(pred$prob) == task$positive)]

  # Set a grid of probabilities to evaluate at.
  prob_grid = seq(0, 1, 0.001)

  # For each possible threshold, calculate TPR,
  # FPR + aggregate to data.table
  grid = data.table::rbindlist(lapply(prob_grid, \(thresh) {
    pred$set_threshold(thresh)
    data.table::data.table(
      thresh = thresh,
      # y axis == sensitivity == TPR
      tpr = mlr3measures::tpr(
        truth = pred$truth, response = pred$response,
        positive = task$positive),
      # x axis == 1 - specificity == 1 - TNR == FPR
      fpr = mlr3measures::fpr(
        truth = pred$truth, response = pred$response,
        positive = task$positive)
    )
  }))

  # Order descending by threshold to use ggplot2::geom_step
  data.table::setorderv(grid, cols = "thresh", order = -1)

  ggplot2::ggplot(grid, ggplot2::aes(x = fpr, y = tpr)) +
    # Step function starting with (h)orizontal, then (v)ertical
    ggplot2::geom_step(direction = "hv") +
    ggplot2::coord_equal() +
    ggplot2::geom_abline(linetype = "dashed") +
    ggplot2::theme_minimal() +
    ggplot2::labs(
      title = "My Custom ROC Curve",
      subtitle = sprintf("%s on %s", learner$id, task$id),
      x = "1 - Specificity", y = "Sensitivity",
      caption = sprintf("n = %i. Test set: %i", task$nrow, length(test_indices))
    )
}

We try our function using tsk("sonar") and lrn("classif.ranger") learner with 100 trees. We set predict_type = "prob" since we need probability predictions to apply thresholds, rather than hard class predictions.

set.seed(3)

# Setting up example task and learner for testing
task = tsk("sonar")
learner = lrn("classif.ranger", num.trees = 100, predict_type = "prob")
split = partition(task)

my_roc_plot(task, learner, split$train, split$test)

We can compare it with the pre-built plot function in mlr3viz:

learner$train(task, split$train)
pred = learner$predict(task, split$test)
autoplot(pred, type = "roc")

Note the slight discrepancy between the two curves. This is caused by some implementation differences used by the precrec which is used for this functionality in mlr3viz. There are different approaches to drawing ROC curves, and our implementation above is one of the simpler ones!

A.3 Solutions to Chapter 4

Tune the mtry, sample.fraction, and num.trees hyperparameters of lrn("regr.ranger") on tsk("mtcars"). Use a simple random search with 50 evaluations. Evaluate with a three-fold CV and the root mean squared error. Visualize the effects that each hyperparameter has on the performance via simple marginal plots, which plot a single hyperparameter versus the cross-validated MSE.

set.seed(1)

task = tsk("mtcars")

learner = lrn("regr.ranger",
  mtry = to_tune(1, 10),
  sample.fraction = to_tune(0.5, 1),
  num.trees = to_tune(100, 500)
)

instance = ti(
  learner = learner,
  task = task,
  resampling = rsmp("cv", folds = 3),
  measure = msr("regr.rmse"),
  terminator = trm("evals", n_evals = 50)
)

tuner = tnr("random_search", batch_size = 10)

tuner$optimize(instance)

   mtry num.trees sample.fraction learner_param_vals  x_domain regr.rmse
1:    9       151          0.9767          <list[4]> <list[3]>     2.541

# all evaluations
as.data.table(instance$archive)

    mtry num.trees sample.fraction regr.rmse runtime_learners
 1:    9       470          0.7231     2.705            0.037
 2:    7       340          0.8201     2.658            0.027
 3:    7       491          0.9959     2.603            0.043
 4:    4       393          0.7478     2.710            0.323
 5:    3       243          0.7422     2.715            0.021
---                                                          
46:    2       331          0.9665     2.687            0.024
47:    4       465          0.7353     2.644            0.031
48:    7       157          0.8018     2.634            0.020
49:    4       266          0.7425     2.625            0.025
50:    7       184          0.5544     2.936            0.018
6 variable(s) not shown: [timestamp, warnings, errors, x_domain, batch_nr, resample_result]

# best configuration
instance$result

   mtry num.trees sample.fraction learner_param_vals  x_domain regr.rmse
1:    9       151          0.9767          <list[4]> <list[3]>     2.541

# incumbent plot
autoplot(instance, type = "incumbent")

# marginal plots
autoplot(instance, type = "marginal", cols_x = "mtry")

autoplot(instance, type = "marginal", cols_x = "sample.fraction")

autoplot(instance, type = "marginal", cols_x = "num.trees")

Evaluate the performance of the model created in Exercise 1 with nested resampling. Use a holdout validation for the inner resampling and a three-fold CV for the outer resampling.

set.seed(1)

task = tsk("mtcars")

learner = lrn("regr.ranger",
  mtry = to_tune(1, 10),
  sample.fraction = to_tune(0.5, 1),
  num.trees = to_tune(100, 500)
)

at = auto_tuner(
  tuner = tnr("random_search", batch_size = 50),
  learner = learner,
  resampling = rsmp("holdout"),
  measure = msr("regr.rmse"),
  terminator = trm("evals", n_evals = 50)
)

rr = resample(task, at, rsmp("cv", folds = 3))

rr$aggregate(msr("regr.rmse"))

regr.rmse 
    2.542

The "rmse" is slightly higher than the one we obtained in Exercise 1. We see that the performance estimated while tuning overestimates the true performance

Tune and benchmark an XGBoost model against a logistic regression (without tuning the latter) and determine which has the best Brier score. Use mlr3tuningspaces and nested resampling, try to pick appropriate inner and outer resampling strategies that balance computational efficiency vs. stability of the results.

set.seed(1)

task = tsk("sonar")
lrn_log_reg = lrn("classif.log_reg", predict_type = "prob")

# load xgboost learner with search space
lrn_xgboost = lts(lrn("classif.xgboost", predict_type = "prob"))

# search space for xgboost
lrn_xgboost$param_set$search_space()

<ParamSet(8)>
                  id    class  lower    upper nlevels        default
1:             alpha ParamDbl -6.908    6.908     Inf <NoDefault[0]>
2: colsample_bylevel ParamDbl  0.100    1.000     Inf <NoDefault[0]>
3:  colsample_bytree ParamDbl  0.100    1.000     Inf <NoDefault[0]>
4:               eta ParamDbl -9.210    0.000     Inf <NoDefault[0]>
5:            lambda ParamDbl -6.908    6.908     Inf <NoDefault[0]>
6:         max_depth ParamInt  1.000   20.000      20 <NoDefault[0]>
7:           nrounds ParamInt  1.000 5000.000    5000 <NoDefault[0]>
8:         subsample ParamDbl  0.100    1.000     Inf <NoDefault[0]>
1 variable(s) not shown: [value]
Trafo is set.

at_xgboost = auto_tuner(
  tuner = tnr("random_search", batch_size = 50),
  learner = lrn_xgboost,
  resampling = rsmp("cv", folds = 3),
  measure = msr("classif.bbrier"),
  terminator = trm("evals", n_evals = 50)
)

design = benchmark_grid(
  tasks = task,
  learners = list(lrn_log_reg, at_xgboost),
  resamplings = rsmp("cv", folds = 5)
)

bmr = benchmark(design)

bmr$aggregate(msr("classif.bbrier"))

   nr task_id            learner_id resampling_id iters classif.bbrier
1:  1   sonar       classif.log_reg            cv     5         0.2423
2:  2   sonar classif.xgboost.tuned            cv     5         0.1024
Hidden columns: resample_result

We use the lts() function from the mlr3tuningspaces package to load the lrn("classif.xgboost") with a search space. The learner is wrapped in an auto_tuner(), which is then benchmarked against the lrn("classif.log_reg").

(*) Write a function that implements an iterated random search procedure that drills down on the optimal configuration by applying random search to iteratively smaller search spaces. Your function should have seven inputs: task, learner, search_space, resampling, measure, random_search_stages, and random_search_size. You should only worry about programming this for fully numeric and bounded search spaces that have no dependencies. In pseudo-code:
1. Create a random design of size random_search_size from the given search space and evaluate the learner on it.
2. Identify the best configuration.
3. Create a smaller search space around this best config, where you define the new range for each parameter as: new_range[i] = (best_conf[i] - 0.25 * current_range[i], best_conf[i] + 0.25*current_range[i]). Ensure that this new_range respects the initial bound of the original search_space by taking the max() of the new and old lower bound, and the min() of the new and the old upper bound (“clipping”).
4. Iterate the previous steps random_search_stages times and at the end return the best configuration you have ever evaluated.

library(mlr3misc)

focus_search = function(task, learner, search_space, resampling, measure, random_search_stages, random_search_size) {

  repeat {

    # tune learner on random design
    instance = tune(
      tuner = tnr("random_search", batch_size = random_search_size),
      learner = learner,
      task = task,
      resampling = resampling,
      measure = measure,
      search_space = search_space,
      terminator = trm("evals", n_evals = random_search_size),
    )

    # identify the best configuration
    best_config = instance$result_x_search_space

    # narrow search space
    params = map(search_space$subspaces(), function(subspace) {
      best = best_config[[subspace$ids()]]
      lower = subspace$lower
      upper = subspace$upper

      new_lower = best - 0.25 * lower
      new_upper = best + 0.25 * upper

      if ("ParamInt" %in% subspace$class) {
        new_lower = round(new_lower)
        new_upper = round(new_upper)

        p_int(max(new_lower, lower), min(new_upper, upper), tags = subspace$tags[[1]])
      } else {
        p_dbl(max(new_lower, lower), min(new_upper, upper), tags = subspace$tags[[1]])
      }
    })
    search_space = invoke(ps, .args = params)

    random_search_stages = random_search_stages - 1
    if (!random_search_stages) return(best_config)
  }
}

focus_search(
  task = tsk("mtcars"),
  learner = lrn("regr.xgboost"),
  search_space = ps(
    eta = p_dbl(lower = 0.01, upper = 0.5),
    max_depth = p_int(lower = 1, upper = 10),
    nrounds = p_int(lower = 10, upper = 100)
  ),
  resampling = rsmp("cv", folds = 3),
  measure = msr("regr.rmse"),
  random_search_stages = 2,
  random_search_size = 50
)

      eta max_depth nrounds
1: 0.4341         2      84

As a stretch goal, look into mlr3tuning’s internal source code and turn your function into an R6 class inheriting from the TunerBatch class – test it out on a learner of your choice.

library(R6)
library(mlr3tuning)

TunerBatchFocusSearch = R6Class("TunerFocusSearch",
  inherit = TunerBatch,
  public = list(

    initialize = function() {
      param_set = ps(
        random_search_stages = p_int(lower = 1L, tags = "required"),
        random_search_size = p_int(lower = 1L, tags = "required")
      )

      param_set$values = list(random_search_stages = 10L, random_search_size = 50L)

      super$initialize(
        id = "focus_search",
        param_set = param_set,
        param_classes = c("ParamLgl", "ParamInt", "ParamDbl", "ParamFct"),
        properties = c("dependencies", "single-crit", "multi-crit"),
        label = "Focus Search",
        man = "mlr3tuning::mlr_tuners_focus_search"
      )
    }
  ),

  private = list(
    .optimize = function(inst) {
      pv = self$param_set$values
      search_space = inst$search_space

       for (i in seq(pv$random_search_stages)) {

        # evaluate random design
        xdt = generate_design_random(search_space, pv$random_search_size)$data
        inst$eval_batch(xdt)

        # identify the best configuration
        best_config = inst$archive$best(batch = i)

        # narrow search space
        params = map(search_space$subspaces(), function(subspace) {
          best = best_config[[subspace$ids()]]
          lower = subspace$lower
          upper = subspace$upper

          new_lower = best - 0.25 * lower
          new_upper = best + 0.25 * upper

          if ("ParamInt" %in% subspace$class) {
            new_lower = round(new_lower)
            new_upper = round(new_upper)

            p_int(max(new_lower, lower), min(new_upper, upper), tags = subspace$tags[[1]])
          } else {
            p_dbl(max(new_lower, lower), min(new_upper, upper), tags = subspace$tags[[1]])
          }
        })
        search_space = invoke(ps, .args = params)

        xdt = generate_design_random(search_space, pv$random_search_size)$data
        inst$eval_batch(xdt)
      }
    }
  )
)

mlr_tuners$add("focus_search", TunerBatchFocusSearch)

instance = ti(
  task = tsk("mtcars"),
  learner = lrn("regr.xgboost"),
  search_space = ps(
    eta = p_dbl(lower = 0.01, upper = 0.5),
    max_depth = p_int(lower = 1, upper = 10),
    nrounds = p_int(lower = 10, upper = 100)
  ),
  resampling = rsmp("cv", folds = 3),
  measure = msr("regr.rmse"),
  terminator = trm("none")
)

tuner = tnr("focus_search", random_search_stages = 2, random_search_size = 50)

tuner$optimize(instance)

      eta max_depth nrounds learner_param_vals  x_domain regr.rmse
1: 0.3131         1      80          <list[5]> <list[3]>     2.897

A.4 Solutions to Chapter 5

Tune the mtry, sample.fraction, and num.trees hyperparameters of lrn("regr.ranger") on tsk("mtcars") and evaluate this with a three-fold CV and the root mean squared error (same as Chapter 4, Exercise 1). Use tnr("mbo") with 50 evaluations. Compare this with the performance progress of a random search run from Chapter 4, Exercise 1. Plot the progress of performance over iterations and visualize the spatial distribution of the evaluated hyperparameter configurations for both algorithms.

We first construct the learner, task, resampling, measure and terminator and then the instance.

library(mlr3mbo)
library(bbotk)
library(data.table)
library(ggplot2)
library(viridisLite)

set.seed(5)

learner = lrn("regr.ranger",
  mtry = to_tune(1, 10),
  sample.fraction = to_tune(0.5, 1),
  num.trees = to_tune(100, 500)
)
task = tsk("mtcars")
resampling = rsmp("cv", folds = 3)
measure = msr("regr.rmse")
terminator = trm("evals", n_evals = 50)
instance_rs = ti(
  learner = learner,
  task = task,
  resampling = resampling,
  measure = measure,
  terminator = terminator
)

Using a random search results in the following final performance:

tuner = tnr("random_search", batch_size = 50)
tuner$optimize(instance_rs)

   mtry num.trees sample.fraction learner_param_vals  x_domain regr.rmse
1:    8       487          0.9051          <list[4]> <list[3]>     2.415

We then construct a new instance and optimize it via Bayesian Optimization (BO) using tnr("mbo") in its default configuration (see also mbo_defaults):

instance_bo = ti(
  learner = learner,
  task = task,
  resampling = resampling,
  measure = measure,
  terminator = terminator
)
tuner = tnr("mbo")
tuner$optimize(instance_bo)

WARN  [10:29:58.302] [mlr3] train: Stopped because hard maximum generation limit was hit.
WARN  [10:30:03.553] [mlr3] train: Stopped because hard maximum generation limit was hit.

   mtry num.trees sample.fraction learner_param_vals  x_domain regr.rmse
1:    5       499          0.9956          <list[4]> <list[3]>     2.525

We then add relevant information to the archives of the instances so that we can combine their data and use this data for generating the desired plots.

instance_rs$archive$data[, iteration := seq_len(.N)]
instance_rs$archive$data[, best_rmse := cummin(regr.rmse)]
instance_rs$archive$data[, method := "Random Search"]
instance_bo$archive$data[, iteration := seq_len(.N)]
instance_bo$archive$data[, best_rmse := cummin(regr.rmse)]
instance_bo$archive$data[, method := "BO"]

plot_data = rbind(instance_rs$archive$data[, c("iteration", "best_rmse", "method")],
  instance_bo$archive$data[, c("iteration", "best_rmse", "method")])

ggplot(aes(x = iteration, y = best_rmse, colour = method), data = plot_data) +
  geom_step() +
  scale_colour_manual(values = viridis(2, end = 0.8)) +
  labs(x = "Number of Configurations", y = "Best regr.rmse", colour = "Method") +
  theme_minimal() +
  theme(legend.position = "bottom")

We see that BO manages to slightly outperform the random search. Ideally, we would replicate running both optimizers multiple times with different random seeds and visualize their average performance along with a dispersion measure to properly take randomness of the optimization process into account. We could even use the same first few random samples as the initial design in BO to allow for a fairer comparison.

To visualize the spatial distribution of the evaluated hyperparameter configurations we will plot for each evaluated configuration the number of trees on the x-axis and the sample fraction on the y-axis. The label of each point corresponds to the mtry parameter directly.

relevant_columns = c("mtry", "sample.fraction", "num.trees", "iteration", "method")
plot_data_sampling = rbind(
  instance_rs$archive$data[, ..relevant_columns, with = FALSE],
  instance_bo$archive$data[, ..relevant_columns, with = FALSE])

ggplot(
    aes(x = num.trees, y = sample.fraction, colour = method, label = mtry),
    data = plot_data_sampling
  ) +
  scale_colour_manual(values = viridis(2, end = 0.8)) +
  geom_point(size = 0) +
  geom_text() +
  guides(colour = guide_legend(title = "Method", override.aes = aes(label = "", size = 2))) +
  theme_minimal() +
  theme(legend.position = "bottom")

We observe that the random search samples uniformly at random – as expected. BO, however, focuses on regions of the search space with a high number of trees between 350 and 400, a high sample fraction and mtry values of around 5 to 8. This is also the region where the final result returned by BO is located. Nevertheless, BO also explores the search space, i.e., along the line of a high sample fraction close to 1.

Minimize the 2D Rastrigin function $f: [-5.12, 5.12] \times [-5.12, 5.12] \rightarrow \mathbb{R}$, $\mathbf{x} \mapsto 10 D+\sum_{i=1}^D\left[x_i^2-10 \cos \left(2 \pi x_i\right)\right]$, $D = 2$ via BO (standard sequential single-objective BO via bayesopt_ego()) using the lower confidence bound with lambda = 1 as acquisition function and "NLOPT_GN_ORIG_DIRECT" via opt("nloptr") as acquisition function optimizer. Use a budget of 40 function evaluations. Run this with both the “default” Gaussian process surrogate model with Matérn 5/2 kernel, and the “default” random forest surrogate model. Compare their anytime performance (similarly as in Figure 5.7). You can construct the surrogate models with default settings using:

surrogate_gp = srlrn(default_gp())
surrogate_rf = srlrn(default_rf())

We first construct the function, making use of efficient evaluation operating on a data.table directly. We then wrap this function in the corresponding ObjectiveRFunDt objective class and construct the instance.

rastrigin = function(xdt) {
  D = ncol(xdt)
  y = 10 * D + rowSums(xdt^2 - (10 * cos(2 * pi * xdt)))
  data.table(y = y)
}

objective = ObjectiveRFunDt$new(
  fun = rastrigin,
  domain = ps(x1 = p_dbl(lower = -5.12, upper = 5.12),
    x2 = p_dbl(lower = -5.12, upper = 5.12)),
  codomain = ps(y = p_dbl(tags = "minimize")),
  id = "rastrigin2D")

instance = OptimInstanceSingleCrit$new(
  objective = objective,
  terminator = trm("evals", n_evals = 40))

OptimInstanceSingleCrit is deprecated. Use OptimInstanceBatchSingleCrit instead.

We then construct the surrogates as well as the acquisition function and acquisition function optimizer (we will terminate the acquisition function optimization once optimization process stagnates by 1e-5 over the last 100 iterations) and construct the two BO optimizers.

surrogate_gp = srlrn(default_gp())
surrogate_rf = srlrn(default_rf())

acq_function = acqf("cb", lambda = 1)

acq_optimizer = acqo(opt("nloptr", algorithm = "NLOPT_GN_ORIG_DIRECT"),
  terminator = trm("stagnation", iters = 100, threshold = 1e-5))

optimizer_gp = opt("mbo",
  loop_function = bayesopt_ego,
  surrogate = surrogate_gp,
  acq_function = acq_function,
  acq_optimizer = acq_optimizer)

optimizer_rf = opt("mbo",
  loop_function = bayesopt_ego,
  surrogate = surrogate_rf,
  acq_function = acq_function,
  acq_optimizer = acq_optimizer
)

We will use the following initial design for both optimizers:

initial_design = data.table(
  x1 = c(-3.95, 1.16, 3.72, -1.39, -0.11, 5.00, -2.67, 2.44),
  x2 = c(1.18, -3.93, 3.74, -1.37, 5.02, -0.09, -2.65, 2.46)
)
instance$eval_batch(initial_design)

We then proceed to optimize the instance with each of the two optimizers and make sure to extract the relevant data from the archive of the instance.

optimizer_gp$optimize(instance)

       x1     x2  x_domain     y
1: -2.969 -1.017 <list[2]> 10.09

gp_data = instance$archive$data
gp_data[, y_min := cummin(y)]
gp_data[, iteration := seq_len(.N)]
gp_data[, surrogate := "Gaussian Process"]

instance$archive$clear()

instance$eval_batch(initial_design)

optimizer_rf$optimize(instance)

        x1     x2  x_domain     y
1: 0.01405 -3.975 <list[2]> 15.96

rf_data = instance$archive$data
rf_data[, y_min := cummin(y)]
rf_data[, iteration := seq_len(.N)]
rf_data[, surrogate := "Random forest"]

We then combine the data and use it to generate the desired plot:

plot_data = rbind(gp_data, rf_data)
ggplot(aes(x = iteration, y = y_min, colour = surrogate), data = plot_data) +
  geom_step() +
  scale_colour_manual(values = viridis(2, end = 0.8)) +
  labs(y = "Best Observed Function Value", x = "Number of Function Evaluations",
       colour = "Surrogate Model") +
  theme_minimal() +
  theme(legend.position = "bottom")

As expected, we observe that the BO algorithm with the Gaussian Process surrogate appears to outperform the random forest surrogate counterpart. However, ideally we would replicate running each algorithm using different random seeds and visualize the average performance along with some dispersion measure to properly take randomness of the optimization process into account.

Minimize the following function: $f: [-10, 10] \rightarrow \mathbb{R}^2, x \mapsto \left(x^2, (x - 2)^2\right)$ with respect to both objectives. Use the ParEGO algorithm. Construct the objective function using the ObjectiveRFunMany class. Terminate the optimization after a runtime of 100 evals. Plot the resulting Pareto front and compare it to the analytical solution, $y_2 = \left(\sqrt{y_1}-2\right)^2$ with $y_1$ ranging from $0$ to $4$.

We first construct the function, wrap it in the objective and then create the instance.

fun = function(xss) {
  evaluations = lapply(xss, FUN = function(xs) {
    list(y1 = xs$x ^ 2, y2 = (xs$x - 2)^2)
  })
  rbindlist(evaluations)
}

objective = ObjectiveRFunMany$new(
  fun = fun,
  domain = ps(x = p_dbl(lower = -10, upper = 10)),
  codomain = ps(y1 = p_dbl(tags = "minimize"), y2 = p_dbl(tags = "minimize")),
  id = "schaffer1")

instance = OptimInstanceMultiCrit$new(
  objective = objective,
  terminator = trm("evals", n_evals = 100)
)

OptimInstanceMultiCrit is deprecated. Use OptimInstanceBatchMultiCrit instead.

As a surrogate we will use a random forest. ParEGO is a scalarization based multi-objective BO algorithm and therefore we use the Expected Improvement as acquisition function. We will use the same acquisition functon optimizer as earlier.

surrogate = srlrn(default_rf())

acq_function = acqf("ei")

acq_optimizer = acqo(opt("nloptr", algorithm = "NLOPT_GN_ORIG_DIRECT"),
  terminator = trm("stagnation", iters = 100, threshold = 1e-5))

optimizer = opt("mbo",
  loop_function = bayesopt_parego,
  surrogate = surrogate,
  acq_function = acq_function,
  acq_optimizer = acq_optimizer
)

We then optimize the instance:

optimizer$optimize(instance)

WARN  [10:40:12.895] [bbotk] Error: Missing data in dependent variable.
WARN  [10:40:12.896] [bbotk] Could not update the surrogate a final time after the optimization process has terminated.

         x  x_domain      y1     y2
 1: 0.0000 <list[1]> 0.00000 4.0000
 2: 0.2469 <list[1]> 0.06097 3.0733
 3: 1.8381 <list[1]> 3.37874 0.0262
 4: 1.2346 <list[1]> 1.52416 0.5859
 5: 1.4815 <list[1]> 2.19479 0.2689
---                                
37: 0.4390 <list[1]> 0.19268 2.4369
38: 1.2955 <list[1]> 1.67841 0.4963
39: 0.7316 <list[1]> 0.53523 1.6088
40: 1.1797 <list[1]> 1.39169 0.6729
41: 1.2437 <list[1]> 1.54682 0.5720

Finally, we visualize the resulting Pareto front (in black) and its analytical counterpart (in darkgrey).

true_pareto = data.table(y1 = seq(from = 0, to = 4, length.out = 1001))
true_pareto[, y2 := (sqrt(y1) - 2) ^2]

ggplot(aes(x = y1, y = y2), data = instance$archive$best()) +
  geom_point() +
  geom_line(data = true_pareto, colour = "darkgrey") +
  geom_step(direction = "hv") +
  labs(x = expression(y[1]), y = expression(y[2])) +
  theme_minimal()

A.5 Solutions to Chapter 6

Compute the correlation filter scores on tsk("mtcars") and use the filter to select the five features most strongly correlated with the target. Resample lrn("regr.kknn") on both the full dataset and the reduced one, and compare both performances based on 10-fold CV with respect to MSE. NB: Here, we have performed the feature filtering outside of CV, which is generally not a good idea as it biases the CV performance estimation. To do this properly, filtering should be embedded inside the CV via pipelines – try to come back to this exercise after you read Chapter 8 to implement this with less bias.

set.seed(1)

task = tsk("mtcars")

filter = flt("correlation")

filter$calculate(task)

# sorted filter scores
filter$scores

    wt    cyl   disp     hp   drat     vs     am   carb   gear   qsec 
0.8677 0.8522 0.8476 0.7762 0.6812 0.6640 0.5998 0.5509 0.4803 0.4187

# subset task to 5 most correlated features
task$select(names(head(filter$scores, 5)))
task

<TaskRegr:mtcars> (32 x 6): Motor Trends
* Target: mpg
* Properties: -
* Features (5):
  - dbl (5): cyl, disp, drat, hp, wt

design = benchmark_grid(
  tasks = list(task, tsk("mtcars")),
  learners = lrn("regr.kknn"),
  resamplings = rsmp("cv", folds = 10)
)

bmr = benchmark(design)
bmr$aggregate(msr("regr.mse"))

   nr task_id learner_id resampling_id iters regr.mse
1:  1  mtcars  regr.kknn            cv    10    6.678
2:  2  mtcars  regr.kknn            cv    10    9.083
Hidden columns: resample_result

The "mse" is much lower on the filtered task.

Apply backward selection to tsk("penguins") with lrn("classif.rpart") and holdout resampling by the classification accuracy measure. Compare the results with those in Section 6.2.1 by also running the forward selection from that section. Do the selected features differ? Which feature selection method reports a higher classification accuracy in its $result?

set.seed(1)

fselector_sbs = fs("sequential", strategy = "sbs")

instance_sbs = fsi(
  task =  tsk("penguins"),
  learner = lrn("classif.rpart"),
  resampling = rsmp("holdout"),
  measure = msr("classif.acc"),
  terminator = trm("none")
)

fselector_sbs$optimize(instance_sbs)

   bill_depth bill_length body_mass flipper_length island   sex  year
1:       TRUE        TRUE      TRUE          FALSE  FALSE FALSE FALSE
3 variable(s) not shown: [features, n_features, classif.acc]

# optimization path sbs
fselector_sbs$optimization_path(instance_sbs)

   bill_depth bill_length body_mass flipper_length island   sex  year
1:       TRUE        TRUE      TRUE           TRUE   TRUE  TRUE  TRUE
2:      FALSE        TRUE      TRUE           TRUE   TRUE  TRUE  TRUE
3:      FALSE        TRUE      TRUE          FALSE   TRUE  TRUE  TRUE
4:      FALSE        TRUE      TRUE          FALSE  FALSE  TRUE  TRUE
5:      FALSE        TRUE      TRUE          FALSE  FALSE  TRUE FALSE
6:      FALSE        TRUE      TRUE          FALSE  FALSE FALSE FALSE
7:      FALSE        TRUE     FALSE          FALSE  FALSE FALSE FALSE
2 variable(s) not shown: [classif.acc, batch_nr]

instance_sbs$result

   bill_depth bill_length body_mass flipper_length island   sex  year
1:       TRUE        TRUE      TRUE          FALSE  FALSE FALSE FALSE
3 variable(s) not shown: [features, n_features, classif.acc]

fselector_sfs = fs("sequential", strategy = "sfs")

instance_sfs = fsi(
  task =  tsk("penguins"),
  learner = lrn("classif.rpart"),
  resampling = rsmp("holdout"),
  measure = msr("classif.acc"),
  terminator = trm("none")
)

fselector_sfs$optimize(instance_sfs)

   bill_depth bill_length body_mass flipper_length island   sex  year
1:       TRUE        TRUE     FALSE           TRUE  FALSE FALSE FALSE
3 variable(s) not shown: [features, n_features, classif.acc]

# optimization path sfs
fselector_sfs$optimization_path(instance_sfs)

   bill_depth bill_length body_mass flipper_length island   sex  year
1:       TRUE       FALSE     FALSE          FALSE  FALSE FALSE FALSE
2:       TRUE       FALSE     FALSE           TRUE  FALSE FALSE FALSE
3:       TRUE        TRUE     FALSE           TRUE  FALSE FALSE FALSE
4:       TRUE        TRUE      TRUE           TRUE  FALSE FALSE FALSE
5:       TRUE        TRUE      TRUE           TRUE  FALSE FALSE  TRUE
6:       TRUE        TRUE      TRUE           TRUE   TRUE FALSE  TRUE
7:       TRUE        TRUE      TRUE           TRUE   TRUE  TRUE  TRUE
2 variable(s) not shown: [classif.acc, batch_nr]

instance_sfs$result

   bill_depth bill_length body_mass flipper_length island   sex  year
1:       TRUE        TRUE     FALSE           TRUE  FALSE FALSE FALSE
3 variable(s) not shown: [features, n_features, classif.acc]

The sequential backward search selects 5 features, while the sequential forward search selects all features. The sequential backward search reports a higher classification accuracy.

There is a problem in the performance comparison in Exercise 2 as feature selection is performed on the test-set. Change the process by applying forward feature selection with auto_fselector(). Compare the performance to backward feature selection from Exercise 2 using nested resampling.

set.seed(1)

afs_sfs = auto_fselector(
  fselector = fs("sequential", strategy = "sfs"),
  learner = lrn("classif.rpart", id = "classif.rpart.sfs"),
  resampling = rsmp("holdout"),
  measure = msr("classif.acc")
)

afs_sbs = auto_fselector(
  fselector = fs("sequential", strategy = "sbs"),
  learner = lrn("classif.rpart", id = "classif.rpart.sbs"),
  resampling = rsmp("holdout"),
  measure = msr("classif.acc")
)

design = benchmark_grid(
  tasks = tsk("penguins"),
  learners = list(afs_sfs, afs_sbs),
  resamplings = rsmp("cv", folds = 5)
)

bmr = benchmark(design)
bmr$aggregate(msr("classif.acc"))

   nr  task_id                  learner_id resampling_id iters
1:  1 penguins classif.rpart.sfs.fselector            cv     5
2:  2 penguins classif.rpart.sbs.fselector            cv     5
1 variable(s) not shown: [classif.acc]
Hidden columns: resample_result

Now the sequential forward search selects yields a slightly higher classification accuracy.

(*) Write a feature selection algorithm that is a hybrid of a filter and a wrapper method. This search algorithm should compute filter scores for all features and then perform a forward search. But instead of tentatively adding all remaining features to the current feature set, it should only stochastically try a subset of the available features. Features with high filter scores should be added with higher probability. Start by coding a stand-alone R method for this search (based on a learner, task, resampling, performance measure and some control settings).

library(mlr3verse)
library(data.table)

task = tsk("sonar")
learner = lrn("classif.rpart")
resampling = rsmp("cv", folds = 3)
measure = msr("classif.acc")
filter = flt("auc")
n = 5
max_features = 10


filter_forward_selection_search = function(task, learner, resampling, measure, filter, n, max_features) {
  features = task$feature_names

  # calculate filter scores
  filter$calculate(task)
  scores = filter$scores

  result_features = character(0)
  while(max_features > length(result_features)) {

    # select n features to try
    filter_features = sample(names(scores), size = min(n, length(scores)), prob = scores)

    # create feature matrix
    states = matrix(FALSE, ncol = length(features), nrow = length(filter_features))

    # add filter features to matrix
    for (i in seq_along(filter_features)) {
      states[i, which(features %in% filter_features[i])] = TRUE
    }

    # add already selected features to matrix
    states[, which(features %in% result_features)] = TRUE

    # convert matrix to design
    design = setnames(as.data.table(states), features)

    # evaluate feature combinations
    instance = fselect(
      fselector = fs("design_points", design = design, batch_size = nrow(design)),
      task =  task,
      learner = learner,
      resampling = resampling,
      measure = measure
    )

    # current best set
    result_features = instance$result_feature_set

    # remove selected features from scores
    scores = scores[!names(scores) %in% result_features]
  }

  result_features
}

filter_forward_selection_search(task, learner, resampling, measure, filter, n, max_features)

 [1] "V11" "V13" "V2"  "V24" "V27" "V34" "V36" "V52" "V56" "V58"

Then, as a stretch goal, see if you can implement this as an R6 class inheriting from FSelectorBatch.

library(R6)
library(checkmate)
library(mlr3verse)
library(mlr3fselect)
library(data.table)

FSelectorBatchSequentialFilter = R6Class("FSelectorBatchSequentialFilter",
  inherit = FSelectorBatch,
  public = list(

    #' @description
    #' Creates a new instance of this [R6][R6::R6Class] class.`
    initialize = function() {
      param_set = ps(
        filter = p_uty(tags = "required"),
        n = p_int(lower = 1, tags = "required"),
        max_features = p_int(lower = 1)
      )

      super$initialize(
        id = "filter_sequential",
        param_set = param_set,
        properties = "single-crit",
        label = "Sequential Filter Search",
        man = "mlr3fselect::mlr_fselectors_sequential"
      )
    }
  ),
  private = list(
    .optimize = function(inst) {
      pv = self$param_set$values
      features = inst$archive$cols_x
      max_features = pv$max_features %??% length(features)

      # calculate filter scores
      pv$filter$calculate(inst$objective$task)
      scores = pv$filter$scores

      result_features = character(0)
      while(max_features > length(result_features)) {

        # select n features to try
        filter_features = sample(names(scores), size = min(pv$n, length(scores)), prob = scores)

        # create feature matrix
        states = matrix(FALSE, ncol = length(features), nrow = length(filter_features))

        # add filter features to matrix
        for (i in seq_along(filter_features)) {
          states[i, which(features %in% filter_features[i])] = TRUE
        }

        # add already selected features to matrix
        states[, which(features %in% result_features)] = TRUE

        # convert matrix to design
        design = setnames(as.data.table(states), features)

        # evaluate feature combinations
        inst$eval_batch(design)

        # current best set
        res = inst$archive$best(batch = inst$archive$n_batch)
        result_features = features[as.logical(res[, features, with = FALSE])]

        # remove selected features from scores
        scores = scores[!names(scores) %in% result_features]
      }
    }
  )
)

mlr_fselectors$add("sequential_filter", FSelectorBatchSequentialFilter)

instance = fselect(
  fselector = fs(
    "sequential_filter",
    filter = flt("auc"),
    n = 5,
    max_features = 10),
  task =  tsk("sonar"),
  learner = lrn("classif.rpart"),
  resampling = rsmp("cv", folds = 3),
  measure = msr("classif.acc")
)

A.6 Solutions to Chapter 7

Concatenate the PipeOps named in the exercise by using %>>%. The resulting Graph can then be converted to a Learner by using as_learner().

library(mlr3pipelines)
library(mlr3learners)

graph = po("imputeoor") %>>% po("scale") %>>% lrn("classif.log_reg")
graph_learner = as_learner(graph)

The GraphLearner can be trained like any other Learner object, thereby filling in its $model field. It is possible to access the $state of any PipeOp through this field: the states are named after the PipeOp’s $id. The logistic regression model can then be extracted from the state of the po("learner") that contains the lrn("classif.log_reg").

graph_learner$train(tsk("pima"))

# access the state of the po("learner") to get the model
model = graph_learner$model$classif.log_reg$model
coef(model)

(Intercept)         age     glucose     insulin        mass    pedigree 
   -0.88835     0.15584     1.13631    -0.17477     0.74383     0.32121 
   pregnant    pressure     triceps 
    0.39594    -0.24967     0.05599

Alternatively, the underlying lrn("classif.log_reg") can be accessed through the $base_learner() method:

model = graph_learner$base_learner()$model
coef(model)

(Intercept)         age     glucose     insulin        mass    pedigree 
   -0.88835     0.15584     1.13631    -0.17477     0.74383     0.32121 
   pregnant    pressure     triceps 
    0.39594    -0.24967     0.05599

As a third option, the trained PipeOp can be accessed through the $graph_model field of the GraphLearner. The trained PipeOp has a $learner_model field, which contains the trained Learner object, which contains the model.

pipeop = graph_learner$graph_model$pipeops$classif.log_reg
model = pipeop$learner_model$model
coef(model)

(Intercept)         age     glucose     insulin        mass    pedigree 
   -0.88835     0.15584     1.13631    -0.17477     0.74383     0.32121 
   pregnant    pressure     triceps 
    0.39594    -0.24967     0.05599

Set the $keep_results flag of the Graph to TRUE to keep the results of the individual PipeOps. Afterwards, the input of the lrn("classif.log_reg") can be accessed through the $.result field of its predecessor, the po("scale"). Note that the $.result is a list, we want to access its only element, named $output.

graph_learner$graph$keep_results = TRUE
graph_learner$train(tsk("pima"))

# access the input of the learner
scale_result = graph_learner$graph_model$pipeops$scale$.result

scale_output_task = scale_result$output

age_column = scale_output_task$data()$age

# check if the age column is standardized:
# 1. does it have mean 0? -- almost, up to numerical precision!
mean(age_column)

[1] 1.988e-16

# 2. does it have standard deviation 1? -- yes!
sd(age_column)

[1] 1

A.7 Solutions to Chapter 8

Use the po("pca") to replace numeric columns with their PCA transform. To restrict this operator to only columns without missing values, the affect_columns with a fitting Selector can be used: The selector_missing(), which selects columns with missing values, combined with selector_invert(), which inverts the selection. Since po("pca") only operates on numeric columns, it is not necessary to use a Selector to select numeric columns.

graph = as_graph(po("pca",
  affect_columns = selector_invert(selector_missing()))
)

# apply the graph to the pima task
graph_result = graph$train(tsk("pima"))

# we get the following features
graph_result[[1]]$feature_names

[1] "PC1"      "PC2"      "PC3"      "glucose"  "insulin"  "mass"    
[7] "pressure" "triceps"

# Compare with feature-columns of tsk("pima") with missing values:
selector_missing()(tsk("pima"))

[1] "glucose"  "insulin"  "mass"     "pressure" "triceps"

Alternatively, po("select") can be used to select the columns without missing values that are passed to po("pca"). Another po("select") can be used to select all the other columns. It is put in parallel with the first po("select") using gunion(). It is necessary to use different $id values for both po("select") to avoid a name clash in the Graph. To combine the output from both paths, po("featureunion") can be used.

path1 = po("select", id = "select_non_missing",
  selector = selector_invert(selector_missing())) %>>%
    po("pca")
path2 = po("select", id = "select_missing",
  selector = selector_missing())
graph = gunion(list(path1, path2)) %>>% po("featureunion")

# apply the graph to the pima task
graph_result = graph$train(tsk("pima"))
graph_result[[1]]$feature_names

[1] "PC1"      "PC2"      "PC3"      "glucose"  "insulin"  "mass"    
[7] "pressure" "triceps"

First, observe the feature names produced by the level 0 learners when applied to the tsk("wine") task:

lrn_rpart = lrn("classif.rpart", predict_type = "prob")
po_rpart_cv = po("learner_cv", learner = lrn_rpart,
  resampling.folds = 2, id = "rpart_cv"
)

lrn_knn = lrn("classif.kknn", predict_type = "prob")
po_knn_cv = po("learner_cv",
  learner = lrn_knn,
  resampling.folds = 2, id = "knn_cv"
)

# we restrict ourselves to two level 0 learners here to
# focus on the essentials.

gr_level_0 = gunion(list(po_rpart_cv, po_knn_cv))
gr_combined = gr_level_0 %>>% po("featureunion")

gr_combined$train(tsk("wine"))[[1]]$head()

   type rpart_cv.prob.1 rpart_cv.prob.2 rpart_cv.prob.3 knn_cv.prob.1
1:    1          1.0000          0.0000               0             1
2:    1          1.0000          0.0000               0             1
3:    1          1.0000          0.0000               0             1
4:    1          1.0000          0.0000               0             1
5:    1          0.2857          0.7143               0             1
6:    1          1.0000          0.0000               0             1
2 variable(s) not shown: [knn_cv.prob.2, knn_cv.prob.3]

To use po("select") to remove, instead of keep, a feature based on a pattern, use selector_invert together with selector_grep. To remove the “1” class columns, i.e. all columns with names that end in “1”, the following po("select") could be used:

drop_one = po("select", selector = selector_invert(selector_grep("\\.1$")))

# Train it on the wine task with lrn("classif.multinom"):

gr_stack = gr_combined %>>% drop_one %>>%
  lrn("classif.multinom", trace = FALSE)

glrn_stack = as_learner(gr_stack)

glrn_stack$train(tsk("wine"))

glrn_stack$base_learner()$model

Call:
nnet::multinom(formula = type ~ ., data = task$data(), trace = FALSE)

Coefficients:
  (Intercept) rpart_cv.prob.2 rpart_cv.prob.3 knn_cv.prob.2
2      -6.559          -8.273           15.96        44.974
3     -29.359         -19.237           14.07        -9.424
  knn_cv.prob.3
2         14.64
3         64.42

Residual Deviance: 5.504 
AIC: 25.5

A solution that does not need to specify the target classes at all is to use a custom Selector, as was shown in Section 8.3.1:

selector_remove_one_prob_column = function(task) {
  class_removing = task$class_names[[1]]
  selector_use = selector_invert(selector_grep(paste0("\\.", class_removing ,"$")))
  selector_use(task)
}

Using this selector in Section 8.3.2, one could use the resulting stacking learner on any classification task with arbitrary target classes. It can be used as an alternative to the Selector used in exercise 2:

drop_one_alt = po("select", selector = selector_remove_one_prob_column)

# The same as above:
gr_stack = gr_combined %>>% drop_one_alt %>>%
  lrn("classif.multinom", trace = FALSE)
glrn_stack = as_learner(gr_stack)
glrn_stack$train(tsk("wine"))

# As before, the first class was dropped.
glrn_stack$base_learner()$model

Call:
nnet::multinom(formula = type ~ ., data = task$data(), trace = FALSE)

Coefficients:
  (Intercept) rpart_cv.prob.2 rpart_cv.prob.3 knn_cv.prob.2
2      -3.735         -0.1002          -10.62         21.81
3     -21.486        -17.2448          -21.62         40.16
  knn_cv.prob.3
2         18.13
3         53.81

Residual Deviance: 27.4 
AIC: 47.4

We choose to use the following options for imputation, factor encoding, and model training. Note the use of pos() and lrns(), which return lists of PipeOp and Learner objects, respectively.

imputing = pos(c("imputeoor", "imputesample"))

factor_encoding = pos(c("encode", "encodeimpact"))

models = lrns(c("classif.rpart", "classif.log_reg", "classif.svm"))

Use the ppl("branch") pipeline to get Graphs with alternative path branching, controlled by its own hyperparameter. We need to give the po("branch") operators that are created here individual prefixes to avoid nameclashes when we put everything together.

full_graph = ppl("branch",
    prefix_branchops = "impute_", graphs = imputing
  ) %>>% ppl("branch",
    prefix_branchops = "encode_", graphs = factor_encoding
  ) %>>% ppl("branch",
    prefix_branchops = "model_", graphs = models
  )

full_graph$plot()

The easiest way to set up the search space for this pipeline is to use to_tune(). It is necessary to record the dependencies of the hyperparameters of the preprocessing and model PipeOps on the branch hyperparameters. For this, to_tune() needs to be applied to a Domain object – p_dbl(), p_fct(), etc. – that has its dependency declared using the depends argument.

library("paradox")
full_graph$param_set$set_values(
  impute_branch.selection = to_tune(),
  encode_branch.selection = to_tune(),
  model_branch.selection = to_tune(),

  encodeimpact.smoothing = to_tune(p_dbl(1e-3, 1e3, logscale = TRUE,
    depends = encode_branch.selection == "encodeimpact")),
  encode.method = to_tune(p_fct(c("one-hot", "poly"),
    depends = encode_branch.selection == "encode")),

  classif.rpart.cp = to_tune(p_dbl(0.001, 0.3, logscale = TRUE,
    depends = model_branch.selection == "classif.rpart")),
  classif.svm.cost = to_tune(p_dbl(1e-5, 1e5, logscale = TRUE,
    depends = model_branch.selection == "classif.svm")),
  classif.svm.gamma = to_tune(p_dbl(1e-5, 1e5, logscale = TRUE,
    depends = model_branch.selection == "classif.svm"))
)

We also set a few SVM kernel hyperparameters record their dependency on the model selection branch hyperparameter. We could record these dependencies in the Graph, using the $add_dep() method of the ParamSet, but here we use the simpler approach of adding a single item search space component.

full_graph$param_set$set_values(
  classif.svm.type = to_tune(p_fct("C-classification",
    depends = model_branch.selection == "classif.svm")),
  classif.svm.kernel = to_tune(p_fct("radial",
    depends = model_branch.selection == "classif.svm"))
)

To turn this Graph into an AutoML-system, we use an AutoTuner. Here we use random search, but any other Tuner could be used.

library("mlr3tuning")
automl_at = auto_tuner(
  tuner = tnr("random_search"),
  learner = full_graph,
  resampling = rsmp("cv", folds = 4),
  measure = msr("classif.ce"),
  term_evals = 30
)

We can now benchmark this AutoTuner on a few tasks and compare it with the untuned random forest with out-of-range (OOR) imputation:

learners = list(
  automl_at,
  as_learner(po("imputeoor") %>>% lrn("classif.ranger"))
)
learners[[1]]$id = "automl"
learners[[2]]$id = "ranger"

tasks = list(
  tsk("breast_cancer"),
  tsk("pima"),
  tsk("sonar")
)

set.seed(123L)
design = benchmark_grid(tasks, learners = learners, rsmp("cv", folds = 3))
bmr = benchmark(design)

bmr$aggregate()

   nr       task_id learner_id resampling_id iters classif.ce
1:  1 breast_cancer     automl            cv     3    0.03513
2:  2 breast_cancer     ranger            cv     3    0.02489
3:  3          pima     automl            cv     3    0.22917
4:  4          pima     ranger            cv     3    0.23047
5:  5         sonar     automl            cv     3    0.22098
6:  6         sonar     ranger            cv     3    0.16812
Hidden columns: resample_result

The AutoTuner performs better than the untuned random forest on one task. This is, of course, a toy example to demonstrate the capabilities of mlr3pipelines in combination with the mlr3tuning package. To use this kind of setup on real world data, one would need to take care of making the process more robust, e.g. by using the ppl("robustify") pipeline, and by using fallback learners.

A.8 Solutions for Chapter 9

We will consider a prediction problem similar to the one from this chapter, but using the King County Housing regression data instead (available with tsk("kc_housing")). To evaluate the models, we again use 10-fold CV, mean absolute error and lrn("regr.glmnet"). For now we will ignore the date column and simply remove it:

set.seed(1)

library("mlr3data")
task = tsk("kc_housing")
task$select(setdiff(task$feature_names, "date"))

Have a look at the features, are there any features which might be problematic? If so, change or remove them. Check the dataset and learner properties to understand which preprocessing steps you need to do.

summary(task)

     price           bathrooms       bedrooms       condition   
 Min.   :  75000   Min.   :0.00   Min.   : 0.00   Min.   :1.00  
 1st Qu.: 321950   1st Qu.:1.75   1st Qu.: 3.00   1st Qu.:3.00  
 Median : 450000   Median :2.25   Median : 3.00   Median :3.00  
 Mean   : 540088   Mean   :2.12   Mean   : 3.37   Mean   :3.41  
 3rd Qu.: 645000   3rd Qu.:2.50   3rd Qu.: 4.00   3rd Qu.:4.00  
 Max.   :7700000   Max.   :8.00   Max.   :33.00   Max.   :5.00  
                                                                
     floors         grade            lat            long     
 Min.   :1.00   Min.   : 1.00   Min.   :47.2   Min.   :-123  
 1st Qu.:1.00   1st Qu.: 7.00   1st Qu.:47.5   1st Qu.:-122  
 Median :1.50   Median : 7.00   Median :47.6   Median :-122  
 Mean   :1.49   Mean   : 7.66   Mean   :47.6   Mean   :-122  
 3rd Qu.:2.00   3rd Qu.: 8.00   3rd Qu.:47.7   3rd Qu.:-122  
 Max.   :3.50   Max.   :13.00   Max.   :47.8   Max.   :-121  
                                                             
   sqft_above   sqft_basement    sqft_living    sqft_living15 
 Min.   : 290   Min.   :  10    Min.   :  290   Min.   : 399  
 1st Qu.:1190   1st Qu.: 450    1st Qu.: 1427   1st Qu.:1490  
 Median :1560   Median : 700    Median : 1910   Median :1840  
 Mean   :1788   Mean   : 742    Mean   : 2080   Mean   :1987  
 3rd Qu.:2210   3rd Qu.: 980    3rd Qu.: 2550   3rd Qu.:2360  
 Max.   :9410   Max.   :4820    Max.   :13540   Max.   :6210  
                NA's   :13126                                 
    sqft_lot         sqft_lot15          view       waterfront     
 Min.   :    520   Min.   :   651   Min.   :0.000   Mode :logical  
 1st Qu.:   5040   1st Qu.:  5100   1st Qu.:0.000   FALSE:21450    
 Median :   7618   Median :  7620   Median :0.000   TRUE :163      
 Mean   :  15107   Mean   : 12768   Mean   :0.234                  
 3rd Qu.:  10688   3rd Qu.: 10083   3rd Qu.:0.000                  
 Max.   :1651359   Max.   :871200   Max.   :4.000                  
                                                                   
    yr_built     yr_renovated      zipcode     
 Min.   :1900   Min.   :1934    Min.   :98001  
 1st Qu.:1951   1st Qu.:1987    1st Qu.:98033  
 Median :1975   Median :2000    Median :98065  
 Mean   :1971   Mean   :1996    Mean   :98078  
 3rd Qu.:1997   3rd Qu.:2007    3rd Qu.:98118  
 Max.   :2015   Max.   :2015    Max.   :98199  
                NA's   :20699

The zipcode should not be interpreted as a numeric value, so we cast it to a factor. We could argue to remove lat and long as handling them as linear effects is not necessarily a suitable, but we will keep them since glmnet performs internal feature selection anyways.

zipencode = po("mutate", mutation = list(zipcode = ~ as.factor(zipcode)), id = "zipencode")

Build a suitable pipeline that allows glmnet to be trained on the dataset. Construct a new glmnet model with ppl("robustify"). Compare the two pipelines in a benchmark experiment.

lrn_glmnet = lrn("regr.glmnet")

graph_preproc =
  zipencode %>>%
  po("fixfactors") %>>%
  po("encodeimpact") %>>%
  list(
    po("missind", type = "integer", affect_columns = selector_type("integer")),
    po("imputehist", affect_columns = selector_type("integer"))) %>>%
  po("featureunion") %>>%
  po("imputeoor", affect_columns = selector_type("factor")) %>>%
  lrn_glmnet

graph_preproc$plot()

glmnet does not support factors or missing values. So our pipeline needs to handle both. First we fix the factor levels to ensure that all 70 zipcodes are fixed. We can consider 70 levels high cardinality, so we use impact encoding. We use the same imputation strategy as in Chapter 9.

graph_robustify =
  pipeline_robustify(task = task, learner = lrn_glmnet) %>>%
  lrn_glmnet

graph_robustify$plot()

glrn_preproc = as_learner(graph_preproc, id = "glmnet_preproc")
glrn_robustify = as_learner(graph_robustify, id = "glmnet_robustify")

design = benchmark_grid(
  tasks = task,
  learners = list(glrn_preproc, glrn_robustify),
  resamplings = rsmp("cv", folds = 3)
)

bmr = benchmark(design)

Warning: Multiple lambdas have been fit. Lambda will be set to 0.01 (see parameter 's').
This happened PipeOp regr.glmnet's $predict()
Warning: Multiple lambdas have been fit. Lambda will be set to 0.01 (see parameter 's').
This happened PipeOp regr.glmnet's $predict()
Warning: Multiple lambdas have been fit. Lambda will be set to 0.01 (see parameter 's').
This happened PipeOp regr.glmnet's $predict()
Warning: Multiple lambdas have been fit. Lambda will be set to 0.01 (see parameter 's').
This happened PipeOp regr.glmnet's $predict()
Warning: Multiple lambdas have been fit. Lambda will be set to 0.01 (see parameter 's').
This happened PipeOp regr.glmnet's $predict()
Warning: Multiple lambdas have been fit. Lambda will be set to 0.01 (see parameter 's').
This happened PipeOp regr.glmnet's $predict()

bmr$aggregate(msr("regr.rmse"))

   nr    task_id
1:  1 kc_housing
2:  2 kc_housing
4 variable(s) not shown: [learner_id, resampling_id, iters, regr.rmse]
Hidden columns: resample_result

Our preprocessing pipeline performs slightly better than the robustified one.

Now consider the date feature: How can you extract information from this feature in a way that glmnet can use? Does this improve the performance of your pipeline? Finally, consider the spatial nature of the dataset. Can you extract an additional feature from the lat / long coordinates? (Hint: Downtown Seattle has lat/long coordinates 47.605/122.334).

task = tsk("kc_housing")

graph_mutate =
  po("mutate", mutation = list(
    date = ~ as.numeric(date),
    distance_downtown = ~ sqrt((lat - 47.605)^2 + (long  + 122.334)^2))) %>>%
  zipencode %>>%
  po("encodeimpact") %>>%
  list(
    po("missind", type = "integer", affect_columns = selector_type("integer")),
    po("imputehist", affect_columns = selector_type("integer"))) %>>%
  po("featureunion") %>>%
  po("imputeoor", affect_columns = selector_type("factor")) %>>%
  lrn_glmnet

glrn_mutate = as_learner(graph_mutate)

design = benchmark_grid(
  tasks = task,
  learners = glrn_mutate,
  resamplings = rsmp("cv", folds = 3)
)

bmr_2 = benchmark(design)
bmr$combine(bmr_2)
bmr$aggregate(msr("regr.mae"))

   nr    task_id
1:  1 kc_housing
2:  2 kc_housing
3:  3 kc_housing
4 variable(s) not shown: [learner_id, resampling_id, iters, regr.mae]
Hidden columns: resample_result

We simply convert the date feature into a numeric timestamp so that glmnet can handle the feature. We create one additional feature as the distance to downtown Seattle. This improves the average error of our model by a further 1400$.

A.9 Solutions to Chapter 10

Consider the following example where you resample a learner (debug learner, sleeps for 3 seconds during $train()) on 4 workers using the multisession backend:

task = tsk("penguins")
learner = lrn("classif.debug", sleep_train = function() 3)
resampling = rsmp("cv", folds = 6)

future::plan("multisession", workers = 4)
resample(task, learner, resampling)

<ResampleResult> with 6 resampling iterations
  task_id    learner_id resampling_id iteration     prediction_test
 penguins classif.debug            cv         1 <PredictionClassif>
 penguins classif.debug            cv         2 <PredictionClassif>
 penguins classif.debug            cv         3 <PredictionClassif>
 penguins classif.debug            cv         4 <PredictionClassif>
 penguins classif.debug            cv         5 <PredictionClassif>
 penguins classif.debug            cv         6 <PredictionClassif>
2 variable(s) not shown: [warnings, errors]

Assuming that the learner would actually calculate something and not just sleep: Would all CPUs be busy?
Prove your point by measuring the elapsed time, e.g., using system.time().
What would you change in the setup and why?

Not all CPUs would be utilized for the whole duration. All 4 of them are occupied for the first 4 iterations of the cross-validation. The 5th iteration, however, only runs in parallel to the 6th fold, leaving 2 cores idle. This is supported by the elapsed time of roughly 6 seconds for 6 jobs compared to also roughly 6 seconds for 8 jobs:

task = tsk("penguins")
learner = lrn("classif.debug", sleep_train = function() 3)

future::plan("multisession", workers = 4)

resampling = rsmp("cv", folds = 6)
system.time(resample(task, learner, resampling))

   user  system elapsed 
  0.386   0.012   6.829

resampling = rsmp("cv", folds = 8)
system.time(resample(task, learner, resampling))

   user  system elapsed 
  0.447   0.008   6.718

If possible, the number of resampling iterations should be an integer multiple of the number of workers. Therefore, a simple adaptation either increases the number of folds for improved accuracy of the error estimate or reduces the number of folds for improved runtime.

Create a new custom binary classification measure which scores (“prob”-type) predictions. This measure should compute the absolute difference between the predicted probability for the positive class and a 0-1 encoding of the ground truth and then average these values across the test set. Test this with classif.log_reg on tsk("sonar").

The rules can easily be translated to R code where we first convert select the predicted probabilities for the positive class, 0-1 encode the truth vector and then calculate the mean absolute error between the two vectors.

mae_prob = function(truth, prob, task) {
  # retrieve positive class from task
  positive = task$positive
  # select positive class probabilities
  prob_positive = prob[, positive]
  # obtain 0-1 encoding of truth
  y = as.integer(truth == positive)
  # average the absolute difference
  mean(abs(prob_positive - y))
}

This function can be embedded in the Measure class accordingly.

MeasureMaeProb = R6::R6Class("MeasureMaeProb",
  inherit = mlr3::MeasureClassif, # classification measure
  public = list(
    initialize = function() { # initialize class
      super$initialize( # initialize method of parent class
        id = "mae_prob", # unique ID
        packages = character(), # no dependencies
        properties = "requires_task", # needs access to task for positive class
        predict_type = "prob", # measures probability prediction
        range = c(0, 1), # results in values between [0, 1]
        minimize = TRUE # smaller values are better
      )
    }
  ),

  private = list(
    .score = function(prediction, task, ...) { # define score as private method
      # call loss function
      mae_prob(prediction$truth, prediction$prob, task)
    }
  )
)

Because this is a custom class that is not available in the mlr_measures dictionary, we have to create a new instance using the $new() constructor.

msr_mae_prob = MeasureMaeProb$new()
msr_mae_prob

<MeasureMaeProb:mae_prob>
* Packages: mlr3
* Range: [0, 1]
* Minimize: TRUE
* Average: macro
* Parameters: list()
* Properties: requires_task
* Predict type: prob

To try this measure, we resample a logistic regression on the sonar task using five-fold cross-validation.

# predict_type is set to "prob", as otherwise our measure does not work
learner = lrn("classif.log_reg", predict_type = "prob")
task = tsk("sonar")
rr = resample(task, learner, rsmp("cv", folds = 5))

Warning: glm.fit: algorithm did not converge

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: algorithm did not converge

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: algorithm did not converge

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: algorithm did not converge

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: algorithm did not converge

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

We now score the resample result using our custom measure and msr("classif.acc").

score = rr$score(list(msr_mae_prob, msr("classif.acc")))

In this case, there is a clear relationship between the classification accuracy and our custom measure, i.e. the higher the accuracy, the lower the mean absolute error of the predicted probabilities.

cor(score$mae_prob, score$classif.acc)

[1] -0.9982

“Tune” the error_train hyperparameter of the classif.debug learner on a continuous interval from 0 to 1, using a simple classification tree as the fallback learner and the penguins task. Tune for 50 iterations using random search and 10-fold cross-validation. Inspect the resulting archive and find out which evaluations resulted in an error, and which did not. Now do the same in the interval 0.3 to 0.7. Are your results surprising?

First, we create the learner that we want to tune, mark the relevant parameter for tuning and set the fallback learner to a classification tree.

lrn_debug = lrn("classif.debug",
  error_train = to_tune(0, 1)
)
lrn_debug$encapsulate("evaluate", fallback = lrn("classif.rpart"))
lrn_debug

<LearnerClassifDebug:classif.debug>: Debug Learner for Classification
* Model: -
* Parameters: error_train=<RangeTuneToken>
* Validate: NULL
* Packages: mlr3
* Predict Types:  [response], prob
* Feature Types: logical, integer, numeric, character, factor,
  ordered
* Properties: hotstart_forward, internal_tuning, marshal,
  missings, multiclass, twoclass, validation

This example is unusual, because we expect better results from the fallback classification tree than from the primary debug learner, which predicts the mode of the target distribution. Nonetheless it serves as a good example to illustrate the effects of training errors on the tuning results.

We proceed with optimizing the classification accuracy of the learner on the penguins task.

instance = tune(
  learner =  lrn_debug,
  task = tsk("penguins"),
  resampling = rsmp("cv"),
  tuner = tnr("random_search"),
  measure = msr("classif.acc"),
  term_evals = 50
)

ERROR [10:46:48.145] [mlr3] train: Error from classif.debug->train()
ERROR [10:46:49.439] [mlr3] train: Error from classif.debug->train()
ERROR [10:46:49.525] [mlr3] train: Error from classif.debug->train()
ERROR [10:46:50.039] [mlr3] train: Error from classif.debug->train()
ERROR [10:46:50.138] [mlr3] train: Error from classif.debug->train()
ERROR [10:46:50.263] [mlr3] train: Error from classif.debug->train()
ERROR [10:46:50.380] [mlr3] train: Error from classif.debug->train()
ERROR [10:46:50.502] [mlr3] train: Error from classif.debug->train()
ERROR [10:46:52.270] [mlr3] train: Error from classif.debug->train()
ERROR [10:46:52.603] [mlr3] train: Error from classif.debug->train()
ERROR [10:46:52.725] [mlr3] train: Error from classif.debug->train()
ERROR [10:46:52.872] [mlr3] train: Error from classif.debug->train()
ERROR [10:46:53.103] [mlr3] train: Error from classif.debug->train()
ERROR [10:46:53.357] [mlr3] train: Error from classif.debug->train()
ERROR [10:46:53.480] [mlr3] train: Error from classif.debug->train()
ERROR [10:46:53.700] [mlr3] train: Error from classif.debug->train()
ERROR [10:46:53.824] [mlr3] train: Error from classif.debug->train()
ERROR [10:46:53.947] [mlr3] train: Error from classif.debug->train()
ERROR [10:46:54.075] [mlr3] train: Error from classif.debug->train()
ERROR [10:46:54.453] [mlr3] train: Error from classif.debug->train()
ERROR [10:46:55.535] [mlr3] train: Error from classif.debug->train()
ERROR [10:46:56.134] [mlr3] train: Error from classif.debug->train()
ERROR [10:46:56.257] [mlr3] train: Error from classif.debug->train()
ERROR [10:46:56.486] [mlr3] train: Error from classif.debug->train()
ERROR [10:46:56.875] [mlr3] train: Error from classif.debug->train()
ERROR [10:46:57.585] [mlr3] train: Error from classif.debug->train()
ERROR [10:46:57.819] [mlr3] train: Error from classif.debug->train()
ERROR [10:46:57.942] [mlr3] train: Error from classif.debug->train()
ERROR [10:46:58.309] [mlr3] train: Error from classif.debug->train()
ERROR [10:46:58.652] [mlr3] train: Error from classif.debug->train()
ERROR [10:46:58.765] [mlr3] train: Error from classif.debug->train()
ERROR [10:46:58.873] [mlr3] train: Error from classif.debug->train()
ERROR [10:46:58.983] [mlr3] train: Error from classif.debug->train()
ERROR [10:46:59.105] [mlr3] train: Error from classif.debug->train()
ERROR [10:46:59.228] [mlr3] train: Error from classif.debug->train()
ERROR [10:46:59.350] [mlr3] train: Error from classif.debug->train()
ERROR [10:46:59.472] [mlr3] train: Error from classif.debug->train()
ERROR [10:46:59.596] [mlr3] train: Error from classif.debug->train()
ERROR [10:46:59.720] [mlr3] train: Error from classif.debug->train()
ERROR [10:46:59.951] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:00.062] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:00.174] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:00.534] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:00.927] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:03.026] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:03.880] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:03.993] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:04.102] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:04.213] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:04.337] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:04.459] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:04.583] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:04.974] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:05.308] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:05.418] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:05.530] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:05.775] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:05.898] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:06.024] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:06.275] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:06.517] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:06.632] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:06.855] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:06.978] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:07.105] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:07.239] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:07.361] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:07.482] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:07.608] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:07.830] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:07.943] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:08.057] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:08.660] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:09.133] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:09.244] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:09.354] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:09.465] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:09.588] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:09.711] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:09.836] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:09.965] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:10.088] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:10.214] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:10.904] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:11.150] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:11.273] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:11.747] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:11.859] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:11.969] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:12.084] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:12.327] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:12.448] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:13.044] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:13.181] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:13.293] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:13.403] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:13.525] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:13.647] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:13.770] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:13.893] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:14.014] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:14.138] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:14.472] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:14.583] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:14.818] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:14.942] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:15.186] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:15.309] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:15.659] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:16.338] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:16.930] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:17.041] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:17.155] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:17.387] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:17.511] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:17.633] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:17.757] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:17.999] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:18.337] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:18.456] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:18.568] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:18.814] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:18.944] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:19.208] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:19.331] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:20.855] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:20.966] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:21.081] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:21.190] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:21.322] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:21.445] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:21.568] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:21.692] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:21.816] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:22.400] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:22.753] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:22.999] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:23.244] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:23.577] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:24.182] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:26.090] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:26.201] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:26.311] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:26.430] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:26.559] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:26.683] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:26.939] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:27.421] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:27.755] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:27.881] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:28.003] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:28.372] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:28.497] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:29.209] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:30.079] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:30.310] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:30.418] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:30.542] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:30.788] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:30.912] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:31.034] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:31.385] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:32.276] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:32.769] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:33.237] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:33.349] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:33.815] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:33.940] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:34.757] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:34.867] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:34.990] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:35.830] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:35.942] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:36.163] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:36.283] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:36.406] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:36.528] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:36.651] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:36.773] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:36.898] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:37.124] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:37.457] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:38.410] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:38.520] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:38.634] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:38.743] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:38.987] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:39.107] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:39.229] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:39.355] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:39.480] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:39.927] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:40.156] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:40.397] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:40.639] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:40.761] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:41.824] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:42.417] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:42.534] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:43.728] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:44.559] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:44.681] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:44.908] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:45.018] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:45.127] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:45.235] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:45.355] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:45.477] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:45.721] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:45.841] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:45.967] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:46.189] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:47.250] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:47.587] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:47.696] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:48.047] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:48.168] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:48.290] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:48.414] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:48.532] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:48.865] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:49.327] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:50.050] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:50.167] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:50.278] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:50.388] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:50.632] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:50.753] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:50.876] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:50.998] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:51.123] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:51.348] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:51.460] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:51.571] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:51.684] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:51.805] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:51.927] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:52.049] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:52.294] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:52.418] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:52.646] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:52.757] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:52.865] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:52.975] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:53.100] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:53.219] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:53.584] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:53.706] [mlr3] train: Error from classif.debug->train()

instance

<TuningInstanceBatchSingleCrit>
* State:  Optimized
* Objective: <ObjectiveTuningBatch:classif.debug_on_penguins>
* Search Space:
            id    class lower upper nlevels
1: error_train ParamDbl     0     1     Inf
* Terminator: <TerminatorEvals>
* Result:
   error_train classif.acc
1:      0.9832      0.9478
* Archive:
    error_train classif.acc
 1:     0.10572      0.3834
 2:     0.77347      0.7657
 3:     0.03411      0.3927
 4:     0.47503      0.6554
 5:     0.67638      0.7586
---                        
46:     0.67548      0.7661
47:     0.12612      0.4915
48:     0.75091      0.8743
49:     0.92625      0.9008
50:     0.93283      0.8008

To find out which evaluations resulted in an error, we can inspect the $archive slot of the instance, which we convert to a data.table for easier filtering.

archive = as.data.table(instance$archive)
archive[, c("error_train", "classif.acc", "errors")]

    error_train classif.acc errors
 1:     0.10572      0.3834      1
 2:     0.77347      0.7657      7
 3:     0.03411      0.3927      0
 4:     0.47503      0.6554      5
 5:     0.67638      0.7586      7
---                               
46:     0.67548      0.7661      7
47:     0.12612      0.4915      2
48:     0.75091      0.8743      9
49:     0.92625      0.9008      9
50:     0.93283      0.8008      8

Below, we visualize the relationship between the error probabilty and the classification accuracy.

ggplot(data = archive, aes(x = error_train, y = classif.acc, color = errors)) +
  geom_point() +
  theme_minimal()

Higher values for error_train lead to more resampling iterations using the classification tree fallback learner and therefore to better classification accuracies. Therefore, the best found hyperparameter configurations will tend to have values of error_train close to 1. When multiple parameter configurations have the same test performance, the first one is chosen by $result_learner_param_vals.

instance$result_learner_param_vals

$error_train
[1] 0.9832

We repeat the same experiment for the tuning interval from 0.3 to 0.7.

lrn_debug$param_set$set_values(
  error_train = to_tune(0.3, 0.7)
)

instance2 = tune(
  learner =  lrn_debug,
  task = tsk("penguins"),
  resampling = rsmp("cv"),
  tuner = tnr("random_search"),
  measure = msr("classif.acc"),
  term_evals = 50
)

ERROR [10:47:57.774] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:57.886] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:58.105] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:58.713] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:59.178] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:59.289] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:59.404] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:59.648] [mlr3] train: Error from classif.debug->train()
ERROR [10:47:59.893] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:00.139] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:00.474] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:00.584] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:01.194] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:01.317] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:01.444] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:01.890] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:02.120] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:02.242] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:02.365] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:02.951] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:03.062] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:03.279] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:03.400] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:03.524] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:03.770] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:03.893] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:04.240] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:04.358] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:04.468] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:04.699] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:05.066] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:05.188] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:05.311] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:05.659] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:06.130] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:06.378] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:07.465] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:07.586] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:08.185] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:08.410] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:08.645] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:08.768] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:08.889] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:09.014] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:09.135] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:09.259] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:09.935] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:10.057] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:10.302] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:10.423] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:10.546] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:11.105] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:11.227] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:11.349] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:11.834] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:12.275] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:12.750] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:12.874] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:12.992] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:13.338] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:14.030] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:14.271] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:14.411] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:14.856] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:15.209] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:15.572] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:15.982] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:16.092] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:16.205] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:16.316] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:16.438] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:16.559] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:16.685] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:17.293] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:17.404] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:17.518] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:17.633] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:17.755] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:17.875] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:17.998] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:18.120] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:18.914] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:19.155] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:19.278] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:19.642] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:20.197] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:20.315] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:20.559] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:20.680] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:20.924] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:21.151] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:21.727] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:21.849] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:22.090] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:22.434] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:22.542] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:23.003] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:23.489] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:23.722] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:24.055] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:24.672] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:24.793] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:25.017] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:25.359] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:25.482] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:25.608] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:25.739] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:25.858] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:25.982] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:26.544] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:26.896] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:27.018] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:27.259] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:27.380] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:27.604] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:27.712] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:27.825] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:27.935] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:28.057] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:28.178] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:28.539] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:28.660] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:28.993] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:29.215] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:29.337] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:29.457] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:29.581] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:29.942] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:30.168] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:30.274] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:30.386] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:30.736] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:30.859] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:30.980] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:31.101] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:31.223] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:31.683] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:31.917] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:32.041] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:32.285] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:32.410] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:32.531] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:32.871] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:32.988] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:33.349] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:33.471] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:33.592] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:33.714] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:33.836] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:34.057] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:34.278] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:34.633] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:34.876] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:35.340] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:35.450] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:35.792] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:35.910] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:36.032] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:36.400] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:36.622] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:36.730] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:36.951] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:37.073] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:37.314] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:37.676] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:38.033] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:38.982] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:39.220] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:39.329] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:39.553] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:39.799] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:40.045] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:40.168] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:40.289] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:40.632] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:40.973] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:41.216] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:41.339] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:41.459] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:41.579] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:41.806] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:42.510] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:42.873] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:43.318] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:43.918] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:44.388] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:44.687] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:45.424] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:45.655] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:45.875] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:45.985] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:46.228] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:46.360] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:46.488] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:46.611] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:46.736] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:46.998] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:47.112] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:47.462] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:47.584] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:48.312] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:48.885] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:49.009] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:49.131] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:49.251] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:49.602] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:49.820] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:50.051] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:50.420] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:51.112] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:51.221] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:51.465] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:51.830] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:52.174] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:52.286] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:52.396] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:52.505] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:52.626] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:52.869] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:53.582] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:53.805] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:54.055] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:54.180] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:54.306] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:54.427] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:54.549] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:54.777] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:54.886] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:55.112] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:55.232] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:55.354] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:55.596] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:56.534] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:57.025] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:57.149] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:57.837] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:57.958] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:58.202] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:58.324] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:58.448] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:58.677] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:58.892] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:59.249] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:59.368] [mlr3] train: Error from classif.debug->train()
ERROR [10:48:59.489] [mlr3] train: Error from classif.debug->train()
ERROR [10:49:00.083] [mlr3] train: Error from classif.debug->train()
ERROR [10:49:00.567] [mlr3] train: Error from classif.debug->train()
ERROR [10:49:00.810] [mlr3] train: Error from classif.debug->train()
ERROR [10:49:00.931] [mlr3] train: Error from classif.debug->train()
ERROR [10:49:01.402] [mlr3] train: Error from classif.debug->train()
ERROR [10:49:01.733] [mlr3] train: Error from classif.debug->train()
ERROR [10:49:01.982] [mlr3] train: Error from classif.debug->train()
ERROR [10:49:02.357] [mlr3] train: Error from classif.debug->train()

archive2 = as.data.table(instance2$archive)
instance2

<TuningInstanceBatchSingleCrit>
* State:  Optimized
* Objective: <ObjectiveTuningBatch:classif.debug_on_penguins>
* Search Space:
            id    class lower upper nlevels
1: error_train ParamDbl   0.3   0.7     Inf
* Terminator: <TerminatorEvals>
* Result:
   error_train classif.acc
1:      0.5705      0.8363
* Archive:
    error_train classif.acc
 1:      0.5166      0.5760
 2:      0.5456      0.6730
 3:      0.5298      0.6713
 4:      0.5146      0.6349
 5:      0.6054      0.7903
---                        
46:      0.5060      0.5566
47:      0.5110      0.6746
48:      0.3964      0.6745
49:      0.3376      0.5246
50:      0.5408      0.5779

As before, higher error probabilities during training lead to higher classification accuracies.

ggplot(data = archive2, aes(x = error_train, y = classif.acc, color = errors)) +
  geom_point() +
  theme_minimal()

However, the best found configurations for the error_train parameter, now tend to be close to 0.7 instead of 1 as before.

instance2$result_learner_param_vals

$error_train
[1] 0.5705

This demonstrates that when utilizing a fallback learner, the tuning results are influenced not only by the direct impact of the tuning parameters on the primary learner but also by their effect on its error probability. Therefore, it is always advisable to manually inspect the tuning results afterward. Note that in most real-world scenarios, the fallback learner performs worse than the primary learner, and thus the effects illustrated here are usually reversed.

A.10 Solutions to Chapter 11

Load the OpenML collection with ID 269, which contains regression tasks from the AutoML benchmark (Gijsbers et al. 2022). Peek into this suite to study the contained data sets and their characteristics. Then find all tasks with less than 4000 observations and convert them to mlr3 tasks.

We access the AutoML benchmark suite with ID 269 using the ocl() function.

library(mlr3oml)
automl_suite = ocl(id = 269)
automl_suite$task_ids

To create a summary of the underlying datasets, we pass their IDs to list_oml_data().

data_tbl = list_oml_data(automl_suite$data_ids)
data_tbl[, c("data_id", "name", "NumberOfInstances")]

    data_id                    name NumberOfInstances
 1:     201                     pol             15000
 2:     216               elevators             16599
 3:     287            wine_quality              6497
 4:     416               yprop_4_1              8885
 5:     422                topo_2_1              8885
---                                                  
29:   42728   Airlines_DepDelay_10M          10000000
30:   42729 nyc-taxi-green-dec-2016            581835
31:   42730                us_crime              1994
32:   42731             house_sales             21613
33:   43071     MIP-2016-regression              1090

To find those datasets with up to 4000 observations, we can simply filter the table.

data_tbl = data_tbl[NumberOfInstances < 4000, ]

Alternatively, the list_oml_tasks() also allows to filter OpenML tasks by their characteristics.

task_tbl = list_oml_tasks(
  task_id = automl_suite$task_ids, number_instances = c(0, 4000)
)

The resulting table contains matching OpenML tasks from the AutoML benchmark suite.

task_tbl[, .(task_id, data_id, name, NumberOfInstances)]

    task_id data_id                 name NumberOfInstances
 1:  167210   41021            Moneyball              1232
 2:  359930     550                quake              2178
 3:  359931     546              sensory               576
 4:  359932     541               socmob              1156
 5:  359933     507             space_ga              3107
 6:  359934     505              tecator               240
 7:  359945   42730             us_crime              1994
 8:  359950     531               boston               506
 9:  359951   42563 house_prices_nominal              1460
10:  360945   43071  MIP-2016-regression              1090

We create mlr3 tasks from these OpenML IDs using tsk("oml").

tasks = lapply(task_tbl$task_id, function(id) tsk("oml", task_id = id))

tasks[[1]]

<TaskRegr:Moneyball> (1232 x 15)
* Target: RS
* Properties: -
* Features (14):
  - fct (6): G, League, Playoffs, RankPlayoffs, RankSeason, Team
  - dbl (5): BA, OBP, OOBP, OSLG, SLG
  - int (3): RA, W, Year

Create an experimental design that compares lrn("regr.ranger") and lrn("regr.rpart") on those tasks. Use the robustify pipeline for both learners and a featureless fallback learner. You can use three-fold CV instead of the OpenML resamplings to save time. Run the comparison experiments with batchtools. Use default hyperparameter settings and do not perform any tuning to keep the experiments simple.

lrn_ranger = as_learner(
  ppl("robustify", learner = lrn("regr.ranger")) %>>%
    po("learner", lrn("regr.ranger"))
)
lrn_ranger$id = "ranger"
lrn_ranger$encapsulate("evaluate", fallback = lrn("regr.featureless"))

lrn_rpart = as_learner(
  ppl("robustify", learner = lrn("regr.rpart")) %>>%
    po("learner", lrn("regr.rpart"))
)
lrn_rpart$id = "rpart"
lrn_rpart$encapsulate("evaluate", fallback = lrn("regr.featureless"))

learners = list(lrn_ranger, lrn_rpart)

We set a seed before calling benchmark_grid() as this instantiates the resamplings, which is stochastic.

set.seed(123)
resampling = rsmp("cv", folds = 3)
design = benchmark_grid(tasks, learners, resampling)
design

                    task learner resampling
 1:            Moneyball  ranger         cv
 2:            Moneyball   rpart         cv
 3:                quake  ranger         cv
 4:                quake   rpart         cv
 5:              sensory  ranger         cv
---                                        
16:               boston   rpart         cv
17: house_prices_nominal  ranger         cv
18: house_prices_nominal   rpart         cv
19:  MIP-2016-regression  ranger         cv
20:  MIP-2016-regression   rpart         cv

To execute this benchmark design using mlr3batchmark we start by creating and configuring an experiment registry. We set file.dir = NA to use a temporary directory for the registry.

library(mlr3batchmark)

Loading required package: batchtools


Attaching package: 'batchtools'

The following object is masked from 'package:mlr3misc':

    chunk

library(batchtools)

reg = makeExperimentRegistry(
  file.dir = NA,
  seed = 1,
  packages = "mlr3verse"
)

No readable configuration file found

Created registry in '/tmp/RtmpK2u44B/registry157a1b090b1b' using cluster functions 'Interactive'

The next two steps are to populate the registry with the experiments using batchmark() and to submit them. By specifying no IDs in submitJobs(), all jobs returned by findNotSubmitted() are queued, which in this case are all existing jobs.

batchmark(design, reg = reg)
submitJobs(reg = reg)
waitForJobs(reg = reg)

After the execution of the experiment finished we can load the results as a BenchmarkResult.

bmr = reduceResultsBatchmark(reg = reg)
bmr$aggregate(msr("regr.mse"))

    nr              task_id learner_id resampling_id iters  regr.mse
 1:  1            Moneyball     ranger            cv     3 6.699e+02
 2:  2            Moneyball      rpart            cv     3 1.357e+03
 3:  3                quake     ranger            cv     3 3.827e-02
 4:  4                quake      rpart            cv     3 3.567e-02
 5:  5              sensory     ranger            cv     3 5.072e-01
---                                                                 
16: 16               boston      rpart            cv     3 2.293e+01
17: 17 house_prices_nominal     ranger            cv     3 1.045e+09
18: 18 house_prices_nominal      rpart            cv     3 1.949e+09
19: 19  MIP-2016-regression     ranger            cv     3 7.505e+08
20: 20  MIP-2016-regression      rpart            cv     3 6.277e+08
Hidden columns: resample_result

Conduct a global Friedman test and, if appropriate, post hoc Friedman-Nemenyi tests, and interpret the results. As an evaluation measure, use the MSE.

First, we load the mlr3benchmark package and create a BenchmarkAggr from the benchmark result using msr("regr.mse").

library(mlr3benchmark)
bma = as_benchmark_aggr(bmr, measures = msr("regr.mse"))
bma

<BenchmarkAggr> of 20 rows with 10 tasks, 2 learners and 1 measure
                 task_id learner_id       mse
 1:            Moneyball     ranger 6.699e+02
 2:            Moneyball      rpart 1.357e+03
 3:                quake     ranger 3.827e-02
 4:                quake      rpart 3.567e-02
 5:              sensory     ranger 5.072e-01
---                                          
16:               boston      rpart 2.293e+01
17: house_prices_nominal     ranger 1.045e+09
18: house_prices_nominal      rpart 1.949e+09
19:  MIP-2016-regression     ranger 7.505e+08
20:  MIP-2016-regression      rpart 6.277e+08

We can also visualize this result using the autoplot() function.

autoplot(bma)

Below, we conduct a global Friedman test. Note that a post-hoc test is not needed because we are only comparing two algorithms.

bma$friedman_test()


    Friedman rank sum test

data:  mse and learner_id and task_id
Friedman chi-squared = 1.6, df = 1, p-value = 0.2

This experimental design was not able to detect a significant difference on the 5% level so we cannot reject our null hypothesis that the regression tree performs equally well as the random forest.

A.11 Solutions to Chapter 12

Prepare a mlr3 regression task for fifa data. Select only variables describing the age and skills of footballers. Train any predictive model for this task, e.g. lrn("regr.ranger").

library(DALEX)
library(ggplot2)
data("fifa", package = "DALEX")
old_theme = set_theme_dalex("ema")

library(mlr3)
library(mlr3learners)
set.seed(1)

feat_of_interest = c("age", "skill_ball_control", "skill_curve", "skill_dribbling",  "skill_fk_accuracy", "skill_long_passing", "value_eur")
fifa20 = fifa[,feat_of_interest]

task_fifa = as_task_regr(fifa20, target = "value_eur", id = "fifa20")

learner = lrn("regr.ranger")
learner$train(task_fifa)
learner$model

Ranger result

Call:
 ranger::ranger(dependent.variable.name = task$target_names, data = task$data(),      case.weights = task$weights$weight, num.threads = 1L) 

Type:                             Regression 
Number of trees:                  500 
Sample size:                      5000 
Number of independent variables:  6 
Mtry:                             2 
Target node size:                 5 
Variable importance mode:         none 
Splitrule:                        variance 
OOB prediction error (MSE):       3.404e+13 
R squared (OOB):                  0.5671

Use the permutation importance method to calculate feature importance ranking. Which feature is the most important? Do you find the results surprising?

With iml

library(iml)
model = Predictor$new(learner,
                data = fifa20,
                y = "value_eur")

effect = FeatureImp$new(model,
                loss = "rmse")
effect$plot()

With DALEX

library(DALEX)
ranger_exp = DALEX::explain(learner,
  data = fifa20[, setdiff(names(fifa20), "value_eur")],
  y = fifa$value_eur,
  label = "Fifa 2020",
  verbose = FALSE)

ranger_effect = model_parts(ranger_exp, B = 5)
head(ranger_effect)

            variable mean_dropout_loss     label
1       _full_model_           2897186 Fifa 2020
2                age           4658209 Fifa 2020
3        skill_curve           4842955 Fifa 2020
4  skill_fk_accuracy           5167775 Fifa 2020
5    skill_dribbling           5747754 Fifa 2020
6 skill_long_passing           5947734 Fifa 2020

plot(ranger_effect)

Use the partial dependence plot/profile to draw the global behavior of the model for this feature. Is it aligned with your expectations?

With iml

impfeat = c("skill_ball_control")

effect = FeatureEffects$new(model, features = impfeat)
plot(effect)

With DALEX

impfeat = c("skill_ball_control")

ranger_profiles = model_profile(ranger_exp, variables = impfeat)
plot(ranger_profiles)

Choose Robert Lewandowski as a specific example and calculate and plot the Shapley values. Which feature is locally the most important and has the strongest influence on his valuation as a soccer player?

player_1 = fifa20["R. Lewandowski",]

With iml

shapley = Shapley$new(model, x.interest = player_1)
plot(shapley)

With DALEX

ranger_shap = predict_parts(ranger_exp,
             new_observation = player_1,
             type = "shap", B = 1)
plot(ranger_shap, show_boxplots = FALSE)

A.12 Solutions to Chapter 13

Run a benchmark experiment on tsk("german_credit") with lrn("classif.featureless"), lrn("classif.log_reg"), and lrn("classif.ranger"). Tune the prediction thresholds of all learners by encapsulating them in a po("learner_cv") (with two-fold CV), followed by a po("tunethreshold"). Use msr("classif.costs", costs = costs), where the costs matrix is as follows: true positive is -10, true negative is -1, false positive is 2, and false negative is 3. Use this measure in po("tunethreshold") and when evaluating your benchmark experiment.

set.seed(1)
# Load task and learners
tsk_german = tsk("german_credit")
learners = lrns(c("classif.featureless", "classif.log_reg",
  "classif.ranger"), predict_type = "prob")

# Create costs matrix
costs = matrix(c(-10, 3, 2, -1), nrow = 2,
  dimnames = list("Predicted Credit" = c("good", "bad"),
                    Truth = c("good", "bad")))
costs

                Truth
Predicted Credit good bad
            good  -10   2
            bad     3  -1

Our cost matrix is as expected so we can plug it into our measure and setup our pipeline.

# Create measure
meas_costs = msr("classif.costs", costs = costs)

# Create a function to wrap learners in internal cross-validation
#  to tune the threshold
pipe = function(l) {
  po("learner_cv", l, resampling.folds = 2) %>>%
    po("tunethreshold", measure = meas_costs)
}

# Benchmark
learners = lapply(learners, pipe)
design = benchmark_grid(tsk_german, learners, rsmp("holdout"))
bmr = benchmark(design)$aggregate(meas_costs)

OptimInstanceSingleCrit is deprecated. Use OptimInstanceBatchSingleCrit instead.
OptimInstanceSingleCrit is deprecated. Use OptimInstanceBatchSingleCrit instead.
OptimInstanceSingleCrit is deprecated. Use OptimInstanceBatchSingleCrit instead.

Now exploring our results…

bmr[, .(learner_id, classif.costs)]

                          learner_id classif.costs
1: classif.featureless.tunethreshold        -6.180
2:     classif.log_reg.tunethreshold        -6.216
3:      classif.ranger.tunethreshold        -6.180

Based on these results, the logistic regression performs the best with the greatest increase to costs, however the difference is only marginal compared to the other learners.

Train and test a survival forest using lrn("surv.rfsrc") (from mlr3extralearners). Run this experiment using tsk("rats") and partition(). Evaluate your model with the RCLL measure.

# Get learners
library(mlr3extralearners)
# Get survival models
library(mlr3proba)
set.seed(1)
# Use partition to split data and test our model
tsk_rats = tsk("rats")
splits = partition(tsk_rats)
learner = lrn("surv.rfsrc")
prediction = learner$train(tsk_rats, splits$train)$predict(tsk_rats, splits$test)
prediction$score(msr("surv.rcll"))

surv.rcll 
    2.726

The right-censored logloss provides a measure of predictive accuracy, but it is quite hard to interpret it without comparison to another model. To yield a more informative value, we could either compute the RCLL for an uninformed baseline like the Kaplan-Meier estimator, or we could use the ERV (explained residual variation) parameter in the measure, which returns the RCLL as a percentage increase in performance compared to an uninformed baseline (in this case the Kaplan-Meier estimator):

lrn("surv.kaplan")$
  train(tsk_rats, splits$train)$
  predict(tsk_rats, splits$test)$
  score(msr("surv.rcll"))

surv.rcll 
    2.789

prediction$score(msr("surv.rcll", ERV = TRUE),
  task = tsk_rats, train_set = splits$train)

surv.rcll 
   0.0226

Now we can see that our model is only marginally better than the Kaplan-Meier baseline (a 2% performance increase).

Estimate the density of the “precip” task from the mlr3proba package using lrn("dens.hist"), evaluate your estimation with the logloss measure. As a stretch goal, look into the documentation of distr6 to learn how to analyse your estimated distribution further.

# Get density models
library(mlr3proba)
set.seed(1)
# Run experiment
tsk_precip = tsk("precip")
learner = lrn("dens.hist")
prediction = learner$train(tsk_precip)$predict(tsk_precip)
prediction

<PredictionDens> for 70 observations:
 row_ids      pdf    cdf
       1 0.001429 0.9957
       2 0.007143 0.9479
       3 0.005714 0.0400
     ---      ---    ---
      68 0.007143 0.2507
      69 0.012857 0.1163
      70 0.007143 0.9800
1 variable(s) not shown: [distr]

prediction$score(msr("dens.logloss"))

dens.logloss 
       3.896

As before the logloss is not too informative by itself but as the Histogram is itself a baseline, we can use this value for comparison to more sophisticated models. To learn more about our predicted distribution, we could use distr6 to summarise the distribution and to compute values such as the pdf and cdf:

prediction$distr$summary()

Histogram Estimator
Support: [0,70]     Scientific Type: ℝ 

Traits:     continuous; univariate
Properties: asymmetric

# pdf evaluated at `50`
prediction$distr$pdf(50)

[1] 0.007143

Run a benchmark clustering experiment on the “wine” dataset without a label column. Compare the performance of k-means learner with k equal to 2, 3 and 4 using the silhouette measure and the insample resampling technique. What value of k would you choose based on the silhouette scores?

set.seed(1)
# Load clustering models and tasks
library(mlr3cluster)
# Create the clustering dataset by extracting only the features from the
#  wine task
tsk_wine = tsk("wine")
tsk_wine = as_task_clust(tsk_wine$data(cols = tsk_wine$feature_names))
# Create learners and ensure they have unique IDs
learners = c(
  lrn("clust.kmeans", centers = 2, id = "K=2"),
  lrn("clust.kmeans", centers = 3, id = "K=3"),
  lrn("clust.kmeans", centers = 4, id = "K=4")
)
# Benchmark
meas = msr("clust.silhouette")
design = benchmark_grid(tsk_wine, learners, rsmp("insample"))
benchmark(design)$aggregate(meas)[, .(learner_id, clust.silhouette)]

   learner_id clust.silhouette
1:        K=2           0.6569
2:        K=3           0.5711
3:        K=4           0.5606

We can see that we get the silhouette closest to 1 with K=2 so we might use this value for future experiments.

A.13 Solutions to Chapter 14

Train a model of your choice on tsk("adult_train") and test it on tsk("adult_test"), use any measure of your choice to evaluate your predictions. Assume our goal is to achieve parity in false omission rates across the protected ‘sex’ attribute. Construct a fairness metric that encodes this and evaluate your model. To get a deeper understanding, look at the groupwise_metrics function to obtain performance in each group.

For now we simply load the data and look at the data.

library(mlr3)
library(mlr3fairness)
set.seed(8)

tsk_adult_train = tsk("adult_train")
tsk_adult_test = tsk("adult_test")
tsk_adult_train

<TaskClassif:adult_train> (30718 x 13)
* Target: target
* Properties: twoclass
* Features (12):
  - fct (7): education, marital_status, occupation, race,
    relationship, sex, workclass
  - int (5): age, capital_gain, capital_loss, education_num,
    hours_per_week
* Protected attribute: sex

We can now train a simple model, e.g., a decision tree and evaluate for accuracy.

learner = lrn("classif.rpart")
learner$train(tsk_adult_train)
prediction = learner$predict(tsk_adult_test)
prediction$score()

classif.ce 
     0.161

The false omission rate parity metric is available via the key "fairness.fomr". Note, that evaluating our prediction now requires that we also provide the task.

msr_1 = msr("fairness.fomr")
prediction$score(msr_1, tsk_adult_test)

fairness.fomr 
      0.03533

In addition, we can look at false omission rates in each group. The groupwise_metrics function creates a metric for each group specified in the pta column role:

tsk_adult_test$col_roles$pta

[1] "sex"

We can then use this metric to evaluate our model again. This gives us the false omission rates for male and female individuals separately.

msr_2 = groupwise_metrics(base_measure = msr("classif.fomr"), task = tsk_adult_test)
prediction$score(msr_2, tsk_adult_test)

  subgroup.fomr_Male subgroup.fomr_Female 
              0.2442               0.2089

Improve your model by employing pipelines that use pre- or post-processing methods for fairness. Evaluate your model along the two metrics and visualize the resulting metrics. Compare the different models using an appropriate visualization.

First we can again construct the learners above.

library(mlr3pipelines)
lrn_1 = po("reweighing_wts") %>>% lrn("classif.rpart")
lrn_2 = po("learner_cv", lrn("classif.rpart")) %>>%
  po("EOd")

And run the benchmark again. Note, that we use three-fold CV this time for comparison.

learners = list(learner, lrn_1, lrn_2)
design = benchmark_grid(tsk_adult_train, learners, rsmp("cv", folds = 3L))
bmr = benchmark(design)
bmr$aggregate(msrs(c("classif.acc", "fairness.fomr")))

   nr     task_id                   learner_id resampling_id iters
1:  1 adult_train                classif.rpart            cv     3
2:  2 adult_train reweighing_wts.classif.rpart            cv     3
3:  3 adult_train            classif.rpart.EOd            cv     3
2 variable(s) not shown: [classif.acc, fairness.fomr]
Hidden columns: resample_result

We can now again visualize the result.

library(ggplot2)
fairness_accuracy_tradeoff(bmr, msr("fairness.fomr")) +
  scale_color_viridis_d("Learner") +
  theme_minimal()

We can notice two main results:

We do not improve in the false omission rate by using fairness interventions. One reason might be, that the interventions chosen do not optimize for the false omission rate, but other metrics, e.g. equalized odds.
The spread between the different cross-validation iterations (small dots) is quite large, estimates might come with a considerable error.

Add “race” as a second sensitive attribute to your dataset. Add the information to your task and evaluate the initial model again. What changes? Again study the groupwise_metrics.

This can be achieved by adding “race” to the "pta" col_role.

tsk_adult_train$set_col_roles("race", add_to = "pta")
tsk_adult_train

<TaskClassif:adult_train> (30718 x 13)
* Target: target
* Properties: twoclass
* Features (12):
  - fct (7): education, marital_status, occupation, race,
    relationship, sex, workclass
  - int (5): age, capital_gain, capital_loss, education_num,
    hours_per_week
* Protected attribute: sex, race

tsk_adult_test$set_col_roles("race", add_to = "pta")
prediction$score(msr_1, tsk_adult_test)

fairness.fomr 
       0.8889

Evaluating for the intersection, we obtain a large deviation from 0. Note, that the metric by default computes the maximum discrepancy between all metrics for the non-binary case.

If we now compute the groupwise_metrics, we will get a metric for the intersection of each group.

msr_3 = groupwise_metrics(msr("classif.fomr"),  tsk_adult_train)
unname(sapply(msr_3, function(x) x$id))

 [1] "subgroup.fomr_Male, White"               
 [2] "subgroup.fomr_Male, Black"               
 [3] "subgroup.fomr_Female, Black"             
 [4] "subgroup.fomr_Female, White"             
 [5] "subgroup.fomr_Male, Asian-Pac-Islander"  
 [6] "subgroup.fomr_Male, Amer-Indian-Eskimo"  
 [7] "subgroup.fomr_Female, Other"             
 [8] "subgroup.fomr_Female, Asian-Pac-Islander"
 [9] "subgroup.fomr_Female, Amer-Indian-Eskimo"
[10] "subgroup.fomr_Male, Other"

prediction$score(msr_3, tsk_adult_test)

               subgroup.fomr_Male, White 
                                  0.2402 
               subgroup.fomr_Male, Black 
                                  0.2716 
             subgroup.fomr_Female, Black 
                                  0.2609 
             subgroup.fomr_Female, White 
                                  0.1919 
  subgroup.fomr_Male, Asian-Pac-Islander 
                                  0.3168 
  subgroup.fomr_Male, Amer-Indian-Eskimo 
                                  0.1667 
             subgroup.fomr_Female, Other 
                                  0.2500 
subgroup.fomr_Female, Asian-Pac-Islander 
                                  0.3529 
subgroup.fomr_Female, Amer-Indian-Eskimo 
                                  1.0000 
               subgroup.fomr_Male, Other 
                                  0.1111

And we can see, that the reason might be, that the false omission rate for female Amer-Indian-Eskimo is at 1.0!

In this chapter we were unable to reduce bias in our experiment. Using everything you have learned in this book, see if you can successfully reduce bias in your model. Critically reflect on this exercise, why might this be a bad idea?

Several problems with the existing metrics.

We’ll go through them one by one to deepen our understanding:

Metric and evaluation

In order for the fairness metric to be useful, we need to ensure that the data used for evaluation is representative and sufficiently large.

We can investigate this further by looking at actual counts:

  table(tsk_adult_test$data(cols = c("race", "sex", "target")))

, , target = <=50K

                    sex
race                 Female Male
  Amer-Indian-Eskimo     56   74
  Asian-Pac-Islander    131  186
  Black                 654  619
  Other                  37   66
  White                3544 6176

, , target = >50K

                    sex
race                 Female Male
  Amer-Indian-Eskimo      3   16
  Asian-Pac-Islander     26  106
  Black                  41  133
  Other                   5   19
  White                 492 2931

One of the reasons might be that there are only 3 individuals in the “>50k” category! This is an often encountered problem, as error metrics have a large variance when samples are small. Note, that the pre- and post-processing methods in general do not all support multiple protected attributes.

We should question whether comparing the metric between all groups actually makes sense for the question we are trying to answer. Instead, we might want to observe the metric between two specific subgroups, in this case between individuals with sex: Female and race: "Black" or "White".

First, we create a subset of only sex: Female and race: "Black", "White".

adult_subset = tsk_adult_test$clone()
df = adult_subset$data()
rows = seq_len(nrow(df))[df$race %in% c("Black", "White") & df$sex %in% c("Female")]
adult_subset$filter(rows)
adult_subset$set_col_roles("race", add_to = "pta")

And evaluate our measure again:

prediction$score(msr_3, adult_subset)

               subgroup.fomr_Male, White 
                                  0.2402 
               subgroup.fomr_Male, Black 
                                  0.2716 
             subgroup.fomr_Female, Black 
                                  0.2609 
             subgroup.fomr_Female, White 
                                  0.1919 
  subgroup.fomr_Male, Asian-Pac-Islander 
                                  0.3168 
  subgroup.fomr_Male, Amer-Indian-Eskimo 
                                  0.1667 
             subgroup.fomr_Female, Other 
                                  0.2500 
subgroup.fomr_Female, Asian-Pac-Islander 
                                  0.3529 
subgroup.fomr_Female, Amer-Indian-Eskimo 
                                  1.0000 
               subgroup.fomr_Male, Other 
                                  0.1111

We can see, that between women there is an even bigger discrepancy compared to men.

The bias mitigation strategies we employed do not optimize for the false omission rate metric, but other metrics instead. It might therefore be better to try to achieve fairness via other strategies, using different or more powerful models or tuning hyperparameters.

A.14 Solutions to Chapter 15

Manually $train() a LightGBM classifier from mlr3extralearners on the pima task using $1/3$ of the training data for validation. As the pima task has missing values, select a method from mlr3pipelines to impute them. Explicitly set the evaluation metric to logloss ("binary_logloss"), the maximum number of boosting iterations to 1000, the patience parameter to 10, and the step size to 0.01. After training the learner, inspect the final validation scores as well as the early stopped number of iterations.

We start by loading the packages and creating the task.

library(mlr3)
library(mlr3extralearners)
library(mlr3pipelines)

tsk_pima = tsk("pima")
tsk_pima

<TaskClassif:pima> (768 x 9): Pima Indian Diabetes
* Target: diabetes
* Properties: twoclass
* Features (8):
  - dbl (8): age, glucose, insulin, mass, pedigree, pregnant,
    pressure, triceps

Below, we see that the task has five features with missing values.

tsk_pima$missings()

diabetes      age  glucose  insulin     mass pedigree pregnant pressure 
       0        0        5      374       11        0        0       35 
 triceps 
     227

Next, we create the LightGBM classifier, but don’t specify the validation data yet. We handle the missing values using a simple median imputation.

lrn_lgbm = lrn("classif.lightgbm",
  num_iterations = 1000,
  early_stopping_rounds = 10,
  learning_rate = 0.01,
  eval = "binary_logloss"
)

glrn = as_learner(po("imputemedian") %>>% lrn_lgbm)
glrn$id = "lgbm"

After constructing the graphlearner, we now configure the validation data using set_validate(). The call below sets the $validate field of the LightGBM pipeop to "predefined" and of the graphlearner to 0.3. Recall that only the graphlearner itself can specify how the validation data is generated. The individual pipeops can either use it ("predefined") or not (NULL).

set_validate(glrn, validate = 0.3, ids = "classif.lightgbm")
glrn$validate

[1] 0.3

glrn$graph$pipeops$classif.lightgbm$validate

[1] "predefined"

Finally, we train the learner and inspect the validation scores and internally tuned parameters.

glrn$train(tsk_pima)

glrn$internal_tuned_values

$classif.lightgbm.num_iterations
[1] 183

glrn$internal_valid_scores

$classif.lightgbm.binary_logloss
[1] 0.4782

Wrap the learner from exercise 1) in an AutoTuner using a three-fold CV for the tuning. Also change the rule for aggregating the different boosting iterations from averaging to taking the maximum across the folds. Don’t tune any parameters other than nrounds, which can be done using tnr("internal"). Use the internal validation metric as the tuning measure. Compare this learner with a lrn("classif.rpart") using a 10-fold outer cross-validation with respect to classification accuracy.

We start by setting the number of boosting iterations to an internal tune token where the maximum number of boosting iterations is 1000 and the aggregation function the maximum. Note that the input to the aggregation function is a list of integer values (the early stopped values for the different resampling iterations), so we need to unlist() it first before taking the maximum.

library(mlr3tuning)

glrn$param_set$set_values(
  classif.lightgbm.num_iterations = to_tune(
    upper = 1000, internal = TRUE, aggr = function(x) max(unlist(x))
  )
)

Now, we change the validation data from 0.3 to "test", where we can omit the ids specification as LightGBM is the base learner.

set_validate(glrn, validate = "test")

Next, we create the autotuner using the configuration given in the instructions. As the internal validation measures are calculated by lightgbm and not mlr3, we need to specify whether the metric should be minimized.

at_lgbm = auto_tuner(
  learner = glrn,
  tuner = tnr("internal"),
  resampling = rsmp("cv", folds = 3),
  measure = msr("internal_valid_score",
    select = "classif.lightgbm.binary_logloss", minimize = TRUE)
)
at_lgbm$id = "at_lgbm"

Finally, we set up the benchmark design, run it, and evaluate the learners in terms of their classification accuracy.

design = benchmark_grid(
  task = tsk_pima,
  learners = list(at_lgbm, lrn("classif.rpart")),
  resamplings = rsmp("cv", folds = 10)
)

bmr = benchmark(design)

bmr$aggregate(msr("classif.acc"))

   nr task_id    learner_id resampling_id iters classif.acc
1:  1    pima       at_lgbm            cv    10      0.7735
2:  2    pima classif.rpart            cv    10      0.7462
Hidden columns: resample_result

Consider the code below:

branch_lrn = as_learner(
  ppl("branch", list(
    lrn("classif.ranger"),
    lrn("classif.xgboost",
      early_stopping_rounds = 10,
      eval_metric = "error",
      eta = to_tune(0.001, 0.1, logscale = TRUE),
      nrounds = to_tune(upper = 1000, internal = TRUE)))))

set_validate(branch_lrn, validate = "test", ids = "classif.xgboost")
branch_lrn$param_set$set_values(branch.selection = to_tune())

at = auto_tuner(
  tuner = tnr("grid_search"),
  learner = branch_lrn,
  resampling = rsmp("holdout", ratio = 0.8),
  # cannot use internal validation score because ranger does not have one
  measure = msr("classif.ce"),
  term_evals = 10L,
  store_models = TRUE
)

tsk_sonar = tsk("sonar")$filter(1:100)

rr = resample(
  tsk_sonar, at, rsmp("holdout", ratio = 0.8), store_models = TRUE
)

Answer the following questions (ideally without running the code):

3.1 During the hyperparameter optimization, how many observations are used to train the XGBoost algorithm (excluding validation data) and how many for the random forest? Hint: learners that cannot make use of validation data ignore it.

The outer resampling already removes 20 observations from the data (the outer test set), leaving only 80 data points (the outer train set) for the inner resampling. Then 16 (0.2 * 80; the test set of the inner holdout resampling) observations are used to evaluate the hyperparameter configurations. This leaves 64 (80 - 16) observations for training. For XGBoost, the 16 observations that make up the inner test set are also used for validation, so no more observations from the 64 training points are removed. Because the random forest does not support validation, the 16 observations from the inner test set will only be used for evaluation the hyperparameter configuration, but not simultanteously for internal validation. Therefore, both the random forest and XGBoost models use 64 observations for training.

3.2 How many observations would be used to train the final model if XGBoost was selected? What if the random forest was chosen?

In both cases, all 80 observations (the train set from the outer resampling) would be used. This is because during the final model fit no validation data is generated.

3.3 How would the answers to the last two questions change if we had set the $validate field of the graphlearner to 0.25 instead of "test"?

In this case, the validation data is no longer identical to the inner resampling test set. Instead, it is split from the 64 observations that make up the inner training set. Because this happens before the task enters the graphlearner, both the XGBoost model and the random forest only have access to 48 ((1 - 0.25) * 64) observations, and the remaining 16 are used to create the validation data. Note that the random forest will again ignore the validation data as it does not have the ‘validation’ property and therefore cannot use it. Also, the autotuner would now use a different set for tuning the step size and boosting iterations (which coincidentally both have size 16). Therefore, the answer to question 3.1 would be 48 instead of 64.

However, this does not change the answer to 3.2, as, again, no validation is performed during the final model fit.

Note that we would normally recommend setting the validation data to "test" when tuning, so this should be thought of as a illustrative example.

Look at the (failing) code below:

tsk_sonar = tsk("sonar")
glrn = as_learner(
  po("pca") %>>% lrn("classif.xgboost", validate = 0.3)
)

Error: Validate field of PipeOp 'classif.xgboost' must either be NULL or 'predefined'.
To configure how the validation data is created, set the $validate field of the GraphLearner, e.g. using set_validate().

Can you explain why the code fails? Hint: Should the data that xgboost uses for validation be preprocessed according to the train or predict logic?

If we set the $validate field of the XGBoost classifier to 0.3, the validation data would be generated from the output task of PipeOpOpPCA. However, this task has been exclusively preprocessed using the train logic, because the PipeOpPCA does not ‘know’ that the LightGBM classifier wants to do validation. Because validation performance is intended to measure how well a model would perform during prediction, the validation should be preprocessed according to the predict logic. For this reason, splitting of the 30% of the output from PipeOpPCA to use as validation data in the XGBoost classifier would be invalid. Therefore, it is not possible to set the $validate field of PipeOps to values other than predefined' orNULL’. Only the GraphLearner itself can dictate how the validation data is created before it enters the Graph, so the validation data is then preprocessed according to the predict logic.