10  Large-Scale Benchmark Experiments


Sebastian Fischer

Ludwig-Maximilians-Universität München

In the field of machine learning, benchmark experiments are used to evaluate and compare the performance of algorithms. Conducting such experiments involves evaluating algorithms on a large suite of diverse datasets, which can, e.g., be acquired through collaborative machine learning platforms such as as OpenML. The first section of this chapter shows how to work with OpenML using the interface package mlr3oml. However, running large-scale experiments requires not only datasets, but also significant computational resources, and it can be beneficial to leverage High-Performance Computing (HPC) clusters to speed up the experiment execution. In the second section, we will show how the R package batchtools and its mlr3 integration mlr3batchmark can considerably simplify the process of working on HPC clusters. Once the experiments are complete, statistical analysis is required to draw conclusions from the results using the mlr3benchmark package for post-hoc analyses.

In the world of machine learning, there are many methods that are hard to evaluate using mathematical analysis alone. Even if formal analysis is successful, it is often an open question whether real-world datasets satisfy the necessary assumptions for the theorems to hold. Consequently, researchers resort to conducting benchmark experiments to answer fundamental scientific questions. Such studies evaluate different algorithms on a wide range of datasets, with the aim of identifying which method performs best. These empirical investigations are essential for understanding the capabilities and limitations of existing methods and for developing new and improved approaches. To carry them out effectively, researchers need access to datasets, spanning a wide range of domains and problem types. This is because one can only draw conclusions about the kind of datasets on which the benchmark study was conducted. Fortunately, there are several online resources available for acquiring such datasets, with OpenML1 (Vanschoren et al. 2013) being a popular choice. In the first part of this chapter, we will show how to use OpenML through the connector package mlr3oml.

However, evaluating algorithms on such a large scale requires significant computational resources. For this reason, researchers often utilize High-Performance Computing (HPC) clusters which allow executing experiments massively parallel. The R package batchtools (Lang, Bischl, and Surmann 2017) is a tool for managing and executing experiments on such clusters. In the second section we will show how to use batchtools and its mlr3 connector for benchmarking: mlr3batchmark. Once the experiments are complete, visualizations and statistical tests are used to draw conclusions from the benchmark experiments using the mlr3benchmark package.

A common design for benchmarking is to compare a set of Learners \(L_1, \ldots, L_n\) by evaluating their performance on a set of tasks \(T_1, \ldots, T_k\) which each have an associated resampling \(R_1, \ldots, R_k\). We call such a task-resampling combination \(C_j = (T_j, R_j)\) a resampled task. Furthermore, we define \(M = (M_1, \ldots, M_m)\) to be the evaluation measures. We here focus on the most frequent case of a full experimental grid design, where each experiment is defined by a triple \(E_{i, j} = (L_i, C_j, M)\), which evaluates a learner \(L_i\) on a resampled task \(C_j\) using performance measures \(M\). Running the benchmark consists of evaluating all these triples. The execution of a single resample experiment can again be subdivided into computationally independent resampling iterations to achieve a finer granularity for parallelization (see Section 9.1.2).

As a guiding example throughout this chapter, we will compare the random forest implementation mlr_learners_classif.ranger with the logistic regression learner mlr_learners_classif.log_reg. This question holds significant practical importance and has already been studied by, e.g., Couronné, Probst, and Boulesteix (2018).

The following example compares these two learners using a simple holdout resampling on three classification tasks that ship with mlr3. Chapter 3 already covered how to conduct such benchmark experiments, and we recommend revisiting this chapter if anything is unclear. The metric of choice is the classification accuracy. Note that we “robustify” both learners to work on a wide range of tasks using the respective preprocessing pipeline, c.f. Chapter 6. The ppl("robustify") creates a preprocessing pipeline, i.e. a predefined Graph consisting of common operations such as missing value imputation or feature encoding. By specifying the learner during construction of the pipeline, some preprocessing operations can be omitted, e.g. when a learner can handle missing values itself. Moreover, we specify the fallback learner to be a featureless learner (compare Section 9.2.2) and we use "try" as encapsulation which will be discussed later in this chapter.

# create logistic regression pipeline
learner_logreg = lrn("classif.log_reg")
learner_logreg = as_learner(
  ppl("robustify", learner = learner_logreg) %>>% learner_logreg)
learner_logreg$id = "logreg"
learner_logreg$fallback = lrn("classif.featureless")
learner_logreg$encapsulate = c(train = "try", predict = "try")

# create random forest pipeline
learner_ranger = lrn("classif.ranger")
learner_ranger = as_learner(
  ppl("robustify", learner = learner_ranger) %>>% learner_ranger)
learner_ranger$id = "ranger"
learner_ranger$fallback = lrn("classif.featureless")
learner_ranger$encapsulate = c(train = "try", predict = "try")

# create full grid design with holdout resampling
design = benchmark_grid(
  tsks(c("german_credit", "sonar", "spam")),
  list(learner_logreg, learner_ranger),

# run the benchmark
bmr = benchmark(design)

# retrieve results
acc = bmr$aggregate(msr("classif.acc"))
acc[, .(task_id, learner_id, classif.acc)]
         task_id learner_id classif.acc
1: german_credit     logreg   0.7357357
2: german_credit     ranger   0.7837838
3:         sonar     logreg   0.6666667
4:         sonar     ranger   0.7681159
5:          spam     logreg   0.9367666
6:          spam     ranger   0.9556714

Looking only at the aggregated performance measures, the random forest outperforms the logistic regression on all three datasets. However, this analysis is not conclusive because

In the subsequent sections, we will show the steps required to scale up this analysis using the tools mentioned earlier.

10.1 Getting Data with OpenML

In order to be able to draw meaningful conclusions from benchmark experiments, a good choice of datasets and tasks is essential. It is therefore helpful to

  1. have convenient access to a large collection of datasets and be able to filter them for specific properties, and
  2. be able to easily share datasets, tasks and collections with others, so that they can evaluate their methods on the same problems and thus allow cross-study comparisons.

OpenML is a platform that facilitates the sharing and dissemination of machine learning research data and satisfies these two desiderata. Like mlr3, it is free and open source. Unlike mlr3, it is not tied to a programming language and can for example also be used from its Python interface (Feurer et al. 2021) or from Java. Its goal is to make it easier for researchers to find the data they need to answer the questions they have. Its design was guided by the FAIR2 principles, which stand for Findability, Accessibility, Interoperability and Reusability. The purpose of these principles is to make scientific data more easily discoverable and reusable. More concretely, OpenML is a repository for storing, organising and retrieving datasets, algorithms and experimental results in a standardised way. Entities have unique identifiers and standardised (meta) data. Everything is accessible through a REST API or the web interface.

In this section we will cover some of the main features of OpenML and how to use them via the mlr3oml interface package. OpenML supports different types of objects and we will cover the following:

  • OpenML Dataset: (Usually tabular) data with additional metadata. The latter includes for example a description of the data and a licence. When accessed via mlr3oml, it can be converted to a mlr3::DataBackend. As most OpenML datasets also have a designated target column, they can often directly be converted to an mlr3::Task.
  • OpenML Task: A machine learning task, i.e. a concrete problem specification on an OpenML dataset. This includes splits into train and test sets, thereby differing from the mlr3 task definition, and corresponds to the notion of a resampled task defined in the introduction. Thus, it can be converted to both a mlr3::Task and corresponding instantiated mlr3::Resampling.
  • OpenML Task Collection: A container object that allows to group tasks. This allows the creation of benchmark suites, such as the OpenML CC-18 (Bischl et al. 2021), which is a curated collection of classification tasks.

While OpenML also supports other objects such as representations of algorithms (flows) and experiment results (runs), they are not covered in this chapter. For more information about these features, we refer to the OpenML website3 or the documentation of the mlr3oml package.

10.1.1 Dataset

To illustrate the OpenML dataset class, we will use the dataset with ID 15904 – the well-known adult data. Such an ID can either be found by searching for objects on the OpenML website or through the REST API. This will be covered in more detail in Section 10.1.3.

We load the object into R using mlr3oml::odt(), which returns an object of class OMLData.

odata = odt(id = 1590)
<OMLData:1590:adult> (48842x15)
 * Default target: class

This dataset contains information about 48842 adults – such as their age or education – and the goal is usually to predict the class variable, which indicates whether a person has an income above 50K dollars per year. The OMLData object not only contains the data itself, but comes with additional metadata that is accessible through its fields.

[1] "Public"

The actual data can be accessed through $data.

       age    workclass fnlwgt    education education.num     marital.status
    1:  25      Private 226802         11th             7      Never-married
    2:  38      Private  89814      HS-grad             9 Married-civ-spouse
    3:  28    Local-gov 336951   Assoc-acdm            12 Married-civ-spouse
    4:  44      Private 160323 Some-college            10 Married-civ-spouse
    5:  18         <NA> 103497 Some-college            10      Never-married
48838:  27      Private 257302   Assoc-acdm            12 Married-civ-spouse
48839:  40      Private 154374      HS-grad             9 Married-civ-spouse
48840:  58      Private 151910      HS-grad             9            Widowed
48841:  22      Private 201490      HS-grad             9      Never-married
48842:  52 Self-emp-inc 287927      HS-grad             9 Married-civ-spouse
9 variables not shown: [occupation, relationship, race, sex, capital.gain, capital.loss, hours.per.week, native.country, class]

When working with OpenML objects, these are downloaded in a piecemeal fashion from the OpenML server. This way you can, e.g., access the metadata from the OMLData object without loading the dataset. While accessing the $data slot in the above example, the download is automatically triggered, imported in R, and the data.frame gets stored in the odata object. All subsequent accesses to $data will be transparently redirected to the in-memory data.frame. Additionally, many objects can be permanently cached on the local file system. This caching can be enabled by setting the option mlr3oml.cache to either TRUE or a specific path to be used as the cache folder.

After we have loaded the data of interest into R, the next step is to convert it into a format usable with mlr3. The class that comes closest to the OpenML dataset is the mlr3::DataBackend (see Section 9.4) and it is possible to convert OMLData objects by calling as_data_backend(). This is a recurring theme throughout this section: OpenML and mlr3 objects are well interoperable.

backend = as_data_backend(odata)
<DataBackendDataTable> (48842x16)
 age workclass fnlwgt    education education.num     marital.status
  25   Private 226802         11th             7      Never-married
  38   Private  89814      HS-grad             9 Married-civ-spouse
  28 Local-gov 336951   Assoc-acdm            12 Married-civ-spouse
  44   Private 160323 Some-college            10 Married-civ-spouse
  18      <NA> 103497 Some-college            10      Never-married
  34   Private 198693         10th             6      Never-married
10 variables not shown: [occupation, relationship, race, sex, capital.gain, capital.loss, hours.per.week, native.country, class, ..row_id]
[...] (48836 rows omitted)

We can create an mlr3 task from the backend via as_task_classif().

task = as_task_classif(backend, target = "class")

Some datasets on OpenML contain columns that should neither be used as a feature nor a target. The column names that are usually included as features are accessible through the field $feature_names and we assign them to the mlr3 task accordingly. Note that for the dataset at hand this would not have been necessary, as all non-target columns are to be treated as predictors, but we include it for clarity.

task$col_roles$feature = odata$feature_names
<TaskClassif:backend> (48842 x 15)
* Target: class
* Properties: twoclass
* Features (14):
  - fct (8): education, marital.status, native.country, occupation,
    race, relationship, sex, workclass
  - int (6): age, capital.gain, capital.loss, education.num, fnlwgt,

Alternatively, as the OpenML adult dataset comes with a default target, you can also directly convert it to a task with the appropriate type. This will also set the features of the task appropriately.

task = as_task(odata)

10.1.2 Task

OpenML tasks are built on top of OpenML datasets and additionally specify the target variable, the train-test splits to use for resampling, and more. Similar to mlr3, OpenML has different types of tasks, such as regression or classification. A task associated with the adult data from earlier has ID 3599835. We can load the object using the mlr3oml::otsk() function, which returns an OMLTask object. Note that this task object is different from an mlr3 Task and cannot directly be used for machine learning. However, it contains all required information and there is a convenient converter, as shown below.

otask = otsk(id = 359983)
 * Type: Supervised Classification
 * Data: adult (id: 1590; dim: 48842x15)
 * Target: class
 * Estimation: crossvalidation (id: 1; repeats: 1, folds: 10)

The OMLData object associated with the underlying dataset can be accessed through the $data field.

<OMLData:1590:adult> (48842x15)
 * Default target: class

The data splits associated with the estimation procedure are accessible through the field $task_splits. In mlr3 terms, these are the instantiation of an mlr3 Resampling on a specific Task.

         type rowid repeat. fold
     1: TRAIN 32427       0    0
     2: TRAIN 13077       0    0
     3: TRAIN 15902       0    0
     4: TRAIN 17703       0    0
     5: TRAIN 35511       0    0
488416:  TEST  8048       0    9
488417:  TEST 12667       0    9
488418:  TEST 43944       0    9
488419:  TEST 25263       0    9
488420:  TEST 43381       0    9

The OpenML task can be converted to both an mlr3 Task and a ResamplingCustom instantiated on the task. To convert to the former we can use as_task().

task = as_task(otask)
<TaskClassif:adult> (48842 x 15)
* Target: class
* Properties: twoclass
* Features (14):
  - fct (8): education, marital.status, native.country, occupation,
    race, relationship, sex, workclass
  - int (6): age, capital.gain, capital.loss, education.num, fnlwgt,

The accompanying resampling can be created using as_resampling().

resampling = as_resampling(otask)
<ResamplingCustom>: Custom Splits
* Iterations: 10
* Instantiated: TRUE
* Parameters: list()

As a shortcut, it is also possible to create the objects using the "oml" task or resampling using the tsk() and rsmp() constructors and pass the data_id or task_id to query, e.g. tsk("oml", task_id = 359983).

10.1.3 Filtering of Data and Tasks

Besides working with objects with known IDs, another important question is how to find IDs of relevant datasets or tasks. Because objects on OpenML have strict metadata, they can be filtered w.r.t these properties. This is possible through either the website or the REST API. In addition, the website also supports targeted text queries to search for specific datasets such as the “adult” data from earlier.

The list_oml_data() function allows to filter datasets for specific properties. As an example, we might only be interested in comparing the random forest and the logistic regression on datasets with less than 4 features and 100 to 1000 observations. By setting number_classes to 2, we only receive datasets where the default target has two different values. To keep the output readable, we only show the first 5 results from that query.

odatasets = list_oml_data(
  limit = 5,
  number_features = c(1, 4),
  number_instances = c(100, 1000),
  number_classes = 2

The table below confirms that indeed only datasets with the specified properties were returned. We only show a subset of the columns for readability.

  .(data_id, NumberOfClasses, NumberOfFeatures, NumberOfInstances)]
   data_id NumberOfClasses NumberOfFeatures NumberOfInstances
1:      43               2                4               306
2:     444               2                4               132
3:     448               2                4               120
4:     464               2                3               250
5:     724               2                4               468

Besides datasets, it is also possible to filter tasks. This can be done using list_oml_tasks() and works analogously to the previous example.

We could now start looking at the returned IDs in more detail in order to verify whether they are suitable for our purposes. This process can be tedious, as some datasets have hard-to-detect quirks to look out for. A solution to this problem is to use an existing curated task collection, which we will cover next.

10.1.4 Task Collection

The OpenML task collection is a container object bundling existing tasks. This allows for the creation of benchmark suites, which are curated collections of tasks, satisfying certain quality criteria. One example for such a benchmark suite is the OpenML CC-186, which contains curated classification tasks (Bischl et al. 2021). Other collections available on OpenML include the AutoML benchmark7 (Gijsbers et al. 2022) or a benchmark for tabular deep learning8 (Grinsztajn, Oyallon, and Varoquaux 2022).

otask_collection = ocl(id = 99)

We can create an OMLCollection using mlr3oml::ocl(). We see that the CC-18 contains 72 classification tasks on different datasets.

<OMLCollection: 99>
 * data:  72
 * tasks: 72

The contained tasks can be accessed through $task_ids.

 [1]      3      6     11     12     14     15     16     18     22     23
[11]     28     29     31     32     37     43     45     49     53    219
[21]   2074   2079   3021   3022   3481   3549   3560   3573   3902   3903
[31]   3904   3913   3917   3918   7592   9910   9946   9952   9957   9960
[41]   9964   9971   9976   9977   9978   9981   9985  10093  10101  14952
[51]  14954  14965  14969  14970 125920 125922 146195 146800 146817 146819
[61] 146820 146821 146822 146824 146825 167119 167120 167121 167124 167125
[71] 167140 167141

We will now define our experimental design using tasks from the CC-18. If we wanted to get all tasks and resamplings, we could achieve this using the converters as_tasks() and as_resamplings(). However, as the CC-18 contains not only binary classification tasks we use list_oml_tasks() to subset the collection further. We pass the task IDs from the CC-18 as argument task_id and request the number of classes to be 2.

binary_cc18 = list_oml_tasks(
  task_id = otask_collection$task_ids, number_classes = 2)

In order to keep the runtime reasonable in later examples, we only use the first six tasks.

binary_cc18[, .(task_id, name, NumberOfClasses)]
    task_id                             name NumberOfClasses
 1:       3                         kr-vs-kp               2
 2:      15                         breast-w               2
 3:      29                  credit-approval               2
 4:      31                         credit-g               2
 5:      37                         diabetes               2
31:  146819 climate-model-simulation-crashes               2
32:  146820                             wilt               2
33:  167120                      numerai28.6               2
34:  167125          Internet-Advertisements               2
35:  167141                            churn               2
ids = binary_cc18$task_id[1:6]

We now define the learners, tasks, and resamplings for the experiment. In addition to the random forest and the logistic regression, we also include a featureless learner as a baseline.

otasks = lapply(ids, otsk)

tasks = lapply(otasks, as_task)
resamplings = lapply(otasks, as_resampling)

learner_featureless = lrn("classif.featureless", id = "featureless")
learners = list(learner_logreg, learner_ranger, learner_featureless)

To define the design table, we use benchmark_grid() and set paired to TRUE. This option can be used in situations where each resampling is instantiated on a corresponding task (therefore the tasks and resamplings below must have the same length) and each learner should be evaluated on every resampled task.

large_design = benchmark_grid(
  tasks, learners, resamplings, paired = TRUE
               task     learner resampling
 1:        kr-vs-kp      logreg     custom
 2:        kr-vs-kp      ranger     custom
 3:        kr-vs-kp featureless     custom
 4:        breast-w      logreg     custom
 5:        breast-w      ranger     custom
 6:        breast-w featureless     custom
 7: credit-approval      logreg     custom
 8: credit-approval      ranger     custom
 9: credit-approval featureless     custom
10:        credit-g      logreg     custom
11:        credit-g      ranger     custom
12:        credit-g featureless     custom
13:        diabetes      logreg     custom
14:        diabetes      ranger     custom
15:        diabetes featureless     custom
16:        spambase      logreg     custom
17:        spambase      ranger     custom
18:        spambase featureless     custom

10.2 Experiment Execution on HPC Clusters

Once an experimental design is finalized, the next step is to run it. Parallelizing this step is conceptually straightforward, because we are facing an embarrassingly parallel problem (see Section 9.1). Not only are the resample experiments independent, but even their individual iterations are. However, if the experiment is large, parallelization on a local machine as shown in Section 9.1 often is not enough and using a distributed computing system, such as an HPC cluster, is required. While access to HPC clusters is widespread at universities and research-driven companies, the effort required to work on these systems is still considerable. The R package batchtools provides a framework to simplify running large batches of computational experiments in parallel from R. It is highly flexible, making it suitable for a wide range of computational experiments, including machine learning, optimisation, simulation, and more.


In Section 9.1.2 we have already touched upon different parallelization backends. The package future also comes with a plan for batchtools. However, for larger experiments, the additional control over the execution which batchtools offers is invaluable. Therefore, we recommend future’s "batchtools" plan only for moderately sized experiments which complete within a couple of hours. To estimate the total runtime, a subset of the benchmark is usually executed and measured, and then the runtime is extrapolated.

10.2.1 HPC Basics

An HPC cluster is a collection of interconnected computers or servers providing computational power beyond what a single computer can achieve. They are used for solving complex problems in chemistry, physics, engineering, machine learning and other fields that require a large amount of computational resources. An HPC cluster typically consists of multiple compute nodes, each with multiple CPU/GPU cores, memory, and local storage. These nodes are connected together by a high-speed network and network file system which enables the nodes to communicate and work together on a given task. These clusters are also designed to run parallel applications that can be split into smaller tasks and distributed across multiple compute nodes to be executed simultaneously. We will leverage this capacity to parallelize the execution of the benchmark experiment. The most important difference between such a cluster and a personal computer (PC), is that the nodes cannot be accessed directly, but instead computational jobs are queued by a scheduling system like Slurm9 (Simple Linux Utility for Resource Management). A scheduling system is a software tool that orchestrates the allocation of computing resources to users or applications on the cluster. It ensures that multiple users and applications can access the resources of the cluster in a fair and efficient manner, and also helps to maximize the utilization of the computing resources.

Figure 10.1 contains a rough sketch of an HPC architecture. Multiple users can log in into the head node (typically via SSH10) and add their computational workloads of the form “Execute Computation X using Resources Y for Z amount of time” to the queue. One such instruction will be referred to as a computational job. The scheduling system controls when these computational jobs are executed.

A rough sketch of the architecture of an HPC cluster. Ann and Bob both have access to the cluster and can log in to the head node. There, they can submit jobs to the scheduling system, which adds them to its queue and determines when they are run.

Figure 10.1: Illustration of an HPC cluster architecture.

Common challenges when interoperating with a scheduling systems are that

  1. the code to submit a job depends on the scheduling system,
  2. code that runs locally has to be adapted to run on the cluster,
  3. the burden is on the user to account for seeding to ensure reproducibility, and
  4. it is cumbersome to query the status of jobs and debug failures.

In the following we will see how batchtools mitigates these problems.

10.2.2 General Setup and Experiment Registry

Our goal in this section is to run the benchmark design large_design shown in Section 10.1.4 on an HPC cluster. We use the packages batchtools and mlr3batchmark for this. The mlr3batchmark package assists in translating the machine learning problem defined with syntax and objects from mlr3 to the more general “apply some algorithm A on some problem P” approach in batchtools, which is briefly outlined in Section 10.2.6.

The central concept of batchtools is the experiment or job: One replication of an experiment (or job) is defined by applying a (parameterized) algorithm to a (parameterized) problem. A benchmark experiment in batchtools then consists of running many such experiments with different algorithms, algorithm parameters, problems, and problem parameters. Each such experiment is computationally independent of all other experiments and constitutes the basic level of computation batchtools can parallelize.

In the introduction of this chapter, we have defined a benchmark experiment as evaluating a number of triples \(E_{i, j} = (L_i, C_j, M)\) that are defined by a learner \(L_i\), a resampled task \(C_j = (T_j, R_j)\), and measures \(M\). While it might seem natural to define one such triplet as a job, each resampled task \(C_j\) can be split up even further, namely into its resampling iterations \(C^1_j, \ldots, C^{n_j}_j\), where \(n_j\) is the number of iterations of resampling \(R_j\). This makes the computation more granular and gives you the opportunity to utilize more CPUs in parallel (c.f., Section 9.1). Moreover, because the evaluation of the measures \(M\) is (usually) computationally negligible, it is common to only parallelize the execution of the resample experiments and evaluate the measures \(M\) afterwards. For these reasons, we will understand one pair \((L_i, C^l_j)\) as a single batchtools experiment.11 The custom definition of problems and algorithm which yields a different granularity is demonstrated in Section 10.2.6.

  • 11 Note that such a job does not have to coincide with the notion of a computational job defined earlier, more on that in Section 10.2.4.

  • The first step is always to create an (or load an existing) experiment registry using the function batchtools::makeExperimentRegistry() (or batchtools::loadRegistry(), respectively). This function constructs the inter-communication object for all functions in batchtools and corresponds to a folder on the file system. Among other things, the experiment registry stores the

    • algorithms, problems, and job definitions
    • log outputs and status of submitted, running, and finished jobs
    • job results
    • cluster function, which defines the interaction with the scheduling system

    While the first three bullet points should be relatively clear, the fourth needs some explanation. By configuring the scheduling system through a cluster function, it is possible to make the interaction with the scheduling system independent of the scheduling software. We will come back to this later and show how to change it to work on a Slurm cluster.

    We create a registry in a subdirectory of our working directory - on a real cluster, make sure that this folder is stored on a shared network filesystem, otherwise the nodes cannot access it. Furthermore, we set the registry’s seed to 1 and the packages to mlr3verse, which will make the package available in all our experiments.

    # create registry
    reg = makeExperimentRegistry(
      file.dir = "./experiments",
      seed = 1,
      packages = "mlr3verse"
    No readable configuration file found
    Created registry in '/home/runner/work/mlr3book/mlr3book/book/experiments' using cluster functions 'Interactive'

    When printing our newly created registry, we see that there are 0 problems, algorithms or jobs registered. Among other things, we are informed that the “Interactive” cluster function (see batchtools::makeClusterFunctionsInteractive()) is used and about the working directory for the experiments.

    Experiment Registry
      Backend   : Interactive
      File dir  : /home/runner/work/mlr3book/mlr3book/book/experiments
      Work dir  : /home/runner/work/mlr3book/mlr3book/book
      Jobs      : 0
      Problems  : 0
      Algorithms: 0
      Seed      : 1
      Writeable : TRUE

    10.2.3 Experiment Definition using mlr3batchmark

    The next step is to populate the registry with problems and algorithms, which we will then use to define the jobs, i.e., the resampling iterations. This is the first step where mlr3batchmark comes into play. Doing this step with batchtools is also possible and gives you more flexibility and is demonstrated in Section 10.2.6. By calling batchmark(), mlr3 tasks and mlr3 resamplings will be translated to batchtools problems, and mlr3 learners are mapped to batchtools algorithms. Then, jobs for all resampling iterations are created.

    batchmark(large_design, reg = reg)

    All batchtools functions that interoperate with a registry take a registry as an argument. By default, this argument is set to the last created registry, which is currently the reg object defined earlier. We nonetheless pass it explicitly in this section for clarity, but would not have to do so.

    When printing the registry, we confirm that six problems (one for each resampled task) but only a single algorithm is registered. While a 1-on-1 mapping of the 3 algorithms in the design would have also been possible, the approach in mlr3batchmark uses a single algorithm parametrized with the learner identifiers instead for efficiency. Furthermore, \(180 = 3 \times 6 \times 10\) jobs, i.e. one for each resampling iteration, are registered.

    Experiment Registry
      Backend   : Interactive
      File dir  : /home/runner/work/mlr3book/mlr3book/book/experiments
      Work dir  : /home/runner/work/mlr3book/mlr3book/book
      Jobs      : 180
      Problems  : 6
      Algorithms: 1
      Seed      : 1
      Writeable : TRUE

    We can summarize the defined experiments using batchtools::summarizeExperiments(). There are 10 jobs for each combination of a learner and resampled task, as 10-fold cross-valdation is used as the resampling procedure.

      by = c("task_id", "learner_id"), reg = reg)
                task_id  learner_id .count
     1:        kr-vs-kp      logreg     10
     2:        kr-vs-kp      ranger     10
     3:        kr-vs-kp featureless     10
     4:        breast-w      logreg     10
     5:        breast-w      ranger     10
     6:        breast-w featureless     10
     7: credit-approval      logreg     10
     8: credit-approval      ranger     10
     9: credit-approval featureless     10
    10:        credit-g      logreg     10
    11:        credit-g      ranger     10
    12:        credit-g featureless     10
    13:        diabetes      logreg     10
    14:        diabetes      ranger     10
    15:        diabetes featureless     10
    16:        spambase      logreg     10
    17:        spambase      ranger     10
    18:        spambase featureless     10

    The function batchtools::getJobTable() can be used to get more detailed information about the jobs. Here, we only show a few selected columns for readability and unpack the list columns algo.pars and prob.pars using batchtools::unwrap(). Among other things, we see that each job has a unique job.id. Each row in this job table represents one iteration (column repl) of a resample experiment.

    job_table = getJobTable(reg = reg)
    job_table = unwrap(job_table)
    job_table = job_table[,
      .(job.id, learner_id, task_id, resampling_id, repl)
         job.id  learner_id  task_id resampling_id repl
      1:      1      logreg kr-vs-kp        custom    1
      2:      2      logreg kr-vs-kp        custom    2
      3:      3      logreg kr-vs-kp        custom    3
      4:      4      logreg kr-vs-kp        custom    4
      5:      5      logreg kr-vs-kp        custom    5
    176:    176 featureless spambase        custom    6
    177:    177 featureless spambase        custom    7
    178:    178 featureless spambase        custom    8
    179:    179 featureless spambase        custom    9
    180:    180 featureless spambase        custom   10

    10.2.4 Job Submission

    Once the experiments are defined, the next step is to submit them. Before doing so, it is recommended to test each algorithm individually using batchtools::testJob(). We test the job with job.id = 1 exemplary and specify external = TRUE to run the test in an external R session. The return of the testJob() function (which is the return value of the "run_learner" algorithm) is a bit technical, because here not the complete objects but only the essential parts are returned to reduce the communication overhead (a named list learner_state and a named list prediction) - we do not give a detailed description of the list elements here. We can for example access the training time through the learner_state as shown below.

    result = testJob(1, external = TRUE, reg = reg)
    Loading required package: mlr3verse
    Loading required package: mlr3
    ### [bt]: Generating problem instance for problem '62fd27845aadeb67' ...
    ### [bt]: Applying algorithm 'run_learner' on problem '62fd27845aadeb67' for job 1 (seed = 2) ...
    INFO  [13:48:49.264] [mlr3] Applying learner 'logreg' on task 'kr-vs-kp' (iter 1/10)
    Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
    This happened PipeOp classif.log_reg's $train()
    [1] 0.721

    In case something goes wrong, batchtools comes with a bunch of useful debugging utilities covered in Section 10.2.5. Once we are confident that the jobs are defined correctly, we can proceed with their submission, which requires

    1. specifying resource requirements for each computational job, and
    2. (optionally) grouping multiple jobs into one computational job

    Which resources can be configured depends on the cluster function that is set in the registry. We earlier left it at its default value, which is the “Interactive” cluster function. In the following we assume that we are working on a Slurm cluster. Accordingly, we initialize the cluster function with batchtools::makeClusterFunctionsSlurm() and a predefined slurm-simple template12. A template file is a shell script with placeholders filled in by batchtools and contains

    1. the command to start the computation via Rscript or R CMD batch, usually in the last line, and
    2. comments which serve as annotations for the scheduler, e.g. to communicate resources or paths on the file system.

    The exemplary template should work on many Slurm installations out-of-the-box, but can also easily be customized to also work with more advanced configurations.

    cf = makeClusterFunctionsSlurm(template = "slurm-simple")

    To proceed with the examples on a local machine, set the cluster functions to a Socket backend.

    Auto-detected 2 CPUs

    It is possible to customize the cluster function. More information is available in the documentation of the batchtools package.

    We update the value of the $cluster.functions and save the registry. This only has to be done manually when modifying the fields explicitly. Functions like batchmark() internally save the registry when required.

    reg$cluster.functions = cf
    saveRegistry(reg = reg)
    [1] TRUE

    The jobs are submitted to the scheduler via batchtools::submitJobs(). The most important arguments of this function besides the registry are:

    • ids, which are either a vector of job IDs to submit, or a data frame with columns job.id and chunk, which allows grouping multiple jobs into one larger computational job and thereby controlling the granularity. This often makes sense on HPCs, as submitting and running computational jobs comes with a considerable overhead and there are often hard limits for the maximum runtime (walltime).
    • resources, which is a named list specifying the resource requirements for the submitted jobs.

    We will batchtools::chunk() the IDs in such a way that 5 iterations of one resample experiment are run sequentially in one computational job. The optimal grouping depends on the concrete experiment and scheduling system.

    ids = job_table$job.id
    chunks = data.table(
      job.id = ids, chunk = chunk(ids, chunk.size = 5, shuffle = FALSE)
         job.id chunk
      1:      1     1
      2:      2     1
      3:      3     1
      4:      4     1
      5:      5     1
    176:    176    36
    177:    177    36
    178:    178    36
    179:    179    36
    180:    180    36

    Furthermore, we specify the number of CPUs per computational job to 1, the walltime to 1 hour (3600 seconds), and the RAM limit to 8 GB. The set of resources depends on your cluster and the corresponding template file. For a list of resource names that are standardised across most implementations, see batchtools::submitJobs(). If you are unsure about the resource requirements, you can start a subset of jobs with conservative resource constraints, e.g. the maximum runtime allowed for your computing site. Measured runtimes and memory usage can be queried with batchtools::getJobTable() and finally used to better estimate the required resources for the remaining jobs.

      ids = chunks,
      resources = list(ncpus = 1, walltime = 3600, memory = 8000),
      reg = reg
    # wait for all jobs to terminate
    waitForJobs(reg = reg)

    A good approach to submit computational jobs is to do this from an R session that is running persistently. One option is to use TMUX (Terminal Multiplexer)13 on the head node to continue job submission (or computation, depending on the cluster functions) in the background.

    10.2.5 Job Monitoring, Error Handling, and Result Collection

    After submitting the jobs, the next phase is to wait for them to finish. In case you terminated your running R session after job submission, you can load the experiment registry using loadRegistry() to continue where you left off.

    In any large scale experiment many things can and will go wrong, even if we test our jobs beforehand using batchtools::testJob() as recommended earlier. The cluster might have an outage, jobs may run into resource limits or crash, subtle bugs in your code could be triggered or any other error condition might arise. In these situations it is important to quickly determine what went wrong and to recompute only the minimal number of required jobs.

    The current status of the computation can be queried with getStatus(), which lists the number of jobs categorized in multiple status groups:

    getStatus(reg = reg)
    Status for 180 jobs at 2023-06-04 13:50:54:
      Submitted    : 180 (100.0%)
      -- Queued    :   0 (  0.0%)
      -- Started   : 180 (100.0%)
      ---- Running :   0 (  0.0%)
      ---- Done    : 180 (100.0%)
      ---- Error   :   0 (  0.0%)
      ---- Expired :   0 (  0.0%)

    To query the ids of jobs in the respective categories, see findJobs() and, e.g., findNotSubmitted() or findDone().

    In our case, all experiments finished and none expired or crashed - albeit we took some countermeasures by extending our base learners with fallback learners and integrating them in the pipeline “robustify”. Nonetheless, it makes sense to check the logs for suspicious messages and warnings with grepLogs() before proceeding with the analysis of the results. For this purpose, we have set the encapsulation before to "try": in contrast to "evaluate" or "callr" encapsulation, "try" does not capture the output from messages, warnings or errors and store them in the learner’s log. All output is printed to the console and redirected to a log file so that batchtools can operate on it with functions like getLog() or grepLogs().

    In the following, we extend the design with the debug learner (see Section 9.2) erring with a 50% probability. By just calling batchmark() with the new design again, the new experiments will be added to the registry on top of the existing jobs. The tasks and resamplings below are once again those from the large_design from earlier.

    extra_design = benchmark_grid(
      learners = lrns("classif.debug", error_train = 0.5),
      tasks = tasks,
      resampling = resamplings,
      paired = TRUE
    batchmark(extra_design, reg = reg)

    We can check the new state and submit the jobs which have not been submitted yet (i.e., the newly created jobs):

    getStatus(reg = reg)
    Status for 240 jobs at 2023-06-04 13:50:55:
      Submitted    : 180 ( 75.0%)
      -- Queued    :   0 (  0.0%)
      -- Started   : 180 ( 75.0%)
      ---- Running :   0 (  0.0%)
      ---- Done    : 180 ( 75.0%)
      ---- Error   :   0 (  0.0%)
      ---- Expired :   0 (  0.0%)
    ids = findNotSubmitted(reg = reg)

    We queue these jobs as usual by passing their IDs to submitJobs():

    submitJobs(ids, reg = reg)
    waitForJobs(reg = reg)

    After these jobs have terminated, we can get a summary of those that failed:

    error_ids = findErrors(reg = reg)
      error_ids, by = c("task_id", "learner_id"), reg = reg)
               task_id    learner_id .count
    1:        kr-vs-kp classif.debug      6
    2:        breast-w classif.debug      3
    3: credit-approval classif.debug      5
    4:        credit-g classif.debug      6
    5:        diabetes classif.debug      5
    6:        spambase classif.debug      2

    Unsurprisingly, all failed jobs have one thing in common: the debug learner.

    Finally, it is time to collect the experiment output. The results for benchmark experiments defined with batchmark() can be collected with reduceResultsBatchmark(), which constructs a regular BenchmarkResult. It is also possible to retrieve single results with loadResult(), but the returned values are hard to work with, as they are not optimized for usability, but efficiency. Here, we only collect results which were not produced by the debug learner.

    ids = findExperiments(
      algo.pars = learner_id != "classif.debug", reg = reg)
    bmr = reduceResultsBatchmark(ids, reg = reg)
        nr         task_id  learner_id resampling_id iters classif.ce
     1:  1        kr-vs-kp      logreg        custom    10 0.02566027
     2:  2        kr-vs-kp      ranger        custom    10 0.01377253
     3:  3        kr-vs-kp featureless        custom    10 0.47778409
     4:  4        breast-w      logreg        custom    10 0.03577640
     5:  5        breast-w      ranger        custom    10 0.02863354
     6:  6        breast-w featureless        custom    10 0.34478261
     7:  7 credit-approval      logreg        custom    10 0.14782609
     8:  8 credit-approval      ranger        custom    10 0.12318841
     9:  9 credit-approval featureless        custom    10 0.44492754
    10: 10        credit-g      logreg        custom    10 0.24700000
    11: 11        credit-g      ranger        custom    10 0.23500000
    12: 12        credit-g featureless        custom    10 0.30000000
    13: 13        diabetes      logreg        custom    10 0.22404306
    14: 14        diabetes      ranger        custom    10 0.23180109
    15: 15        diabetes featureless        custom    10 0.34894053
    16: 16        spambase      logreg        custom    10 0.07215647
    17: 17        spambase      ranger        custom    10 0.04780911
    18: 18        spambase featureless        custom    10 0.39404461
    Hidden columns: resample_result

    10.2.6 Custom Experiment Definition

    This section covers advanced ML or technical details that can be skipped.

    While mlr3batchmark excels for conducting benchmarks on HPCs, there can be situations in which more fine-grained control over the experiment definition is beneficial or even required. Here, we will show how to define batchtools jobs that execute an mlr3 benchmark experiment without the help of mlr3batchmark. There is not one single way how to achieve this, and we here show a solution that sacrifices efficiency for simplicity. Unless you have a specific reason to customize your experiment definition, we recommend using mlr3batchmark. Like before, the first step is to create an experiment registry.

    reg = makeExperimentRegistry(
      file.dir = "./experiments-custom",
      seed = 1,
      packages = "mlr3verse"
    No readable configuration file found
    Created registry in '/home/runner/work/mlr3book/mlr3book/book/experiments-custom' using cluster functions 'Interactive'

    We can register a problem by calling batchtools::addProblem(), whose main arguments beside the registry are:

    • name to uniquely identify the problem,
    • data to represent the static data part of a problem, and
    • fun, which takes in the problem data, the problem parameters, and the job definition, see batchtools::makeJob() and returns a problem instance.

    We register all task-resampling combinations of the large_design using the task ID as the name.14 The problem fun takes in the static problem data and returns it as the problem instance as is. If we were using problem parameters, we could modify the problem instance depending on their values. In the code below, recall that the tasks and resamplings were originally used to define the large_design.

  • 14 The mlr3 task ID is not the same as the OpenML task ID.

  • for (i in seq_along(tasks)) {
        name = tasks[[i]]$id,
        data = list(task = tasks[[i]], resampling = resamplings[[i]]),
        fun = function(data, job, ...) data,
        reg = reg

    When calling batchtools::addProblem(), not only is the problem added to the registry object from the active R session, but this information is also synced with the registry folder.

    The next step is to register the algorithm we want to run, which we achieve by calling batchtools::addAlgorithm(). Besides the registry, it takes in the arguments:

    • name to uniquely identify the algorithm and
    • fun, which takes in the problem instance, the algorithm parameters, and the job definition. It defines the computational steps of an experiment and its return value is the experiment result, i.e. what can later be retrieved using loadResult().

    The algorithm function receives a list containing the task and resampling as the problem instance, the learner as the algorithm parameter, and the job object. It then executes the resample experiment defined by these three objects using resample() and returns a ResampleResult. This differs from the mlr3batchmark, where one resampling iteration corresponds to one job in batchtools. Here, one batchtools job represents a complete resample experiment.

      fun = function(instance, learner, job, ...) {
        resample(instance$task, learner, instance$resampling)
      reg = reg
    Adding algorithm 'run_learner'
    [1] "run_learner"

    As we have now defined the problems and the algorithm, we can define concrete experiments using batchtools::addExperiments(). This function has arguments

    • prob.designs, a named list of data frames. The name must match the problem name while the column names correspond to parameters of the problem.

    • algo.designs, a named list of data frames. The name must match the algorithm name while the column names correspond to parameters of the algorithm.

    In the code below, we add all resampling iterations for the six tasks as experiments. By leaving prob.designs unspecified, experiments for all existing problems are created per default. We set the algorithm parameters to all possible learners, i.e. the logistic regression, random forest, and featureless learner from large_design. Note that whenever an experiment is added, the current seed is assigned to the experiment and then incremented.

    algorithm_design = list(run_learner = data.table(learner = learners))
    addExperiments(algo.designs = algorithm_design, reg = reg)

    We confirm that the algorithm, problems, and experiments (jobs) were added successfully.

               problem   algorithm .count
    1:        kr-vs-kp run_learner      3
    2:        breast-w run_learner      3
    3: credit-approval run_learner      3
    4:        credit-g run_learner      3
    5:        diabetes run_learner      3
    6:        spambase run_learner      3

    Figure 10.2 summarizes the interplay between the batchtools problems, algorithms, and experiments.

    A problem consists of a static data part and applies the problem function to this data part (and potentially problem parameters) to return a problem instance. The algorithm function takes in a problem instance (and potentially algorithm parameters), executes one job and returns its result.

    Figure 10.2: Illustration the batchtools problem, algorithm, and experiment.

    We are now ready to submit the jobs to the cluster. By specifying no job IDs, all experiments are submitted as independent jobs, i.e. one computational job executes one resample experiment.

    submitJobs(reg = reg)
    waitForJobs(reg = reg)

    When a cluster jobs finishes, it stores its return value in the registry folder. We can retrieve the job results using batchtools::loadResult(). It outputs the objects returned by the algorithm function, which in our case is a ResampleResult.

    rr = loadResult(1, reg = reg)
    <ResampleResult> with 10 resampling iterations
      task_id learner_id resampling_id iteration warnings errors
     kr-vs-kp     logreg        custom         1        0      0
     kr-vs-kp     logreg        custom         2        0      0
     kr-vs-kp     logreg        custom         3        0      0
     kr-vs-kp     logreg        custom         4        0      0
     kr-vs-kp     logreg        custom         5        0      0
     kr-vs-kp     logreg        custom         6        0      0
     kr-vs-kp     logreg        custom         7        0      0
     kr-vs-kp     logreg        custom         8        0      0
     kr-vs-kp     logreg        custom         9        0      0
     kr-vs-kp     logreg        custom        10        0      0

    In order to use mlr3’s post-processing tools, we need to convert all results into a BenchmarkResult. We can do this, by combining all resample results using batchtools::reduceResults().

    bmr = reduceResults(c, reg = reg)
        nr         task_id  learner_id resampling_id iters classif.ce
     1:  1        kr-vs-kp      logreg        custom    10 0.02566027
     2:  2        kr-vs-kp      ranger        custom    10 0.01283307
     3:  3        kr-vs-kp featureless        custom    10 0.47778409
     4:  4        breast-w      logreg        custom    10 0.03577640
     5:  5        breast-w      ranger        custom    10 0.02861284
     6:  6        breast-w featureless        custom    10 0.34478261
     7:  7 credit-approval      logreg        custom    10 0.14637681
     8:  8 credit-approval      ranger        custom    10 0.11739130
     9:  9 credit-approval featureless        custom    10 0.44492754
    10: 10        credit-g      logreg        custom    10 0.24700000
    11: 11        credit-g      ranger        custom    10 0.23500000
    12: 12        credit-g featureless        custom    10 0.30000000
    13: 13        diabetes      logreg        custom    10 0.22404306
    14: 14        diabetes      ranger        custom    10 0.23439850
    15: 15        diabetes featureless        custom    10 0.34894053
    16: 16        spambase      logreg        custom    10 0.07215647
    17: 17        spambase      ranger        custom    10 0.04715788
    18: 18        spambase featureless        custom    10 0.39404461
    Hidden columns: resample_result

    While we took a different route than in Section 10.2.3 to define the experiments and ran them at a different granularity, we arrived at the same result.

    10.3 Statistical Analysis

    Once we successfully executed the benchmark experiment, we can proceed with its analysis. The package mlr3benchmark provides infrastructure for applying statistical significance tests on BenchmarkResult objects. Currently, Friedman tests and pairwise Friedman-Nemenyi tests (Demšar 2006) are supported to analyze benchmark experiments with at least two independent tasks and at least two learners. Before we can use these methods, we have to convert the benchmark result to a mlr3benchmark::BenchmarkAggr using as_benchmark_aggr(). We can then perform a pairwise comparison using $friedman_posthoc(). This method will first perform a global friedman test and only conduct the post-hoc tests if the former is significant.

    bma = as_benchmark_aggr(bmr, measures = msr("classif.ce"))
        Pairwise comparisons using Nemenyi-Wilcoxon-Wilcox all-pairs test for a two-way balanced complete block design
    data: ce and learner_id and task_id
                logreg ranger
    ranger      0.4804 -     
    featureless 0.1072 0.0043
    P value adjustment method: single-step

    These results would indicate a statistically significant difference between the "featureless" learner and "ranger", assuming a 95% confidence level. This table can be summarized in a critical difference plot, which typically shows the mean rank of a learning algorithm on the x-axis along with a thick horizontal line that connects learners which are pairwise not significantly different (while correcting for multiple tests):

    autoplot(bma, type = "cd")

    While our experiment did now show a significant difference between the random forest and the logistic regression (they are connected in the plot), the former has a lower rank on average. This is in line with the large benchmark study conducted by Couronné, Probst, and Boulesteix (2018), where the random forest outperformed the logistic regression in 69% of 243 real world datasets.

    As a final note, it is important to be careful when interpreting such test results. Because our datasets are not an iid sample from a population of datasets, we can only make inference about the data generating processes at hand, i.e. those that generated the datasets we used in the benchmark.

    10.4 Conclusion

    In this chapter, we have explored how to conduct large scale machine learning experiments using mlr3. We have shown how to acquire diverse datasets from OpenML through the mlr3oml interface package. Furthermore, we have learned how to execute large-scale experiments using the batchtools package and its mlr3batchmark integration. Finally, we have demonstrated how to analyze the results using the mlr3benchmark package, thereby extracting meaningful insights from the experiments.

    The most important functions and classes we learned about are in Table 10.1 alongside their R6 classes (if applicable).

    Table 10.1: Core functions for Open in mlr3 with the underlying R6 class that are constructed when these functions are called (if applicable) and a summary of the purpose of the functions.
    S3 function R6 Class Summary
    odt() OMLData Retrieve an OpenML Dataset
    otsk() OMLTask Retrieve an OpenML Task
    ocl() OMLCollection Retrieve an OpenML Collection
    list_oml_data() - Filter OpenML Datasets
    list_oml_tasks() - Filter OpenML Tasks
    makeExperimentRegistry() - Create a new registry
    loadRegistry() - Load an existing registry
    saveRegistry() - Save an existing registry
    batchmark() - Register problems, algorithms, and experiments from a design
    addProblem() - Register a new Problem
    addAlgorithm() - Register a new algorithm
    addExperiments() - Register experiments using existing algorithms and problems
    submitJobs() - Submit jobs to the scheduler
    getJobTable() - Get an overview of all job definitions
    unwrap() - Get an overview of all job definitions
    getStatus() - Get the status of the computation
    reduceResultsBatchmark() - Load finished jobs as a benchmark result
    reduceResults() - Combine experiment results
    findExperiments() - Find specific experiments
    grepLogs() - Search the log files
    summarizeExperiments() - Summarize defined experiments
    getLog() - Get a specific log file
    findErrors() - Find ids of failed jobs


    10.5 Exercises

    In this exercise we will conduct an empirical study that compares two machine learning algorithms. Our null hypothesis is that a single regression tree performs equally well than a random forest.

    Getting Data from OpenML

    1. Load the OpenML collection with ID https://www.openml.org/search?type=study&study_type=task&id=26919. It contains regression tasks from the AutoML benchmark (Gijsbers et al. 2022).
    2. Find all tasks with less than 4000 observations and convert them to mlr3 tasks.
    3. Create an experimental design that compares the random forest in ranger with the regression tree from rpart on those tasks. You can use 3-fold cross-validation instead of the OpenML resamplings to save time.
  • 19 269

  • Executing the Experiments using batchtools

    1. Create a registry and populate it with the experiments.
    2. (Optional) Change the cluster function to either “Socket” or “Multicore” (the latter does not work on Windows).
    3. Submit the jobs and once they are finished, collect the results.

    Analyzing the Results

    1. Conduct a global Friedman test and interpret the results. As an evaluation measure, use the mean-square error.
    2. Inspect the ranks of the results.