In the field of machine learning, benchmark experiments are used to evaluate and compare the performance of algorithms. Conducting such experiments involves evaluating algorithms on a large suite of diverse datasets, which can, e.g., be acquired through collaborative machine learning platforms such as as OpenML. The first section of this chapter shows how to work with OpenML using the interface package mlr3oml. However, running large-scale experiments requires not only datasets, but also significant computational resources, and it can be beneficial to leverage High-Performance Computing (HPC) clusters to speed up the experiment execution. In the second section, we will show how the R package batchtools and its mlr3 integration mlr3batchmark can considerably simplify the process of working on HPC clusters. Once the experiments are complete, statistical analysis is required to draw conclusions from the results using the mlr3benchmark package for post-hoc analyses.
In the world of machine learning, there are many methods that are hard to evaluate using mathematical analysis alone. Even if formal analysis is successful, it is often an open question whether real-world datasets satisfy the necessary assumptions for the theorems to hold. Consequently, researchers resort to conducting benchmark experiments to answer fundamental scientific questions. Such studies evaluate different algorithms on a wide range of datasets, with the aim of identifying which method performs best. These empirical investigations are essential for understanding the capabilities and limitations of existing methods and for developing new and improved approaches. To carry them out effectively, researchers need access to datasets, spanning a wide range of domains and problem types. This is because one can only draw conclusions about the kind of datasets on which the benchmark study was conducted. Fortunately, there are several online resources available for acquiring such datasets, with OpenML1(Vanschoren et al. 2013) being a popular choice. In the first part of this chapter, we will show how to use OpenML through the connector package mlr3oml.
However, evaluating algorithms on such a large scale requires significant computational resources. For this reason, researchers often utilize High-Performance Computing (HPC) clusters which allow executing experiments massively parallel. The R package batchtools(Lang, Bischl, and Surmann 2017) is a tool for managing and executing experiments on such clusters. In the second section we will show how to use batchtools and its mlr3 connector for benchmarking: mlr3batchmark. Once the experiments are complete, visualizations and statistical tests are used to draw conclusions from the benchmark experiments using the mlr3benchmark package.
A common design for benchmarking is to compare a set of Learners \(L_1, \ldots, L_n\) by evaluating their performance on a set of tasks \(T_1, \ldots, T_k\) which each have an associated resampling \(R_1, \ldots, R_k\). We call such a task-resampling combination \(C_j = (T_j, R_j)\) a resampled task. Furthermore, we define \(M = (M_1, \ldots, M_m)\) to be the evaluation measures. We here focus on the most frequent case of a full experimental grid design, where each experiment is defined by a triple \(E_{i, j} = (L_i, C_j, M)\), which evaluates a learner \(L_i\) on a resampled task \(C_j\) using performance measures \(M\). Running the benchmark consists of evaluating all these triples. The execution of a single resample experiment can again be subdivided into computationally independent resampling iterations to achieve a finer granularity for parallelization (see Section 9.1.2).
As a guiding example throughout this chapter, we will compare the random forest implementation mlr_learners_classif.ranger with the logistic regression learner mlr_learners_classif.log_reg. This question holds significant practical importance and has already been studied by, e.g., Couronné, Probst, and Boulesteix (2018).
The following example compares these two learners using a simple holdout resampling on three classification tasks that ship with mlr3. Chapter 3 already covered how to conduct such benchmark experiments, and we recommend revisiting this chapter if anything is unclear. The metric of choice is the classification accuracy. Note that we “robustify” both learners to work on a wide range of tasks using the respective preprocessing pipeline, c.f. Chapter 6. The ppl("robustify") creates a preprocessing pipeline, i.e. a predefined Graph consisting of common operations such as missing value imputation or feature encoding. By specifying the learner during construction of the pipeline, some preprocessing operations can be omitted, e.g. when a learner can handle missing values itself. Moreover, we specify the fallback learner to be a featureless learner (compare Section 9.2.2) and we use "try" as encapsulation which will be discussed later in this chapter.
# create logistic regression pipelinelearner_logreg=lrn("classif.log_reg")learner_logreg=as_learner(ppl("robustify", learner =learner_logreg)%>>%learner_logreg)learner_logreg$id="logreg"learner_logreg$fallback=lrn("classif.featureless")learner_logreg$encapsulate=c(train ="try", predict ="try")# create random forest pipelinelearner_ranger=lrn("classif.ranger")learner_ranger=as_learner(ppl("robustify", learner =learner_ranger)%>>%learner_ranger)learner_ranger$id="ranger"learner_ranger$fallback=lrn("classif.featureless")learner_ranger$encapsulate=c(train ="try", predict ="try")# create full grid design with holdout resamplingset.seed(123)design=benchmark_grid(tsks(c("german_credit", "sonar", "spam")),list(learner_logreg, learner_ranger),rsmp("holdout"))# run the benchmarkbmr=benchmark(design)# retrieve resultsacc=bmr$aggregate(msr("classif.acc"))acc[, .(task_id, learner_id, classif.acc)]
Looking only at the aggregated performance measures, the random forest outperforms the logistic regression on all three datasets. However, this analysis is not conclusive because
only three tasks are considered,
only a single holdout resampling is used, and
the difference between performances might not be statistically significant.
In the subsequent sections, we will show the steps required to scale up this analysis using the tools mentioned earlier.
10.1 Getting Data with OpenML
In order to be able to draw meaningful conclusions from benchmark experiments, a good choice of datasets and tasks is essential. It is therefore helpful to
have convenient access to a large collection of datasets and be able to filter them for specific properties, and
be able to easily share datasets, tasks and collections with others, so that they can evaluate their methods on the same problems and thus allow cross-study comparisons.
OpenML is a platform that facilitates the sharing and dissemination of machine learning research data and satisfies these two desiderata. Like mlr3, it is free and open source. Unlike mlr3, it is not tied to a programming language and can for example also be used from its Python interface (Feurer et al. 2021) or from Java. Its goal is to make it easier for researchers to find the data they need to answer the questions they have. Its design was guided by the FAIR2 principles, which stand for Findability, Accessibility, Interoperability and Reusability. The purpose of these principles is to make scientific data more easily discoverable and reusable. More concretely, OpenML is a repository for storing, organising and retrieving datasets, algorithms and experimental results in a standardised way. Entities have unique identifiers and standardised (meta) data. Everything is accessible through a REST API or the web interface.
In this section we will cover some of the main features of OpenML and how to use them via the mlr3oml interface package. OpenML supports different types of objects and we will cover the following:
OpenML Dataset: (Usually tabular) data with additional metadata. The latter includes for example a description of the data and a licence. When accessed via mlr3oml, it can be converted to a mlr3::DataBackend. As most OpenML datasets also have a designated target column, they can often directly be converted to an mlr3::Task.
OpenML Task: A machine learning task, i.e. a concrete problem specification on an OpenML dataset. This includes splits into train and test sets, thereby differing from the mlr3 task definition, and corresponds to the notion of a resampled task defined in the introduction. Thus, it can be converted to both a mlr3::Task and corresponding instantiated mlr3::Resampling.
OpenML Task Collection: A container object that allows to group tasks. This allows the creation of benchmark suites, such as the OpenML CC-18 (Bischl et al. 2021), which is a curated collection of classification tasks.
While OpenML also supports other objects such as representations of algorithms (flows) and experiment results (runs), they are not covered in this chapter. For more information about these features, we refer to the OpenML website3 or the documentation of the mlr3oml package.
To illustrate the OpenML dataset class, we will use the dataset with ID 15904 – the well-known adult data. Such an ID can either be found by searching for objects on the OpenML website or through the REST API. This will be covered in more detail in Section 10.1.3.
<OMLData:1590:adult> (48842x15)
* Default target: class
This dataset contains information about 48842 adults – such as their age or education – and the goal is usually to predict the class variable, which indicates whether a person has an income above 50K dollars per year. The OMLData object not only contains the data itself, but comes with additional metadata that is accessible through its fields.
When working with OpenML objects, these are downloaded in a piecemeal fashion from the OpenML server. This way you can, e.g., access the metadata from the OMLData object without loading the dataset. While accessing the $data slot in the above example, the download is automatically triggered, imported in R, and the data.frame gets stored in the odata object. All subsequent accesses to $data will be transparently redirected to the in-memory data.frame. Additionally, many objects can be permanently cached on the local file system. This caching can be enabled by setting the option mlr3oml.cache to either TRUE or a specific path to be used as the cache folder.
After we have loaded the data of interest into R, the next step is to convert it into a format usable with mlr3. The class that comes closest to the OpenML dataset is the mlr3::DataBackend (see Section 9.4) and it is possible to convert OMLData objects by calling as_data_backend(). This is a recurring theme throughout this section: OpenML and mlr3 objects are well interoperable.
Some datasets on OpenML contain columns that should neither be used as a feature nor a target. The column names that are usually included as features are accessible through the field $feature_names and we assign them to the mlr3 task accordingly. Note that for the dataset at hand this would not have been necessary, as all non-target columns are to be treated as predictors, but we include it for clarity.
task$col_roles$feature=odata$feature_namestask
<TaskClassif:backend> (48842 x 15)
* Target: class
* Properties: twoclass
* Features (14):
- fct (8): education, marital.status, native.country, occupation,
race, relationship, sex, workclass
- int (6): age, capital.gain, capital.loss, education.num, fnlwgt,
hours.per.week
Alternatively, as the OpenML adult dataset comes with a default target, you can also directly convert it to a task with the appropriate type. This will also set the features of the task appropriately.
task=as_task(odata)
10.1.2 Task
OpenML tasks are built on top of OpenML datasets and additionally specify the target variable, the train-test splits to use for resampling, and more. Similar to mlr3, OpenML has different types of tasks, such as regression or classification. A task associated with the adult data from earlier has ID 3599835. We can load the object using the mlr3oml::otsk() function, which returns an OMLTask object. Note that this task object is different from an mlr3Task and cannot directly be used for machine learning. However, it contains all required information and there is a convenient converter, as shown below.
The OMLData object associated with the underlying dataset can be accessed through the $data field.
otask$data
<OMLData:1590:adult> (48842x15)
* Default target: class
The data splits associated with the estimation procedure are accessible through the field $task_splits. In mlr3 terms, these are the instantiation of an mlr3Resampling on a specific Task.
otask$task_splits
type rowid repeat. fold
1: TRAIN 32427 0 0
2: TRAIN 13077 0 0
3: TRAIN 15902 0 0
4: TRAIN 17703 0 0
5: TRAIN 35511 0 0
---
488416: TEST 8048 0 9
488417: TEST 12667 0 9
488418: TEST 43944 0 9
488419: TEST 25263 0 9
488420: TEST 43381 0 9
The OpenML task can be converted to both an mlr3 Task and a ResamplingCustom instantiated on the task. To convert to the former we can use as_task().
task=as_task(otask)task
<TaskClassif:adult> (48842 x 15)
* Target: class
* Properties: twoclass
* Features (14):
- fct (8): education, marital.status, native.country, occupation,
race, relationship, sex, workclass
- int (6): age, capital.gain, capital.loss, education.num, fnlwgt,
hours.per.week
The accompanying resampling can be created using as_resampling().
As a shortcut, it is also possible to create the objects using the "oml" task or resampling using the tsk() and rsmp() constructors and pass the data_id or task_id to query, e.g. tsk("oml", task_id = 359983).
10.1.3 Filtering of Data and Tasks
Besides working with objects with known IDs, another important question is how to find IDs of relevant datasets or tasks. Because objects on OpenML have strict metadata, they can be filtered w.r.t these properties. This is possible through either the website or the REST API. In addition, the website also supports targeted text queries to search for specific datasets such as the “adult” data from earlier.
The list_oml_data() function allows to filter datasets for specific properties. As an example, we might only be interested in comparing the random forest and the logistic regression on datasets with less than 4 features and 100 to 1000 observations. By setting number_classes to 2, we only receive datasets where the default target has two different values. To keep the output readable, we only show the first 5 results from that query.
Besides datasets, it is also possible to filter tasks. This can be done using list_oml_tasks() and works analogously to the previous example.
We could now start looking at the returned IDs in more detail in order to verify whether they are suitable for our purposes. This process can be tedious, as some datasets have hard-to-detect quirks to look out for. A solution to this problem is to use an existing curated task collection, which we will cover next.
We will now define our experimental design using tasks from the CC-18. If we wanted to get all tasks and resamplings, we could achieve this using the converters as_tasks() and as_resamplings(). However, as the CC-18 contains not only binary classification tasks we use list_oml_tasks() to subset the collection further. We pass the task IDs from the CC-18 as argument task_id and request the number of classes to be 2.
We now define the learners, tasks, and resamplings for the experiment. In addition to the random forest and the logistic regression, we also include a featureless learner as a baseline.
otasks=lapply(ids, otsk)tasks=lapply(otasks, as_task)resamplings=lapply(otasks, as_resampling)learner_featureless=lrn("classif.featureless", id ="featureless")learners=list(learner_logreg, learner_ranger, learner_featureless)
To define the design table, we use benchmark_grid() and set paired to TRUE. This option can be used in situations where each resampling is instantiated on a corresponding task (therefore the tasks and resamplings below must have the same length) and each learner should be evaluated on every resampled task.
Once an experimental design is finalized, the next step is to run it. Parallelizing this step is conceptually straightforward, because we are facing an embarrassingly parallel problem (see Section 9.1). Not only are the resample experiments independent, but even their individual iterations are. However, if the experiment is large, parallelization on a local machine as shown in Section 9.1 often is not enough and using a distributed computing system, such as an HPC cluster, is required. While access to HPC clusters is widespread at universities and research-driven companies, the effort required to work on these systems is still considerable. The R package batchtools provides a framework to simplify running large batches of computational experiments in parallel from R. It is highly flexible, making it suitable for a wide range of computational experiments, including machine learning, optimisation, simulation, and more.
Note
In Section 9.1.2 we have already touched upon different parallelization backends. The package future also comes with a plan for batchtools. However, for larger experiments, the additional control over the execution which batchtools offers is invaluable. Therefore, we recommend future’s "batchtools" plan only for moderately sized experiments which complete within a couple of hours. To estimate the total runtime, a subset of the benchmark is usually executed and measured, and then the runtime is extrapolated.
10.2.1 HPC Basics
An HPC cluster is a collection of interconnected computers or servers providing computational power beyond what a single computer can achieve. They are used for solving complex problems in chemistry, physics, engineering, machine learning and other fields that require a large amount of computational resources. An HPC cluster typically consists of multiple compute nodes, each with multiple CPU/GPU cores, memory, and local storage. These nodes are connected together by a high-speed network and network file system which enables the nodes to communicate and work together on a given task. These clusters are also designed to run parallel applications that can be split into smaller tasks and distributed across multiple compute nodes to be executed simultaneously. We will leverage this capacity to parallelize the execution of the benchmark experiment. The most important difference between such a cluster and a personal computer (PC), is that the nodes cannot be accessed directly, but instead computational jobs are queued by a scheduling system like Slurm9 (Simple Linux Utility for Resource Management). A scheduling system is a software tool that orchestrates the allocation of computing resources to users or applications on the cluster. It ensures that multiple users and applications can access the resources of the cluster in a fair and efficient manner, and also helps to maximize the utilization of the computing resources.
Figure 10.1 contains a rough sketch of an HPC architecture. Multiple users can log in into the head node (typically via SSH10) and add their computational workloads of the form “Execute Computation X using Resources Y for Z amount of time” to the queue. One such instruction will be referred to as a computational job. The scheduling system controls when these computational jobs are executed.
Figure 10.1: Illustration of an HPC cluster architecture.
Common challenges when interoperating with a scheduling systems are that
the code to submit a job depends on the scheduling system,
code that runs locally has to be adapted to run on the cluster,
the burden is on the user to account for seeding to ensure reproducibility, and
it is cumbersome to query the status of jobs and debug failures.
In the following we will see how batchtools mitigates these problems.
10.2.2 General Setup and Experiment Registry
Our goal in this section is to run the benchmark design large_design shown in Section 10.1.4 on an HPC cluster. We use the packages batchtools and mlr3batchmark for this. The mlr3batchmark package assists in translating the machine learning problem defined with syntax and objects from mlr3 to the more general “apply some algorithm A on some problem P” approach in batchtools, which is briefly outlined in Section 10.2.6.
The central concept of batchtools is the experiment or job: One replication of an experiment (or job) is defined by applying a (parameterized) algorithm to a (parameterized) problem. A benchmark experiment in batchtools then consists of running many such experiments with different algorithms, algorithm parameters, problems, and problem parameters. Each such experiment is computationally independent of all other experiments and constitutes the basic level of computation batchtools can parallelize.
In the introduction of this chapter, we have defined a benchmark experiment as evaluating a number of triples \(E_{i, j} = (L_i, C_j, M)\) that are defined by a learner \(L_i\), a resampled task \(C_j = (T_j, R_j)\), and measures \(M\). While it might seem natural to define one such triplet as a job, each resampled task \(C_j\) can be split up even further, namely into its resampling iterations \(C^1_j, \ldots, C^{n_j}_j\), where \(n_j\) is the number of iterations of resampling \(R_j\). This makes the computation more granular and gives you the opportunity to utilize more CPUs in parallel (c.f., Section 9.1). Moreover, because the evaluation of the measures \(M\) is (usually) computationally negligible, it is common to only parallelize the execution of the resample experiments and evaluate the measures \(M\) afterwards. For these reasons, we will understand one pair \((L_i, C^l_j)\) as a single batchtools experiment.11 The custom definition of problems and algorithm which yields a different granularity is demonstrated in Section 10.2.6.
11 Note that such a job does not have to coincide with the notion of a computational job defined earlier, more on that in Section 10.2.4.
The first step is always to create an (or load an existing) experiment registry using the function batchtools::makeExperimentRegistry() (or batchtools::loadRegistry(), respectively). This function constructs the inter-communication object for all functions in batchtools and corresponds to a folder on the file system. Among other things, the experiment registry stores the
algorithms, problems, and job definitions
log outputs and status of submitted, running, and finished jobs
job results
cluster function, which defines the interaction with the scheduling system
While the first three bullet points should be relatively clear, the fourth needs some explanation. By configuring the scheduling system through a cluster function, it is possible to make the interaction with the scheduling system independent of the scheduling software. We will come back to this later and show how to change it to work on a Slurm cluster.
We create a registry in a subdirectory of our working directory - on a real cluster, make sure that this folder is stored on a shared network filesystem, otherwise the nodes cannot access it. Furthermore, we set the registry’s seed to 1 and the packages to mlr3verse, which will make the package available in all our experiments.
Created registry in '/home/runner/work/mlr3book/mlr3book/book/experiments' using cluster functions 'Interactive'
When printing our newly created registry, we see that there are 0 problems, algorithms or jobs registered. Among other things, we are informed that the “Interactive” cluster function (see batchtools::makeClusterFunctionsInteractive()) is used and about the working directory for the experiments.
reg
Experiment Registry
Backend : Interactive
File dir : /home/runner/work/mlr3book/mlr3book/book/experiments
Work dir : /home/runner/work/mlr3book/mlr3book/book
Jobs : 0
Problems : 0
Algorithms: 0
Seed : 1
Writeable : TRUE
10.2.3 Experiment Definition using mlr3batchmark
The next step is to populate the registry with problems and algorithms, which we will then use to define the jobs, i.e., the resampling iterations. This is the first step where mlr3batchmark comes into play. Doing this step with batchtools is also possible and gives you more flexibility and is demonstrated in Section 10.2.6. By calling batchmark(), mlr3 tasks and mlr3 resamplings will be translated to batchtools problems, and mlr3 learners are mapped to batchtools algorithms. Then, jobs for all resampling iterations are created.
All batchtools functions that interoperate with a registry take a registry as an argument. By default, this argument is set to the last created registry, which is currently the reg object defined earlier. We nonetheless pass it explicitly in this section for clarity, but would not have to do so.
When printing the registry, we confirm that six problems (one for each resampled task) but only a single algorithm is registered. While a 1-on-1 mapping of the 3 algorithms in the design would have also been possible, the approach in mlr3batchmark uses a single algorithm parametrized with the learner identifiers instead for efficiency. Furthermore, \(180 = 3 \times 6 \times 10\) jobs, i.e. one for each resampling iteration, are registered.
reg
Experiment Registry
Backend : Interactive
File dir : /home/runner/work/mlr3book/mlr3book/book/experiments
Work dir : /home/runner/work/mlr3book/mlr3book/book
Jobs : 180
Problems : 6
Algorithms: 1
Seed : 1
Writeable : TRUE
We can summarize the defined experiments using batchtools::summarizeExperiments(). There are 10 jobs for each combination of a learner and resampled task, as 10-fold cross-valdation is used as the resampling procedure.
The function batchtools::getJobTable() can be used to get more detailed information about the jobs. Here, we only show a few selected columns for readability and unpack the list columns algo.pars and prob.pars using batchtools::unwrap(). Among other things, we see that each job has a unique job.id. Each row in this job table represents one iteration (column repl) of a resample experiment.
Once the experiments are defined, the next step is to submit them. Before doing so, it is recommended to test each algorithm individually using batchtools::testJob(). We test the job with job.id = 1 exemplary and specify external = TRUE to run the test in an external R session. The return of the testJob() function (which is the return value of the "run_learner" algorithm) is a bit technical, because here not the complete objects but only the essential parts are returned to reduce the communication overhead (a named list learner_state and a named list prediction) - we do not give a detailed description of the list elements here. We can for example access the training time through the learner_state as shown below.
Loading required package: mlr3verse
Loading required package: mlr3
### [bt]: Generating problem instance for problem '62fd27845aadeb67' ...
### [bt]: Applying algorithm 'run_learner' on problem '62fd27845aadeb67' for job 1 (seed = 2) ...
INFO [13:48:49.264] [mlr3] Applying learner 'logreg' on task 'kr-vs-kp' (iter 1/10)
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
This happened PipeOp classif.log_reg's $train()
result$learner_state$train_time
[1] 0.721
In case something goes wrong, batchtools comes with a bunch of useful debugging utilities covered in Section 10.2.5. Once we are confident that the jobs are defined correctly, we can proceed with their submission, which requires
specifying resource requirements for each computational job, and
(optionally) grouping multiple jobs into one computational job
Which resources can be configured depends on the cluster function that is set in the registry. We earlier left it at its default value, which is the “Interactive” cluster function. In the following we assume that we are working on a Slurm cluster. Accordingly, we initialize the cluster function with batchtools::makeClusterFunctionsSlurm() and a predefined slurm-simple template12. A template file is a shell script with placeholders filled in by batchtools and contains
the command to start the computation via Rscript or R CMD batch, usually in the last line, and
comments which serve as annotations for the scheduler, e.g. to communicate resources or paths on the file system.
The exemplary template should work on many Slurm installations out-of-the-box, but can also easily be customized to also work with more advanced configurations.
It is possible to customize the cluster function. More information is available in the documentation of the batchtools package.
We update the value of the $cluster.functions and save the registry. This only has to be done manually when modifying the fields explicitly. Functions like batchmark() internally save the registry when required.
The jobs are submitted to the scheduler via batchtools::submitJobs(). The most important arguments of this function besides the registry are:
ids, which are either a vector of job IDs to submit, or a data frame with columns job.id and chunk, which allows grouping multiple jobs into one larger computational job and thereby controlling the granularity. This often makes sense on HPCs, as submitting and running computational jobs comes with a considerable overhead and there are often hard limits for the maximum runtime (walltime).
resources, which is a named list specifying the resource requirements for the submitted jobs.
We will batchtools::chunk() the IDs in such a way that 5 iterations of one resample experiment are run sequentially in one computational job. The optimal grouping depends on the concrete experiment and scheduling system.
Furthermore, we specify the number of CPUs per computational job to 1, the walltime to 1 hour (3600 seconds), and the RAM limit to 8 GB. The set of resources depends on your cluster and the corresponding template file. For a list of resource names that are standardised across most implementations, see batchtools::submitJobs(). If you are unsure about the resource requirements, you can start a subset of jobs with conservative resource constraints, e.g. the maximum runtime allowed for your computing site. Measured runtimes and memory usage can be queried with batchtools::getJobTable() and finally used to better estimate the required resources for the remaining jobs.
submitJobs( ids =chunks, resources =list(ncpus =1, walltime =3600, memory =8000), reg =reg)# wait for all jobs to terminatewaitForJobs(reg =reg)
Tip
A good approach to submit computational jobs is to do this from an R session that is running persistently. One option is to use TMUX (Terminal Multiplexer)13 on the head node to continue job submission (or computation, depending on the cluster functions) in the background.
10.2.5 Job Monitoring, Error Handling, and Result Collection
After submitting the jobs, the next phase is to wait for them to finish. In case you terminated your running R session after job submission, you can load the experiment registry using loadRegistry() to continue where you left off.
In any large scale experiment many things can and will go wrong, even if we test our jobs beforehand using batchtools::testJob() as recommended earlier. The cluster might have an outage, jobs may run into resource limits or crash, subtle bugs in your code could be triggered or any other error condition might arise. In these situations it is important to quickly determine what went wrong and to recompute only the minimal number of required jobs.
The current status of the computation can be queried with getStatus(), which lists the number of jobs categorized in multiple status groups:
In our case, all experiments finished and none expired or crashed - albeit we took some countermeasures by extending our base learners with fallback learners and integrating them in the pipeline “robustify”. Nonetheless, it makes sense to check the logs for suspicious messages and warnings with grepLogs() before proceeding with the analysis of the results. For this purpose, we have set the encapsulation before to "try": in contrast to "evaluate" or "callr" encapsulation, "try" does not capture the output from messages, warnings or errors and store them in the learner’s log. All output is printed to the console and redirected to a log file so that batchtools can operate on it with functions like getLog() or grepLogs().
In the following, we extend the design with the debug learner (see Section 9.2) erring with a 50% probability. By just calling batchmark() with the new design again, the new experiments will be added to the registry on top of the existing jobs. The tasks and resamplings below are once again those from the large_design from earlier.
Unsurprisingly, all failed jobs have one thing in common: the debug learner.
Finally, it is time to collect the experiment output. The results for benchmark experiments defined with batchmark() can be collected with reduceResultsBatchmark(), which constructs a regular BenchmarkResult. It is also possible to retrieve single results with loadResult(), but the returned values are hard to work with, as they are not optimized for usability, but efficiency. Here, we only collect results which were not produced by the debug learner.
This section covers advanced ML or technical details that can be skipped.
While mlr3batchmark excels for conducting benchmarks on HPCs, there can be situations in which more fine-grained control over the experiment definition is beneficial or even required. Here, we will show how to define batchtools jobs that execute an mlr3 benchmark experiment without the help of mlr3batchmark. There is not one single way how to achieve this, and we here show a solution that sacrifices efficiency for simplicity. Unless you have a specific reason to customize your experiment definition, we recommend using mlr3batchmark. Like before, the first step is to create an experiment registry.
Created registry in '/home/runner/work/mlr3book/mlr3book/book/experiments-custom' using cluster functions 'Interactive'
We can register a problem by calling batchtools::addProblem(), whose main arguments beside the registry are:
name to uniquely identify the problem,
data to represent the static data part of a problem, and
fun, which takes in the problem data, the problem parameters, and the job definition, see batchtools::makeJob() and returns a problem instance.
We register all task-resampling combinations of the large_design using the task ID as the name.14 The problem fun takes in the static problem data and returns it as the problem instance as is. If we were using problem parameters, we could modify the problem instance depending on their values. In the code below, recall that the tasks and resamplings were originally used to define the large_design.
14 The mlr3 task ID is not the same as the OpenML task ID.
for(iinseq_along(tasks)){addProblem( name =tasks[[i]]$id, data =list(task =tasks[[i]], resampling =resamplings[[i]]), fun =function(data, job, ...)data, reg =reg)}
When calling batchtools::addProblem(), not only is the problem added to the registry object from the active R session, but this information is also synced with the registry folder.
The next step is to register the algorithm we want to run, which we achieve by calling batchtools::addAlgorithm(). Besides the registry, it takes in the arguments:
name to uniquely identify the algorithm and
fun, which takes in the problem instance, the algorithm parameters, and the job definition. It defines the computational steps of an experiment and its return value is the experiment result, i.e. what can later be retrieved using loadResult().
The algorithm function receives a list containing the task and resampling as the problem instance, the learner as the algorithm parameter, and the job object. It then executes the resample experiment defined by these three objects using resample() and returns a ResampleResult. This differs from the mlr3batchmark, where one resampling iteration corresponds to one job in batchtools. Here, one batchtools job represents a complete resample experiment.
addAlgorithm("run_learner", fun =function(instance, learner, job, ...){resample(instance$task, learner, instance$resampling)}, reg =reg)
Adding algorithm 'run_learner'
reg$algorithms
[1] "run_learner"
As we have now defined the problems and the algorithm, we can define concrete experiments using batchtools::addExperiments(). This function has arguments
prob.designs, a named list of data frames. The name must match the problem name while the column names correspond to parameters of the problem.
algo.designs, a named list of data frames. The name must match the algorithm name while the column names correspond to parameters of the algorithm.
In the code below, we add all resampling iterations for the six tasks as experiments. By leaving prob.designs unspecified, experiments for all existing problems are created per default. We set the algorithm parameters to all possible learners, i.e. the logistic regression, random forest, and featureless learner from large_design. Note that whenever an experiment is added, the current seed is assigned to the experiment and then incremented.
Figure 10.2 summarizes the interplay between the batchtools problems, algorithms, and experiments.
Figure 10.2: Illustration the batchtools problem, algorithm, and experiment.
We are now ready to submit the jobs to the cluster. By specifying no job IDs, all experiments are submitted as independent jobs, i.e. one computational job executes one resample experiment.
When a cluster jobs finishes, it stores its return value in the registry folder. We can retrieve the job results using batchtools::loadResult(). It outputs the objects returned by the algorithm function, which in our case is a ResampleResult.
In order to use mlr3’s post-processing tools, we need to convert all results into a BenchmarkResult. We can do this, by combining all resample results using batchtools::reduceResults().
While we took a different route than in Section 10.2.3 to define the experiments and ran them at a different granularity, we arrived at the same result.
10.3 Statistical Analysis
Once we successfully executed the benchmark experiment, we can proceed with its analysis. The package mlr3benchmark provides infrastructure for applying statistical significance tests on BenchmarkResult objects. Currently, Friedman tests and pairwise Friedman-Nemenyi tests (Demšar 2006) are supported to analyze benchmark experiments with at least two independent tasks and at least two learners. Before we can use these methods, we have to convert the benchmark result to a mlr3benchmark::BenchmarkAggr using as_benchmark_aggr(). We can then perform a pairwise comparison using $friedman_posthoc(). This method will first perform a global friedman test and only conduct the post-hoc tests if the former is significant.
These results would indicate a statistically significant difference between the "featureless" learner and "ranger", assuming a 95% confidence level. This table can be summarized in a critical difference plot, which typically shows the mean rank of a learning algorithm on the x-axis along with a thick horizontal line that connects learners which are pairwise not significantly different (while correcting for multiple tests):
autoplot(bma, type ="cd")
While our experiment did now show a significant difference between the random forest and the logistic regression (they are connected in the plot), the former has a lower rank on average. This is in line with the large benchmark study conducted by Couronné, Probst, and Boulesteix (2018), where the random forest outperformed the logistic regression in 69% of 243 real world datasets.
As a final note, it is important to be careful when interpreting such test results. Because our datasets are not an iid sample from a population of datasets, we can only make inference about the data generating processes at hand, i.e. those that generated the datasets we used in the benchmark.
10.4 Conclusion
In this chapter, we have explored how to conduct large scale machine learning experiments using mlr3. We have shown how to acquire diverse datasets from OpenML through the mlr3oml interface package. Furthermore, we have learned how to execute large-scale experiments using the batchtools package and its mlr3batchmark integration. Finally, we have demonstrated how to analyze the results using the mlr3benchmark package, thereby extracting meaningful insights from the experiments.
The most important functions and classes we learned about are in Table 10.1 alongside their R6 classes (if applicable).
Table 10.1: Core functions for Open in mlr3 with the underlying R6 class that are constructed when these functions are called (if applicable) and a summary of the purpose of the functions.
In this exercise we will conduct an empirical study that compares two machine learning algorithms. Our null hypothesis is that a single regression tree performs equally well than a random forest.
Find all tasks with less than 4000 observations and convert them to mlr3 tasks.
Create an experimental design that compares the random forest in ranger with the regression tree from rpart on those tasks. You can use 3-fold cross-validation instead of the OpenML resamplings to save time.
Create a registry and populate it with the experiments.
(Optional) Change the cluster function to either “Socket” or “Multicore” (the latter does not work on Windows).
Submit the jobs and once they are finished, collect the results.
Analyzing the Results
Conduct a global Friedman test and interpret the results. As an evaluation measure, use the mean-square error.
Inspect the ranks of the results.
Bischl, Bernd, Giuseppe Casalicchio, Matthias Feurer, Pieter Gijsbers, Frank Hutter, Michel Lang, Rafael Gomes Mantovani, Jan N. van Rijn, and Joaquin Vanschoren. 2021. “OpenML Benchmarking Suites.” In Thirty-Fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). https://openreview.net/forum?id=OCrD8ycKjG.
Couronné, Raphael, Philipp Probst, and Anne-Laure Boulesteix. 2018. “Random Forest Versus Logistic Regression: A Large-Scale Benchmark Experiment.”BMC Bioinformatics 19: 1–14.
Demšar, Janez. 2006. “Statistical Comparisons of Classifiers over Multiple Data Sets.”Journal of Machine Learning Research 7 (1): 1–30. https://jmlr.org/papers/v7/demsar06a.html.
Feurer, Matthias, Jan N Van Rijn, Arlind Kadra, Pieter Gijsbers, Neeratyoy Mallik, Sahithya Ravi, Andreas Müller, Joaquin Vanschoren, and Frank Hutter. 2021. “Openml-Python: An Extensible Python Api for Openml.”The Journal of Machine Learning Research 22 (1): 4573–77.
Gijsbers, Pieter, Marcos L. P. Bueno, Stefan Coors, Erin LeDell, Sébastien Poirier, Janek Thomas, Bernd Bischl, and Joaquin Vanschoren. 2022. “AMLB: An AutoML Benchmark.” arXiv. https://doi.org/10.48550/ARXIV.2207.12560.
Grinsztajn, Leo, Edouard Oyallon, and Gael Varoquaux. 2022. “Why Do Tree-Based Models Still Outperform Deep Learning on Typical Tabular Data?” In Thirty-Sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track. https://openreview.net/forum?id=Fp7__phQszn.
Lang, Michel, Bernd Bischl, and Dirk Surmann. 2017. “Batchtools: Tools for r to Work on Batch Systems.”The Journal of Open Source Software, no. 10 (February). https://doi.org/10.21105/joss.00135.
Vanschoren, Joaquin, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. 2013. “OpenML: Networked Science in Machine Learning.”SIGKDD Explorations 15 (2): 49–60. https://doi.org/10.1145/2641190.2641198.
---author: - name: Sebastian Fischer orcid: 0000-0002-9609-3197 email: sebastian.fischer@stat.uni-muenchen.de affiliations: - name: Ludwig-Maximilians-Universität Münchenabstract: In the field of machine learning, benchmark experiments are used to evaluate and compare the performance of algorithms. Conducting such experiments involves evaluating algorithms on a large suite of diverse datasets, which can, e.g., be acquired through collaborative machine learning platforms such as as [OpenML](http://www.openml.org/). The first section of this chapter shows how to work with OpenML using the interface package [`mlr3oml`](https://github.com/mlr-org/mlr3oml). However, running large-scale experiments requires not only datasets, but also significant computational resources, and it can be beneficial to leverage High-Performance Computing (HPC) clusters to speed up the experiment execution. In the second section, we will show how the R package [`batchtools`](https://github.com/mllg/batchtools) and its mlr3 integration [`mlr3batchmark`](https://github.com/mlr-org/mlr3batchmark) can considerably simplify the process of working on HPC clusters. Once the experiments are complete, statistical analysis is required to draw conclusions from the results using the [`mlr3benchmark`](https://github.com/mlr-org/mlr3benchmark) package for post-hoc analyses.---# Large-Scale Benchmark Experiments {#sec-large-benchmarking}{{< include _setup.qmd >}}```{r large_benchmarking-001}#| include: false#| cache: falselgr::get_logger("mlr3oml")$set_threshold("off")library("mlr3batchmark")library("batchtools")library("mlr3oml")if (!dir.exists(file.path("openml", "manual"))) {dir.create(file.path("openml", "manual"), recursive =TRUE)}options(mlr3oml.cache =file.path("openml", "cache"))```In the world of machine learning, there are many methods that are hard to evaluate using mathematical analysis alone.Even if formal analysis is successful, it is often an open question whether real-world datasets satisfy the necessary assumptions for the theorems to hold.Consequently, researchers resort to conducting benchmark experiments to answer fundamental scientific questions.Such studies evaluate different algorithms on a wide range of datasets, with the aim of identifying which method performs best.These empirical investigations are essential for understanding the capabilities and limitations of existing methods and for developing new and improved approaches.To carry them out effectively, researchers need access to datasets, spanning a wide range of domains and problem types.This is because one can only draw conclusions about the kind of datasets on which the benchmark study was conducted.Fortunately, there are several online resources available for acquiring such datasets, with `r link("https://www.openml.org/", "OpenML")`[@openml2013] being a popular choice.In the first part of this chapter, we will show how to use OpenML through the connector package `r ref_pkg("mlr3oml")`.However, evaluating algorithms on such a large scale requires significant computational resources.For this reason, researchers often utilize High-Performance Computing (HPC) clusters which allow executing experiments massively parallel.The R package `r ref_pkg("batchtools")`[@batchtools] is a tool for managing and executing experiments on such clusters.In the second section we will show how to use batchtools and its mlr3 connector for benchmarking: `r ref_pkg("mlr3batchmark")`.Once the experiments are complete, visualizations and statistical tests are used to draw conclusions from the benchmark experiments using the `r ref_pkg("mlr3benchmark")` package.A common design for benchmarking is to compare a set of Learners $L_1, \ldots, L_n$ by evaluating their performance on a set of tasks $T_1, \ldots, T_k$ which each have an associated resampling $R_1, \ldots, R_k$.We call such a task-resampling combination $C_j = (T_j, R_j)$ a `r index("resampled task")`.Furthermore, we define $M = (M_1, \ldots, M_m)$ to be the evaluation measures.We here focus on the most frequent case of a full experimental grid design, where each experiment is defined by a triple $E_{i, j} = (L_i, C_j, M)$, which evaluates a learner $L_i$ on a resampled task $C_j$ using performance measures $M$.Running the benchmark consists of evaluating all these triples.The execution of a single resample experiment can again be subdivided into computationally independent resampling iterations to achieve a finer granularity for parallelization (see @sec-parallel-resample).As a guiding example throughout this chapter, we will compare the random forest implementation `r ref("mlr_learners_classif.ranger")` with the logistic regression learner `r ref("mlr_learners_classif.log_reg")`.This question holds significant practical importance and has already been studied by, e.g., @couronne2018random.The following example compares these two learners using a simple holdout resampling on three classification tasks that ship with `r ref_pkg("mlr3")`.@sec-performance already covered how to conduct such benchmark experiments, and we recommend revisiting this chapter if anything is unclear.The metric of choice is the classification accuracy.Note that we "robustify" both learners to work on a wide range of tasks using the respective preprocessing pipeline, c.f. @sec-pipelines.The `ppl("robustify")` creates a preprocessing pipeline, i.e. a predefined `r ref("Graph")` consisting of common operations such as missing value imputation or feature encoding.By specifying the learner during construction of the pipeline, some preprocessing operations can be omitted, e.g. when a learner can handle missing values itself.Moreover, we specify the fallback learner to be a featureless learner (compare @sec-fallback) and we use `"try"` as encapsulation which will be discussed later in this chapter.<!-- FIXME: reference to robustify specifically when preprocessing chapter is ready -->```{r large_benchmarking-002}#| warning: false# create logistic regression pipelinelearner_logreg =lrn("classif.log_reg")learner_logreg =as_learner(ppl("robustify", learner = learner_logreg) %>>% learner_logreg)learner_logreg$id ="logreg"learner_logreg$fallback =lrn("classif.featureless")learner_logreg$encapsulate =c(train ="try", predict ="try")# create random forest pipelinelearner_ranger =lrn("classif.ranger")learner_ranger =as_learner(ppl("robustify", learner = learner_ranger) %>>% learner_ranger)learner_ranger$id ="ranger"learner_ranger$fallback =lrn("classif.featureless")learner_ranger$encapsulate =c(train ="try", predict ="try")# create full grid design with holdout resamplingset.seed(123)design =benchmark_grid(tsks(c("german_credit", "sonar", "spam")),list(learner_logreg, learner_ranger),rsmp("holdout"))# run the benchmarkbmr =benchmark(design)# retrieve resultsacc = bmr$aggregate(msr("classif.acc"))acc[, .(task_id, learner_id, classif.acc)]```Looking only at the aggregated performance measures, the random forest outperforms the logistic regression on all three datasets.However, this analysis is not conclusive because* only three tasks are considered,* only a single holdout resampling is used, and* the difference between performances might not be statistically significant.In the subsequent sections, we will show the steps required to scale up this analysis using the tools mentioned earlier.## Getting Data with OpenML {#sec-openml}In order to be able to draw meaningful conclusions from benchmark experiments, a good choice of datasets and tasks is essential.It is therefore helpful to1. have convenient access to a large collection of datasets and be able to filter them for specific properties, and1. be able to easily share datasets, tasks and collections with others, so that they can evaluate their methods on the same problems and thus allow cross-study comparisons.OpenML is a platform that facilitates the sharing and dissemination of machine learning research data and satisfies these two desiderata.Like mlr3, it is free and open source.Unlike mlr3, it is not tied to a programming language and can for example also be used from its Python interface [@feurer2021openml] or from Java.Its goal is to make it easier for researchers to find the data they need to answer the questions they have.Its design was guided by the `r link("https://www.go-fair.org/fair-principles/", "FAIR")` principles, which stand for **F**indability, **A**ccessibility, **I**nteroperability and **R**eusability.The purpose of these principles is to make scientific data more easily discoverable and reusable.More concretely, OpenML is a repository for storing, organising and retrieving datasets, algorithms and experimental results in a standardised way.Entities have unique identifiers and standardised (meta) data.Everything is accessible through a REST API or the web interface.In this section we will cover some of the main features of OpenML and how to use them via the `r ref_pkg("mlr3oml")` interface package.OpenML supports different types of objects and we will cover the following:* `r index("OpenML **Dataset**")`: (Usually tabular) data with additional metadata. The latter includes for example a description of the data and a licence. When accessed via `r ref_pkg("mlr3oml")`, it can be converted to a `r ref("mlr3::DataBackend")`. As most OpenML datasets also have a designated target column, they can often directly be converted to an `r ref("mlr3::Task")`.* `r index("OpenML **Task**")`: A machine learning task, i.e. a concrete problem specification on an OpenML dataset. This includes splits into train and test sets, thereby differing from the mlr3 task definition, and corresponds to the notion of a resampled task defined in the introduction. Thus, it can be converted to both a `r ref("mlr3::Task")` and corresponding instantiated `r ref("mlr3::Resampling")`.* `r index("OpenML **Task Collection**")`: A container object that allows to group tasks. This allows the creation of benchmark suites, such as the OpenML CC-18 [@bischl2021openml], which is a curated collection of classification tasks.While OpenML also supports other objects such as representations of algorithms (flows) and experiment results (runs), they are not covered in this chapter.For more information about these features, we refer to the OpenML `r link("htts://openml.org", "website")` or the documentation of the `r ref_pkg("mlr3oml")` package.### Dataset {#sec-openml-dataset}To illustrate the OpenML dataset class, we will use the dataset with `r link("https://openml.org/d/1590", "ID 1590")` -- the well-known adult data.Such an ID can either be found by searching for objects on the OpenML website or through the REST API.This will be covered in more detail in @sec-openml-filtering.We load the object into R using `r ref("mlr3oml::odt()")`, which returns an object of class `r ref("OMLData")`.```{r large_benchmarking-003}library("mlr3oml")odata =odt(id =1590)odata```This dataset contains information about `r odata$nrow` adults -- such as their age or education -- and the goal is usually to predict the *class* variable, which indicates whether a person has an income above 50K dollars per year.The `r ref("OMLData")` object not only contains the data itself, but comes with additional metadata that is accessible through its fields.```{r large_benchmarking-004}odata$license```The actual data can be accessed through `$data`.```{r large_benchmarking-005}odata$data```:::{.callout-tip}When working with OpenML objects, these are downloaded in a piecemeal fashion from the OpenML server.This way you can, e.g., access the metadata from the `OMLData` object without loading the dataset.While accessing the `$data` slot in the above example, the download is automatically triggered, imported in R, and the `data.frame` gets stored in the `odata` object.All subsequent accesses to `$data` will be transparently redirected to the in-memory `data.frame`.Additionally, many objects can be permanently cached on the local file system.This caching can be enabled by setting the option `mlr3oml.cache` to either `TRUE` or a specific path to be used as the cache folder.:::After we have loaded the data of interest into R, the next step is to convert it into a format usable with mlr3.The class that comes closest to the OpenML dataset is the `r ref("mlr3::DataBackend")` (see @sec-backends) and it is possible to convert `r ref("OMLData")` objects by calling `r ref("as_data_backend()")`.This is a recurring theme throughout this section: OpenML and mlr3 objects are well interoperable.```{r large_benchmarking-006}backend =as_data_backend(odata)backend```We can create an mlr3 task from the backend via `r ref("as_task_classif()")`.```{r large_benchmarking-007}task =as_task_classif(backend, target ="class")```Some datasets on OpenML contain columns that should neither be used as a feature nor a target.The column names that are usually included as features are accessible through the field `$feature_names` and we assign them to the `mlr3` task accordingly.Note that for the dataset at hand this would not have been necessary, as all non-target columns are to be treated as predictors, but we include it for clarity.```{r}task$col_roles$feature = odata$feature_namestask```Alternatively, as the OpenML adult dataset comes with a default target, you can also directly convert it to a task with the appropriate type.This will also set the features of the task appropriately.```{r large_benchmarking-008}task =as_task(odata)```### Task {#sec-openml-task}OpenML tasks are built on top of OpenML datasets and additionally specify the target variable, the train-test splits to use for resampling, and more.Similar to mlr3, OpenML has different types of tasks, such as regression or classification.A task associated with the adult data from earlier has ID `r link("https://openml.org/t/359983", "359983")`.We can load the object using the `r ref("mlr3oml::otsk()")` function, which returns an `r ref("OMLTask")` object.Note that this task object is different from an `mlr3``r ref("Task")` and cannot directly be used for machine learning.However, it contains all required information and there is a convenient converter, as shown below.```{r large_benchmarking-009}otask =otsk(id =359983)otask```The `r ref("OMLData")` object associated with the underlying dataset can be accessed through the `$data` field.```{r large_benchmarking-010}otask$data```The data splits associated with the estimation procedure are accessible through the field `$task_splits`.In mlr3 terms, these are the instantiation of an `mlr3``r ref("Resampling")` on a specific `r ref("Task")`.```{r large_benchmarking-011}otask$task_splits```The OpenML task can be converted to both an mlr3 `r ref("Task")` and a `r ref("ResamplingCustom")` instantiated on the task.To convert to the former we can use `r ref("as_task()")`.```{r large_benchmarking-012}task =as_task(otask)task```The accompanying resampling can be created using `r ref("as_resampling()")`.```{r large_benchmarking-013}resampling =as_resampling(otask)resampling```:::{.callout-tip}As a shortcut, it is also possible to create the objects using the `"oml"` task or resampling using the `r ref("tsk()")` and `r ref("rsmp()")` constructors and pass the `data_id` or `task_id` to query, e.g. `tsk("oml", task_id = 359983)`.:::### Filtering of Data and Tasks {#sec-openml-filtering}Besides working with objects with known IDs, another important question is how to find IDs of relevant datasets or tasks.Because objects on OpenML have strict metadata, they can be filtered w.r.t these properties.This is possible through either the website or the REST API.In addition, the website also supports targeted text queries to search for specific datasets such as the "adult" data from earlier.The `r ref("list_oml_data()")` function allows to filter datasets for specific properties.As an example, we might only be interested in comparing the random forest and the logistic regression on datasets with less than 4 features and 100 to 1000 observations.By setting `number_classes` to 2, we only receive datasets where the default target has two different values.To keep the output readable, we only show the first 5 results from that query.```{r}#| include: falsepath_odatasets =file.path("openml", "manual", "odatasets_filter.rds")``````{r large_benchmarking-014, eval = !file.exists(path_odatasets)}odatasets =list_oml_data(limit =5,number_features =c(1, 4),number_instances =c(100, 1000),number_classes =2)``````{r}#| include: falseif (file.exists(path_odatasets)) { odatasets =readRDS(path_odatasets)} else {saveRDS(odatasets, path_odatasets)}```The table below confirms that indeed only datasets with the specified properties were returned.We only show a subset of the columns for readability.```{r large_benchmarking-016}odatasets[, .(data_id, NumberOfClasses, NumberOfFeatures, NumberOfInstances)]```Besides datasets, it is also possible to filter tasks.This can be done using `r ref("list_oml_tasks()")` and works analogously to the previous example.We could now start looking at the returned IDs in more detail in order to verify whether they are suitable for our purposes.This process can be tedious, as some datasets have hard-to-detect quirks to look out for.A solution to this problem is to use an existing curated task collection, which we will cover next.### Task Collection {#sec-openml-collection}The OpenML task collection is a container object bundling existing tasks.This allows for the creation of `r index("benchmark suites")`, which are curated collections of tasks, satisfying certain quality criteria.One example for such a benchmark suite is the `r link("https://www.openml.org/search?type=study&study_type=task&id=99", "OpenML CC-18")`, which contains curated classification tasks [@bischl2021openml].Other collections available on OpenML include the `r link("https://www.openml.org/search?type=study&study_type=task&id=271", "AutoML benchmark")`[@amlb2022] or a `r link("https://www.openml.org/search?type=study&study_type=task&id=304", "benchmark for tabular deep learning")`[@grinsztajn2022why].```{r}#| include: falsepath_otask_collection =file.path("openml", "manual", "otask_collection99.rds")``````{r large_benchmarking-017, eval = !file.exists(path_otask_collection)}otask_collection =ocl(id =99)``````{r}#| include: falseif (file.exists(path_otask_collection)) { otask_collection =readRDS(path_otask_collection)} else {# need to trigger the download otask_collection$task_idssaveRDS(otask_collection, path_otask_collection)}```We can create an `r ref("OMLCollection")` using `r ref("mlr3oml::ocl()")`.We see that the CC-18 contains 72 classification tasks on different datasets.```{r large_benchmarking-019}otask_collection```The contained tasks can be accessed through `$task_ids`.```{r large_benchmarking-020}otask_collection$task_ids```We will now define our experimental design using tasks from the CC-18.If we wanted to get all tasks and resamplings, we could achieve this using the converters `r ref("as_tasks()")` and `r ref("as_resamplings()")`.However, as the CC-18 contains not only binary classification tasks we use `r ref("list_oml_tasks()")` to subset the collection further.We pass the task IDs from the CC-18 as argument `task_id` and request the number of classes to be 2.```{r}#| include: falsepath_binary_cc18 =file.path("openml", "manual", "binary_cc18.rds")``````{r large_benchmarking-021, eval = !file.exists(path_binary_cc18)}binary_cc18 =list_oml_tasks(task_id = otask_collection$task_ids, number_classes =2)``````{r}#| include: falseif (!file.exists(path_binary_cc18)) {saveRDS(binary_cc18, path_binary_cc18)} else { binary_cc18 =readRDS(path_binary_cc18)}```In order to keep the runtime reasonable in later examples, we only use the first six tasks.```{r large_benchmarking-023}binary_cc18[, .(task_id, name, NumberOfClasses)]ids = binary_cc18$task_id[1:6]```We now define the learners, tasks, and resamplings for the experiment.In addition to the random forest and the logistic regression, we also include a featureless learner as a baseline.```{r large_benchmarking-024}otasks =lapply(ids, otsk)tasks =lapply(otasks, as_task)resamplings =lapply(otasks, as_resampling)learner_featureless =lrn("classif.featureless", id ="featureless")learners =list(learner_logreg, learner_ranger, learner_featureless)```To define the design table, we use `r ref("benchmark_grid()")` and set `paired` to `TRUE`.This option can be used in situations where each resampling is instantiated on a corresponding task (therefore the `tasks` and `resamplings` below must have the same length) and each learner should be evaluated on every resampled task.```{r large_benchmarking-025}large_design =benchmark_grid( tasks, learners, resamplings, paired =TRUE)large_design```## Experiment Execution on HPC Clusters {#sec-hpc-exec}Once an experimental design is finalized, the next step is to run it.Parallelizing this step is conceptually straightforward, because we are facing an embarrassingly parallel problem (see @sec-parallelization).Not only are the resample experiments independent, but even their individual iterations are.However, if the experiment is large, parallelization on a local machine as shown in @sec-parallelization often is not enough and using a distributed computing system, such as an HPC cluster, is required.While access to HPC clusters is widespread at universities and research-driven companies, the effort required to work on these systems is still considerable.The R package `r ref_pkg("batchtools")` provides a framework to simplify running large batches of computational experiments in parallel from R.It is highly flexible, making it suitable for a wide range of computational experiments, including machine learning, optimisation, simulation, and more.:::{.callout-note}In @sec-parallel-resample we have already touched upon different parallelization backends.The package `r ref_pkg("future")` also comes with a plan for `r ref_pkg("batchtools")`.However, for larger experiments, the additional control over the execution which `batchtools` offers is invaluable.Therefore, we recommend future's `"batchtools"` plan only for moderately sized experiments which complete within a couple of hours.To estimate the total runtime, a subset of the benchmark is usually executed and measured, and then the runtime is extrapolated.:::### HPC Basics {#sec-hpc-basics}An HPC cluster is a collection of interconnected computers or servers providing computational power beyond what a single computer can achieve.They are used for solving complex problems in chemistry, physics, engineering, machine learning and other fields that require a large amount of computational resources.An HPC cluster typically consists of multiple compute nodes, each with multiple CPU/GPU cores, memory, and local storage.These nodes are connected together by a high-speed network and network file system which enables the nodes to communicate and work together on a given task.These clusters are also designed to run parallel applications that can be split into smaller tasks and distributed across multiple compute nodes to be executed simultaneously.We will leverage this capacity to parallelize the execution of the benchmark experiment.The most important difference between such a cluster and a personal computer (PC), is that the nodes cannot be accessed directly, but instead computational jobs are queued by a `r index("scheduling system")` like `r link("https://slurm.schedmd.com", "Slurm")` (Simple Linux Utility for Resource Management).A scheduling system is a software tool that orchestrates the allocation of computing resources to users or applications on the cluster.It ensures that multiple users and applications can access the resources of the cluster in a fair and efficient manner, and also helps to maximize the utilization of the computing resources.@fig-hpc contains a rough sketch of an HPC architecture.Multiple users can log in into the head node (typically via `r link("https://en.wikipedia.org/wiki/Secure_Shell", "SSH")`) and add their computational workloads of the form "Execute Computation X using Resources Y for Z amount of time" to the queue.One such instruction will be referred to as a `r index("computational job")`.The scheduling system controls when these computational jobs are executed.```{r large_benchmarking-026}#| label: fig-hpc#| fig-cap: "Illustration of an HPC cluster architecture."#| fig-align: "center"#| fig-alt: "A rough sketch of the architecture of an HPC cluster. Ann and Bob both have access to the cluster and can log in to the head node. There, they can submit jobs to the scheduling system, which adds them to its queue and determines when they are run."#| echo: falseknitr::include_graphics("Figures/hpc.drawio.png")```Common challenges when interoperating with a scheduling systems are that1. the code to submit a job depends on the scheduling system,1. code that runs locally has to be adapted to run on the cluster,1. the burden is on the user to account for seeding to ensure reproducibility, and1. it is cumbersome to query the status of jobs and debug failures.In the following we will see how `r ref_pkg("batchtools")` mitigates these problems.### General Setup and Experiment Registry {#sec-registry}Our goal in this section is to run the benchmark design `large_design` shown in @sec-openml-collection on an HPC cluster.We use the packages `r ref_pkg("batchtools")` and `r ref_pkg("mlr3batchmark")` for this.The `r ref_pkg("mlr3batchmark")` package assists in translating the machine learning problem defined with syntax and objects from `r ref_pkg("mlr3")` to the more general "apply some algorithm A on some problem P" approach in `r ref_pkg("batchtools")`, which is briefly outlined in @sec-custom-experiment-definition.The central concept of `r ref_pkg("batchtools")` is the experiment or `r index("job")`:One replication of an experiment (or job) is defined by applying a (parameterized) algorithm to a (parameterized) problem.A benchmark experiment in batchtools then consists of running many such experiments with different algorithms, algorithm parameters, problems, and problem parameters.Each such experiment is computationally independent of all other experiments and constitutes the basic level of computation batchtools can parallelize.In the introduction of this chapter, we have defined a benchmark experiment as evaluating a number of triples $E_{i, j} = (L_i, C_j, M)$ that are defined by a learner $L_i$, a resampled task $C_j = (T_j, R_j)$, and measures $M$.While it might seem natural to define one such triplet as a job, each resampled task $C_j$ can be split up even further, namely into its resampling iterations $C^1_j, \ldots, C^{n_j}_j$, where $n_j$ is the number of iterations of resampling $R_j$.This makes the computation more granular and gives you the opportunity to utilize more CPUs in parallel (c.f., @sec-parallelization).Moreover, because the evaluation of the measures $M$ is (usually) computationally negligible, it is common to only parallelize the execution of the resample experiments and evaluate the measures $M$ afterwards.For these reasons, we will understand one pair $(L_i, C^l_j)$ as a single batchtools experiment.^[Note that such a job does not have to coincide with the notion of a computational job defined earlier, more on that in @sec-batchtools-submission.]The custom definition of problems and algorithm which yields a different granularity is demonstrated in @sec-custom-experiment-definition.The first step is always to create an (or load an existing) experiment registry using the function `r ref("batchtools::makeExperimentRegistry()")` (or `r ref("batchtools::loadRegistry()")`, respectively).This function constructs the inter-communication object for all functions in `r ref_pkg("batchtools")` and corresponds to a folder on the file system.Among other things, the experiment registry stores the* algorithms, problems, and job definitions* log outputs and status of submitted, running, and finished jobs* job results* `r index("cluster function", "Batchtools object that allows to configure the interaction with the scheduling system.")`, which defines the interaction with the scheduling systemWhile the first three bullet points should be relatively clear, the fourth needs some explanation.By configuring the scheduling system through a cluster function, it is possible to make the interaction with the scheduling system independent of the scheduling software.We will come back to this later and show how to change it to work on a Slurm cluster.We create a registry in a subdirectory of our working directory - on a real cluster, make sure that this folder is stored on a shared network filesystem, otherwise the nodes cannot access it.Furthermore, we set the registry's `seed` to 1 and the `packages` to `r ref_pkg("mlr3verse")`, which will make the package available in all our experiments.```{r include = FALSE}#| cache: falseif (dir.exists("experiments"))unlink("experiments", recursive =TRUE)``````{r large_benchmarking-027}#| cache: falselibrary("batchtools")# create registryreg =makeExperimentRegistry(file.dir ="./experiments",seed =1,packages ="mlr3verse")```When printing our newly created registry, we see that there are 0 problems, algorithms or jobs registered.Among other things, we are informed that the "Interactive" cluster function (see `r ref("batchtools::makeClusterFunctionsInteractive()")`) is used and about the working directory for the experiments.```{r large_benchmarking-028}reg```### Experiment Definition using mlr3batchmark {#sec-mlr3batchmark}The next step is to populate the registry with problems and algorithms, which we will then use to define the jobs, i.e., the resampling iterations.This is the first step where `r ref_pkg("mlr3batchmark")` comes into play.Doing this step with `r ref_pkg("batchtools")` is also possible and gives you more flexibility and is demonstrated in @sec-custom-experiment-definition.By calling `r ref("batchmark()")`, mlr3 tasks and mlr3 resamplings will be translated to batchtools problems, and mlr3 learners are mapped to batchtools algorithms.Then, jobs for all resampling iterations are created.```{r large_benchmarking-029}#| cache: false#| output: falselibrary("mlr3batchmark")batchmark(large_design, reg = reg)```::: {.callout-tip}All batchtools functions that interoperate with a registry take a registry as an argument.By default, this argument is set to the last created registry, which is currently the `reg` object defined earlier.We nonetheless pass it explicitly in this section for clarity, but would not have to do so.:::When printing the registry, we confirm that six problems (one for each resampled task) but only a single algorithm is registered.While a 1-on-1 mapping of the 3 algorithms in the design would have also been possible, the approach in `r ref_pkg("mlr3batchmark")` uses a single algorithm parametrized with the learner identifiers instead for efficiency.Furthermore, $180 = 3 \times 6 \times 10$ jobs, i.e. one for each resampling iteration, are registered.```{r large_benchmarking-030}reg```We can summarize the defined experiments using `r ref("batchtools::summarizeExperiments()")`.There are 10 jobs for each combination of a learner and resampled task, as 10-fold cross-valdation is used as the resampling procedure.```{r large_benchmarking-031}summarizeExperiments(by =c("task_id", "learner_id"), reg = reg)```The function `r ref("batchtools::getJobTable()")` can be used to get more detailed information about the jobs.Here, we only show a few selected columns for readability and unpack the list columns `algo.pars` and `prob.pars` using `r ref("batchtools::unwrap()")`.Among other things, we see that each job has a unique `job.id`.Each row in this job table represents one iteration (column `repl`) of a resample experiment.```{r large_benchmarking-032}job_table =getJobTable(reg = reg)job_table =unwrap(job_table)job_table = job_table[, .(job.id, learner_id, task_id, resampling_id, repl)]job_table```### Job Submission {#sec-batchtools-submission}Once the experiments are defined, the next step is to submit them.Before doing so, it is recommended to test each algorithm individually using `r ref("batchtools::testJob()")`.We test the job with `job.id = 1` exemplary and specify `external = TRUE` to run the test in an external R session.The return of the `testJob()` function (which is the return value of the `"run_learner"` algorithm) is a bit technical, because here not the complete objects but only the essential parts are returned to reduce the communication overhead (a named list `learner_state` and a named list `prediction`) - we do not give a detailed description of the list elements here.We can for example access the training time through the `learner_state` as shown below.```{r large_benchmarking-033}#| output: trueresult =testJob(1, external =TRUE, reg = reg)result$learner_state$train_time```In case something goes wrong, `r ref_pkg("batchtools")` comes with a bunch of useful debugging utilities covered in @sec-batchtools-monitoring.Once we are confident that the jobs are defined correctly, we can proceed with their submission, which requires1. specifying resource requirements for each computational job, and1. (optionally) grouping multiple jobs into one computational jobWhich resources can be configured depends on the cluster function that is set in the registry.We earlier left it at its default value, which is the "Interactive" cluster function.In the following we assume that we are working on a Slurm cluster.Accordingly, we initialize the cluster function with `r ref("batchtools::makeClusterFunctionsSlurm()")` and a predefined `r link("https://github.com/mllg/batchtools/blob/master/inst/templates/slurm-simple.tmpl", "slurm-simple template")`.A template file is a shell script with placeholders filled in by `batchtools` and contains1. the command to start the computation via `Rscript` or `R CMD batch`, usually in the last line, and1. comments which serve as annotations for the scheduler, e.g. to communicate resources or paths on the file system.The exemplary template should work on many Slurm installations out-of-the-box, but can also easily be customized to also work with more advanced configurations.```{r large_benchmarking-034}cf =makeClusterFunctionsSlurm(template ="slurm-simple")```To proceed with the examples on a local machine, set the cluster functions to a Socket backend.```{r large_benchmarking-035}cf =makeClusterFunctionsSocket()``````{r}#| include: falsecf =makeClusterFunctionsInteractive()```:::{.callout-tip}It is possible to customize the cluster function.More information is available in the documentation of the `r ref_pkg("batchtools")` package.:::We update the value of the `$cluster.functions` and save the registry.This only has to be done manually when modifying the fields explicitly.Functions like `r ref("batchmark()")` internally save the registry when required.```{r large_benchmarking-036}#| eval: true#| cache: falsereg$cluster.functions = cfsaveRegistry(reg = reg)```The jobs are submitted to the scheduler via `r ref("batchtools::submitJobs()")`.The most important arguments of this function besides the registry are:* `ids`, which are either a vector of job IDs to submit, or a data frame with columns `job.id` and `chunk`, which allows grouping multiple jobs into one larger computational job and thereby controlling the granularity. This often makes sense on HPCs, as submitting and running computational jobs comes with a considerable overhead and there are often hard limits for the maximum runtime (walltime).* `resources`, which is a named list specifying the resource requirements for the submitted jobs.We will `r ref("batchtools::chunk()")` the IDs in such a way that 5 iterations of one resample experiment are run sequentially in one computational job.The optimal grouping depends on the concrete experiment and scheduling system.```{r large_benchmarking-037}ids = job_table$job.idchunks =data.table(job.id = ids, chunk =chunk(ids, chunk.size =5, shuffle =FALSE))chunks```Furthermore, we specify the number of CPUs per computational job to 1, the walltime to 1 hour (3600 seconds), and the RAM limit to 8 GB.The set of resources depends on your cluster and the corresponding template file.For a list of resource names that are standardised across most implementations, see `r ref("batchtools::submitJobs()")`.If you are unsure about the resource requirements, you can start a subset of jobs with conservative resource constraints, e.g. the maximum runtime allowed for your computing site.Measured runtimes and memory usage can be queried with `r ref("batchtools::getJobTable()")` and finally used to better estimate the required resources for the remaining jobs.<!-- output to false otherwise we see the interactive cluster function -->```{r large_benchmarking-038}#| cache: false#| output: falsesubmitJobs(ids = chunks,resources =list(ncpus =1, walltime =3600, memory =8000),reg = reg)# wait for all jobs to terminatewaitForJobs(reg = reg)```:::{.callout-tip}A good approach to submit computational jobs is to do this from an R session that is running persistently.One option is to use `r link("https://github.com/tmux/tmux/wiki", "TMUX (Terminal Multiplexer)")` on the head node to continue job submission (or computation, depending on the cluster functions) in the background.:::### Job Monitoring, Error Handling, and Result Collection {#sec-batchtools-monitoring}After submitting the jobs, the next phase is to wait for them to finish.In case you terminated your running R session after job submission, you can load the experiment registry using `r ref("loadRegistry()")` to continue where you left off.In any large scale experiment many things can and will go wrong, even if we test our jobs beforehand using `r ref("batchtools::testJob()")` as recommended earlier.The cluster might have an outage, jobs may run into resource limits or crash, subtle bugs in your code could be triggered or any other error condition might arise.In these situations it is important to quickly determine what went wrong and to recompute only the minimal number of required jobs.The current status of the computation can be queried with `r ref("getStatus()")`, which lists the number of jobs categorized in multiple status groups:```{r large_benchmarking-041}getStatus(reg = reg)```To query the ids of jobs in the respective categories, see `r ref("findJobs()")` and, e.g., `findNotSubmitted()` or `findDone()`.In our case, all experiments finished and none expired or crashed - albeit we took some countermeasures by extending our base learners with fallback learners and integrating them in the pipeline "robustify".Nonetheless, it makes sense to check the logs for suspicious messages and warnings with `r ref("grepLogs()")` before proceeding with the analysis of the results.For this purpose, we have set the encapsulation before to `"try"`: in contrast to `"evaluate"` or `"callr"` encapsulation, `"try"` does not capture the output from messages, warnings or errors and store them in the learner's log.All output is printed to the console and redirected to a log file so that `batchtools` can operate on it with functions like `r ref("getLog()")` or `r ref("grepLogs()")`.In the following, we extend the design with the debug learner (see @sec-error-handling) erring with a 50% probability.By just calling `r ref("batchmark()")` with the new design again, the new experiments will be added to the registry on top of the existing jobs.The `tasks` and `resamplings` below are once again those from the `large_design` from earlier.```{r}#| cache: false#| output: falseextra_design =benchmark_grid(learners =lrns("classif.debug", error_train =0.5),tasks = tasks,resampling = resamplings,paired =TRUE)batchmark(extra_design, reg = reg)```We can check the new state and submit the jobs which have not been submitted yet (i.e., the newly created jobs):```{r}#| cache: falsegetStatus(reg = reg)ids =findNotSubmitted(reg = reg)```We queue these jobs as usual by passing their IDs to `submitJobs()`:```{r}#| cache: false#| output: falsesubmitJobs(ids, reg = reg)waitForJobs(reg = reg)```After these jobs have terminated, we can get a summary of those that failed:```{r}error_ids =findErrors(reg = reg)summarizeExperiments( error_ids, by =c("task_id", "learner_id"), reg = reg)```Unsurprisingly, all failed jobs have one thing in common: the debug learner.Finally, it is time to collect the experiment output.The results for benchmark experiments defined with `r ref("batchmark()")` can be collected with `r ref("reduceResultsBatchmark()")`, which constructs a regular `r ref("BenchmarkResult")`.It is also possible to retrieve single results with `r ref("loadResult()")`, but the returned values are hard to work with, as they are not optimized for usability, but efficiency.Here, we only collect results which were not produced by the debug learner.```{r}ids =findExperiments(algo.pars = learner_id !="classif.debug", reg = reg)bmr =reduceResultsBatchmark(ids, reg = reg)bmr$aggregate()```### Custom Experiment Definition {#sec-custom-experiment-definition}{{< include _optional.qmd >}}While `r ref_pkg("mlr3batchmark")` excels for conducting benchmarks on HPCs, there can be situations in which more fine-grained control over the experiment definition is beneficial or even required.Here, we will show how to define batchtools jobs that execute an mlr3 benchmark experiment without the help of `r ref_pkg("mlr3batchmark")`.There is not one single way how to achieve this, and we here show a solution that sacrifices efficiency for simplicity.Unless you have a specific reason to customize your experiment definition, we recommend using `r ref_pkg("mlr3batchmark")`.Like before, the first step is to create an experiment registry.```{r include = FALSE}#| cache: falseif (dir.exists("experiments-custom"))unlink("experiments-custom", recursive =TRUE)``````{r large_benchmarking-046}#| cache: falsereg =makeExperimentRegistry(file.dir ="./experiments-custom",seed =1,packages ="mlr3verse")```We can register a problem by calling `r ref("batchtools::addProblem()")`, whose main arguments beside the registry are:* `name` to uniquely identify the problem,* `data` to represent the static data part of a problem, and* `fun`, which takes in the problem `data`, the problem parameters, and the `job` definition, see `r ref("batchtools::makeJob()")` and returns a problem `instance`.We register all task-resampling combinations of the `large_design` using the task ID as the name.^[The mlr3 task ID is not the same as the OpenML task ID.]The problem `fun` takes in the static problem `data` and returns it as the problem `instance` as is.If we were using problem parameters, we could modify the problem instance depending on their values.In the code below, recall that the `tasks` and `resamplings` were originally used to define the `large_design`.```{r large_benchmarking-047}#| output: false#| cache: falsefor (i inseq_along(tasks)) {addProblem(name = tasks[[i]]$id,data =list(task = tasks[[i]], resampling = resamplings[[i]]),fun =function(data, job, ...) data,reg = reg )}```When calling `r ref("batchtools::addProblem()")`, not only is the problem added to the registry object from the active R session, but this information is also synced with the registry folder.The next step is to register the algorithm we want to run, which we achieve by calling `r ref("batchtools::addAlgorithm()")`.Besides the registry, it takes in the arguments:* `name` to uniquely identify the algorithm and* `fun`, which takes in the problem instance, the algorithm parameters, and the `job` definition. It defines the computational steps of an experiment and its return value is the experiment result, i.e. what can later be retrieved using `r ref("loadResult()")`.The algorithm function receives a list containing the task and resampling as the problem `instance`, the `learner` as the algorithm parameter, and the `job` object.It then executes the resample experiment defined by these three objects using `r ref("resample()")` and returns a `r ref("ResampleResult")`.This differs from the `r ref_pkg("mlr3batchmark")`, where one resampling iteration corresponds to one job in `r ref_pkg("batchtools")`.Here, one `r ref_pkg("batchtools")` job represents a complete resample experiment.```{r large_benchmarking-048}#| cache: falseaddAlgorithm("run_learner",fun =function(instance, learner, job, ...) {resample(instance$task, learner, instance$resampling) },reg = reg)reg$algorithms```As we have now defined the problems and the algorithm, we can define concrete experiments using `r ref("batchtools::addExperiments()")`.This function has arguments* `prob.designs`, a named list of data frames. The name must match the problem name while the column names correspond to parameters of the problem.* `algo.designs`, a named list of data frames. The name must match the algorithm name while the column names correspond to parameters of the algorithm.In the code below, we add all resampling iterations for the six tasks as experiments.By leaving `prob.designs` unspecified, experiments for all existing problems are created per default.We set the algorithm parameters to all possible `learners`, i.e. the logistic regression, random forest, and featureless learner from `large_design`.Note that whenever an experiment is added, the current seed is assigned to the experiment and then incremented.```{r large_benchmarking-049}#| cache: false#| output: falselibrary("data.table")algorithm_design =list(run_learner =data.table(learner = learners))print(algorithm_design$run_learner)addExperiments(algo.designs = algorithm_design, reg = reg)```We confirm that the algorithm, problems, and experiments (jobs) were added successfully.```{r large_benchmarking-050}summarizeExperiments()```@fig-batchtools-illustration summarizes the interplay between the batchtools problems, algorithms, and experiments.```{r large_benchmarking-051}#| label: fig-batchtools-illustration#| fig-cap: "Illustration the batchtools problem, algorithm, and experiment. "#| fig-align: "center"#| fig-alt: "A problem consists of a static data part and applies the problem function to this data part (and potentially problem parameters) to return a problem instance. The algorithm function takes in a problem instance (and potentially algorithm parameters), executes one job and returns its result."#| echo: falseknitr::include_graphics("Figures/tikz_prob_algo_simple.png")```We are now ready to submit the jobs to the cluster.By specifying no job IDs, all experiments are submitted as independent jobs, i.e. one computational job executes one resample experiment.```{r}#| output: false#| cache: falsesubmitJobs(reg = reg)waitForJobs(reg = reg)```When a cluster jobs finishes, it stores its return value in the registry folder.We can retrieve the job results using `r ref("batchtools::loadResult()")`.It outputs the objects returned by the algorithm function, which in our case is a `r ref("ResampleResult")`.```{r large_benchmarking-054}rr =loadResult(1, reg = reg)rr```In order to use mlr3's post-processing tools, we need to convert all results into a `r ref("BenchmarkResult")`.We can do this, by combining all resample results using `r ref("batchtools::reduceResults()")`.```{r large_benchmarking-055}bmr =reduceResults(c, reg = reg)bmr$aggregate()```While we took a different route than in @sec-mlr3batchmark to define the experiments and ran them at a different granularity, we arrived at the same result.## Statistical Analysis {#sec-benchmark-analysis}Once we successfully executed the benchmark experiment, we can proceed with its analysis.The package `r ref("mlr3benchmark")` provides infrastructure for applying statistical significance tests on `r ref("BenchmarkResult")` objects.Currently, Friedman tests and pairwise Friedman-Nemenyi tests [@demsar2006] are supported to analyze benchmark experiments with at least two independent tasks and at least two learners.Before we can use these methods, we have to convert the benchmark result to a `r ref("mlr3benchmark::BenchmarkAggr")` using `r ref("as_benchmark_aggr()")`.We can then perform a pairwise comparison using `$friedman_posthoc()`.This method will first perform a global friedman test and only conduct the post-hoc tests if the former is significant.```{r large_benchmarking-056}library("mlr3benchmark")bma =as_benchmark_aggr(bmr, measures =msr("classif.ce"))bma$friedman_posthoc()```These results would indicate a statistically significant difference between the `"featureless"` learner and `"ranger"`, assuming a 95% confidence level.This table can be summarized in a critical difference plot, which typically shows the mean rank of a learning algorithm on the x-axis along with a thick horizontal line that connects learners which are pairwise not significantly different (while correcting for multiple tests):```{r large_benchmarking-057}autoplot(bma, type ="cd")```While our experiment did now show a significant difference between the random forest and the logistic regression (they are connected in the plot), the former has a lower rank on average.This is in line with the large benchmark study conducted by @couronne2018random, where the random forest outperformed the logistic regression in 69% of 243 real world datasets.As a final note, it is important to be careful when interpreting such test results.Because our datasets are not an iid sample from a population of datasets, we can only make inference about the data generating processes at hand, i.e. those that generated the datasets we used in the benchmark.## ConclusionIn this chapter, we have explored how to conduct large scale machine learning experiments using mlr3.We have shown how to acquire diverse datasets from OpenML through the `r ref_pkg("mlr3oml")` interface package.Furthermore, we have learned how to execute large-scale experiments using the `r ref_pkg("batchtools")` package and its `r ref_pkg("mlr3batchmark")` integration.Finally, we have demonstrated how to analyze the results using the `r ref_pkg("mlr3benchmark")` package, thereby extracting meaningful insights from the experiments.The most important functions and classes we learned about are in @tbl-api-large-benchmarking alongside their R6 classes (if applicable).| S3 function | R6 Class | Summary || ------------------------------------| ---------------------------| --------------------------------------------------------------------|| `r ref("odt()")` | `r ref("OMLData")` | Retrieve an OpenML Dataset || `r ref("otsk()")` | `r ref("OMLTask")` | Retrieve an OpenML Task || `r ref("ocl()")` | `r ref("OMLCollection")` | Retrieve an OpenML Collection || `r ref("list_oml_data()")` |- | Filter OpenML Datasets || `r ref("list_oml_tasks()")` |- | Filter OpenML Tasks || `r ref("makeExperimentRegistry()")` |- | Create a new registry || `r ref("loadRegistry()")` |- | Load an existing registry || `r ref("saveRegistry()")` |- | Save an existing registry || `r ref("batchmark()")` |- | Register problems, algorithms, and experiments from a design || `r ref("addProblem()")` |- | Register a new Problem || `r ref("addAlgorithm()")` |- | Register a new algorithm || `r ref("addExperiments()")` |- | Register experiments using existing algorithms and problems || `r ref("submitJobs()")` |- | Submit jobs to the scheduler || `r ref("getJobTable()")` |- | Get an overview of all job definitions || `r ref("unwrap()")` |- | Get an overview of all job definitions || `r ref("getStatus()")` |- | Get the status of the computation || `r ref("reduceResultsBatchmark()")` |- | Load finished jobs as a benchmark result || `r ref("reduceResults()")` |- | Combine experiment results || `r ref("findExperiments()")` |- | Find specific experiments || `r ref("grepLogs()")` |- | Search the log files || `r ref("summarizeExperiments()")` |- | Summarize defined experiments || `r ref("getLog()")` |- | Get a specific log file || `r ref("findErrors()")` |- | Find ids of failed jobs |:Core functions for Open in mlr3 with the underlying R6 class that are constructed when these functions are called (if applicable) and a summary of the purpose of the functions. {#tbl-api-large-benchmarking}### Resources{.unnumbered .unlisted}- Look at the short `r link("https://doi.org/10.21105/joss.00135", "batchtools paper")`, and please cite this if you use the package.- Read the `r link("https://www.jstatsoft.org/v64/i11", "Paper on BatchJobs / BatchExperiments")` which are the batchtools predecessors, but the concepts still hold, and most examples work analogously.- Learn more in the `r link("https://mllg.github.io/batchtools/articles/batchtools.html", "batchtools vignette")`.- Explore the `r link("https://docs.openml.org/", "OpenML documentation")`.## ExercisesIn this exercise we will conduct an empirical study that compares two machine learning algorithms.Our null hypothesis is that a single regression tree performs equally well than a random forest.### Getting Data from OpenML {.unnumbered .unlisted}1. Load the OpenML collection with ID `r link("269", "https://www.openml.org/search?type=study&study_type=task&id=269")`. It contains regression tasks from the AutoML benchmark [@amlb2022].2. Find all tasks with less than 4000 observations and convert them to mlr3 tasks.3. Create an experimental design that compares the random forest in `r ref_pkg("ranger")` with the regression tree from `r ref_pkg("rpart")` on those tasks. You can use 3-fold cross-validation instead of the OpenML resamplings to save time.### Executing the Experiments using batchtools {.unnumbered .unlisted}1. Create a registry and populate it with the experiments.1. (Optional) Change the cluster function to either "Socket" or "Multicore" (the latter does not work on Windows).1. Submit the jobs and once they are finished, collect the results.### Analyzing the Results {.unnumbered .unlisted}1. Conduct a global Friedman test and interpret the results. As an evaluation measure, use the mean-square error.1. Inspect the ranks of the results.