Michel Lang Research Center Trustworthy Data Science and Security, and TU Dortmund University
In the previous chapters, we demonstrated how to turn ML concepts and ML methods into code. So far, we have covered ML concepts without going into a lot of technical detail, which can be important for more advanced uses of mlr3. This includes the following topics:
The term parallelization refers to running multiple algorithms in parallel, i.e., executing them simultaneously on multiple CPU cores, CPUs, or computational nodes. Not all algorithms can be parallelized, but when they can, parallelization allows significant savings in computation time.
In general, there are many possibilities to parallelize, depending on the hardware to run the computations: If you only have a single CPU with multiple cores, threads1 or processes2 are ways to utilize all cores on a local machine. If you have multiple machines on the other hand, the machines need a way to communicate and exchange information, e.g. via protocols like network sockets or the Message Passing Interface (MPI)3. Larger computational sites rely on a scheduler to orchestrate the computation for multiple users and offer a shared network file system all machines can access. Interacting with scheduling systems on compute clusters is covered in Section 10.2 using the R package batchtools. We do not want to delve too deep into such details here, but want to introduce some terminology which helps us to discuss parallelization on a more abstract level:
We call the hardware to parallelize on together with the respective interface provided by an R package the parallelization backend. As many parallelization backends have different APIs, we are using the future package as an abstraction layer for many parallelization backends. mlr3 just interfaces future while the user can control how the code is executed by configuring a backend prior to starting the computations.
The R session or process which orchestrates the computational work is called main, and it starts computational jobs.
The R sessions, processes, or machines which receive the jobs, do the calculation and then send back the result are called workers.
Parallelization BackendMainJobsWorkers
An important step in parallel programming involves the identification of sections of the program flow which are both time-consuming (bottlenecks) and also can run independently of a different section. The key characteristic is that these sections do not depend on each other, i.e. section A can be ran without waiting for results from section B. Fortunately, these sections are comparably easy to spot for machine learning experiments:
Bottlenecks
The training of a learning algorithm (or other computationally intensive parts of a machine learning pipeline, c.f. Chapter 6) may contain independent sections which can run in parallel, e.g.
A single decision tree iterates over all features to find the best split point, for each feature independently.
A random forest usually fits hundreds of trees independently.
Many feature filters work in a univariate fashion, i.e. calculate a numeric score for each feature independently.
The key principle that makes parallelization for these examples (and in general in many fields of statistics and ML) is called data parallelism: The same operation is performed concurrently on different elements of the input data. Parallelization of learning algorithms is covered in Section 9.1.1.
A resampling consists of independent repetitions of train-test-splits (Section 9.1.2).
A benchmark consists of multiple independent resamplings (Section 9.1.3).
Tuning (Chapter 4) is repeated benchmarking, embedded in a sequential procedure which determines the hyperparameter configuration to try next. In addition to parallelization of the benchmark, some tuners propose multiple configurations to evaluate independently in each sequential step, which provides a second level for parallelization discussed in Section 9.1.4.
The predictions of a single learner for multiple observations is independent (Section 9.1.5).
Data Parallelism
When computational problems are so easy to parallelize like the examples listed in (1)-(4), they are often referred to as embarrassingly parallel. Whenever you can put the heavy lifting into a function and call it with a map-like function like lapply(), you are facing an embarrassingly parallel problem. Such problems are straightforward to parallelize, e.g., in R with the furrr package which provides parallel counterparts for popular sequential map-like functions from the purrr package.
Embarrassingly Parallel
However, it does not make practical sense to actually execute in parallel every operation that can be parallelized. Starting and terminating workers as well as possible communication between workers comes at a price in the form of additionally required runtime which is called parallelization overhead. The overhead strongly varies from parallelization backend to parallelization backend and must be carefully weighed against the runtime of the sequential execution to determine if parallelization is worth the effort. If the sequential execution is comparably fast, enabling parallelization often just introduces additional complexity for very little runtime savings or can even slow down the execution.
Parallelization Overhead
Sometimes, it is possible to control the granularity of the parallelization to reduce the parallelization overhead. For example, if you want to parallelize a for-loop with 1000 iterations on 4 CPU cores, the overhead can be reduced by chunking the work of the 1000 jobs into 4 computational jobs performing 250 iterations each. So 4 bigger jobs are calculated instead of 1000 small ones.
Granularity
This effect is illustrated in the following code chunk using a socket cluster. Note that this parallel backend already comes with an option to control the chunk size (chunk.size), but for other backends you must chunk manually which is also demonstrated:
# set up a socket cluster with 4 workers on the local machinelibrary(parallel)cores=4cl=makeCluster(cores)print(cl)
socket cluster with 4 nodes on host 'localhost'
# vector to operate onx=1:1000# fast function to parallelizef=function(x)sqrt(x+1)# unchunked approach: 1000 jobssystem.time({parSapply(cl, x, f, chunk.size =1)})
Whenever you have the option to control the granularity by setting the chunk size, you should aim for at least as many jobs as workers and also the runtime of each worker should be at least several seconds. This ensures that you can fully utilize the system and that the parallelization overhead stays reasonable. If you have heterogeneous runtimes, also consider grouping jobs together so that the runtime of the chunks get homogeneous. If there is a good estimate for the runtime, batchtools::binpack() (create an arbitrary number of chunks, each with a specified maximum combined runtime) and batchtools::lpt() (pack a specified number of chunks, each with arbitrary but homogeneous runtime) can prove useful - both are documented together with the batchtools::chunk() helper. For unknown runtimes, randomizing the order of jobs sometimes helps if there is a systematic relationship between the order of the jobs and their runtime. This prevents the long jobs, for example, from all being executed at the end, which leads to avoidable underutilization. mlr3misc ships with the functions chunk() and chunk_vector() to conveniently chunk vectors and also shuffles them per default. There are also options to control the chunk size for parallelization started from within mlr3 - these are described later in Section 9.1.2.
9.1.1 Parallelization of Learners
The most atomic part of mlr3 which can be parallelized are calls to external code, i.e. the execution of certain PipeOp objects, Filter objects or Learner objects. For these objects, mlr3 merely provides a unified interface to control the execution. The parallelization is implemented by the respective package authors of the (external) algorithms mlr3 calls.
Most of these algorithms are parallelized via threading, e.g., the random forest implementation in ranger or the boosting implemented in xgboost. For example, while fitting a single decision tree, each split that divides the data into two disjoint partitions requires a search for the best cut point on all \(p\) features. So instead of iterating over all features sequentially, the search can be broken down into \(p\) threads, each searching for the best cut point on a single feature. These threads can easily be parallelized by the scheduler of the operating system, as there is no need for communication between the threads. After all threads have finished, the results are collected and merged before terminating the threads. I.e., for our example of the decision tree, (1) the \(p\) best cut points per feature are collected and then (2) aggregated to the single best cut point across all features by just iterating over the \(p\) results sequentially.
Note
Parallelization on GPUs is not covered in this book. mlr3 only distributes the fitting of multiple learners, e.g., during resampling, benchmarking, or tuning. On this rather abstract level, GPU parallelization does not work efficiently. However, some learning procedures can be compiled against CUDA/OpenCL to utilize the GPU while fitting a single model. We refer to the respective documentation of the learner’s implementation, e.g., https://xgboost.readthedocs.io/en/stable/gpu/ for XGBoost.
Threading is implemented in the compiled code of the package (e.g., in C or C++). The R interpreter calls the external code and waits for the results to be returned - without noticing that the computations are executed in parallel. Unfortunately, threading conflicts with certain parallel backends, causing the system to be overutilized in the best case and causing hangs or segfaults in the worst case. For this reason, we introduced the convention that threading parallelization is turned off per default. Hyperparameters that control the number of threads are tagged with the label "threads":
library("mlr3learners")# for the ranger learner# get the ranger learnerlearner=lrn("classif.ranger")# show all hyperparameters tagged with "threads"learner$param_set$ids(tags ="threads")
[1] "num.threads"
# The number of threads is initialized to 1learner$param_set$values$num.threads
[1] 1
To enable the parallelization for this learner, mlr3 provides the helper function set_threads():
In the last line, we did not set the number of threads, letting the package fall back to a heuristic to detect the correct number. This heuristic is sometimes flaky, and utilizing all available cores is occasionally counterproductive as overburdening the system often has negative effects on the overall runtime. The function which determines the number of CPUs for mlr3 is implemented in parallelly::availableCores() and works well for many setups. See this blog post4 for some background information about the implemented heuristic. However, there are still some scenarios where it is better to reduce the number of utilized CPUs manually:
You want to simultaneously work on the same system, e.g., browse the web or watch a video.
You are on a multi-user system and want to spare some resources for other users.
You have a CPU with heterogeneous cores, for example, the energy-efficient “Icestorm” cores on a Mac M1 chip. These are comparably slower than the high-performance “Firestorm” cores and not well suited for heavy computations.
You have linked R to a threaded BLAS implementation like OpenBLAS, and your learners make heavy use of linear algebra.
You can manually set the number of CPUs to overrule the heuristic via option "mc.cores":
We recommend setting this in your system’s .Rprofile file, c.f. Startup.
There are some other approaches for parallelization of learners, e.g. by directly supporting one specific parallelization backend or a parallelization framework like foreach. If this is supported, parallelization must be explicitly activated, e.g. by setting a hyperparameter. If you need to parallelize on the learner level because a single model fit takes too much time, and you only fit a few of these models, consult the documentation of the respective learner. In many scenarios it makes more sense to parallelize on a different level like resampling or benchmarking which is covered in the following subsections.
9.1.2 Parallelization of Resamplings
In addition to parallel learners, most machine learning experiments include a very easy handle for parallelization: the resampling. By definition, resampling is performed to get an unbiased performance estimator by aggregating over independent repetitions of multiple train-test splits.
mlr3 has “marked” this loop of independent iterations as parallelizable and uses an additional abstraction layer to support a broad range of parallel backends: the future package. The loop is executed via the future parallelization framework, using the parallel backend configured by the user via the future::plan() function.
In this section, we will use the spam task and a simple lrn("classif.rpart"). We use the future::multisession plan (which internally uses socket clusters from the parallel package, see parallel::makeCluster()) that should work on all operating systems.
# define objects to perform a resamplingtask=tsk("spam")learner=lrn("classif.rpart")resampling=rsmp("cv", folds =3)# select the multisession backend to usefuture::plan("multisession")# run the resampling in parallel and measure runtimesystem.time({resample(task, learner, resampling)})
user system elapsed
0.171 0.005 0.928
By default, all CPUs of your machine are used unless you specify the argument workers in future::plan() (possible problems with the value returned by the heuristic have already been discussed in the previous Section 9.1.1). If you compare runtimes between the parallel backend and sequential execution (plan("sequential")) here, you should see a decrease in the reported elapsed time. However, in practice, you cannot expect the runtime to fall linearly as the number of cores increases (Amdahl’s law5). In contrast to threads, the technical overhead for starting workers, communicating objects, sending back results, and shutting down the workers is quite large for the multisession backend. The multicore backend (plan("multicore")) comes with more overhead than threading, but considerably less overhead in comparison with the multisession backend. In fact, with the multicore backend, R objects are copied only when they are modified (copy-on-write), while with the multisession backend, objects are always copied to the respective session prior to any computation. The multicore backend has the major disadvantage that it is not supported on Windows systems - for this reason, we will stick with the multisession backend for all examples here. In general, it is advised to only consider parallelization for resamplings where each iteration runs at least a few seconds. Note that there are two mlr3 options to control the execution and granularity:
If mlr3.exec_random is set to TRUE (default), the order of jobs is randomized in resamplings and benchmarks. This can help if you run a benchmark or tuning with heterogeneous runtimes, e.g., to avoid that all the expensive learners get started last.
Option mlr3.exec_chunk_size can be used to control how many jobs are mapped to a single future and defaults to 1. The value of this option is passed to future.apply::future_mapply() and future.scheduling is constantly set to TRUE.
Tuning the chunk size can help in some rare cases to mitigate the parallelization overhead. For larger problems and longer runtimes, however, this plays a subordinate role.
Figure 9.1 illustrates the parallelization from the above example. From left to right:
The computational task is split into 3 parts for the 3-fold cross-validation.
The folds are passed to 3 workers, each fitting a model on the respective subset of the task and predicting on the left-out observations.
The predictions (and trained models) are communicated back to main process which combines them into a ResampleResult.
graph LR
M[fa:fa-server Main]
S{"resample()"}
C{ResampleResult}
M --> S
S -->|Fold 1| W1[fa:fa-microchip Worker 1]
S -->|Fold 2| W2[fa:fa-microchip Worker 2]
S -->|Fold 3| W3[fa:fa-microchip Worker 3]
W1 -->|Prediction 1| C
W2 -->|Prediction 2| C
W3 -->|Prediction 3| C
Figure 9.1: Parallelization of a resampling using a 3-fold cross-validation
9.1.3 Parallelization of Benchmarks
Benchmarks can be seen as a collection of multiple independent resamplings where a combination of a task, a learner, and a resampling strategy defines one resampling to perform. In pseudo-code, the calculation can be written down as
Parallelize over all resamplings, execute each resampling sequentially (parallelize outer loop).
Iterate over all resamplings, execute each resampling in parallel (parallelize inner loop).
If you are transitioning from mlr, you might be used to selecting one of these parallelization levels before benchmarking. One major drawback of this approach becomes clear when both the outer and inner loop have fewer iterations than there are available workers, resulting in an underutilized system. In mlr3, the choice of level is no longer required (except occasionally for nested resampling, briefly described in the following Section 9.1.4). All experiments are rolled out on the same level, i.e., benchmark() iterates over the elements of the Cartesian product of the iterations of the outer and inner loops. Therefore, there is no need to decide whether you want to parallelize the tuning or the resampling, you always parallelize both. This approach makes the computation fine-grained and gives the future backend the opportunity to group the jobs into chunks of suitable size (depending on the number of workers).
Parallelization of benchmarks works analogously to resampling:
For larger benchmarks with a cumulative runtime of weeks, months or even years, see Section 10.2 which covers parallelization on high-performance computing clusters.
9.1.4 Nested Resampling Parallelization
Like in benchmarking, nested resampling for tuning also translates into two nested resampling loops. But unlike benchmarking, the outer loop iterations are not necessarily independent of each other: depending on the result of the resampling in the first outer loop, different hyperparameters are suggested for the second iteration. Therefore, nested loops cannot be flattened, and the user instead has to choose which of the loops to parallelize. Let us consider the following example: You want to tune the minsplit argument of a classification tree using the AutoTuner of mlr3tuning (simplified version taken from Section 4.1):
To evaluate the performance on an independent test set, resampling is used:
the inner cross-validation of the auto tuner with 2 folds,
the outer cross-validation of the resampling with 5 folds, and
evaluating all configurations proposed by the random search in a single batch (parameter batch_size of TunerRandomSearch, defaulting to 1).
Because the third opportunity is not always applicable, especially for many advanced tuning algorithms which are only capable of proposing a single configuration in each iteration, we will here focus on the first two opportunities. Furthermore, we assume that we have a single CPU with four cores available.
If we opt to parallelize the outer CV, all four cores would be utilized first with the computation of the first 4 resampling iterations. The computation of the fifth iteration has to wait. The resulting CPU utilization of the nested resampling example on 4 CPUs is visualized in two Figures:
Figure 9.2 as an example for parallelizing the outer 5-fold cross-validation.
# Runs the outer loop in parallel and the inner loop sequentiallyfuture::plan(list("multisession", "sequential"))
We assume that each fit during the inner resampling takes 4 seconds to compute and that there is no other significant overhead. First, each of the four workers starts with the computation of an inner 2-fold cross-validation. As there are more jobs than workers, the remaining fifth iteration of the outer resampling is queued on CPU1 after the first 4 iterations are finished after 8 secs. During the computation of the 5th outer resampling iteration, only CPU1 is utilized, the other 3 CPUs are idling. Note that just setting up the parallelization with a simple future::plan("multisession") has the same effect - the most outer loop is parallelized while all subsequent loops default to sequential execution.
Figure 9.3 as an example for parallelizing the inner 2-fold cross-validation.
# Runs the outer loop sequentially and the inner loop in parallelfuture::plan(list("sequential", "multisession"))
Here, the outer loop runs sequentially and distributes the 2 computations for the inner resampling on 2 CPUs. Meanwhile, CPU3 and CPU4 are idling.
Figure 9.2: CPU utilization for 4 CPUs while parallelizing the outer 5-fold cross-validation with a sequential 2-fold cross-validation inside. Jobs are labeled as [iteration outer]-[iteration inner].
Figure 9.3: CPU utilization for 4 CPUs while parallelizing the inner 2-fold cross-validation with a sequential 5-fold cross-validation outside. Jobs are labeled as [iteration outer]-[iteration inner].
Both possibilities for parallelization are not exploiting the full potential of the 4 CPUs. With parallelization of the outer loop, all results are computed after 16s, in contrast to parallelization of the inner loop where the results are only available after 20s.
If possible, the number of iterations can be adapted to the available hardware. There is no law set in stone that you have to select, e.g., 10 folds in cross-validation. If you have 4 CPUs and a reasonable variance, 8 iterations are often sufficient, or you do 12 iterations because you get the last two iterations basically for free.
Alternatively, you can also enable parallelization for both loops for nested parallelization, even on different parallelization backends. While nesting real parallelization backends is often unintended and causes unnecessary overhead, it is useful in some distributed computing setups. In this case, the number of workers must be manually tweaked so that the system does not get overburdened:
# Runs both loops in parallelfuture::plan(list(future::tweak("multisession", workers =2),future::tweak("multisession", workers =4)))
This example would run on 8 cores (= 2 * 4) on the local machine, parallelizing the outer resampling on 2, and the inner resampling on 4 workers. The vignette of the future package gives more insight into nested parallelization. For more background information about parallelization during tuning, see Section 6.7 of Bischl et al. (2023).
Important
During tuning with mlr3tuning, you can often adjust the batch size of the Tuner, i.e., control how many hyperparameter configurations are evaluated in parallel. If you want full parallelization, make sure that the batch size multiplied by the number of (inner) resampling iterations is at least equal to the number of available workers. If you expect homogeneous runtimes, i.e., you are tuning over a single learner or linear pipeline, and you have no hyperparameter which is likely to influence the runtime, aim for a multiple of the number of workers.
In general, larger batches allow for more parallelization, while smaller batches imply a more frequent evaluation of the termination criteria. We default to a batch_size of 1 that ensures that all Terminators work as intended, i.e., you cannot exceed the computational budget.
Heterogeneous runtimes add an extra layer of complexity to parallelization. This occurs frequently, especially in tuning, when a hyperparameter strongly influences the runtime of the learning procedure. Examples are the number of trees for random forests or the number of regularization values to be tested in penalized regression.
How efficient the parallelization turns out depends in particular on the scheduling strategy of the backend. After the first batch of jobs is sent to the worker, the next jobs are either started
as soon as all results have been collectively reported back to the main process, or
as soon as the first job reports back.
Method (a) usually comes with less synchronization overhead and is best suited for short jobs with homogeneous runtimes. Method (b) is faster if the runtimes are heterogeneous, especially if the parallelization overhead is neglectable in comparison with the runtime for the computation. E.g., for parallel::mclapply(), the behavior of the scheduler can be controlled with the mc.preschedule option and parallel::parSapply() implements Method (a) while parallel::parSapplyLB() implements scheduling according to method (b).
9.1.5 Parallelization of Predictions
Finally, also the prediction of a single learner can be parallelized as the predictions of two observations are independent. For most learners, training is the bottleneck and parallelizing the prediction is not a worthwhile endeavor, but of course there are exceptions.
Technically, the test data is first split into multiple groups and the predict-method of the learner is applied to each group in parallel using active backend configured via future::plan(). The resulting predictions are then combined internally in a second step. However, to avoid predicting in parallel accidentally, parallel predictions must first be enabled in the learner via the parallel_predict field:
# train random forest on spam tasktask=tsk("spam")learner=lrn("classif.ranger")learner$train(task)# set up parallel predict on 4 workersfuture::plan("multisession", workers =4)learner$parallel_predict=TRUE# perform predictionprediction=learner$predict(task)
The resulting Prediction is identical to the one computed sequentially.
9.1.6 Reproducibility
Usually, reproducibility is a major concern during parallelization because special (PRNGs6) are required (see this blog post7. A simple set.seed() is not sufficient when parallelization is involved. One general recommendation here is switching from the default Mersenne Twister8 to Pierre L’Ecuyer’s RngStreams (see Random and parallel::RNGstreams). However, even this parallel PRNG comes with pitfalls w.r.t. reproducibility: if you change the number of workers, you still get different results after setting the seed with set.seed().
Luckily, this and many other problems are already addressed by the excellent future parallelization framework which mlr3 uses under the hood. future ensures that all workers will receive the exactly same PRNG streams, independent of the number of workers. Although correct seeding alone does not guarantee full reproducibility, it is one problem less to worry about. You can find more details about the used PRNG in this blog post9.
The issue of reproducibility is very complex, and complete reproducibility is very difficult to achieve - this ultimately also depends on the computational accuracy of the hardware, the processor instructions used, compiler versions and optimisation flags, or the BLAS library for linear algebra R links to. But since at least parallel PRNGs are not a problem, you should get the same or at least very similar results with a simple set.seed() before you run the experiments.
9.2 Error Handling
In large ML experiments, it is not uncommon that a model fit or prediction fails with an error. This is because the algorithms have to process arbitrary data, and not all eventualities can always be handled. While we try to identify obvious problems before execution, such as when missing values occur, but a learner cannot handle them, other problems are far more complex to detect. Examples include correlations or collinearity that make model fitting impossible, outliers that lead to numerical problems, or new levels of categorical variables appearing in the predict step. The learners behave quite differently when encountering such problems: some models signal a warning during the train step that they failed to fit but return a baseline model while other models stop the execution. During prediction, some learners just refuse to predict the response for observations they cannot handle while others predict a missing value. How to deal with these problems even in more complex setups like benchmarking or tuning is the topic of this section.
For illustration (and internal testing) of error handling, mlr3 ships with the learners "classif.debug" and "regr.debug". Here, we use the debug learner for classification to demonstrate the error handling:
This learner comes with special parameters that let us simulate problems frequently encountered in ML. E.g., the debug learner comes with hyperparameters to control
what conditions should be signaled (message, warning, error, segfault) with what probability,
during which stage the conditions should be signaled (train or predict), and
the ratio of predictions being NA (predict_missing).
We now set a hyperparameter to let the debug learner signal an error during the train step. By default, mlr3 does not catch conditions such as warnings or errors raised while calling learners:
# set probability to signal an error to 1learner$param_set$values=list(error_train =1)learner$train(tsk("penguins"))
Error in .__LearnerClassifDebug__.train(self = self, private = private, : Error from classif.debug->train()
If this has been a regular learner, we could now start debugging with traceback() (or create a Minimal Reproducible Example (MRE)10 to tackle the problem down or file a bug report upstream). However, due to the nature of the problem, it is likely that the cause of the error cannot be fixed - so you have to learn how to deal with the errors.
If you start debugging, make sure you have disabled parallelization to avoid various pitfalls related to parallelization. It may also be helpful to set the option mlr3.debug to TRUE. If this flag is set, mlr3 does not call into the future package, resulting in an easier-to-interpret program flow and traceback().
9.2.1 Encapsulation
Since ML algorithms are confronted with arbitrary, often messy data, errors are not uncommon here, and we often just need to move on during benchmarking or tuning. Thus, we need a mechanism to
capture all signaled conditions such as messages, warnings and errors so that we can analyze them post-hoc (called encapsulation, covered in this section),
deal with algorithms which do not terminate in a reasonable time, and
a statistically sound way to proceed while being able to aggregate over partial results (next Section 9.2.2).
Encapsulation ensures that signaled conditions (such as messages, warnings and errors) are intercepted: all conditions raised during the training or predict step are logged into the learner, and errors do not interrupt the program flow. I.e., the execution of the calling function or package (here: mlr3) continues as if there had been no error, though the result (fitted model during train(), predictions during predict()) are missing. Each Learner has a field encapsulate to control how the train or predict steps are wrapped. The easiest way to encapsulate the execution is provided by the package evaluate which evaluates R expressions while tracking conditions such as outputs, messages, warnings or errors (see the documentation of the encapsulate() helper function for more details):
task=tsk("penguins")learner=lrn("classif.debug")# this learner throws a warning and then stops with an error during train()learner$param_set$values=list(warning_train =1, error_train =1)# enable encapsulation for train() and predict()learner$encapsulate=c(train ="evaluate", predict ="evaluate")learner$train(task)
After training the learner, one can access the recorded log via the fields log, warnings and errors:
learner$log
stage class msg
1: train warning Warning from classif.debug->train()
2: train error Error from classif.debug->train()
learner$warnings
[1] "Warning from classif.debug->train()"
learner$errors
[1] "Error from classif.debug->train()"
Another method for encapsulation is implemented in the callr package. In contrast to evaluate, the computation is taken out in a separate R process. This guards the calling session against segmentation faults which otherwise would tear down the complete main R session. On the downside, starting new processes comes with comparably more computational overhead.
With either of these encapsulation methods, we can now catch errors and post-hoc analyze the messages, warnings and error messages. Additionally, a timeout can be set so that learners do not run for an indefinite time but are terminated after a specified time. Interrupting learners works differently depending on the encapsulation (see mlr3misc::encapsulate()) and, when viewed from the outside, behave as if they would signal an error after reaching the timeout. The timeout can be set separately for training and prediction and must be provided in seconds:
# 5 minute timeout for training, no timeout for predictlearner$timeout=c(train =5*60, predict =Inf)
Unfortunately, catching errors and ensuring an upper time limit is only half the battle. Without a model, it is not possible to get predictions:
learner$predict(task)
Error: Cannot predict, Learner 'classif.debug' has not been trained yet
To handle the missing predictions gracefully during resample(), benchmark() or tuning, fallback learners are introduced next.
9.2.2 Fallback learners
Fallback learners have the purpose of being able to score results in cases where a Learner completely failed to fit a model or refuses to provide predictions for some or all observations.
We will first handle the case that a learner fails to fit a model during training, e.g., if some convergence criterion is not met or the learner ran out of memory. There are in general three possibilities to proceed:
Ignore iterations with failed model fits. Although this is arguably the most frequent approach in practice, it is not statistically sound. For example, consider the case where a researcher wants a specific learner to look better in a benchmark study. To do this, the researcher takes an existing learner but introduces a small adaptation: If an internal goodness-of-fit measure is not achieved, an error is thrown. In other words, the learner only fits a model if the model can be reasonably well learned on the given training data. In comparison with the learning procedure without this adaptation and a good threshold, however, we now compare the mean over only the “easy” splits with the mean over all splits - an unfair advantage.
Penalize failing learners. Instead of ignoring failed iterations, we can simply impute the worst possible score (as defined by the Measure) and thereby heavily penalize the learner for failing. However, this often seems too harsh for many problems, and for some measures there is no reasonable value to impute.
Impute a value that corresponds to a (weak) baseline. Instead of imputing with the worst possible score, impute with a reasonable baseline, e.g., by just predicting the majority class or the mean of the response in the training data. Such simple baselines are implemented as featureless learners (mlr_learners_classif.featureless or mlr_learners_regr.featureless). Note that a reasonable baseline value is different in different training splits. Retrieving these values after a larger benchmark study has been conducted is possible, but tedious.
We strongly recommend option (3): it is statistically sound and very flexible. To make this procedure very convenient during resampling and benchmarking, we support fitting a proper baseline with a fallback learner. In the next example, in addition to the debug learner, we attach a simple featureless learner to the debug learner. So whenever the debug learner fails (which is every single time with the given parametrization) and encapsulation is enabled, mlr3 falls back to the predictions of the featureless learner internally:
Note that encapsulation is not enabled explicitly; it is automatically set to "evaluate" for the training and the predict step while setting a fallback learner for a learner without encapsulation enabled. Furthermore, the log contains the captured error (which is also included in the print output), and although no model is stored, we can still get predictions:
In this stepwise train-predict procedure, the fallback learner is of limited use. However, it is invaluable for larger benchmark studies.
In the following snippet, we compare the previously created debug learner with a simple classification tree. We re-parametrize the debug learner to fail in roughly 30% of the resampling iterations during the training step:
Even though the debug learner occasionally failed to provide predictions, we still got a statistically sound aggregated performance value which we can compare to the aggregated performance of the classification tree. It is also possible to split the benchmark up into separate ResampleResult objects which sometimes helps to get more context. E.g., if we only want to have a closer look into the debug learner, we can extract the errors from the corresponding resample results:
iteration msg
1: 1 Error from classif.debug->train()
2: 4 Error from classif.debug->train()
3: 5 Error from classif.debug->train()
A problem similar to failed model fits emerges when a learner predicts only a subset of the observations in the test set (and predicts NA or no value for others). A typical case is, e.g., when new and unseen factor levels are encountered in the test data. Imagine again that our goal is to benchmark two algorithms using cross-validation on some binary classification task:
Algorithm A is an ordinary logistic regression.
Algorithm B is also an ordinary logistic regression, but with a twist: If the logistic regression is rather certain about the predicted label (> 90% probability), it returns the label and returns a missing value otherwise.
At its core, this is the same problem as outlined before. If we measure the performance using only the non-missing predictions, Algorithm B would likely outperform algorithm A. However, this approach does not factor in that you can not generate predictions for all observations. Long story short, if a fallback learner is specified, missing predictions of the base learner will be automatically replaced with predictions from the fallback learner. This is illustrated in the following example:
task=tsk("penguins")learner=lrn("classif.debug")# this hyperparameter sets the ratio of missing predictionslearner$param_set$values=list(predict_missing =0.5)# without fallbackp=learner$train(task)$predict(task)table(p$response, useNA ="always")
Adelie Chinstrap Gentoo <NA>
172 0 0 172
# with fallbacklearner$fallback=lrn("classif.featureless")p=learner$train(task)$predict(task)table(p$response, useNA ="always")
Adelie Chinstrap Gentoo <NA>
172 172 0 0
Summed up, by combining encapsulation and fallback learners, it is possible to benchmark even quite unreliable or unstable learning algorithms in a convenient and statistically sound fashion.
9.3 Logging
mlr3 internally uses the lgr package to control the verbosity of the output, e.g., suppress messages or enable additional debugging messages.
All log messages have an associated level which encodes a number, e.g. informative messages have the level "info" which is associated with the number 400 while debugging messages have the level "debug" with number 500. A message is only displayed if the respective level exceeds the global logging threshold.
To change the setting for mlr3 for the current session, you need to retrieve the logger (which is a R6 object) from lgr, and then change the threshold of the like this:
lgr comes with a global option called "lgr.default_threshold" which can be set via options() to make your choice permanent across sessions.
Also note that the optimization packages such as mlr3tuning or mlr3fselect use the logger of their base package bbotk. If you want do disable logging from mlr3, but keep the output from mlr3tuning, reduce the verbosity of the mlr3 logger and change the bbotk logger to the desired level.
Redirecting output is already extensively covered in the documentation and vignette of lgr. Here is just a short example that adds an appender to log events additionally to a temporary file using the JSON format:
tf=tempfile("mlr3log_", fileext =".json")# get the logger as R6 objectlogger=lgr::get_logger("mlr")# add Json appenderlogger$add_appender(lgr::AppenderJson$new(tf), name ="json")# signal a warninglogger$warn("this is a warning from mlr3")
WARN [13:46:37.591] this is a warning from mlr3
# print the contents of the filecat(readLines(tf))
{"level":300,"timestamp":"2023-06-04 13:46:37","logger":"mlr","caller":"eval","msg":"this is a warning from mlr3"}
# remove the appender againlogger$remove_appender("json")
9.3.3 Immediate Log Feedback
mlr3 uses future and encapsulation to make evaluations fast, stable, and reproducible. However, this may lead to logs being delayed, out of order, or, in case of some errors, not present at all.
When it is necessary to have immediate access to log messages, for example to investigate problems, one may therefore choose to disable future and encapsulation. This can be done by enabling the debug mode using options(mlr.debug = TRUE); the $encapsulate slot of learners should also be set to "none" (default) or "evaluate", but not "callr". Enabling the debug mode should only be done to investigate problems, however, and not for production use, because
this disables parallelization, and
this leads to different RNG behavior and therefore to results that are not fully reproducible.
9.4 Data Backends
This section covers advanced ML or technical details that can be skipped.
In mlr3, Task objects store their data in an abstract data object, the DataBackend. A data backend provides a unified API to retrieve subsets of the data or query information about it, regardless of how the data is actually stored on the system. The default backend uses data.table via the DataBackendDataTable as a very fast and efficient in-memory database. For example, we can query the dimensions of the penguins task:
While storing the Task’s data in memory is most efficient w.r.t. accessing it for model fitting, this has two major disadvantages:
Although only a small proportion of the data is required, the complete data frame sits in memory and consumes memory. This is especially a problem if you work with large tasks or many tasks simultaneously, e.g., for benchmarking.
During parallelization, the complete data needs to be transferred to the workers which can increase the overhead.
To avoid these drawbacks, especially for larger data, it can be necessary to interface out-of-memory data to reduce the memory requirements. This way, only the part of the data which is currently required by the learners will be placed in the main memory to operate on. There are multiple options to archive this:
DataBackendDuckDB for the impressive DuckDB14 database connected via duckdb: a fast, zero-configuration alternative to SQLite.
DataBackendDuckDB, again, but for Parquet files15. The data does not need to be converted to DuckDB’s native storage format, you can work directly on directories containing one or multiple files stored in the popular Parquet format.
In the following, we will show how to work with data backends that are available through mlr3db.
9.4.1 Databases with DataBackendDplyr
To demonstrate the mlr3db::DataBackendDplyr we use the NYC flights data set from the nycflights13 package and move it into a SQLite database. Although as_sqlite_backend() provides a convenient function to perform this step, we construct the database manually here.
With the SQLite database stored in file path, we now re-establish a connection and switch to dplyr/dbplyr for some essential preprocessing. If you had a real database management system (DBMS), this would be the first step now:
Attaching package: 'dplyr'
The following objects are masked from 'package:data.table':
between, first, last
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Attaching package: 'dbplyr'
The following objects are masked from 'package:dplyr':
ident, sql
As databases are intended to store large volumes of data, a natural first step is to slice and dice the data to suitable dimensions. Therefore, we build up an SQL query in a step-wise fashion using dplyr verbs and start by selecting a subset of columns to work on:
Note that the DataBackendDplyr does not know about any rows or columns we have filtered out with dplyr before, it just operates on the view we provided.
As we now have constructed a backend, we can switch over to mlr3 for model fitting on a task based on the previously created mlr3db::DataBackendDplyr:
task=as_task_regr(b, id ="flights_sqlite", target ="arr_delay")learner=lrn("regr.rpart")resampling=rsmp("subsampling", ratio =0.02, repeats =3)
We pass all these objects to resample() to perform a subsampling on 2% of the observations three times. In each iteration, only the required subset of the data is queried from the SQLite database and passed to rpart::rpart():
Note that we still have an active connection to the database. To properly close it, we remove the tbl object referencing the connection and then close the connection.
We have already demonstrated how to operate on a SQLite database. DuckDB databases (using DataBackendDuckDB) provide a modern alternative to SQLite, tailored to the needs of machine learning. To convert a data.frame to DuckDB, we provide the helper function as_duckdb_backend(). Only two arguments are required: the data.frame to convert, and a path to store the data.
While this is useful while working with many tasks simultaneously in order to keep the memory requirements reasonable, the more frequent use case for DuckDB are nowadays Parquet files. Parquet is a popular column-oriented data storage format supporting efficient compression, making it far superior to other popular data exchange formats such as CSV.
To demonstrate working with Parquet files, we first query the location of an example dataset shipped with mlr3db:
We can then create a DataBackendDuckDB based on this file and convert the backend to a classification task, all without loading the dataset into memory:
Accessing the data internally triggers a query and data is fetched to be stored in an in-memory data.frame, but only the required subsets. After the retrieved data is processed, the garbage collector can release the occupied memory. The backend can also operate on a folder with multiple parquet files, which is documented in as_duckdb_backend().
9.5 Extending mlr3
Hopefully having read the rest of this book you are now on the way to being an mlr3 expert. Maybe you will even want to extend the universe with new classes for more learners, measures, tuners, pipeops, and more; if so, read on.
This section covers advanced ML or technical details that can be skipped.
In this chapter we will show how to extend mlr3 using the simple example of creating a custom Measure. If you are interested in implementing new learners, pipeops, and tuners, then check out the vignettes in the respective packages: mlr3extralearners, mlr3pipelines, or mlr3tuning. Or if you are considering adding a new machine learning task then please contact us on GitHub, email, or Mattermost. This section assumes good knowledge of R6, see Section 1.8.2 for a brief introduction and references to further resources.
We welcome contributions from all levels of developers and if you want to add any of your new classes to our universe then please make pull requests to aligning packages, for example tuners and filters would go to mlr3tuning and mlr3filters respectively, a new survival measure would go to mlr3proba, and all new learners go to mlr3extralearners. Do not worry if you make a PR to the wrong repository, we will transfer it to the right one. Please read the mlr3 Wiki16 for coding conventions that we use if you want to add code to our organisation.
We will now turn to extending the Measure class to implement new metrics. As an example, let us consider a regression measure that scores a prediction as \(1\) if the difference between the true and predicted values are less than one standard deviation of the truth, or scores the prediction as \(0\) otherwise. In maths this would be defined as \(f(y, \hat{y}) = \frac{1}{n} \sum_{i=1}^n \mathbb{I}(|y_i - \hat{y}_i| < \sigma(y))\), where \(y\) contains the true values and \(\hat{y}\) the predicted values for observations \(i = 1, ..., n\). The function \(\mathbb{I(C)}\) denotes the indicator function17 with \(\mathbb{I(C)} = 1\) if condition \(C\) is true and \(\mathbb{I(C)} = 0\) otherwise. In code the measure may be written as:
This measure is then bounded in \([0, 1]\) and a larger score is better.
To use this measure in mlr3, we need to create a new R6::R6Class, which will inherit from Measure and in this case specifically inheriting from MeasureRegr. We will now demonstrate what the final code for this new measure would look like and then explain each line, this can be used as a template for most performance measures.
MeasureRegrThresholdAcc=R6::R6Class("MeasureRegrThresholdAcc", inherit =mlr3::MeasureRegr, # regression measure public =list( initialize =function(){# initialize classsuper$initialize( id ="thresh_acc", # unique ID packages =character(), # no dependencies properties =character(), # no special properties predict_type ="response", # measures response prediction range =c(0, 1), # results in values between (0, 1) minimize =FALSE# larger values are better)}), private =list( .score =function(prediction, ...){# define score as private method# define lossthreshold_acc=function(truth, response){mean(ifelse(abs(truth-response)<sd(truth), 1, 0))}# call loss functionthreshold_acc(prediction$truth, prediction$response)}))
In the first two lines we name the class, here MeasureRegrThresholdAcc, and then state this is a regression measure that inherits from MeasureRegr.
We initialize the class by stating its unique ID is “thresh_acc”, that it does not require any external packages (packages = character()) and that it has no special properties (properties = character()).
We then pass specific details of the loss function which are: it measures the quality of a “response” type prediction, its values range between (0, 1), and that the loss is optimised as its maximum.
Finally, we define the score itself as a private method called .score and simply pass the predictions to the function we defined earlier.
Sometimes measures require data from the training set, the task, or the learner. These are usually complex edge-cases examples, so we will not go into detail here, for working examples we suggest looking at the code for mlr3proba::MeasureSurvSongAUC and mlr3proba::MeasureSurvAUC. You can also consult the manual page of the Measure for an overview of other properties and meta-data that can be specified.
Once you have defined your measure you can either use it with the R6 constructor, or by adding it to the mlr_measures dictionary:
# or add to dictionarymlr3::mlr_measures$add("regr.thresh_acc", MeasureRegrThresholdAcc)prediction$score(msr("regr.thresh_acc"))
thresh_acc
0.6363636
Even though we only showed how to create a custom measure, the process of adding other objects is in essence the same:
Find the right class to inherit from
Add methods that:
Initialize the object with the correct properties ($initialize()).
Implement the public and private methods that do the actual computation. In the above example this was the private $.score() method.
As a lot of classes already exist in mlr3, we recommend looking at similar classes that are already available.
9.6 Conclusion
This chapter describes many advanced topics that are not relevant to each and every user and cannot be covered in full detail here. I.e., we covered parallelization, error handling, logging, and working with data bases - all these topics might not be needed if you work on rather small and clean datasets. Additionally, we demonstrated how to extend mlr3 with additional measures.
The sections presented in this chapter deal with problems that many users will encounter with some regularity. Therefore, at least superficial knowledge is necessary to be able to identify the problem specifically and to be able to return to the correct section or, for more background information, the following resources.
Resources
Schmidberger et al. (2009) and Eddelbuettel (2020) give a more systematic and in-depth overview about the possibilities to parallelize with R
The vignette18 of the lgr package demonstrates advanced logging capabilities, e.g., logging to JSON files or retrieving logged objects for debugging.
Extending and customizing objects is covered in the documentation of the respective packages:
For an overview of available DBMS in R, see the CRAN task view on databases19, and in particular the vignettes of the dbplyr package for DBMS readily available in mlr3. For working directly with a SQL database, we recommend a general purpose SQL Tutorial20.
Consider the following example where you resample a learner (debug learner, sleeps for 3 seconds during train) on 4 workers using the multisession backend:
Bischl, Bernd, Martin Binder, Michel Lang, Tobias Pielok, Jakob Richter, Stefan Coors, Janek Thomas, et al. 2023. “Hyperparameter Optimization: Foundations, Algorithms, Best Practices, and Open Challenges.”Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, e1484.
Eddelbuettel, Dirk. 2020. “Parallel Computing with R: A Brief Review.”WIREs Computational Statistics 13 (2). https://doi.org/10.1002/wics.1515.
Schmidberger, Markus, Martin Morgan, Dirk Eddelbuettel, Hao Yu, Luke Tierney, and Ulrich Mansmann. 2009. “State of the Art in Parallel Computing with R.”Journal of Statistical Software 31 (1). https://doi.org/10.18637/jss.v031.i01.
# Technical {#sec-technical}{{< include _setup.qmd >}}`r authors("Technical")`In the previous chapters, we demonstrated how to turn ML concepts and ML methods into code.So far, we have covered ML concepts without going into a lot of technical detail, which can be important for more advanced uses of mlr3.This includes the following topics:* Parallelization with the `r ref_pkg("future")` framework (@sec-parallelization),* how to handle errors and troubleshoot (@sec-error-handling),* adjust the logger to your needs (@sec-logging),* working with out-of-memory data, e.g., data stored in databases (@sec-backends), and* adding new classes to mlr3 (@sec-extending).## Parallelization {#sec-parallelization}The term `r index("parallelization")` refers to running multiple algorithms in parallel, i.e., executing them simultaneously on multiple CPU cores, CPUs, or computational nodes.Not all algorithms can be parallelized, but when they can, parallelization allows significant savings in computation time.In general, there are many possibilities to parallelize, depending on the hardware to run the computations: If you only have a single CPU with multiple cores, `r link("https://en.wikipedia.org/wiki/Thread_(computing)", "threads")` or `r link("https://en.wikipedia.org/wiki/Fork_(system_call)", "processes")` are ways to utilize all cores on a local machine.If you have multiple machines on the other hand, the machines need a way to communicate and exchange information, e.g. via protocols like network sockets or the `r link("https://en.wikipedia.org/wiki/Message_Passing_Interface", "Message Passing Interface (MPI)")`.Larger computational sites rely on a scheduler to orchestrate the computation for multiple users and offer a shared network file system all machines can access.Interacting with scheduling systems on compute clusters is covered in @sec-hpc-exec using the R package `r ref_pkg("batchtools")`.We do not want to delve too deep into such details here, but want to introduce some terminology which helps us to discuss parallelization on a more abstract level:* We call the hardware to parallelize on together with the respective interface provided by an R package the `r index("parallelization backend", aside = TRUE)`. As many parallelization backends have different APIs, we are using the `r ref_pkg("future")` package as an abstraction layer for many parallelization backends.`r mlr3` just interfaces `r ref_pkg("future")` while the user can control how the code is executed by configuring a backend prior to starting the computations.* The R session or process which orchestrates the computational work is called `r index("main", aside = TRUE)`, and it starts computational `r index("jobs", aside = TRUE)`.* The R sessions, processes, or machines which receive the jobs, do the calculation and then send back the result are called `r index("workers", aside = TRUE)`.An important step in parallel programming involves the identification of sections of the program flow which are both time-consuming (`r index("bottlenecks", aside = TRUE)`) and also can run independently of a different section.The key characteristic is that these sections do not depend on each other, i.e. section A can be ran without waiting for results from section B.Fortunately, these sections are comparably easy to spot for machine learning experiments:1. The training of a learning algorithm (or other computationally intensive parts of a machine learning pipeline, c.f. @sec-pipelines) *may* contain independent sections which can run in parallel, e.g. * A single decision tree iterates over all features to find the best split point, for each feature independently. * A random forest usually fits hundreds of trees independently. * Many feature filters work in a univariate fashion, i.e. calculate a numeric score for each feature independently. The key principle that makes parallelization for these examples (and in general in many fields of statistics and ML) is called `r define("data parallelism")`: The same operation is performed concurrently on different elements of the input data. Parallelization of learning algorithms is covered in @sec-parallel-learner.1. A resampling consists of independent repetitions of train-test-splits (@sec-parallel-resample).1. A benchmark consists of multiple independent resamplings (@sec-parallel-benchmark).1. Tuning (@sec-optimization) is repeated benchmarking, embedded in a sequential procedure which determines the hyperparameter configuration to try next. In addition to parallelization of the benchmark, some tuners propose multiple configurations to evaluate independently in each sequential step, which provides a second level for parallelization discussed in @sec-nested-resampling-parallelization.1. The predictions of a single learner for multiple observations is independent (@sec-parallel-predict).When computational problems are so easy to parallelize like the examples listed in (1)-(4), they are often referred to as `r index("embarrassingly parallel", aside = TRUE)`.Whenever you can put the heavy lifting into a function and call it with a map-like function like `lapply()`, you are facing an embarrassingly parallel problem.Such problems are straightforward to parallelize, e.g., in R with the `r ref_pkg("furrr")` package which provides parallel counterparts for popular sequential map-like functions from the `r ref_pkg("purrr")` package.However, it does not make practical sense to actually execute in parallel every operation that can be parallelized.Starting and terminating workers as well as possible communication between workers comes at a price in the form of additionally required runtime which is called `r index("parallelization overhead", aside = TRUE)`.The overhead strongly varies from parallelization backend to parallelization backend and must be carefully weighed against the runtime of the sequential execution to determine if parallelization is worth the effort.If the sequential execution is comparably fast, enabling parallelization often just introduces additional complexity for very little runtime savings or can even slow down the execution.Sometimes, it is possible to control the `r index("granularity", aside = TRUE)` of the parallelization to reduce the parallelization overhead.For example, if you want to parallelize a `for`-loop with 1000 iterations on 4 CPU cores, the overhead can be reduced by chunking the work of the 1000 jobs into 4 computational jobs performing 250 iterations each.So 4 bigger jobs are calculated instead of 1000 small ones.This effect is illustrated in the following code chunk using a socket cluster.Note that this parallel backend already comes with an option to control the chunk size (`chunk.size`), but for other backends you must chunk manually which is also demonstrated:```{r technical-001, eval = TRUE}# set up a socket cluster with 4 workers on the local machinelibrary(parallel)cores =4cl =makeCluster(cores)print(cl)# vector to operate onx =1:1000# fast function to parallelizef =function(x) sqrt(x +1)# unchunked approach: 1000 jobssystem.time({parSapply(cl, x, f, chunk.size =1)})# chunked approach: 4 jobssystem.time({parSapply(cl, x, f, chunk.size =250)})# manual chunking: 4 jobschunks =rep(1:cores, each =length(x) %/% cores)jobs =split(x, chunks)system.time({parLapply(cl, jobs, function(chunk, .fun) sapply(chunk, .fun) ,.fun = f, chunk.size =1)})```Whenever you have the option to control the granularity by setting the chunk size, you should aim for at least as many jobs as workers and also the runtime of each worker should be at least several seconds.This ensures that you can fully utilize the system and that the parallelization overhead stays reasonable.If you have heterogeneous runtimes, also consider grouping jobs together so that the runtime of the chunks get homogeneous.If there is a good estimate for the runtime, `batchtools::binpack()` (create an arbitrary number of chunks, each with a specified maximum combined runtime) and `batchtools::lpt()` (pack a specified number of chunks, each with arbitrary but homogeneous runtime) can prove useful - both are documented together with the `r ref("batchtools::chunk()")` helper.For unknown runtimes, randomizing the order of jobs sometimes helps if there is a systematic relationship between the order of the jobs and their runtime.This prevents the long jobs, for example, from all being executed at the end, which leads to avoidable underutilization.`r mlr3misc` ships with the functions `r ref("chunk()")` and `r ref("chunk_vector()")` to conveniently chunk vectors and also shuffles them per default.There are also options to control the chunk size for parallelization started from within mlr3 - these are described later in @sec-parallel-resample.### Parallelization of Learners {#sec-parallel-learner}The most atomic part of `r mlr3` which can be parallelized are calls to external code, i.e. the execution of certain `r ref("PipeOp")` objects, `r ref("Filter")` objects or `r ref("Learner")` objects.For these objects, `r mlr3` merely provides a unified interface to control the execution.The parallelization is implemented by the respective package authors of the (external) algorithms `r mlr3` calls.Most of these algorithms are parallelized via threading, e.g., the random forest implementation in `r ref_pkg("ranger")` or the boosting implemented in `r ref_pkg("xgboost")`.For example, while fitting a single decision tree, each split that divides the data into two disjoint partitions requires a search for the best cut point on all $p$ features.So instead of iterating over all features sequentially, the search can be broken down into $p$ threads, each searching for the best cut point on a single feature.These threads can easily be parallelized by the scheduler of the operating system, as there is no need for communication between the threads.After all threads have finished, the results are collected and merged before terminating the threads.I.e., for our example of the decision tree, (1) the $p$ best cut points per feature are collected and then (2) aggregated to the single best cut point across all features by just iterating over the $p$ results sequentially.:::{.callout-note}Parallelization on GPUs is not covered in this book.`mlr3` only distributes the fitting of multiple learners, e.g., during resampling, benchmarking, or tuning.On this rather abstract level, GPU parallelization does not work efficiently.However, some learning procedures can be compiled against CUDA/OpenCL to utilize the GPU while fitting a single model.We refer to the respective documentation of the learner's implementation, e.g., `r link("https://xgboost.readthedocs.io/en/stable/gpu/")` for XGBoost.:::Threading is implemented in the compiled code of the package (e.g., in C or C++).The R interpreter calls the external code and waits for the results to be returned - without noticing that the computations are executed in parallel.Unfortunately, threading conflicts with certain parallel backends, causing the system to be overutilized in the best case and causing hangs or segfaults in the worst case.For this reason, we introduced the convention that threading parallelization is turned off per default.Hyperparameters that control the number of threads are tagged with the label `"threads"`:```{r technical-002}library("mlr3learners") # for the ranger learner# get the ranger learnerlearner =lrn("classif.ranger")# show all hyperparameters tagged with "threads"learner$param_set$ids(tags ="threads")# The number of threads is initialized to 1learner$param_set$values$num.threads```To enable the parallelization for this learner, `r mlr3` provides the helper function `r ref("set_threads()")`:```{r technical-003}# use 4 CPUsset_threads(learner, n =4)# auto-detect cores on the local machineset_threads(learner)```In the last line, we did not set the number of threads, letting the package fall back to a heuristic to detect the correct number.This heuristic is sometimes flaky, and utilizing all available cores is occasionally counterproductive as overburdening the system often has negative effects on the overall runtime.The function which determines the number of CPUs for `r mlr3` is implemented in `r ref("parallelly::availableCores()")` and works well for many setups.See `r link("https://www.jottr.org/2022/12/05/avoid-detectcores", "this blog post")` for some background information about the implemented heuristic.However, there are still some scenarios where it is better to reduce the number of utilized CPUs manually:* You want to simultaneously work on the same system, e.g., browse the web or watch a video.* You are on a multi-user system and want to spare some resources for other users.* You have a CPU with heterogeneous cores, for example, the energy-efficient "Icestorm" cores on a Mac M1 chip. These are comparably slower than the high-performance "Firestorm" cores and not well suited for heavy computations.* You have linked R to a threaded [BLAS](https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms) implementation like [OpenBLAS](https://www.openblas.net/), and your learners make heavy use of linear algebra.You can manually set the number of CPUs to overrule the heuristic via option `"mc.cores"`:```{r technical-004, eval = FALSE}options(mc.cores =4)```We recommend setting this in your system's `.Rprofile` file, c.f. `r ref("Startup")`.There are some other approaches for parallelization of learners, e.g. by directly supporting one specific parallelization backend or a parallelization framework like `r ref_pkg("foreach")`.If this is supported, parallelization must be explicitly activated, e.g. by setting a hyperparameter.If you need to parallelize on the learner level because a single model fit takes too much time, and you only fit a few of these models, consult the documentation of the respective learner.In many scenarios it makes more sense to parallelize on a different level like resampling or benchmarking which is covered in the following subsections.### Parallelization of Resamplings {#sec-parallel-resample}In addition to parallel learners, most machine learning experiments include a very easy handle for parallelization: the resampling.By definition, resampling is performed to get an unbiased performance estimator by aggregating over **independent** repetitions of multiple train-test splits.`r mlr3` has "marked" this loop of independent iterations as parallelizable and uses an additional abstraction layer to support a broad range of parallel backends: the `r ref_pkg("future")` package.The loop is executed via the `r ref_pkg("future")` parallelization framework, using the parallel backend configured by the user via the `r ref("future::plan()")` function.In this section, we will use the spam task and a simple `lrn("classif.rpart")`.We use the `r ref("future::multisession")` plan (which internally uses socket clusters from the `parallel` package, see `r ref("parallel::makeCluster()")`) that should work on all operating systems.```{r technical-005}# query the currently active planfuture::plan()# define objects to perform a resamplingtask =tsk("spam")learner =lrn("classif.rpart")resampling =rsmp("cv", folds =3)# select the multisession backend to usefuture::plan("multisession")# run the resampling in parallel and measure runtimesystem.time({resample(task, learner, resampling)})```By default, all CPUs of your machine are used unless you specify the argument `workers` in `r ref("future::plan()")` (possible problems with the value returned by the heuristic have already been discussed in the previous @sec-parallel-learner).If you compare runtimes between the parallel backend and sequential execution (`plan("sequential")`) here, you should see a decrease in the reported elapsed time.However, in practice, you cannot expect the runtime to fall linearly as the number of cores increases (`r link("https://en.wikipedia.org/wiki/Amdahl%2527s_law", "Amdahl's law")`).In contrast to threads, the technical overhead for starting workers, communicating objects, sending back results, and shutting down the workers is quite large for the multisession backend.The multicore backend (`plan("multicore")`) comes with more overhead than threading, but considerably less overhead in comparison with the multisession backend.In fact, with the multicore backend, R objects are copied only when they are modified (copy-on-write), while with the multisession backend, objects are always copied to the respective session prior to any computation.The multicore backend has the major disadvantage that it is not supported on Windows systems - for this reason, we will stick with the multisession backend for all examples here.In general, it is advised to only consider parallelization for resamplings where each iteration runs at least a few seconds.Note that there are two mlr3 options to control the execution and granularity:* If `mlr3.exec_random` is set to `TRUE` (default), the order of jobs is randomized in resamplings and benchmarks. This can help if you run a benchmark or tuning with heterogeneous runtimes, e.g., to avoid that all the expensive learners get started last.* Option `mlr3.exec_chunk_size` can be used to control how many jobs are mapped to a single `future` and defaults to 1. The value of this option is passed to `r ref("future.apply::future_mapply()")` and `future.scheduling` is constantly set to `TRUE`.Tuning the chunk size can help in some rare cases to mitigate the parallelization overhead.For larger problems and longer runtimes, however, this plays a subordinate role.@fig-parallel-overview illustrates the parallelization from the above example. From left to right:1. The main process calls the `resample()` function.2. The computational task is split into 3 parts for the 3-fold cross-validation.3. The folds are passed to 3 workers, each fitting a model on the respective subset of the task and predicting on the left-out observations.4. The predictions (and trained models) are communicated back to main process which combines them into a `ResampleResult`.```{mermaid}%%| label: fig-parallel-overview%%| fig-cap: Parallelization of a resampling using a 3-fold cross-validationgraph LR M[fa:fa-server Main] S{"resample()"} C{ResampleResult} M --> S S -->|Fold 1| W1[fa:fa-microchip Worker 1] S -->|Fold 2| W2[fa:fa-microchip Worker 2] S -->|Fold 3| W3[fa:fa-microchip Worker 3] W1 -->|Prediction 1| C W2 -->|Prediction 2| C W3 -->|Prediction 3| C```### Parallelization of Benchmarks {#sec-parallel-benchmark}Benchmarks can be seen as a collection of multiple independent resamplings where a combination of a task, a learner, and a resampling strategy defines one resampling to perform.In pseudo-code, the calculation can be written down as```foreach combination of (task, learner, resampling strategy) { foreach resampling iteration { execute(resampling, j) }}```For parallelization, there are now two options:1. Parallelize over all resamplings, execute each resampling sequentially (parallelize outer loop).2. Iterate over all resamplings, execute each resampling in parallel (parallelize inner loop).If you are transitioning from `r ref_pkg("mlr")`, you might be used to selecting one of these parallelization levels before benchmarking.One major drawback of this approach becomes clear when both the outer and inner loop have fewer iterations than there are available workers, resulting in an underutilized system.In `r mlr3`, the choice of level is no longer required (except occasionally for nested resampling, briefly described in the following @sec-nested-resampling-parallelization).All experiments are rolled out on the same level, i.e., `benchmark()` iterates over the elements of the Cartesian product of the iterations of the outer and inner loops.Therefore, there is no need to decide whether you want to parallelize the tuning *or* the resampling, you always parallelize both.This approach makes the computation fine-grained and gives the `r ref_pkg("future")` backend the opportunity to group the jobs into chunks of suitable size (depending on the number of workers).Parallelization of benchmarks works analogously to resampling:```{r technical-006, echo = FALSE}# simple benchmark designdesign =benchmark_grid(tsks(c("sonar", "penguins")),lrns(c("classif.featureless", "classif.rpart")),rsmp("cv", folds =3))# enable parallelizationfuture::plan("multisession")# run benchmark in parallelbenchmark(design)```For larger benchmarks with a cumulative runtime of weeks, months or even years, see @sec-hpc-exec which covers parallelization on high-performance computing clusters.### Nested Resampling Parallelization {#sec-nested-resampling-parallelization}Like in benchmarking, [nested resampling](#nested-resampling) for tuning also translates into two nested resampling loops.But unlike benchmarking, the outer loop iterations are not necessarily independent of each other: depending on the result of the resampling in the first outer loop, different hyperparameters are suggested for the second iteration.Therefore, nested loops cannot be flattened, and the user instead has to choose which of the loops to parallelize.Let us consider the following example:You want to tune the `minsplit` argument of a classification tree using the `r ref("AutoTuner")` of `r mlr3tuning` (simplified version taken from @sec-model-tuning):```{r technical-007, echo = FALSE}library("mlr3tuning")learner =lrn("classif.rpart",minsplit =to_tune(2, 128, logscale =TRUE))at =auto_tuner(tuner =tnr("random_search"),learner = learner,resampling =rsmp("cv", folds =2), # inner CVmeasure =msr("classif.ce"),term_evals =20,)```To evaluate the performance on an independent test set, resampling is used:```{r technical-008}resample(task =tsk("penguins"),learner = at,resampling =rsmp("cv", folds =5) # outer CV)```Here, we have three opportunities to parallelize:1. the inner cross-validation of the auto tuner with 2 folds,2. the outer cross-validation of the resampling with 5 folds, and3. evaluating all configurations proposed by the random search in a single batch (parameter `batch_size` of `r ref("TunerRandomSearch")`, defaulting to 1).Because the third opportunity is not always applicable, especially for many advanced tuning algorithms which are only capable of proposing a single configuration in each iteration, we will here focus on the first two opportunities.Furthermore, we assume that we have a single CPU with four cores available.If we opt to parallelize the outer CV, all four cores would be utilized first with the computation of the first 4 resampling iterations.The computation of the fifth iteration has to wait.The resulting CPU utilization of the nested resampling example on 4 CPUs is visualized in two Figures:* @fig-parallel-outer as an example for parallelizing the outer 5-fold cross-validation.```{r technical-009, eval = FALSE}# Runs the outer loop in parallel and the inner loop sequentially future::plan(list("multisession", "sequential"))``` We assume that each fit during the inner resampling takes 4 seconds to compute and that there is no other significant overhead. First, each of the four workers starts with the computation of an inner 2-fold cross-validation. As there are more jobs than workers, the remaining fifth iteration of the outer resampling is queued on CPU1 **after** the first 4 iterations are finished after 8 secs. During the computation of the 5th outer resampling iteration, only CPU1 is utilized, the other 3 CPUs are idling. Note that just setting up the parallelization with a simple `future::plan("multisession")` has the same effect - the most outer loop is parallelized while all subsequent loops default to sequential execution.* @fig-parallel-inner as an example for parallelizing the inner 2-fold cross-validation.```{r technical-010, eval = FALSE}# Runs the outer loop sequentially and the inner loop in parallel future::plan(list("sequential", "multisession"))``` Here, the outer loop runs sequentially and distributes the 2 computations for the inner resampling on 2 CPUs. Meanwhile, CPU3 and CPU4 are idling.```{mermaid}%%| label: fig-parallel-outer%%| fig-cap: CPU utilization for4 CPUs while parallelizing the outer 5-fold cross-validation with a sequential 2-fold cross-validation inside. Jobs are labeled as [iteration outer]-[iteration inner].gantt title CPU Utilization dateFormat s axisFormat %S section CPU1 Iteration 1-1 :0, 4s Iteration 1-2 :4, 4s Iteration 5-1 :8, 4s Iteration 5-2 :12, 4s section CPU2 Iteration 2-1 :0, 4s Iteration 2-2 :4, 4s Idle :crit, 8, 8s section CPU3 Iteration 3-1 :0, 4s Iteration 3-2 :4, 4s Idle :crit, 8, 8s section CPU4 Iteration 4-1 :0, 4s Iteration 4-2 :4, 4s Idle :crit, 8, 8s``````{mermaid}%%| label: fig-parallel-inner%%| fig-cap: CPU utilization for4 CPUs while parallelizing the inner 2-fold cross-validation with a sequential 5-fold cross-validation outside. Jobs are labeled as [iteration outer]-[iteration inner].gantt title CPU Utilization dateFormat s axisFormat %S section CPU1 Iteration 1-1 :0, 4s Iteration 2-1 :4, 4s Iteration 3-1 :8, 4s Iteration 4-1 :12, 4s Iteration 5-1 :16, 4s section CPU2 Iteration 1-2 :0, 4s Iteration 2-2 :4, 4s Iteration 3-2 :8, 4s Iteration 4-2 :12, 4s Iteration 5-2 :16, 4s section CPU3 Idle :crit, 0, 20s section CPU4 Idle :crit, 0, 20s```Both possibilities for parallelization are not exploiting the full potential of the 4 CPUs.With parallelization of the outer loop, all results are computed after 16s, in contrast to parallelization of the inner loop where the results are only available after 20s.If possible, the number of iterations can be adapted to the available hardware. There is no law set in stone that you have to select, e.g., 10 folds in cross-validation.If you have 4 CPUs and a reasonable variance, 8 iterations are often sufficient, or you do 12 iterations because you get the last two iterations basically for free.Alternatively, you can also enable parallelization for both loops for nested parallelization, even on different parallelization backends.While nesting real parallelization backends is often unintended and causes unnecessary overhead, it is useful in some distributed computing setups.In this case, the number of workers must be manually tweaked so that the system does not get overburdened:```{r technical-011, eval = FALSE}# Runs both loops in parallelfuture::plan(list( future::tweak("multisession", workers =2), future::tweak("multisession", workers =4)))```This example would run on 8 cores (`= 2 * 4`) on the local machine, parallelizing the outer resampling on 2, and the inner resampling on 4 workers.The [vignette](https://cran.r-project.org/web/packages/future/vignettes/future-3-topologies.html) of the `r ref_pkg("future")` package gives more insight into nested parallelization.For more background information about parallelization during tuning, see Section 6.7 of @hpo_practical.:::{.callout-important}During tuning with `r mlr3tuning`, you can often adjust the **batch size** of the `r ref("Tuner")`, i.e., control how many hyperparameter configurations are evaluated in parallel.If you want full parallelization, make sure that the batch size multiplied by the number of (inner) resampling iterations is at least equal to the number of available workers.If you expect homogeneous runtimes, i.e., you are tuning over a single learner or linear pipeline, and you have no hyperparameter which is likely to influence the runtime, aim for a multiple of the number of workers.In general, larger batches allow for more parallelization, while smaller batches imply a more frequent evaluation of the termination criteria.We default to a `batch_size` of 1 that ensures that all `r ref("Terminator")`s work as intended, i.e., you cannot exceed the computational budget.:::Heterogeneous runtimes add an extra layer of complexity to parallelization.This occurs frequently, especially in tuning, when a hyperparameter strongly influences the runtime of the learning procedure.Examples are the number of trees for random forests or the number of regularization values to be tested in penalized regression.How efficient the parallelization turns out depends in particular on the scheduling strategy of the backend.After the first batch of jobs is sent to the worker, the next jobs are either started(a) as soon as all results have been collectively reported back to the main process, or(b) as soon as the first job reports back.Method (a) usually comes with less synchronization overhead and is best suited for short jobs with homogeneous runtimes.Method (b) is faster if the runtimes are heterogeneous, especially if the parallelization overhead is neglectable in comparison with the runtime for the computation.E.g., for `r ref("parallel::mclapply()")`, the behavior of the scheduler can be controlled with the `mc.preschedule` option and `r ref("parallel::parSapply()")` implements Method (a) while `r ref("parallel::parSapplyLB()")` implements scheduling according to method (b).### Parallelization of Predictions {#sec-parallel-predict}Finally, also the prediction of a single learner can be parallelized as the predictions of two observations are independent.For most learners, training is the bottleneck and parallelizing the prediction is not a worthwhile endeavor, but of course there are exceptions.Technically, the test data is first split into multiple groups and the predict-method of the learner is applied to each group in parallel using active backend configured via `r ref("future::plan()")`.The resulting predictions are then combined internally in a second step.However, to avoid predicting in parallel accidentally, parallel predictions must first be enabled in the learner via the `parallel_predict` field:```{r technical-012}# train random forest on spam tasktask =tsk("spam")learner =lrn("classif.ranger")learner$train(task)# set up parallel predict on 4 workersfuture::plan("multisession", workers =4)learner$parallel_predict =TRUE# perform predictionprediction = learner$predict(task)```The resulting `r ref("Prediction")` is identical to the one computed sequentially.### ReproducibilityUsually, reproducibility is a major concern during parallelization because special (`r link("https://en.wikipedia.org/wiki/Pseudorandom_number_generator", "PRNGs")`) are required (see `r link("https://www.jottr.org/2020/09/22/push-for-statistical-sound-rng/", "this blog post")`.A simple `r ref("set.seed()")` is not sufficient when parallelization is involved.One general recommendation here is switching from the default `r link("https://en.wikipedia.org/wiki/Mersenne_Twister","Mersenne Twister" )` to Pierre L'Ecuyer's RngStreams (see `r ref("Random")` and `r ref("parallel::RNGstreams")`).However, even this parallel PRNG comes with pitfalls w.r.t. reproducibility: if you change the number of workers, you still get different results after setting the seed with `r ref("set.seed()")`.Luckily, this and many other problems are already addressed by the excellent `r ref_pkg("future")` parallelization framework which `r mlr3` uses under the hood.`r ref_pkg("future")` ensures that all workers will receive the exactly same PRNG streams, independent of the number of workers.Although correct seeding alone does not guarantee full reproducibility, it is one problem less to worry about.You can find more details about the used PRNG in `r link("https://www.jottr.org/2020/09/22/push-for-statistical-sound-rng/", "this blog post")`.The issue of reproducibility is very complex, and complete reproducibility is very difficult to achieve - this ultimately also depends on the computational accuracy of the hardware, the processor instructions used, compiler versions and optimisation flags, or the `r index("BLAS")` library for linear algebra R links to.But since at least parallel PRNGs are not a problem, you should get the same or at least very similar results with a simple `set.seed()` before you run the experiments.## Error Handling {#sec-error-handling}In large ML experiments, it is not uncommon that a model fit or prediction fails with an error.This is because the algorithms have to process arbitrary data, and not all eventualities can always be handled.While we try to identify obvious problems before execution, such as when missing values occur, but a learner cannot handle them, other problems are far more complex to detect.Examples include correlations or collinearity that make model fitting impossible, outliers that lead to numerical problems, or new levels of categorical variables appearing in the predict step.The learners behave quite differently when encountering such problems: some models signal a warning during the train step that they failed to fit but return a baseline model while other models stop the execution.During prediction, some learners just refuse to predict the response for observations they cannot handle while others predict a missing value.How to deal with these problems even in more complex setups like benchmarking or tuning is the topic of this section.For illustration (and internal testing) of error handling, `r mlr3` ships with the learners `"classif.debug"` and `"regr.debug"`.Here, we use the debug learner for classification to demonstrate the error handling:```{r technical-013}task =tsk("penguins")learner =lrn("classif.debug")print(learner)```This learner comes with special parameters that let us simulate problems frequently encountered in ML.E.g., the debug learner comes with hyperparameters to control1. what conditions should be signaled (message, warning, error, segfault) with what probability,1. during which stage the conditions should be signaled (train or predict), and1. the ratio of predictions being `NA` (`predict_missing`).For a detailed description of all hyperparameters, see the manual page of `r ref("mlr_learners_classif.debug")`.```{r technical-014}learner$param_set```With the learner's default settings, the learner will do nothing special: The learner remembers a random label and constantly predicts this label:```{r technical-015}task =tsk("penguins")learner$train(task)$predict(task)$confusion```We now set a hyperparameter to let the debug learner signal an error during the train step.By default, `r mlr3` does not catch conditions such as warnings or errors raised while calling learners:```{r technical-016, error = TRUE}# set probability to signal an error to 1learner$param_set$values =list(error_train =1)learner$train(tsk("penguins"))```If this has been a regular learner, we could now start debugging with `r ref("traceback()")` (or create a `r link("https://stackoverflow.com/help/minimal-reproducible-example", "Minimal Reproducible Example (MRE)")` to tackle the problem down or file a bug report upstream).However, due to the nature of the problem, it is likely that the cause of the error cannot be fixed - so you have to learn how to deal with the errors.:::{.callout-note}If you start debugging, make sure you have disabled parallelization to avoid various pitfalls related to parallelization.It may also be helpful to set the option `mlr3.debug` to `TRUE`.If this flag is set, `r mlr3` does not call into the `r ref_pkg("future")` package, resulting in an easier-to-interpret program flow and `traceback()`.:::### Encapsulation {#sec-encapsulation}Since ML algorithms are confronted with arbitrary, often messy data, errors are not uncommon here, and we often just need to move on during benchmarking or tuning.Thus, we need a mechanism to 1. capture all signaled conditions such as messages, warnings and errors so that we can analyze them post-hoc (called `r index("encapsulation")`, covered in this section), 1. deal with algorithms which do not terminate in a reasonable time, and 1. a statistically sound way to proceed while being able to aggregate over partial results (next @sec-fallback).Encapsulation ensures that signaled conditions (such as messages, warnings and errors) are intercepted: all conditions raised during the training or predict step are logged into the learner, and errors do not interrupt the program flow.I.e., the execution of the calling function or package (here: `r mlr3`) continues as if there had been no error, though the result (fitted model during `train()`, predictions during `predict()`) are missing.Each `r ref("Learner")` has a field `encapsulate` to control how the train or predict steps are wrapped.The easiest way to encapsulate the execution is provided by the package `r ref_pkg("evaluate")` which evaluates R expressions while tracking conditions such as outputs, messages, warnings or errors (see the documentation of the `r ref("encapsulate()")` helper function for more details):```{r technical-017}task =tsk("penguins")learner =lrn("classif.debug")# this learner throws a warning and then stops with an error during train()learner$param_set$values =list(warning_train =1, error_train =1)# enable encapsulation for train() and predict()learner$encapsulate =c(train ="evaluate", predict ="evaluate")learner$train(task)```After training the learner, one can access the recorded log via the fields `log`, `warnings` and `errors`:```{r technical-018}learner$loglearner$warningslearner$errors```Another method for encapsulation is implemented in the `r ref_pkg("callr")` package.In contrast to `r ref_pkg("evaluate")`, the computation is taken out in a separate R process.This guards the calling session against segmentation faults which otherwise would tear down the complete main R session.On the downside, starting new processes comes with comparably more computational overhead.```{r technical-019}learner$encapsulate =c(train ="callr", predict ="callr")learner$param_set$values =list(segfault_train =1)learner$train(task = task)learner$errors```With either of these encapsulation methods, we can now catch errors and post-hoc analyze the messages, warnings and error messages.Additionally, a timeout can be set so that learners do not run for an indefinite time but are terminated after a specified time.Interrupting learners works differently depending on the encapsulation (see `r ref("mlr3misc::encapsulate()")`) and, when viewed from the outside, behave as if they would signal an error after reaching the timeout.The timeout can be set separately for training and prediction and must be provided in seconds:```{r technical-020}# 5 minute timeout for training, no timeout for predictlearner$timeout =c(train =5*60, predict =Inf)```Unfortunately, catching errors and ensuring an upper time limit is only half the battle.Without a model, it is not possible to get predictions:```{r technical-021, error = TRUE}learner$predict(task)```To handle the missing predictions gracefully during `r ref("resample()")`, `r ref("benchmark()")` or tuning, fallback learners are introduced next.### Fallback learners {#sec-fallback}`r index("Fallback learners")` have the purpose of being able to score results in cases where a `r ref("Learner")` completely failed to fit a model or refuses to provide predictions for some or all observations.We will first handle the case that a learner fails to fit a model during training, e.g., if some convergence criterion is not met or the learner ran out of memory.There are in general three possibilities to proceed:1. Ignore iterations with failed model fits. Although this is arguably the most frequent approach in practice, it is **not** statistically sound. For example, consider the case where a researcher wants a specific learner to look better in a benchmark study. To do this, the researcher takes an existing learner but introduces a small adaptation: If an internal goodness-of-fit measure is not achieved, an error is thrown. In other words, the learner only fits a model if the model can be reasonably well learned on the given training data. In comparison with the learning procedure without this adaptation and a good threshold, however, we now compare the mean over only the "easy" splits with the mean over all splits - an unfair advantage.2. Penalize failing learners. Instead of ignoring failed iterations, we can simply impute the worst possible score (as defined by the `r ref("Measure")`) and thereby heavily penalize the learner for failing. However, this often seems too harsh for many problems, and for some measures there is no reasonable value to impute.3. Impute a value that corresponds to a (weak) baseline. Instead of imputing with the worst possible score, impute with a reasonable baseline, e.g., by just predicting the majority class or the mean of the response in the training data. Such simple baselines are implemented as featureless learners (`r ref("mlr_learners_classif.featureless")` or `r ref("mlr_learners_regr.featureless")`). Note that a reasonable baseline value is different in different training splits. Retrieving these values after a larger benchmark study has been conducted is possible, but tedious.We strongly recommend option (3): it is statistically sound and very flexible.To make this procedure very convenient during resampling and benchmarking, we support fitting a proper baseline with a fallback learner.In the next example, in addition to the debug learner, we attach a simple featureless learner to the debug learner.So whenever the debug learner fails (which is every single time with the given parametrization) and encapsulation is enabled, `r mlr3` falls back to the predictions of the featureless learner internally:```{r technical-022}task =tsk("penguins")learner =lrn("classif.debug")learner$param_set$values =list(error_train =1)learner$fallback =lrn("classif.featureless")learner$train(task)learner```Note that encapsulation is not enabled explicitly; it is automatically set to `"evaluate"` for the training and the predict step while setting a fallback learner for a learner without encapsulation enabled.Furthermore, the log contains the captured error (which is also included in the print output), and although no model is stored, we can still get predictions:```{r technical-023}learner$modelprediction = learner$predict(task)prediction$score()```In this stepwise train-predict procedure, the fallback learner is of limited use.However, it is invaluable for larger benchmark studies.In the following snippet, we compare the previously created debug learner with a simple classification tree.We re-parametrize the debug learner to fail in roughly 30% of the resampling iterations during the training step:```{r technical-024}learner$param_set$values =list(error_train =0.3)bmr =benchmark(benchmark_grid(tsk("penguins"), list(learner, lrn("classif.rpart")), rsmp("cv")))aggr = bmr$aggregate(conditions =TRUE)aggr[, .(learner_id, warnings, errors, classif.ce)]```Even though the debug learner occasionally failed to provide predictions, we still got a statistically sound aggregated performance value which we can compare to the aggregated performance of the classification tree.It is also possible to split the benchmark up into separate `ResampleResult` objects which sometimes helps to get more context.E.g., if we only want to have a closer look into the debug learner, we can extract the errors from the corresponding resample results:```{r technical-025}rr = aggr[learner_id =="classif.debug"]$resample_result[[1L]]rr$errors```A problem similar to failed model fits emerges when a learner predicts only a subset of the observations in the test set (and predicts `NA` or no value for others).A typical case is, e.g., when new and unseen factor levels are encountered in the test data.Imagine again that our goal is to benchmark two algorithms using cross-validation on some binary classification task:* Algorithm A is an ordinary logistic regression.* Algorithm B is also an ordinary logistic regression, but with a twist: If the logistic regression is rather certain about the predicted label (> 90% probability), it returns the label and returns a missing value otherwise.At its core, this is the same problem as outlined before.If we measure the performance using only the non-missing predictions, Algorithm B would likely outperform algorithm A.However, this approach does not factor in that you can not generate predictions for all observations.Long story short, if a fallback learner is specified, missing predictions of the base learner will be automatically replaced with predictions from the fallback learner.This is illustrated in the following example:```{r technical-026}task =tsk("penguins")learner =lrn("classif.debug")# this hyperparameter sets the ratio of missing predictionslearner$param_set$values =list(predict_missing =0.5)# without fallbackp = learner$train(task)$predict(task)table(p$response, useNA ="always")# with fallbacklearner$fallback =lrn("classif.featureless")p = learner$train(task)$predict(task)table(p$response, useNA ="always")```Summed up, by combining encapsulation and fallback learners, it is possible to benchmark even quite unreliable or unstable learning algorithms in a convenient and statistically sound fashion.## `r index("Logging")` {#sec-logging}`r mlr3` internally uses the `r ref_pkg("lgr")` package to control the verbosity of the output, e.g., suppress messages or enable additional debugging messages.### Changing `r mlr3` logging levelsAll log messages have an associated level which encodes a number, e.g. informative messages have the level `"info"` which is associated with the number 400 while debugging messages have the level `"debug"` with number 500.A message is only displayed if the respective level exceeds the global logging threshold.To change the setting for `r mlr3` for the current session, you need to retrieve the logger (which is a `r ref_pkg("R6")` object) from `r ref_pkg("lgr")`, and then change the threshold of the like this:```{r technical-027, eval = FALSE}requireNamespace("lgr")logger = lgr::get_logger("mlr3")logger$set_threshold("<level>")```The default log level is `"info"`.All available levels can be listed as follows:```{r technical-028}getOption("lgr.log_levels")```To increase verbosity, set the log level to a higher value, e.g. to `"debug"` with:```{r technical-029, eval = FALSE}lgr::get_logger("mlr3")$set_threshold("debug")```To reduce the verbosity, reduce the log level to warn:```{r technical-030, eval = FALSE}lgr::get_logger("mlr3")$set_threshold("warn")````r ref_pkg("lgr")` comes with a global option called `"lgr.default_threshold"` which can be set via `options()` to make your choice permanent across sessions.Also note that the optimization packages such as `r mlr3tuning` or `r mlr3fselect` use the logger of their base package `r ref_pkg("bbotk")`.If you want do disable logging from `r mlr3`, but keep the output from `r mlr3tuning`, reduce the verbosity of the `r mlr3` logger and change the `r ref_pkg("bbotk")` logger to the desired level.```{r technical-031, eval=FALSE}lgr::get_logger("mlr3")$set_threshold("warn")lgr::get_logger("bbotk")$set_threshold("info")```### Redirecting outputRedirecting output is already extensively covered in the documentation and vignette of `r ref_pkg("lgr")`.Here is just a short example that adds an appender to log events additionally to a temporary file using the [JSON](https://en.wikipedia.org/wiki/JSON) format:```{r technical-032, eval = knitr::is_html_output()}tf =tempfile("mlr3log_", fileext =".json")# get the logger as R6 objectlogger = lgr::get_logger("mlr")# add Json appenderlogger$add_appender(lgr::AppenderJson$new(tf), name ="json")# signal a warninglogger$warn("this is a warning from mlr3")# print the contents of the filecat(readLines(tf))# remove the appender againlogger$remove_appender("json")```### Immediate Log Feedback`r mlr3` uses `r ref_pkg("future")` and [encapsulation](#encapsulation) to make evaluations fast, stable, and reproducible.However, this may lead to logs being delayed, out of order, or, in case of some errors, not present at all.When it is necessary to have immediate access to log messages, for example to investigate problems, one may therefore choose to disable `r ref_pkg("future")` and encapsulation.This can be done by enabling the debug mode using `options(mlr.debug = TRUE)`; the `$encapsulate` slot of learners should also be set to `"none"` (default) or `"evaluate"`, but not `"callr"`.Enabling the debug mode should only be done to investigate problems, however, and not for production use, because1. this disables parallelization, and2. this leads to different RNG behavior and therefore to results that are not fully reproducible.## Data Backends {#sec-backends}{{< include _optional.qmd >}}In mlr3, `r ref("Task")` objects store their data in an abstract data object, the `r ref("DataBackend")`.A `r index("data backend")` provides a unified API to retrieve subsets of the data or query information about it, regardless of how the data is actually stored on the system.The default backend uses `r ref_pkg("data.table")` via the `r ref("DataBackendDataTable")` as a very fast and efficient in-memory database.For example, we can query the dimensions of the penguins task:```{r technical-033}task =tsk("penguins")backend = task$backendbackend$nrowbackend$ncol```While storing the Task's data in memory is most efficient w.r.t. accessing it for model fitting, this has two major disadvantages:1. Although only a small proportion of the data is required, the complete data frame sits in memory and consumes memory. This is especially a problem if you work with large tasks or many tasks simultaneously, e.g., for benchmarking.2. During parallelization, the complete data needs to be transferred to the workers which can increase the overhead.To avoid these drawbacks, especially for larger data, it can be necessary to interface out-of-memory data to reduce the memory requirements.This way, only the part of the data which is currently required by the learners will be placed in the main memory to operate on.There are multiple options to archive this:1. `r ref("DataBackendDplyr")` which interfaces the R package `r ref_pkg("dbplyr")`, extending `r ref_pkg("dplyr")` to work on many popular `r index("SQL")` databases like `r link("https://mariadb.org/", "MariaDB")`, `r link("https://www.postgresql.org/", "PostgresSQL")` or `r link("https://www.sqlite.org", "SQLite")`.2. `r ref("DataBackendDuckDB")` for the impressive `r link("https://duckdb.org/", "DuckDB")` database connected via `r ref_pkg("duckdb")`: a fast, zero-configuration alternative to SQLite.3. `r ref("DataBackendDuckDB")`, again, but for `r link("https://parquet.apache.org/", "Parquet files")`. The data does not need to be converted to DuckDB's native storage format, you can work directly on directories containing one or multiple files stored in the popular Parquet format.In the following, we will show how to work with data backends that are available through `r ref_pkg("mlr3db")`.### Databases with DataBackendDplyrTo demonstrate the `r ref("mlr3db::DataBackendDplyr")` we use the NYC flights data set from the `r ref_pkg("nycflights13")` package and move it into a `r index("SQLite")` database.Although `r ref("as_sqlite_backend()")` provides a convenient function to perform this step, we construct the database manually here.```{r technical-034, message = FALSE}# load datarequireNamespace("DBI")requireNamespace("RSQLite")requireNamespace("nycflights13")data("flights", package ="nycflights13")str(flights)# add column of unique row idsflights$row_id =1:nrow(flights)# create sqlite database in temporary filepath =tempfile("flights", fileext =".sqlite")con = DBI::dbConnect(RSQLite::SQLite(), path)tbl = DBI::dbWriteTable(con, "flights", as.data.frame(flights))DBI::dbDisconnect(con)# remove in-memory datarm(flights)```With the SQLite database stored in file `path`, we now re-establish a connection and switch to `r ref_pkg("dplyr")`/`r ref_pkg("dbplyr")` for some essential preprocessing.If you had a real `r index("database management system")` (DBMS), this would be the first step now:```{r technical-035, echo = FALSE}# establish connectioncon = DBI::dbConnect(RSQLite::SQLite(), path)# select the "flights" table, enter dplyrlibrary("dplyr")library("dbplyr")tbl =tbl(con, "flights")```As databases are intended to store large volumes of data, a natural first step is to slice and dice the data to suitable dimensions.Therefore, we build up an SQL query in a step-wise fashion using `r ref_pkg("dplyr")` verbs and start by selecting a subset of columns to work on:```{r technical-036}keep =c("row_id", "year", "month", "day", "hour", "minute", "dep_time","arr_time", "carrier", "flight", "air_time", "distance", "arr_delay")tbl =select(tbl, all_of(keep))```Additionally, we remove those observations where the arrival delay (`arr_delay`) has a missing value:```{r technical-037}tbl =filter(tbl, !is.na(arr_delay))```To reduce the runtime for this example, we also filter the data to only use every second row:```{r technical-038}tbl =filter(tbl, row_id %%2==0)```The factor levels of the feature `carrier` are merged so that infrequent carriers are replaced by level "other":```{r technical-039}tbl =mutate(tbl, carrier =case_when( carrier %in%c("OO", "HA", "YV", "F9", "AS", "FL", "VX", "WN") ~"other",TRUE~ carrier))```Next, the processed table is used to create a `r ref("mlr3db::DataBackendDplyr")` from `r mlr3db`:```{r technical-040}library("mlr3db")b =as_data_backend(tbl, primary_key ="row_id")```We can now use the interface of `r ref("DataBackend")` to query some basic information about the data:```{r technical-041}b$nrowb$ncolb$head()```Note that the `r ref("DataBackendDplyr")` does not know about any rows or columns we have filtered out with `r ref_pkg("dplyr")` before, it just operates on the view we provided.As we now have constructed a backend, we can switch over to `r mlr3` for model fitting on a task based on the previously created `r ref("mlr3db::DataBackendDplyr")`:```{r technical-042}task =as_task_regr(b, id ="flights_sqlite", target ="arr_delay")learner =lrn("regr.rpart")resampling =rsmp("subsampling", ratio =0.02, repeats =3)```We pass all these objects to `r ref("resample()")` to perform a subsampling on 2% of the observations three times.In each iteration, only the required subset of the data is queried from the SQLite database and passed to `r ref("rpart::rpart()")`:```{r technical-043}rr =resample(task, learner, resampling)print(rr)measures =msrs(c("regr.mse", "time_train", "time_predict"))rr$aggregate(measures)```Note that we still have an active connection to the database.To properly close it, we remove the `tbl` object referencing the connection and then close the connection.```{r technical-044}rm(tbl)DBI::dbDisconnect(con)```### Parquet Files with DataBackendDuckDBWe have already demonstrated how to operate on a SQLite database.`r index("DuckDB")` databases (using `r ref("DataBackendDuckDB")`) provide a modern alternative to SQLite, tailored to the needs of machine learning.To convert a `data.frame` to DuckDB, we provide the helper function `r ref("as_duckdb_backend()")`.Only two arguments are required: the `data.frame` to convert, and a `path` to store the data.While this is useful while working with many tasks simultaneously in order to keep the memory requirements reasonable, the more frequent use case for DuckDB are nowadays [`r index("Parquet")` files](https://en.wikipedia.org/wiki/Apache_Parquet).Parquet is a popular column-oriented data storage format supporting efficient compression, making it far superior to other popular data exchange formats such as CSV.To demonstrate working with Parquet files, we first query the location of an example dataset shipped with `r ref_pkg("mlr3db")`:```{r technical-045}path =system.file(file.path("extdata", "spam.parquet"), package ="mlr3db")```We can then create a `r ref("DataBackendDuckDB")` based on this file and convert the backend to a classification task, all without loading the dataset into memory:```{r technical-046}backend =as_duckdb_backend(path)task =as_task_classif(backend, target ="type")print(task)```Accessing the data internally triggers a query and data is fetched to be stored in an in-memory `data.frame`, but only the required subsets.After the retrieved data is processed, the garbage collector can release the occupied memory.The backend can also operate on a folder with multiple parquet files, which is documented in `r ref("as_duckdb_backend()")`.## Extending mlr3 {#sec-extending}Hopefully having read the rest of this book you are now on the way to being an `r mlr3` expert.Maybe you will even want to extend the universe with new classes for more learners, measures, tuners, pipeops, and more; if so, read on.{{< include _optional.qmd >}}In this chapter we will show how to extend `r mlr3` using the simple example of creating a custom `r ref("Measure")`.If you are interested in implementing new learners, pipeops, and tuners, then check out the vignettes in the respective packages: `r mlr3extralearners`, `r mlr3pipelines`, or `r mlr3tuning`.Or if you are considering adding a new machine learning task then please contact us on GitHub, email, or Mattermost.This section assumes good knowledge of `R6`, see @sec-r6 for a brief introduction and references to further resources.We welcome contributions from all levels of developers and if you want to add any of your new classes to our universe then please make pull requests to aligning packages, for example tuners and filters would go to `r mlr3tuning` and `r mlr3filters` respectively, a new survival measure would go to `r mlr3proba`, and *all* new learners go to `r mlr3extralearners`.Do not worry if you make a PR to the wrong repository, we will transfer it to the right one.Please read the `r link("https://github.com/mlr-org/mlr3/wiki", "mlr3 Wiki")` for coding conventions that we use if you want to add code to our organisation.We will now turn to extending the `r ref("Measure")` class to implement new metrics.As an example, let us consider a regression measure that scores a prediction as $1$ if the difference between the true and predicted values are less than one standard deviation of the truth, or scores the prediction as $0$ otherwise.In maths this would be defined as $f(y, \hat{y}) = \frac{1}{n} \sum_{i=1}^n \mathbb{I}(|y_i - \hat{y}_i| < \sigma(y))$, where $y$ contains the true values and $\hat{y}$ the predicted values for observations $i = 1, ..., n$.The function $\mathbb{I(C)}$ denotes the `r link("https://en.wikipedia.org/wiki/Indicator_function", "indicator function")` with $\mathbb{I(C)} = 1$ if condition $C$ is true and $\mathbb{I(C)} = 0$ otherwise.In code the measure may be written as:```{r technical-047}threshold_acc =function(truth, response) {mean(ifelse(abs(truth - response) <sd(truth), 1, 0))}threshold_acc(c(100, 0, 1), c(1, 11, 6))```This measure is then bounded in $[0, 1]$ and a larger score is better.To use this measure in mlr3, we need to create a new `r ref("R6::R6Class")`, which will inherit from `r ref("Measure")` and in this case specifically inheriting from `r ref("MeasureRegr")`.We will now demonstrate what the final code for this new measure would look like and then explain each line, this can be used as a template for most performance measures.```{r technical-048}MeasureRegrThresholdAcc = R6::R6Class("MeasureRegrThresholdAcc",inherit = mlr3::MeasureRegr, # regression measurepublic =list(initialize =function() { # initialize class super$initialize(id ="thresh_acc", # unique IDpackages =character(), # no dependenciesproperties =character(), # no special propertiespredict_type ="response", # measures response predictionrange =c(0, 1), # results in values between (0, 1)minimize =FALSE# larger values are better ) } ),private =list(.score =function(prediction, ...) { # define score as private method# define loss threshold_acc =function(truth, response) {mean(ifelse(abs(truth - response) <sd(truth), 1, 0)) }# call loss functionthreshold_acc(prediction$truth, prediction$response) } ))```1. In the first two lines we name the class, here `MeasureRegrThresholdAcc`, and then state this is a regression measure that inherits from `r ref("MeasureRegr")`.2. We initialize the class by stating its unique ID is "thresh_acc", that it does not require any external packages (`packages = character()`) and that it has no special properties (`properties = character()`).3. We then pass specific details of the loss function which are: it measures the quality of a "response" type prediction, its values range between (0, 1), and that the loss is optimised as its maximum.4. Finally, we define the score itself as a private method called `.score` and simply pass the predictions to the function we defined earlier.Sometimes measures require data from the training set, the task, or the learner.These are usually complex edge-cases examples, so we will not go into detail here, for working examples we suggest looking at the code for `r ref("mlr3proba::MeasureSurvSongAUC")` and `r ref("mlr3proba::MeasureSurvAUC")`.You can also consult the manual page of the `r ref("Measure")` for an overview of other properties and meta-data that can be specified.Once you have defined your measure you can either use it with the `R6` constructor, or by adding it to the `r ref("mlr_measures")` dictionary:```{r technical-049}library(mlr3verse)task =tsk("mtcars")split =partition(task)learner =lrn("regr.featureless")$train(task, split$train)prediction = learner$predict(task, split$test)prediction$score(MeasureRegrThresholdAcc$new())# or add to dictionarymlr3::mlr_measures$add("regr.thresh_acc", MeasureRegrThresholdAcc)prediction$score(msr("regr.thresh_acc"))```Even though we only showed how to create a custom measure, the process of adding other objects is in essence the same:1. Find the right class to inherit from2. Add methods that: a) Initialize the object with the correct properties (`$initialize()`). b) Implement the public and private methods that do the actual computation. In the above example this was the private `$.score()` method.As a lot of classes already exist in `r mlr3`, we recommend looking at similar classes that are already available.## ConclusionThis chapter describes many advanced topics that are not relevant to each and every user and cannot be covered in full detail here.I.e., we covered parallelization, error handling, logging, and working with data bases - all these topics might not be needed if you work on rather small and clean datasets.Additionally, we demonstrated how to extend mlr3 with additional measures.The sections presented in this chapter deal with problems that many users will encounter with some regularity.Therefore, at least superficial knowledge is necessary to be able to identify the problem specifically and to be able to return to the correct section or, for more background information, the following resources.### Resources {.unnumbered .unlisted}- @Schmidberger2009 and @Eddelbuettel2020 give a more systematic and in-depth overview about the possibilities to parallelize with R- The `r link("https://cran.r-project.org/web/packages/lgr/vignettes/lgr.html", text = "vignette")` of the `r ref_pkg("lgr")` package demonstrates advanced logging capabilities, e.g., logging to JSON files or retrieving logged objects for debugging.- Extending and customizing objects is covered in the documentation of the respective packages: - for learners see the vignette of `r ref_pkg("mlr3learners")` - for pipe operators see the vignette `r ref_pkg("mlr3pipelines")`- For an overview of available DBMS in R, see the `r link("https://cran.r-project.org/view=Databases", text = "CRAN task view on databases")`, and in particular the vignettes of the `r ref_pkg("dbplyr")` package for DBMS readily available in mlr3. For working directly with a SQL database, we recommend a general purpose `r link("https://www.w3schools.com/sql/", text = "SQL Tutorial")`.## Exercises### Parallelization {.unnumbered .unlisted}Consider the following example where you resample a learner (debug learner, sleeps for 3 seconds during train) on 4 workers using the multisession backend:```{r technical-050, eval = FALSE}task =tsk("penguins")learner =lrn("classif.debug", sleep_train =function() 3)resampling =rsmp("cv", folds =6)future::plan("multisession", workers =4)resample(task, learner, resampling)```* Assuming that the learner would actually calculate something and not just sleep: Would all CPUs be busy?* Prove your point by measuring the elapsed time, e.g., using `r ref("system.time()")`.* What would you change in the setup and why?### Custom Measures {.unnumbered .unlisted}Create a new custom classification measure which scores predictions using the mean over the following classification costs:1. If the learner predicted label "A" and the truth is "A", assign score 01. If the learner predicted label "B" and the truth is "B", assign score 01. If the learner predicted label "A" and the truth is "B", assign score 11. If the learner predicted label "B" and the truth is "A", assign score 10Hint: You can implement it yourself as demonstrated in @sec-extending or use the measure `r ref("mlr_measures_classif.costs")`.