9  Technical

Michel Lang
Research Center Trustworthy Data Science and Security, and TU Dortmund University

In the previous chapters, we demonstrated how to turn ML concepts and ML methods into code. So far, we have covered ML concepts without going into a lot of technical detail, which can be important for more advanced uses of mlr3. This includes the following topics:

9.1 Parallelization

The term parallelization refers to running multiple algorithms in parallel, i.e., executing them simultaneously on multiple CPU cores, CPUs, or computational nodes. Not all algorithms can be parallelized, but when they can, parallelization allows significant savings in computation time.

In general, there are many possibilities to parallelize, depending on the hardware to run the computations: If you only have a single CPU with multiple cores, threads1 or processes2 are ways to utilize all cores on a local machine. If you have multiple machines on the other hand, the machines need a way to communicate and exchange information, e.g. via protocols like network sockets or the Message Passing Interface (MPI)3. Larger computational sites rely on a scheduler to orchestrate the computation for multiple users and offer a shared network file system all machines can access. Interacting with scheduling systems on compute clusters is covered in Section 10.2 using the R package batchtools. We do not want to delve too deep into such details here, but want to introduce some terminology which helps us to discuss parallelization on a more abstract level:

  • We call the hardware to parallelize on together with the respective interface provided by an R package the parallelization backend. As many parallelization backends have different APIs, we are using the future package as an abstraction layer for many parallelization backends. mlr3 just interfaces future while the user can control how the code is executed by configuring a backend prior to starting the computations.
  • The R session or process which orchestrates the computational work is called main, and it starts computational jobs.
  • The R sessions, processes, or machines which receive the jobs, do the calculation and then send back the result are called workers.
Parallelization BackendMainJobsWorkers

An important step in parallel programming involves the identification of sections of the program flow which are both time-consuming (bottlenecks) and also can run independently of a different section. The key characteristic is that these sections do not depend on each other, i.e. section A can be ran without waiting for results from section B. Fortunately, these sections are comparably easy to spot for machine learning experiments:

Bottlenecks
  1. The training of a learning algorithm (or other computationally intensive parts of a machine learning pipeline, c.f. Chapter 6) may contain independent sections which can run in parallel, e.g.
    • A single decision tree iterates over all features to find the best split point, for each feature independently.
    • A random forest usually fits hundreds of trees independently.
    • Many feature filters work in a univariate fashion, i.e. calculate a numeric score for each feature independently.
    The key principle that makes parallelization for these examples (and in general in many fields of statistics and ML) is called data parallelism: The same operation is performed concurrently on different elements of the input data. Parallelization of learning algorithms is covered in Section 9.1.1.
  2. A resampling consists of independent repetitions of train-test-splits (Section 9.1.2).
  3. A benchmark consists of multiple independent resamplings (Section 9.1.3).
  4. Tuning (Chapter 4) is repeated benchmarking, embedded in a sequential procedure which determines the hyperparameter configuration to try next. In addition to parallelization of the benchmark, some tuners propose multiple configurations to evaluate independently in each sequential step, which provides a second level for parallelization discussed in Section 9.1.4.
  5. The predictions of a single learner for multiple observations is independent (Section 9.1.5).
Data Parallelism

When computational problems are so easy to parallelize like the examples listed in (1)-(4), they are often referred to as embarrassingly parallel. Whenever you can put the heavy lifting into a function and call it with a map-like function like lapply(), you are facing an embarrassingly parallel problem. Such problems are straightforward to parallelize, e.g., in R with the furrr package which provides parallel counterparts for popular sequential map-like functions from the purrr package.

Embarrassingly Parallel

However, it does not make practical sense to actually execute in parallel every operation that can be parallelized. Starting and terminating workers as well as possible communication between workers comes at a price in the form of additionally required runtime which is called parallelization overhead. The overhead strongly varies from parallelization backend to parallelization backend and must be carefully weighed against the runtime of the sequential execution to determine if parallelization is worth the effort. If the sequential execution is comparably fast, enabling parallelization often just introduces additional complexity for very little runtime savings or can even slow down the execution.

Parallelization Overhead

Sometimes, it is possible to control the granularity of the parallelization to reduce the parallelization overhead. For example, if you want to parallelize a for-loop with 1000 iterations on 4 CPU cores, the overhead can be reduced by chunking the work of the 1000 jobs into 4 computational jobs performing 250 iterations each. So 4 bigger jobs are calculated instead of 1000 small ones.

Granularity

This effect is illustrated in the following code chunk using a socket cluster. Note that this parallel backend already comes with an option to control the chunk size (chunk.size), but for other backends you must chunk manually which is also demonstrated:

# set up a socket cluster with 4 workers on the local machine
library(parallel)
cores = 4
cl = makeCluster(cores)
print(cl)
socket cluster with 4 nodes on host 'localhost'
# vector to operate on
x = 1:1000

# fast function to parallelize
f = function(x) sqrt(x + 1)


# unchunked approach: 1000 jobs
system.time({
  parSapply(cl, x, f, chunk.size = 1)
})
   user  system elapsed 
  0.166   0.011  11.013 
# chunked approach: 4 jobs
system.time({
  parSapply(cl, x, f, chunk.size = 250)
})
   user  system elapsed 
  0.002   0.000   0.129 
# manual chunking: 4 jobs
chunks = rep(1:cores, each = length(x) %/% cores)
jobs = split(x, chunks)
system.time({
  parLapply(cl, jobs, function(chunk, .fun) sapply(chunk, .fun) ,
    .fun = f, chunk.size = 1)
})
   user  system elapsed 
  0.001   0.000   0.044 

Whenever you have the option to control the granularity by setting the chunk size, you should aim for at least as many jobs as workers and also the runtime of each worker should be at least several seconds. This ensures that you can fully utilize the system and that the parallelization overhead stays reasonable. If you have heterogeneous runtimes, also consider grouping jobs together so that the runtime of the chunks get homogeneous. If there is a good estimate for the runtime, batchtools::binpack() (create an arbitrary number of chunks, each with a specified maximum combined runtime) and batchtools::lpt() (pack a specified number of chunks, each with arbitrary but homogeneous runtime) can prove useful - both are documented together with the batchtools::chunk() helper. For unknown runtimes, randomizing the order of jobs sometimes helps if there is a systematic relationship between the order of the jobs and their runtime. This prevents the long jobs, for example, from all being executed at the end, which leads to avoidable underutilization. mlr3misc ships with the functions chunk() and chunk_vector() to conveniently chunk vectors and also shuffles them per default. There are also options to control the chunk size for parallelization started from within mlr3 - these are described later in Section 9.1.2.

9.1.1 Parallelization of Learners

The most atomic part of mlr3 which can be parallelized are calls to external code, i.e. the execution of certain PipeOp objects, Filter objects or Learner objects. For these objects, mlr3 merely provides a unified interface to control the execution. The parallelization is implemented by the respective package authors of the (external) algorithms mlr3 calls.

Most of these algorithms are parallelized via threading, e.g., the random forest implementation in ranger or the boosting implemented in xgboost. For example, while fitting a single decision tree, each split that divides the data into two disjoint partitions requires a search for the best cut point on all \(p\) features. So instead of iterating over all features sequentially, the search can be broken down into \(p\) threads, each searching for the best cut point on a single feature. These threads can easily be parallelized by the scheduler of the operating system, as there is no need for communication between the threads. After all threads have finished, the results are collected and merged before terminating the threads. I.e., for our example of the decision tree, (1) the \(p\) best cut points per feature are collected and then (2) aggregated to the single best cut point across all features by just iterating over the \(p\) results sequentially.

Note

Parallelization on GPUs is not covered in this book. mlr3 only distributes the fitting of multiple learners, e.g., during resampling, benchmarking, or tuning. On this rather abstract level, GPU parallelization does not work efficiently. However, some learning procedures can be compiled against CUDA/OpenCL to utilize the GPU while fitting a single model. We refer to the respective documentation of the learner’s implementation, e.g., https://xgboost.readthedocs.io/en/stable/gpu/ for XGBoost.

Threading is implemented in the compiled code of the package (e.g., in C or C++). The R interpreter calls the external code and waits for the results to be returned - without noticing that the computations are executed in parallel. Unfortunately, threading conflicts with certain parallel backends, causing the system to be overutilized in the best case and causing hangs or segfaults in the worst case. For this reason, we introduced the convention that threading parallelization is turned off per default. Hyperparameters that control the number of threads are tagged with the label "threads":

library("mlr3learners") # for the ranger learner

# get the ranger learner
learner = lrn("classif.ranger")

# show all hyperparameters tagged with "threads"
learner$param_set$ids(tags = "threads")
[1] "num.threads"
# The number of threads is initialized to 1
learner$param_set$values$num.threads
[1] 1

To enable the parallelization for this learner, mlr3 provides the helper function set_threads():

# use 4 CPUs
set_threads(learner, n = 4)
<LearnerClassifRanger:classif.ranger>
* Model: -
* Parameters: num.threads=4
* Packages: mlr3, mlr3learners, ranger
* Predict Types:  [response], prob
* Feature Types: logical, integer, numeric, character, factor, ordered
* Properties: hotstart_backward, importance, multiclass, oob_error,
  twoclass, weights
# auto-detect cores on the local machine
set_threads(learner)
<LearnerClassifRanger:classif.ranger>
* Model: -
* Parameters: num.threads=2
* Packages: mlr3, mlr3learners, ranger
* Predict Types:  [response], prob
* Feature Types: logical, integer, numeric, character, factor, ordered
* Properties: hotstart_backward, importance, multiclass, oob_error,
  twoclass, weights

In the last line, we did not set the number of threads, letting the package fall back to a heuristic to detect the correct number. This heuristic is sometimes flaky, and utilizing all available cores is occasionally counterproductive as overburdening the system often has negative effects on the overall runtime. The function which determines the number of CPUs for mlr3 is implemented in parallelly::availableCores() and works well for many setups. See this blog post4 for some background information about the implemented heuristic. However, there are still some scenarios where it is better to reduce the number of utilized CPUs manually:

  • You want to simultaneously work on the same system, e.g., browse the web or watch a video.
  • You are on a multi-user system and want to spare some resources for other users.
  • You have a CPU with heterogeneous cores, for example, the energy-efficient “Icestorm” cores on a Mac M1 chip. These are comparably slower than the high-performance “Firestorm” cores and not well suited for heavy computations.
  • You have linked R to a threaded BLAS implementation like OpenBLAS, and your learners make heavy use of linear algebra.

You can manually set the number of CPUs to overrule the heuristic via option "mc.cores":

options(mc.cores = 4)

We recommend setting this in your system’s .Rprofile file, c.f. Startup.

There are some other approaches for parallelization of learners, e.g. by directly supporting one specific parallelization backend or a parallelization framework like foreach. If this is supported, parallelization must be explicitly activated, e.g. by setting a hyperparameter. If you need to parallelize on the learner level because a single model fit takes too much time, and you only fit a few of these models, consult the documentation of the respective learner. In many scenarios it makes more sense to parallelize on a different level like resampling or benchmarking which is covered in the following subsections.

9.1.2 Parallelization of Resamplings

In addition to parallel learners, most machine learning experiments include a very easy handle for parallelization: the resampling. By definition, resampling is performed to get an unbiased performance estimator by aggregating over independent repetitions of multiple train-test splits.

mlr3 has “marked” this loop of independent iterations as parallelizable and uses an additional abstraction layer to support a broad range of parallel backends: the future package. The loop is executed via the future parallelization framework, using the parallel backend configured by the user via the future::plan() function.

In this section, we will use the spam task and a simple lrn("classif.rpart"). We use the future::multisession plan (which internally uses socket clusters from the parallel package, see parallel::makeCluster()) that should work on all operating systems.

# query the currently active plan
future::plan()
sequential:
- args: function (..., envir = parent.frame())
- tweaked: FALSE
- call: NULL
# define objects to perform a resampling
task = tsk("spam")
learner = lrn("classif.rpart")
resampling = rsmp("cv", folds = 3)

# select the multisession backend to use
future::plan("multisession")

# run the resampling in parallel and measure runtime
system.time({
  resample(task, learner, resampling)
})
   user  system elapsed 
  0.171   0.005   0.928 

By default, all CPUs of your machine are used unless you specify the argument workers in future::plan() (possible problems with the value returned by the heuristic have already been discussed in the previous Section 9.1.1). If you compare runtimes between the parallel backend and sequential execution (plan("sequential")) here, you should see a decrease in the reported elapsed time. However, in practice, you cannot expect the runtime to fall linearly as the number of cores increases (Amdahl’s law5). In contrast to threads, the technical overhead for starting workers, communicating objects, sending back results, and shutting down the workers is quite large for the multisession backend. The multicore backend (plan("multicore")) comes with more overhead than threading, but considerably less overhead in comparison with the multisession backend. In fact, with the multicore backend, R objects are copied only when they are modified (copy-on-write), while with the multisession backend, objects are always copied to the respective session prior to any computation. The multicore backend has the major disadvantage that it is not supported on Windows systems - for this reason, we will stick with the multisession backend for all examples here. In general, it is advised to only consider parallelization for resamplings where each iteration runs at least a few seconds. Note that there are two mlr3 options to control the execution and granularity:

  • If mlr3.exec_random is set to TRUE (default), the order of jobs is randomized in resamplings and benchmarks. This can help if you run a benchmark or tuning with heterogeneous runtimes, e.g., to avoid that all the expensive learners get started last.
  • Option mlr3.exec_chunk_size can be used to control how many jobs are mapped to a single future and defaults to 1. The value of this option is passed to future.apply::future_mapply() and future.scheduling is constantly set to TRUE.

Tuning the chunk size can help in some rare cases to mitigate the parallelization overhead. For larger problems and longer runtimes, however, this plays a subordinate role.

Figure 9.1 illustrates the parallelization from the above example. From left to right:

  1. The main process calls the resample() function.
  2. The computational task is split into 3 parts for the 3-fold cross-validation.
  3. The folds are passed to 3 workers, each fitting a model on the respective subset of the task and predicting on the left-out observations.
  4. The predictions (and trained models) are communicated back to main process which combines them into a ResampleResult.
graph LR
    M[fa:fa-server Main]
    S{"resample()"}
    C{ResampleResult}

    M --> S
    S -->|Fold 1| W1[fa:fa-microchip Worker 1]
    S -->|Fold 2| W2[fa:fa-microchip Worker 2]
    S -->|Fold 3| W3[fa:fa-microchip Worker 3]
    W1 -->|Prediction 1| C
    W2 -->|Prediction 2| C
    W3 -->|Prediction 3| C
Figure 9.1: Parallelization of a resampling using a 3-fold cross-validation

9.1.3 Parallelization of Benchmarks

Benchmarks can be seen as a collection of multiple independent resamplings where a combination of a task, a learner, and a resampling strategy defines one resampling to perform. In pseudo-code, the calculation can be written down as

foreach combination of (task, learner, resampling strategy) {
    foreach resampling iteration {
        execute(resampling, j)
    }
}

For parallelization, there are now two options:

  1. Parallelize over all resamplings, execute each resampling sequentially (parallelize outer loop).
  2. Iterate over all resamplings, execute each resampling in parallel (parallelize inner loop).

If you are transitioning from mlr, you might be used to selecting one of these parallelization levels before benchmarking. One major drawback of this approach becomes clear when both the outer and inner loop have fewer iterations than there are available workers, resulting in an underutilized system. In mlr3, the choice of level is no longer required (except occasionally for nested resampling, briefly described in the following Section 9.1.4). All experiments are rolled out on the same level, i.e., benchmark() iterates over the elements of the Cartesian product of the iterations of the outer and inner loops. Therefore, there is no need to decide whether you want to parallelize the tuning or the resampling, you always parallelize both. This approach makes the computation fine-grained and gives the future backend the opportunity to group the jobs into chunks of suitable size (depending on the number of workers).

Parallelization of benchmarks works analogously to resampling:

<BenchmarkResult> of 12 rows with 4 resampling runs
 nr  task_id          learner_id resampling_id iters warnings errors
  1    sonar classif.featureless            cv     3        0      0
  2    sonar       classif.rpart            cv     3        0      0
  3 penguins classif.featureless            cv     3        0      0
  4 penguins       classif.rpart            cv     3        0      0

For larger benchmarks with a cumulative runtime of weeks, months or even years, see Section 10.2 which covers parallelization on high-performance computing clusters.

9.1.4 Nested Resampling Parallelization

Like in benchmarking, nested resampling for tuning also translates into two nested resampling loops. But unlike benchmarking, the outer loop iterations are not necessarily independent of each other: depending on the result of the resampling in the first outer loop, different hyperparameters are suggested for the second iteration. Therefore, nested loops cannot be flattened, and the user instead has to choose which of the loops to parallelize. Let us consider the following example: You want to tune the minsplit argument of a classification tree using the AutoTuner of mlr3tuning (simplified version taken from Section 4.1):

To evaluate the performance on an independent test set, resampling is used:

resample(
  task = tsk("penguins"),
  learner = at,
  resampling = rsmp("cv", folds = 5) # outer CV
)
<ResampleResult> with 5 resampling iterations
  task_id          learner_id resampling_id iteration warnings errors
 penguins classif.rpart.tuned            cv         1        0      0
 penguins classif.rpart.tuned            cv         2        0      0
 penguins classif.rpart.tuned            cv         3        0      0
 penguins classif.rpart.tuned            cv         4        0      0
 penguins classif.rpart.tuned            cv         5        0      0

Here, we have three opportunities to parallelize:

  1. the inner cross-validation of the auto tuner with 2 folds,
  2. the outer cross-validation of the resampling with 5 folds, and
  3. evaluating all configurations proposed by the random search in a single batch (parameter batch_size of TunerRandomSearch, defaulting to 1).

Because the third opportunity is not always applicable, especially for many advanced tuning algorithms which are only capable of proposing a single configuration in each iteration, we will here focus on the first two opportunities. Furthermore, we assume that we have a single CPU with four cores available.

If we opt to parallelize the outer CV, all four cores would be utilized first with the computation of the first 4 resampling iterations. The computation of the fifth iteration has to wait. The resulting CPU utilization of the nested resampling example on 4 CPUs is visualized in two Figures:

  • Figure 9.2 as an example for parallelizing the outer 5-fold cross-validation.

    # Runs the outer loop in parallel and the inner loop sequentially
    future::plan(list("multisession", "sequential"))

    We assume that each fit during the inner resampling takes 4 seconds to compute and that there is no other significant overhead. First, each of the four workers starts with the computation of an inner 2-fold cross-validation. As there are more jobs than workers, the remaining fifth iteration of the outer resampling is queued on CPU1 after the first 4 iterations are finished after 8 secs. During the computation of the 5th outer resampling iteration, only CPU1 is utilized, the other 3 CPUs are idling. Note that just setting up the parallelization with a simple future::plan("multisession") has the same effect - the most outer loop is parallelized while all subsequent loops default to sequential execution.

  • Figure 9.3 as an example for parallelizing the inner 2-fold cross-validation.

    # Runs the outer loop sequentially and the inner loop in parallel
    future::plan(list("sequential", "multisession"))

    Here, the outer loop runs sequentially and distributes the 2 computations for the inner resampling on 2 CPUs. Meanwhile, CPU3 and CPU4 are idling.

gantt
    title CPU Utilization
    dateFormat  s
    axisFormat %S
    section CPU1
    Iteration 1-1           :0, 4s
    Iteration 1-2           :4, 4s
    Iteration 5-1           :8, 4s
    Iteration 5-2           :12, 4s

    section CPU2
    Iteration 2-1           :0, 4s
    Iteration 2-2           :4, 4s
    Idle                    :crit, 8, 8s

    section CPU3
    Iteration 3-1           :0, 4s
    Iteration 3-2           :4, 4s
    Idle                    :crit, 8, 8s

    section CPU4
    Iteration 4-1           :0, 4s
    Iteration 4-2           :4, 4s
    Idle                    :crit, 8, 8s
Figure 9.2: CPU utilization for 4 CPUs while parallelizing the outer 5-fold cross-validation with a sequential 2-fold cross-validation inside. Jobs are labeled as [iteration outer]-[iteration inner].
gantt
    title CPU Utilization
    dateFormat  s
    axisFormat %S
    section CPU1
    Iteration 1-1           :0, 4s
    Iteration 2-1           :4, 4s
    Iteration 3-1           :8, 4s
    Iteration 4-1           :12, 4s
    Iteration 5-1           :16, 4s

    section CPU2
    Iteration 1-2           :0, 4s
    Iteration 2-2           :4, 4s
    Iteration 3-2           :8, 4s
    Iteration 4-2           :12, 4s
    Iteration 5-2           :16, 4s

    section CPU3
    Idle                    :crit, 0, 20s

    section CPU4
    Idle                    :crit, 0, 20s
Figure 9.3: CPU utilization for 4 CPUs while parallelizing the inner 2-fold cross-validation with a sequential 5-fold cross-validation outside. Jobs are labeled as [iteration outer]-[iteration inner].

Both possibilities for parallelization are not exploiting the full potential of the 4 CPUs. With parallelization of the outer loop, all results are computed after 16s, in contrast to parallelization of the inner loop where the results are only available after 20s.

If possible, the number of iterations can be adapted to the available hardware. There is no law set in stone that you have to select, e.g., 10 folds in cross-validation. If you have 4 CPUs and a reasonable variance, 8 iterations are often sufficient, or you do 12 iterations because you get the last two iterations basically for free.

Alternatively, you can also enable parallelization for both loops for nested parallelization, even on different parallelization backends. While nesting real parallelization backends is often unintended and causes unnecessary overhead, it is useful in some distributed computing setups. In this case, the number of workers must be manually tweaked so that the system does not get overburdened:

# Runs both loops in parallel
future::plan(list(
  future::tweak("multisession", workers = 2),
  future::tweak("multisession", workers = 4)
))

This example would run on 8 cores (= 2 * 4) on the local machine, parallelizing the outer resampling on 2, and the inner resampling on 4 workers. The vignette of the future package gives more insight into nested parallelization. For more background information about parallelization during tuning, see Section 6.7 of Bischl et al. (2023).

Important

During tuning with mlr3tuning, you can often adjust the batch size of the Tuner, i.e., control how many hyperparameter configurations are evaluated in parallel. If you want full parallelization, make sure that the batch size multiplied by the number of (inner) resampling iterations is at least equal to the number of available workers. If you expect homogeneous runtimes, i.e., you are tuning over a single learner or linear pipeline, and you have no hyperparameter which is likely to influence the runtime, aim for a multiple of the number of workers.

In general, larger batches allow for more parallelization, while smaller batches imply a more frequent evaluation of the termination criteria. We default to a batch_size of 1 that ensures that all Terminators work as intended, i.e., you cannot exceed the computational budget.

Heterogeneous runtimes add an extra layer of complexity to parallelization. This occurs frequently, especially in tuning, when a hyperparameter strongly influences the runtime of the learning procedure. Examples are the number of trees for random forests or the number of regularization values to be tested in penalized regression.

How efficient the parallelization turns out depends in particular on the scheduling strategy of the backend. After the first batch of jobs is sent to the worker, the next jobs are either started

  1. as soon as all results have been collectively reported back to the main process, or
  2. as soon as the first job reports back.

Method (a) usually comes with less synchronization overhead and is best suited for short jobs with homogeneous runtimes. Method (b) is faster if the runtimes are heterogeneous, especially if the parallelization overhead is neglectable in comparison with the runtime for the computation. E.g., for parallel::mclapply(), the behavior of the scheduler can be controlled with the mc.preschedule option and parallel::parSapply() implements Method (a) while parallel::parSapplyLB() implements scheduling according to method (b).

9.1.5 Parallelization of Predictions

Finally, also the prediction of a single learner can be parallelized as the predictions of two observations are independent. For most learners, training is the bottleneck and parallelizing the prediction is not a worthwhile endeavor, but of course there are exceptions.

Technically, the test data is first split into multiple groups and the predict-method of the learner is applied to each group in parallel using active backend configured via future::plan(). The resulting predictions are then combined internally in a second step. However, to avoid predicting in parallel accidentally, parallel predictions must first be enabled in the learner via the parallel_predict field:

# train random forest on spam task
task = tsk("spam")
learner = lrn("classif.ranger")
learner$train(task)

# set up parallel predict on 4 workers
future::plan("multisession", workers = 4)
learner$parallel_predict = TRUE

# perform prediction
prediction = learner$predict(task)

The resulting Prediction is identical to the one computed sequentially.

9.1.6 Reproducibility

Usually, reproducibility is a major concern during parallelization because special (PRNGs6) are required (see this blog post7. A simple set.seed() is not sufficient when parallelization is involved. One general recommendation here is switching from the default Mersenne Twister8 to Pierre L’Ecuyer’s RngStreams (see Random and parallel::RNGstreams). However, even this parallel PRNG comes with pitfalls w.r.t. reproducibility: if you change the number of workers, you still get different results after setting the seed with set.seed().

Luckily, this and many other problems are already addressed by the excellent future parallelization framework which mlr3 uses under the hood. future ensures that all workers will receive the exactly same PRNG streams, independent of the number of workers. Although correct seeding alone does not guarantee full reproducibility, it is one problem less to worry about. You can find more details about the used PRNG in this blog post9.

The issue of reproducibility is very complex, and complete reproducibility is very difficult to achieve - this ultimately also depends on the computational accuracy of the hardware, the processor instructions used, compiler versions and optimisation flags, or the BLAS library for linear algebra R links to. But since at least parallel PRNGs are not a problem, you should get the same or at least very similar results with a simple set.seed() before you run the experiments.

9.2 Error Handling

In large ML experiments, it is not uncommon that a model fit or prediction fails with an error. This is because the algorithms have to process arbitrary data, and not all eventualities can always be handled. While we try to identify obvious problems before execution, such as when missing values occur, but a learner cannot handle them, other problems are far more complex to detect. Examples include correlations or collinearity that make model fitting impossible, outliers that lead to numerical problems, or new levels of categorical variables appearing in the predict step. The learners behave quite differently when encountering such problems: some models signal a warning during the train step that they failed to fit but return a baseline model while other models stop the execution. During prediction, some learners just refuse to predict the response for observations they cannot handle while others predict a missing value. How to deal with these problems even in more complex setups like benchmarking or tuning is the topic of this section.

For illustration (and internal testing) of error handling, mlr3 ships with the learners "classif.debug" and "regr.debug". Here, we use the debug learner for classification to demonstrate the error handling:

task = tsk("penguins")
learner = lrn("classif.debug")
print(learner)
<LearnerClassifDebug:classif.debug>: Debug Learner for Classification
* Model: -
* Parameters: list()
* Packages: mlr3
* Predict Types:  [response], prob
* Feature Types: logical, integer, numeric, character, factor, ordered
* Properties: hotstart_forward, missings, multiclass, twoclass

This learner comes with special parameters that let us simulate problems frequently encountered in ML. E.g., the debug learner comes with hyperparameters to control

  1. what conditions should be signaled (message, warning, error, segfault) with what probability,
  2. during which stage the conditions should be signaled (train or predict), and
  3. the ratio of predictions being NA (predict_missing).

For a detailed description of all hyperparameters, see the manual page of mlr_learners_classif.debug.

learner$param_set
<ParamSet>
                      id    class lower upper nlevels        default value
 1:        error_predict ParamDbl     0     1     Inf              0      
 2:          error_train ParamDbl     0     1     Inf              0      
 3:      message_predict ParamDbl     0     1     Inf              0      
 4:        message_train ParamDbl     0     1     Inf              0      
 5:      predict_missing ParamDbl     0     1     Inf              0      
 6: predict_missing_type ParamFct    NA    NA       2             na      
 7:           save_tasks ParamLgl    NA    NA       2          FALSE      
 8:     segfault_predict ParamDbl     0     1     Inf              0      
 9:       segfault_train ParamDbl     0     1     Inf              0      
10:          sleep_train ParamUty    NA    NA     Inf <NoDefault[3]>      
11:        sleep_predict ParamUty    NA    NA     Inf <NoDefault[3]>      
12:              threads ParamInt     1   Inf     Inf <NoDefault[3]>      
13:      warning_predict ParamDbl     0     1     Inf              0      
14:        warning_train ParamDbl     0     1     Inf              0      
15:                    x ParamDbl     0     1     Inf <NoDefault[3]>      
16:                 iter ParamInt     1   Inf     Inf              1      

With the learner’s default settings, the learner will do nothing special: The learner remembers a random label and constantly predicts this label:

task = tsk("penguins")
learner$train(task)$predict(task)$confusion
           truth
response    Adelie Chinstrap Gentoo
  Adelie       152        68    124
  Chinstrap      0         0      0
  Gentoo         0         0      0

We now set a hyperparameter to let the debug learner signal an error during the train step. By default, mlr3 does not catch conditions such as warnings or errors raised while calling learners:

# set probability to signal an error to 1
learner$param_set$values = list(error_train = 1)

learner$train(tsk("penguins"))
Error in .__LearnerClassifDebug__.train(self = self, private = private, : Error from classif.debug->train()

If this has been a regular learner, we could now start debugging with traceback() (or create a Minimal Reproducible Example (MRE)10 to tackle the problem down or file a bug report upstream). However, due to the nature of the problem, it is likely that the cause of the error cannot be fixed - so you have to learn how to deal with the errors.

Note

If you start debugging, make sure you have disabled parallelization to avoid various pitfalls related to parallelization. It may also be helpful to set the option mlr3.debug to TRUE. If this flag is set, mlr3 does not call into the future package, resulting in an easier-to-interpret program flow and traceback().

9.2.1 Encapsulation

Since ML algorithms are confronted with arbitrary, often messy data, errors are not uncommon here, and we often just need to move on during benchmarking or tuning. Thus, we need a mechanism to

  1. capture all signaled conditions such as messages, warnings and errors so that we can analyze them post-hoc (called encapsulation, covered in this section),
  2. deal with algorithms which do not terminate in a reasonable time, and
  3. a statistically sound way to proceed while being able to aggregate over partial results (next Section 9.2.2).

Encapsulation ensures that signaled conditions (such as messages, warnings and errors) are intercepted: all conditions raised during the training or predict step are logged into the learner, and errors do not interrupt the program flow. I.e., the execution of the calling function or package (here: mlr3) continues as if there had been no error, though the result (fitted model during train(), predictions during predict()) are missing. Each Learner has a field encapsulate to control how the train or predict steps are wrapped. The easiest way to encapsulate the execution is provided by the package evaluate which evaluates R expressions while tracking conditions such as outputs, messages, warnings or errors (see the documentation of the encapsulate() helper function for more details):

task = tsk("penguins")
learner = lrn("classif.debug")

# this learner throws a warning and then stops with an error during train()
learner$param_set$values = list(warning_train = 1, error_train = 1)

# enable encapsulation for train() and predict()
learner$encapsulate = c(train = "evaluate", predict = "evaluate")

learner$train(task)

After training the learner, one can access the recorded log via the fields log, warnings and errors:

learner$log
   stage   class                                 msg
1: train warning Warning from classif.debug->train()
2: train   error   Error from classif.debug->train()
learner$warnings
[1] "Warning from classif.debug->train()"
learner$errors
[1] "Error from classif.debug->train()"

Another method for encapsulation is implemented in the callr package. In contrast to evaluate, the computation is taken out in a separate R process. This guards the calling session against segmentation faults which otherwise would tear down the complete main R session. On the downside, starting new processes comes with comparably more computational overhead.

learner$encapsulate = c(train = "callr", predict = "callr")
learner$param_set$values = list(segfault_train = 1)
learner$train(task = task)
learner$errors
[1] "callr process exited with status -11"

With either of these encapsulation methods, we can now catch errors and post-hoc analyze the messages, warnings and error messages. Additionally, a timeout can be set so that learners do not run for an indefinite time but are terminated after a specified time. Interrupting learners works differently depending on the encapsulation (see mlr3misc::encapsulate()) and, when viewed from the outside, behave as if they would signal an error after reaching the timeout. The timeout can be set separately for training and prediction and must be provided in seconds:

# 5 minute timeout for training, no timeout for predict
learner$timeout = c(train = 5 * 60, predict = Inf)

Unfortunately, catching errors and ensuring an upper time limit is only half the battle. Without a model, it is not possible to get predictions:

learner$predict(task)
Error: Cannot predict, Learner 'classif.debug' has not been trained yet

To handle the missing predictions gracefully during resample(), benchmark() or tuning, fallback learners are introduced next.

9.2.2 Fallback learners

Fallback learners have the purpose of being able to score results in cases where a Learner completely failed to fit a model or refuses to provide predictions for some or all observations.

We will first handle the case that a learner fails to fit a model during training, e.g., if some convergence criterion is not met or the learner ran out of memory. There are in general three possibilities to proceed:

  1. Ignore iterations with failed model fits. Although this is arguably the most frequent approach in practice, it is not statistically sound. For example, consider the case where a researcher wants a specific learner to look better in a benchmark study. To do this, the researcher takes an existing learner but introduces a small adaptation: If an internal goodness-of-fit measure is not achieved, an error is thrown. In other words, the learner only fits a model if the model can be reasonably well learned on the given training data. In comparison with the learning procedure without this adaptation and a good threshold, however, we now compare the mean over only the “easy” splits with the mean over all splits - an unfair advantage.
  2. Penalize failing learners. Instead of ignoring failed iterations, we can simply impute the worst possible score (as defined by the Measure) and thereby heavily penalize the learner for failing. However, this often seems too harsh for many problems, and for some measures there is no reasonable value to impute.
  3. Impute a value that corresponds to a (weak) baseline. Instead of imputing with the worst possible score, impute with a reasonable baseline, e.g., by just predicting the majority class or the mean of the response in the training data. Such simple baselines are implemented as featureless learners (mlr_learners_classif.featureless or mlr_learners_regr.featureless). Note that a reasonable baseline value is different in different training splits. Retrieving these values after a larger benchmark study has been conducted is possible, but tedious.

We strongly recommend option (3): it is statistically sound and very flexible. To make this procedure very convenient during resampling and benchmarking, we support fitting a proper baseline with a fallback learner. In the next example, in addition to the debug learner, we attach a simple featureless learner to the debug learner. So whenever the debug learner fails (which is every single time with the given parametrization) and encapsulation is enabled, mlr3 falls back to the predictions of the featureless learner internally:

task = tsk("penguins")

learner = lrn("classif.debug")
learner$param_set$values = list(error_train = 1)
learner$fallback = lrn("classif.featureless")

learner$train(task)
learner
<LearnerClassifDebug:classif.debug>: Debug Learner for Classification
* Model: -
* Parameters: error_train=1
* Packages: mlr3
* Predict Types:  [response], prob
* Feature Types: logical, integer, numeric, character, factor, ordered
* Properties: hotstart_forward, missings, multiclass, twoclass
* Errors: Error from classif.debug->train()

Note that encapsulation is not enabled explicitly; it is automatically set to "evaluate" for the training and the predict step while setting a fallback learner for a learner without encapsulation enabled. Furthermore, the log contains the captured error (which is also included in the print output), and although no model is stored, we can still get predictions:

learner$model
NULL
prediction = learner$predict(task)
prediction$score()
classif.ce 
 0.5581395 

In this stepwise train-predict procedure, the fallback learner is of limited use. However, it is invaluable for larger benchmark studies.

In the following snippet, we compare the previously created debug learner with a simple classification tree. We re-parametrize the debug learner to fail in roughly 30% of the resampling iterations during the training step:

learner$param_set$values = list(error_train = 0.3)

bmr = benchmark(benchmark_grid(tsk("penguins"), list(learner, lrn("classif.rpart")), rsmp("cv")))
aggr = bmr$aggregate(conditions = TRUE)
aggr[, .(learner_id, warnings, errors, classif.ce)]
      learner_id warnings errors classif.ce
1: classif.debug        0      3 0.58974790
2: classif.rpart        0      0 0.05226891

Even though the debug learner occasionally failed to provide predictions, we still got a statistically sound aggregated performance value which we can compare to the aggregated performance of the classification tree. It is also possible to split the benchmark up into separate ResampleResult objects which sometimes helps to get more context. E.g., if we only want to have a closer look into the debug learner, we can extract the errors from the corresponding resample results:

rr = aggr[learner_id == "classif.debug"]$resample_result[[1L]]
rr$errors
   iteration                               msg
1:         1 Error from classif.debug->train()
2:         4 Error from classif.debug->train()
3:         5 Error from classif.debug->train()

A problem similar to failed model fits emerges when a learner predicts only a subset of the observations in the test set (and predicts NA or no value for others). A typical case is, e.g., when new and unseen factor levels are encountered in the test data. Imagine again that our goal is to benchmark two algorithms using cross-validation on some binary classification task:

  • Algorithm A is an ordinary logistic regression.
  • Algorithm B is also an ordinary logistic regression, but with a twist: If the logistic regression is rather certain about the predicted label (> 90% probability), it returns the label and returns a missing value otherwise.

At its core, this is the same problem as outlined before. If we measure the performance using only the non-missing predictions, Algorithm B would likely outperform algorithm A. However, this approach does not factor in that you can not generate predictions for all observations. Long story short, if a fallback learner is specified, missing predictions of the base learner will be automatically replaced with predictions from the fallback learner. This is illustrated in the following example:

task = tsk("penguins")
learner = lrn("classif.debug")

# this hyperparameter sets the ratio of missing predictions
learner$param_set$values = list(predict_missing = 0.5)

# without fallback
p = learner$train(task)$predict(task)
table(p$response, useNA = "always")

   Adelie Chinstrap    Gentoo      <NA> 
      172         0         0       172 
# with fallback
learner$fallback = lrn("classif.featureless")
p = learner$train(task)$predict(task)
table(p$response, useNA = "always")

   Adelie Chinstrap    Gentoo      <NA> 
      172       172         0         0 

Summed up, by combining encapsulation and fallback learners, it is possible to benchmark even quite unreliable or unstable learning algorithms in a convenient and statistically sound fashion.

9.3 Logging

mlr3 internally uses the lgr package to control the verbosity of the output, e.g., suppress messages or enable additional debugging messages.

9.3.1 Changing mlr3 logging levels

All log messages have an associated level which encodes a number, e.g. informative messages have the level "info" which is associated with the number 400 while debugging messages have the level "debug" with number 500. A message is only displayed if the respective level exceeds the global logging threshold.

To change the setting for mlr3 for the current session, you need to retrieve the logger (which is a R6 object) from lgr, and then change the threshold of the like this:

requireNamespace("lgr")

logger = lgr::get_logger("mlr3")
logger$set_threshold("<level>")

The default log level is "info". All available levels can be listed as follows:

getOption("lgr.log_levels")
fatal error  warn  info debug trace 
  100   200   300   400   500   600 

To increase verbosity, set the log level to a higher value, e.g. to "debug" with:

lgr::get_logger("mlr3")$set_threshold("debug")

To reduce the verbosity, reduce the log level to warn:

lgr::get_logger("mlr3")$set_threshold("warn")

lgr comes with a global option called "lgr.default_threshold" which can be set via options() to make your choice permanent across sessions.

Also note that the optimization packages such as mlr3tuning or mlr3fselect use the logger of their base package bbotk. If you want do disable logging from mlr3, but keep the output from mlr3tuning, reduce the verbosity of the mlr3 logger and change the bbotk logger to the desired level.

lgr::get_logger("mlr3")$set_threshold("warn")
lgr::get_logger("bbotk")$set_threshold("info")

9.3.2 Redirecting output

Redirecting output is already extensively covered in the documentation and vignette of lgr. Here is just a short example that adds an appender to log events additionally to a temporary file using the JSON format:

tf = tempfile("mlr3log_", fileext = ".json")

# get the logger as R6 object
logger = lgr::get_logger("mlr")

# add Json appender
logger$add_appender(lgr::AppenderJson$new(tf), name = "json")

# signal a warning
logger$warn("this is a warning from mlr3")
WARN  [13:46:37.591] this is a warning from mlr3
# print the contents of the file
cat(readLines(tf))
{"level":300,"timestamp":"2023-06-04 13:46:37","logger":"mlr","caller":"eval","msg":"this is a warning from mlr3"}
# remove the appender again
logger$remove_appender("json")

9.3.3 Immediate Log Feedback

mlr3 uses future and encapsulation to make evaluations fast, stable, and reproducible. However, this may lead to logs being delayed, out of order, or, in case of some errors, not present at all.

When it is necessary to have immediate access to log messages, for example to investigate problems, one may therefore choose to disable future and encapsulation. This can be done by enabling the debug mode using options(mlr.debug = TRUE); the $encapsulate slot of learners should also be set to "none" (default) or "evaluate", but not "callr". Enabling the debug mode should only be done to investigate problems, however, and not for production use, because

  1. this disables parallelization, and
  2. this leads to different RNG behavior and therefore to results that are not fully reproducible.

9.4 Data Backends

This section covers advanced ML or technical details that can be skipped.

In mlr3, Task objects store their data in an abstract data object, the DataBackend. A data backend provides a unified API to retrieve subsets of the data or query information about it, regardless of how the data is actually stored on the system. The default backend uses data.table via the DataBackendDataTable as a very fast and efficient in-memory database. For example, we can query the dimensions of the penguins task:

task = tsk("penguins")
backend = task$backend
backend$nrow
[1] 344
backend$ncol
[1] 9

While storing the Task’s data in memory is most efficient w.r.t. accessing it for model fitting, this has two major disadvantages:

  1. Although only a small proportion of the data is required, the complete data frame sits in memory and consumes memory. This is especially a problem if you work with large tasks or many tasks simultaneously, e.g., for benchmarking.
  2. During parallelization, the complete data needs to be transferred to the workers which can increase the overhead.

To avoid these drawbacks, especially for larger data, it can be necessary to interface out-of-memory data to reduce the memory requirements. This way, only the part of the data which is currently required by the learners will be placed in the main memory to operate on. There are multiple options to archive this:

  1. DataBackendDplyr which interfaces the R package dbplyr, extending dplyr to work on many popular SQL databases like MariaDB11, PostgresSQL12 or SQLite13.
  2. DataBackendDuckDB for the impressive DuckDB14 database connected via duckdb: a fast, zero-configuration alternative to SQLite.
  3. DataBackendDuckDB, again, but for Parquet files15. The data does not need to be converted to DuckDB’s native storage format, you can work directly on directories containing one or multiple files stored in the popular Parquet format.

In the following, we will show how to work with data backends that are available through mlr3db.

9.4.1 Databases with DataBackendDplyr

To demonstrate the mlr3db::DataBackendDplyr we use the NYC flights data set from the nycflights13 package and move it into a SQLite database. Although as_sqlite_backend() provides a convenient function to perform this step, we construct the database manually here.

# load data
requireNamespace("DBI")
requireNamespace("RSQLite")
requireNamespace("nycflights13")
data("flights", package = "nycflights13")
str(flights)
tibble [336,776 × 19] (S3: tbl_df/tbl/data.frame)
 $ year          : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
 $ month         : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
 $ day           : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
 $ dep_time      : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
 $ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
 $ dep_delay     : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
 $ arr_time      : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
 $ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
 $ arr_delay     : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
 $ carrier       : chr [1:336776] "UA" "UA" "AA" "B6" ...
 $ flight        : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...
 $ tailnum       : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
 $ origin        : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
 $ dest          : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
 $ air_time      : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
 $ distance      : num [1:336776] 1400 1416 1089 1576 762 ...
 $ hour          : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
 $ minute        : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
 $ time_hour     : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
# add column of unique row ids
flights$row_id = 1:nrow(flights)

# create sqlite database in temporary file
path = tempfile("flights", fileext = ".sqlite")
con = DBI::dbConnect(RSQLite::SQLite(), path)
tbl = DBI::dbWriteTable(con, "flights", as.data.frame(flights))
DBI::dbDisconnect(con)

# remove in-memory data
rm(flights)

With the SQLite database stored in file path, we now re-establish a connection and switch to dplyr/dbplyr for some essential preprocessing. If you had a real database management system (DBMS), this would be the first step now:


Attaching package: 'dplyr'
The following objects are masked from 'package:data.table':

    between, first, last
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Attaching package: 'dbplyr'
The following objects are masked from 'package:dplyr':

    ident, sql

As databases are intended to store large volumes of data, a natural first step is to slice and dice the data to suitable dimensions. Therefore, we build up an SQL query in a step-wise fashion using dplyr verbs and start by selecting a subset of columns to work on:

keep = c("row_id", "year", "month", "day", "hour", "minute", "dep_time",
  "arr_time", "carrier", "flight", "air_time", "distance", "arr_delay")
tbl = select(tbl, all_of(keep))

Additionally, we remove those observations where the arrival delay (arr_delay) has a missing value:

tbl = filter(tbl, !is.na(arr_delay))

To reduce the runtime for this example, we also filter the data to only use every second row:

tbl = filter(tbl, row_id %% 2 == 0)

The factor levels of the feature carrier are merged so that infrequent carriers are replaced by level “other”:

tbl = mutate(tbl, carrier = case_when(
  carrier %in% c("OO", "HA", "YV", "F9", "AS", "FL", "VX", "WN") ~ "other",
  TRUE ~ carrier))

Next, the processed table is used to create a mlr3db::DataBackendDplyr from mlr3db:

library("mlr3db")
b = as_data_backend(tbl, primary_key = "row_id")

We can now use the interface of DataBackend to query some basic information about the data:

b$nrow
[1] 163707
b$ncol
[1] 13
b$head()
   row_id year month day hour minute dep_time arr_time carrier flight air_time
1:      2 2013     1   1    5     29      533      850      UA   1714      227
2:      4 2013     1   1    5     45      544     1004      B6    725      183
3:      6 2013     1   1    5     58      554      740      UA   1696      150
4:      8 2013     1   1    6      0      557      709      EV   5708       53
5:     10 2013     1   1    6      0      558      753      AA    301      138
6:     12 2013     1   1    6      0      558      853      B6     71      158
2 variables not shown: [distance, arr_delay]

Note that the DataBackendDplyr does not know about any rows or columns we have filtered out with dplyr before, it just operates on the view we provided.

As we now have constructed a backend, we can switch over to mlr3 for model fitting on a task based on the previously created mlr3db::DataBackendDplyr:

task = as_task_regr(b, id = "flights_sqlite", target = "arr_delay")
learner = lrn("regr.rpart")
resampling = rsmp("subsampling", ratio = 0.02, repeats = 3)

We pass all these objects to resample() to perform a subsampling on 2% of the observations three times. In each iteration, only the required subset of the data is queried from the SQLite database and passed to rpart::rpart():

rr = resample(task, learner, resampling)
Loading required package: RSQLite
Loading required package: RSQLite
Loading required package: RSQLite
print(rr)
<ResampleResult> with 3 resampling iterations
        task_id learner_id resampling_id iteration warnings errors
 flights_sqlite regr.rpart   subsampling         1        0      0
 flights_sqlite regr.rpart   subsampling         2        0      0
 flights_sqlite regr.rpart   subsampling         3        0      0
measures = msrs(c("regr.mse", "time_train", "time_predict"))
rr$aggregate(measures)
    regr.mse   time_train time_predict 
 1168.456270     1.744667    21.559000 

Note that we still have an active connection to the database. To properly close it, we remove the tbl object referencing the connection and then close the connection.

rm(tbl)
DBI::dbDisconnect(con)

9.4.2 Parquet Files with DataBackendDuckDB

We have already demonstrated how to operate on a SQLite database. DuckDB databases (using DataBackendDuckDB) provide a modern alternative to SQLite, tailored to the needs of machine learning. To convert a data.frame to DuckDB, we provide the helper function as_duckdb_backend(). Only two arguments are required: the data.frame to convert, and a path to store the data.

While this is useful while working with many tasks simultaneously in order to keep the memory requirements reasonable, the more frequent use case for DuckDB are nowadays Parquet files. Parquet is a popular column-oriented data storage format supporting efficient compression, making it far superior to other popular data exchange formats such as CSV.

To demonstrate working with Parquet files, we first query the location of an example dataset shipped with mlr3db:

path = system.file(file.path("extdata", "spam.parquet"), package = "mlr3db")

We can then create a DataBackendDuckDB based on this file and convert the backend to a classification task, all without loading the dataset into memory:

backend = as_duckdb_backend(path)
task = as_task_classif(backend, target = "type")
print(task)
<TaskClassif:backend> (4601 x 58)
* Target: type
* Properties: twoclass
* Features (57):
  - dbl (57): address, addresses, all, business, capitalAve,
    capitalLong, capitalTotal, charDollar, charExclamation, charHash,
    charRoundbracket, charSemicolon, charSquarebracket, conference,
    credit, cs, data, direct, edu, email, font, free, george, hp, hpl,
    internet, lab, labs, mail, make, meeting, money, num000, num1999,
    num3d, num415, num650, num85, num857, order, original, our, over,
    parts, people, pm, project, re, receive, remove, report, table,
    technology, telnet, will, you, your

Accessing the data internally triggers a query and data is fetched to be stored in an in-memory data.frame, but only the required subsets. After the retrieved data is processed, the garbage collector can release the occupied memory. The backend can also operate on a folder with multiple parquet files, which is documented in as_duckdb_backend().

9.5 Extending mlr3

Hopefully having read the rest of this book you are now on the way to being an mlr3 expert. Maybe you will even want to extend the universe with new classes for more learners, measures, tuners, pipeops, and more; if so, read on.

This section covers advanced ML or technical details that can be skipped.

In this chapter we will show how to extend mlr3 using the simple example of creating a custom Measure. If you are interested in implementing new learners, pipeops, and tuners, then check out the vignettes in the respective packages: mlr3extralearners, mlr3pipelines, or mlr3tuning. Or if you are considering adding a new machine learning task then please contact us on GitHub, email, or Mattermost. This section assumes good knowledge of R6, see Section 1.8.2 for a brief introduction and references to further resources.

We welcome contributions from all levels of developers and if you want to add any of your new classes to our universe then please make pull requests to aligning packages, for example tuners and filters would go to mlr3tuning and mlr3filters respectively, a new survival measure would go to mlr3proba, and all new learners go to mlr3extralearners. Do not worry if you make a PR to the wrong repository, we will transfer it to the right one. Please read the mlr3 Wiki16 for coding conventions that we use if you want to add code to our organisation.

We will now turn to extending the Measure class to implement new metrics. As an example, let us consider a regression measure that scores a prediction as \(1\) if the difference between the true and predicted values are less than one standard deviation of the truth, or scores the prediction as \(0\) otherwise. In maths this would be defined as \(f(y, \hat{y}) = \frac{1}{n} \sum_{i=1}^n \mathbb{I}(|y_i - \hat{y}_i| < \sigma(y))\), where \(y\) contains the true values and \(\hat{y}\) the predicted values for observations \(i = 1, ..., n\). The function \(\mathbb{I(C)}\) denotes the indicator function17 with \(\mathbb{I(C)} = 1\) if condition \(C\) is true and \(\mathbb{I(C)} = 0\) otherwise. In code the measure may be written as:

threshold_acc = function(truth, response) {
  mean(ifelse(abs(truth - response) < sd(truth), 1, 0))
}

threshold_acc(c(100, 0, 1), c(1, 11, 6))
[1] 0.6666667

This measure is then bounded in \([0, 1]\) and a larger score is better.

To use this measure in mlr3, we need to create a new R6::R6Class, which will inherit from Measure and in this case specifically inheriting from MeasureRegr. We will now demonstrate what the final code for this new measure would look like and then explain each line, this can be used as a template for most performance measures.

MeasureRegrThresholdAcc = R6::R6Class("MeasureRegrThresholdAcc",
  inherit = mlr3::MeasureRegr, # regression measure
  public = list(
    initialize = function() { # initialize class
      super$initialize(
        id = "thresh_acc", # unique ID
        packages = character(), # no dependencies
        properties = character(), # no special properties
        predict_type = "response", # measures response prediction
        range = c(0, 1), # results in values between (0, 1)
        minimize = FALSE # larger values are better
      )
    }
  ),

  private = list(
    .score = function(prediction, ...) { # define score as private method
      # define loss
      threshold_acc = function(truth, response) {
        mean(ifelse(abs(truth - response) < sd(truth), 1, 0))
      }
      # call loss function
      threshold_acc(prediction$truth, prediction$response)
    }
  )
)
  1. In the first two lines we name the class, here MeasureRegrThresholdAcc, and then state this is a regression measure that inherits from MeasureRegr.
  2. We initialize the class by stating its unique ID is “thresh_acc”, that it does not require any external packages (packages = character()) and that it has no special properties (properties = character()).
  3. We then pass specific details of the loss function which are: it measures the quality of a “response” type prediction, its values range between (0, 1), and that the loss is optimised as its maximum.
  4. Finally, we define the score itself as a private method called .score and simply pass the predictions to the function we defined earlier.

Sometimes measures require data from the training set, the task, or the learner. These are usually complex edge-cases examples, so we will not go into detail here, for working examples we suggest looking at the code for mlr3proba::MeasureSurvSongAUC and mlr3proba::MeasureSurvAUC. You can also consult the manual page of the Measure for an overview of other properties and meta-data that can be specified.

Once you have defined your measure you can either use it with the R6 constructor, or by adding it to the mlr_measures dictionary:

library(mlr3verse)

task = tsk("mtcars")
split = partition(task)
learner = lrn("regr.featureless")$train(task, split$train)
prediction = learner$predict(task, split$test)
prediction$score(MeasureRegrThresholdAcc$new())
thresh_acc 
 0.6363636 
# or add to dictionary
mlr3::mlr_measures$add("regr.thresh_acc", MeasureRegrThresholdAcc)
prediction$score(msr("regr.thresh_acc"))
thresh_acc 
 0.6363636 

Even though we only showed how to create a custom measure, the process of adding other objects is in essence the same:

  1. Find the right class to inherit from
  2. Add methods that:
    1. Initialize the object with the correct properties ($initialize()).
    2. Implement the public and private methods that do the actual computation. In the above example this was the private $.score() method.

As a lot of classes already exist in mlr3, we recommend looking at similar classes that are already available.

9.6 Conclusion

This chapter describes many advanced topics that are not relevant to each and every user and cannot be covered in full detail here. I.e., we covered parallelization, error handling, logging, and working with data bases - all these topics might not be needed if you work on rather small and clean datasets. Additionally, we demonstrated how to extend mlr3 with additional measures.

The sections presented in this chapter deal with problems that many users will encounter with some regularity. Therefore, at least superficial knowledge is necessary to be able to identify the problem specifically and to be able to return to the correct section or, for more background information, the following resources.

Resources

  • Schmidberger et al. (2009) and Eddelbuettel (2020) give a more systematic and in-depth overview about the possibilities to parallelize with R
  • The vignette18 of the lgr package demonstrates advanced logging capabilities, e.g., logging to JSON files or retrieving logged objects for debugging.
  • Extending and customizing objects is covered in the documentation of the respective packages:
  • For an overview of available DBMS in R, see the CRAN task view on databases19, and in particular the vignettes of the dbplyr package for DBMS readily available in mlr3. For working directly with a SQL database, we recommend a general purpose SQL Tutorial20.

9.7 Exercises

Parallelization

Consider the following example where you resample a learner (debug learner, sleeps for 3 seconds during train) on 4 workers using the multisession backend:

task = tsk("penguins")
learner = lrn("classif.debug", sleep_train = function() 3)
resampling = rsmp("cv", folds = 6)

future::plan("multisession", workers = 4)
resample(task, learner, resampling)
  • Assuming that the learner would actually calculate something and not just sleep: Would all CPUs be busy?
  • Prove your point by measuring the elapsed time, e.g., using system.time().
  • What would you change in the setup and why?

Custom Measures

Create a new custom classification measure which scores predictions using the mean over the following classification costs:

  1. If the learner predicted label “A” and the truth is “A”, assign score 0
  2. If the learner predicted label “B” and the truth is “B”, assign score 0
  3. If the learner predicted label “A” and the truth is “B”, assign score 1
  4. If the learner predicted label “B” and the truth is “A”, assign score 10

Hint: You can implement it yourself as demonstrated in Section 9.5 or use the measure mlr_measures_classif.costs.