# 9  Technical

Authors
Affiliations
Michel Lang

Research Center Trustworthy Data Science and Security

TU Dortmund University

Author 2

Affiliation 2

Abstract
TODO (150-200 WORDS)

This chapter provides an overview of the technical details of the mlr3 framework. This includes the following topics:

## 9.1 Parallelization

Parallelization refers to running multiple jobs in parallel, i.e., executing them simultaneously on multiple CPU cores, CPUs, or computational nodes. This process allows for significant savings in computing power.

In general, there are many possibilities to parallelize, depending on the hardware to run the computations: If you only have a single CPU with multiple cores, threads or forks are ways to utilize all cores. If you have multiple machines, they need a way to communicate and exchange information, e.g. via protocols like network sockets or the Message Passing Interface (MPI). We don’t want to delve too deep into such details here, but want to introduce some terminology:

• We call the parallelization platform together with its implementation for R the parallelization backend. As many parallelization backends have a different API, we are using the future package as an additional abstraction layer. mlr3 just interfaces future while the user can control how the code is executed.
• The R session or process which orchestrates the computational work is called main, and it starts computational jobs.
• The R sessions, processes, forks or machines which receive the jobs, do the calculation and then send back the result are called workers.

We distinguish between implicit parallelism and explicit parallelism. For the former, no special directives are required to enable the parallelization, everything works fully automatically. For the latter, parallelization has to be manually configured. On the one hand, this gives you full control over the execution, but on the other hand, this poses a greater obstacle for non-experts.

Note

We don’t cover parallelization on GPUs here. mlr3 only distributes the fitting of multiple learners, e.g., during resampling, benchmarking, or tuning. On this rather abstract level, GPU parallelization doesn’t work efficiently. Some learning procedures can be compiled against CUDA/OpenCL to utilize the GPU while fitting a single model. We refer to the respective documentation of the learner’s implementation, e.g., here for xgboost.

### 9.1.1 Implicit Parallelization

We talk about implicit parallelization in the context of mlr3, if mlr3 calls external code (i.e., code from foreign CRAN packages which implements a Learner) that itself runs in parallel. Note that this definition includes GPU acceleration.

Many machine learning algorithms can parallelize their model fit using threading, e.g., the random forest implementation in ranger or the boosting implemented in xgboost. During threading, the implementation instructs some sequential parts of the code to be executed independently of the other parts in the same process.

For example, while fitting a decision tree, each split that divides the data into two disjoint partitions requires a search for the best cut point on all $$p$$ features. So instead of iterating over all features sequentially, the search can be broken down into $$p$$ threads, each searching for the best cut point on a single feature. These threads can easily be parallelized by the scheduler of the operating system, as there is no need for communication between the threads. After all threads have finished, the results are collected and merged before terminating the threads. I.e., for our example of the decision tree, (1) the $$p$$ best cut points per feature are collected and then (2) aggregated to the single best cut point across all features by just iterating over the $$p$$ results sequentially.

Warning

It does not make practical sense to actually execute in parallel every operation that can be parallelized. Starting and terminating workers (here: threads) as well as possible communication between workers comes at a price in the form of additionally required runtime which is called (parallelization) overhead. The overhead must be related to the runtime of the sequential execution. If the sequential execution is comparably fast, enabling parallelization often just introduces additional complexity and slows down the execution.

Unfortunately, threading conflicts with certain parallel backends used during explicit parallelization, causing the system to be overutilized in the best case and causing hangs or segfaults in the worst case. For this reason, we introduced the convention that implicit parallelization is turned off per default. Hyperparameters that control the number of threads are tagged with the label "threads". Currently, controlling the number of threads is possible for some learners and filters from the mlr3filters package:

library("mlr3learners") # for the ranger learner

learner = lrn("classif.ranger")
learner$param_set$ids(tags = "threads")
[1] "num.threads"

To enable the parallelization for this learner, we provide the helper function set_threads() which

# use 4 CPUs
set_threads(learner, n = 4)
<LearnerClassifRanger:classif.ranger>
* Model: -
* Packages: mlr3, mlr3learners, ranger
* Predict Types:  [response], prob
* Feature Types: logical, integer, numeric, character, factor, ordered
* Properties: hotstart_backward, importance, multiclass, oob_error,
twoclass, weights
# auto-detect cores on the local machine
set_threads(learner)
<LearnerClassifRanger:classif.ranger>
* Model: -
* Packages: mlr3, mlr3learners, ranger
* Predict Types:  [response], prob
* Feature Types: logical, integer, numeric, character, factor, ordered
* Properties: hotstart_backward, importance, multiclass, oob_error,
twoclass, weights
Danger

Automatic detection of the number of CPUs is sometimes flaky, and utilizing all available cores is occasionally counterproductive as overburdening the system often has negative effects on the overall runtime. The function which determines the number of CPUs for mlr3 is implemented in parallelly::availableCores() and comes with reasonable heuristics for many setups. See this blog post for some background information about the heuristic. However, there are still some scenarios where it is better to reduce the number of utilized CPUs manually:

• You want to simultaneously work on the same system, e.g., browse the web or watch a video.
• You are on a multi-user system and want to spare some resources for other users.
• You have energy-efficient CPU cores, for example, the “Icestorm” cores on a Mac M1 chip. These are comparably slower than the high-performance “Firestorm” cores and not well suited for heavy computations.
• You have linked R to a threaded BLAS implementation like OpenBLAS, and your learners make heavy use of linear algebra.

You can manually set the number of CPUs to overrule the heuristic via option "mc.cores":

options(mc.cores = 4)

We recommend setting this in your system’s .Rprofile file, c.f. Startup.

### 9.1.2 Explicit Parallelization

Here, we talk about explicit parallelization if mlr3 starts and controls the parallelization itself. For this purpose, an additional abstraction layer is used to be able to operate on a unified interface for a broad range of parallel backends: the future package. There are two operations where mlr3 calls the future package: while performing resampling via resample() and while benchmarking via benchmark(). During resampling, because all resampling iterations are independent of each other, all iterations can be executed in parallel. The same holds for benchmarking, where additionally to the independent model fits of a single resampling, all combinations in the provided design are also independent. These iterations are performed by future using the parallel backend configured with future::plan(). Extension packages like mlr3tuning internally call benchmark() during tuning and thus work in parallel, too.

Tip

When computational problems are so easy to parallelize, they are often referred to as “embarrassingly parallel”.

Whenever you loop over elements with a map-like function (e.g., lapply(), sapply(), mapply(), vapply() or a function from package purrr), you are facing an embarrassingly parallel problem. Such problems are straightforward to parallelize with R, e.g., with the furrr package providing map-like functions executed in parallel via the future framework. The same holds for for-loops with independent iterations, i.e., loops where the current iteration does not rely on the results of previous iterations.

In this section, we will use the spam task and a simple classification tree to showcase the explicit parallelization. We use the future::multisession parallel backend that should work on all systems.

# select the multisession backend to use
future::plan("multisession")

# define objects to perform a resampling
learner = lrn("classif.rpart")
resampling = rsmp("cv", folds = 3)

time = proc.time()[3]
diff = proc.time()[3] - time

By default, all CPUs of your machine are used unless you specify the argument workers in future::plan() (possible problems with this default have already been discussed for implicit parallelization). You should see a decrease in the reported elapsed time, but in practice, you cannot expect the runtime to fall linearly as the number of cores increases (Amdahl’s law). In contrast to threads, the technical overhead for starting workers, communicating objects, sending back results, and shutting down the workers is quite large for the multisession backend. Therefore, it is advised to only consider parallelization for resamplings where each iteration runs at least several seconds.

Figure 9.1 illustrates the parallelization from the above example. From left to right:

1. The main process calls the resample() function.
2. The task is split into 3 folds.
3. The folds are passed to three workers, each fitting a model on the respective subset of the task and predicting on the left-out observations.
4. The predictions (and trained models) are communicated back to main process which combines them into a ResampleResult.
Note

If you are transitioning from mlr, you might be used to selecting different parallelization levels, e.g., for resampling, benchmarking, or tuning. In mlr3, this is no longer required (except for nested resampling, briefly described in the following section). All kind of experiments are rolled out on the same level. Therefore, there is no need to decide whether you want to parallelize the tuning OR the resampling.

Just lean back and let the machine do the work :-)

### 9.1.3 Reproducibility

Usually reproducibility is a major concern during parallelization as special pseudorandom number generators (PRNGs) are required. Luckily, this problem is already solved for us by the excellent future package mlr3 calls under the hood. future ensures that all workers will receive the exactly same PRNG streams. Although this alone does not guarantee full reproducibility, it is one problem less to worry about.

You can find more details about the used pseudo RNG in this blog post.

### 9.1.4 Nested Resampling Parallelization

Nested resampling results in two nested resampling loops, and the user can choose which of them should be parallelized. Let’s consider the following example: You want to tune the minsplit argument of a classification tree using the AutoTuner of mlr3tuning (simplified version taken from the nested resampling section):

library("mlr3tuning")
Loading required package: paradox
learner = lrn("classif.rpart",
minsplit  = to_tune(2, 128, logscale = TRUE)
)

at = auto_tuner(
method = tnr("random_search"),
learner = learner,
resampling = rsmp("cv", folds = 2), # inner CV
measure = msr("classif.ce"),
term_evals = 20,
)

To evaluate the performance on an independent test set, resampling is used:

resample(
learner = at,
resampling = rsmp("cv", folds = 5) # outer CV
)
<ResampleResult> of 5 iterations
* Learner: classif.rpart.tuned
* Warnings: 0 in 0 iterations
* Errors: 0 in 0 iterations

Here, we have two opportunities to tune: the inner cross-validation of the auto tuner with 2 folds, or the outer cross-validation of the resampling with 5 folds. Let’s say that we have a single CPU with four cores available.

If we opt to parallelize the outer CV, all four cores would be utilized first with the computation of the first 4 resampling iterations. The computation of the fifth iteration has to wait, i.e., depending on the parallelization backend and its scheduling strategy,

1. until all four iterations have been finished, and the results have been collectively reported back to the main process, or
2. either one of the four cores has terminated – the first core reporting back will get a new task as soon as possible.
Note

The former method usually comes with less synchronization overhead and is best suited for short jobs with homogeneous runtimes. The latter yields better runtimes if the runtimes are heterogeneous, especially if the parallelization overhead is neglectable in comparison with the runtime for the computation. E.g., for parallel::mclapply(), the behavior of the scheduler can be controlled with the mc.preschedule option. For many backends, you cannot control the scheduling. However, future allows you to first chunk jobs together which combines multiple tasks into blocks that run sequentially on a worker, avoiding the intermediate synchronization steps.

The resulting CPU utilization of the nested resampling example on 4 CPUs is visualized in two Figures:

• Figure 9.2 as an example for parallelizing the outer 5-fold cross-validation.

# Runs the outer loop in parallel and the inner loop sequentially
future::plan(list("multisession", "sequential"))

We assume that each fit during the inner resampling takes 4 seconds to compute and that there is no other significant overhead. First, each of the four workers starts with the computation of an inner 2-fold cross-validation. As there are more jobs than workers, the remaining fifth iteration of the outer resampling is queued on CPU1 after the first 4 iterations are finished after 8 secs. During the computation of the 5th outer resampling iteration, only CPU1 is utilized, the other 3 CPUs are idling.

• Figure 9.3 as an example for parallelizing the inner 2-fold cross-validation.

# Runs the outer loop sequentially and the inner loop in parallel
future::plan(list("sequential", "multisession"))

Here, the outer loop runs sequentially and distributes the 2 computations for the inner resampling on 2 CPUs. Meanwhile, CPU3 and CPU4 are idling.

Both possibilities for parallelization are not exploiting the full potential of the 4 CPUs. With parallelization of the outer loop, all results are computed after 16s, in contrast to parallelization of the inner loop where the results are only available after 20s.

If possible, the number of iterations can be adapted to the available hardware. There is no law set in stone that you have to do, e.g., 10 folds in cross-validation. If you have 4 CPUs and a reasonable variance, 8 iterations are often sufficient, or you do 12 iterations because you get the last two iterations basically for free.

Alternatively, you can also enable parallelization for both loops for nested parallelization, even on different parallelization backends. While nesting real parallelization backends is often unintended and causes unnecessary overhead, it is useful in some distributed computing setups. In this case, the number of workers must be manually tweaked so that the system does not get overburdened:

# Runs both loops in parallel
future::plan(list(
future::tweak("multisession", workers = 2),
future::tweak("multisession", workers = 4)
))

This example would run on 8 cores (= 2 * 4) on the local machine. The vignette of the future package gives more insight into nested parallelization. For more background information about parallelization during tuning, see Section 6.7 of Bischl et al. (2021).

Danger

During tuning with mlr3tuning, you can often adjust the batch size of the Tuner, i.e., control how many hyperparameter configurations are evaluated in parallel. If you want full parallelization, make sure that the batch size multiplied by the number of (inner) resampling iterations is at least equal to the number of available workers. If you expect homogeneous runtimes, i.e., you are tuning over a single learner or linear pipeline and you have no hyperparameter which is likely to influence the performance, aim for a multiple of the number of workers.

In general, larger batches mean more parallelization, while smaller batches imply a more frequent evaluation of termination criteria. We default to a batch_size of 1 that ensures that all Terminators work as intended, i.e., you cannot exceed the computational budget.

## 9.2 Error Handling

In ML, it is not uncommon for something to break. This is because the algorithms have to process arbitrary data, and not all eventualities can always be handled. While we try to identify obvious problems before execution, such as when missing values occur, but a learner can’t handle them, other problems are far more complex to detect. Examples include correlations or collinearity that make model fitting impossible, outliers that lead to numerical problems, or new levels of categorical variables emerging in the predict step. The learners behave quite differently when encountering such problems: some models signal a warning during the train step that they failed to fit but return a baseline model while other models stop the execution. During prediction, some learners just refuse to predict the response for observations they cannot handle while others predict a missing value. How to deal with these problems even in more complex setups like benchmarking or tuning is the topic of this section.

For illustration (and internal testing) of error handling, mlr3 ships with the learners classif.debug and regr.debug. Here, we will concentrate on the debug learner for classification:

task = tsk("penguins")
learner = lrn("classif.debug")
print(learner)
<LearnerClassifDebug:classif.debug>: Debug Learner for Classification
* Model: -
* Parameters: list()
* Packages: mlr3
* Predict Types:  [response], prob
* Feature Types: logical, integer, numeric, character, factor, ordered
* Properties: hotstart_forward, missings, multiclass, twoclass

This learner comes with special hyperparameters that let us simulate problems frequently encountered in ML. E.g., the debug learner comes with hyperparameters to control

1. what conditions should be signaled (message, warning, error, segfault) with what probability,
2. during which stage the conditions should be signaled (train or predict), and
3. the ratio of predictions being NA (predict_missing).
learner$param_set <ParamSet> id class lower upper nlevels default value 1: error_predict ParamDbl 0 1 Inf 0 2: error_train ParamDbl 0 1 Inf 0 3: message_predict ParamDbl 0 1 Inf 0 4: message_train ParamDbl 0 1 Inf 0 5: predict_missing ParamDbl 0 1 Inf 0 6: predict_missing_type ParamFct NA NA 2 na 7: save_tasks ParamLgl NA NA 2 FALSE 8: segfault_predict ParamDbl 0 1 Inf 0 9: segfault_train ParamDbl 0 1 Inf 0 10: sleep_train ParamUty NA NA Inf <NoDefault[3]> 11: sleep_predict ParamUty NA NA Inf <NoDefault[3]> 12: threads ParamInt 1 Inf Inf <NoDefault[3]> 13: warning_predict ParamDbl 0 1 Inf 0 14: warning_train ParamDbl 0 1 Inf 0 15: x ParamDbl 0 1 Inf <NoDefault[3]> 16: iter ParamInt 1 Inf Inf 1  With the learner’s default settings, the learner will do nothing special: The learner remembers a random label and constantly predicts this label: task = tsk("penguins") learner$train(task)$predict(task)$confusion
           truth
Chinstrap      0         0      0
Gentoo       152        68    124

We now set a hyperparameter to let the debug learner signal an error during the train step. By default, mlr3 does not catch conditions such as warnings or errors raised while calling learners:

# set probability to signal an error to 1
learner$param_set$values = list(error_train = 1)

learner$train(tsk("penguins")) Error in .__LearnerClassifDebug__.train(self = self, private = private, : Error from classif.debug->train() If this has been a regular learner, we could now start debugging with traceback() (or create a Minimal Reproducible Example (MRE) to file a bug report upstream). Note If you start debugging, make sure you have disabled parallelization to avoid various pitfalls related to parallelization. It may also be helpful to set the option mlr3.debug to TRUE. If this flag is set, mlr3 does not call into the future package, resulting in an easier-to-interpret program flow and traceback(). ### 9.2.1 Encapsulation Since ML algorithms are confronted with arbitrary, often messy data, errors are not uncommon here, and we often just need to move on during benchmarking or tuning. Thus, we need a mechanism to 1. capture all signaled conditions such as messages, warnings and errors so that we can analyze them post-hoc (called “encapsulation”, covered in this section), and 2. a statistically sound way to proceed while being able to aggregate over partial results (next Section 9.2.2). Encapsulation ensures that signaled conditions (such as messages, warnings and errors) are intercepted: all conditions raised during the training or predict step are logged into the learner, and errors do not interrupt the program flow. I.e., the execution of the calling function or package (here: mlr3) continues as if there had been no error, though the result (fitted model during train(), predictions during predict()) are missing. Each Learner has a field encapsulate to control how the train or predict steps are wrapped. The easiest way to encapsulate the execution is provided by the package evaluate which evaluates R expressions while tracking conditions such as outputs, messages, warnings or errors (see the documentation of the encapsulate() helper function for more details): task = tsk("penguins") learner = lrn("classif.debug") # this learner throws a warning and then stops with an error during train() learner$param_set$values = list(warning_train = 1, error_train = 1) # enable encapsulation for train() and predict() learner$encapsulate = c(train = "evaluate", predict = "evaluate")

learner$train(task) After training the learner, one can access the recorded log via the fields log, warnings and errors: learner$log
   stage   class                                 msg
1: train warning Warning from classif.debug->train()
2: train   error   Error from classif.debug->train()
learner$warnings [1] "Warning from classif.debug->train()" learner$errors
[1] "Error from classif.debug->train()"

Another method for encapsulation is implemented in the callr package. In contrast to evaluate, the computation is taken out in a separate R process. This guards the calling session against segfaults which otherwise would tear down the complete R session. On the downside, starting new processes comes with comparably more computational overhead.

learner$encapsulate = c(train = "callr", predict = "callr") learner$param_set$values = list(segfault_train = 1) learner$train(task = task)
learner$errors [1] "callr process exited with status -11" With either of these encapsulation methods, we can now catch errors and post-hoc analyze the messages, warnings and error messages. Unfortunately, this is only half the battle. Without a model, it is not possible to get predictions: learner$predict(task)
Error: Cannot predict, Learner 'classif.debug' has not been trained yet

To handle the missing predictions gracefully during resample(), benchmark() or tuning, fallback learners are introduced next.

### 9.2.2 Fallback learners

Fallback learners have the purpose of allowing scoring results in cases where a Learner failed to fit a model, refuses to provide predictions for all observations or predicts missing values.

We will first handle the case that a learner fails to fit a model during training, e.g., if some convergence criterion is not met or the learner ran out of memory. There are in general three possibilities to proceed:

1. Ignore missing scores. Although this is arguably the most frequent approach in practice, it is not statistically sound. For example, consider the case where a researcher wants a specific learner to look better in a benchmark study. To do this, the researcher takes an existing learner but introduces a small adaptation: If an internal goodness-of-fit measure is not achieved, an error is thrown. In other words, the learner only fits a model if the model can be reasonably well learned on the given training data. In comparison with the learning procedure without this adaptation and a good threshold, however, we now compare the mean over only the “easy” splits with the mean over all splits - an unfair advantage.
2. Penalize failing learners. If a score is missing, we can simply impute the worst possible score (as defined by the Measure) and thereby heavily penalize the learner for failing. However, this often seems too harsh for many problems, and for some measures there is no reasonable value to impute.
3. Impute a value that corresponds to a (weak) baseline. Instead of imputing with the worst possible score, impute with a reasonable baseline, e.g., by just predicting the majority class or the mean of the response in the training data. Such simple baselines are implemented as featureless learners (mlr_learners_classif.featureless or mlr_learners_regr.featureless). Note that a reasonable baseline value is different in different training splits. Retrieving these values after a larger benchmark study has been conducted is possible, but tedious.

We strongly recommend option (3): it is statistically sound and very flexible. To make this procedure very convenient during resampling and benchmarking, we support fitting a proper baseline with a fallback learner. In the next example, in addition to the debug learner, we attach a simple featureless learner to the debug learner. So whenever the debug learner fails (which is every single time with the given parametrization) and encapsulation is enabled, mlr3 falls back to the predictions of the featureless learner internally:

task = tsk("penguins")

learner = lrn("classif.debug")
learner$param_set$values = list(error_train = 1)
learner$fallback = lrn("classif.featureless") learner$train(task)
learner
<LearnerClassifDebug:classif.debug>: Debug Learner for Classification
* Model: -
* Parameters: error_train=1
* Packages: mlr3
* Predict Types:  [response], prob
* Feature Types: logical, integer, numeric, character, factor, ordered
* Properties: hotstart_forward, missings, multiclass, twoclass
* Errors: Error from classif.debug->train()

Note that we don’t have to enable encapsulation explicitly; it is automatically set to "evaluate" for the training and the predict step while setting a fallback learner for a learner without encapsulation enabled. Furthermore, the log contains the captured error (which is also included in the print output), and although we don’t have a model, we can still get predictions:

learner$model NULL prediction = learner$predict(task)
prediction$score() classif.ce 0.5581395  In this stepwise train-predict procedure, the fallback learner is of limited use. However, it is invaluable for larger benchmark studies. In the following snippet, we compare the previously created debug learner with a simple classification tree. We re-parametrize the debug learner to fail in roughly 30% of the resampling iterations during the training step: learner$param_set$values = list(error_train = 0.3) bmr = benchmark(benchmark_grid(tsk("penguins"), list(learner, lrn("classif.rpart")), rsmp("cv"))) aggr = bmr$aggregate(conditions = TRUE)
aggr[, .(learner_id, warnings, errors, classif.ce)]
      learner_id warnings errors classif.ce
1: classif.debug        0      8 0.56915966
2: classif.rpart        0      0 0.06403361

Even though the debug learner occasionally failed to provide predictions, we still got a statistically sound aggregated performance value which we can compare to the aggregated performance of the classification tree. It is also possible to split the benchmark up into separate ResampleResult objects which sometimes helps to get more context. E.g., if we only want to have a closer look into the debug learner, we can extract the errors from the corresponding resample results:

rr = aggr[learner_id == "classif.debug"]$resample_result[[1L]] rr$errors
   iteration                               msg
1:         1 Error from classif.debug->train()
2:         4 Error from classif.debug->train()
3:         5 Error from classif.debug->train()
4:         6 Error from classif.debug->train()
5:         7 Error from classif.debug->train()
6:         8 Error from classif.debug->train()
7:         9 Error from classif.debug->train()
8:        10 Error from classif.debug->train()

A similar problem to failed model fits emerges when a learner predicts only a subset of the observations in the test set (and predicts NA or no value for others). A typical case is, e.g., when new and unseen factor levels are encountered in the test data. Imagine again that our goal is to benchmark two algorithms using cross-validation on some binary classification task:

• Algorithm A is an ordinary logistic regression.
• Algorithm B is also an ordinary logistic regression, but with a twist: If the logistic regression is rather certain about the predicted label (> 90% probability), it returns the label and returns a missing value otherwise.

Clearly, at its core, this is the same problem as outlined before. Algorithm B would easily outperform algorithm A, but you have not factored in that you can not generate predictions for all observations. Long story short, if a fallback learner is involved, missing predictions of the base learner will be automatically replaced with predictions from the fallback learner. This is illustrated in the following example:

task = tsk("penguins")
learner = lrn("classif.debug")

# this hyperparameter sets the ratio of missing predictions
learner$param_set$values = list(predict_missing = 0.5)

# without fallback
p = learner$train(task)$predict(task)
table(p$response, useNA = "always")  Adelie Chinstrap Gentoo <NA> 172 0 0 172  # with fallback learner$fallback = lrn("classif.featureless")
p = learner$train(task)$predict(task)
table(p$response, useNA = "always")  Adelie Chinstrap Gentoo <NA> 172 0 172 0  Summed up, by combining encapsulation and fallback learners, it is possible to benchmark even quite unreliable or unstable learning algorithms in a convenient and statistically sound fashion. ### 9.2.3 Actionable Errors All problems demonstrated so far are artificial and non-actionable. The usefulness of encapsulation and error logging usually only really becomes apparent in large benchmarks, especially in combination with parallelization. For a fair comparison, you need to distinguish between the following cases: 1. You have made a mistake, e.g., forgot a required preprocessing step in your pipeline. Action: Fix problems, restart computation. 2. Temporary problems related to the executing system, e.g., network hiccups. Action: Restart computation. 3. Intrinsic, deterministic and reproducible problem with the model fitting. Action: Impute with fallback learner. The package mlr3batchmark provides functionality to map jobs of a benchmark to computational jobs for the package batchtools. This provides a convenient way get fine-grained control over the execution of each single resampling iteration and then combine the results afterwards to a BenchmarkResult again to proceed with the analysis. ## 9.3 Data Backends In mlr3, Tasks store their data in an abstract data object, the DataBackend. A backend provides a unified API to retrieve subsets of the data or query information about it, regardless of how the data is actually stored. The default backend uses data.table via the DataBackendDataTable as a very fast and efficient in-memory database. For example, we can query some information of the mlr_tasks_penguins task: task = tsk("penguins") backend = task$backend
backend$nrow [1] 344 backend$ncol
[1] 9

For bigger data, or when working with many tasks simultaneously in the same R session, it can be necessary to interface out-of-memory data to reduce the memory requirements. This way, only the part of the data which is currently required by the learners will be placed in the main memory to operate on. There are multiple options to archive this:

1. DataBackendDplyr which interfaces the R package dbplyr, extending dplyr to work on many popular databases like MariaDB, PostgreSQL or SQLite.
2. DataBackendDuckDB for the impressive DuckDB database connected via duckdb: a fast, zero-configuration alternative to SQLite.
3. DataBackendDuckDB, again, but for Parquet files. The data does not need to be converted to DuckDB’s native storage format, you can work directly on directories containing one or multiple files stored in the popular Parquet format.

### 9.3.1 Databases with DataBackendDplyr

To demonstrate the DataBackendDplyr we use the NYC flights data set from the nycflights13 package and move it into a SQLite database. Although as_sqlite_backend() provides a convenient function to perform this step, we construct the database manually here.

# load data
requireNamespace("DBI")
requireNamespace("RSQLite")
requireNamespace("nycflights13")
data("flights", package = "nycflights13")
str(flights)
tibble [336,776 × 19] (S3: tbl_df/tbl/data.frame)
$year : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...$ month         : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
$day : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...$ dep_time      : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
$sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...$ dep_delay     : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
$arr_time : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...$ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
$arr_delay : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...$ carrier       : chr [1:336776] "UA" "UA" "AA" "B6" ...
$flight : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...$ tailnum       : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
$origin : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...$ dest          : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
$air_time : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...$ distance      : num [1:336776] 1400 1416 1089 1576 762 ...
$hour : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...$ minute        : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
$time_hour : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ... # add column of unique row ids flights$row_id = 1:nrow(flights)

# create sqlite database in temporary file
path = tempfile("flights", fileext = ".sqlite")
con = DBI::dbConnect(RSQLite::SQLite(), path)
tbl = DBI::dbWriteTable(con, "flights", as.data.frame(flights))
DBI::dbDisconnect(con)

# remove in-memory data
rm(flights)

With the SQLite database stored in file path, we now re-establish a connection and switch to dplyr/dbplyr for some essential preprocessing.

# establish connection
con = DBI::dbConnect(RSQLite::SQLite(), path)

# select the "flights" table, enter dplyr
library("dplyr")

Attaching package: 'dplyr'
The following objects are masked from 'package:data.table':

between, first, last
The following objects are masked from 'package:stats':

filter, lag
The following objects are masked from 'package:base':

intersect, setdiff, setequal, union
library("dbplyr")

Attaching package: 'dbplyr'
The following objects are masked from 'package:dplyr':

ident, sql
tbl = tbl(con, "flights")

First, we select a subset of columns to work on:

keep = c("row_id", "year", "month", "day", "hour", "minute", "dep_time",
"arr_time", "carrier", "flight", "air_time", "distance", "arr_delay")
tbl = select(tbl, all_of(keep))

Additionally, we remove those observations where the arrival delay (arr_delay) has a missing value:

tbl = filter(tbl, !is.na(arr_delay))

To keep runtime reasonable for this toy example, we filter the data to only use every second row:

tbl = filter(tbl, row_id %% 2 == 0)

The factor levels of the feature carrier are merged so that infrequent carriers are replaced by level “other”:

tbl = mutate(tbl, carrier = case_when(
carrier %in% c("OO", "HA", "YV", "F9", "AS", "FL", "VX", "WN") ~ "other",
TRUE ~ carrier))

Next, the processed table is used to create a mlr3db::DataBackendDplyr from mlr3db:

library("mlr3db")
b = as_data_backend(tbl, primary_key = "row_id")

We can now use the interface of DataBackend to query some basic information about the data:

b$nrow [1] 163707 b$ncol
[1] 13
b$head()  row_id year month day hour minute dep_time arr_time carrier flight air_time 1: 2 2013 1 1 5 29 533 850 UA 1714 227 2: 4 2013 1 1 5 45 544 1004 B6 725 183 3: 6 2013 1 1 5 58 554 740 UA 1696 150 4: 8 2013 1 1 6 0 557 709 EV 5708 53 5: 10 2013 1 1 6 0 558 753 AA 301 138 6: 12 2013 1 1 6 0 558 853 B6 71 158 2 variables not shown: [distance, arr_delay] Note that the DataBackendDplyr does not know about any rows or columns we have filtered out with dplyr before, it just operates on the view we provided. As we now have constructed a backend, we can switch over to mlr3 for model fitting and create the following mlr3 objects: task = as_task_regr(b, id = "flights_sqlite", target = "arr_delay") learner = lrn("regr.rpart") measures = mlr_measures$mget(c("regr.mse", "time_train", "time_predict"))
resampling = rsmp("subsampling", repeats = 3, ratio = 0.02)

We pass all these objects to resample() to perform a simple resampling with three iterations. In each iteration, only the required subset of the data is queried from the SQLite database and passed to rpart::rpart():

rr = resample(task, learner, resampling)
print(rr)
<ResampleResult> of 3 iterations
* Learner: regr.rpart
* Warnings: 0 in 0 iterations
* Errors: 0 in 0 iterations
rr$aggregate(measures)  regr.mse time_train time_predict 1246.077 0.291 1.921  Note that we still have an active connection to the database. To properly close it, we remove the tbl object referencing the connection and then close the connection. rm(tbl) DBI::dbDisconnect(con) ### 9.3.2 Parquet Files with DataBackendDuckDB While storing the Task’s data in memory is most efficient w.r.t. accessing it for model fitting, this has two major disadvantages: 1. Although you might only need a small proportion of the data, the complete data frame sits in memory and consumes memory. This is especially a problem if you work with many tasks simultaneously. 2. During parallelization, the complete data needs to be transferred to the workers which can cause significant overhead. A very simple way to avoid this is given by just converting the DataBackendDataTable to a DataBackendDuckDB. As we have already demonstrated how to operate on a SQLite database, and DuckDB is not different in that regard. To convert a data.frame to DuckDB, we provide the helper function as_duckdb_backend(). Only two arguments are required: the data.frame to convert, and a path to store the data. While this is useful while working with many tasks simultaneously in order to keep the memory requirements reasonable, the more frequent use case for DuckDB are nowadays Parquet files. Parquet is a popular column-oriented data storage format supporting efficient compression, making it far superior to other popular data exchange formats such as CSV. To demonstrate working with Parquet files, we first query the location of an example data set shipped with mlr3db: path = system.file(file.path("extdata", "spam.parquet"), package = "mlr3db") We can then create a DataBackendDuckDB based on this file and convert the backend to a classification task, all without loading the dataset into memory: backend = as_duckdb_backend(path) task = as_task_classif(backend, target = "type") print(task) <TaskClassif:backend> (4601 x 58) * Target: type * Properties: twoclass * Features (57): - dbl (57): address, addresses, all, business, capitalAve, capitalLong, capitalTotal, charDollar, charExclamation, charHash, charRoundbracket, charSemicolon, charSquarebracket, conference, credit, cs, data, direct, edu, email, font, free, george, hp, hpl, internet, lab, labs, mail, make, meeting, money, num000, num1999, num3d, num415, num650, num85, num857, order, original, our, over, parts, people, pm, project, re, receive, remove, report, table, technology, telnet, will, you, your Accessing the data internally triggers a query and data is fetched to be stored in an in-memory data.frame, but only the required subsets. ## 9.4 Parameters (using paradox) Note We are currently revising the book. This section contains a lot of legacy code. You can find the new paradox syntax in the Defining a Tuning Spaces section. The paradox package offers a language for the description of parameter spaces, as well as tools for useful operations on these parameter spaces. A parameter space is often useful when describing: • A set of sensible input values for an R function • The set of possible values that slots of a configuration object can take • The search space of an optimization process The tools provided by paradox therefore relate to: • Parameter checking: Verifying that a set of parameters satisfies the conditions of a parameter space • Parameter sampling: Generating parameter values that lie in the parameter space for systematic exploration of program behavior depending on these parameters paradox is, by nature, an auxiliary package that derives its usefulness from other packages that make use of it. It is heavily utilized in other mlr-org packages such as mlr3, mlr3pipelines, and mlr3tuning. ### 9.4.1 Reference Based Objects paradox is the spiritual successor to the ParamHelpers package and was written from scratch using the R6 class system. The most important consequence of this is that all objects created in paradox are “reference-based”, unlike most other objects in R. When a change is made to a ParamSet object, for example by adding a parameter using the $add() function, all variables that point to this ParamSet will contain the changed object. To create an independent copy of a ParamSet, the $clone() method needs to be used: library("paradox") ps = ParamSet$new()
ps2 = ps
ps3 = ps$clone(deep = TRUE) print(ps) # the same for ps2 and ps3 <ParamSet> Empty. ps$add(ParamLgl$new("a")) print(ps) # ps was changed <ParamSet> id class lower upper nlevels default value 1: a ParamLgl NA NA 2 <NoDefault[3]>  print(ps2) # contains the same reference as ps <ParamSet> id class lower upper nlevels default value 1: a ParamLgl NA NA 2 <NoDefault[3]>  print(ps3) # is a "clone" of the old (empty) ps <ParamSet> Empty. ### 9.4.2 Defining a Parameter Space #### 9.4.2.1 Single Parameters The basic building block for describing parameter spaces is the Param class. It represents a single parameter, which usually can take a single atomic value. Consider, for example, trying to configure the rpart package’s rpart.control object. It has various components (minsplit, cp, …) that all take a single value, and that would all be represented by a different instance of a Param object. The Param class has various subclasses that represent different value types: A particular instance of a parameter is created by calling the attached $new() function:

library("paradox")
parA = ParamLgl$new(id = "A") parB = ParamInt$new(id = "B", lower = 0, upper = 10, tags = c("tag1", "tag2"))
parC = ParamDbl$new(id = "C", lower = 0, upper = 4, special_vals = list(NULL)) parD = ParamFct$new(id = "D", levels = c("x", "y", "z"), default = "y")
parE = ParamUty$new(id = "E", custom_check = function(x) checkmate::checkFunction(x)) Every parameter must have: • id - A name for the parameter within the parameter set • default - A default value • special_vals - A list of values that are accepted even if they do not conform to the type • tags - Tags that can be used to organize parameters The numeric (Int and Dbl) parameters furthermore allow for specification of a lower and upper bound. Meanwhile, the Fct parameter must be given a vector of levels that define the possible states its parameter can take. The Uty parameter can also have a custom_check function that must return TRUE when a value is acceptable and may return a character(1) error description otherwise. The example above defines parE as a parameter that only accepts functions. All values which are given to the constructor are then accessible from the object for inspection using $. Although all these values can be changed for a parameter after construction, this can be a bad idea and should be avoided when possible.

Instead, a new parameter should be constructed. Besides the possible values that can be given to a constructor, there are also the $class, $nlevels, $is_bounded, $has_default, $storage_type, $is_number and $is_categ slots that give information about a parameter. A list of all slots can be found in ?Param. parB$lower
[1] 0
parA$levels [1] TRUE FALSE parE$class
[1] "ParamUty"

It is also possible to get all information of a Param as data.table by calling as.data.table().

as.data.table(parA)
   id    class lower upper      levels nlevels is_bounded special_vals
1:  A ParamLgl    NA    NA  TRUE,FALSE       2       TRUE    <list[0]>
3 variables not shown: [default, storage_type, tags]
##### 9.4.2.1.1 Type / Range Checking

A Param object offers the possibility to check whether a value satisfies its condition, i.e. is of the right type, and also falls within the range of allowed values, using the $test(), $check(), and $assert() functions. test() should be used within conditional checks and returns TRUE or FALSE, while check() returns an error description when a value does not conform to the parameter (and thus plays well with the "checkmate::assert()" function). assert() will throw an error whenever a value does not fit. parA$test(FALSE)
[1] TRUE
parA$test("FALSE") [1] FALSE parA$check("FALSE")
[1] "Must be of type 'logical flag', not 'character'"

Instead of testing single parameters, it is often more convenient to check a whole set of parameters using a ParamSet.

#### 9.4.2.2 Parameter Sets

The ordered collection of parameters is handled in a ParamSet1. It is initialized using the $new() function and optionally takes a list of Params as argument. Parameters can also be added to the constructed ParamSet using the $add() function. It is even possible to add whole ParamSets to other ParamSets.

• 1 Although the name is suggestive of a “Set”-valued Param, this is unrelated to the other objects that follow the ParamXxx naming scheme.

• ps = ParamSet$new(list(parA, parB)) ps$add(parC)
ps$add(ParamSet$new(list(parD, parE)))
print(ps)
<ParamSet>
id    class lower upper nlevels        default value
1:  A ParamLgl    NA    NA       2 <NoDefault[3]>
2:  B ParamInt     0    10      11 <NoDefault[3]>
3:  C ParamDbl     0     4     Inf <NoDefault[3]>
4:  D ParamFct    NA    NA       3              y
5:  E ParamUty    NA    NA     Inf <NoDefault[3]>      

The individual parameters can be accessed through the $params slot. It is also possible to get information about all parameters in a vectorized fashion using mostly the same slots as for individual Params (i.e. $class, $levels etc.), see ?ParamSet for details. It is possible to reduce ParamSets using the $subset method. Be aware that it modifies a ParamSet in-place, so a “clone” must be created first if the original ParamSet should not be modified.

psSmall = ps$clone() psSmall$subset(c("A", "B", "C"))
print(psSmall)
<ParamSet>
id    class lower upper nlevels        default value
1:  A ParamLgl    NA    NA       2 <NoDefault[3]>
2:  B ParamInt     0    10      11 <NoDefault[3]>
3:  C ParamDbl     0     4     Inf <NoDefault[3]>      

Just as for Params, and much more useful, it is possible to get the ParamSet as a data.table using as.data.table(). This makes it easy to subset parameters on certain conditions and aggregate information about them, using the variety of methods provided by data.table.

as.data.table(ps)
   id    class lower upper      levels nlevels is_bounded special_vals
1:  A ParamLgl    NA    NA  TRUE,FALSE       2       TRUE    <list[0]>
2:  B ParamInt     0    10                  11       TRUE    <list[0]>
3:  C ParamDbl     0     4                 Inf       TRUE    <list[1]>
4:  D ParamFct    NA    NA       x,y,z       3       TRUE    <list[0]>
5:  E ParamUty    NA    NA                 Inf      FALSE    <list[0]>
3 variables not shown: [default, storage_type, tags]
##### 9.4.2.2.1 Type / Range Checking

Similar to individual Params, the ParamSet provides $test(), $check() and $assert() functions that allow for type and range checking of parameters. Their argument must be a named list with values that are checked against the respective parameters. It is possible to check only a subset of parameters. ps$check(list(A = TRUE, B = 0, E = identity))
[1] TRUE
ps$check(list(A = 1)) [1] "A: Must be of type 'logical flag', not 'double'" ps$check(list(Z = 1))
[1] "Parameter 'Z' not available. Did you mean 'A' / 'B' / 'C'?"
##### 9.4.2.2.2 Values in a ParamSet

Although a ParamSet fundamentally represents a value space, it also has a slot $values that can contain a point within that space. This is useful because many things that define a parameter space need similar operations (like parameter checking) that can be simplified. The $values slot contains a named list that is always checked against parameter constraints. When trying to set parameter values, e.g. for mlr3 Learners, it is the $values slot of its $param_set that needs to be used.

ps$values = list(A = TRUE, B = 0) ps$values$B = 1 print(ps$values)
$A [1] TRUE$B
[1] 1

The parameter constraints are automatically checked:

ps$values$B = 100
Error in self$assert(xs): Assertion on 'xs' failed: B: Element 1 is not <= 10. ##### 9.4.2.2.3 Dependencies It is often the case that certain parameters are irrelevant or should not be given depending on values of other parameters. An example would be a parameter that switches a certain algorithm feature (for example regularization) on or off, combined with another parameter that controls the behavior of that feature (e.g. a regularization parameter). The second parameter would be said to depend on the first parameter having the value TRUE. A dependency can be added using the $add_dep method, which takes both the ids of the “depender” and “dependee” parameters as well as a Condition object. The Condition object represents the check to be performed on the “dependee”. Currently it can be created using CondEqual$new() and CondAnyOf$new(). Multiple dependencies can be added, and parameters that depend on others can again be depended on, as long as no cyclic dependencies are introduced.

The consequence of dependencies are twofold: For one, the $check(), $test() and $assert() tests will not accept the presence of a parameter if its dependency is not met. Furthermore, when sampling or creating grid designs from a ParamSet, the dependencies will be respected (see Parameter Sampling, in particular Hierarchical Sampler). The following example makes parameter D depend on parameter A being FALSE, and parameter B depend on parameter D being one of "x" or "y". This introduces an implicit dependency of B on A being FALSE as well, because D does not take any value if A is TRUE. ps$add_dep("D", "A", CondEqual$new(FALSE)) ps$add_dep("B", "D", CondAnyOf$new(c("x", "y"))) ps$check(list(A = FALSE, D = "x", B = 1))          # OK: all dependencies met
[1] TRUE
ps$check(list(A = FALSE, D = "z", B = 1)) # B's dependency is not met [1] "The parameter 'B' can only be set if the following condition is met 'D ∈ {x, y}'. Instead the current parameter value is: D=z" ps$check(list(A = FALSE, B = 1))                   # B's dependency is not met
[1] "The parameter 'B' can only be set if the following condition is met 'D ∈ {x, y}'. Instead the parameter value for 'D' is not set at all. Try setting 'D' to a value that satisfies the condition"
ps$check(list(A = FALSE, D = "z")) # OK: B is absent [1] TRUE ps$check(list(A = TRUE))                           # OK: neither B nor D present
[1] TRUE
ps$check(list(A = TRUE, D = "x", B = 1)) # D's dependency is not met [1] "The parameter 'D' can only be set if the following condition is met 'A = FALSE'. Instead the current parameter value is: A=TRUE" ps$check(list(A = TRUE, B = 1))                    # B's dependency is not met
[1] "The parameter 'B' can only be set if the following condition is met 'D ∈ {x, y}'. Instead the parameter value for 'D' is not set at all. Try setting 'D' to a value that satisfies the condition"

Internally, the dependencies are represented as a data.table, which can be accessed listed in the $deps slot. This data.table can even be mutated, to e.g. remove dependencies. There are no sanity checks done when the $deps slot is changed this way. Therefore it is advised to be cautious.

ps$deps  id on cond 1: D A <CondEqual[9]> 2: B D <CondAnyOf[9]> #### 9.4.2.3 Vector Parameters Unlike in the old ParamHelpers package, there are no more vectorial parameters in paradox. Instead, it is now possible to create multiple copies of a single parameter using the $rep function. This creates a ParamSet consisting of multiple copies of the parameter, which can then (optionally) be added to another ParamSet.

ps2d = ParamDbl$new("x", lower = 0, upper = 1)$rep(2)
print(ps2d)
<ParamSet>
id    class lower upper nlevels        default value
1: x_rep_1 ParamDbl     0     1     Inf <NoDefault[3]>
2: x_rep_2 ParamDbl     0     1     Inf <NoDefault[3]>      
ps$add(ps2d) print(ps) <ParamSet> id class lower upper nlevels default parents value 1: A ParamLgl NA NA 2 <NoDefault[3]> TRUE 2: B ParamInt 0 10 11 <NoDefault[3]> D 1 3: C ParamDbl 0 4 Inf <NoDefault[3]> 4: D ParamFct NA NA 3 y A 5: E ParamUty NA NA Inf <NoDefault[3]> 6: x_rep_1 ParamDbl 0 1 Inf <NoDefault[3]> 7: x_rep_2 ParamDbl 0 1 Inf <NoDefault[3]>  It is also possible to use a ParamUty to accept vectorial parameters, which also works for parameters of variable length. A ParamSet containing a ParamUty can be used for parameter checking, but not for sampling. To sample values for a method that needs a vectorial parameter, it is advised to use a parameter transformation function that creates a vector from atomic values. Assembling a vector from repeated parameters is aided by the parameter’s $tags: Parameters that were generated by the $rep() command automatically get tagged as belonging to a group of repeated parameters. ps$tags
$A character(0)$B
[1] "tag1" "tag2"

$C character(0)$D
character(0)

$E character(0)$x_rep_1
[1] "x_rep"

sampA$sample(5) <Design> with 5 rows: A 1: TRUE 2: TRUE 3: FALSE 4: FALSE 5: FALSE ##### 9.4.3.4.2 Hierarchical Sampler The SamplerHierarchical sampler is an auxiliary sampler that combines many 1D-Samplers to get a combined distribution. Its name “hierarchical” implies that it is able to respect parameter dependencies. This suggests that parameters only get sampled when their dependencies are met. The following example shows how this works: The Int parameter B depends on the Lgl parameter A being TRUE. A is sampled to be TRUE in about half the cases, in which case B takes a value between 0 and 10. In the cases where A is FALSE, B is set to NA. psSmall$add_dep("B", "A", CondEqual$new(TRUE)) sampH = SamplerHierarchical$new(psSmall,
list(Sampler1DCateg$new(parA), Sampler1DUnif$new(parB),
Sampler1DUnif$new(parC)) ) sampled = sampH$sample(1000)
table(sampled$data[, c("A", "B")], useNA = "ifany")  B A 0 1 2 3 4 5 6 7 8 9 10 <NA> FALSE 0 0 0 0 0 0 0 0 0 0 0 493 TRUE 51 49 50 42 47 47 44 52 35 49 41 0 ##### 9.4.3.4.3 Joint Sampler Another way of combining samplers is the SamplerJointIndep. SamplerJointIndep also makes it possible to combine Samplers that are not 1D. However, SamplerJointIndep currently can not handle ParamSets with dependencies. sampJ = SamplerJointIndep$new(
list(Sampler1DUnif$new(ParamDbl$new("x", 0, 1)),
Sampler1DUnif$new(ParamDbl$new("y", 0, 1)))
)
sampJ$sample(5) <Design> with 5 rows: x y 1: 0.9980689 0.29325189 2: 0.5696531 0.14916164 3: 0.7999309 0.87583963 4: 0.9762024 0.06155586 5: 0.4334808 0.60938294 ##### 9.4.3.4.4 SamplerUnif The Sampler used in generate_design_random() is the SamplerUnif sampler, which corresponds to a HierarchicalSampler of Sampler1DUnif for all parameters. ### 9.4.4 Parameter Transformation While the different Samplers allow for a wide specification of parameter distributions, there are cases where the simplest way of getting a desired distribution is to sample parameters from a simple distribution (such as the uniform distribution) and then transform them. This can be done by assigning a function to the $trafo slot of a ParamSet. The $trafo function is called with two parameters: • The list of parameter values to be transformed as x • The ParamSet itself as param_set The $trafo function must return a list of transformed parameter values.

The transformation is performed when calling the $transpose function of the Design object returned by a Sampler with the trafo ParamSet to TRUE (the default). The following, for example, creates a parameter that is exponentially distributed: psexp = ParamSet$new(list(ParamDbl$new("par", 0, 1))) psexp$trafo = function(x, param_set) {
x$par = -log(x$par)
x
}
design = generate_design_random(psexp, 2)
print(design)
<Design> with 2 rows:
par
1: 0.07302144
2: 0.15870426
design$transpose() # trafo is TRUE [[1]] [[1]]$par
[1] 2.617002

[[2]]
[[2]]$par [1] 1.840713 Compare this to $transpose() without transformation:

design$transpose(trafo = FALSE) [[1]] [[1]]$par
[1] 0.07302144

[[2]]
[[2]]$par [1] 0.1587043 #### 9.4.4.1 Transformation between Types Usually the design created with one ParamSet is then used to configure other objects that themselves have a ParamSet which defines the values they take. The ParamSets which can be used for random sampling, however, are restricted in some ways: They must have finite bounds, and they may not contain “untyped” (ParamUty) parameters. $trafo provides the glue for these situations. There is relatively little constraint on the trafo function’s return value, so it is possible to return values that have different bounds or even types than the original ParamSet. It is even possible to remove some parameters and add new ones.

Suppose, for example, that a certain method requires a function as a parameter. Let’s say a function that summarizes its data in a certain way. The user can pass functions like median() or mean(), but could also pass quantiles or something completely different. This method would probably use the following ParamSet:

methodPS = ParamSet$new( list( ParamUty$new("fun",
custom_check = function(x) checkmate::checkFunction(x, nargs = 1))
)
)
print(methodPS)
<ParamSet>
id    class lower upper nlevels        default value
1: fun ParamUty    NA    NA     Inf <NoDefault[3]>      

If one wanted to sample this method, using one of four functions, a way to do this would be:

samplingPS = ParamSet$new( list( ParamFct$new("fun", c("mean", "median", "min", "max"))
)
)

samplingPS$trafo = function(x, param_set) { # x$fun is a character(1),
# in particular one of 'mean', 'median', 'min', 'max'.
# We want to turn it into a function!
x$fun = get(x$fun, mode = "function")
x
}
design = generate_design_random(samplingPS, 2)
print(design)
<Design> with 2 rows:
fun
1: max
2: min

Note that the Design only contains the column “fun” as a character column. To get a single value as a function, the $transpose function is used. xvals = design$transpose()
print(xvals[[1]])
$fun function (..., na.rm = FALSE) .Primitive("max") We can now check that it fits the requirements set by methodPS, and that fun it is in fact a function: methodPS$check(xvals[[1]])
[1] TRUE
xvals[[1]]$fun(1:10) [1] 10 Imagine now that a different kind of parametrization of the function is desired: The user wants to give a function that selects a certain quantile, where the quantile is set by a parameter. In that case the $transpose function could generate a function in a different way. For interpretability, the parameter is called “quantile” before transformation, and the “fun” parameter is generated on the fly.

samplingPS2 = ParamSet$new( list( ParamDbl$new("quantile", 0, 1)
)
)

samplingPS2$trafo = function(x, param_set) { # x$quantile is a numeric(1) between 0 and 1.
# We want to turn it into a function!
list(fun = function(input) quantile(input, x$quantile)) } design = generate_design_random(samplingPS2, 2) print(design) <Design> with 2 rows: quantile 1: 0.03543698 2: 0.65563727 The Design now contains the column “quantile” that will be used by the $transpose function to create the fun parameter. We also check that it fits the requirement set by methodPS, and that it is a function.

xvals = design$transpose() print(xvals[[1]]) $fun
function(input) quantile(input, x$quantile) <environment: 0x5632364900c8> methodPS$check(xvals[[1]])
[1] TRUE
xvals[[1]]$fun(1:10) 3.543698% 1.318933  ### 9.4.5 Defining a Tuning Spaces When running an optimization, it is important to inform the tuning algorithm about what hyperparameters are valid. Here the names, types, and valid ranges of each hyperparameter are important. All this information is communicated with objects of the class ParamSet, which is defined in paradox. While it is possible to create ParamSet-objects using its $new-constructor, it is much shorter and readable to use the ps-shortcut, which will be presented here. For an in-depth description of paradox and its classes, see Section 9.4.

Note, that ParamSet objects exist in two contexts. First, ParamSet-objects are used to define the space of valid parameter settings for a learner (and other objects). Second, they are used to define a search space for tuning. We are mainly interested in the latter. For example we can consider the minsplit parameter of the classif.rpart Learner. The ParamSet associated with the learner has a lower but no upper bound. However, for tuning the value, a lower and upper bound must be given because tuning search spaces need to be bounded. For Learner or PipeOp objects, typically “unbounded” ParamSets are used. Here, however, we will mainly focus on creating “bounded” ParamSets that can be used for tuning. See Section 9.4 for more details on using ParamSets to define parameter ranges for use-cases besides tuning.

#### 9.4.5.1 Creating ParamSets

An empty "ParamSet") – not yet very useful – can be constructed using just the "ps") call:

search_space = ps()
print(search_space)
<ParamSet>
Empty.

ps takes named Domain arguments that are turned into parameters. A possible search space for the "classif.svm" learner could for example be:

search_space = ps(
cost = p_dbl(lower = 0.1, upper = 10),
kernel = p_fct(levels = c("polynomial", "radial"))
)
print(search_space)
<ParamSet>
id    class lower upper nlevels        default value
1:   cost ParamDbl   0.1    10     Inf <NoDefault[3]>
2: kernel ParamFct    NA    NA       2 <NoDefault[3]>      

There are five domain constructors that produce a parameters when given to ps:

Constructor Description Is bounded? Underlying Class
p_dbl Real valued parameter (“double”) When upper and lower are given ParamDbl
p_int Integer parameter When upper and lower are given ParamInt
p_fct Discrete valued parameter (“factor”) Always ParamFct
p_lgl Logical / Boolean parameter Always ParamLgl
p_uty Untyped parameter Never ParamUty

These domain constructors each take some of the following arguments:

• lower, upper: lower and upper bound of numerical parameters (p_dbl and p_int). These need to be given to get bounded parameter spaces valid for tuning.
• levels: Allowed categorical values for p_fct parameters. Required argument for p_fct. See below for more details on this parameter.
• trafo: transformation function, see below.
• depends: dependencies, see below.
• tags: Further information about a parameter, used for example by the hyperband tuner.
• default: Value corresponding to default behavior when the parameter is not given. Not used for tuning search spaces.
• special_vals: Valid values besides the normally accepted values for a parameter. Not used for tuning search spaces.
• custom_check: Function that checks whether a value given to p_uty is valid. Not used for tuning search spaces.

The lower and upper parameters are always in the first and second position respectively, except for p_fct where levels is in the first position. It is preferred to omit the labels (ex: upper = 0.1 becomes just 0.1). This way of defining a ParamSet is more concise than the equivalent definition above. Preferred:

search_space = ps(cost = p_dbl(0.1, 10), kernel = p_fct(c("polynomial", "radial")))

#### 9.4.5.2 Transformations (trafo)

We can use the paradox function generate_design_grid to look at the values that would be evaluated by grid search. (We are using rbindlist() here because the result of $transpose() is a list that is harder to read. If we didn’t use $transpose(), on the other hand, the transformations that we investigate here are not applied.) In generate_design_grid(search_space, 3), search_space is the ParamSet argument and 3 is the specified resolution in the parameter space. The resolution for categorical parameters is ignored; these parameters always produce a grid over all of their valid levels. For numerical parameters the endpoints of the params are always included in the grid, so if there were 3 levels for the kernel instead of 2 there would be 9 rows, or if the resolution was 4 in this example there would be 8 rows in the resulting table.

library("data.table")
rbindlist(generate_design_grid(search_space, 3)$transpose())  cost kernel 1: 0.10 polynomial 2: 0.10 radial 3: 5.05 polynomial 4: 5.05 radial 5: 10.00 polynomial 6: 10.00 radial We notice that the cost parameter is taken on a linear scale. We assume, however, that the difference of cost between 0.1 and 1 should have a similar effect as the difference between 1 and 10. Therefore it makes more sense to tune it on a logarithmic scale. This is done by using a transformation (trafo). This is a function that is applied to a parameter after it has been sampled by the tuner. We can tune cost on a logarithmic scale by sampling on the linear scale [-1, 1] and computing 10^x from that value. search_space = ps( cost = p_dbl(-1, 1, trafo = function(x) 10^x), kernel = p_fct(c("polynomial", "radial")) ) rbindlist(generate_design_grid(search_space, 3)$transpose())
   cost     kernel
1:  0.1 polynomial
3:  1.0 polynomial
5: 10.0 polynomial
6: 10.0     radial

It is even possible to attach another transformation to the ParamSet as a whole that gets executed after individual parameter’s transformations were performed. It is given through the .extra_trafo argument and should be a function with parameters x and param_set that takes a list of parameter values in x and returns a modified list. This transformation can access all parameter values of an evaluation and modify them with interactions. It is even possible to add or remove parameters. (The following is a bit of a silly example.)

search_space = ps(
cost = p_dbl(-1, 1, trafo = function(x) 10^x),
.extra_trafo = function(x, param_set) {
if (x$kernel == "polynomial") { x$cost = x$cost * 2 } x } ) rbindlist(generate_design_grid(search_space, 3)$transpose())
   cost     kernel
1:  0.2 polynomial
3:  2.0 polynomial
5: 20.0 polynomial
6: 10.0     radial

The available types of search space parameters are limited: continuous, integer, discrete, and logical scalars. There are many machine learning algorithms, however, that take parameters of other types, for example vectors or functions. These can not be defined in a search space ParamSet, and they are often given as ParamUty in the Learner’s ParamSet. When trying to tune over these hyperparameters, it is necessary to perform a Transformation that changes the type of a parameter.

An example is the class.weights parameter of the Support Vector Machine (SVM), which takes a named vector of class weights with one entry for each target class. The trafo that would tune class.weights for the mlr_tasks_spam, 'tsk("spam") dataset could be:

search_space = ps(
class.weights = p_dbl(0.1, 0.9, trafo = function(x) c(spam = x, nonspam = 1 - x))
)
generate_design_grid(search_space, 3)$transpose() [[1]] [[1]]$class.weights
spam nonspam
0.1     0.9

[[2]]
[[2]]$class.weights spam nonspam 0.5 0.5 [[3]] [[3]]$class.weights
spam nonspam
0.9     0.1 

(We are omitting rbindlist() in this example because it breaks the vector valued return elements.)

### 9.4.6 Automatic Factor Level Transformation

A common use-case is the necessity to specify a list of values that should all be tried (or sampled from). It may be the case that a hyperparameter accepts function objects as values and a certain list of functions should be tried. Or it may be that a choice of special numeric values should be tried. For this, the p_fct constructor’s level argument may be a value that is not a character vector, but something else. If, for example, only the values 0.1, 3, and 10 should be tried for the cost parameter, even when doing random search, then the following search space would achieve that:

search_space = ps(
cost = p_fct(c(0.1, 3, 10)),
)
rbindlist(generate_design_grid(search_space, 3)$transpose())  cost kernel 1: 0.1 polynomial 2: 0.1 radial 3: 3.0 polynomial 4: 3.0 radial 5: 10.0 polynomial 6: 10.0 radial This is equivalent to the following: search_space = ps( cost = p_fct(c("0.1", "3", "10"), trafo = function(x) list(0.1 = 0.1, 3 = 3, 10 = 10)[[x]]), kernel = p_fct(c("polynomial", "radial")) ) rbindlist(generate_design_grid(search_space, 3)$transpose())
   cost     kernel
1:  0.1 polynomial
3:  3.0 polynomial
5: 10.0 polynomial
6: 10.0     radial

Note: Though the resolution is 3 here, in this case it doesn’t matter because both cost and kernel are factors (the resolution for categorical variables is ignored, these parameters always produce a grid over all their valid levels).

This may seem silly, but makes sense when considering that factorial tuning parameters are always character values:

search_space = ps(
cost = p_fct(c(0.1, 3, 10)),
)
typeof(search_space$params$cost$levels) [1] "character" Be aware that this results in an “unordered” hyperparameter, however. Tuning algorithms that make use of ordering information of parameters, like genetic algorithms or model based optimization, will perform worse when this is done. For these algorithms, it may make more sense to define a p_dbl or p_int with a more fitting trafo. The class.weights case from above can also be implemented like this, if there are only a few candidates of class.weights vectors that should be tried. Note that the levels argument of p_fct must be named if there is no easy way for as.character() to create names: search_space = ps( class.weights = p_fct( list( candidate_a = c(spam = 0.5, nonspam = 0.5), candidate_b = c(spam = 0.3, nonspam = 0.7) ) ) ) generate_design_grid(search_space)$transpose()
[[1]]
[[1]]$class.weights spam nonspam 0.5 0.5 [[2]] [[2]]$class.weights
spam nonspam
0.3     0.7 

#### 9.4.6.1 Parameter Dependencies (depends)

Some parameters are only relevant when another parameter has a certain value, or one of several values. The Support Vector Machine (SVM), for example, has the degree parameter that is only valid when kernel is "polynomial". This can be specified using the depends argument. It is an expression that must involve other parameters and be of the form <param> == <scalar>, <param> %in% <vector>, or multiple of these chained by &&. To tune the degree parameter, one would need to do the following:

search_space = ps(
cost = p_dbl(-1, 1, trafo = function(x) 10^x),
degree = p_int(1, 3, depends = kernel == "polynomial")
)
rbindlist(generate_design_grid(search_space, 3)$transpose(), fill = TRUE)  cost kernel degree 1: 0.1 polynomial 1 2: 0.1 polynomial 2 3: 0.1 polynomial 3 4: 0.1 radial NA 5: 1.0 polynomial 1 6: 1.0 polynomial 2 7: 1.0 polynomial 3 8: 1.0 radial NA 9: 10.0 polynomial 1 10: 10.0 polynomial 2 11: 10.0 polynomial 3 12: 10.0 radial NA #### 9.4.6.2 Creating Tuning ParamSets from other ParamSets Having to define a tuning ParamSet for a Learner that already has parameter set information may seem unnecessarily tedious, and there is indeed a way to create tuning ParamSets from a Learner’s ParamSet, making use of as much information as already available. This is done by setting values of a Learner’s ParamSet to so-called TuneTokens, constructed with a to_tune call. This can be done in the same way that other hyperparameters are set to specific values. It can be understood as the hyperparameters being tagged for later tuning. The resulting ParamSet used for tuning can be retrieved using the $search_space() method.

learner = lrn("classif.svm")
learner$param_set$values$kernel = "polynomial" # for example learner$param_set$values$degree = to_tune(lower = 1, upper = 3)

print(learner$param_set$search_space())
<ParamSet>
id    class lower upper nlevels        default value
1: degree ParamInt     1     3       3 <NoDefault[3]>      
rbindlist(generate_design_grid(
learner$param_set$search_space(), 3)$transpose() )  degree 1: 1 2: 2 3: 3 It is possible to omit lower here, because it can be inferred from the lower bound of the degree parameter itself. For other parameters, that are already bounded, it is possible to not give any bounds at all, because their ranges are already bounded. An example is the logical shrinking hyperparameter: learner$param_set$values$shrinking = to_tune()

print(learner$param_set$search_space())
<ParamSet>
id    class lower upper nlevels        default value
1:    degree ParamInt     1     3       3 <NoDefault[3]>
2: shrinking ParamLgl    NA    NA       2           TRUE      
rbindlist(generate_design_grid(
learner$param_set$search_space(), 3)$transpose() )  degree shrinking 1: 1 TRUE 2: 1 FALSE 3: 2 TRUE 4: 2 FALSE 5: 3 TRUE 6: 3 FALSE "to_tune") can also be constructed with a Domain object, i.e. something constructed with a p_*** call. This way it is possible to tune continuous parameters with discrete values, or to give trafos or dependencies. One could, for example, tune the cost as above on three given special values, and introduce a dependency of shrinking on it. Notice that a short form for to_tune(<levels>) is a short form of to_tune(p_fct(<levels>)). Note When introducing the dependency, we need to use the degree value from before the implicit trafo, which is the name or as.character() of the respective value, here "val2"! learner$param_set$values$type = "C-classification" # needs to be set because of a bug in paradox
learner$param_set$values$cost = to_tune(c(val1 = 0.3, val2 = 0.7)) learner$param_set$values$shrinking = to_tune(p_lgl(depends = cost == "val2"))

print(learner$param_set$search_space())
<ParamSet>
id    class lower upper nlevels        default parents value
1:      cost ParamFct    NA    NA       2 <NoDefault[3]>
2:    degree ParamInt     1     3       3 <NoDefault[3]>
3: shrinking ParamLgl    NA    NA       2 <NoDefault[3]>    cost
Trafo is set.
rbindlist(generate_design_grid(learner$param_set$search_space(), 3)$transpose(), fill = TRUE)  degree cost shrinking 1: 1 0.3 NA 2: 1 0.7 TRUE 3: 1 0.7 FALSE 4: 2 0.3 NA 5: 2 0.7 TRUE 6: 2 0.7 FALSE 7: 3 0.3 NA 8: 3 0.7 TRUE 9: 3 0.7 FALSE The "search_space() picks up dependencies from the underlying ParamSet automatically. So if the kernel is tuned, then degree automatically gets the dependency on it, without us having to specify that. (Here we reset cost and shrinking to NULL for the sake of clarity of the generated output.) learner$param_set$values$cost = NULL
learner$param_set$values$shrinking = NULL learner$param_set$values$kernel = to_tune(c("polynomial", "radial"))

print(learner$param_set$search_space())
<ParamSet>
id    class lower upper nlevels        default parents value
1: degree ParamInt     1     3       3 <NoDefault[3]>  kernel
2: kernel ParamFct    NA    NA       2 <NoDefault[3]>              
rbindlist(generate_design_grid(learner$param_set$search_space(), 3)$transpose(), fill = TRUE)  kernel degree 1: polynomial 1 2: polynomial 2 3: polynomial 3 4: radial NA It is even possible to define whole ParamSets that get tuned over for a single parameter. This may be especially useful for vector hyperparameters that should be searched along multiple dimensions. This ParamSet must, however, have an .extra_trafo that returns a list with a single element, because it corresponds to a single hyperparameter that is being tuned. Suppose the class.weights hyperparameter should be tuned along two dimensions: learner$param_set$values$class.weights = to_tune(
ps(spam = p_dbl(0.1, 0.9), nonspam = p_dbl(0.1, 0.9),
.extra_trafo = function(x, param_set) list(c(spam = x$spam, nonspam = x$nonspam))
))
head(generate_design_grid(learner$param_set$search_space(), 3)$transpose(), 3) [[1]] [[1]]$kernel
[1] "polynomial"

[[1]]$degree [1] 1 [[1]]$class.weights
spam nonspam
0.1     0.1

[[2]]
[[2]]$kernel [1] "polynomial" [[2]]$degree
[1] 1

[[2]]$class.weights spam nonspam 0.1 0.5 [[3]] [[3]]$kernel
[1] "polynomial"

[[3]]$degree [1] 1 [[3]]$class.weights
spam nonspam
0.1     0.9 

## 9.5 Logging

We use the lgr package for logging and progress output.

### 9.5.1 Changing mlr3 logging levels

To change the setting for mlr3 for the current session, you need to retrieve the logger (which is a R6 object) from lgr, and then change the threshold of the like this:

requireNamespace("lgr")

logger = lgr::get_logger("mlr3")
logger$set_threshold("<level>") The default log level is "info". All available levels can be listed as follows: getOption("lgr.log_levels") fatal error warn info debug trace 100 200 300 400 500 600  To increase verbosity, set the log level to a higher value, e.g. to "debug" with: lgr::get_logger("mlr3")$set_threshold("debug")

To reduce the verbosity, reduce the log level to warn:

lgr::get_logger("mlr3")$set_threshold("warn") lgr comes with a global option called "lgr.default_threshold" which can be set via options() to make your choice permanent across sessions. Also note that the optimization packages such as mlr3tuning mlr3fselect use the logger of their base package bbotk. To disable the output from mlr3, but keep the output from mlr3tuning, reduce the verbosity for the logger mlr3 and optionally change the logger bbotk to the desired level. lgr::get_logger("mlr3")$set_threshold("warn")
lgr::get_logger("bbotk")$set_threshold("info") ### 9.5.2 Redirecting output Redirecting output is already extensively covered in the documentation and vignette of lgr. Here is just a short example that adds an additional appender to log events into a temporary file in JSON format: tf = tempfile("mlr3log_", fileext = ".json") # get the logger as R6 object logger = lgr::get_logger("mlr") # add Json appender logger$add_appender(lgr::AppenderJson$new(tf), name = "json") # signal a warning logger$warn("this is a warning from mlr3")
WARN  [23:07:29.436] this is a warning from mlr3
# print the contents of the file
cat(readLines(tf))
{"level":300,"timestamp":"2023-02-06 23:07:29","logger":"mlr","caller":"eval","msg":"this is a warning from mlr3"}
# remove the appender again
logger$remove_appender("json") ### 9.5.3 Immediate Log Feedback mlr3 uses future and encapsulation to make evaluations fast, stable, and reproducible. However, this may lead to logs being delayed, out of order, or, in case of some errors, not present at all. When it is necessary to have immediate access to log messages, for example to investigate problems, one may therefore choose to disable future and encapsulation. This can be done by enabling the debug mode using options(mlr.debug = TRUE); the $encapsulate slot of learners should also be set to "none" (default) or "evaluate", but not "callr". This should only be done to investigate problems, however, and not for production use, because

1. this disables parallelization, and
2. this leads to different RNG behavior and therefore to results that are not reproducible when the debug mode is set.