5.1 Parallelization
Parallelization refers to the process of running multiple jobs in parallel, simultaneously. This process allows for significant savings in computing power.
mlr3 uses the future backends for parallelization. Make sure you have installed the required packages future and future.apply:
mlr3 is capable of parallelizing a variety of different scenarios.
One of the most used cases is to parallelize the Resampling
iterations.
See Section Resampling for a detailed introduction to resampling.
In the following section, we will use the spam task and a simple classification tree ("classif.rpart"
) to showcase parallelization.
We use the future package to parallelize the resampling by selecting a backend via the function future::plan()
.
We use the future::multiprocess
backend here which uses forks (c.f. parallel::mcparallel()
) on UNIX based systems and a socket cluster
on Windows or if running in RStudio:
::plan("multiprocess")
future
= tsk("spam")
task = lrn("classif.rpart")
learner = rsmp("subsampling")
resampling
= Sys.time()
time resample(task, learner, resampling)
Sys.time() - time
By default all CPUs of your machine are used unless you specify argument workers
in future::plan()
.
On most systems you should see a decrease in the reported elapsed time. On some systems (e.g. Windows), the overhead for parallelization is quite large though. Therefore, it is advised to only enable parallelization for resamplings where each iteration runs at least 10 seconds.
Choosing the parallelization level
If you are transitioning from mlr, you might be used to selecting different parallelization levels, e.g. for resampling, benchmarking or tuning. In mlr3 this is no longer required. All kind of events are rolled out on the same level. Therefore, there is no need to decide whether you want to parallelize the tuning OR the resampling.
Just lean back and let the machine do the work :-)
5.1.1 Nested Resampling Parallelization
Nested resampling results in two nested resampling loops. We can choose different parallelization backends for the inner and outer resampling loop, respectively. We just have to pass a list of future backends:
# Runs the outer loop in parallel and the inner loop sequentially
::plan(list("multisession", "sequential"))
future# Runs the outer loop sequentially and the inner loop in parallel
::plan(list("sequential", "multisession")) future
While nesting real parallelization backends is often unintended and causes unnecessary overhead, it is useful in some distributed computing setups. It can be achieved with future by forcing a fixed number of workers for each loop:
# Runs both loops in parallel
::plan(list(future::tweak("multisession", workers = 2),
future::tweak("multisession", workers = 4))) future
This example would run on 8 cores (= 2 * 4
) on the local machine.
The vignette of the future package gives more insight into nested parallelization.