8 Parallelization

First, make sure you have installed the required packages future and future.apply:

if (!requireNamespace("future", quietly = TRUE))
  install.packages("future")
if (!requireNamespace("future.apply", quietly = TRUE))
  install.packages("future.apply")

There are multiple operations which mlr3 can parallelize. One of the more obvious operations to parallelize is resampling.

Each resampling iteration applies a learner on a subset of a task, predicts on a different subset of the task and then evaluates the performance by comparing true and predicted labels. As all iterations are independent from each other, this loop is called embarrassingly parallel.

In the following, we will consider the spam task and a simple classification tree ("classif.rpart") to illustrate the parallelization.

First, the loop without any parallelization:

library("mlr3")

task = mlr_tasks$get("spam")
learner = mlr_learners$get("classif.rpart")
resampling = mlr_resamplings$get("subsampling")

system.time(
  resample(task, learner, resampling)
)[3L]

Now, we use the future package to parallelize the resampling by selecting a backend via the function plan() and then repeat the resampling. We use the “multiprocess” backend here which uses threads on linux/mac and a socket cluster on windows:

future::plan("multiprocess")
system.time(
  resample(task, learner, resampling)
)[3L]

On most systems you should see a decrease in the reported elapsed time. On some systems (e.g. windows), the overhead for parallelization is quite large though. Therefore, it is advised to only enable parallelization for experiments which run more than 10s each.

Benchmarking is another example for embarrassingly parallel execution. The following code sends 64 jobs (4 tasks * 16 resampling repeats) to the future backend:

tasks = mlr_tasks$mget(c("iris", "spam", "pima"))
learners = mlr_learners$mget("classif.rpart")
resamplings = mlr_resamplings$mget("subsampling", param_vals = list(ratio = 0.8, repeats = 16))

future::plan("multiprocess")
system.time(
  benchmark(expand_grid(tasks, learners, resamplings))
)[3L]