sinus_1D = function(xdt) {
y = 2 * xdt$x * sin(14 * xdt$x)
data.table(y = y)
}
13 Bayesian Optimization
Recall that in hyperparameter optimization (HPO, see Chapter 4), we configure a learner with a hyperparameter configuration and evaluate the learner via a resampling technique to estimate its generalization performance with the goal to find the optimal hyperparameter configuration. In general, no analytical description for this mapping from a hyperparameter configuration to performance exists and gradient information is also not available. HPO is therefore a prime example for black-box optimization which considers the optimization of a function whose structure and analytical description is unknown, unexploitable or non-existent. As a result, the only observable information is the output value of the function given an input value. Besides, evaluating the performance of a learner can take a substantial amount of time, making HPO an expensive black-box optimization problem. Other examples for black-box optimization problems are real-life experiments, e.g., crash tests or chemical reactions, or expensive computer simulations of such processes..
Many algorithm classes exists that can be used for black-box optimization that differ in how they tackle this problem. In Chapter 4, we showed how to use a grid or random search to find the optimal hyperparameter configuration for a learner on a given task. However, more sophisticated black-box optimizers such as evolutionary strategies or Bayesian optimization allow for finding much better performing configurations much more efficiently. This chapter is therefore of interest to users concerned with sample efficient black-box optimization and HPO.
In general, most black-box optimizers work iteratively, i.e., they propose new points for evaluation by making use of the information collected during the evaluation of previous points. Evolutionary strategies, for example, maintain a population and generate new points for evaluation by choosing parents from that population and performing recombination operators such as mutation and crossover to generate the offspring of the next generation. In general, evolutionary strategies are very suited for black-box optimization, however, if the cost of the evaluation of the black-box function becomes large (e.g., as in HPO), sample efficiency of an optimizer becomes highly relevant.
Bayesian optimization (BO) – sometimes also called Model Based Optimization (MBO) refers to a class of sample-efficient global black-box optimization algorithms that rely on a surrogate model trained on observed data to model the black-box function. This surrogate model is typically a non-linear regression model that tries to capture the unknown function using limited observed data. To select the next candidate point for evaluation, BO algorithms use an acquisition function that quantifies the expected ‘utility’ of each point in the search space based on the surrogate model. The algorithm selects the candidate point with the best acquisition function value, and evaluates the black-box function at that point to then update the surrogate model. This iterative process continues until a stopping criterion is met, such as reaching a pre-specified maximum number of evaluations or achieving a desired level of performance. Often, using BO results in very good optimization performance, especially if the cost of the black-box evaluation becomes large and optimization budget is tight.
In the following, we will give a brief general introduction to black-box optimization making use of the bbotk
package. We then show how BO can be performed within the mlr3 ecosystem making use of the mlr3mbo
package. Detailed introductions to black-box optimization and BO are given in Bischl et al. (2023), Feurer and Hutter (2019) and Garnett (2022).
13.1 Black-Box Optimization
The bbotk
(black-box optimization toolkit) package is the workhorse package for black-box optimization within the mlr3 ecosystem. At the heart of the package are the R6 classes:
-
OptimInstanceSingleCrit
andOptimInstanceMultiCrit
, which are used to construct an optimization instance which describes the optimization problem and stores the results; and -
Optimizer
which is used to get and set optimization algorithms.
This should sound very familiar. Indeed, in Chapter 4, the classes TuningInstanceSingleCrit
, TuningInstanceMultiCrit
and Tuner
were already introduced, which are special cases of the OptimInstanceSingleCrit
, OptimInstanceMultiCrit
and Optimizer
classes. The later are suited for general black-box optimization and not only HPO.
Our running example will be to optimize the following sinusoidal function (Figure 13.1), which is characterized by two local minima and one global minimum: \(f: [0, 1] \rightarrow \mathbb{R}, x \mapsto 2x + \sin(14x)\).
At the core of an OptimInstanceSingleCrit
lies an Objective
function wrapping the actual function mapping a domain to a codomain (see also Section 4.4). The domain of a function refers to the set of all possible input values for which the function is defined and can produce a valid output. The codomain of a function refers to the set of all possible output values that the function can produce. Objective functions can be created using different classes: ObjectiveRFun
wraps a custom R function that takes a list describing a single configuration as input, ObjectiveRFunMany
wraps a custom R function that takes a list of configurations as input, and ObjectiveRFunDt
wraps a custom R function that operates on a data.table
. The choice of which class to use mostly depends on efficiency considerations. For instance, if your function can process data.table
objects directly, using ObjectiveRFunDt
can enable efficient vectorized or batched evaluations. This approach avoids unnecessary data type conversions and offers straightforward operations on data.table
objects. Similarly, if the function evaluation can be parallelized, it is preferable to use ObjectiveRFunMany
instead of ObjectiveRFun
. With ObjectiveRFunMany
, the parallel evaluation of multiple points can be directly implemented within the wrapped function. We provide an example in the exercises section.
An Objective
always requires the specification of the domain and codomain in the form of a ParamSet
. Moreover, by tagging the codomain with "minimize"
or "maximize"
we specify the optimization direction:
domain = ps(x = p_dbl(lower = 0, upper = 1))
codomain = ps(y = p_dbl(tags = "minimize"))
objective = ObjectiveRFunDt$new(sinus_1D,
domain = domain, codomain = codomain)
We can proceed to visualizing the sinusoidal function (Figure 13.1) by generating a grid of points on which we evaluate the function (in a batched manner):
xydt = generate_design_grid(domain, resolution = 1001)$data
xydt[, y := objective$eval_dt(xydt)$y]
xydt
x y
1: 0.000 0.000000e+00
2: 0.001 2.799909e-05
3: 0.002 1.119854e-04
4: 0.003 2.519259e-04
5: 0.004 4.477659e-04
---
997: 0.996 1.954951e+00
998: 0.997 1.962081e+00
999: 0.998 1.968836e+00
1000: 0.999 1.975215e+00
1001: 1.000 1.981215e+00
The global minimum is located at around x = 0.792
.
xydt[y == min(y), ]
x y
1: 0.792 -1.577239
By wrapping the objective function in an OptimInstanceSingleCrit
, we can proceed to optimize it. Note that an OptimInstance
also allows for explicitly specifying a search space. While the domain describes the set of all possible input values for which the function is defined and can produce a valid output, the search space is usually a subset or a transformation of the domain and describes the set of all input values we want to optimize over. Note that in black-box optimization, it is common for the domain and therefore also the search space to have finite box constraints. Often, it is useful to use transformation functions of the domain resulting in a search space that can be optimized more efficiently (see also Section 4.1.5). When constructing an OptimInstance
and not specifying the search_space
explicitly, the whole domain of the objective function will be used by default. In the following, we use a simple random search to optimize the sinusoidal function over the whole domain.
instance = OptimInstanceSingleCrit$new(objective,
search_space = domain,
terminator = trm("evals", n_evals = 20))
optimizer = opt("random_search", batch_size = 20)
optimizer$optimize(instance)
instance$archive$best()
x y timestamp batch_nr
1: 0.7883051 -1.575305 2023-06-04 14:02:35 1
Similarly as in tuning (?sec-simplified-tuning) where we used the tune()
helper function, we can also use the bb_optimize()
helper function for optimization. Internally, this creates an OptimInstance
, starts the optimization and returns the result with the instance.
instance = bb_optimize(objective, method = "random_search",
max_evals = 20)
To see all available optimizers that can be used for black-box optimization, you can inspect the following dictionary. The key mbo
refers to BO as introduced in this chapter.
as.data.table(mlr_optimizers)
key label
1: cmaes Covariance Matrix Adaptation Evolution Strategy
2: design_points Design Points
3: focus_search Focus Search
4: gensa Generalized Simulated Annealing
5: grid_search Grid Search
6: hyperband Hyperband
7: irace Iterated Racing
8: mbo Model Based Optimization
9: nloptr Non-linear Optimization
10: random_search Random Search
11: successive_halving Successive Halving
3 variables not shown: [param_classes, properties, packages]
Note that evolutionary strategies are available within the mlr3 ecosystem via the miesmuschel
package which we will not cover in this chapter.
13.2 Building Blocks of Bayesian Optimization
Local optima (as in Figure 13.1) pose a significant challenge for many optimization algorithms as they can trap the algorithm, preventing it from escaping to potentially better solutions. Bayesian optimization (BO) is an iterative global optimization algorithm that makes use of a surrogate model to model the unknown black-box function. After having observed an initial design, the surrogate model is trained on all data points observed so far. Then an acquisition function is used to determine which points of the search space are promising candidate(s) that should be evaluated next. The acquisition function relies on the mean and standard deviation prediction of the surrogate model and requires no evaluation of the true black-box function and therefore is comparably cheap to optimize. The acquisition function should balance exploration and exploitation of the BO algorithm. We want to exploit knowledge about regions where we observed that performance is good and the surrogate model has low uncertainty but also want to make sure that we do not miss crucial regions and therefore also want to explore into regions where have not yet evaluated points and as a result the uncertainty of the surrogate model is high. After having evaluated the next candidate(s), the process repeats itself until a given termination criteria is met.
Most BO flavors therefore follow a simple loop:
- Fit the surrogate model on all observations made so far.
- Optimize the acquisition function to find the next candidate(s) that should be evaluated.
- Evaluate the next candidate(s) and update the archive of all observations made so far.
Note that in the following we will often speak of BO flavors, as BO is highly modular, i.e., users can choose different types of surrogate models, acquisition functions and acquisition function optimizers to their liking.
mlr3mbo
makes BO available within the mlr3 ecosystem. The core design principle is high modularity based on straightforward to use building blocks allowing users to easily write custom BO algorithms. At the heart of the package are the two R6 classes OptimizerMbo
and TunerMbo
, which can be configured with respect to their general loop structure of the BO algorithm (loop_function
), surrogate model (Surrogate
), acquisition function (AcqFunction
) and acquisition function optimizer (AcqOptimizer
). We highly encourage readers to also take a look at the source code of a standard single-objective BO loop such as, e.g., bayesopt_ego()
(mlr_loop_functions_ego
), demonstrating that it is straightforward to write custom loop functions using few lines of code.
In the following, we will illustrate the steps involved in such a standard BO loop in more detail, before we move on to using an OptimizerMbo
or TunerMbo
directly.
13.2.1 The Initial Design
Before we can fit a surrogate model, we need data. The initial set of points that is evaluated before a surrogate model can be fit is referred to as the initial design.
mlr3mbo
offers two different ways for specifying an initial design:
- One can simply evaluate points manually on the
OptimInstance
that is to be optimized prior to using anOptimizerMbo
. In this case, theloop_function
should skip the construction and evaluation of its own initial design. For example if one wants to use a customdesign
given in the form of adata.table
,instance$eval_batch(design)
will evaluate it so that it can be used within the optimization process. - If no points were already evaluated manually on the
OptimInstance
, theloop_function
should construct an initial design itself and evaluate it.
Functions for creating different initial designs are part of the paradox
package, e.g.:
-
generate_design_random()
: uniformly at random -
generate_design_grid()
: uniform sized grid -
generate_design_lhs()
: Latin hypercube sampling (Stein 1987) -
generate_design_sobol()
: Sobol sequence (Niederreiter 1988)
For illustrative purposes we will briefly compare these samplers on a two dimensional domain where we want to construct an initial design of size 9:
sample_domain = ps(x1 = p_dbl(lower = 0, upper = 1),
x2 = p_dbl(lower = 0, upper = 1))
random_design = generate_design_random(sample_domain, n = 9)$data
grid_design = generate_design_grid(sample_domain, resolution = 3)$data
lhs_design = generate_design_lhs(sample_domain, n = 9)$data
sobol_design = generate_design_sobol(sample_domain, n = 9)$data
We observe that a random design does not necessarily cover the search well and in this example simply due to bad luck samples points close to each other leaving large areas unexplored. A grid design will result in points being equidistant from their nearest neighbour and does not cover the search space well (areas between points are unexplored). In contrast, an LHS design provides a good space-filling property, as it ensures that each interval of each input variable (spanned by the horizontal and vertical dotted lines in Figure 13.2) is represented by exactly one sample point. This usually results in a more even coverage of the search space and a better representation of the distribution of the input variables. A Sobol design has a similar goal in mind but does not guarantee this for a small number of samples. However, constructing a Sobol design can be done much more efficiently than an LHS design, especially if the number of samples and dimensions grows. Moreover, a Sobol design has better coverage properties than an LHS design if the number of dimensions grows large.
If a specific initial design is desired, it needs to be evaluated on the OptimInstance
before running an OptimizerMbo
. The loop function within the OptimizerMbo
will then recognize the evaluations in the instance’s archive and consider it as the initial design, omitting the need to construct and evaluate a new initial design. Note that the same mechanism also allows for specifying a custom initial design (i.e., points designed by the user) in the form of a data.table
. Coming back to our running example of minimizing the sinusoidal function, we will now use the following custom design and evaluate it:
13.2.2 Loop Function
The loop_function
determines the behavior of the BO algorithm on a global level, i.e., how the subroutine should look like that is performed at each iteration.
To get an overview of readily available loop functions, the following dictionary can be inspected:
as.data.table(mlr_loop_functions)
key label instance
1: bayesopt_ego Efficient Global Optimization single-crit
2: bayesopt_emo Multi-Objective EGO multi-crit
3: bayesopt_mpcl Multipoint Constant Liar single-crit
4: bayesopt_parego ParEGO multi-crit
5: bayesopt_smsego SMS-EGO multi-crit
1 variable not shown: [man]
Technically, all loop functions are instances of the S3
class loop_function
(and are actually simply custom R functions with some attributes set).
For sequential single-objective black-box optimization, the Efficient Global Optimization (EGO) algorithm (Jones, Schonlau, and Welch 1998) can be considered the reference algorithm. In mlr3mbo
, the EGO algorithm is implemented via the bayesopt_ego()
loop function (mlr_loop_functions_ego
). After having made some assertions and safety checks, and having evaluated the initial design, bayesopt_ego()
essentially repeatedly performs the following steps:
-
acq_function$surrogate$update()
: update the surrogate model -
acq_function$update()
: update the acquisition function -
acq_optimizer$optimize()
: optimize the acquisition function to yield a new candidate -
instance$eval_batch(candidate)
: evaluate the candidate and add it to the archive
For completeness, we also provide the actual code (slightly modified and compressed to increase readability) of the bayesopt_ego()
loop function (mlr_loop_functions_ego
) below. The function first performs some assertions and checks the validity of the input arguments. It sets up the necessary components, including the surrogate model, acquisition function, and acquisition function optimizer. If the archive of evaluated points is empty, it generates an initial design of points uniformly at random within the search space and evaluates them (either of size \(4D\) where \(D\) is the dimensionality of the search space or if specified based on the init_design_size
argument). The function then enters a loop where it repeatedly performs the steps described above: Note that updating the surrogate model, acquisition function and optimizing the acquisition function are wrapped in a error catch mechanism with a fallback to propose the next candidate uniformly at random. For more details on this mechanism, see Section 13.8.
bayesopt_ego = function(
instance,
surrogate,
acq_function,
acq_optimizer,
init_design_size = NULL
) {
# assertions
assert_r6(instance, "OptimInstanceSingleCrit")
assert_int(init_design_size, lower = 1L, null.ok = TRUE)
assert_r6(surrogate, classes = "Surrogate")
assert_r6(acq_function, classes = "AcqFunction")
assert_r6(acq_optimizer, classes = "AcqOptimizer")
# initial design
search_space = instance$search_space
if (is.null(init_design_size) && instance$archive$n_evals == 0L) {
init_design_size = 4L * search_space$length
}
if (!is.null(init_design_size) && init_design_size > 0L) {
design = generate_design_random(search_space, n = init_design_size)$data
instance$eval_batch(design)
}
# completing initialization
surrogate$archive = instance$archive
acq_function$surrogate = surrogate
acq_optimizer$acq_function = acq_function
# actual loop
repeat {
candidate = tryCatch({
acq_function$surrogate$update()
acq_function$update()
acq_optimizer$optimize()
}, mbo_error = function(mbo_error_condition) {
lg$info(paste0(class(mbo_error_condition), collapse = " / "))
lg$info("Proposing a randomly sampled point")
generate_design_random(search_space, n = 1L)$data
})
instance$eval_batch(candidate)
if (instance$is_terminated) break
}
return(invisible(instance))
}
13.2.3 Surrogate Model
A surrogate model wraps a regression learner that models the unknown black-box function based on observed data. In mlr3mbo
, SurrogateLearner
and SurrogateLearnerCollection
are the higher-level R6 classes which should be used to construct a surrogate model, inheriting from the base Surrogate
class.
As a learner, any regression learner (LearnerRegr
) from mlr3
can be used, however, most acquisition functions require both a mean and a standard deviation prediction (therefore not all learners are suitable for all scenarios). Typical choices include:
- A Gaussian Process (
mlr3learners::mlr_learners_regr.km
) for low dimensional numeric search spaces - A random forest (e.g.,
mlr3learners::mlr_learners_regr.ranger
) for higher dimensional mixed (and / or hierarchical) search spaces
A detailed introduction to Gaussian Processes can be found in Williams and Rasmussen (2006) and in-depth focus to Gaussian Processes in the context of surrogate models in BO is given in Garnett (2022).
A SurrogateLearner
can be constructed via:
Here, we use a Gaussian Process with Matérn 5/2 kernel, which uses BFGS
as an optimizer to find the optimal kernel parameters and set trace = FALSE
to prevent too much output during fitting.
When using a Surrogate
interactively, i.e., outside of an OptimizerMbo
or TunerMbo
like below, the archive
of the instance must be specified:
The wrapped learner can be accessed via the $learner
field:
surrogate$learner
<LearnerRegrKM:regr.km>
* Model: -
* Parameters: covtype=matern5_2, optim.method=BFGS, control=<list>
* Packages: mlr3, mlr3learners, DiceKriging
* Predict Types: response, [se]
* Feature Types: logical, integer, numeric
* Properties: -
Internally, the learner is fitted on a regression task (mlr3::TaskRegr
) constructed from the bbotk::Archive
of the bbotk::OptimInstance
that is to be optimized. Features are given by the variables of the domain, whereas the target is given by the variable of the codomain. Depending on the choice of the loop function, multiple targets must be modelled by multiple surrogates, in which case a SurrogateLearnerCollection
should be used, see Section 13.5.
Updating the surrogate model results in the fields of the $learner
being populated as expected:
surrogate$update()
surrogate$learner$model
Call:
DiceKriging::km(design = data, response = task$truth(), covtype = "matern5_2",
optim.method = "BFGS", control = pv$control)
Trend coeff.:
Estimate
(Intercept) 0.7899
Covar. type : matern5_2
Covar. coeff.:
Estimate
theta(x) 0.3014
Variance estimate: 1.069737
Figure 13.3 visualizes the mean and uncertainty prediction of the surrogate model. Note that when using a Gaussian Process that interpolates the training data, the standard deviation prediction is zero for training data.
prediction = surrogate$predict(xydt[, surrogate$x_cols, with = FALSE])
xydt[, c("mean", "se") := prediction]
ggplot() +
geom_point(aes(x = x, y = y), size = 2, data = instance$archive$data) +
geom_line(aes(x = x, y = y), data = xydt) +
geom_line(aes(x = x, y = mean), colour = "steelblue", linetype = 2,
data = xydt) +
geom_ribbon(aes(x = x, min = mean - se, max = mean + se),
fill = "steelblue", colour = NA, alpha = 0.1, data = xydt) +
theme_minimal()
13.2.4 Acquisition Function
Roughly speaking, an acquisition function relies on the predictions of a surrogate model and quantifies the expected ‘utility’ of each point of the search space if it were to be evaluated in the next iteration.
A popular example is given by the Expected Improvement (Jones, Schonlau, and Welch 1998). The Expected Improvement tells us how much we can expect a candidate point to improve over the current best function value observed so far given the performance prediction of the surrogate model: \[ \alpha_{\mathrm{EI}}(\mathbf{x}) = \mathbb{E} \left[ \max \left( f_{\mathrm{min}} - Y(\mathbf{x}), 0 \right) \right] \] Here, \(Y(\mathbf{x)}\) is the surrogate model prediction (a random variable) for a given point \(\mathbf{x}\) (which when using a Gaussian Process follows a normal distribution) and \(f_{\mathrm{min}}\) is the current best function value observed so far (when assuming minimization) – also called the incumbent.
To get an overview of other available acquisition functions, the following dictionary can be inspected:
as.data.table(mlr_acqfunctions)
key label
1: aei Augmented Expected Improvement
2: cb Lower / Upper Confidence Bound
3: ehvi Expected Hypervolume Improvement
4: ehvigh Expected Hypervolume Improvement via GH Quadrature
5: ei Expected Improvement
6: eips Expected Improvement Per Second
7: mean Posterior Mean
8: pi Probability Of Improvement
9: smsego SMS-EGO
1 variable not shown: [man]
Technically, all acquisition functions inherit from the R6
class AcqFunction
which itself simply inherits from the base bbotk::Objective
class.
Construction is straightforward via:
acq_function = acqf("ei")
When working interactively, i.e., outside of an OptimizerMbo
or TunerMbo
like below, the surrogate
on which the acquisition function operates on must be specified:
acq_function = acqf("ei", surrogate = surrogate)
We now want to use the Expected Improvement to choose the next candidate for evaluation. First, we have to update the acquisition function. For the Expected Improvement, this results in updating the incumbent to make sure that it is actually set to current best function value observed so far.
acq_function$update()
Afterwards, we can evaluate the acquisition function for every point of the domain, e.g.:
acq_function$eval_dt(data.table(x = 0.25))
acq_ei
1: 0.01370754
Figure 13.4 shows that the Expected Improvement is high in regions where the mean prediction of the Gaussian Process is low but the standard deviation prediction suggests uncertainty. As a result, the Expected Improvement is often multi-modal.
ei_values = acq_function$eval_dt(xydt[, surrogate$x_cols, with = FALSE])
xydt[, ei := ei_values]
ggplot() +
geom_point(aes(x = x, y = y), size = 2, data = instance$archive$data) +
geom_line(aes(x = x, y = y), data = xydt) +
geom_line(aes(x = x, y = mean), colour = "steelblue", linetype = 2,
data = xydt) +
geom_ribbon(aes(x = x, min = mean - se, max = mean + se),
fill = "steelblue", colour = NA, alpha = 0.1, data = xydt) +
geom_line(aes(x = x, y = ei * 40), linewidth = 1, colour = "darkred",
linetype = 1, data = xydt) +
scale_y_continuous("y",
sec.axis = sec_axis(~ . * 0.025,
name = expression(alpha[EI]),
breaks = c(0, 0.025, 0.05))) +
theme_minimal()
13.2.5 Acquisition Function Optimizer
In practice, evaluating all potential candidates and selecting the best one is not feasible and computationally expensive (and for continuous search spaces theoretically impossible). To overcome this challenge, an optimizer is used to efficiently search the space of potential candidates. The optimizer’s objective is to identify the most promising points for evaluation by optimizing the acquisition function within a limited computational budget.
Internally, an OptimInstance
is constructed using the acquisition function as an bbotk::Objective
.
An acquisition function optimizer is then used to solve this optimization problem. Technically, this optimizer is a member of the AcqOptimizer
R6
class.
Construction requires specifying an bbotk::Optimizer
as well as a bbotk::Terminator
:
When working interactively, i.e., outside of an OptimizerMbo
or TunerMbo
like below, the acq_function
must be specified as well:
In this example we use the DIRECT algorithm provided by the nloptr
package to optimize the Expected Improvement. We will terminate the acquisition function optimization if we no longer improve by at least 1e-5
for 100
iterations.
candidate = acq_optimizer$optimize()
candidate
x x_domain acq_ei .already_evaluated
1: 0.417289 <list[1]> 0.06074387 FALSE
Having introduced the building blocks and their usage, we just illustrated what the standard loop function of a BO algorithm would do during optimization. The BO algorithm would then go on to evaluate the candidate and continue with the next iteration of the loop.
instance$eval_batch(candidate)
13.3 Bayesian Optimization for Black-Box Optimization
Of course, users usually do not want to perform all steps of the BO loop manually. Instead one can simply construct an OptimizerMbo
and use it to optimize the instance. Again, note that the loop_function
specifies the overall behavior of the BO algorithm, dictating the structure of the subroutine executed at each iteration as explained. Here we use bayesopt_ego()
(mlr_loop_functions_ego
) loop function as we are performing standard single-objective optimization. Moreover, note that when not using mlr3mbo
interactively, i.e., outside of an OptimizerMbo
or TunerMbo
, passing arguments such as archive
to the surrogate
is not required:
surrogate = srlrn(lrn("regr.km", covtype = "matern5_2",
optim.method = "BFGS", control = list(trace = FALSE)))
acq_function = acqf("ei")
acq_optimizer = acqo(opt("nloptr", algorithm = "NLOPT_GN_ORIG_DIRECT"),
terminator = trm("stagnation", iters = 100, threshold = 1e-5))
optimizer = opt("mbo",
loop_function = bayesopt_ego,
surrogate = surrogate,
acq_function = acq_function,
acq_optimizer = acq_optimizer)
instance$archive$best()
x y timestamp batch_nr x_domain acq_ei
1: 0.7921811 -1.577224 2023-06-04 14:03:00 11 <list[1]> 0.0009251674
1 variable not shown: [.already_evaluated]
We see that BO comes close to the true global optimum using few function evaluations.
Visualizing the sequential decision making process of the BO algorithm (i.e., the sampling trajectory of points) shows that focus is given more and more to regions around the global optimum (Figure 13.5). Nevertheless, even in later optimization stages, exploration is performed, illustrating that the Expected Improvement indeed balances exploration and exploitation.
If we replicate running our BO algorithm ten times (with random initial designs and varying random seeds) and compare this to a random search, we can see that BO indeed performs much better and on average reaches the global optimum after around 15 function evaluations (Figure 13.6). As expected, the performance for the initial design size is close to the performance of the random search.
13.4 Bayesian Optimization for Hyperparameter Optimization
mlr3mbo
can be used out of the box for HPO (tuning) within the mlr3 ecosystem using TunerMbo
. For illustrative purposes, we revisit the tuning example of Section 4.1.3 and perform BO instead of a grid search.
First, we construct the tuning instance:
resampling = rsmp("cv", folds = 3)
measure = msr("classif.ce")
learner = lrn("classif.svm",
cost = to_tune(1e-5, 1e5, logscale = TRUE),
gamma = to_tune(1e-5, 1e5, logscale = TRUE),
kernel = "radial",
type = "C-classification"
)
instance = ti(
task = tsk("sonar"),
learner = learner,
resampling = rsmp("cv", folds = 3),
measures = msr("classif.ce"),
terminator = trm("evals", n_evals = 25)
)
We can then simply construct an TunerMbo
and use it to optimize the instance. Note that TunerMbo
is simply a light-weight wrapper around OptimizerMbo
.
tuner = tnr("mbo",
loop_function = bayesopt_ego,
surrogate = surrogate,
acq_function = acq_function,
acq_optimizer = acq_optimizer)
tuner$optimize(instance)
cost gamma learner_param_vals x_domain classif.ce
1: 6.050056 -4.286433 <list[4]> <list[2]> 0.1535542
We see that BO finds a substantially better hyperparameter configuration than a grid search (using the same budget):
instance = ti(
task = tsk("sonar"),
learner = learner,
resampling = rsmp("cv", folds = 3),
measures = msr("classif.ce"),
terminator = trm("evals", n_evals = 25)
)
tuner = tnr("grid_search", resolution = 5)
tuner$optimize(instance)
cost gamma learner_param_vals x_domain classif.ce
1: 5.756463 -5.756463 <list[4]> <list[2]> 0.1829538
13.5 Multi-Objective Bayesian Optimization
BO can be used to optimize not only single-objective black-box functions, but also multi-objective black-box functions (recall that we already introduced multi-objective optimization in Section 4.5). Multi-Objective BO algorithms can differ in many design choices, for example whether they use a scalarization approach of objectives and only rely on a single surrogate model, or fit a surrogate model for each objective. More details on multi-objective BO can for example be found in Horn et al. (2015) or Morales-Hernández, Van Nieuwenhuyse, and Rojas Gonzalez (2022).
We will now illustrate how ParEGO (Knowles 2006) can be used for multi-objective HPO and revisit the example of Section 4.5:
learner = lrn("classif.rpart",
cp = to_tune(1e-04, 1e-1),
minsplit = to_tune(2, 64),
maxdepth = to_tune(1, 30)
)
measures = msrs(c("classif.ce", "selected_features"))
instance = ti(
task = tsk("sonar"),
learner = learner,
resampling = rsmp("cv", folds = 5),
measures = measures,
terminator = trm("evals", n_evals = 30),
store_models = TRUE
)
ParEGO, implemented via the bayesopt_parego()
loop function (mlr_loop_functions_parego
) tackles multi-objective BO via a scalarization approach and models a single scalarized objective function (note that the scalarization itself is reparameterized every iteration) via a single surrogate model and then proceeds to find the next candidate for evaluation making use of a standard single-objective acquisition function such as the Expected Improvement:
surrogate = srlrn(lrn("regr.km", covtype = "matern5_2",
optim.method = "BFGS", control = list(trace = FALSE)))
acq_function = acqf("ei")
acq_optimizer = acqo(opt("random_search", batch_size = 1000),
terminator = trm("evals", n_evals = 1000))
tuner = tnr("mbo",
loop_function = bayesopt_parego,
surrogate = surrogate,
acq_function = acq_function,
acq_optimizer = acq_optimizer)
tuner$optimize(instance)
The Pareto front is visualized in Figure 13.7. Note that the number of selected features can be fractional, as in this example, it is determined through resampling and calculated as an average across the number of selected features per cross-validation fold.
An alternative approach to multi-objective BO would be to model each objective function via a separate surrogate model and finding the next candidate for evaluation via an aggregating acquisition function. An example for this approach is given by the SMS-EGO algorithm (Ponweiser et al. 2008), implemented via the bayesopt_smsego()
loop function (mlr_loop_functions_smsego
).
Using SMS-EGO for the multi-objective tuning problem above would look like the following (specifying one surrogate model for each measure, changing the acquisition function, and changing the loop function):
surrogate = srlrnc(list(
lrn("regr.km", covtype = "matern5_2",
optim.method = "BFGS", control = list(trace = FALSE)),
lrn("regr.km", covtype = "matern5_2",
optim.method = "BFGS", control = list(trace = FALSE))))
acq_function = acqf("smsego")
acq_optimizer = acqo(opt("random_search", batch_size = 1000),
terminator = trm("evals", n_evals = 1000))
tuner = tnr("mbo",
loop_function = bayesopt_smsego,
surrogate = surrogate,
acq_function = acq_function,
acq_optimizer = acq_optimizer)
13.6 Noisy Bayesian Optimization
So far, we implicitly assumed that the black-box function we are trying to optimize is deterministic, i.e., repeatedly evaluating the same point will always returns the same objective function value. Real world black-box functions, however, are often noisy, i.e., the true signal of the black-box function is augmented by some noise and repeatedly evaluating the same point will return different objective function values. In bbotk
, noisiness of an objective function can be indicated via the properties
field of the Objective
class. This makes it possible to treat such objectives differently.
sinus_1D_noisy = function(xdt) {
y = 2 * xdt$x * sin(14 * xdt$x) + rnorm(nrow(xdt), mean = 0, sd = 0.1)
data.table(y = y)
}
domain = ps(x = p_dbl(lower = 0, upper = 1))
codomain = ps(y = p_dbl(tags = "minimize"))
objective = ObjectiveRFunDt$new(sinus_1D_noisy,
domain = domain, codomain = codomain, properties = "noisy")
mlr3mbo
allows for several ways how noisiness of objectives can be respected during BO:
- A surrogate model can be used that can model noisiness of observations
- An acquisition function can be used that properly respects noisiness
- The final best point(s) after optimization (i.e., the
$result
field of the instance) can be chosen in a way to reflect noisiness
For example, instead of using an interpolating Gaussian Process, Gaussian Process regression that estimates the measurement error can be used:
srlrn(lrn("regr.km", nugget.estim = TRUE))
This will result in the Gaussian Process not perfectly interpolating training data and the standard deviation prediction associated with the training data will be non-zero, reflecting the uncertainty in the observed function values due to the measurement error. A more in-depth discussion of noise free vs. noisy observations in the context of Gaussian Processes can be found in Chapter 2 in Williams and Rasmussen (2006).
An example for an acquisition function that properly respects noisiness of observations is given by the Augmented Expected Improvement (Huang et al. 2012) which essentially rescales the Expected Improvement, taking measurement error into account:
acqf("aei")
Finally, mlr3mbo
allows for explicitly specifying how the final result after optimization is assigned to the instance (i.e. what will be written to instance$result
) via a result assigner.
as.data.table(mlr_result_assigners)
key label man
1: archive Archive mlr3mbo::mlr_result_assigners_archive
2: surrogate Mean Surrogate Prediction mlr3mbo::mlr_result_assigners_surrogate
For example, ResultAssignerSurrogate
(mlr_result_assigners_surrogate
) will not simply pick the best point according to the evaluations logged in the archive
but instead will use a surrogate model to predict the mean of all evaluated points and proceed to choose the point with the best mean prediction as the final result.
13.7 Parallelizing Evaluations
The standard behavior of most BO algorithms is to sequentially propose a single candidate that should be evaluated next. Still, users may want to use compute resources more efficiently via parallelization. mlr3mbo
offers two ways to do this:
- In the case of hyperparameter optimization, it is usually best to parallelize the evaluation of a model, i.e., the resampling. This is straightforward as explained in Section 9.1.2.
- If the actual proposal mechanism of a BO algorithm should be parallelized in the sense that the loop proposes a batch of candidates that should be evaluated synchronously in the next iteration, the evaluation of the objective function itself must be parallelized. Moreover, the
loop_function
must be able to support batch proposals of candidates, e.g.,bayesopt_mpcl()
(mlr_loop_functions_mpcl
) andbayesopt_parego()
(mlr_loop_functions_parego
) support this by setting theq
argument to a value larger than1
. This will result inq
candidates being proposed in each iteration that should be evaluated (synchronously) as a batch in parallel. As a result, when using for example anObjectiveRFunDt
as anObjective
inside anOptimInstance
, thedata.table
will containq
candidates given as rows and users must parallelize the evaluation of these candidates manually. This can be easily done relying on thefuture
package. We provide an example in the exercises section.
13.8 Robustness
Optimization is an automatic process that should ideally not rely on manual intervention. Robustness of an optimization algorithm is almost as important as good performance. In the context of BO, there is plenty of room for potential failure of building blocks which can result in potential failure of the whole algorithm. For example, if two points in the training data are too close to each other, fitting the Gaussian Process can fail.
mlr3mbo
has several built-in safety nets that ensure that all kinds of errors can be caught and handled appropriately within the BO algorithm. Most importantly, all Surrogate
have the catch_errors
configuration parameter:
surrogate = srlrn(lrn("regr.km", covtype = "matern5_2",
optim.method = "BFGS", control = list(trace = FALSE)))
surrogate$param_set$params$catch_errors
id class lower upper levels default
1: catch_errors ParamLgl NA NA TRUE,FALSE <NoDefault[3]>
If set to TRUE
, all errors that occur during training or updating of the surrogate model are caught. The standard behavior of any loop_function
is then to trigger a fallback, i.e., proposing the next candidate uniformly at random.
Similarly, AcqOptimizer
have the catch_errors
configuration parameter:
acq_optimizer = acqo(opt("nloptr", algorithm = "NLOPT_GN_ORIG_DIRECT"),
terminator = trm("stagnation", iters = 100, threshold = 1e-5))
acq_optimizer$param_set$params$catch_errors
id class lower upper levels default
1: catch_errors ParamLgl NA NA TRUE,FALSE TRUE
If set to TRUE
, all errors that occur during the acquisition function optimization (either due to the surrogate model failing to predict or the acquisition function or acquisition function optimizer erroring out) are caught. Again, the standard behavior of any loop_function
then is to trigger a fallback, i.e., proposing the next candidate uniformly at random. Note that when setting catch_errors = TRUE
for the AcqOptimizer
, it is usually not necessary to also explicitly set catch_errors = TRUE
for the Surrogate
. Still, this may be useful when debugging.
In the worst-case (all iterations erroring out), the BO algorithm will therefore simply perform a random search. Ideally, the learner wrapped within the surrogate model makes use of encapsulation and can rely on a fallback learner (see Section 4.7.1) that will jump into action before this final safety net of proposing the next candidate uniformly at random is triggered. Note that the value of the acquisition function is always also logged into the archive of the optimization instance. To make sure that the BO algorithm behaved as expected, users should always inspect the log of the optimization process by looking at the archive and checking whether the acquisition function column is populated as expected. This can be done by simply inspecting the data
logged in the Archive
of the OptimInstance
(instance$archive$data
).
13.9 Practical Considerations in Bayesian Optimization
mlr3mbo
tries to use ‘intelligent’ defaults regarding the choice of surrogate model, acquisition function, acquisition function optimizer and even the loop function. For example, in the case of a purely numeric search space, mlr3mbo
will by default use a Gaussian Process as surrogate model and a random forest as fallback learner and additionally encapsulates (see Section 4.7.1) the learner via the evaluate
package. In the case of a mixed or hierarchical search space, mlr3mbo
will use a random forest as surrogate model. As a result of defaults existing for all building blocks, users can perform BO without specifying any building blocks and can still expect decent optimization performance. To see an up-to-date overview of these defaults, users should inspect the following man page: ?mbo_defaults
In practice, users may prefer a more robust BO variant over a potentially better performing but unstable variant. For example, Gaussian Processes are highly subject to the choice of kernel but also the kernel parameters themselves which are usually obtained via maximum likelihood estimation. Suboptimal parameter values can result in white noise models with a constant mean and standard deviation prediction (except for the interpolation of training data). In this case, the surrogate model will not provide useful mean and standard deviation predictions resulting in poor overall performance of the BO algorithm. Moreover, fitting a vanilla Gaussian Process scales cubic in the number of data points and therefore the overhead of the BO algorithm grows with the number of iterations. Besides, vanilla Gaussian Processes natively cannot handle categorical input variables or dependencies in the search space (recall that in HPO we often deal with mixed hierarchical spaces). In contrast, a random forest – popularly used as a surrogate model in SMAC, see Lindauer et al. (2022) – is cheap to train, quite robust in the sense that it is not as sensitive to its hyperparameters as a Gaussian Process, and can easily handle mixed hierarchical spaces. On the downside, a random forest is not really Bayesian (i.e., there is no posterior predictive distribution) and suffers from poor uncertainty estimates and poor extrapolation. Nevertheless, random forests usually perform quite well as surrogate model in BO.
Warmstarting is a technique in optimization where previous optimization runs are used to improve the convergence rate and final solution of a new, related optimization run. In Bayesian optimization, warmstarting can be achieved by providing a set of likely well-performing configurations as part of the initial design. This approach can be particularly advantageous because it allows the surrogate model to start with prior knowledge of the optimization landscape in relevant regions. In mlr3mbo
, warmstarting is straightforward by specifying a custom initial design. Furthermore, a convenient feature of mlr3mbo
is the ability to continue optimization in an online fashion even after an optimization run has been terminated. Both OptimizerMbo
and TunerMbo
support this feature, allowing optimization to resume on a given instance even if the optimization was previously interrupted or terminated.
Determining when to stop an optimization run is an important practical consideration. Common termination criteria include stopping after a fixed number of function evaluations or when a given walltime budget has been reached (see also Section 4.1.2). Another option is to stop the optimization when a certain performance level is achieved or when performance improvement stagnates. In the context of BO, it can also be sensible to stop the optimization if the best acquisition function value falls below a certain threshold. For instance, terminating the optimization if the Expected Improvement of the next candidate(s) is negligible can be a reasonable approach. More practical considerations in BO can also be found in Bischl et al. (2023).
13.10 Conclusion
In this chapter, we learned how to tackle black-box optimization with Bayesian optimization. mlr3mbo
is built modular relying on a Surrogate
, AcqFunction
and AcqOptimizer
as well as a general loop_function
that build the actual optimizer or tuner constructed in the form of an OptimizerMbo
or TunerMbo
. If you are interested in learning more about the underlying R6 classes to gain finer control of these methods, then you are invited to take a look at the online documentation.
S3 function | R6 Class | Summary |
---|---|---|
srlrn or srlrnc
|
SurrogateLearner or SurrogateLearnerCollection
|
Construct a surrogate model |
acqf |
AcqFunction |
Determines an acquisition function |
acqo |
AcqOptimizer |
Determines an acquisition function optimizer |
ras |
ResultAssigner |
Determines a result assigner |
loop_function |
- | General description of a loop function |
Resources
A more in-depth introduction to mlr3mbo
can be found its getting started vignette1.
13.11 Exercises
- Minimize the 2D Rastrigin function \(f: [-5.12, 5.12] \times [-5.12, 5.12] \rightarrow \mathbb{R}\), \(\mathbf{x} \mapsto 10 D+\sum_{i=1}^D\left[x_i^2-10 \cos \left(2 \pi x_i\right)\right]\), \(D = 2\) via BO (standard sequential single-objective BO via
bayesopt_ego()
) using the lower confidence bound withlambda = 1
as acquisition function and"NLOPT_GN_ORIG_DIRECT"
viaopt("nloptr")
as acquisition function optimizer (similarly as above). Specify a budget of 40 function evaluations. Use either a Gaussian Process with Matérn 5/2 kernel ("regr.km"
, similarly as above) or a random forest ("regr.ranger"
) as surrogate model and compare the anytime performance (similarly as in Figure 13.6) of these two BO algorithms. As an initial design, use the following points:
You can use the following function skeleton as a starting point to construct the objective function (using the ObjectiveRFunDt
class):
The different surrogate models should for example look like the following:
- Minimize the following function: \(f: [-10, 10] \rightarrow \mathbb{R}^2, x \mapsto \left(x^2, (x - 2)^2\right)\). Use the ParEGO algorithm (
mlr_loop_functions_parego
) in a batch setting of four candidates (q = 4
) and parallelize the actual objective function evaluation using thefuture
package (using four workers in amultisession
plan). Construct the objective function using theObjectiveRFunMany
class. For illustrative reasons, suspend the execution for five seconds every time a point is to be evaluated (making use of theSys.sleep()
function). Use the following surrogate model, acquisition function and acquisition function optimizer (recall that ParEGO uses a scalarization approach to multi-objective optimization):
Terminate the optimization after a runtime of 60 seconds. How many points did the BO algorithm evaluate (including the initial design) when properly parallelizing the evaluation of the objective function? Compare this to the number of points the BO algorithm evaluated when not parallelizing the evaluation (but still using a batch of size q = 4
). Note that q = 4
must be passed to the OptimizerMbo
via the args
argument. You can use the following (non-parallelized) function skeleton as a starting point to construct the objective function (note that future.apply
might be useful to implement the parallelization):