Skip to content

Commit badba20

Browse files
authored
Fixing merge conflicts for acquisition/models docs
1 parent fe75f43 commit badba20

File tree

2 files changed

+365
-397
lines changed

2 files changed

+365
-397
lines changed

docs/acquisition.md

+178-207
Original file line numberDiff line numberDiff line change
@@ -1,207 +1,178 @@
1-
---
2-
id: acquisition
3-
title: Acquisition Functions
4-
---
5-
6-
Acquisition functions are heuristics employed to evaluate the usefulness of one
7-
of more design points for achieving the objective of maximizing the underlying
8-
black box function.
9-
10-
BoTorch supports both analytic as well as (quasi-) Monte-Carlo based acquisition
11-
functions. It provides a generic
12-
[`AcquisitionFunction`](../api/acquisition.html#acquisitionfunction) API that
13-
abstracts away from the particular type, so that optimization can be performed
14-
on the same objects.
15-
16-
17-
## Monte Carlo Acquisition Functions
18-
19-
Many common acquisition functions can be expressed as the expectation of some
20-
real-valued function of the model output(s) at the design point(s):
21-
22-
$$
23-
\alpha(X) = \mathbb{E}\bigl[ a(\xi) \mid
24-
\xi \sim \mathbb{P}(f(X) \mid \mathcal{D}) \bigr]
25-
$$
26-
27-
where $X = (x_1, \dotsc, x_q)$, and $\mathbb{P}(f(X) \mid \mathcal{D})$ is the
28-
posterior distribution of the function $f$ at $X$ given the data $\mathcal{D}$
29-
observed so far.
30-
31-
Evaluating the acquisition function thus requires evaluating an integral over
32-
the posterior distribution. In most cases, this is analytically intractable. In
33-
particular, analytic expressions generally do not exist for batch acquisition
34-
functions that consider multiple design points jointly (i.e. $q > 1$).
35-
36-
An alternative is to use Monte-Carlo (MC) sampling to approximate the integrals.
37-
An MC approximation of $\alpha$ at $X$ using $N$ MC samples is
38-
39-
$$ \alpha(X) \approx \frac{1}{N} \sum_{i=1}^N a(\xi_{i}) $$
40-
41-
where $\xi_i \sim \mathbb{P}(f(X) \mid \mathcal{D})$.
42-
43-
For instance, for q-Expected Improvement (qEI), we have:
44-
45-
$$
46-
\text{qEI}(X) \approx \frac{1}{N} \sum_{i=1}^N \max_{j=1,..., q}
47-
\bigl\\{ \max(\xi_{ij} - f^\*, 0) \bigr\\},
48-
\qquad \xi_{i} \sim \mathbb{P}(f(X) \mid \mathcal{D})
49-
$$
50-
51-
where $f^\*$ is the best function value observed so far (assuming noiseless
52-
observations). Using the reparameterization trick ([^KingmaWelling2014],
53-
[^Rezende2014]),
54-
55-
$$
56-
\text{qEI}(X) \approx \frac{1}{N} \sum_{i=1}^N \max_{j=1,..., q}
57-
\bigl\\{ \max\bigl( \mu(X)\_j + (L(X) \epsilon_i)\_j - f^\*, 0 \bigr) \bigr\\},
58-
\qquad \epsilon_{i} \sim \mathcal{N}(0, I)
59-
$$
60-
61-
where $\mu(X)$ is the posterior mean of $f$ at $X$, and $L(X)L(X)^T = \Sigma(X)$
62-
is a root decomposition of the posterior covariance matrix.
63-
64-
All MC-based acquisition functions in BoTorch are derived from
65-
[`MCAcquisitionFunction`](../api/acquisition.html#mcacquisitionfunction).
66-
67-
Acquisition functions expect input tensors $X$ of shape
68-
$\textit{batch_shape} \times q \times d$, where $d$ is the dimension of the
69-
feature space, $q$ is the number of points considered jointly, and
70-
$\textit{batch_shape}$ is the batch-shape of the input tensor. The output
71-
$\alpha(X)$ will have shape $\textit{batch_shape}$, with each element
72-
corresponding to the respective $q \times d$ batch tensor in the input $X$.
73-
Note that for analytic acquisition functions, it must be that $q=1$.
74-
75-
### MC, q-MC, and Fixed Base Samples
76-
77-
BoTorch relies on the re-parameterization trick and (quasi)-Monte-Carlo sampling
78-
for optimization and estimation of the batch acquisition functions [^Wilson2017].
79-
The results below show the reduced variance when estimating an expected
80-
improvement (EI) acquisition function using base samples obtained via quasi-MC
81-
sampling versus standard MC sampling.
82-
83-
![MC_qMC](assets/EI_MC_qMC.png)
84-
85-
In the plots above, the base samples used to estimate each point are resampled.
86-
As discussed in the [Overview](./overview), a single set of base samples can be
87-
used for optimization when the re-parameterization trick is employed. What are the
88-
trade-offs between using a fixed set of base samples versus re-sampling on every
89-
MC evaluation of the acquisition function? Below, we show that fixing base samples
90-
produces functions that are potentially much easier to optimize, without resorting to
91-
stochastic optimization methods.
92-
93-
![resampling_fixed](assets/EI_resampling_fixed.png)
94-
95-
If the base samples are fixed, the problem of optimizing the acquisition function
96-
is deterministic, allowing for conventional quasi-second order methods such as
97-
L-BFGS or sequential least-squares programming (SLSQP) to be used. These have
98-
faster convergence rates than first-order methods and can speed up acquisition
99-
function optimization significantly.
100-
101-
One concern is that the approximated acquisition function is *biased* for any
102-
fixed set of base samples, which may adversely affect the solution. However, we
103-
find that in practice, both the optimal value and the optimal solution of these
104-
biased problems for standard acquisition functions converge quite rapidly to
105-
their true counterparts as more samples are used. Note that for evaluation of
106-
the acquisition function we integrate over a $qo$-dimensional space (where
107-
$q$ is the number of points in the q-batch and $o$ is the number of outputs
108-
included in the objective). Therefore, the MC integration problem can be quite
109-
low-dimensional even for models on high-dimensional feature spaces (large $d$).
110-
Because using additional samples is relatively cheap computationally,
111-
we default to 500 base samples in the MC acquisition functions.
112-
113-
On the other hand, when re-sampling is used in conjunction with a stochastic
114-
optimization algorithm, the kind of bias noted above is no longer a concern.
115-
The trade-off here is that the optimization may be less effective, as discussed
116-
above.
117-
118-
119-
## Analytic Acquisition Functions
120-
121-
BoTorch also provides implementations of analytic acquisition functions that
122-
do not depend on MC sampling. These acquisition functions are subclasses of
123-
[`AnalyticAcquisitionFunction`](../api/acquisition.html#analyticacquisitionfunction)
124-
and only exist for the case of a single candidate point ($q = 1$). These
125-
include classical acquisition functions such as Expected Improvement (EI),
126-
Upper Confidence Bound (UCB), and Probability of Improvement (PI). An example
127-
comparing [`ExpectedImprovement`](../api/acquisition.html#expectedimprovement),
128-
the analytic version of EI, to it's MC counterpart
129-
[`qExpectedImprovement`](../api/acquisition.html#qexpectedimprovement)
130-
can be found in
131-
[this tutorial](../tutorials/compare_mc_analytic_acquisition).
132-
133-
Analytic acquisition functions allow for an explicit expression in terms of the
134-
summary statistics of the posterior distribution at the evaluated point(s).
135-
A popular acquisition function is Expected Improvement of a single point
136-
for a Gaussian posterior, given by
137-
138-
$$ \text{EI}(x) = \mathbb{E}\bigl[
139-
\max(y - f^\*, 0) \mid y\sim \mathcal{N}(\mu(x), \sigma^2(x))
140-
\bigr] $$
141-
142-
where $\mu(x)$ and $\sigma(x)$ are the posterior mean and variance of $f$ at the
143-
point $x$, and $f^\*$ is again the best function value observed so far (assuming
144-
noiseless observations). It can be shown that
145-
146-
$$ \text{EI}(x) = \sigma(x) \bigl( z \Phi(z) + \varphi(z) \bigr)$$
147-
148-
where $z = \frac{\mu(x) - f_{\max}}{\sigma(x)}$ and $\Phi$ and $\varphi$ are
149-
the cdf and pdf of the standard normal distribution, respectively.
150-
151-
With some additional work, it is also possible to express the gradient of
152-
the Expected Improvement with respect to the design $x$. Classic Bayesian
153-
Optimization software will implement this gradient function explicitly, so that
154-
it can be used for numerically optimizing the acquisition function.
155-
156-
BoTorch, in contrast, harnesses PyTorch's automatic differentiation feature
157-
("autograd") in order to obtain gradients of acquisition functions. This makes
158-
implementing new acquisition functions much less cumbersome, as it does not
159-
require to analytically derive gradients. All that is required is that the
160-
operations performed in the acquisition function computation allow for the
161-
back-propagation of gradient information through the posterior and the model.
162-
163-
164-
[^KingmaWelling2014]: D. P. Kingma, M. Welling. Auto-Encoding Variational Bayes.
165-
ICLR, 2013.
166-
167-
[^Rezende2014]: D. J. Rezende, S. Mohamed, D. Wierstra. Stochastic
168-
Backpropagation and Approximate Inference in Deep Generative Models. ICML, 2014.
169-
170-
[^Wilson2017]: J. T. Wilson, R. Moriconi, F. Hutter, M. P. Deisenroth.
171-
The Reparameterization Trick for Acquisition Functions. NeurIPS Workshop on
172-
Bayesian Optimization, 2017.
173-
174-
## Latent Information Gain
175-
176-
In the high-dimensional spatiotemporal domain, Expected Information Gain becomes
177-
less informative for useful observations, and it can be difficult to calculate
178-
its parameters. To overcome these limitations, we propose a novel acquisition
179-
function by computing the expected information gain in the latent space rather
180-
than the observational space. To design this acquisition function,
181-
we prove the equivalence between the expected information gain
182-
in the observational space and the expected KL divergence in the
183-
latent processes w.r.t. a candidate parameter 𝜃, as illustrated by the
184-
following proposition.
185-
186-
Proposition 1. The expected information gain (EIG) for Neural
187-
Process is equivalent to the KL divergence between the prior and
188-
posterior in the latent process, that is
189-
190-
$$ \text{EIG}(\hat{x}_{1:T}, \theta) := \mathbb{E} \left[ H(\hat{x}_{1:T}) -
191-
H(\hat{x}_{1:T} \mid z_{1:T}, \theta) \right]
192-
= \mathbb{E}_{p(\hat{x}_{1:T} \mid \theta)}
193-
\text{KL} \left( p(z_{1:T} \mid \hat{x}_{1:T}, \theta) \,\|\, p(z_{1:T}) \right)
194-
$$
195-
196-
197-
Inspired by this fact, we propose a novel acquisition function computing the
198-
expected KL divergence in the latent processes and name it LIG. Specifically,
199-
the trained NP model produces a variational posterior given the current dataset.
200-
For every parameter $$\theta$$ remained in the search space, we can predict
201-
$$\hat{x}_{1:T}$$ with the decoder. We use $$\hat{x}_{1:T}$$ and $$\theta$$
202-
as input to the encoder to re-evaluate the posterior. LIG computes the
203-
distributional difference with respect to the latent process.
204-
[Wu2023arxiv]:
205-
Wu, D., Niu, R., Chinazzi, M., Vespignani, A., Ma, Y.-A., & Yu, R. (2023).
206-
Deep Bayesian Active Learning for Accelerating Stochastic Simulation.
207-
arXiv preprint arXiv:2106.02770. Retrieved from https://arxiv.org/abs/2106.02770
1+
---
2+
id: acquisition
3+
title: Acquisition Functions
4+
---
5+
6+
Acquisition functions are heuristics employed to evaluate the usefulness of one
7+
of more design points for achieving the objective of maximizing the underlying
8+
black box function.
9+
10+
BoTorch supports both analytic as well as (quasi-) Monte-Carlo based acquisition
11+
functions. It provides a generic
12+
[`AcquisitionFunction`](https://botorch.readthedocs.io/en/latest/acquisition.html#botorch.acquisition.acquisition.AcquisitionFunction) API that
13+
abstracts away from the particular type, so that optimization can be performed
14+
on the same objects.
15+
16+
17+
## Monte Carlo Acquisition Functions
18+
19+
Many common acquisition functions can be expressed as the expectation of some
20+
real-valued function of the model output(s) at the design point(s):
21+
22+
$$
23+
\alpha(X) = \mathbb{E}\bigl[ a(\xi) \mid
24+
\xi \sim \mathbb{P}(f(X) \mid \mathcal{D}) \bigr]
25+
$$
26+
27+
where $X = (x_1, \dotsc, x_q)$, and $\mathbb{P}(f(X) \mid \mathcal{D})$ is the
28+
posterior distribution of the function $f$ at $X$ given the data $\mathcal{D}$
29+
observed so far.
30+
31+
Evaluating the acquisition function thus requires evaluating an integral over
32+
the posterior distribution. In most cases, this is analytically intractable. In
33+
particular, analytic expressions generally do not exist for batch acquisition
34+
functions that consider multiple design points jointly (i.e. $q > 1$).
35+
36+
An alternative is to use Monte-Carlo (MC) sampling to approximate the integrals.
37+
An MC approximation of $\alpha$ at $X$ using $N$ MC samples is
38+
39+
$$
40+
\alpha(X) \approx \frac{1}{N} \sum_{i=1}^N a(\xi_{i})
41+
$$
42+
43+
where $\xi_i \sim \mathbb{P}(f(X) \mid \mathcal{D})$.
44+
45+
For instance, for q-Expected Improvement (qEI), we have:
46+
47+
$$
48+
\text{qEI}(X) \approx \frac{1}{N} \sum_{i=1}^N \max_{j=1,..., q}
49+
\bigl\{ \max(\xi_{ij} - f^*, 0) \bigr\},
50+
\qquad \xi_{i} \sim \mathbb{P}(f(X) \mid \mathcal{D})
51+
$$
52+
53+
where $f^*$ is the best function value observed so far (assuming noiseless
54+
observations). Using the reparameterization trick ([^KingmaWelling2014],
55+
[^Rezende2014]),
56+
57+
$$
58+
\text{qEI}(X) \approx \frac{1}{N} \sum_{i=1}^N \max_{j=1,..., q}
59+
\bigl\{ \max\bigl( \mu(X)\_j + (L(X) \epsilon_i)\_j - f^*, 0 \bigr) \bigr\},
60+
\qquad \epsilon_{i} \sim \mathcal{N}(0, I)
61+
$$
62+
63+
where $\mu(X)$ is the posterior mean of $f$ at $X$, and $L(X)L(X)^T = \Sigma(X)$
64+
is a root decomposition of the posterior covariance matrix.
65+
66+
All MC-based acquisition functions in BoTorch are derived from
67+
[`MCAcquisitionFunction`](https://botorch.readthedocs.io/en/latest/acquisition.html#botorch.acquisition.monte_carlo.MCAcquisitionFunction).
68+
69+
Acquisition functions expect input tensors $X$ of shape
70+
$\textit{batch\_shape} \times q \times d$, where $d$ is the dimension of the
71+
feature space, $q$ is the number of points considered jointly, and
72+
$\textit{batch\_shape}$ is the batch-shape of the input tensor. The output
73+
$\alpha(X)$ will have shape $\textit{batch\_shape}$, with each element
74+
corresponding to the respective $q \times d$ batch tensor in the input $X$.
75+
Note that for analytic acquisition functions, it must be that $q=1$.
76+
77+
### MC, q-MC, and Fixed Base Samples
78+
79+
BoTorch relies on the re-parameterization trick and (quasi)-Monte-Carlo sampling
80+
for optimization and estimation of the batch acquisition functions [^Wilson2017].
81+
The results below show the reduced variance when estimating an expected
82+
improvement (EI) acquisition function using base samples obtained via quasi-MC
83+
sampling versus standard MC sampling.
84+
85+
![MC_qMC](assets/EI_MC_qMC.png)
86+
87+
In the plots above, the base samples used to estimate each point are resampled.
88+
As discussed in the [Overview](./overview), a single set of base samples can be
89+
used for optimization when the re-parameterization trick is employed. What are the
90+
trade-offs between using a fixed set of base samples versus re-sampling on every
91+
MC evaluation of the acquisition function? Below, we show that fixing base samples
92+
produces functions that are potentially much easier to optimize, without resorting to
93+
stochastic optimization methods.
94+
95+
![resampling_fixed](assets/EI_resampling_fixed.png)
96+
97+
If the base samples are fixed, the problem of optimizing the acquisition function
98+
is deterministic, allowing for conventional quasi-second order methods such as
99+
L-BFGS or sequential least-squares programming (SLSQP) to be used. These have
100+
faster convergence rates than first-order methods and can speed up acquisition
101+
function optimization significantly.
102+
103+
One concern is that the approximated acquisition function is *biased* for any
104+
fixed set of base samples, which may adversely affect the solution. However, we
105+
find that in practice, both the optimal value and the optimal solution of these
106+
biased problems for standard acquisition functions converge quite rapidly to
107+
their true counterparts as more samples are used. Note that for evaluation of
108+
the acquisition function we integrate over a $qo$-dimensional space (where
109+
$q$ is the number of points in the q-batch and $o$ is the number of outputs
110+
included in the objective). Therefore, the MC integration problem can be quite
111+
low-dimensional even for models on high-dimensional feature spaces (large $d$).
112+
Because using additional samples is relatively cheap computationally,
113+
we default to 500 base samples in the MC acquisition functions.
114+
115+
On the other hand, when re-sampling is used in conjunction with a stochastic
116+
optimization algorithm, the kind of bias noted above is no longer a concern.
117+
The trade-off here is that the optimization may be less effective, as discussed
118+
above.
119+
120+
121+
## Analytic Acquisition Functions
122+
123+
BoTorch also provides implementations of analytic acquisition functions that
124+
do not depend on MC sampling. These acquisition functions are subclasses of
125+
[`AnalyticAcquisitionFunction`](https://botorch.readthedocs.io/en/latest/acquisition.html#botorch.acquisition.analytic.AnalyticAcquisitionFunction)
126+
and only exist for the case of a single candidate point ($q = 1$). These
127+
include classical acquisition functions such as Expected Improvement (EI),
128+
Upper Confidence Bound (UCB), and Probability of Improvement (PI). An example
129+
comparing [`ExpectedImprovement`](https://botorch.readthedocs.io/en/latest/acquisition.html#botorch.acquisition.analytic.ExpectedImprovement),
130+
the analytic version of EI, to it's MC counterpart
131+
[`qExpectedImprovement`](https://botorch.readthedocs.io/en/latest/acquisition.html#botorch.acquisition.monte_carlo.qExpectedImprovement)
132+
can be found in
133+
[this tutorial](tutorials/compare_mc_analytic_acquisition).
134+
135+
Analytic acquisition functions allow for an explicit expression in terms of the
136+
summary statistics of the posterior distribution at the evaluated point(s).
137+
A popular acquisition function is Expected Improvement of a single point
138+
for a Gaussian posterior, given by
139+
140+
$$
141+
\text{EI}(x) = \mathbb{E}\bigl[
142+
\max(y - f^*, 0) \mid y\sim \mathcal{N}(\mu(x), \sigma^2(x))
143+
\bigr]
144+
$$
145+
146+
where $\mu(x)$ and $\sigma(x)$ are the posterior mean and variance of $f$ at the
147+
point $x$, and $f^*$ is again the best function value observed so far (assuming
148+
noiseless observations). It can be shown that
149+
150+
$$
151+
\text{EI}(x) = \sigma(x) \bigl( z \Phi(z) + \varphi(z) \bigr)
152+
$$
153+
154+
where $z = \frac{\mu(x) - f_{\max}}{\sigma(x)}$ and $\Phi$ and $\varphi$ are
155+
the cdf and pdf of the standard normal distribution, respectively.
156+
157+
With some additional work, it is also possible to express the gradient of
158+
the Expected Improvement with respect to the design $x$. Classic Bayesian
159+
Optimization software will implement this gradient function explicitly, so that
160+
it can be used for numerically optimizing the acquisition function.
161+
162+
BoTorch, in contrast, harnesses PyTorch's automatic differentiation feature
163+
("autograd") in order to obtain gradients of acquisition functions. This makes
164+
implementing new acquisition functions much less cumbersome, as it does not
165+
require to analytically derive gradients. All that is required is that the
166+
operations performed in the acquisition function computation allow for the
167+
back-propagation of gradient information through the posterior and the model.
168+
169+
170+
[^KingmaWelling2014]: D. P. Kingma, M. Welling. Auto-Encoding Variational Bayes.
171+
ICLR, 2013.
172+
173+
[^Rezende2014]: D. J. Rezende, S. Mohamed, D. Wierstra. Stochastic
174+
Backpropagation and Approximate Inference in Deep Generative Models. ICML, 2014.
175+
176+
[^Wilson2017]: J. T. Wilson, R. Moriconi, F. Hutter, M. P. Deisenroth.
177+
The Reparameterization Trick for Acquisition Functions. NeurIPS Workshop on
178+
Bayesian Optimization, 2017.

0 commit comments

Comments
 (0)