|
| 1 | +--- |
| 2 | +pagetitle: Embedded Laplace Approximation |
| 3 | +--- |
| 4 | + |
| 5 | +# Embedded Laplace Approximation |
| 6 | + |
| 7 | +The embedded Laplace approximation can be used to approximate certain |
| 8 | +marginal and conditional distributions that arise in latent Gaussian models. |
| 9 | +A latent Gaussian model observes the following hierarchical structure: |
| 10 | +$$ |
| 11 | + \phi \sim p(\phi), \ \ \theta \sim \text{MultiNormal}(0, K(\phi)), \ \ |
| 12 | + y \sim p(y \mid \theta, \phi), |
| 13 | +$$ |
| 14 | +where $K(\phi)$ denotes the prior covariance matrix parameterized by $\phi$. |
| 15 | +To draw samples from the posterior $p(\phi, \theta \mid y)$, we can either |
| 16 | +use a standard method, such as Markov chain Monte Carlo, or we can follow |
| 17 | +a two-step procedure: |
| 18 | + |
| 19 | +1. draw samples from the *marginal likelihood* $p(\phi \mid y)$ |
| 20 | +2. draw samples from the *conditional posterior* $p(\theta \mid y, \phi)$. |
| 21 | + |
| 22 | +In practice, neither the marginal likelihood nor the conditional posterior |
| 23 | +are available in close form and so they must be approximated. |
| 24 | +It turns out that if we have an approximation of $p(\theta \mid y, \phi)$, |
| 25 | +we immediately obtain an approximation of $p(\phi \mid y)$. |
| 26 | +The embedded Laplace approximation returns |
| 27 | +$\log \hat p(y \mid \phi) \approx \log p(y \mid \phi)$. |
| 28 | +Evaluating this log density in the `model` block, we can then sample from |
| 29 | +$p(\phi \mid y)$ using one of Stan's algorithms. |
| 30 | + |
| 31 | +To obtain posterior draws for $\theta$, we generate samples from the Laplace |
| 32 | +approximation to $p(\theta \mid y, \phi)$ in `generated quantities`. |
| 33 | + |
| 34 | +## Specifying the likelihood function |
| 35 | + |
| 36 | +The first step is to write down a function in the `functions` block which |
| 37 | +returns `\log p(y \mid \theta, \phi)`. There are a few constraints on this |
| 38 | +function: |
| 39 | + |
| 40 | +* The function return type must be `real` |
| 41 | + |
| 42 | +* The first argument must be the latent Gaussian variable $\theta$ and must |
| 43 | +have type `vector`. |
| 44 | + |
| 45 | +* The operations in the function must support higher-order automatic |
| 46 | +differentation (AD). Most functions in Stan support higher-order AD. |
| 47 | +The exceptions are functions with specialized calls for reverse-mode AD, and |
| 48 | +these are higher-order functions (algebraic solvers, differential equation |
| 49 | +solvers, and integrators) and the suite of hidden Markov model (HMM) functions. |
| 50 | + |
| 51 | +The signature of the function is |
| 52 | +``` |
| 53 | +real ll_function(vector theta, ...) |
| 54 | +``` |
| 55 | +There is no type restrictions for the variadic arguments `...` and each |
| 56 | +argument can be passed as data or parameter. As always, users should use |
| 57 | +parameter arguments only when nescessary in order to speed up differentiation. |
| 58 | +In general, we recommend marking data only arguments with the keyword `data`, |
| 59 | +for example, |
| 60 | +``` |
| 61 | +real ll_function(vector theta, data vector x, ...) |
| 62 | +``` |
| 63 | + |
| 64 | +## Specifying the covariance function |
| 65 | + |
| 66 | +We next need to specify a function that returns the prior covariance matrix |
| 67 | +$K$ as a function of the hyperparameters $\phi$. |
| 68 | +The only restriction is that this function returns a matrix with size |
| 69 | +$n \times n$ where $n$ is the size of $\theta$. The signature is: |
| 70 | +``` |
| 71 | +matrix K_function(...) |
| 72 | +``` |
| 73 | +There is no type restrictions for the variadic arguments. The variables $\phi$ |
| 74 | +is implicitly defined as the collection of all non-data arguments passed to |
| 75 | +`ll_function` (excluding $\theta$) and `K_function`. |
| 76 | + |
| 77 | + |
| 78 | +## Approximating the log marginal likelihood $\log p(y \mid \phi)$ |
| 79 | + |
| 80 | +In the `model` block, we increment `target` with `laplace_marginal`, a function |
| 81 | +that approximates $\log p(y \mid \phi)$. This function takes in the |
| 82 | +user-specified likelihood and covariance functions, as well as their arguments. |
| 83 | +These arguments must be passed as tuples, which can be generated on the fly |
| 84 | +using parenthesis. |
| 85 | +We also need to pass an argument $\theta_0$ which serves as an initial guess for |
| 86 | +the optimization problem that underlies the Laplace approximation, |
| 87 | +$$ |
| 88 | + \underset{\theta}{\text{argmax}} \ \log p(\theta \mid y, \phi). |
| 89 | +$$ |
| 90 | +The size of $\theta_0$ must be consistent with the size of the $\theta$ argument |
| 91 | +passed to `ll_function`. |
| 92 | + |
| 93 | +The signature of the function is: |
| 94 | +``` |
| 95 | +target += laplace_marginal(function ll_function, tupple (...), vector theta_0, |
| 96 | + function K_function, tupple (...)); |
| 97 | +``` |
| 98 | +The tuple `(...)` after `ll_function` contains the arguments that get passed |
| 99 | +to `ll_function` *excluding $\theta$*. Likewise, the tuple `(...)` after |
| 100 | +`ll_function` contains the arguments that get passed to `K_function`. |
| 101 | + |
| 102 | +It also possible to specify control parameters, which can help improve the |
| 103 | +optimization that underlies the Laplace approximation. Specifically: |
| 104 | + |
| 105 | +* `tol`: the tolerance $\epsilon$ of the optimizer. Specifically, the optimizer |
| 106 | +stops when $||\nabla \log p(\theta \mid y, \phi)|| \le \epsilon$. By default, |
| 107 | +the value is $\epsilon = 10^{-6}$. |
| 108 | + |
| 109 | +* `max_num_steps`: the maximum number of steps taken by the optimizer before |
| 110 | +it gives up (in which case the Metropolis proposal gets rejected). The default |
| 111 | +is 100 steps. |
| 112 | + |
| 113 | +* `hessian_block_size`: the size of the blocks, assuming the Hessian |
| 114 | +$\partial \log p(y \mid \theta, phi) \ \partial \theta$ is block-diagonal. |
| 115 | +The structure of the Hessian is determined by the dependence structure of $y$ |
| 116 | +on $\theta$. By default, the Hessian is treated as diagonal |
| 117 | +(`hessian_block_size=1`). If the Hessian is not block diagonal, then set |
| 118 | +`hessian_block_size=n`, where `n` is the size of $\theta$. |
| 119 | + |
| 120 | +* `solver`: choice of Newton solver. The optimizer used to compute the |
| 121 | +Laplace approximation does one of three matrix decompositions to compute a |
| 122 | +Newton step. The problem determines which decomposition is numerical stable. |
| 123 | +By default (`solver=1`), the solver makes a Cholesky decomposition of the |
| 124 | +negative Hessian, $- \partial \log p(y \mid \theta, \phi) / \partial \theta$. |
| 125 | +If `solver=2`, the solver makes a Cholesky decomposition of the covariance |
| 126 | +matrix $K(\phi)$. |
| 127 | +If the Cholesky decomposition cannot be computed for neither the negative |
| 128 | +Hessian nor the covariance matrix, use `solver=3` which uses a more expensive |
| 129 | +but less specialized approach. |
| 130 | + |
| 131 | +* `max_steps_linesearch`: maximum number of steps in linesearch. The linesearch |
| 132 | +method tries to insure that the Newton step leads to a decrease in the |
| 133 | +objective function. If the Newton step does not improve the objective function, |
| 134 | +the step is repeatedly halved until the objective function decreases or the |
| 135 | +maximum number of steps in the linesearch is reached. By default, |
| 136 | +`max_steps_linesearch=0`, meaning no linesearch is performed. |
| 137 | + |
| 138 | +With these arguments at hand, we can call `laplace_marginal_tol` with the |
| 139 | +following signature: |
| 140 | +``` |
| 141 | +target += laplace_margina_tol(function ll_function, tupple (...), vector theta_0, |
| 142 | + function K_function, tupple (...), |
| 143 | + real tol, int max_steps, int hessian_block_size, |
| 144 | + int solver, int max_steps_linesearch); |
| 145 | +``` |
| 146 | + |
| 147 | +## Draw approximate samples from the conditional $p(\theta \mid y, \phi)$ |
| 148 | + |
| 149 | +In `generated quantities`, it is possible to draw samples from the Laplace |
| 150 | +approximation of $p(\theta \mid \phi, y)$ using `laplace_latent_rng`. |
| 151 | +The process of iteratively drawing from $p(\phi \mid y)$ (say, with MCMC) and |
| 152 | +then $p(\theta \mid y, \phi)$ produces samples from the joint posterior |
| 153 | +$p(\theta, \phi \mid y)$. The signature for `laplace_latent_rng` follows closely |
| 154 | +the signature for `laplace_marginal`: |
| 155 | +``` |
| 156 | +vector theta = |
| 157 | + laplace_latent_rng(function ll_function, tupple (...), vector theta_0, |
| 158 | + function K_function, tupple (...)); |
| 159 | +``` |
| 160 | +Once again, it is possible to specify control parameters: |
| 161 | +``` |
| 162 | +vector theta = |
| 163 | + laplace_latent_tol_rng(function ll_function, tupple (...), vector theta_0, |
| 164 | + function K_function, tupple (...), |
| 165 | + real tol, int max_steps, int hessian_block_size, |
| 166 | + int solver, int max_steps_linesearch); |
| 167 | +``` |
| 168 | + |
| 169 | +## Built-in likelihood functions for the embedded Laplace |
| 170 | + |
| 171 | +Stan supports a narrow menu of built-in likelihood functions. These wrappers |
| 172 | +exist for the user's convenience but are not more computationally efficient |
| 173 | +than specifying log likelihoods in the `functions` block. |
| 174 | + |
| 175 | +[...] |
| 176 | + |
| 177 | + |
| 178 | +## Draw approximate samples for out-of-sample latent variables. |
| 179 | + |
| 180 | +In many applications, it is of interest to draw latent variables for |
| 181 | +in-sample and out-of-sample predictions. We respectively denote these latent |
| 182 | +variables $\theta$ and $\theta^*$. In a latent Gaussian model, |
| 183 | +$(\theta, \theta^*)$ jointly follow a prior multivariate normal distribution: |
| 184 | +$$ |
| 185 | + \theta, \theta^* \sim \text{MultiNormal}(0, {\bf K}(\phi)), |
| 186 | +$$ |
| 187 | +where $\bf K$ designates the joint covariance matrix over $\theta, \theta^*$. |
| 188 | + |
| 189 | +We can break $\bf K$ into three components, |
| 190 | +$$ |
| 191 | +{\bf K} = \begin{bmatrix} |
| 192 | + K & \\ |
| 193 | + K^* & K^{**} |
| 194 | +\end{bmatrix}, |
| 195 | +$$ |
| 196 | +where $K$ is the prior covariance matrix for $\theta$, $K^{**}$ the prior |
| 197 | +covariance matrix for $\theta^*$, and $K^*$ the covariance matrix between |
| 198 | +$\theta$ and $\theta^*$. |
| 199 | + |
| 200 | +Stan supports the case where $\theta$ is associated with an in-sample |
| 201 | +covariate $X$ and $\theta^*$ with an out-of-sample covariate $X^*$. |
| 202 | +Furthermore, the covariance function is written in such a way that |
| 203 | +$$ |
| 204 | +K = f(..., X, X), \ \ K^{**} = f(..., X^*, X^*), \ \ K^* = f(..., X, X^*), |
| 205 | +$$ |
| 206 | +as is typically the case in Gaussian process models. |
| 207 | + |
| 208 | + |
| 209 | + |
| 210 | + |
| 211 | + |
| 212 | +The |
| 213 | +function `laplace_latent_rng` produces samples from the Laplace approximation |
| 214 | +and admits nearly the same arguments as `laplace_marginal`. A key difference |
| 215 | +is that |
| 216 | +``` |
| 217 | +vector laplace_latent_rng(function ll_function, tupple (...), vector theta_0, |
| 218 | + function K_function, tupple (...)); |
| 219 | +``` |
| 220 | + |
| 221 | + |
| 222 | + |
| 223 | + |
| 224 | + |
| 225 | + |
| 226 | + |
| 227 | + |
| 228 | + |
| 229 | + |
| 230 | + |
| 231 | + |
| 232 | + |
| 233 | + |
| 234 | + |
| 235 | + |
0 commit comments