Skip to content

Commit 6b47dd5

Browse files
authored
Merge pull request #762 from stan-dev/issue/761-effeciency-tuning
Issue/761 efficiency tuning
2 parents 058ffb3 + 99495ce commit 6b47dd5

File tree

3 files changed

+119
-117
lines changed

3 files changed

+119
-117
lines changed

src/reference-manual/transforms.qmd

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,9 @@ with lower and/or upper bounds. Vectors may alternatively be
1414
constrained to be ordered, positive ordered, or simplexes. Matrices
1515
may be constrained to be correlation matrices or covariance matrices.
1616
This chapter provides a definition of the transforms used for each
17-
type of variable.
17+
type of variable. For examples of how to declare and define these
18+
variables in a Stan program, see section
19+
[Variable declaration](types.qmd#variable-declaration.section).
1820

1921
Stan converts models to C++ classes which define probability
2022
functions with support on all of $\mathbb{R}^K$, where $K$ is the number
@@ -335,6 +337,8 @@ p_Y(y)
335337
\cdot \sigma.
336338
$$
337339

340+
For an example of how to code this in Stan, see section [Affinely Transformed Real](types.qmd#affine-transform.section).
341+
338342

339343
## Ordered vector
340344

src/reference-manual/types.qmd

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -402,7 +402,7 @@ positive infinity are ignored. Stan provides constants
402402
be used for this purpose, or they may be supplied as data.
403403

404404

405-
### Affinely transformed real {-}
405+
### Affinely transformed real {#affine-transform.section}
406406

407407
Real variables may be declared on a space that has been transformed using an
408408
affine transformation $x\mapsto \mu + \sigma * x$ with offset $\mu$ and
@@ -1518,7 +1518,7 @@ vectors.
15181518

15191519

15201520

1521-
## Variable declaration
1521+
## Variable declaration {#variable-declaration.section}
15221522

15231523
Variables in Stan are declared by giving a type and a name. For example
15241524

src/stan-users-guide/efficiency-tuning.qmd

Lines changed: 112 additions & 114 deletions
Original file line numberDiff line numberDiff line change
@@ -78,8 +78,8 @@ better or optimization algorithms require less adaptation.
7878

7979
## Model conditioning and curvature
8080

81-
Because Stan's algorithms (other than Riemannian Hamiltonian Monte
82-
Carlo) rely on step-based gradient-based approximations of the density
81+
Because Stan's algorithms rely on step-based gradient-based
82+
approximations of the density
8383
(or penalized maximum likelihood) being fitted, posterior curvature
8484
not captured by this first-order approximation plays a central role in
8585
determining the statistical efficiency of Stan's algorithms.
@@ -120,19 +120,16 @@ scale and so that posterior correlation is reduced; together, these
120120
properties mean that there is no rotation or scaling required for
121121
optimal performance of Stan's algorithms. For Hamiltonian Monte
122122
Carlo, this implies a unit mass matrix, which requires no adaptation
123-
as it is where the algorithm initializes. Riemannian Hamiltonian
124-
Monte Carlo performs this conditioning on the fly at every step, but
125-
such conditioning is expensive computationally.
123+
as it is where the algorithm initializes.
126124

127125
### Varying curvature {-}
128126

129127
In all but very simple models (such as multivariate normals), the
130128
Hessian will vary as $\theta$ varies (an extreme example is Neal's
131129
funnel, as naturally arises in hierarchical models with little or no
132130
data). The more the curvature varies, the harder it is for all of the
133-
algorithms with fixed adaptation parameters (that is, everything but
134-
Riemannian Hamiltonian Monte Carlo) to find adaptations that cover the
135-
entire density well. Many of the variable transforms proposed are
131+
algorithms with fixed adaptation parameters to find adaptations that
132+
cover the entire density well. Many of the variable transforms proposed are
136133
aimed at improving the conditioning of the Hessian and/or making it
137134
more consistent across the relevant portions of the density (or
138135
penalized maximum likelihood function) being fit.
@@ -339,7 +336,7 @@ transformed parameter.
339336

340337
Sampling from heavy tailed distributions such as the Cauchy is
341338
difficult for Hamiltonian Monte Carlo, which operates within a
342-
Euclidean geometry.^[Riemannian Manifold Hamiltonian Monte Carlo (RMHMC) overcomes this difficulty by simulating the Hamiltonian dynamics in a space with a position-dependent metric; see @GirolamiCalderhead:2011 and @Betancourt:2012.]
339+
Euclidean geometry.
343340

344341
The practical problem is that tail of the Cauchy
345342
requires a relatively large step size compared to the trunk. With a
@@ -453,7 +450,7 @@ $$
453450
\times
454451
\textsf{Gamma}\left(\tau \middle| \nu/2, \nu/2\right)
455452
\
456-
\textsf{d} \tau.
453+
\text{d} \tau.
457454
$$
458455

459456

@@ -523,8 +520,8 @@ distribution approaches a normal distribution. Thus the parameter
523520

524521
Unfortunately, the usual situation in applied Bayesian modeling
525522
involves complex geometries and interactions that are not known
526-
analytically. Nevertheless, reparameterization can still be
527-
effective for separating parameters.
523+
analytically. Nevertheless, the non-centered parameterization
524+
can still be effective for separating parameters.
528525

529526
#### Centered parameterization {-}
530527

@@ -577,8 +574,11 @@ effective sample size when there is not much data (see
577574
@Betancourt-Girolami:2013), and in more extreme cases will be
578575
necessary to achieve convergence.
579576

577+
580578
```stan
581579
parameters {
580+
real mu_beta;
581+
real<lower=0> sigma_beta;
582582
vector[K] beta_raw;
583583
// ...
584584
}
@@ -596,6 +596,23 @@ model {
596596
Any priors defined for `mu_beta` and `sigma_beta` remain as
597597
defined in the original model.
598598

599+
Alternatively, Stan's
600+
[affine transform](https://mc-stan.org/docs/reference-manual/types.html#affinely-transformed-real)
601+
can be used to decouple `sigma` and `beta`:
602+
603+
```stan
604+
parameters {
605+
real mu_beta;
606+
real<lower=0> sigma_beta;
607+
vector<offset=mu_beta, multiplier=sigma_beta>[K] beta;
608+
// ...
609+
}
610+
model {
611+
beta ~ normal(mu_beta, sigma_beta);
612+
// ...
613+
}
614+
```
615+
599616
Reparameterization of hierarchical models is not limited to the normal
600617
distribution, although the normal distribution is the best candidate
601618
for doing so. In general, any distribution of parameters in the
@@ -1279,14 +1296,21 @@ To make the model even more efficient, a transformed data variable
12791296
defined to be `sum(y)` could be used in the place of `sum(y)`.
12801297

12811298

1282-
## Standardizing predictors and outputs
1299+
## Standardizing predictors
1300+
1301+
Standardizing the data so that all predictors have a zero sample mean and
1302+
unit sample variance has the following potential benefits:
1303+
1304+
* It helps in faster convergence of MCMC chains.
1305+
* It makes the model less sensitive to the specifics of the parameterization.
1306+
* It aids in the interpretation and comparison of the importance of coefficients across different predictors.
12831307

1284-
Stan programs will run faster if the input is standardized to have a
1285-
zero sample mean and unit sample variance. This section illustrates
1286-
the principle with a simple linear regression.
1308+
When there are large differences between the units and scales of the predictors,
1309+
standardizating the predictors is especially useful.
1310+
This section illustrates the principle for a simple linear regression.
12871311

1288-
Suppose that $y = (y_1,\dotsc,y_N)$ is a sequence of $N$ outcomes and
1289-
$x = (x_1,\dotsc,x_N)$ a parallel sequence of $N$ predictors. A
1312+
Suppose that $y = (y_1,\dotsc,y_N)$ is a vector of $N$ outcomes and
1313+
$x = (x_1,\dotsc,x_N)$ the corresponding vector of $N$ predictors. A
12901314
simple linear regression involving an intercept coefficient $\alpha$
12911315
and slope coefficient $\beta$ can be expressed as
12921316
$$
@@ -1297,39 +1321,43 @@ $$
12971321
\epsilon_n \sim \textsf{normal}(0,\sigma).
12981322
$$
12991323

1300-
If either vector $x$ or $y$ has very large or very small values or if the
1301-
sample mean of the values is far away from 0 (on the scale of the values),
1302-
then it can be more efficient to standardize the outputs $y_n$ and
1303-
predictors $x_n$. The data are first centered by subtracting the
1304-
sample mean, and then scaled by dividing by the sample deviation.
1305-
Thus a data point $u$ is standardized with respect to
1306-
a vector $y$ by the function $\textsf{z}_y$, defined by
1324+
If $x$ has very large or very small values
1325+
or if the mean of the values is far away from 0 (on the scale of the values),
1326+
then it can be more efficient to standardize the predictor values $x_n$.
1327+
First the elements of $x$ are zero-centered by subtracting the mean,
1328+
then scaled by dividing by the standard deviation.
1329+
1330+
The mean of $x$ is given by:
1331+
13071332
$$
1308-
\textsf{z}_y(u) = \frac{u - \bar{y}}{\texttt{sd}(y)}
1333+
mean_x = \frac{1}{N} \sum_{n=1}^{N} x_n
13091334
$$
1310-
where the sample mean of $y$ is
1335+
1336+
The standard deviation of $x$ is calculated as:
13111337
$$
1312-
\bar{y}
1313-
= \frac{1}{N} \sum_{n=1}^N y_n,
1338+
sd_x = {\left({\frac{1}{N} \sum_{n=1}^{N} (x_n - mean_x)^2}\right)}^{1/2}
13141339
$$
1315-
and the sample standard deviation of $y$ is
1340+
1341+
With these, we compute the $z$, the standardized predictors
1342+
13161343
$$
1317-
\texttt{sd}(y)
1318-
= \left(
1319-
\frac{1}{N} \sum_{n=1}^N (y_n - \bar{y})^2
1320-
\right)^{1/2}.
1344+
z_n = \frac{x_n - mean_x}{sd_x}
13211345
$$
1346+
1347+
where $z_n$ is the standardized value corresponding to $x_n$.
1348+
13221349
The inverse transform is
13231350
defined by reversing the two normalization steps, first rescaling by
1324-
the same deviation and relocating by the sample mean,
1351+
the same deviation and relocating by the sample mean.
1352+
13251353
$$
1326-
\textrm{z}_y^{-1}(v) = \texttt{sd}(y) v + \bar{y}.
1354+
x_n = z_n sd_x + mean_x
13271355
$$
13281356

1329-
To standardize a regression problem, the predictors and outcomes are
1330-
standardized. This changes the scale of the variables, and hence
1331-
changes the scale of the priors. Consider the following initial
1332-
model.
1357+
Standardizing the predictors standardizes the scale of the variables,
1358+
and hence the scale of the priors.
1359+
1360+
Consider the following initial model.
13331361

13341362
```stan
13351363
data {
@@ -1346,18 +1374,15 @@ model {
13461374
// priors
13471375
alpha ~ normal(0, 10);
13481376
beta ~ normal(0, 10);
1349-
sigma ~ cauchy(0, 5);
1377+
sigma ~ normal(0, 5);
13501378
// likelihood
1351-
for (n in 1:N) {
1352-
y[n] ~ normal(alpha + beta * x[n], sigma);
1353-
}
1379+
y ~ normal(x * beta + alpha, sigma);
13541380
}
13551381
```
13561382

1357-
1358-
The data block for the standardized model is identical. The
1359-
standardized predictors and outputs are defined in the transformed
1360-
data block.
1383+
The data block for the standardized model is identical.
1384+
The mean and standard deviation of the data are defined
1385+
in the transformed data block, along with the standardized predictors.
13611386

13621387
```stan
13631388
data {
@@ -1366,10 +1391,9 @@ data {
13661391
vector[N] x;
13671392
}
13681393
transformed data {
1369-
vector[N] x_std;
1370-
vector[N] y_std;
1371-
x_std = (x - mean(x)) / sd(x);
1372-
y_std = (y - mean(y)) / sd(y);
1394+
real mean_x = mean(x);
1395+
real sd_x = sd(x);
1396+
vector[N] x_std = (x - mean_x) / sd_x;
13731397
}
13741398
parameters {
13751399
real alpha_std;
@@ -1379,89 +1403,63 @@ parameters {
13791403
model {
13801404
alpha_std ~ normal(0, 10);
13811405
beta_std ~ normal(0, 10);
1382-
sigma_std ~ cauchy(0, 5);
1383-
for (n in 1:N) {
1384-
y_std[n] ~ normal(alpha_std + beta_std * x_std[n],
1385-
sigma_std);
1386-
}
1406+
sigma_std ~ normal(0, 5);
1407+
y ~ normal(x_std * beta_std + alpha_std, sigma_std);
13871408
}
13881409
```
13891410

1390-
The parameters are renamed to indicate that they aren't the
1391-
"natural" parameters, but the model is otherwise identical. In
1392-
particular, the fairly diffuse priors on the coefficients and error
1393-
scale are the same. These could have been transformed as well, but
1411+
The parameters are renamed to indicate that they aren't the "natural" parameters.
1412+
The transformed data `x_std` is defined in terms of variables `mean_x` and `sd_x`;
1413+
by declaring these variables in the `transformed data` block, they will be available
1414+
in all following blocks, and therefore can be used in the `generated quantities` block
1415+
to record the "natural" parameters `alpha` and `beta`.
1416+
1417+
The fairly diffuse priors on the coefficients are the same.
1418+
These could have been transformed as well, but
13941419
here they are left as is, because the scales make sense as
1395-
diffuse priors for standardized data; the priors could be made more
1396-
informative. For instance, because the outputs $y$ have been
1397-
standardized, the error $\sigma$ should not be greater than 1, because
1398-
that's the scale of the noise for predictors $\alpha = \beta = 0$.
1420+
diffuse priors for standardized data.
13991421

14001422
The original regression
14011423
$$
1402-
y_n
1403-
= \alpha + \beta x_n + \epsilon_n
1424+
y_n = \alpha + \beta x_n + \epsilon_n
14041425
$$
1405-
has been transformed to a regression on the standardized variables,
1426+
has been transformed to a regression on the standardized data variable $z$,
1427+
14061428
$$
1407-
\textrm{z}_y(y_n)
1408-
= \alpha'
1409-
+ \beta' \textrm{z}_x(x_n)
1410-
+ \epsilon'_n.
1429+
y_n = \alpha' + \beta' z_n + \epsilon_n.
14111430
$$
1412-
The original parameters can be recovered with a little algebra,
1431+
1432+
The likelihood is specified in terms of the standardized parameters.
1433+
The original slope $\beta$ is the standardized slope $\beta'$ scaled by the inverse of the standard deviation of $x$.
1434+
The original intercept $\alpha$ is the intercept from the standardized model $\alpha'$, corrected for the effect of scaling and centering $x$.
1435+
Thus, the formulas to retrieve $\alpha$ and $\beta$ from $\alpha'$ and $\beta'$ are:
1436+
14131437
\begin{align*}
1414-
y_n &= \textrm{z}_y^{-1}(\textrm{z}_y(y_n)) \\
1415-
&= \textrm{z}_y^{-1}
1416-
\left( \alpha' + \beta' \textrm{z}_x(x_n) + \epsilon_n' \right) \\
1417-
&= \textrm{z}_y^{-1}
1418-
\left( \alpha' + \beta' \left( \frac{x_n - \bar{x}}{\texttt{sd}(x)} \right) + \epsilon_n' \right) \\
1419-
&= \texttt{sd}(y)
1420-
\left( \alpha' + \beta' \left( \frac{x_n - \bar{x}}{\texttt{sd}(x)} \right) + \epsilon_n' \right) + \bar{y} \\
1421-
&=
1422-
\left( \texttt{sd}(y) \left( \alpha' - \beta' \frac{\bar{x}}{\texttt{sd}(x)} \right) + \bar{y} \right)
1423-
+ \left( \beta' \frac{\texttt{sd}(y)}{\texttt{sd}(x)} \right) x_n
1424-
+ \texttt{sd}(y) \epsilon'_n,
1438+
\beta = \frac{\beta'}{\sigma_x} \\
1439+
\alpha = \alpha' - \beta' \frac{\mu_x}{\sigma_x}
14251440
\end{align*}
1426-
from which the original scale parameter values can be read off,
1427-
$$
1428-
\alpha
1429-
=
1430-
\texttt{sd}(y)
1431-
\left(
1432-
\alpha'
1433-
- \beta' \frac{\bar{x}}{\texttt{sd}(x)}
1434-
\right)
1435-
+ \bar{y};
1436-
\qquad
1437-
\beta = \beta' \frac{\texttt{sd}(y)}{\texttt{sd}(x)};
1438-
\qquad
1439-
\sigma = \texttt{sd}(y) \sigma'.
1440-
$$
14411441

14421442
These recovered parameter values on the original scales can be
14431443
calculated within Stan using a generated quantities block following
14441444
the model block,
1445+
14451446
```stan
14461447
generated quantities {
1447-
real alpha;
1448-
real beta;
1449-
real<lower=0> sigma;
1450-
alpha = sd(y) * (alpha_std - beta_std * mean(x) / sd(x))
1451-
+ mean(y);
1452-
beta = beta_std * sd(y) / sd(x);
1453-
sigma = sd(y) * sigma_std;
1448+
real beta = beta_std / sd_x;
1449+
real alpha = alpha_std - beta_std * mean_x / sd_x;
1450+
14541451
}
14551452
```
14561453

1457-
It is inefficient to compute all of the means and standard
1458-
deviations every iteration; for more efficiency, these can be
1459-
calculated once and stored as transformed data. Furthermore, the
1460-
model sampling statement can be easily vectorized, for instance, in
1461-
the transformed model, to
1462-
```stan
1463-
y_std ~ normal(alpha_std + beta_std * x_std, sigma_std);
1464-
```
1454+
When there are multiple real-valued predictors, i.e.,
1455+
when `K` is the number of predictors, `x` is an $N \times K$ matrix,
1456+
and `beta` ia $K$-vector of coefficients,
1457+
then `x * beta` is an $N$-vector of predictions, one for each of the $N$ data items.
1458+
When $K \ll N$
1459+
the [QR reparameterization](regression.qmd#QR-reparameterization.section)
1460+
is recommended for linear and generalized linear models
1461+
unless there is an informative prior on the location of $\beta$.
1462+
14651463

14661464
### Standard normal distribution {-}
14671465

0 commit comments

Comments
 (0)