@@ -78,8 +78,8 @@ better or optimization algorithms require less adaptation.
78
78
79
79
## Model conditioning and curvature
80
80
81
- Because Stan's algorithms (other than Riemannian Hamiltonian Monte
82
- Carlo) rely on step-based gradient-based approximations of the density
81
+ Because Stan's algorithms rely on step-based gradient-based
82
+ approximations of the density
83
83
(or penalized maximum likelihood) being fitted, posterior curvature
84
84
not captured by this first-order approximation plays a central role in
85
85
determining the statistical efficiency of Stan's algorithms.
@@ -120,19 +120,16 @@ scale and so that posterior correlation is reduced; together, these
120
120
properties mean that there is no rotation or scaling required for
121
121
optimal performance of Stan's algorithms. For Hamiltonian Monte
122
122
Carlo, this implies a unit mass matrix, which requires no adaptation
123
- as it is where the algorithm initializes. Riemannian Hamiltonian
124
- Monte Carlo performs this conditioning on the fly at every step, but
125
- such conditioning is expensive computationally.
123
+ as it is where the algorithm initializes.
126
124
127
125
### Varying curvature {-}
128
126
129
127
In all but very simple models (such as multivariate normals), the
130
128
Hessian will vary as $\theta$ varies (an extreme example is Neal's
131
129
funnel, as naturally arises in hierarchical models with little or no
132
130
data). The more the curvature varies, the harder it is for all of the
133
- algorithms with fixed adaptation parameters (that is, everything but
134
- Riemannian Hamiltonian Monte Carlo) to find adaptations that cover the
135
- entire density well. Many of the variable transforms proposed are
131
+ algorithms with fixed adaptation parameters to find adaptations that
132
+ cover the entire density well. Many of the variable transforms proposed are
136
133
aimed at improving the conditioning of the Hessian and/or making it
137
134
more consistent across the relevant portions of the density (or
138
135
penalized maximum likelihood function) being fit.
@@ -339,7 +336,7 @@ transformed parameter.
339
336
340
337
Sampling from heavy tailed distributions such as the Cauchy is
341
338
difficult for Hamiltonian Monte Carlo, which operates within a
342
- Euclidean geometry.^ [ Riemannian Manifold Hamiltonian Monte Carlo (RMHMC) overcomes this difficulty by simulating the Hamiltonian dynamics in a space with a position-dependent metric; see @ GirolamiCalderhead :2011 and @ Betancourt :2012. ]
339
+ Euclidean geometry.
343
340
344
341
The practical problem is that tail of the Cauchy
345
342
requires a relatively large step size compared to the trunk. With a
453
450
\times
454
451
\textsf{Gamma}\left(\tau \middle| \nu/2, \nu/2\right)
455
452
\
456
- \textsf {d} \tau.
453
+ \text {d} \tau.
457
454
$$
458
455
459
456
@@ -523,8 +520,8 @@ distribution approaches a normal distribution. Thus the parameter
523
520
524
521
Unfortunately, the usual situation in applied Bayesian modeling
525
522
involves complex geometries and interactions that are not known
526
- analytically. Nevertheless, reparameterization can still be
527
- effective for separating parameters.
523
+ analytically. Nevertheless, the non-centered parameterization
524
+ can still be effective for separating parameters.
528
525
529
526
#### Centered parameterization {-}
530
527
@@ -577,8 +574,11 @@ effective sample size when there is not much data (see
577
574
@Betancourt-Girolami :2013), and in more extreme cases will be
578
575
necessary to achieve convergence.
579
576
577
+
580
578
``` stan
581
579
parameters {
580
+ real mu_beta;
581
+ real<lower=0> sigma_beta;
582
582
vector[K] beta_raw;
583
583
// ...
584
584
}
@@ -596,6 +596,23 @@ model {
596
596
Any priors defined for ` mu_beta ` and ` sigma_beta ` remain as
597
597
defined in the original model.
598
598
599
+ Alternatively, Stan's
600
+ [ affine transform] ( https://mc-stan.org/docs/reference-manual/types.html#affinely-transformed-real )
601
+ can be used to decouple ` sigma ` and ` beta ` :
602
+
603
+ ``` stan
604
+ parameters {
605
+ real mu_beta;
606
+ real<lower=0> sigma_beta;
607
+ vector<offset=mu_beta, multiplier=sigma_beta>[K] beta;
608
+ // ...
609
+ }
610
+ model {
611
+ beta ~ normal(mu_beta, sigma_beta);
612
+ // ...
613
+ }
614
+ ```
615
+
599
616
Reparameterization of hierarchical models is not limited to the normal
600
617
distribution, although the normal distribution is the best candidate
601
618
for doing so. In general, any distribution of parameters in the
@@ -1279,14 +1296,21 @@ To make the model even more efficient, a transformed data variable
1279
1296
defined to be ` sum(y) ` could be used in the place of ` sum(y) ` .
1280
1297
1281
1298
1282
- ## Standardizing predictors and outputs
1299
+ ## Standardizing predictors
1300
+
1301
+ Standardizing the data so that all predictors have a zero sample mean and
1302
+ unit sample variance has the following potential benefits:
1303
+
1304
+ * It helps in faster convergence of MCMC chains.
1305
+ * It makes the model less sensitive to the specifics of the parameterization.
1306
+ * It aids in the interpretation and comparison of the importance of coefficients across different predictors.
1283
1307
1284
- Stan programs will run faster if the input is standardized to have a
1285
- zero sample mean and unit sample variance. This section illustrates
1286
- the principle with a simple linear regression.
1308
+ When there are large differences between the units and scales of the predictors,
1309
+ standardizating the predictors is especially useful.
1310
+ This section illustrates the principle for a simple linear regression.
1287
1311
1288
- Suppose that $y = (y_1,\dotsc,y_N)$ is a sequence of $N$ outcomes and
1289
- $x = (x_1,\dotsc,x_N)$ a parallel sequence of $N$ predictors. A
1312
+ Suppose that $y = (y_1,\dotsc,y_N)$ is a vector of $N$ outcomes and
1313
+ $x = (x_1,\dotsc,x_N)$ the corresponding vector of $N$ predictors. A
1290
1314
simple linear regression involving an intercept coefficient $\alpha$
1291
1315
and slope coefficient $\beta$ can be expressed as
1292
1316
$$
@@ -1297,39 +1321,43 @@ $$
1297
1321
\epsilon_n \sim \textsf{normal}(0,\sigma).
1298
1322
$$
1299
1323
1300
- If either vector $x$ or $y$ has very large or very small values or if the
1301
- sample mean of the values is far away from 0 (on the scale of the values),
1302
- then it can be more efficient to standardize the outputs $y_n$ and
1303
- predictors $x_n$. The data are first centered by subtracting the
1304
- sample mean, and then scaled by dividing by the sample deviation.
1305
- Thus a data point $u$ is standardized with respect to
1306
- a vector $y$ by the function $\textsf{z}_ y$, defined by
1324
+ If $x$ has very large or very small values
1325
+ or if the mean of the values is far away from 0 (on the scale of the values),
1326
+ then it can be more efficient to standardize the predictor values $x_n$.
1327
+ First the elements of $x$ are zero-centered by subtracting the mean,
1328
+ then scaled by dividing by the standard deviation.
1329
+
1330
+ The mean of $x$ is given by:
1331
+
1307
1332
$$
1308
- \textsf{z}_y(u) = \frac{u - \bar{y}}{\texttt{sd}(y)}
1333
+ mean_x = \frac{1}{N} \sum_{n=1}^{N} x_n
1309
1334
$$
1310
- where the sample mean of $y$ is
1335
+
1336
+ The standard deviation of $x$ is calculated as:
1311
1337
$$
1312
- \bar{y}
1313
- = \frac{1}{N} \sum_{n=1}^N y_n,
1338
+ sd_x = {\left({\frac{1}{N} \sum_{n=1}^{N} (x_n - mean_x)^2}\right)}^{1/2}
1314
1339
$$
1315
- and the sample standard deviation of $y$ is
1340
+
1341
+ With these, we compute the $z$, the standardized predictors
1342
+
1316
1343
$$
1317
- \texttt{sd}(y)
1318
- = \left(
1319
- \frac{1}{N} \sum_{n=1}^N (y_n - \bar{y})^2
1320
- \right)^{1/2}.
1344
+ z_n = \frac{x_n - mean_x}{sd_x}
1321
1345
$$
1346
+
1347
+ where $z_n$ is the standardized value corresponding to $x_n$.
1348
+
1322
1349
The inverse transform is
1323
1350
defined by reversing the two normalization steps, first rescaling by
1324
- the same deviation and relocating by the sample mean,
1351
+ the same deviation and relocating by the sample mean.
1352
+
1325
1353
$$
1326
- \textrm{z}_y^{-1}(v) = \texttt{sd}(y) v + \bar{y}.
1354
+ x_n = z_n sd_x + mean_x
1327
1355
$$
1328
1356
1329
- To standardize a regression problem, the predictors and outcomes are
1330
- standardized. This changes the scale of the variables, and hence
1331
- changes the scale of the priors. Consider the following initial
1332
- model.
1357
+ Standardizing the predictors standardizes the scale of the variables,
1358
+ and hence the scale of the priors.
1359
+
1360
+ Consider the following initial model.
1333
1361
1334
1362
``` stan
1335
1363
data {
@@ -1346,18 +1374,15 @@ model {
1346
1374
// priors
1347
1375
alpha ~ normal(0, 10);
1348
1376
beta ~ normal(0, 10);
1349
- sigma ~ cauchy (0, 5);
1377
+ sigma ~ normal (0, 5);
1350
1378
// likelihood
1351
- for (n in 1:N) {
1352
- y[n] ~ normal(alpha + beta * x[n], sigma);
1353
- }
1379
+ y ~ normal(x * beta + alpha, sigma);
1354
1380
}
1355
1381
```
1356
1382
1357
-
1358
- The data block for the standardized model is identical. The
1359
- standardized predictors and outputs are defined in the transformed
1360
- data block.
1383
+ The data block for the standardized model is identical.
1384
+ The mean and standard deviation of the data are defined
1385
+ in the transformed data block, along with the standardized predictors.
1361
1386
1362
1387
``` stan
1363
1388
data {
@@ -1366,10 +1391,9 @@ data {
1366
1391
vector[N] x;
1367
1392
}
1368
1393
transformed data {
1369
- vector[N] x_std;
1370
- vector[N] y_std;
1371
- x_std = (x - mean(x)) / sd(x);
1372
- y_std = (y - mean(y)) / sd(y);
1394
+ real mean_x = mean(x);
1395
+ real sd_x = sd(x);
1396
+ vector[N] x_std = (x - mean_x) / sd_x;
1373
1397
}
1374
1398
parameters {
1375
1399
real alpha_std;
@@ -1379,89 +1403,63 @@ parameters {
1379
1403
model {
1380
1404
alpha_std ~ normal(0, 10);
1381
1405
beta_std ~ normal(0, 10);
1382
- sigma_std ~ cauchy(0, 5);
1383
- for (n in 1:N) {
1384
- y_std[n] ~ normal(alpha_std + beta_std * x_std[n],
1385
- sigma_std);
1386
- }
1406
+ sigma_std ~ normal(0, 5);
1407
+ y ~ normal(x_std * beta_std + alpha_std, sigma_std);
1387
1408
}
1388
1409
```
1389
1410
1390
- The parameters are renamed to indicate that they aren't the
1391
- "natural" parameters, but the model is otherwise identical. In
1392
- particular, the fairly diffuse priors on the coefficients and error
1393
- scale are the same. These could have been transformed as well, but
1411
+ The parameters are renamed to indicate that they aren't the "natural" parameters.
1412
+ The transformed data ` x_std ` is defined in terms of variables ` mean_x ` and ` sd_x ` ;
1413
+ by declaring these variables in the ` transformed data ` block, they will be available
1414
+ in all following blocks, and therefore can be used in the ` generated quantities ` block
1415
+ to record the "natural" parameters ` alpha ` and ` beta ` .
1416
+
1417
+ The fairly diffuse priors on the coefficients are the same.
1418
+ These could have been transformed as well, but
1394
1419
here they are left as is, because the scales make sense as
1395
- diffuse priors for standardized data; the priors could be made more
1396
- informative. For instance, because the outputs $y$ have been
1397
- standardized, the error $\sigma$ should not be greater than 1, because
1398
- that's the scale of the noise for predictors $\alpha = \beta = 0$.
1420
+ diffuse priors for standardized data.
1399
1421
1400
1422
The original regression
1401
1423
$$
1402
- y_n
1403
- = \alpha + \beta x_n + \epsilon_n
1424
+ y_n = \alpha + \beta x_n + \epsilon_n
1404
1425
$$
1405
- has been transformed to a regression on the standardized variables,
1426
+ has been transformed to a regression on the standardized data variable $z$,
1427
+
1406
1428
$$
1407
- \textrm{z}_y(y_n)
1408
- = \alpha'
1409
- + \beta' \textrm{z}_x(x_n)
1410
- + \epsilon'_n.
1429
+ y_n = \alpha' + \beta' z_n + \epsilon_n.
1411
1430
$$
1412
- The original parameters can be recovered with a little algebra,
1431
+
1432
+ The likelihood is specified in terms of the standardized parameters.
1433
+ The original slope $\beta$ is the standardized slope $\beta'$ scaled by the inverse of the standard deviation of $x$.
1434
+ The original intercept $\alpha$ is the intercept from the standardized model $\alpha'$, corrected for the effect of scaling and centering $x$.
1435
+ Thus, the formulas to retrieve $\alpha$ and $\beta$ from $\alpha'$ and $\beta'$ are:
1436
+
1413
1437
\begin{align* }
1414
- y_n &= \textrm{z}_ y^{-1}(\textrm{z}_ y(y_n)) \\
1415
- &= \textrm{z}_ y^{-1}
1416
- \left( \alpha' + \beta' \textrm{z}_ x(x_n) + \epsilon_n' \right) \\
1417
- &= \textrm{z}_ y^{-1}
1418
- \left( \alpha' + \beta' \left( \frac{x_n - \bar{x}}{\texttt{sd}(x)} \right) + \epsilon_n' \right) \\
1419
- &= \texttt{sd}(y)
1420
- \left( \alpha' + \beta' \left( \frac{x_n - \bar{x}}{\texttt{sd}(x)} \right) + \epsilon_n' \right) + \bar{y} \\
1421
- &=
1422
- \left( \texttt{sd}(y) \left( \alpha' - \beta' \frac{\bar{x}}{\texttt{sd}(x)} \right) + \bar{y} \right)
1423
- + \left( \beta' \frac{\texttt{sd}(y)}{\texttt{sd}(x)} \right) x_n
1424
- + \texttt{sd}(y) \epsilon'_ n,
1438
+ \beta = \frac{\beta'}{\sigma_x} \\
1439
+ \alpha = \alpha' - \beta' \frac{\mu_x}{\sigma_x}
1425
1440
\end{align* }
1426
- from which the original scale parameter values can be read off,
1427
- $$
1428
- \alpha
1429
- =
1430
- \texttt{sd}(y)
1431
- \left(
1432
- \alpha'
1433
- - \beta' \frac{\bar{x}}{\texttt{sd}(x)}
1434
- \right)
1435
- + \bar{y};
1436
- \qquad
1437
- \beta = \beta' \frac{\texttt{sd}(y)}{\texttt{sd}(x)};
1438
- \qquad
1439
- \sigma = \texttt{sd}(y) \sigma'.
1440
- $$
1441
1441
1442
1442
These recovered parameter values on the original scales can be
1443
1443
calculated within Stan using a generated quantities block following
1444
1444
the model block,
1445
+
1445
1446
``` stan
1446
1447
generated quantities {
1447
- real alpha;
1448
- real beta;
1449
- real<lower=0> sigma;
1450
- alpha = sd(y) * (alpha_std - beta_std * mean(x) / sd(x))
1451
- + mean(y);
1452
- beta = beta_std * sd(y) / sd(x);
1453
- sigma = sd(y) * sigma_std;
1448
+ real beta = beta_std / sd_x;
1449
+ real alpha = alpha_std - beta_std * mean_x / sd_x;
1450
+
1454
1451
}
1455
1452
```
1456
1453
1457
- It is inefficient to compute all of the means and standard
1458
- deviations every iteration; for more efficiency, these can be
1459
- calculated once and stored as transformed data. Furthermore, the
1460
- model sampling statement can be easily vectorized, for instance, in
1461
- the transformed model, to
1462
- ``` stan
1463
- y_std ~ normal(alpha_std + beta_std * x_std, sigma_std);
1464
- ```
1454
+ When there are multiple real-valued predictors, i.e.,
1455
+ when ` K ` is the number of predictors, ` x ` is an $N \times K$ matrix,
1456
+ and ` beta ` ia $K$-vector of coefficients,
1457
+ then ` x * beta ` is an $N$-vector of predictions, one for each of the $N$ data items.
1458
+ When $K \ll N$
1459
+ the [ QR reparameterization] ( regression.qmd#QR-reparameterization.section )
1460
+ is recommended for linear and generalized linear models
1461
+ unless there is an informative prior on the location of $\beta$.
1462
+
1465
1463
1466
1464
### Standard normal distribution {-}
1467
1465
0 commit comments