@@ -1199,8 +1199,14 @@ should be faster.
1199
1199
1200
1200
In some cases, models can be recoded to exploit sufficient statistics
1201
1201
in estimation. This can lead to large efficiency gains compared to an
1202
- expanded model. For example, consider the following Bernoulli
1203
- sampling model.
1202
+ expanded model. This section provides examples for Bernoulli
1203
+ and normal distributions, but the same approach can be applied to
1204
+ other members of the exponential family.
1205
+
1206
+
1207
+ ### Bernoulli sufficient statistics {-}
1208
+
1209
+ Consider the following Bernoulli sampling model.
1204
1210
1205
1211
``` stan
1206
1212
data {
@@ -1257,6 +1263,94 @@ the PMF and simply amount to an alternative, more efficient coding of
1257
1263
the same likelihood. For efficiency, the frequencies ` f[k] `
1258
1264
should be counted once in the transformed data block and stored.
1259
1265
1266
+ The same trick works for combining multiple binomial observations.
1267
+
1268
+
1269
+ ### Normal sufficient statistics {-}
1270
+
1271
+ Consider the following Stan model for fitting a normal distribution to data.
1272
+
1273
+ ``` stan
1274
+ data {
1275
+ int N;
1276
+ vector[N] y;
1277
+ }
1278
+ parameters {
1279
+ real mu;
1280
+ real<lower=0> sigma;
1281
+ }
1282
+ model {
1283
+ y ~ normal(mu, sigma);
1284
+ }
1285
+ ```
1286
+
1287
+ With the vectorized form used for ` y ` , Stan is clever enough to only
1288
+ evaluate ` log(sigma) ` once, but it still has to evaluate the normal
1289
+ for all of ` y[1] ` to ` y[N] ` , which involves adding up all the squared
1290
+ differences from the mean and then dividing by ` sigma ` squared.
1291
+
1292
+ An equivalent density to the one above (up to normalizing constants
1293
+ that do not depend on parameters), is given in the following Stan
1294
+ program.
1295
+
1296
+ ``` stan
1297
+ data {
1298
+ int N;
1299
+ vector[N] y;
1300
+ }
1301
+ transformed data {
1302
+ real mean_y = mean(y);
1303
+ real<lower=0> var_y = variance(y);
1304
+ real nm1_over2 = 0.5 * (N - 1);
1305
+ real sqrt_N = sqrt(N);
1306
+ }
1307
+ parameters {
1308
+ real mu;
1309
+ real<lower=0> sigma;
1310
+ }
1311
+ model {
1312
+ mean_y ~ normal(mu, sigma / sqrt_N);
1313
+ var_y ~ gamma(nm1_over2, nm1_over2 / sigma^2);
1314
+ }
1315
+ ```
1316
+
1317
+ The data and parameters are the same in this program as in the first.
1318
+ The second version adds a transformed data block to compute the mean
1319
+ and variance of the data, which are the sufficient statistics here.
1320
+ These are stored along with two other useful constants. Then the
1321
+ program can define distributions over the mean and variance, both of
1322
+ which are scalars here.
1323
+
1324
+ The original Stan program and this one define the same model in the
1325
+ sense that they define the same log density up to a constant additive
1326
+ term that does not depend on the parameters. The priors on ` mu ` and
1327
+ ` sigma ` are both improper, but proper priors could be added as
1328
+ additional statements in the model block without affecting the
1329
+ sufficiency.
1330
+
1331
+ This transform explicitly relies on aggregating the data. Using this
1332
+ trick on parameters leads to more computation than just computing the
1333
+ normal log density, even before accounting for the non-linear change
1334
+ of variables in the variance.
1335
+
1336
+ ### Poisson sufficient statistics {-}
1337
+
1338
+ The Poisson distribution is the easiest case, because the sum of
1339
+ observations is sufficient. Specifically, we can replace
1340
+
1341
+ ``` stan
1342
+ y ~ poisson(lambda);
1343
+ ```
1344
+
1345
+ with
1346
+
1347
+ ``` stan
1348
+ sum(y) ~ poisson(size(y) * lambda);
1349
+ ```
1350
+
1351
+ This will work even if ` y ` is a parameter vector because no Jacobian
1352
+ adjustment is required for summation.
1353
+
1260
1354
1261
1355
## Aggregating common subexpressions
1262
1356
@@ -1306,7 +1400,7 @@ unit sample variance has the following potential benefits:
1306
1400
* It aids in the interpretation and comparison of the importance of coefficients across different predictors.
1307
1401
1308
1402
When there are large differences between the units and scales of the predictors,
1309
- standardizating the predictors is especially useful.
1403
+ standardizing the predictors is especially useful.
1310
1404
This section illustrates the principle for a simple linear regression.
1311
1405
1312
1406
Suppose that $y = (y_1,\dotsc,y_N)$ is a vector of $N$ outcomes and
0 commit comments