Skip to content

Commit 24d1d7e

Browse files
authored
Merge pull request #836 from stan-dev/issue/835-new-diagnostics
update CmdStan Guide and Reference Manual for updated `stansummary` and explain rank-normalized split R-hat.
2 parents 1782f1a + 78d628f commit 24d1d7e

File tree

2 files changed

+102
-53
lines changed

2 files changed

+102
-53
lines changed

src/cmdstan-guide/stansummary.qmd

Lines changed: 44 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -12,34 +12,38 @@ diagnostic statistics on the sampler chains, reported in the following order:
1212

1313
- Mean - sample mean
1414
- MCSE - Monte Carlo Standard Error, a measure of the amount of noise in the sample
15-
- StdDev - sample standard deviation
15+
- StdDev - sample standard deviation - the standard deviation around the sample mean.
16+
- MAD - Median Absolute Deviation - the median absolute deviation around the sample median.
1617
- Quantiles - default 5%, 50%, 95%
17-
- N_eff - effective sample size - the number of independent draws in the sample
18-
- N_eff/S - the number of independent draws per second
19-
- R_hat - $\hat{R}$ statistic, a measure of chain equilibrium, must be within $0.05$ of $1.0$.
18+
- ESS_bulk
19+
- ESS_tail
20+
- R_hat - $\hat{R}$ statistic, a MCMC convergence diagnostic
2021

2122
When reviewing the `stansummary` output, it is important to check the final three
22-
output columns first - these are the diagnostic statistics on chain convergence and
23-
number of independent draws in the sample.
24-
A $\hat{R}$ statistic of greater than $1.05$ indicates that the chain has not converged and
25-
therefore the sample is not drawn from the posterior, thus the estimates of the mean and
26-
all other summary statistics are invalid.
23+
output columns first - these are the diagnostic statistics on MCMC convergence and
24+
effective sample size.
25+
A $\hat{R}$ statistic of greater than $1$ indicates potential convergence problems and that the sample is not presentative of the target posterior, thus the estimates of the mean and all other summary statistics are likely to be invalid. A value $1.01$ can be used as generic threshold to decide whether more iterations or further convergence analysis is needed, but other thresholds can be used depending on the specific use case.
2726

2827
Estimation by sampling produces an approximate value for the model parameters;
29-
the MCSE statistic indicates the amount of noise in the estimate.
28+
the MCSE statistic indicates the amount of uncertainty in the estimate.
3029
Therefore MCSE column is placed next to the sample mean column,
3130
in order to make it easy to compare this sample with others.
3231

3332
For more information, see the
3433
[Posterior Analysis](https://mc-stan.org/docs/reference-manual/analysis.html)
3534
chapter of the Stan Reference Manual which describes both the theory and practice of MCMC
3635
estimation techniques.
36+
37+
The statistics - Mean, StdDev, MAD, and Quantiles - are computed directly from all draws across all chains.
38+
The diagnostic statistics - ESS_bulk, ESS_tail, and R_hat are computed from the rank-normalized,
39+
folded, and splitted chains according to the definitions by @Vehtari+etal:2021:Rhat.
40+
the MCSE statistic is computed using split chain R_hat and autocorrelations.
3741
The summary statistics and the algorithms used to compute them are described in sections
3842
[Notation for samples](https://mc-stan.org/docs/reference-manual/analysis.html#notation-for-samples-chains-and-draws)
3943
and
4044
[Effective Sample Size](https://mc-stan.org/docs/reference-manual/analysis.html#effective-sample-size.section).
4145

42-
## Building the stansummary command
46+
## Building the `stansummary` command
4347

4448
The CmdStan makefile task `build` compiles the `stansummary` utility
4549
into the `bin` directory.
@@ -69,38 +73,38 @@ output files to get the following report:
6973

7074
```
7175
> bin/stansummary eight_*.csv
72-
Input files: eight_1.csv, eight_2.csv, eight_3.csv, eight_4.csv
7376
Inference for Stan model: eight_schools_model
74-
4 chains: each with iter=(1000,1000,1000,1000); warmup=(0,0,0,0); thin=(1,1,1,1); 4000 iterations saved.
75-
76-
Warmup took (0.048, 0.060, 0.047, 0.045) seconds, 0.20 seconds total
77-
Sampling took (0.057, 0.058, 0.061, 0.066) seconds, 0.24 seconds total
78-
79-
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
80-
81-
lp__ -18 0.33 5.1 -26 -19 -9.1 233 963 1.0
82-
accept_stat__ 0.88 1.6e-02 0.23 0.21 0.98 1.00 203 838 1.0e+00
83-
stepsize__ 0.18 2.2e-02 0.031 0.14 0.20 0.22 2.0 8.3 3.9e+13
84-
treedepth__ 3.8 5.9e-02 0.78 2.0 4.0 5.0 175 724 1.0e+00
85-
n_leapfrog__ 18 1.3e+00 9.4 7.0 15 31 51 212 1.0e+00
86-
divergent__ 0.015 4.1e-03 0.12 0.00 0.00 0.00 865 3576 1.0e+00
87-
energy__ 23 3.4e-01 5.5 13 23 32 258 1066 1.0e+00
88-
89-
mu 7.9 0.16 5.1 -0.23 7.9 16 1021 4221 1.0
90-
theta[1] 12 0.30 8.6 -0.48 11 28 837 3459 1.0
91-
theta[2] 7.8 0.15 6.4 -2.7 7.7 18 1717 7096 1.00
92-
theta[3] 6.1 0.19 7.7 -6.5 6.5 18 1684 6958 1.0
93-
theta[4] 7.5 0.15 6.7 -3.1 7.4 18 2026 8373 1.0
94-
theta[5] 4.7 0.17 6.4 -6.7 5.3 15 1391 5747 1.00
95-
theta[6] 5.9 0.16 6.7 -5.8 6.2 16 1673 6915 1.00
96-
theta[7] 11 0.22 7.0 0.057 10 23 1069 4419 1.0
97-
theta[8] 8.3 0.20 7.9 -4.2 8.0 22 1503 6209 1.00
98-
tau 7.2 0.26 5.2 1.5 5.9 17 401 1657 1.0
77+
4 chains: each with iter=1000; warmup=1000; thin=1; 1000 iterations saved.
78+
79+
Warmup took (0.031, 0.031, 0.038, 0.031) seconds, 0.13 seconds total
80+
Sampling took (0.032, 0.039, 0.026, 0.029) seconds, 0.13 seconds total
81+
82+
Mean MCSE StdDev MAD 5% 50% 95% ESS_bulk ESS_tail R_hat
83+
84+
lp__ -18 0.36 5.3 5.5 -26 -18 -8.7 214 217 1.0
85+
accept_stat__ 0.80 0.021 0.29 0.068 0.033 0.95 1.00 268 314 1.0
86+
stepsize__ 0.20 nan 0.040 0.029 0.14 0.21 0.25 nan nan nan
87+
treedepth__ 3.7 0.22 0.88 1.5 2.0 4.0 5.0 26 18 1.1
88+
n_leapfrog__ 17 3.1 10 12 3.0 15 31 30 21 1.1
89+
divergent__ 0.022 nan 0.15 0.00 0.00 0.00 0.00 nan nan nan
90+
energy__ 23 0.38 5.7 5.8 13 23 32 223 286 1.0
91+
92+
mu 7.9 0.16 5.3 4.9 -0.35 7.9 16 977 472 1.0
93+
theta[1] 12 0.32 8.8 7.4 -0.19 11 28 727 525 1.0
94+
theta[2] 7.8 0.17 6.4 6.0 -2.8 7.8 18 1298 2137 1.0
95+
theta[3] 5.9 0.20 8.1 6.8 -8.0 6.5 18 1493 1720 1.0
96+
theta[4] 7.5 0.17 6.9 6.3 -3.4 7.6 18 1517 1935 1.0
97+
theta[5] 4.9 0.20 6.3 6.1 -6.1 5.1 15 998 1357 1.0
98+
theta[6] 5.8 0.18 6.8 6.1 -6.3 6.2 16 1434 1431 1.0
99+
theta[7] 11 0.25 7.0 6.4 0.42 10 23 741 562 1.0
100+
theta[8] 8.6 0.21 8.1 7.2 -3.6 8.3 22 1444 1850 1.0
101+
tau 7.3 0.31 5.8 4.4 1.5 5.7 18 226 1230 1.0
99102
100103
Samples were drawn using hmc with nuts.
101-
For each parameter, N_Eff is a crude measure of effective sample size,
102-
and R_hat is the potential scale reduction factor on split chains (at
103-
convergence, R_hat=1).
104+
For each parameter, ESS_bulk and ESS_tail measure the effective sample size for the entire sample
105+
(bulk) and for the .05 and .95 tails (tail),
106+
and R_hat measures the potential scale reduction on split chains. At convergence R_hat will be
107+
very close to 1.00.
104108
```
105109

106110
The console output information consists of

src/reference-manual/analysis.qmd

Lines changed: 58 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ pagetitle: Posterior Analysis
55
# Posterior Analysis {#analysis.chapter}
66

77
Stan uses Markov chain Monte Carlo (MCMC) techniques to generate
8-
samples from the posterior distribution for full Bayesian inference.
8+
draws from the posterior distribution for full Bayesian inference.
99
Markov chain Monte Carlo (MCMC) methods were developed for situations
1010
in which it is not straightforward to make independent draws
1111
@Metropolis:1953.
@@ -17,7 +17,7 @@ despite the fact that it is not actually a Markov chain.
1717
Stan's Laplace algorithm produces a sample from a normal approximation
1818
centered at the mode of a distribution in the unconstrained space.
1919
If the mode is a maximum a posteriori (MAP) estimate,
20-
the samples provide an estimate of the mean and standard deviation
20+
the sample provides an estimate of the mean and standard deviation
2121
of the posterior distribution.
2222
If the mode is a maximum likelihood estimate (MLE),
2323
the sample provides an estimate of the standard error of the likelihood.
@@ -66,15 +66,15 @@ applies. These problems are addressed in the next two sections.
6666

6767
Stan's posterior analysis tools compute a number of summary
6868
statistics, estimates, and diagnostics for Markov chain Monte Carlo
69-
(MCMC) samples. Stan's estimators and diagnostics are more robust in
69+
(MCMC) sample. Stan's estimators and diagnostics are more robust in
7070
the face of non-convergence, antithetical sampling, and long-term
7171
Markov chain correlations than most of the other tools available. The
7272
algorithms Stan uses to achieve this are described in this chapter.
7373

7474

7575
## Convergence
7676

77-
By definition, a Markov chain generates samples from the target
77+
By definition, a Markov chain samples from the target
7878
distribution only after it has converged to equilibrium (i.e.,
7979
equilibrium is defined as being achieved when $p(\theta^{(n)})$ is the
8080
target density). The following point cannot be expressed strongly
@@ -105,21 +105,23 @@ One way to monitor whether a chain has converged to the equilibrium
105105
distribution is to compare its behavior to other randomly initialized
106106
chains. This is the motivation for the @GelmanRubin:1992 potential
107107
scale reduction statistic, $\hat{R}$. The $\hat{R}$ statistic
108-
measures the ratio of the average variance of samples within each
109-
chain to the variance of the pooled samples across chains; if all
108+
measures the ratio of the average variance of drawss within each
109+
chain to the variance of the pooled draws across chains; if all
110110
chains are at equilibrium, these will be the same and $\hat{R}$ will
111111
be one. If the chains have not converged to a common distribution,
112112
the $\hat{R}$ statistic will be greater than one.
113113

114114
Gelman and Rubin's recommendation is that the independent Markov
115115
chains be initialized with diffuse starting values for the parameters
116-
and sampled until all values for $\hat{R}$ are below 1.1. Stan
116+
and sampled until all values for $\hat{R}$ are below some threshold.
117+
@Vehtari+etal:2021:Rhat suggest in general to use a threshold $1.01$, but
118+
othe thresholds can be used depending on the use case. Stan
117119
allows users to specify initial values for parameters and it is also
118120
able to draw diffuse random initializations automatically satisfying
119121
the declared parameter constraints.
120122

121123
The $\hat{R}$ statistic is defined for a set of $M$ Markov chains,
122-
$\theta_m$, each of which has $N$ samples $\theta^{(n)}_m$. The
124+
$\theta_m$, each of which has $N$ draws $\theta^{(n)}_m$. The
123125
*between-chain variance* estimate is
124126

125127
$$
@@ -190,6 +192,49 @@ because the first half of each chain has not mixed with the second
190192
half.
191193

192194

195+
### Rank normalization helps when there are heavy tails {-}
196+
197+
Split R-hat and the effective sample size (ESS) are well defined only if
198+
the marginal posteriors have finite mean and variance.
199+
Therefore, following @Vehtari+etal:2021:Rhat, we compute the rank normalized
200+
parameter values and then feed them into the formulas for split R-hat and ESS.
201+
202+
Rank normalization proceeds as follows:
203+
204+
* First, replace each value $\theta^{(nm)}$ by its rank $r^{(nm)}$ within the pooled
205+
draws from all chains. Average rank for ties are used to conserve
206+
the number of unique values of discrete quantities.
207+
208+
* Second, transform ranks to normal scores using the inverse normal transformation
209+
and a fractional offset:
210+
211+
$$
212+
z_{(nm)} = \Phi^{-1} \left( \frac{r_{(nm)} - 3/8}{S - 1/4} \right)
213+
$$
214+
215+
To further improve sensitivity to chains having different scales,
216+
217+
rank normalized R-hat is computed also for the
218+
for the corresponding *folded*
219+
draws $\zeta^{(mn)}$, absolute deviations from the median,
220+
$$
221+
\label{zeta}
222+
\zeta^{(mn)} = \left|\theta^{(nm)}-{\rm median}(\theta)\right|.
223+
$$
224+
The rank normalized split-$\widehat{R}$ measure computed on the
225+
$\zeta^{(mn)}$ values is called \emph{folded-split}-$\widehat{R}$.
226+
This measures convergence in the
227+
tails rather than in the bulk of the distribution.
228+
229+
To obtain a single conservative $\widehat{R}$ estimate, we propose
230+
to report the maximum of rank normalized split-$\widehat{R}$ and
231+
rank normalized folded-split-$\widehat{R}$ for each parameter.
232+
233+
Bulk-ESS is defined as ESS for rank normalized split chains. Tail-ESS
234+
is defined as the minimum ESS for the 5% and 95% quantiles. See
235+
[Effective Sample Size section](#effective-sample-size.section) for
236+
details on how ESS is estimated.
237+
193238
### Convergence is global {-}
194239

195240
A question that often arises is whether it is acceptable to monitor
@@ -256,7 +301,7 @@ analytically.
256301
So what we do in practice is hope that the finite number of draws is
257302
large enough for the expectations to be reasonably accurate. Removing
258303
warmup iterations improves the accuracy of the expectations but there
259-
is no guarantee that removing any finite number of samples will be
304+
is no guarantee that removing any finite number of draws will be
260305
enough.
261306

262307

@@ -278,7 +323,7 @@ distribution for the log posterior density (`lp__`) almost
278323
never looks Gaussian, instead it features long tails that can lead to
279324
large $\hat{R}$ even in the large $N$ limit. Tweaks to $\hat{R}$,
280325
such as using quantiles in place of raw values, have the flavor of
281-
making the samples of interest more Gaussian and hence the $\hat{R}$
326+
making the sample of interest more Gaussian and hence the $\hat{R}$
282327
statistic more accurate.
283328

284329

@@ -297,7 +342,7 @@ and can apply the standard tests.
297342
## Effective sample size {#effective-sample-size.section}
298343

299344
The second technical difficulty posed by MCMC methods is that the
300-
samples will typically be autocorrelated (or anticorrelated) within a
345+
draws will typically be autocorrelated (or anticorrelated) within a
301346
chain. This increases (or reduces) the uncertainty of the estimation of posterior
302347
quantities of interest, such as means, variances, or quantiles; see
303348
@Geyer:2011.
@@ -352,7 +397,7 @@ $$
352397
\, d\theta - \frac{\mu^2}{\sigma^2}.
353398
$$
354399

355-
The effective sample size of $N$ samples generated by a process with
400+
The effective sample size of $N$ draws generated by a process with
356401
autocorrelations $\rho_t$ is defined by
357402
$$
358403
N_{\mathrm{eff}}
@@ -378,7 +423,7 @@ correlation.
378423
In practice, the probability function in question cannot be tractably
379424
integrated and thus the autocorrelation cannot be calculated, nor the
380425
effective sample size. Instead, these quantities must be estimated
381-
from the samples themselves. The rest of this section describes a
426+
from the draws themselves. The rest of this section describes a
382427
autocorrelations and split-$\hat{R}$ based effective sample
383428
size estimator, based on multiple chains. As before, each chain
384429
$\theta_m$ will be assumed to be of length $N$.

0 commit comments

Comments
 (0)