updated sum-to-zero doc

bob-carpenter · bob-carpenter · commit 0daba60d07b6 · 2024-11-15T15:14:22.000-05:00
diff --git a/src/bibtex/all.bib b/src/bibtex/all.bib
@@ -1862,4 +1862,12 @@ @book{filzmoser+etal:2018
   pages={35--68},
   year={2018},
   publisher={Springer}
+}
+
+@misc{seyboldt:2024,
+  author="Seyboldt, Adrian",
+  title="Add ZeroSumNormal distribution",
+  note="pyro-ppl GitHub repository issue \#1751",
+  year = "2024",
+  url ="https://github.com/pyro-ppl/numpyro/pull/1751#issuecomment-1980569811"
 }
diff --git a/src/reference-manual/transforms.qmd b/src/reference-manual/transforms.qmd
@@ -475,9 +475,15 @@ $$
 $$
 
 For the transform, Stan uses the first part of an isometric log ratio
-transform;  see [@egozcue+etal:2003] for the basic
-definitions and Chapter 3 of [@filzmoser+etal:2018] for the pivot
-coordinate version used here.
+transform; see [@egozcue+etal:2003] for the basic definitions and
+Chapter 3 of [@filzmoser+etal:2018] for the pivot coordinate version
+used here.  Stan uses the isometric log ratio transform because it
+induces a geometry with zero correlation among the dimensions, making
+it easier for HMC to explore than simpler alternatives such as setting
+the final element to the negative sum of the first elements; see, e.g.,
+[@seyboldt:2024].
+
+
 
 
 ### Zero sum transform {-}
diff --git a/src/stan-users-guide/regression.qmd b/src/stan-users-guide/regression.qmd
@@ -521,27 +521,43 @@ centered around zero, as is typical for regression coefficients.
 
 ## Parameterizing centered vectors
 
-It is often convenient to define a parameter vector $\beta$ that is
-centered in the sense of satisfying the sum-to-zero constraint,
-$$
-\sum_{k=1}^K \beta_k = 0.
-$$
-
-Such a parameter vector may be used to identify a multi-logit
-regression parameter vector (see the [multi-logit
-section](#multi-logit.section) for details), or may be used for
-ability or difficulty parameters (but not both) in an IRT model (see
-the [item-response model section](#item-response-models.section) for
-details).
-
-
-### $K-1$ degrees of freedom {-}
-
-As of Stan 2.36, there is a built in `sum_to_zero_vector`
-type which constrains $K-1$ free parameters into a length-$K$
-vector that sums to zero. This is using a more sophisticated
-transform than the previously recommended form of setting
-the final element of the vector to the negative sum of the previous elements.
+When there are varying effects in a regression, the resulting
+likelihood is not identified unless further steps are taken.  For
+example, we might have a global intercept $\alpha$ and then a varying
+effect $\beta_k$ for age group $k$ to make a linear predictor $\alpha +
+\beta_k$.  With this predictor, we can add a constant to $\alpha$ and
+subtract from each $\beta_k$ and get exactly the same likelihood.
+
+The traditional approach to identifying such a model is to pin the
+first varing effect to zero, i.e., $\beta_1 = 0$.  With one of the
+varying effects fixed, you can no longer add a constant to all of them
+and the model's likelihood is identified.  In addition to the
+difficulty in specifying such a model in Stan, it is awkward to
+formulate priors because the other coefficients are all interpreted
+relative to $\beta_1$.  
+
+In a Bayesian setting, a proper prior on each of the $\beta$ is enough
+to identify the model.  Unfortunately, this can lead to inefficiency
+during sampling as the model is still only weakly identified through
+the prior---there is a very simple example of the difference in
+the discussion of collinearity in @collinearity.section.
+
+An alternative identification strategy that allows a symmetric prior
+is to enforce a sum-to-zero constraint on the varying effects, i.e.,
+$\sum_{k=1}^K \beta_k = 0.$
+
+A parameter vector constrained to sum to zero may also be used to
+identify a multi-logit regression parameter vector (see the
+[multi-logit section](#multi-logit.section) for details), or may be
+used for ability or difficulty parameters (but not both) in an IRT
+model (see the [item-response model
+section](#item-response-models.section) for details).
+
+
+### Built-in sum-to-zero vector {-}
+
+As of Stan 2.36, there is a built in `sum_to_zero_vector` type, which
+can be used as follows.
 
 ```stan
 parameters {
@@ -550,10 +566,34 @@ parameters {
 }
 ```
 
-Placing a prior on `beta` in this parameterization leads to
-a subtly different posterior than that resulting from the same prior
-on `beta` in the original parameterization without the
-sum-to-zero constraint.
+This produces a vector of size `K` such that `sum(beta) = 0`.  In the
+unconstrained representation requires only `K - 1` values because the
+last is determined by the first `K - 1`.  
+
+Placing a prior on `beta` in this parameterization, for example,
+
+```stan
+  beta ~ normal(0, 1);
+```
+
+leads to a subtly different posterior than what you would get with the
+same prior on an unconstrained size-`K` vector.  As explained below,
+the variance is reduced.
+
+The sum-to-zero constraint can be implemented naively by setting the
+last element to the negative sum of the first elements, i.e., $\beta_K
+= -\sum_{k=1}^{K-1} \beta_k.$ But that leads to high correlation among
+the $\beta_k$.
+
+The transform used in Stan eliminates these correlations by
+constructing an orthogonal basis and applying it to the
+zero-sum-constraint; @seyboldt:2024 provides an explanation.  The
+*Stan Reference Manual* provides the details in the chapter on
+transforms.  Although any orthogonal basis can be used, Stan uses the
+inverse isometric log transform because it is convenient to describe
+and the transform simplifies to efficient scalar operations rather
+than more expensive matrix operations.
+
 
 #### Marginal distribution of sum-to-zero components {-}
 
@@ -568,53 +608,27 @@ model {
 }
 ```
 
-The components are not independent, as they must sum zero.  No
-Jacobian is required because the transform uses only linear
-operations (and thus have constant Jacobians).
+The scale component can be multiplied by `sigma` to produce a
+`normal(0, sigma)` prior marginally.
 
 To generate distributions with marginals other than standard normal,
 the resulting `beta` may be scaled by some factor `sigma` and
 translated to some new location `mu`.
 
-### Translated and scaled simplex {-}
-
-An alternative approach that's less efficient, but amenable to a
-symmetric prior, is to offset and scale a simplex.
-
-```stan
-parameters {
-  simplex[K] beta_raw;
-  real beta_scale;
-  // ...
-}
-transformed parameters {
-  vector[K] beta;
-  beta = beta_scale * (beta_raw - inv(K));
-  // ...
-}
-```
-
-Here `inv(K)` is just a short way to write `1.0 / K`.  Given
-that `beta_raw` sums to 1 because it is a simplex, the
-elementwise subtraction of `inv(K)` is guaranteed to sum to zero.
-Because the magnitude of the elements of the simplex is bounded, a
-scaling factor is required to provide `beta` with $K$ degrees of
-freedom necessary to take on every possible value that sums to zero.
-
-With this parameterization, a Dirichlet prior can be placed on
-`beta_raw`, perhaps uniform, and another prior put on
-`beta_scale`, typically for "shrinkage."
-
 
 ### Soft centering {-}
 
-Adding a prior such as $\beta \sim \textsf{normal}(0,\sigma)$ will provide a kind
-of soft centering of a parameter vector $\beta$ by preferring, all
-else being equal, that $\sum_{k=1}^K \beta_k = 0$.  This approach is only
-guaranteed to roughly center  if $\beta$ and the elementwise addition $\beta + c$
-for a scalar constant $c$ produce the same likelihood (perhaps by
-another vector $\alpha$ being transformed to $\alpha - c$, as in the
-IRT models).  This is another way of achieving a symmetric prior.
+Adding a prior such as $\beta \sim \textsf{normal}(0,\epsilon)$ for a
+small $\epsilon$ will provide a kind of soft centering of a parameter
+vector $\beta$ by preferring, all else being equal, that $\sum_{k=1}^K
+\beta_k = 0$.  This approach is only guaranteed to roughly center if
+$\beta$ and the elementwise addition $\beta + c$ for a scalar constant
+$c$ produce the same likelihood (perhaps by another vector $\alpha$
+being transformed to $\alpha - c$, as in the IRT models).  This is
+another way of achieving a symmetric prior, though it requires
+choosing an $\epsilon$.  If $\epsilon$ is too large, there won't be a
+strong enough centering effect and if it is too small, it will add
+high curvature to the target density and impede sampling.
 
 
 ## Ordered logistic and probit regression {#ordered-logistic.section}