Skip to content

Commit 0daba60

Browse files
committed
updated sum-to-zero doc
1 parent 71cea94 commit 0daba60

File tree

3 files changed

+95
-67
lines changed

3 files changed

+95
-67
lines changed

src/bibtex/all.bib

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1862,4 +1862,12 @@ @book{filzmoser+etal:2018
18621862
pages={35--68},
18631863
year={2018},
18641864
publisher={Springer}
1865+
}
1866+
1867+
@misc{seyboldt:2024,
1868+
author="Seyboldt, Adrian",
1869+
title="Add ZeroSumNormal distribution",
1870+
note="pyro-ppl GitHub repository issue \#1751",
1871+
year = "2024",
1872+
url ="https://github.com/pyro-ppl/numpyro/pull/1751#issuecomment-1980569811"
18651873
}

src/reference-manual/transforms.qmd

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -475,9 +475,15 @@ $$
475475
$$
476476

477477
For the transform, Stan uses the first part of an isometric log ratio
478-
transform; see [@egozcue+etal:2003] for the basic
479-
definitions and Chapter 3 of [@filzmoser+etal:2018] for the pivot
480-
coordinate version used here.
478+
transform; see [@egozcue+etal:2003] for the basic definitions and
479+
Chapter 3 of [@filzmoser+etal:2018] for the pivot coordinate version
480+
used here. Stan uses the isometric log ratio transform because it
481+
induces a geometry with zero correlation among the dimensions, making
482+
it easier for HMC to explore than simpler alternatives such as setting
483+
the final element to the negative sum of the first elements; see, e.g.,
484+
[@seyboldt:2024].
485+
486+
481487

482488

483489
### Zero sum transform {-}

src/stan-users-guide/regression.qmd

Lines changed: 78 additions & 64 deletions
Original file line numberDiff line numberDiff line change
@@ -521,27 +521,43 @@ centered around zero, as is typical for regression coefficients.
521521

522522
## Parameterizing centered vectors
523523

524-
It is often convenient to define a parameter vector $\beta$ that is
525-
centered in the sense of satisfying the sum-to-zero constraint,
526-
$$
527-
\sum_{k=1}^K \beta_k = 0.
528-
$$
529-
530-
Such a parameter vector may be used to identify a multi-logit
531-
regression parameter vector (see the [multi-logit
532-
section](#multi-logit.section) for details), or may be used for
533-
ability or difficulty parameters (but not both) in an IRT model (see
534-
the [item-response model section](#item-response-models.section) for
535-
details).
536-
537-
538-
### $K-1$ degrees of freedom {-}
539-
540-
As of Stan 2.36, there is a built in `sum_to_zero_vector`
541-
type which constrains $K-1$ free parameters into a length-$K$
542-
vector that sums to zero. This is using a more sophisticated
543-
transform than the previously recommended form of setting
544-
the final element of the vector to the negative sum of the previous elements.
524+
When there are varying effects in a regression, the resulting
525+
likelihood is not identified unless further steps are taken. For
526+
example, we might have a global intercept $\alpha$ and then a varying
527+
effect $\beta_k$ for age group $k$ to make a linear predictor $\alpha +
528+
\beta_k$. With this predictor, we can add a constant to $\alpha$ and
529+
subtract from each $\beta_k$ and get exactly the same likelihood.
530+
531+
The traditional approach to identifying such a model is to pin the
532+
first varing effect to zero, i.e., $\beta_1 = 0$. With one of the
533+
varying effects fixed, you can no longer add a constant to all of them
534+
and the model's likelihood is identified. In addition to the
535+
difficulty in specifying such a model in Stan, it is awkward to
536+
formulate priors because the other coefficients are all interpreted
537+
relative to $\beta_1$.
538+
539+
In a Bayesian setting, a proper prior on each of the $\beta$ is enough
540+
to identify the model. Unfortunately, this can lead to inefficiency
541+
during sampling as the model is still only weakly identified through
542+
the prior---there is a very simple example of the difference in
543+
the discussion of collinearity in @collinearity.section.
544+
545+
An alternative identification strategy that allows a symmetric prior
546+
is to enforce a sum-to-zero constraint on the varying effects, i.e.,
547+
$\sum_{k=1}^K \beta_k = 0.$
548+
549+
A parameter vector constrained to sum to zero may also be used to
550+
identify a multi-logit regression parameter vector (see the
551+
[multi-logit section](#multi-logit.section) for details), or may be
552+
used for ability or difficulty parameters (but not both) in an IRT
553+
model (see the [item-response model
554+
section](#item-response-models.section) for details).
555+
556+
557+
### Built-in sum-to-zero vector {-}
558+
559+
As of Stan 2.36, there is a built in `sum_to_zero_vector` type, which
560+
can be used as follows.
545561

546562
```stan
547563
parameters {
@@ -550,10 +566,34 @@ parameters {
550566
}
551567
```
552568

553-
Placing a prior on `beta` in this parameterization leads to
554-
a subtly different posterior than that resulting from the same prior
555-
on `beta` in the original parameterization without the
556-
sum-to-zero constraint.
569+
This produces a vector of size `K` such that `sum(beta) = 0`. In the
570+
unconstrained representation requires only `K - 1` values because the
571+
last is determined by the first `K - 1`.
572+
573+
Placing a prior on `beta` in this parameterization, for example,
574+
575+
```stan
576+
beta ~ normal(0, 1);
577+
```
578+
579+
leads to a subtly different posterior than what you would get with the
580+
same prior on an unconstrained size-`K` vector. As explained below,
581+
the variance is reduced.
582+
583+
The sum-to-zero constraint can be implemented naively by setting the
584+
last element to the negative sum of the first elements, i.e., $\beta_K
585+
= -\sum_{k=1}^{K-1} \beta_k.$ But that leads to high correlation among
586+
the $\beta_k$.
587+
588+
The transform used in Stan eliminates these correlations by
589+
constructing an orthogonal basis and applying it to the
590+
zero-sum-constraint; @seyboldt:2024 provides an explanation. The
591+
*Stan Reference Manual* provides the details in the chapter on
592+
transforms. Although any orthogonal basis can be used, Stan uses the
593+
inverse isometric log transform because it is convenient to describe
594+
and the transform simplifies to efficient scalar operations rather
595+
than more expensive matrix operations.
596+
557597

558598
#### Marginal distribution of sum-to-zero components {-}
559599

@@ -568,53 +608,27 @@ model {
568608
}
569609
```
570610

571-
The components are not independent, as they must sum zero. No
572-
Jacobian is required because the transform uses only linear
573-
operations (and thus have constant Jacobians).
611+
The scale component can be multiplied by `sigma` to produce a
612+
`normal(0, sigma)` prior marginally.
574613

575614
To generate distributions with marginals other than standard normal,
576615
the resulting `beta` may be scaled by some factor `sigma` and
577616
translated to some new location `mu`.
578617

579-
### Translated and scaled simplex {-}
580-
581-
An alternative approach that's less efficient, but amenable to a
582-
symmetric prior, is to offset and scale a simplex.
583-
584-
```stan
585-
parameters {
586-
simplex[K] beta_raw;
587-
real beta_scale;
588-
// ...
589-
}
590-
transformed parameters {
591-
vector[K] beta;
592-
beta = beta_scale * (beta_raw - inv(K));
593-
// ...
594-
}
595-
```
596-
597-
Here `inv(K)` is just a short way to write `1.0 / K`. Given
598-
that `beta_raw` sums to 1 because it is a simplex, the
599-
elementwise subtraction of `inv(K)` is guaranteed to sum to zero.
600-
Because the magnitude of the elements of the simplex is bounded, a
601-
scaling factor is required to provide `beta` with $K$ degrees of
602-
freedom necessary to take on every possible value that sums to zero.
603-
604-
With this parameterization, a Dirichlet prior can be placed on
605-
`beta_raw`, perhaps uniform, and another prior put on
606-
`beta_scale`, typically for "shrinkage."
607-
608618

609619
### Soft centering {-}
610620

611-
Adding a prior such as $\beta \sim \textsf{normal}(0,\sigma)$ will provide a kind
612-
of soft centering of a parameter vector $\beta$ by preferring, all
613-
else being equal, that $\sum_{k=1}^K \beta_k = 0$. This approach is only
614-
guaranteed to roughly center if $\beta$ and the elementwise addition $\beta + c$
615-
for a scalar constant $c$ produce the same likelihood (perhaps by
616-
another vector $\alpha$ being transformed to $\alpha - c$, as in the
617-
IRT models). This is another way of achieving a symmetric prior.
621+
Adding a prior such as $\beta \sim \textsf{normal}(0,\epsilon)$ for a
622+
small $\epsilon$ will provide a kind of soft centering of a parameter
623+
vector $\beta$ by preferring, all else being equal, that $\sum_{k=1}^K
624+
\beta_k = 0$. This approach is only guaranteed to roughly center if
625+
$\beta$ and the elementwise addition $\beta + c$ for a scalar constant
626+
$c$ produce the same likelihood (perhaps by another vector $\alpha$
627+
being transformed to $\alpha - c$, as in the IRT models). This is
628+
another way of achieving a symmetric prior, though it requires
629+
choosing an $\epsilon$. If $\epsilon$ is too large, there won't be a
630+
strong enough centering effect and if it is too small, it will add
631+
high curvature to the target density and impede sampling.
618632

619633

620634
## Ordered logistic and probit regression {#ordered-logistic.section}

0 commit comments

Comments
 (0)