@@ -521,27 +521,43 @@ centered around zero, as is typical for regression coefficients.
521
521
522
522
## Parameterizing centered vectors
523
523
524
- It is often convenient to define a parameter vector $\beta$ that is
525
- centered in the sense of satisfying the sum-to-zero constraint,
526
- $$
527
- \sum_{k=1}^K \beta_k = 0.
528
- $$
529
-
530
- Such a parameter vector may be used to identify a multi-logit
531
- regression parameter vector (see the [ multi-logit
532
- section] ( #multi-logit.section ) for details), or may be used for
533
- ability or difficulty parameters (but not both) in an IRT model (see
534
- the [ item-response model section] ( #item-response-models.section ) for
535
- details).
536
-
537
-
538
- ### $K-1$ degrees of freedom {-}
539
-
540
- As of Stan 2.36, there is a built in ` sum_to_zero_vector `
541
- type which constrains $K-1$ free parameters into a length-$K$
542
- vector that sums to zero. This is using a more sophisticated
543
- transform than the previously recommended form of setting
544
- the final element of the vector to the negative sum of the previous elements.
524
+ When there are varying effects in a regression, the resulting
525
+ likelihood is not identified unless further steps are taken. For
526
+ example, we might have a global intercept $\alpha$ and then a varying
527
+ effect $\beta_k$ for age group $k$ to make a linear predictor $\alpha +
528
+ \beta_k$. With this predictor, we can add a constant to $\alpha$ and
529
+ subtract from each $\beta_k$ and get exactly the same likelihood.
530
+
531
+ The traditional approach to identifying such a model is to pin the
532
+ first varing effect to zero, i.e., $\beta_1 = 0$. With one of the
533
+ varying effects fixed, you can no longer add a constant to all of them
534
+ and the model's likelihood is identified. In addition to the
535
+ difficulty in specifying such a model in Stan, it is awkward to
536
+ formulate priors because the other coefficients are all interpreted
537
+ relative to $\beta_1$.
538
+
539
+ In a Bayesian setting, a proper prior on each of the $\beta$ is enough
540
+ to identify the model. Unfortunately, this can lead to inefficiency
541
+ during sampling as the model is still only weakly identified through
542
+ the prior---there is a very simple example of the difference in
543
+ the discussion of collinearity in @collinearity .section.
544
+
545
+ An alternative identification strategy that allows a symmetric prior
546
+ is to enforce a sum-to-zero constraint on the varying effects, i.e.,
547
+ $\sum_ {k=1}^K \beta_k = 0.$
548
+
549
+ A parameter vector constrained to sum to zero may also be used to
550
+ identify a multi-logit regression parameter vector (see the
551
+ [ multi-logit section] ( #multi-logit.section ) for details), or may be
552
+ used for ability or difficulty parameters (but not both) in an IRT
553
+ model (see the [ item-response model
554
+ section] ( #item-response-models.section ) for details).
555
+
556
+
557
+ ### Built-in sum-to-zero vector {-}
558
+
559
+ As of Stan 2.36, there is a built in ` sum_to_zero_vector ` type, which
560
+ can be used as follows.
545
561
546
562
``` stan
547
563
parameters {
@@ -550,10 +566,34 @@ parameters {
550
566
}
551
567
```
552
568
553
- Placing a prior on ` beta ` in this parameterization leads to
554
- a subtly different posterior than that resulting from the same prior
555
- on ` beta ` in the original parameterization without the
556
- sum-to-zero constraint.
569
+ This produces a vector of size ` K ` such that ` sum(beta) = 0 ` . In the
570
+ unconstrained representation requires only ` K - 1 ` values because the
571
+ last is determined by the first ` K - 1 ` .
572
+
573
+ Placing a prior on ` beta ` in this parameterization, for example,
574
+
575
+ ``` stan
576
+ beta ~ normal(0, 1);
577
+ ```
578
+
579
+ leads to a subtly different posterior than what you would get with the
580
+ same prior on an unconstrained size-` K ` vector. As explained below,
581
+ the variance is reduced.
582
+
583
+ The sum-to-zero constraint can be implemented naively by setting the
584
+ last element to the negative sum of the first elements, i.e., $\beta_K
585
+ = -\sum_ {k=1}^{K-1} \beta_k.$ But that leads to high correlation among
586
+ the $\beta_k$.
587
+
588
+ The transform used in Stan eliminates these correlations by
589
+ constructing an orthogonal basis and applying it to the
590
+ zero-sum-constraint; @seyboldt :2024 provides an explanation. The
591
+ * Stan Reference Manual* provides the details in the chapter on
592
+ transforms. Although any orthogonal basis can be used, Stan uses the
593
+ inverse isometric log transform because it is convenient to describe
594
+ and the transform simplifies to efficient scalar operations rather
595
+ than more expensive matrix operations.
596
+
557
597
558
598
#### Marginal distribution of sum-to-zero components {-}
559
599
@@ -568,53 +608,27 @@ model {
568
608
}
569
609
```
570
610
571
- The components are not independent, as they must sum zero. No
572
- Jacobian is required because the transform uses only linear
573
- operations (and thus have constant Jacobians).
611
+ The scale component can be multiplied by ` sigma ` to produce a
612
+ ` normal(0, sigma) ` prior marginally.
574
613
575
614
To generate distributions with marginals other than standard normal,
576
615
the resulting ` beta ` may be scaled by some factor ` sigma ` and
577
616
translated to some new location ` mu ` .
578
617
579
- ### Translated and scaled simplex {-}
580
-
581
- An alternative approach that's less efficient, but amenable to a
582
- symmetric prior, is to offset and scale a simplex.
583
-
584
- ``` stan
585
- parameters {
586
- simplex[K] beta_raw;
587
- real beta_scale;
588
- // ...
589
- }
590
- transformed parameters {
591
- vector[K] beta;
592
- beta = beta_scale * (beta_raw - inv(K));
593
- // ...
594
- }
595
- ```
596
-
597
- Here ` inv(K) ` is just a short way to write ` 1.0 / K ` . Given
598
- that ` beta_raw ` sums to 1 because it is a simplex, the
599
- elementwise subtraction of ` inv(K) ` is guaranteed to sum to zero.
600
- Because the magnitude of the elements of the simplex is bounded, a
601
- scaling factor is required to provide ` beta ` with $K$ degrees of
602
- freedom necessary to take on every possible value that sums to zero.
603
-
604
- With this parameterization, a Dirichlet prior can be placed on
605
- ` beta_raw ` , perhaps uniform, and another prior put on
606
- ` beta_scale ` , typically for "shrinkage."
607
-
608
618
609
619
### Soft centering {-}
610
620
611
- Adding a prior such as $\beta \sim \textsf{normal}(0,\sigma)$ will provide a kind
612
- of soft centering of a parameter vector $\beta$ by preferring, all
613
- else being equal, that $\sum_ {k=1}^K \beta_k = 0$. This approach is only
614
- guaranteed to roughly center if $\beta$ and the elementwise addition $\beta + c$
615
- for a scalar constant $c$ produce the same likelihood (perhaps by
616
- another vector $\alpha$ being transformed to $\alpha - c$, as in the
617
- IRT models). This is another way of achieving a symmetric prior.
621
+ Adding a prior such as $\beta \sim \textsf{normal}(0,\epsilon)$ for a
622
+ small $\epsilon$ will provide a kind of soft centering of a parameter
623
+ vector $\beta$ by preferring, all else being equal, that $\sum_ {k=1}^K
624
+ \beta_k = 0$. This approach is only guaranteed to roughly center if
625
+ $\beta$ and the elementwise addition $\beta + c$ for a scalar constant
626
+ $c$ produce the same likelihood (perhaps by another vector $\alpha$
627
+ being transformed to $\alpha - c$, as in the IRT models). This is
628
+ another way of achieving a symmetric prior, though it requires
629
+ choosing an $\epsilon$. If $\epsilon$ is too large, there won't be a
630
+ strong enough centering effect and if it is too small, it will add
631
+ high curvature to the target density and impede sampling.
618
632
619
633
620
634
## Ordered logistic and probit regression {#ordered-logistic.section}
0 commit comments