Skip to content

Commit 1e2267c

Browse files
Adds a discussion of the new preenrollment bias / covariate adjustment work (#680)
* first draft * added header * removed package-lock.json * renamed screenshot * reworded final sentence in section * Update docs/deep-dives/data/preenrollment_bias.md Co-authored-by: Mike Williams <102263964+mikewilli@users.noreply.github.com> * update link so it's to a specific commit and always correct * Update docs/deep-dives/data/preenrollment_bias.md Co-authored-by: Mike Williams <102263964+mikewilli@users.noreply.github.com> * updated phrasing * Update docs/deep-dives/data/preenrollment_bias.md Co-authored-by: Mike Williams <102263964+mikewilli@users.noreply.github.com> * updated period --------- Co-authored-by: Mike Williams <102263964+mikewilli@users.noreply.github.com>
1 parent a57598e commit 1e2267c

File tree

5 files changed

+1422
-1299
lines changed

5 files changed

+1422
-1299
lines changed
Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
---
2+
id: preenrollment_bias
3+
title: Preenrollment Bias
4+
slug: /preenrollment-bias
5+
---
6+
7+
# Automatically Countering Preenrollment Bias
8+
9+
TL;DR: Nimbus has the capability to adjust metrics to account for preenrollment bias/natural randomization variability and to improve the precision of inferences, when possible. Currently this is configured by default for guardrails (averages only) but can also be used for custom analyses. This is expected to reduce the frequency of false positives, of which we believe many were caused by natural randomization variability.
10+
11+
## Preenrollment Bias
12+
13+
In order to generate evidence for a causal hypothesis, we must guarantee that all sources of [confounding](https://en.wikipedia.org/wiki/Confounding) are accounted for. We can either do this by manually controlling for all confounders (which is quite difficult), or we can use randomized experiments in which the randomization process guarantees that, _on average_, units in each treatment branch are balanced on all confounders.
14+
15+
Randomization provides a guarantee of balance on average and across large numbers of experiments, but in practice, for any given experiment and confounding dimension there is the possibility of imbalance. This imbalance (or rather, confounding) presents a challenge to our goal of gathering causal evidence. An imbalance observed during the treatment period is indistinguishable from a treatment effect. In the next section (Retrospective A/A Tests) we provide a method for detecting these situations.
16+
17+
## Retrospective A/A tests
18+
19+
User behavior tends to be consistent over time. We've found week-to-week correlations of up to 80% for our key guardrail metrics. Given this strong correlation, we can look for evidence of imbalance during the _pre-experiment period_. We've been using the term preenrollment bias, though [others](https://www.statsig.com/blog/pre-experiment-bias-detection-statsig) use pre-experiment bias.
20+
21+
During the pre-experiment period, our experimental cohorts should have no statistically significant difference across the dimensions (metrics) of interest. If there are statistically significant differences during the pre-experiment period, this is evidence of bias. This technique is called a [Retrospective A/A test](https://www.microsoft.com/en-us/research/articles/patterns-of-trustworthy-experimentation-pre-experiment-stage/).
22+
23+
We now automatically run Retrospective A/A tests for all Nimbus experiments to test for imbalance in guardrail metrics. These can be found alongside the other statistical results in the [Jetstream data products](https://docs.telemetry.mozilla.org/datasets/jetstream.html#statistics-tables).
24+
25+
In short, you can find them in the `moz-fx-data-experiments.mozanalysis.statistics_<slug>_<period>_1` where `<slug>` is the (snake case) experiment slug or ID (can be found using the Experimenter UI). We run analyses over 2 periods: the week prior to enrollment (`<period>` = `preenrollment_week`) and the 28-day period prior (`<period>` = `preenrollment_days_28`).
26+
27+
For example:
28+
29+
```sql
30+
SELECT *
31+
FROM `moz-fx-data-experiments.mozanalysis.statistics_fake_experiment_slug_preenrollment_week_1`
32+
WHERE 1=1
33+
AND comparison = 'relative_uplift'
34+
AND comparison_to_branch = 'control'
35+
AND statistic != 'deciles'
36+
AND analysis_basis = 'exposures'
37+
ORDER BY metric, branch, statistic
38+
```
39+
40+
## Covariate Adjustment & CUPED
41+
42+
What if the Retrospective A/A test flags evidence of an imbalance? Should we have to discard that metric from our analysis completely? Luckily, there exist techniques to adjust for pre-experiment information. The most common/popular of these is [CUPED](https://www.statsig.com/blog/cuped). We have implemented a CUPED-like technique using linear models.
43+
44+
Inferences for the [average treatment effect](https://en.wikipedia.org/wiki/Average_treatment_effect) (ATE) are most commonly calculated by computing the average (mean) in each treatment branch and then computing the difference. However, this can also be calculated using linear models.
45+
46+
As an example, we can fit a model of the form:
47+
48+
$$y_i = \beta_0 + \beta_t t_i$$
49+
50+
where $t_i$ is the treatment indicator (0 if control, 1 if treated) for the $i$-th unit. Inferences on the ATE can be found from the $\beta_t$ parameter. The point estimate and confidence interval will be identical ([ref](https://www.refsmmat.com/courses/727/lecture-notes/linear-models.html#sec-ols-framework)) to the point estimate and confidence interval of the absolute difference in means between the branches. Computing the confidence intervals for the relative differences are more complex, but can be done using post-estimation marginal effects ([ref](https://stats.stackexchange.com/questions/646454/inferences-on-ratio-of-branch-means-in-randomized-experiment/646462#646462)).
51+
52+
Using this framework, it's quite simple to extend our experiment analysis to account for pre-experiment data. We can simply include it as a covariate in the model. We can instead estimate:
53+
54+
$$y_i = \beta_0 + \beta_t t_i + \beta_y z_i$$
55+
56+
Where $z_i$ is the metric of interest ($y$) for the $i$-th unit as measured during the pre-experiment period. As before, we're interested in inferences to $\beta_t$, but now these inferences will be:
57+
58+
1. Adjusted to account for pre-experiment information
59+
2. Benefit from an increase in precision.
60+
61+
### Configuring covariate adjustment
62+
63+
#### Adjusting a new metric for preenrollment bias
64+
65+
To perform adjustment for a new metric, you can write/edit the [custom config](../jetstream/configuration.md#custom-experiment-configurations) to do 2 things: 1) configure your metric to be calculated over the preenrollment period (that is, perform the retrospective A/A test) and 2) configure the adjustment.
66+
67+
To ensure that your metric is computed during the pre-enrollment period, simply add it to the desired period metric list:
68+
69+
```toml
70+
preenrollment_weekly = [
71+
'my_new_metric'
72+
]
73+
```
74+
75+
To configure the adjustment, first designate that inferences on the mean are desired using linear models. Then, configure that statistic to adjust based on the period chosen above. For example:
76+
77+
```toml
78+
[metrics.my_new_metric.statistics.linear_model_mean] # desire to estimate the mean using linear models
79+
[metrics.my_new_metric.statistics.linear_model_mean.covariate_adjustment] # desire to adjust that estimate
80+
period = "preenrollment_week" # adjust using the same metric calculated during the week prior to enrollment
81+
```
82+
83+
One can reference how adjustment is configured for guardrails ([example](https://github.com/mozilla/metric-hub/blob/57cd56a2fee4ed441a172a7c6cfac10a45d3fb3e/jetstream/defaults/firefox_desktop.toml#L33-L67)).
84+
85+
:::note
86+
Currently, the custom configs only support adjusting a during-treatment metric using the pre-experiment version of that metric. It's not supported to adjust a metric using a different metric or by using during-experiment data. To accomplish either of those tasks, you'll need to do so manually.
87+
:::
88+
89+
:::info
90+
As of February 2025, the execution order of analysis periods is not guaranteed. This means that, when rerunning an analysis for an experiment, it's possible for the computation for the during-treatment analysis to execute before the preenrollment has finished. This will result in the adjustment not being performed. That is, Jetstream will automatically fall back to unadjusted inferences. You can determine if Jetstream fell back by either examining the logs (see [dashboard](https://mozilla.cloud.looker.com/dashboards/246?Experiment=&Timestamp+Date=14+day&Log+Level=ERROR%2CWARNING)) or by comparing to the unadjusted confidence intervals (which will be identical if adjustment was not performed).
91+
:::
92+
93+
#### Custom adjustments
94+
95+
We have built tooling to perform these calculations and these methods can be used manually by data scientists to perform custom adjusted inferences. For example, suppose one wanted to control for machine type (cores) in an experiment trying to drive performance.
96+
97+
```python
98+
from mozanalysis.statistics.frequentist_stats.linear_models import compare_branches_lm
99+
100+
ref_branch = 'control'
101+
df = ... # one row per experimental unit, with `branch`, `performance`, and `cores` as columns
102+
103+
output = compare_branches_lm(df, 'performance', covariate_col_label='cores')
104+
```
105+
106+
To dig even deeper or for something more custom, we expose our own linear model class ([MozOLS](https://github.com/mozilla/mozanalysis/blob/main/src/mozanalysis/frequentist_stats/linear_models/classes.py)) which is optimized for analyzing experimental data as well as functions to extract absolute and relative confidence intervals ([usage example](https://github.com/mozilla/mozanalysis/blob/main/src/mozanalysis/frequentist_stats/linear_models/functions.py#L505-L526)).
107+
108+
:::warning
109+
HERE BE DRAGONS. It's possible to accidentally leak during-experiment information when performing these custom analysis thus potentially invalidating any causal evidence. Similarly, confidence intervals may be incorrect or misleading for complex adjustments.
110+
:::
111+
112+
## Impact and Effectiveness
113+
114+
The effectiveness of the correction varies with the design of the experiement. For example, onboarding experiments do not have pre-experiment data and as such no adjustment can be made. Adjustments are most effective when user behavior has strong temporal correlations. See [here](https://docs.google.com/document/d/19iyqEidsEOYCPxHWi-3azqtlEqXL-46pttJ2GA7jSW8/edit?tab=t.0) for an internal summary of the effectiveness of this methodology, but as a quick primer: we can see from the below experiment that adjusted inferences (dotted line) are more precise and powerful and can remove spurious false positives (green metric)
115+
116+
![example](../../../static/img/preenrollment_example.png)

docusaurus.config.js

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,6 @@
1+
const math = require('remark-math');
2+
const katex = require('rehype-katex');
3+
14
module.exports = {
25
title: "Experimenter Docs",
36
tagline: "Documentation souce for Data scientists, Product Managers and Engineers",
@@ -18,7 +21,7 @@ module.exports = {
1821
],
1922
themeConfig: {
2023
prism: {
21-
additionalLanguages: ["kotlin", "swift", "rust", "toml"]
24+
additionalLanguages: ["kotlin", "swift", "rust", "toml", "sql"]
2225
},
2326
docs: {
2427
sidebar: {
@@ -68,13 +71,25 @@ module.exports = {
6871
routeBasePath: "/",
6972
sidebarPath: require.resolve("./sidebars.js"),
7073
editUrl: "https://github.com/mozilla/experimenter-docs/edit/main/",
74+
remarkPlugins: [math],
75+
rehypePlugins: [katex],
7176
},
7277
theme: {
7378
customCss: require.resolve("./src/css/custom.css"),
7479
},
7580
},
7681
],
82+
7783
],
84+
stylesheets: [
85+
{
86+
href: 'https://cdn.jsdelivr.net/npm/katex@0.13.24/dist/katex.min.css',
87+
type: 'text/css',
88+
integrity:
89+
'sha384-odtC+0UGzzFL/6PNoE8rX/SPcQDXBJ+uRepguP4QkPCm2LBxH3FA3y+fKSiJ+AmM',
90+
crossorigin: 'anonymous',
91+
},
92+
],
7893
markdown: {
7994
mermaid: true,
8095
},

package.json

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,11 @@
1717
"@docusaurus/preset-classic": "^2.4.3",
1818
"@docusaurus/theme-mermaid": "^2.4.3",
1919
"clsx": "^2.1.1",
20+
"hast-util-is-element": "^1.1.0",
2021
"react": "^17.0.2",
21-
"react-dom": "^17.0.2"
22+
"react-dom": "^17.0.2",
23+
"rehype-katex": "^5.0.0",
24+
"remark-math": "^3.0.1"
2225
},
2326
"browserslist": {
2427
"production": [

static/img/preenrollment_example.png

145 KB
Loading

0 commit comments

Comments
 (0)