You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: REFERENCE.md
+2-2Lines changed: 2 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -223,7 +223,7 @@ This last estimand is recommended by Li et al. (2018) in case of poor overlap (i
223
223
224
224
Even though there is treatment effect heterogeneity across the population or a sub-population, the average treatment effect might still be zero. To assess if the estimated CATE function `tau(x) = E[Y(1) - Y(0) | X = x]` does well in identifying personalised effects, another summary measure would be useful. One such proposed measure is the Rank-Weighted Average Treatment Effect (RATE), which uses the relative ranking of the estimated CATEs to gauge if it can effectively target individuals with high treatment effects on a separate evaluation data set (and can thus be used to test for the presence of heterogeneous treatment effects).
225
225
226
-
This approach is implemented in the function `rank_average_treatment_effect`, which provides valid bootstrapped errors of the RATE, along with a Targeting Operator Characteristic (TOC) curve which can be used to visually inspect how well a CATE estimator performs in ordering observations according to treatment benefit. For more details on the RATE metric see Yadlowsky et al., 2021.
226
+
This approach is implemented in the function `rank_average_treatment_effect`, which provides valid bootstrapped errors of the RATE, along with a Targeting Operator Characteristic (TOC) curve which can be used to visually inspect how well a CATE estimator performs in ordering observations according to treatment benefit. For more details on the RATE metric see Yadlowsky et al. (2025).
227
227
228
228
### Best Linear Projection of the CATE
229
229
@@ -440,4 +440,4 @@ Van Der Laan, Mark J., and Daniel Rubin. Targeted maximum likelihood learning. *
440
440
441
441
Wager, Stefan, and Susan Athey. Estimation and inference of heterogeneous treatment effects using random forests. *Journal of the American Statistical Association*, 2018.
442
442
443
-
Yadlowsky, Steve, Scott Fleming, Nigam Shah, Emma Brunskill, and Stefan Wager. Evaluating Treatment Prioritization Rules via Rank-Weighted Average Treatment Effects. *arXiv preprint arXiv:2111.07966*, 2021.
443
+
Yadlowsky, Steve, Scott Fleming, Nigam Shah, Emma Brunskill, and Stefan Wager. Evaluating Treatment Prioritization Rules via Rank-Weighted Average Treatment Effects. *Journal of the American Statistical Association*, 2025.
Copy file name to clipboardExpand all lines: experiments/README.md
+5-3Lines changed: 5 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -16,7 +16,7 @@ This directory contains replication code for
16
16
17
17
* Wager and Athey (2018): This paper is not based on GRF, but on the deprecated `causalForest`. For replication code see https://github.com/swager/causalForest
18
18
19
-
* Yadlowsky, Fleming, Shah, Brunskill, and Wager (2021): The method is available in the GRF function `rank_average_treatment_effect`. For replication code see https://github.com/som-shahlab/RATE-experiments
19
+
* Yadlowsky, Fleming, Shah, Brunskill, and Wager (2025): The method is available in the GRF function `rank_average_treatment_effect`. For replication code see https://github.com/som-shahlab/RATE-experiments
Looking at the best linear projection (BLP) it appears students with a high "financial autonomy index" benefits less from treatment, the RCT authors write that this is a "psychology-based financial autonomy index that aggregated a series of questions that measured whether students felt empowered, confident, and capable of making independent financial decisions and influencing the financial decisions of the households". The BLP appears to suggest that students that already are financially comfortable as measured by this index, don't benefit much from the training course.
214
214
215
215
### Evaluating CATE estimates with RATE
216
-
Causal inference is fundamentally more challenging than the typical predictive use of machine learning algorithms that have a well-defined scoring metric, such as a prediction error. Treatment effects are fundamentally unobserved, so we need alternative metrics to assess performance. The *R-loss* discussed in Nie & Wager (2021) is one such metric, and could, for example, be used as a cross-validation criterion, however, it does not tell us anything about whether there are HTEs present. Even though the true treatment effects are unobserved, we can use suitable *estimates* of treatment effects on held out data to evaluate models. The *Rank-Weighted Average Treatment Effect* ([RATE](https://grf-labs.github.io/grf/reference/rank_average_treatment_effect.html)) (Yadlowsky et al., 2022) is a metric that assesses how well a CATE estimator does in ranking units according to estimated treatment benefit. It can be thought of as an Area Under the Curve (AUC) measure for heterogeneity, where a larger number is better.
216
+
Causal inference is fundamentally more challenging than the typical predictive use of machine learning algorithms that have a well-defined scoring metric, such as a prediction error. Treatment effects are fundamentally unobserved, so we need alternative metrics to assess performance. The *R-loss* discussed in Nie & Wager (2021) is one such metric, and could, for example, be used as a cross-validation criterion, however, it does not tell us anything about whether there are HTEs present. Even though the true treatment effects are unobserved, we can use suitable *estimates* of treatment effects on held out data to evaluate models. The *Rank-Weighted Average Treatment Effect* ([RATE](https://grf-labs.github.io/grf/reference/rank_average_treatment_effect.html)) (Yadlowsky et al., 2025) is a metric that assesses how well a CATE estimator does in ranking units according to estimated treatment benefit. It can be thought of as an Area Under the Curve (AUC) measure for heterogeneity, where a larger number is better.
217
217
218
218
The RATE has an appealing visual component, in that it is the area under the curve that traces out the following difference in expected values while varying the treated fraction $q \in [0, 1]$:
@@ -444,7 +444,7 @@ Zheng, Wenjing, and Mark J. van der Laan. "Cross-validated targeted minimum-loss
444
444
445
445
Wager, Stefan, and Susan Athey. "Estimation and inference of heterogeneous treatment effects using random forests." Journal of the American Statistical Association 113.523 (2018): 1228-1242.
446
446
447
-
Yadlowsky, Steve, Scott Fleming, Nigam Shah, Emma Brunskill, and Stefan Wager. "Evaluating Treatment Prioritization Rules via Rank-Weighted Average Treatment Effects." arXiv preprint arXiv:2111.07966 (2021).
447
+
Yadlowsky, Steve, Scott Fleming, Nigam Shah, Emma Brunskill, and Stefan Wager. "Evaluating Treatment Prioritization Rules via Rank-Weighted Average Treatment Effects." Journal of the American Statistical Association, 120(549), 2025.
448
448
449
449
Zeileis, Achim, Torsten Hothorn, and Kurt Hornik. "Model-based recursive partitioning." Journal of Computational and Graphical Statistics 17, no. 2 (2008): 492-514.
Sverdrup, Erik, Han Wu, Susan Athey, and Stefan Wager. Qini Curves for Multi-Armed Treatment Rules. _arXiv preprint arXiv:2306.11979_ ([arxiv](https://arxiv.org/abs/2306.11979))
290
+
Sverdrup, Erik, Han Wu, Susan Athey, and Stefan Wager. Qini Curves for Multi-Armed Treatment Rules. _Journal of Computational and Graphical Statistics_, 2025. ([arxiv](https://arxiv.org/abs/2306.11979))
291
291
292
-
Yadlowsky, Steve, Scott Fleming, Nigam Shah, Emma Brunskill, and Stefan Wager. Evaluating Treatment Prioritization Rules via Rank-Weighted Average Treatment Effects. _arXiv preprint arXiv:2111.07966_ ([arxiv](https://arxiv.org/abs/2111.07966))
292
+
Yadlowsky, Steve, Scott Fleming, Nigam Shah, Emma Brunskill, and Stefan Wager. Evaluating Treatment Prioritization Rules via Rank-Weighted Average Treatment Effects. _Journal of the American Statistical Association_, 120(549), 2025. ([arxiv](https://arxiv.org/abs/2111.07966))
293
293
294
294
295
295
[^r]: There are many approaches to estimating CATEs in the single-armed setting that can be adopted in the multi-armed case by simply fitting separate CATE functions for the different treatment arms. In this vignette, we use GRF's `multi_arm_causal_forest`, which jointly estimates treatment effects for the arms.
Copy file name to clipboardExpand all lines: r-package/grf/vignettes/rate.Rmd
+4-4Lines changed: 4 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -127,10 +127,10 @@ The overview in the previous section gave a stylized introduction where we imagi
127
127
\end{split}
128
128
\end{equation*}
129
129
130
-
where $\hat F$ is the empirical distribution function of $\hat \tau^{train}(X^{test})$. The `rank_average_treatment_effect` function delivers a AIPW-style[^1] doubly robust estimator of the TOC and RATE using a forest trained on a separate evaluation set. For details on the derivation of the doubly robust estimator and the associated central limit theorem, see Yadlowsky et al. (2021).
130
+
where $\hat F$ is the empirical distribution function of $\hat \tau^{train}(X^{test})$. The `rank_average_treatment_effect` function delivers a AIPW-style[^1] doubly robust estimator of the TOC and RATE using a forest trained on a separate evaluation set. For details on the derivation of the doubly robust estimator and the associated central limit theorem, see Yadlowsky et al. (2025).
131
131
132
132
## An application to SPRINT and ACCORD
133
-
To illustrate RATE we consider an example application from a medical setting. Two large randomized trials *ACCORD* (ACCORD Study Group, 2010) and *SPRINT* (SPRINT Research Group, 2015) conducted on similar populations and designed to measure the effectiveness of a hypertension treatment reach different conclusions. SPRINT found the treatment was effective, ACCORD found that the treatment was not effective. Various explanations for this finding have been proposed, we'll focus on one in particular here: the hypothesis that the difference is due to *heterogeneity in treatment effects* (see Yadlowsky et al., 2021 for references).
133
+
To illustrate RATE we consider an example application from a medical setting. Two large randomized trials *ACCORD* (ACCORD Study Group, 2010) and *SPRINT* (SPRINT Research Group, 2015) conducted on similar populations and designed to measure the effectiveness of a hypertension treatment reach different conclusions. SPRINT found the treatment was effective, ACCORD found that the treatment was not effective. Various explanations for this finding have been proposed, we'll focus on one in particular here: the hypothesis that the difference is due to *heterogeneity in treatment effects* (see Yadlowsky et al., 2025, for references).
134
134
135
135
This hypothesis has a testable implication implied by the previous section: if there is significant heterogeneity present and we are able to effectively estimate these with a powerful CATE estimator, then an estimated RATE on ACCORD and SPRINT should be positive and significant. In particular, our setup implies the following recipe:
136
136
@@ -194,7 +194,7 @@ plot(rate.sprint, xlab = "Treated fraction", main = "TOC evaluated on SPRINT\n t
194
194
plot(rate.accord, xlab = "Treated fraction", main = "TOC evaluated on ACCORD\n tau(X) estimated from SPRINT")
195
195
```
196
196
197
-
In this semi-synthetic example both AUTOCs are insignificant at conventional levels, suggesting there is no evidence of significant HTEs in the two trials. Note: this can also be attributed to a) low power, as perhaps the sample size is not large enough to detect HTEs, b) that the HTE estimator does not detect them, or c) the heterogeneity in the treatment effects along observable predictor variables are negligible. For a broader analysis comparing different prioritization strategies on the SPRINT and ACCORD datasets, see Yadlowsky et al. (2021).
197
+
In this semi-synthetic example both AUTOCs are insignificant at conventional levels, suggesting there is no evidence of significant HTEs in the two trials. Note: this can also be attributed to a) low power, as perhaps the sample size is not large enough to detect HTEs, b) that the HTE estimator does not detect them, or c) the heterogeneity in the treatment effects along observable predictor variables are negligible. For a broader analysis comparing different prioritization strategies on the SPRINT and ACCORD datasets, see Yadlowsky et al. (2025).
198
198
199
199
For a discussion of alternatives to estimating RATEs that do not rely on a single train/test split, we refer to [this vignette](https://grf-labs.github.io/grf/articles/rate_cv.html).
200
200
@@ -211,6 +211,6 @@ Radcliffe, Nicholas. Using control groups to target on predicted lift: Building
211
211
SPRINT Research Group. A Randomized Trial of Intensive Versus Standard Blood-Pressure
212
212
Control. _New England Journal of Medicine_, 373(22):2103–2116, 2015.
213
213
214
-
Yadlowsky, Steve, Scott Fleming, Nigam Shah, Emma Brunskill, and Stefan Wager. Evaluating Treatment Prioritization Rules via Rank-Weighted Average Treatment Effects. _arXiv preprint arXiv:2111.07966_ ([arxiv](https://arxiv.org/abs/2111.07966))
214
+
Yadlowsky, Steve, Scott Fleming, Nigam Shah, Emma Brunskill, and Stefan Wager. Evaluating Treatment Prioritization Rules via Rank-Weighted Average Treatment Effects. _Journal of the American Statistical Association_, 120(549), 2025 ([arxiv](https://arxiv.org/abs/2111.07966))
0 commit comments