Skip to content

A two-component model of post-implementation U.S. federal legislative spillovers via GPT-4o-mini active distillation and Random Forest prediction of the Gini index.

License

Notifications You must be signed in to change notification settings

cfengmvt/Dual-Phase-Legislative-Spillover-Estimator

Repository files navigation

Development Notes

This project emerged out of an immigrant healthcare advocacy campaign I began in October 2023. In petitioning for an end to the five-year bar on new Lawful Permanent Residents' (LPRs) Medicaid eligibility, I began to ask what might occur as a result of such legislation on public health, hospital numbers, federal Appropriations, etc.? For that matter, what might be the higher-order effects of any piece of legislation? This then led to this project that I began in July 2024. The following are the Methods and Results sections extracted from the paper I submitted for my Directed Independent Study course in June 2025, the culmination of a year's worth of self-study and research with the phenomenal help of a PS/PP PhD-student at the Ford School and my DIS advisor at Northville High School.

Note: AI (LLMs) were used to scaffold and debug errors in code, check syntax/grammar in prose, and as a core part of this project in generating the initial zero-shot annotations on legislation. Data collection, cleaning, and interpolation; choice of target variable; and interpretation and evaluation of results were all completed manually without AI assistance.

Methods

A dual-phase approach separating quantitative and qualitative estimation was chosen.

Qualitative Classification

For quantitative analysis, Large Language Model (LLM) fine-tuning will be used. First, a training corpus comprising the full texts of 68 passed U.S. Public Laws from 1999–2020 was curated from those publicly released through the Congress.gov website. The laws were chosen for diversity of scope and length; the shortest considered the renaming of a post office, while the longest consisted of the formation of a new cabinet-level agency. The composition of the corpus did not directly reflect the composition of actual laws passed; “landmark” laws were intentionally disproportionately included in the corpus. This ensures that the model receives adequate exposure to the denser linguistic structures and real-world impacts generally characterized by landmark legislation, as opposed to the pro forma administrative measures that comprise a majority of actual laws passed.

Substantive Framework: INTENTS and PESTLE

A composite INTENTS and PESTLE (Political, Economic, Social, Technological, Legal, Environmental) framework was chosen for the prompt to be used for initial zero-shot annotations. INTENTS serves as the overarching heuristic in the prompt; a spillover’s causal mechanism may best be identified by classifying spillovers by individuals who were and were not intended recipients of a policy intervention, as accomplished by INTENTS (Francetic et al., 2022). Further, INTENTS initially queries for the intended impact of an intervention and for entities directly or indirectly affected, guiding a policy analyst (whether human or LLM) through the steps necessary to determine how a spillover occurs (Francetic et al., 2022). Considering the PESTLE heuristic, it is conventionally used by corporate firms to assess incoming factors that impact their ability to produce and serve a product (Makos, 2024). Given that spillover analysis attempts to estimate outgoing impacts on broader society, it was appropriate to add a Quality of Life dimension of impact to PESTLE given that it does not consider broader impacts on citizen well-being beyond traditional economic and social metrics. As such, PESTLE and Quality of Life were used to create seven sub-categories of spillover effects within the three broad categories of INTENTS, forcing analysis of a wide breadth of possible aftereffects that a law may produce.

Identification and commentary on seven aspects of each spillover effect was further exacted: its precise impact, its causal mechanism (justified with a citation of a specific operative section within the law), its location of impact, the magnitude of the impact, the effect’s time frame, its relevance to the intervention’s goals, and the model’s confidence of the effect actually transpiring. In doing so, a comprehensive postulation of each spillover may be developed with consistency across spillovers of any domain.

LLM Methodology

The full texts of Public Laws from the corpus along with the INTENTS/PESTLE+Q framework were inputted to the parent LLM for initial zero-shot annotations on what spillovers it believes had occurred. By forcing the model to identify impacts it believes to have actually materialized in the real world rather than hypotheticals, this past-tense approach is inferred to activate retrieval pathways that prioritize empirical recall over generative synthesis, reducing the possibility of the model hallucinating. o3-mini, OpenAI’s most recent reasoning model released at the time of experimentation, was found to produce the most qualitatively sound zero-shot annotations on the aforementioned framework. Reasoning models function by first generating text considering how to respond to user input — i.e., “thinking” — thus making them well-suited to the complex task of spillover analysis. o3-mini was accessed via the consumer ChatGPT website to allow for grounding via the internet search function.

The o3-mini was further prompted to adopt a persona of an “objective expert” and a “clear, plain communicator”. It was also to vary the length of its report with the scope of the law (i.e., a more wide-reaching law produces a more detailed response). OpenAI espouses both of the foregoing prompt engineering techniques, and they have been proven useful in achieving a particular style from responses (OpenAI, n.d.).

After initial annotation with the custom prompt, the raw inputs were then cleaned based on the author’s best impression of what was likely to have occurred. In line with this, generated spillovers not substantiated by a distinct causal mechanism were cut. Additionally, the annotations were revised to contain plainer English terminology and to have stable formatting and headings; this was then followed by the final step of editing the annotations from the past to the conditional tense (i.e., from what had occurred to what would occur) to simulate the input of novel legislation that Congress has not yet passed.

LLM Fine-tuning

The cleaned inputs (full texts of legislation) and outputs (spillover reports) were then compiled into a JSON dictionary for training. Laws that exceeded OpenAI’s 65536 fine-tuning token limit were divided, with separate reports for each division. When this was insufficient, certain non-operative clauses in the laws deemed likely insignificant by the author were excluded. Only clauses within sections not cited by the generated report as having caused a spillover were subject to deletion. The final JSON dictionary was then used to fine-tune a smaller second-stage student model under a 90-10 train-validation split. Beyond creating a model specially trained for INTENTS/ PESTLE+Q spillover analysis, the student model’s fewer parameters render it faster and less computationally expensive. This process of human-edited fine-tuning of the outputs of a larger model into a smaller model is referred to as active or guided distillation. The two most readily available small models for fine-tuning — OpenAI’s GPT-4o-mini and Meta’s Llama-3.2 3B — were tested for use as the small model, and the training and validation losses were measured for both. Additionally, GPT-4o-mini calculates the logarithmic probability of each token having been output — i.e., the model’s confidence in its responses — which were also measured.

Figure 1

Diagram of Active Distillation Pipeline Qualitative Spillover Prediction Note. Created using Google Slides.

Quantitative Regression

Quantitative spillover analysis was confined to federal appropriations acts, given that only budgetary provisions contain numeric values. However, adequate training data cannot be directly obtained through the Congressional acts; while the Congressional Budget Act of 1977 mandates that Congress pass twelve appropriations acts every fiscal year, it has only done so four times since 1977, and all other fiscal years have required Continuing Resolutions followed by conglomerate “Omnibus” appropriations acts (Saturno et al., 2016). Hence, yearly federal outlay data from govinfo.gov was used instead — these can be considered the sum of all budgetary appropriations passed by appropriation acts for that fiscal year. The dataset covers 5632 government spending categories from 1962 onward (United States Government Publishing Office, 2024).

The Gini index — the measure of income inequality — was chosen as the target variable. The index is scaled from 0–100 (0 represents perfect equality and 100 represents perfect inequality), and this normalization makes it statistically robust for target prediction. Additionally, population-level income inequality can theoretically be responsive to every domain of government spending, and it can further be used to estimate social stratification and mobility.

Initial Data Preprocessing

To correct for inflation, each data point was divided by that year’s GDP deflator to obtain real dollars for each year. Each data point was further divided by that year’s Real GDP to adjust for GDP growth and then multiplied by 1000 to adjust for scale. The data points — each representing a government account’s outlay for a year — were then summed into 86 root departments, agencies, and bureaus. These represent the input features used in the final pipeline. While the features do not comprise the full government outlays, those spent on cabinet-level agencies and the bureaus and commissions deemed most influential (by the author) on the American public were included. Beyond agency funding, the year was also included as an input feature to correct for output changes caused solely by the progression of time.

Data Interpolation

Given that only 62 years of outlay data are available, more data points are necessary for effective training of a model and thus monthly outlays were interpolated from the cleaned yearly data. Each yearly data point was divided by twelve and duplicated twelve times. To simulate month-to-month spending volatility, the data points were then augmented with Gaussian noise, in which a random integer with a set mean and standard deviation (SD) is added to each value. The mean was set to zero to ensure all noise has a normal distribution. Then, the SD used for each data point was found by taking the coefficient of variation (CV) of the quarterly full outlay sums in that data point’s year and multiplying it by the mean monthly agency value within a year (as all monthly values are equivalent before Gaussian noise, all twelve data points in a year are equal to that mean; see (2) and (3)).

Equations
$CV_{\text{Full Quarterly Outlays}} = \frac{\sigma_{FQO}}{\mu_{FQO}}$ (2)
$\frac{\sigma_{FQO}}{\mu_{FQO}} \cdot \mu_{\text{Monthly Agency Outlays}} = \sigma_{\text{MAO (estimated)}}$ (3)

The underlying hypothesis behind this interpolation is that differences in total quarterly outlay values within a year can serve as a relative proxy for month-to-month volatility in agency outlay values.

Afterward, each data point was multiplied by a correction factor of the sum of the noisy monthly outlays in a year divided by the sum of the actual agency outlays in a year. This ensures that each year’s total outlay value across all agencies measured remains the same.

A similar process was completed for interpolation of the monthly Gini index, with each year’s Gini index duplicated twelve times. Again, only quarterly data was available, with quarterly national real wages being utilized here to estimate month-to-month variability. The same process of using the CV as an intermediary to find the estimated SD was used for augmentation of data points by Gaussian noise. As the Bureau of Labor Statistics only publishes wage data from 1979 onward, the mean CV of the real wages from 1979–2023 within each year was used to calculate the SD used for Gaussian noise of the monthly Gini indices from 1963–1978 (U.S. Bureau of Labor Statistics, 2025). No correction factor was needed here as the Gini index is not a summable metric.

As the World Bank only publishes the Gini index up to 2022, the 2023 Gini index was estimated using U.S. Census data on the percentage change in equivalence-adjusted income (World Bank Group, 2022; U.S. Census Bureau, 2023).

Preprocessing Continued

The input and output variables were created by pairing sixty months (five years) of outlay data with sixty subsequent months of the Gini index; this sliding window creation is necessary given that classical ML models are not natively applicable to time-series data (Brownlee, 2020). Thus, model predictions can be interpreted as the predicted next five years of the Gini index given the previous five years’ government outlays. For feature scaling, standardization (z-score normalization) was applied to each data point.

Random Forest and Support Vector Regression (SVR) were tested for use as a predictive model; both are known to produce sound results with sparse time-series data and are classical models capable of nonlinearity. Given limited data availability, deep learning was not tested. For SVR, Principal Component Analysis (PCA) and Uniform Manifold Approximation Projection (UMAP) were tested for dimensionality reduction. The original features were kept in Random Forest to ensure relative feature contribution to output could be measured with SHAP (SHapley Additive exPlanation) values. Here, in addition to being tested for the optimal prediction model, Random Forest will further be used as the inferential explainer model — regardless of predictive performance — via the calculated SHAP values.

For both model types, Walk-Forward Mean Absolute Error (WF-MAE) and Root Mean Squared Error (WF-RMSE) were used as the primary metrics. These are measured by training and testing on a subset of the data, calculating the conventional MAE- and RMSE-scores, and repeating with subsequent subsets of the data included in training. Additionally, full-model conventional MAE-, RMSE-, and R²-scores were calculated as secondary metrics.

Hyperparameter Tuning

The parameter “shuffle” was set equal to “False” to preserve temporal relationships. This ensures the model offers a forecast of future points vs. a generalization of past points. Each remaining hyperparameter in Random Forest, SVR, PCA, and UMAP was tuned respectively until the primary and/or secondary metrics were optimized.

Results & Evaluation

LLM Distillation Results

Table 1

Model Training Loss Validation Loss
GPT-4o-mini $0.5562 \pm 0.04$ $1.2357 \pm 0.04$
Llama 3.2-3B $1.0026 \pm 0.1$ $0.9777 \pm 0.07$

Note. Only one iteration of fine-tuning was performed for both models due to fine-tuning costs. Thus, the confidence intervals were calculated by taking two Standard Errors of the Mean (SEMs) of the final 30 steps of training/validation loss.

Given that loss values range from 0 to ∞, the low losses exhibited by both base models indicate they performed relatively well in generalizing to the manner of response dictated by fine-tuning. However, GPT-4o-mini demonstrates some overfitting to training data, evident in its validation loss being significantly greater than its training loss (shown more clearly in Figure 3). In comparison, Llama 3.2-3B appears to have generalized better to the training data, as its training and validation losses — while still significantly different — are substantially closer in value. Indeed, Llama 3.2-3B’s losses appear to converge somewhat as training progresses, while GPT-4o-mini’s losses appear to diverge as training progresses: another distinction of overfitting (see Figures 3 and 4).

Figure 3

Graph of GPT-4o-mini Distillation Loss over Training Steps Screenshot 2025-05-30 5 27 11 PM Note. Taken from OpenAI Playground (2025).

Figure 4

Graph of Llama-3.2 3B Distillation Loss over Training Steps chart.svgimage Note. Created using Google Sheets.

GPT-4o-mini further attained a mean token log probability of -0.6755±0.21, equating a mean token probability of 0.2111. Given that the model can describe the same concept in an uncountable number of ways, this indicates moderate model confidence in responses.

Despite Llama reaching a lower validation loss, its responses consistently fail to identify and provide commentary on the specific aspects of spillovers asked of it in the prompt, instead merely providing a terse summary of the spillover (see Table 2). Indeed, Llama even occasionally references non-existent sections of bills. GPT-4o-mini, on the other hand, does offer detailed commentary on each prompted aspect, nearly matching the depth of responses achieved with zero-shot prompting of o3-mini. Additionally, GPT-4o-mini accurately cites bona fide bill sections in its “Causal Mechanism” portion of spillover analyses. Hence, in the author’s opinion, the fine-tuned GPT-4o-mini is deemed more qualitatively robust. This may be attributed to GPT-4o-mini having more trainable parameters than Llama-3.2 3B, thus allowing the former to produce more nuanced outputs.

Table 2

Excerpted Sample Fine-tuned GPT-4o-mini and Llama-3.2 3B Responses

GPT-4o-mini Llama-3.2 3B
B6. Environmental

3A2.6.1. Precise Qualitative Impact: The law will indirectly encourage environmental risk assessments in projects where elevation or gravity-related risks overlap with environmental concerns, such as water runoff management in elevated structures.

3A2.6.2. Causal Mechanism: Improved safety and risk management protocols (Sec. 2(b)) will lead to more comprehensive environmental planning as part of overall project risk assessments.

3A2.6.3. Location: This effect will be particularly relevant in regions with environmentally sensitive infrastructure, such as coastal or mountainous areas.

3A2.6.4. Magnitude: The impact will be modest; while it will not be the primary driver of environmental policy, it will contribute to incremental improvements in project-specific environmental management.

3A2.6.5. Time Frame: The environmental spillover will be observed over several years as project planning and execution incorporate enhanced risk management protocols.

3A2.6.6. Relevance: It will be indirectly relevant to the intervention’s broader goal of ensuring sustainable and safe infrastructure development.

3A2.6.7. Confidence: 65% confident, given the variable nature of environmental impacts across different project types and locations.
Environmental: While not directly targeted, the intervention may indirectly influence environmental practices by encouraging more stringent safety protocols and risk assessments during project development. (Environmental Subcategory 6: Emissions Reduction)

Note. Both models were prompted with the full text of H.R.3548 - Infrastructure Expansion Act of 2025, a bill introduced by Representative Nicholas Langsworthy [R-NY-23] and one that Congress has not yet passed as of June 2025. Both models were set to temperature =0.3, and shown are both models’ analyses of the possible between-units environmental spillover arising from bill implementation. While both models identify roughly the same spillover effect, GPT-4o-mini’s analysis is markedly more detailed.

Ultimately, given that this portion of spillover estimation is qualitative in nature, GPT-4o-mini should be considered the superior model for spillover analysis — despite its higher validation loss — as it produces the most qualitatively sound analyses (in the author’s opinion). Here, it is likely that Llama achieves a lower validation loss by offering a vague generalization that avoids penalization; in the real world, this would have little use.

Regression Results

Table 3

Model WF-MAE WF-RMSE MAE RMSE
Random Forest $0.5696 \pm 0.05$ $0.7922 \pm 0.12$ $-0.5227 \pm 0.15$ $0.739 \pm 0.08$ $0.9769 \pm 0.08$
SVR-PCA $0.5931 \pm 0.05$ $0.8242 \pm 0.11$ $-0.2277 \pm 0.06$ $0.6868 \pm 0.05$ $0.9394 \pm 0.11$
SVR-UMAP $0.6718 \pm 0.05$ $0.9243 \pm 0.10$ $-0.2969 \pm 0.19$ $0.6518 \pm 0.06$ $0.9644 \pm 11$

Note. Confidence intervals were calculated by taking two SEMs. SVR-PCA and SVR-UMAP ran five iterations with training data having undergone different Gaussian noise, while Random Forest ran only three iterations due to necessitating longer training. The most optimized value for each metric is in bold.

The R²-values being less than zero indicate that all model types generalize to the training data worse than the average. However, this can be safely ignored for a time-series model, given that the intention is to predict future values and not generalize to past values. Additionally, the higher RMSE-values (Both WF and conventional) vs. the MAE-values beyond the margin of error indicate that all models experience moderate variance in errors. In other words, model losses are caused more by several anomalous larger errors rather than consistent smaller errors.

SVR-UMAP’s lowest MAE indicates it best minimized absolute errors of all three model types, while SVR-PCA’s lowest RMSE indicates it best minimized larger prediction errors. However, given that Random Forest’s WF-error scores are the lowest of any model, it should be considered the superior model for prediction vs. SVR-PCA or SVR-UMAP. Since both of its conventional error scores exceed its WF-error scores, Random Forest appears to capture local patterns better than global structure. This locality bias is acceptable, given the model’s purpose to predict the future Gini index only on the most recent government outlays.

Nonetheless, Random Forest still has some points of concern. A bell-curve of its residual errors appears to have a negative skew, meaning overprediction occurs more often than underprediction, but underpredictions tend to be more severe. This is likely due to a latent confounding variable influencing the output in the real world, and this is expected given the plethora of possible variables outside of government funding that can influence the Gini index.

Figure 5

Random Forest Distribution of Residual Errors image

Note. Created using Matplotlib.

Figure 6

Bar Graph of Random Forest SHAP Values with Highest Impact on Predictions chart (1).svgimage

Note. Shown are the input features (agencies) found to have the five highest and five lowest SHAP values; the former denote features where a value-increase correlates with an increase in output value, and the latter denote features where a value-increase correlates with a decrease in output value. Error bars represent 2 SEMs. Created using Google Sheets.

SHAP values, as displayed in Figure 6, indicate that increasing funding for the CIA has the highest correlation with increasing income inequality, while increasing funding for the FBP has the highest correlation with decreasing income inequality. Pursuing the latter further, the three inputs with the lowest SHAP values are all associated with the U.S. justice system, while the two inputs with the highest SHAP values are both associated with foreign policy.

References

(Citations for full paper; not all are used here)

6, P. (2014), EXPLAINING UNINTENDED AND UNEXPECTED CONSEQUENCES OF POLICY DECISIONS: COMPARING THREE BRITISH GOVERNMENTS, 1959–74. Public Admin, 92: 673-691.

Awan, A. A. (2023, June 28). An Introduction to SHAP Values and Machine Learning Interpretability. DataCamp. Retrieved May 30, 2025, from www.datacamp.com/tutorial/introduction-to-shap-values-machine-learning-interpretability

Brownlee, J. (2020, November 1). Random Forest for Time Series Forecasting. Machine Learning Mastery. Retrieved May 29, 2025, from machinelearningmastery.com/random-forest-for-time-series-forecasting/

CTOL Editors. (2025, January 1). Microsoft Paper slip: GPT-4O-Mini’s 8B size could unlock IPhone’s AI future. CTOL Digital Solutions. www.ctol.digital/news/microsoft-paper-slip-gpt4o-mini-8b-size-could-unlock-iphone-ai-future/

Francetic, I., Meacock, R., Elliott, J., Kristensen, S. R., Britteon, P., Lugo-Palacios, D. G., Wilson, P., & Sutton, M. (2022). Framework for identification and measurement of spillover effects in policy implementation: intended non-intended targeted non-targeted spillovers (INTENTS). Implementation science communications, 3(1), 30. doi.org/10.1186/s43058-022-00280-8

GeeksforGeeks. (2025, May 25). Principal Component Analysis(PCA). GeeksforGeeks. www.geeksforgeeks.org/principal-component-analysis-pca/

H.R.3548 - 119th Congress (2025-2026): Infrastructure Expansion Act of 2025. (2025, May 21). www.congress.gov/bill/119th-congress/house-bill/3548

Jelveh, Z., Kogut, B., & Naidu, S. (2024). Political Language in Economics. The Economic Journal, 134(662), 2439–2469. doi.org/10.1093/ej/ueae026

Makos, J. (2024, September 13). What is PESTLE Analysis? (Free Template). PESTLE Analysis. www.pestleanalysis.com/what-is-pestle-analysis/

OpenAI. (2024). Is ChatGPT biased? OpenAI Help Center. Retrieved May 30, 2025, from help.openai.com/en/articles/8313359-is-chatgpt-biased

OpenAI. (n.d.). Prompt engineering. OpenAI API. Retrieved May 28, 2025, from platform.openai.com/docs/guides/prompt-engineering/six-strategies-for-getting-better-results

OpenAI. (n.d.). What are tokens and how to count them? OpenAI Help Center. Retrieved May 28, 2025, from help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them

Peters, G. (2025, March 31). Cementing Michigan’s Leadership in National Security, Advanced Manufacturing, and Innovation [Q&A]. Detroit Economic Club, Masonic Temple, Detroit, MI, United States.

Policy Analyst Trends. (2024, July 25). Retrieved June 2, 2025, from www.zippia.com/policy-analyst-jobs/trends/

RandomForestRegressor. (n.d.). Scikit-learn. www.scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

Rasifaghihi, N. (2023, November 14). From Theory to practice: Implementing support vector regression for predictions in Python. Medium. www.medium.com/@niousha.rf/support-vector-regressor-theory-and-coding-exercise-in-python-ca6a7dfda927

Saturno, J. V., Heniff, B. Jr., Lynch, M. S., & Congressional Research Service. (2016). The Congressional Appropriations Process: An Introduction (Report No. R42388). Congressional Research Service. www.sgp.fas.org/crs/misc/R42388.pdf

SVR. (n.d.). Scikit-learn. www.scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction — umap 0.5.8 documentation. (n.d.). www.umap-learn.readthedocs.io/en/latest/

United States Government Publishing Office. (2024, March 11). GovInfo. Budget FY 2025 - Outlays. Retrieved March 12, 2025, from www.govinfo.gov/app/details/BUDGET-2025-DB/BUDGET-2025-DB-2/context

U.S. Census Bureau. (2024). Income distribution measures and percent change using money Income and Equivalence-Adjusted Income. United States Department of Commerce. www.census.gov/content/dam/Census/library/visualizations/2024/demo/p60-282/figure3.pdf

U.S. Bureau of Labor Statistics. (2025, June 2). Employed full time: Median usual weekly real earnings: Wage and salary workers: 16 years and over [LES1252881600Q]. FRED, Federal Reserve Bank of St. Louis. Retrieved June 2, 2025 from fred.stlouisfed.org/series/LES1252881600Q

Wikimedia contributors. (2025, May 25). Shapley value. Wikipedia. Retrieved May 30, 2025, from en.wikipedia.org/wiki/Shapley_value

World Bank Group. (2022). Gini Index - United States. World Bank Open Data. Retrieved June 2, 2025, from data.worldbank.org/indicator/SI.POV.GINI?locations=US

Ziametskaya, A. (n.d.). regression graph machine learning color icon illustration 57813445. Vecteezy. Retrieved June 3, 2025, from www.vecteezy.com/vector-art/57813445-regression-graph-machine-learning-color-icon-illustration

Releases

No releases published

Packages

No packages published