Health Outcomes Modeled Over Multiple CalEnviroScreen Reporting Periods

UPDATE: May 1, 2022

LA TIMES Op-Ed: We mapped the warehouse takeover of the Inland Empire by Dr. Susan Phillips (professor of Enviornmental Analysis), on the team's work at the Robert Redford Conservancy for Southern California Sustainability at Pitzer College.

Images for presentation slides mapping increase of warehousees are screencaps from this tableau (completed to visualize the changes in density) complement the ongoing work of researchers in the Inland Empire.

SUMMARY

We aggregated data from 4 CA OEHHA CalEnviro Screen reports, appended warehouse counts extracted from US Census Business Data and trained numerous estimators, however none reliably modelled health outcomes using the aggregated data.

Background

California Office of Environmental Health Hazard Assessment (CA OEHHA) has compiled data from various government agencies to create a mapping tool used to identifying communities most affected by various pollution sources, producing four reports total in years 2013, 2014, 2018 and 2021 with scores to for each identified disadvantaged community that was disproportionally facing enviornmental impacts. Because trasportation of goods and people continues to be the primary source of emissions, and because increased commerce requires increased warehousing and associated transportation, we sought to create models to answer the following questoins:

Is effect of increased warehouse presence on health outcomes quantifiable ?
What are primary mitigating factors the State can address in response to increased warehouse density?
How well do the CalEnviroScreen scores reflect emergency healthcare counts?
What indicators from the CalEnviroScreen dataset best determine the number of emergency healthcare visits?

Data Acquisition & Cleaning

Data: CalEnviroScreen

Description: CA OEHHA compiles data specfically regarding pollutants and the communities affected by them.

Source	File	Report Date	Shape
CalEnviroScreen 1	Data (xlsx)	April 2013	(1500,49)
CalEnviroScreen 2	Data (xlsx)	Oct 2014	(8035, 51)
CalEnviroScreen 3	Data (xlsx)	June 2018	(8035, 51)
CalEnviroScreen 4	Data, Dictionarty ZIP	Oct 2021	(8035, 51)
		Combined Data	(25444, 62)
		Used Data	(14912, 59)

Data: US Census Business Survey: Warehouse Counts, Density

Description: The US Census data is business counts by county, zip code and business type.

name	description
est total	total number of warehouses in class
est ag	total number of warehouse for agricultural
est cold	total number of warehouses for cold storage
est gen	total number of warehouses for general storage

Data: Model Features

variable name	type	description
diesel pm	numeric	particulate matter, spatially modelled
ozone	numeric	concentration
traffic	numeric	volume: vehicles per length of time over fixed distance

Target

We trained multiple estimators to model the outcomes reported in the CalEnviroScreen reports to build upon that work:

variable name	type	description
asthma hospitalization	numeric	incidence rate, cases/ 10k population
heat attack hospitalizatoin	numeric	incidence rate, cases/ 10k population
low birth weight	numeric	% newborns weighing `< 2.5 kg` (#/100 live births)

Exploratory Data Analyses

outliers in features were not dropped: as the data from the report is meticulously collected and averaged, eliminating observations outside of some central-tendency would eliminate communities that are most affected by selected factors.

Health outcomes on vertical axis suggest locations with high-warehouse density have higher rates of negative health outcomes.

Health outcomes on vertical axis suggest locations with high $PM{2.5}$ have higher rates of negative health outcomes._

Health outcomes on vertical axis suggest locations with traffic concentrations have higher rates of negative health outcomes.

Modeling

David: Linear Models

The linear models give us a sense of which of the features to focus on if we want to address these health issues.

Coefficients for these models were scaled and used to mark the most important features for addressing the health data. Ultimately, we found that the socioeconomic factors of unemployment, housing burden, linguistic isolation, and poverty were as an ensemble more effective at predicting the health data than the ensemble of pollution and warehouse data.

In fact, we found that warehouse counts have nearly no predictive value.

Model	Features	Target	$R^2 train$
1	selected: no time, space or CES scores	Asthma	0.29
2	selected: no time, space or CES scores	LBW	0.14
3	selected: no time, space or CES scores	CVD	0.23
4	warehouse counts only	Asthma	0.017
5	warehouse counts only	LBW	0.002
6	warehouse counts only	CVD	0.013
7	socio-economic only	Asthma	0.263
8	socio-economic only	LBW	0.1337
9	socio-economic only	CVD	0.11

The models that yielded these metrics are developed in Notebook 7: Linear Regression. The procedure for processing the raw data leading up to this notebook begins with Notebook 1: EDA on CAES 4 Data.

Giovanna: Random Forest & SVR

Random Forest Regression over a Gridsearch to find optimal values returned the following importances and low-performance scores, indicating that the metrics we used to model health outcomes are insufficient, and highlighting that socioeconomic factors are much more predictive. Specifically, the number of warehouses in a given zip code does not reflect the outcomes in that zip code.

Model Estimator Features Target $R^2$

1 SVR all asthma 0.29

Random Forest Regression Modell Tuning over GridSearch

Using all Features:

Model Estimator Target $R^2$ Max Error Scale

1 RF Asthma -0.35 10.7 #/10,000

2 RF Low Birth Weight -0.25 7 #/100

2 RF Cardiovascular Disease 0.01 30 #/10,000

Dropping SocioEconomic Features:

Model Estimator Target $R^2$ Max Error Scale

1 RF Asthma -0.01 180 #/10,000

2 RF Low Birth Weight -0.25 8 #/100

2 RF Cardiovascular Disease -0.05 30 #/10,000

None of these models predictably return health outcomes.

Features importances with all features, model target is asthma.

Features importances with all features, model target is heart-attacke hospitalization.

Features importances with all features, model target is low birth-weight.

Features importances without socio-economic features, model target is asthma.

Features importances without socio-economic features, model target is heart-attacke hospitalization.

Features importances without socio-economic features, model l target is low birth-weight`.

Marshall: XGBOOST

Features used: | total population, ozone, pm2.5, diesel pm, pesticides, traffic, cleanup sites, groundwater threats, haz. waste, imp. water bodies, solid waste, pollution burden, low birth weight, education, linguistic isolation, poverty, pop. char. , drinking water, tox. release, unemployment, ces_per, cardiovascular disease, housing burden, est total, est gen, est cold, est farm, est other | gradient boosting supervised regression |

Model	features used type	evaluation metric	Train Accuracy	Test Accuracy	RMSE score	MAE test score
XGBoost	R2, RMSE, & MAE	0.9139	0.7853	13.6915	9.3296
Random Forest meta estimator regression	R2, RMSE, & MAE	0.9634	0.7503	14.7307	9.9952

Conclusion

We saw saw no meaningful relationship to create robust models for health outcomes using our warehouse-aggregated data for any models we tried to fit. Cal EnviroScreen scores highly reflect asthma and pollution burden but not hospitalization rates. Socioeconomic factors aggregated in Cal EnviroScreen built best predictive models for negative health outcomes, highlighting the need for the State to address root causes for pollution burden.

Next Steps

We will continue to aggregate more data with finer granularity, and explore the raw data from which the CalEnviroScreen was sourced and modeled. Furthemore, spatial and temporal analyses will provide more robust models for projecting and adddressing communites to support.

Background: From the Press

EPI: Warehouses Do Not Generate Broad-Based Employment (cites the source I have for fulfillment center locations)
CBRE: 2022 North America Industrial Big Box Review & Outlook: Los Angeles County

Name		Name	Last commit message	Last commit date
Latest commit History 263 Commits
CA_zip_tract_translation		CA_zip_tract_translation
David		David
Giovanna		Giovanna
Marshall		Marshall
PDFs		PDFs
border_shapefiles		border_shapefiles
enviroscreen_info		enviroscreen_info
processed_data		processed_data
raw_data		raw_data
.gitignore		.gitignore
README.md		README.md
group_2_slides.pdf		group_2_slides.pdf
testt.txt		testt.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Health Outcomes Modeled Over Multiple CalEnviroScreen Reporting Periods

UPDATE: May 1, 2022

SUMMARY

Background

Data Acquisition & Cleaning

Data: CalEnviroScreen

Data: US Census Business Survey: Warehouse Counts, Density

Data: Model Features

Target

Exploratory Data Analyses

Modeling

David: Linear Models

Giovanna: Random Forest & SVR

Random Forest Regression Modell Tuning over GridSearch

Using all Features:

Dropping SocioEconomic Features:

None of these models predictably return health outcomes.

Marshall: XGBOOST

Conclusion

Next Steps

Background: From the Press

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

Model	Estimator	Target	$R^2$	Max Error	Scale
1	RF	Asthma	-0.35	10.7	#/10,000
2	RF	Low Birth Weight	-0.25	7	#/100
2	RF	Cardiovascular Disease	0.01	30	#/10,000

Model	Estimator	Target	$R^2$	Max Error	Scale
1	RF	Asthma	-0.01	180	#/10,000
2	RF	Low Birth Weight	-0.25	8	#/100
2	RF	Cardiovascular Disease	-0.05	30	#/10,000

gigi-codes/warehouse_density

Folders and files

Latest commit

History

Repository files navigation

Health Outcomes Modeled Over Multiple CalEnviroScreen Reporting Periods

UPDATE: May 1, 2022

SUMMARY

Background

Data Acquisition & Cleaning

Data: CalEnviroScreen

Data: US Census Business Survey: Warehouse Counts, Density

Data: Model Features

Target

Exploratory Data Analyses

Modeling

David: Linear Models

Giovanna: Random Forest & SVR

Random Forest Regression Modell Tuning over GridSearch

Using all Features:

Dropping SocioEconomic Features:

None of these models predictably return health outcomes.

Marshall: XGBOOST

Conclusion

Next Steps

Background: From the Press

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages