LA TIMES Op-Ed: We mapped the warehouse takeover of the Inland Empire by Dr. Susan Phillips (professor of Enviornmental Analysis), on the team's work at the Robert Redford Conservancy for Southern California Sustainability at Pitzer College.
Images for presentation slides mapping increase of warehousees are screencaps from this tableau (completed to visualize the changes in density) complement the ongoing work of researchers in the Inland Empire.
We aggregated data from 4
CA OEHHA CalEnviro Screen reports, appended warehouse counts extracted from US Census Business Data and trained numerous estimators, however none reliably modelled health outcomes using the aggregated data.
California Office of Environmental Health Hazard Assessment (CA OEHHA) has compiled data from various government agencies to create a mapping tool used to identifying communities most affected by various pollution sources, producing four reports total in years 2013, 2014, 2018 and 2021 with scores to for each identified disadvantaged community that was disproportionally facing enviornmental impacts. Because trasportation of goods and people continues to be the primary source of emissions, and because increased commerce requires increased warehousing and associated transportation, we sought to create models to answer the following questoins:
- Is effect of increased warehouse presence on health outcomes quantifiable ?
- What are primary mitigating factors the State can address in response to increased warehouse density?
- How well do the CalEnviroScreen scores reflect emergency healthcare counts?
- What indicators from the CalEnviroScreen dataset best determine the number of emergency healthcare visits?
Description: CA OEHHA compiles data specfically regarding pollutants and the communities affected by them.
Source | File | Report Date | Shape |
---|---|---|---|
CalEnviroScreen 1 | Data (xlsx) | April 2013 | (1500,49) |
CalEnviroScreen 2 | Data (xlsx) | Oct 2014 | (8035, 51) |
CalEnviroScreen 3 | Data (xlsx) | June 2018 | (8035, 51) |
CalEnviroScreen 4 | Data, Dictionarty ZIP | Oct 2021 | (8035, 51) |
Combined Data | (25444, 62) | ||
Used Data | (14912, 59) |
Description: The US Census data is business counts by county, zip code and business type.
name | description |
---|---|
est total | total number of warehouses in class |
est ag | total number of warehouse for agricultural |
est cold | total number of warehouses for cold storage |
est gen | total number of warehouses for general storage |
variable name | type | description |
---|---|---|
diesel pm | numeric | particulate matter, spatially modelled |
ozone | numeric | concentration |
traffic | numeric | volume: vehicles per length of time over fixed distance |
We trained multiple estimators to model the outcomes reported in the CalEnviroScreen reports to build upon that work:
variable name | type | description |
---|---|---|
asthma hospitalization | numeric | incidence rate, cases/ 10k population |
heat attack hospitalizatoin | numeric | incidence rate, cases/ 10k population |
low birth weight | numeric | % newborns weighing < 2.5 kg (#/100 live births) |
- outliers in features were not dropped: as the data from the report is meticulously collected and averaged, eliminating observations outside of some central-tendency would eliminate communities that are most affected by selected factors.
Health outcomes on vertical axis suggest locations with high-warehouse density have higher rates of negative health outcomes.
Health outcomes on vertical axis suggest locations with high $PM{2.5}$ have higher rates of negative health outcomes._
Health outcomes on vertical axis suggest locations with traffic concentrations have higher rates of negative health outcomes.
The linear models give us a sense of which of the features to focus on if we want to address these health issues.
Coefficients for these models were scaled and used to mark the most important features for addressing the health data. Ultimately, we found that the socioeconomic factors of unemployment, housing burden, linguistic isolation, and poverty were as an ensemble more effective at predicting the health data than the ensemble of pollution and warehouse data.
In fact, we found that warehouse counts have nearly no predictive value.
Model | Features | Target | |
---|---|---|---|
1 | selected: no time, space or CES scores | Asthma | 0.29 |
2 | selected: no time, space or CES scores | LBW | 0.14 |
3 | selected: no time, space or CES scores | CVD | 0.23 |
4 | warehouse counts only | Asthma | 0.017 |
5 | warehouse counts only | LBW | 0.002 |
6 | warehouse counts only | CVD | 0.013 |
7 | socio-economic only | Asthma | 0.263 |
8 | socio-economic only | LBW | 0.1337 |
9 | socio-economic only | CVD | 0.11 |
The models that yielded these metrics are developed in Notebook 7: Linear Regression. The procedure for processing the raw data leading up to this notebook begins with Notebook 1: EDA on CAES 4 Data.
Random Forest Regression
over a Gridsearch to find optimal values returned the following importances and low-performance scores, indicating that the metrics we used to model health outcomes are insufficient, and highlighting that socioeconomic factors are much more predictive. Specifically, the number of warehouses in a given zip code does not reflect the outcomes in that zip code.
Model Estimator Features Target $R^2$ 1 SVR all asthma 0.29
Model Estimator Target $R^2$ Max Error Scale 1 RF Asthma -0.35 10.7 #/10,000 2 RF Low Birth Weight -0.25 7 #/100 2 RF Cardiovascular Disease 0.01 30 #/10,000
Model Estimator Target $R^2$ Max Error Scale 1 RF Asthma -0.01 180 #/10,000 2 RF Low Birth Weight -0.25 8 #/100 2 RF Cardiovascular Disease -0.05 30 #/10,000
Features importances with all features, model target is asthma
.
Features importances with all features, model target is heart-attacke
hospitalization.
Features importances with all features, model target is low birth-weight
.
Features importances without socio-economic features, model target is asthma
.
Features importances without socio-economic features, model target is heart-attacke
hospitalization.
Features importances without socio-economic features, model l target is low birth-weight`.
Features used: | total population, ozone, pm2.5, diesel pm, pesticides, traffic, cleanup sites, groundwater threats, haz. waste, imp. water bodies, solid waste, pollution burden, low birth weight, education, linguistic isolation, poverty, pop. char. , drinking water, tox. release, unemployment, ces_per, cardiovascular disease, housing burden, est total, est gen, est cold, est farm, est other | gradient boosting supervised regression |
Model | features used type | evaluation metric | Train Accuracy | Test Accuracy | RMSE score | MAE test score |
---|---|---|---|---|---|---|
XGBoost | R2, RMSE, & MAE | 0.9139 | 0.7853 | 13.6915 | 9.3296 | |
Random Forest meta estimator regression | R2, RMSE, & MAE | 0.9634 | 0.7503 | 14.7307 | 9.9952 |
We saw saw no meaningful relationship to create robust models for health outcomes using our warehouse-aggregated data for any models we tried to fit. Cal EnviroScreen scores highly reflect asthma
and pollution burden
but not hospitalization rates
. Socioeconomic factors aggregated in Cal EnviroScreen built best predictive models for negative health outcomes, highlighting the need for the State to address root causes for pollution burden.
We will continue to aggregate more data with finer granularity, and explore the raw data from which the CalEnviroScreen was sourced and modeled. Furthemore, spatial and temporal analyses will provide more robust models for projecting and adddressing communites to support.
- EPI: Warehouses Do Not Generate Broad-Based Employment (cites the source I have for fulfillment center locations)
- CBRE: 2022 North America Industrial Big Box Review & Outlook: Los Angeles County