This project performs a Canonical Correlation Analysis (CCA) on socio-economic and demographic data from Our World in Data. The primary objective is to investigate the relationships between objectively measurable factors (e.g., meat supply, university enrollment) and subjective, self-reported indicators (e.g., happiness, trust levels) across 22 different countries.
The project aims to uncover hidden correlations between seemingly disparate datasets. By leveraging CCA, we can identify latent variables that maximize the correlation between two sets of variables, allowing us to better understand how objective societal conditions might influence subjective well-being and perceptions.
The analysis utilizes two primary datasets, data1.csv
and data2.csv
, downloaded from ourworldindata.org
. These files contain statistical survey results, with the latest available data before 2020 for 22 countries.
data1.csv
includes:
happiness
: Self-reported life satisfaction.trust_level
: Share of people who agree with "most people can be trusted."chocolate
: Per capita consumption of cocoa beans (in kg).
data2.csv
includes:
annual_work
: Average number of annual work hours.food_cost
: Share of income spent on food.meat_yearly
: Yearly supply of meat per person.overweight
: Share of the adult population that is overweight or obese.articles_per_million
: Number of research articles published in a year per million of population.create_research
: Share of professionals in research and development per million of population.university_enrolment
: Gross enrollment ratio in tertiary education.electdem
: Electoral democracy index.
Additionally, the project involves integrating a third dataset of my choice from ourworldindata.org
, selected to be available for all 22 countries for the year 2019, further expanding the scope of the CCA.
The project follows a structured approach:
- Data Import and Visualization: Both
data1.csv
anddata2.csv
are imported into Python. Initial histograms are generated to visualize the distribution of each variable. - Data Preprocessing: All variables undergo necessary standardisation to prepare them for CCA, ensuring that differences in scale do not disproportionately influence the analysis.
- Canonical Correlation Analysis (CCA) Implementation:
- CCA is implemented and applied to the standardized datasets.
- The initial results are interpreted to identify the canonical variates and their correlations, providing insights into the relationships between the objective and subjective variable sets.
- Expanded Analysis: A new, relevant dataset is downloaded from
ourworldindata.org
(e.g., [You'd specify your chosen dataset and explain the rationale for choosing it here in the code or a separate documentation]). This new data is merged with the existing datasets. - Re-run CCA and Interpretation: CCA is re-run with the augmented dataset. The results are then re-interpreted, comparing them to the initial findings and discussing any new insights or changes in correlation patterns.