This project focuses on clustering data into disjoint sets, where data within a set is more similar to each other than to data in other sets. Clustering, a form of unsupervised learning discussed in Chapter 12 of the class reference text, allows me to uncover meaningful patterns in the data without predefined labels. I specifically implement k-means clustering, as outlined in Chapter 12.4 of the text and supported by various instructional videos.
For this project, I analyze two datasets:
- Locations of Political Violence in India (2016-2024): A dataset detailing incidents of political violence.
- Locations of Police Shootings in the Continental United States: A dataset detailing incidents of police shootings across the US.
- Distance Calculation: Utilizing the Geopy library, I can calculate geodesic distances between latitude and longitude coordinates, essential for clustering analysis.
The analysis of police shooting data reveals several critical insights:
- State Incidents: I examined which states exhibit the highest number of police shooting incidents, comparing results across different cluster configurations (2, 3, 4, 5, and 6 clusters).
- Age and Death Count Relationship: Investigating the relationship between age, manner of death, and the number of fatalities in the top five states for each of the six clusters.
- Top States Over Time: Analyzing trends in police shootings in the top five states for each year from 2015 to 2022.
- Race and City Correlation: Assessing correlations between the top five cities, racial demographics, and the number of police shootings.
-
Cluster Analysis:
- In my k-means clustering analysis, Cluster 2 frequently emerged as the largest across various models, with Florida notably represented, indicating significant regional issues. In the six-cluster model, California stood out in Cluster 1.
-
Top States for Incidents:
- From 2015 to 2021, California consistently recorded the highest number of incidents, followed closely by Texas and Florida, indicating persistent challenges related to police violence.
-
Yearly Trends:
- California led the incidents each year, with Texas and Florida regularly appearing among the top three states, highlighting ongoing policing issues.
-
Racial Disparities:
- The analysis indicated that police shooting incidents disproportionately affect Black individuals, particularly in cities like Los Angeles, Phoenix, and Houston, raising serious concerns about systemic discrimination in law enforcement.
-
Urban Concentration:
- Most police shooting incidents were found concentrated in urban areas, suggesting unique challenges in policing in these densely populated regions.
-
Data Collection:
- The dataset was sourced from the specified class link, focusing on the 'fatal-police-shootings-data-continental-US.csv' file.
-
Data Preparation:
- A thorough examination of the dataset was conducted to ensure completeness and accuracy, documenting all steps for reproducibility.
-
Variable Creation:
- Key columns for correlation analysis included age, race, state, year, manner_of_death, and gender.
-
Libraries Used:
- Pandas: For data manipulation and handling missing values.
- Folium: For creating interactive maps and visualizing data points.
- Scikit-learn: For applying the K-means clustering algorithm.
- Seaborn: For generating a color palette for clusters.
-
Analytic Method:
- Data preprocessing, legend addition for clarity in maps, descriptive statistics to understand incident distribution, application of the K-means algorithm for clustering, and visualization of clustered data to identify spatial patterns across the United States. """