This project explores the Titanic dataset using Python and popular data science libraries. The goal of this EDA is to understand the structure of the data, identify relationships between features, and gain insights that can inform predictive modeling.
- Loaded the dataset into a Pandas DataFrame.
- Checked the shape and data types.
- Used
.info()
and.describe()
for summary statistics.
- Dropped and handled missing columns (e.g., dropped
'Cabin'
, handled missing'Age'
replacing it with median).
- Plotted distributions of features like:
Age
Fare
Survived
- Used bar plots and box plots to examine survival rates against:
Sex
Pclass
- Used boxplot to detect outliers in:
Fare
Age
- Selected only numerical columns.
- Plotted a heatmap of the correlation matrix to identify strong relationships.
- Notable findings:
Fare
is positively correlated withSurvived
.Pclass
is negatively correlated with bothSurvived
andFare
.
- Notable findings:
pandas
numpy
matplotlib
seaborn
- Gender significantly impacted survival: Female passengers consistently showed a much higher survival rate compared to males.
- Passenger class correlated with survival: Individuals in 1st class were considerably more likely to survive than those in 2nd or 3rd class.
- Fare and survival had a positive link: A higher fare paid for a ticket appears to correlate with a greater chance of survival. This might be due to the correlation between fare and passenger class or potentially better cabin locations.
- Age's impact on survival was weak: While there was a slight negative correlation between age and survival, it suggests that children might have been prioritized in rescue efforts.