This workflow demonstrates data exploration, visualization, and classification model training for heart disease prediction. Below is an outline of the steps taken:
- Load the dataset: A CSV file named
heart.csv
is read into a pandas DataFrame. - Initial exploration: Display the first few rows (
head()
), view descriptive statistics (describe()
), and confirm there are no missing values (isnull().values.any()
). - Basic checks: Investigate the distribution of specific features (e.g.,
sex
).
- Subset the data: Separate records for patients with heart disease (
output == 1
) and those without (output == 0
). - Histograms: Plot the distribution of features like
age
andthalachh
(max heart rate) for each group. - Boxplots: Compare the distributions of numeric features (e.g.,
thalachh
) between the two groups via boxplots. - Correlation: Generate a correlation matrix to identify relationships between variables. Visualize it using heatmaps (e.g., Seaborn, Plotly).
- Seaborn plots:
- Violin plots, pairplots, strip plots to understand data patterns and potential class separability.
- Plotly:
- Interactive histograms and boxplots (e.g., for
output
, colored bysex
) provide insights into how different demographics relate to heart disease. - Boxplots for features like
age
,thalachh
, andtrtbps
grouped byoutput
.
- Interactive histograms and boxplots (e.g., for
- Information check: Review data types and non-null counts (
df.info()
). - Splitting:
- Assign the target variable (
output
) toy
and the remaining columns toX
. - Perform a train–test split (e.g., 80/20) to separate the dataset for training and validation.
- Assign the target variable (
- Scaling:
- Use a standard scaler (
StandardScaler
) to normalize numeric features in both training and test sets.
- Use a standard scaler (
- Define classifiers:
- Logistic Regression
- Decision Tree
- Random Forest
- K-Nearest Neighbors
- Support Vector Classifier
- XGBoost
- Fitting: Train each model on the scaled training data.
- Predictions: Generate predictions on the test set.
- Evaluation: Calculate accuracy scores (
accuracy_score
) to compare performance across models.
- Parameter grid: Define candidate hyperparameters (e.g.,
n_neighbors
,weights
,metric
for KNN). - GridSearchCV: Perform a grid or randomized search across these parameters to find the best combination.
- Refit model: Using the best parameters, retrain and evaluate on the test set to check for improvements.
- Accuracy comparison: Identify the top-performing classifiers for heart disease prediction.
- Visual insights: From EDA plots, note key differentiating features (e.g.,
thalachh
andage
across differentoutput
groups). - Refinement: Use hyperparameter tuning results to optimize models like KNN for better performance.
This project is open-source. You are free to use, modify, and distribute the code. If you use it in your own work, a reference or link back to this repository is appreciated.
Feel free to adapt this single-notebook guide to your preferences or add deeper domain-specific analysis.