This repository contains a Jupyter Notebook that performs an unsupervised learning analysis, specifically clustering, on a dataset containing various physical measurements of penguins. The primary objective is to group penguins into distinct clusters based on their physical characteristics, leveraging machine learning techniques to uncover any inherent groupings within the data.
The dataset comprises several physical measurements of penguins, including:
culmen_length_mm: Length of the penguin's culmen (bill) in millimeters.culmen_depth_mm: Depth of the penguin's culmen in millimeters.flipper_length_mm: Length of the penguin's flipper in millimeters.body_mass_g: Body mass of the penguin in grams.sex: Sex of the penguin (Male/Female).
Data Source: [ ]
The initial phase involves loading the dataset and conducting preliminary data cleaning, which includes handling null values and identifying outliers. This ensures the quality and integrity of the data used for clustering.
We conduct a thorough EDA to understand the distributions and relationships among the different features. This includes generating pairwise plots and correlation matrices to visually inspect the data and identify any significant patterns or correlations.
The dataset undergoes feature engineering to prepare it for machine learning models. This involves creating dummy variables for categorical features and scaling the numerical features to standardize their ranges.
PCA is applied to reduce the dimensionality of the data, enhancing the efficiency of the clustering process. The number of components is chosen based on the explained variance ratio, ensuring that the selected components capture the majority of the variance in the data.
The K-Means algorithm is employed to cluster the data. We utilize the Elbow Method to determine the optimal number of clusters, ensuring that the chosen number provides coherent and interpretable clusters.
Once the clusters are defined, we perform an analysis to characterize each cluster. This includes examining the mean values of the original features within each cluster, which provides insights into the distinguishing characteristics of each group.
The analysis is conducted in a Jupyter Notebook environment with the following key Python packages:
- Pandas
- NumPy
- Matplotlib
- Seaborn
- scikit-learn
Ensure these packages are installed in your environment to successfully run the notebook.
To replicate this analysis, download or clone this repository and open the Jupyter Notebook in a compatible environment. Ensure the dataset is located in the specified path within the notebook, or adjust the path accordingly.
This is an exposure project, with no further goal that would entail contributions. Feedback are always welcome.
Free