Skip to content

jhonadc/ML-penguins_cluster

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Penguin Clustering Analysis

Overview

This repository contains a Jupyter Notebook that performs an unsupervised learning analysis, specifically clustering, on a dataset containing various physical measurements of penguins. The primary objective is to group penguins into distinct clusters based on their physical characteristics, leveraging machine learning techniques to uncover any inherent groupings within the data.

Dataset Description

The dataset comprises several physical measurements of penguins, including:

  • culmen_length_mm: Length of the penguin's culmen (bill) in millimeters.
  • culmen_depth_mm: Depth of the penguin's culmen in millimeters.
  • flipper_length_mm: Length of the penguin's flipper in millimeters.
  • body_mass_g: Body mass of the penguin in grams.
  • sex: Sex of the penguin (Male/Female).

Data Source: [ ]

Analysis Workflow

1. Data Preprocessing

The initial phase involves loading the dataset and conducting preliminary data cleaning, which includes handling null values and identifying outliers. This ensures the quality and integrity of the data used for clustering.

2. Exploratory Data Analysis (EDA)

We conduct a thorough EDA to understand the distributions and relationships among the different features. This includes generating pairwise plots and correlation matrices to visually inspect the data and identify any significant patterns or correlations.

3. Feature Engineering

The dataset undergoes feature engineering to prepare it for machine learning models. This involves creating dummy variables for categorical features and scaling the numerical features to standardize their ranges.

4. Principal Component Analysis (PCA)

PCA is applied to reduce the dimensionality of the data, enhancing the efficiency of the clustering process. The number of components is chosen based on the explained variance ratio, ensuring that the selected components capture the majority of the variance in the data.

5. Clustering with K-Means

The K-Means algorithm is employed to cluster the data. We utilize the Elbow Method to determine the optimal number of clusters, ensuring that the chosen number provides coherent and interpretable clusters.

6. Cluster Analysis

Once the clusters are defined, we perform an analysis to characterize each cluster. This includes examining the mean values of the original features within each cluster, which provides insights into the distinguishing characteristics of each group.

Requirements

The analysis is conducted in a Jupyter Notebook environment with the following key Python packages:

  • Pandas
  • NumPy
  • Matplotlib
  • Seaborn
  • scikit-learn

Ensure these packages are installed in your environment to successfully run the notebook.

Usage

To replicate this analysis, download or clone this repository and open the Jupyter Notebook in a compatible environment. Ensure the dataset is located in the specified path within the notebook, or adjust the path accordingly.

Contributions

This is an exposure project, with no further goal that would entail contributions. Feedback are always welcome.

License

Free

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published