K-Means Clustering Customer Segmentation is a user-friendly, interactive web application built with Python, scikit-learn, and Streamlit. It enables businesses and data enthusiasts to segment customers based on annual income and spending score using the K-Means clustering algorithm. The app provides real-time predictions, clear visualizations, and supports easy retraining with new data or features. Ideal for marketing, retail, banking, and more, it helps identify high-value or at-risk customer groups, personalize offers, and optimize business strategies. The modular codebase and comprehensive documentation make it easy to customize, extend, and deploy in various environments.
- Overview
- Business Use Cases
- Quick Start
- Features
- Project Structure
- Technical Architecture
- Dataset
- K-Means Clustering
- How It Works: Step-by-Step
- Behind the Scenes: Code Structure
- Customization & Extensibility
- Sample Input/Output
- Installation & Requirements
- Usage
- Advanced Usage
- Troubleshooting
- Best Practices
- Security & Privacy
- Documentation
- Contributing
- FAQ
- Support
- Community & Social
- Changelog
- Roadmap
- Glossary
- References & Acknowledgements
- License
- Citation
K-Means Clustering Customer Segmentation is an end-to-end, interactive web application for segmenting customers using unsupervised machine learning. Built with Python, scikit-learn, and Streamlit, this project enables businesses and data enthusiasts to:
- Identify distinct customer groups based on spending patterns and income
- Visualize clusters for actionable business insights
- Experiment with new data and retrain models easily
Business Value:
- Target marketing campaigns to specific customer segments
- Personalize offers and improve customer retention
- Discover high-value or at-risk customer groups
- Retail: Segment shoppers to tailor promotions and loyalty programs.
- Banking: Identify high-value clients for premium services.
- E-commerce: Personalize recommendations and offers.
- Hospitality: Group guests for targeted experiences.
- Telecom: Detect churn-prone customers and upsell opportunities.
- Education: Cluster students for personalized learning paths.
- Clone the repository:
git clone <repo-url> cd KMeans-Clustering-Customer-Segmentation
- Install dependencies:
pip install -r requirements.txt
- Launch the app:
streamlit run app/main.py
- Open your browser: Visit http://localhost:8501
Feature | Description |
---|---|
Interactive Web UI | User-friendly Streamlit interface for input and results |
Real-time Prediction | Instantly predicts customer segment from input values |
Visualizations | Cluster plots, Elbow method, and more (add your screenshots!) |
Easy Retraining | Jupyter notebook for model retraining with new data/features |
Modular Codebase | Clean separation of UI, model, and logic for easy customization |
Deployment Ready | Simple to deploy on Streamlit Cloud, Heroku, or Docker |
Documentation | Extensive docs for dataset, clustering, and deployment |
KMeans-Clustering-Customer-Segmentation/
βββ app/
β βββ main.py # Streamlit app entry point
β βββ model.py # Model loading and prediction logic
β βββ ui.py # Streamlit UI components
βββ dataset/
β βββ mall_customers.csv # Customer data
βββ model/
β βββ model_training.ipynb # Jupyter notebook for training
β βββ model.pkl # Trained KMeans model
βββ docs/
β βββ dataset.md
β βββ kmeans-clustering.md
β βββ streamlit.md
βββ requirements.txt
βββ README.md
flowchart TD
A[User Input (Streamlit UI)] --> B[Model Loader (app/model.py)]
B --> C[Trained KMeans Model (model/model.pkl)]
A --> D[UI Logic (app/ui.py)]
B --> E[Prediction Output]
D --> E
E --> F[Visualization (matplotlib/seaborn)]
F --> G[Display Results in Streamlit]
subgraph Data Science
C
F
end
- File:
dataset/mall_customers.csv
- Source: Kaggle Mall Customers Dataset
- Columns:
CustomerID
: Unique identifierGender
: Male/FemaleAge
: Customer ageAnnual Income (k$)
: Annual income in thousands of dollarsSpending Score (1-100)
: Score assigned by the mall based on customer behavior and spending
Note: The default model uses only Annual Income (k$)
and Spending Score (1-100)
for clustering.
K-Means is an unsupervised algorithm that partitions data into k
clusters, grouping similar data points together. It is widely used for customer segmentation due to its simplicity and effectiveness.
-
How it works:
- Choose
k
cluster centers (centroids) - Assign each data point to the nearest centroid
- Update centroids as the mean of assigned points
- Repeat until assignments stabilize
- Choose
-
Why K-Means?
- Fast and scalable
- Intuitive results for business users
- Well-suited for numerical features
For more, see docs/kmeans-clustering.md
.
- Data Preparation:
- Load and explore the dataset
- Select relevant features (default: income & spending score)
- Model Training:
- Use the Elbow Method to find optimal
k
- Train KMeans on selected features
- Save the trained model as
model/model.pkl
- Use the Elbow Method to find optimal
- Web Application:
- User enters income and spending score
- App loads the trained model and predicts the segment
- Results and (optionally) cluster visualizations are displayed
app/main.py
: Streamlit entry point; initializes app, loads model, and handles routingapp/model.py
: Handles model loading and prediction logicapp/ui.py
: Contains Streamlit UI components for input and outputmodel/model_training.ipynb
: Jupyter notebook for data exploration, training, and saving the model
- Add More Features:
- Edit
model/model_training.ipynb
to include more columns (e.g., Age, Gender) - Update the app UI in
app/ui.py
to accept new inputs
- Edit
- Use Your Own Data:
- Replace
dataset/mall_customers.csv
with your dataset (same or similar format) - Retrain the model using the notebook
- Replace
- Change Number of Clusters:
- Adjust
k
in the notebook and retrain
- Adjust
- Deploy Anywhere:
- See
docs/streamlit.md
for deployment guides (Streamlit Cloud, Docker, etc.)
- See
Sample Input:
- Annual Income (k$):
60
- Spending Score (1-100):
42
Sample Output:
Predicted Segment: 3
This customer belongs to the "Average Income, Average Spending" group.
- Python: 3.7 or higher
- Install all dependencies:
pip install -r requirements.txt
- requirements.txt includes:
- pandas
- numpy
- scikit-learn
- matplotlib
- seaborn
- streamlit
- jupyter, ipykernel (optional, for notebook)
- Run the Streamlit app:
streamlit run app/main.py
- Open your browser: Go to http://localhost:8501
- Interact:
- Enter "Annual Income (k$)" and "Spending Score (1-100)"
- Click "Predict" to see the customer segment
- Retrain the Model:
- Open
model/model_training.ipynb
in Jupyter - Modify code or data as needed
- Run all cells to retrain and save a new model
- Restart the app to use the updated model
- Open
- Deploy Online:
- See
docs/streamlit.md
for deployment instructions
- See
Problem | Solution |
---|---|
ModuleNotFoundError |
Run pip install -r requirements.txt |
Streamlit not launching | Check Python version and Streamlit installation |
Model file not found | Retrain model using the notebook |
Port 8501 already in use | Use streamlit run app/main.py --server.port <other_port> |
UI not updating after retrain | Restart Streamlit app |
- Always explore your data before training
- Use the Elbow Method to select the best
k
- Document any changes to the dataset or features
- Test the app after retraining the model
- Use virtual environments for dependency management
- Add screenshots to the README for better engagement
- No personal data is stored by the app; all predictions are in-memory
- If using real customer data, ensure compliance with GDPR or local privacy laws
- Do not upload sensitive data to public repositories
docs/dataset.md
: Dataset details and schemadocs/kmeans-clustering.md
: K-Means theory and implementationdocs/streamlit.md
: Streamlit and deployment guides
Contributions are welcome! To contribute:
- Fork the repository
- Create a new branch (
git checkout -b feature/your-feature
) - Commit your changes (
git commit -am 'Add new feature'
) - Push to your branch (
git push origin feature/your-feature
) - Open a Pull Request
Best Practices:
- Write clear, concise commit messages
- Add docstrings and comments
- Test your code before submitting
Q: Can I use a different dataset?
A: Yes! Replace dataset/mall_customers.csv
and retrain the model.
Q: How do I add more features?
A: Update feature selection in the notebook and app UI.
Q: The app doesn't start or throws an error. What should I do?
A: Ensure all dependencies are installed and Python version is compatible. Check error messages for details.
Q: How do I deploy this app online?
A: See docs/streamlit.md
for deployment instructions.
- Open an issue for bugs or feature requests
- Email: ptnhanit230104@gmail.com
- Discussions (ask questions, share ideas)
- Contributors
- Suggest a Slack/Discord channel for real-time help!
- v1.0: Initial release with Streamlit app, model training notebook, and documentation
- v1.1: Improved modularity, added advanced usage and deployment docs
- v1.2: Enhanced README, added FAQ and troubleshooting
- Add more clustering algorithms (DBSCAN, Hierarchical)
- Add user authentication for private deployments
- Enable export of cluster assignments
- Add more visualizations (3D plots, interactive charts)
- Docker Compose for multi-service deployment
- Add REST API for programmatic access
- Internationalization (i18n) support
- K-Means: Unsupervised clustering algorithm
- Cluster: Group of similar data points
- Centroid: Center of a cluster
- Elbow Method: Technique to find optimal number of clusters
- Streamlit: Python library for building web apps
- scikit-learn: Python ML library
This project is licensed under the MIT License. See the LICENSE file for details.
If you use this project in your research, please cite as:
@misc{KMeansClusteringCustomerSegmentation,
author = {Nhan Pham Thanh},
title = {K-Means Clustering Customer Segmentation},
year = {2024},
howpublished = {\url{https://github.com/NhanPhamThanh-IT/KMeans-Clustering-Customer-Segmentation}}
}
For more information, see the documentation in the docs/
folder.
Add your screenshots to the docs/
folder and reference them above for a more visual README!