Decision Tree Classifier |Jupyter Notebook | Supervised Learning | Machine Learning | Python | Scikit library
This project explores the Decision Tree Classifier, implementing it both from scratch (using Python and NumPy) and via Scikit-learn for comparison. The goal is to demonstrate how decision trees work under the hood while benchmarking performance against a well-optimized library.
✔ Hands-on Implementation: Builds a decision tree from scratch using the Gini Index as the splitting criterion.
✔ Scikit-learn Comparison: Evaluates the custom implementation against Scikit-learn’s DecisionTreeClassifier for validation.
✔ Diverse Datasets: Tested on the Iris, Wine, and Car Evaluation datasets to assess performance across different problem types.
✔ Step-by-Step Walkthrough: The notebook breaks down training, prediction, and evaluation, making it easy to follow along.
By the end, readers will understand:
-
How decision trees split nodes and make predictions.
-
Why the Gini Index is used for measuring impurity.
-
The trade-offs between custom vs. library-based implementations.
This serves as a practical guide for beginners learning ML fundamentals and intermediate practitioners refining their understanding of tree-based models.
- File Structure 📂
- Requirements 📦
- Installation Guide 🛠
- Dataset Information 📊
- Decision Tree Algorithm 🧠
- Key Findings 📈
- Contributing 🚀
- Contact 📬
The repo contains the following file structure:
📦 Decision Tree repo
│-- 📜 DecisionTree.ipynb # Jupyter Notebook with implementation
│-- 📜 requirements.txt # List of dependencies
│-- 📜 iris.tmls # Iris Flower dataset
│-- 📜 wine.tmls # Wine dataset
|-- 📜 car.tmls # car dataset
│-- 📜 README.md # Project documentation
- Python Version: 3.10 or higher
- External Dependencies: Managed through
requirements.txt
- Jupter Notebook for the web framework
- Numpy
- Panda
Follow the steps below to set up and run the project:
git clone https://github.com/adexoxo13/Naive-Bayes.git
cd Naive-Bayes
conda create --name <my-env>
# When conda asks you to proceed, type y:
proceed ([y]/n)?
#Verify that the new environment was installed correctly:
conda env list
#Activate the new environment:
conda activate myenv
pip install -r requirements.txt
jupyter notebook
Open DecisionTree.ipynb
in Jupyter and run the cells to see the model in action.
- The Iris Dataset 🌸 consists of 150 samples, with the following attributes:
Feature | Description |
---|---|
Sepal Length | Length of the sepal (cm) |
Sepal Width | Width of the sepal (cm) |
Petal Length | Length of the petal (cm) |
Petal Width | Width of the petal (cm) |
Species | Type of Iris Flower (Target) |
---------------- | -------------------------------- |
- The Wine Dataset consists of 178 samples, with the following attributes:
Feature | Description |
---|---|
Alcohol | Acohol in the wine (percentage) |
Malic Acid | Measure of acidity in the wine |
Ash | Measure of ash content in the wine |
Alcalinity of ash | Measure of alkalinity in the wine |
Magnesium | Measure of magnesium content in the wine |
Total phenols | Measure of total phenolic compounds in the wine |
Flavanoids | Measure of flavonoid compounds in the wine |
Nonflavanoid Phenols | Measure of nonflavanoid phenolic compounds in the wine |
Proanthocyanins | Measure of proanthocyanin compounds in the wine |
Color intensity | Measure of color depth and richness in the wine |
Hue | Measure of color tone and variation in the wine |
OD280/OD315 of diluted wines | Measure of absorbance ratios in wines |
Proline | Measure of amino acid content in the wine |
class | Type of wine ( The target) |
------------------------------- | -------------------------------------------------------- |
- The Car Dataset consists of 1728 samples, with the following attributes
Feature | Description |
---|---|
Buying | Buying price (e.g., low, medium, high,...) |
Maint (Maintenance) | Price of the maintenance (e.g., low, medium, high) |
Doors | Number of doors in the car (e.g., two, four) |
Persons | Number of persons that can be seated in the car |
Lug Boot | The size of luggage boot (e.g., small, medium,...) |
Safety | Estimated safety of the car (e.g., low, medium, high...) |
Class | Evaulation level (unacceptable, acceptable, good, very good) |
------------------------------- | ------------------------------------------------------------- |
The jupyter notebook contains two implementations of a Decision Tree classifier:
-
From Scratch
-
Built using a recursive algorithm
-
Splits nodes based on Gini impurity for optimal decision boundaries
-
-
Scikit-learn Version
-
Uses the DecisionTreeClassifier from sklearn
-
Also employs Gini impurity for splits, ensuring consistency with the manual implementation
-
Run 0 : 1.0
Run 1 : 0.9333333333333333
Run 2 : 0.9333333333333333
Run 3 : 0.8333333333333334
Run 4 : 1.0
Consistently achieved 100% accuracy across all runs:
Accuracy: 1.0
Accuracy: 1.0
Accuracy: 1.0
Accuracy: 1.0
Accuracy: 1.0
X[2] <= 3.727 ? # Gini impurity: 0.271
left: X[2] <= 1.726 ? # Gini: 0.162
left: Iris-setosa
right: X[2] <= 3.0 ? # Gini: 0.375
left: Iris-setosa
right: Iris-versicolor
right: X[3] <= 1.705 ? # Gini: 0.355
left: X[3] <= 1.4 ? # Gini: 0.021
left: Iris-versicolor
right: X[2] <= 4.738 ? # Gini: 0.067
left: Iris-versicolor
right: Iris-versicolor
right: X[1] <= 3.012 ? # Gini: 0.002
left: Iris-virginica
right: X[3] <= 2.193 ? # Gini: 0.014
left: Iris-virginica
right: Iris-virginica
Run 0 : 0.8055555555555556
Run 1 : 0.8888888888888888
Run 2 : 0.9444444444444444
Run 3 : 0.8333333333333334
Run 4 : 0.8055555555555556
Consistently ~94–96% accuracy:
Accuracy: 0.9629629629629629
Accuracy: 0.9444444444444444
Accuracy: 0.9629629629629629
Accuracy: 0.9444444444444444
Accuracy: 0.9444444444444444
X[12] <= 762.0 ? # Gini: 0.265
left: X[9] <= 4.850 ? # Gini: 0.374
left: X[1] <= 1.939 ? # Gini: 0.0017
left: Class 2
right: X[1] <= 3.180 ? # Gini: 0.013
left: Class 2
right: Class 2
right: X[6] <= 1.099 ? # Gini: 0.046
left: Class 3
right: X[10] <= 0.759 ? # Gini: 0.258
left: Class 3
right: Class 2
right: X[1] <= 2.054 ? # Gini: 0.045
left: X[0] <= 13.664 ? # Gini: 0.0135
left: X[0] <= 13.152 ? # Gini: 0.087
left: Class 1
right: Class 1
right: Class 1
right: X[5] <= 2.302 ? # Gini: 0.473
left: Class 3
right: Class 1
Run 0 : 0.7745664739884393
Run 1 : 0.7514450867052023
Run 2 : 0.8121387283236994
Run 3 : 0.7890173410404624
Run 4 : 0.7543352601156069
Consistently ~95–96% accuracy:
Accuracy: 0.9595375722543352
Accuracy: 0.9556840077071291
Accuracy: 0.9556840077071291
Accuracy: 0.9614643545279383
Accuracy: 0.9556840077071291
X[3] in [0, 1, 2] ? # Gini: 0.072
left: "unacc"
right: X[5] in [0, 1, 2] ? # Gini: 0.074
left: X[0] in [0, 1, 2, 3] ? # Gini: 0.049
left: X[0] in [0, 1, 2] ? # Gini: 0.064
left: "acc"
right: "acc"
right: X[1] in [0, 1, 2, 3] ? # Gini: 0.150
left: "acc"
right: "unacc"
right: X[5] in [1, 2] ? # Gini: 0.151
left: "unacc"
right: X[4] in [0, 1, 2] ? # Gini: 0.057
left: "acc"
right: "unacc"
Contributions are welcome! Feel free to fork the repository and submit a pull request.
The datasets used in this project are from the UCI Machine Learning Repository:
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.
Feel free to reach out or connect with me:
- 📧 Email: adenabrehama@gmail.com
- 💼 LinkedIn: linkedin.com/in/aden
- 🎨 CodePen: codepen.io/adexoxo
📌 Star this repository if you found it useful! ⭐