This GitHub repository contains the code for implementing decision trees from scratch and then using them for ensemble learning in Python.
Decision trees are a popular machine learning technique for solving classification and regression problems. In this repository, we have implemented the decision tree algorithm from scratch using Python. We have also implemented ensemble learning using the decision trees to improve the accuracy of our model.
To use the code in this repository, you need to have Python installed on your system. You can download and install Python from the official website: https://www.python.org/downloads/
Once you have Python installed, you can clone this repository using the following command: git clone https://github.com/ArindamRoy23/DSBA_T2_Ensemble_Learning_DT_Scratch.git
After cloning the repository, you can run the code by navigating to the directory where the code is stored and running the following command: python test.py
Classification Tree from Scratch: Build and train a Decision Tree for classifiaction tasks.
Classification Tree: binary classification tree object
Node: node object
- max_depth: max depth of the tree (stop criterion)
- min_samples_leaf: minimum number of samples per lieaf (stop criterion)
- min_samples_split: minimum number of samples to split a node (not leaves, also a stop criterion)
- n_classes_: number of classes in training set
- n_features_: number of features in training set and test set
- n_samples_: number of samples
- criterion: criterion to split a node. Chosen from "gini", "crossentropy" and "misclassification_error"
- tree_: tree class
- init(max_depth, min_samples_leaf, min_samples_split, criterion = "gini"): constructor, build a Classifiactiopn Tree class
- fit(X,y): fit the training set data
- predict(X): predict classes for test data using trained parameters
- predict_probability(X): predict probabilities for test data
- acc_score: print the accuracy score of the tree model
This is an implementation of the Decision Tree Regressor algorithm in Python. The Decision Tree Regressor is a machine learning algorithm that creates a decision tree model for regression problems.
The class DecisionTreeRegressor contains several methods to build and train the decision tree. The constructor method init sets the maximum depth of the tree and the minimum number of samples required to split. The fit method fits the data to the decision tree, and the tree_build method builds the decision tree recursively.
The best_split method finds the best feature and threshold for splitting the data based on the mean squared error cost function. The meansqerror method calculates the mean squared error of the data.
Note that this implementation does not handle categorical data. This can be a future scope of improvement.
Regression Tree: binary regression tree object
Node: node object
- max_depth: an integer that represents the maximum depth of the decision tree. If None, the tree will be grown until all leaves are pure or until each leaf has fewer samples than min_samples_split.
- min_samples_split: an integer that represents the minimum number of samples required to split an internal node. If a node has fewer samples than min_samples_split, it will not be split, and the algorithm will terminate for that node.
- root: a reference to the root node of the decision tree. This attribute is set to None when an instance of DecisionTreeRegressor is initialized, and it is set to the root node of the trained decision tree when the fit method is called.
- init(self, max_depth=None, min_samples_split=2): The constructor method that initializes the hyperparameters of the decision tree regressor.
- fit(self, X, y): This method fits the input data X and the target labels y to the decision tree.
- tree_build(self, X, y, depth=0): This is a recursive method that builds the decision tree. It first checks if the stopping criteria has been met (i.e., if the maximum depth or minimum number of samples required to split has been reached). If not, it finds the best feature and threshold to split on using the best_split method, and splits the data into left and right subsets. It then recursively builds the left and right subtrees by calling itself with the left and right subsets. Finally, it returns a Node object that stores the feature, threshold, and left and right child nodes.
- best_split(self, X, y): This method finds the best feature and threshold for splitting the data X and labels y based on the mean squared error (MSE) cost function. It loops over all features and thresholds in X to find the best split that minimizes the sum of the MSE of the left and right subsets. If no good split is found, best_feature and best_threshold are set to None.
- meansqerror(self, y): This method calculates the mean squared error (MSE) of the input target labels y.
The data set used in this project showcases the Airbnb listings and their features in New York City in 2019.
It includes a comprehensive collection of information pertaining to hosts, geographical locations, room features, and pricing.
The primary objective of this task is to employ ensemble learning techniques to predict the houses prices.
In this report, we will first perform descriptive analysis on the data set followed by a preparation of our data-frame to then after explore the use of various ensemble learning methods and their application in solving this regression problem.
- Filling missing values
- Converting date type
- Add new features on the skewed features
- One-hot encoding
- Normalization
Models used in this project including: Decision Tree, Random Forest, AdaBoost, Gradient Boosting and XGBoost.
The hyperparameters are tuned based on 3-fold cross validation. The main metrics we use for evaluation are MAE(Mean Absolute Error), RMSE(Root Mean Squared Error) and R2 score.
Arindam Roy, Chara Vega Brown, Jiayi Wu, Taolue CHEN, Marouan Jouaidi