This school project is an introduction to machine learning. The goal is to predict the price of a car based on its mileage. The project is divided into two parts: a training part and a prediction part.
Linear regression is the one of the founding principles of machine learning.
Machine learning is a field of artificial intelligence that uses statistical techniques to give computer systems the ability to "learn" (i.e., progressively improve performance on a specific task) from data, without being explicitly programmed.
This project aims at producing a model that can predict the price of a car based on its mileage.
The project is made of two parts: a training part and a prediction part.
The training part is based on the following hypothesis:
Where
This hypothesis is a cost function that measures the error of the model. The goal of the training part is to find the values of the parameters
The values of the parameters theta.json
that will be used in the prediction part.
The prediction part is based on the following hypothesis:
This hypothesis is based on the assumption that the price of a car is a linear function of its mileage. The goal of the prediction program is to predict the price of the car based on its mileage.
This algorithm is detailed in code inside the jupyter notebook inside this repo.
The dataset used to train the model is located in the data.csv
file. This file contains two columns: mileage
and price
.
To train the model, you need to run the following command:
python train.py
This will generate a file called theta.csv
that contains the values of the parameters of the model.
I chose to implement Robust Scaling as a normalization method. This method is robust to outliers1 and is based on the following formula:
Where
To predict the price of a car based on its mileage, you need to run the following command:
python predict.py
You will be prompted to enter the mileage of the car you want to predict the price.
To install the dependencies, you need to run the following command:
pip install -r requirements.txt
- Gradient Descent
- Coefficient of Determination
- Theta
- Feature Scaling
- Normlization(statistics)
- Programmatic Fitting
Footnotes
-
outliers are values that are significantly different from the rest of the data and can distort the model. Robust scaling is a normalization method that is robust to outliers. My main goal in implementing this normalization method was to prevent this issue from happening. ↩