A machine learning techniques to analyze road accidents in the US and predict their severity. This project includes data preprocessing, natural language processing (NLP) on accident descriptions, and training a machine learning model to classify accident severity.
-
ML_Engineer_Task.ipynb
Preprocesses raw accident data by handling missing values, feature engineering, and scaling. Outputs a processed dataset in.csv
or.pkl
format. -
ML_Engineer_Task_Colab.ipynb
Loads the preprocessed data and applies NLP techniques to accident descriptions. Trains a Random Forest Classifier with SMOTE and outputs a trained model (Random_Forest_Classifier_SMOTE.pkl
).
The dataset consists of the following attributes, each providing crucial information about road accidents. Here's a detailed description:
Attribute | Description | Nullable |
---|---|---|
ID | Unique identifier for each accident record. | No |
Severity | Severity level of the accident (1 to 4), where 1 indicates minor impact and 4 indicates significant impact. | No |
Start_Time | Start time of the accident in the local timezone. | No |
End_Time | End time of the accident's impact on traffic flow. | No |
Start_Lat | Latitude of the starting point of the accident. | No |
Start_Lng | Longitude of the starting point of the accident. | No |
End_Lat | Latitude of the ending point of the accident. | Yes |
End_Lng | Longitude of the ending point of the accident. | Yes |
Distance(mi) | Length of the road segment affected by the accident. | No |
Description | Natural language description of the accident. | No |
Number | Street number in the address field. | Yes |
Street | Street name in the address field. | Yes |
Side | Side of the street (Right/Left). | Yes |
City | City name. | Yes |
County | County name. | Yes |
State | State name. | Yes |
Zipcode | Zip code of the location. | Yes |
Country | Country name (e.g., US). | Yes |
Timezone | Timezone of the accident's location (e.g., Eastern, Central). | Yes |
Airport_Code | Nearest airport-based weather station to the accident location. | Yes |
Weather_Timestamp | Timestamp of the weather observation for the accident. | Yes |
Temperature(F) | Temperature at the time of the accident (in Fahrenheit). | Yes |
Wind_Chill(F) | Wind chill at the time of the accident (in Fahrenheit). | Yes |
Humidity(%) | Humidity percentage at the time of the accident. | Yes |
Pressure(in) | Atmospheric pressure at the time of the accident (in inches). | Yes |
Visibility(mi) | Visibility at the time of the accident (in miles). | Yes |
Wind_Direction | Wind direction at the time of the accident. | Yes |
Wind_Speed(mph) | Wind speed at the time of the accident (in miles per hour). | Yes |
Precipitation(in) | Precipitation amount at the time of the accident (in inches). | Yes |
Weather_Condition | Weather conditions (e.g., Rain, Snow, Fog). | Yes |
Amenity | Indicates the presence of nearby amenities. | No |
Bump | Indicates the presence of nearby speed bumps. | No |
Crossing | Indicates the presence of a nearby crossing. | No |
Give_Way | Indicates the presence of a nearby "Give Way" sign. | No |
Junction | Indicates the presence of a nearby junction. | No |
No_Exit | Indicates the presence of a nearby "No Exit" sign. | No |
Railway | Indicates the presence of nearby railways. | No |
Roundabout | Indicates the presence of a nearby roundabout. | No |
Station | Indicates the presence of a nearby station. | No |
Stop | Indicates the presence of a nearby stop sign. | No |
Traffic_Calming | Indicates the presence of nearby traffic-calming measures. | No |
Traffic_Signal | Indicates the presence of nearby traffic signals. | No |
Turning_Loop | Indicates the presence of a nearby turning loop. | No |
Sunrise_Sunset | Period of the day (Day/Night) based on sunrise and sunset. | Yes |
Civil_Twilight | Period of the day based on civil twilight. | Yes |
Nautical_Twilight | Period of the day based on nautical twilight. | Yes |
Astronomical_Twilight | Period of the day based on astronomical twilight. | Yes |
- Handles missing values using group-based imputations.
- Groups
Weather_Condition
categories into broader groups (e.g.,Rain
,Snow
,Fog
). - Encodes categorical variables and scales numerical features.
- Cleans, tokenizes, and lemmatizes accident descriptions.
- Vectorizes text data using
TfidfVectorizer
for feature extraction.
- Uses a Random Forest Classifier.
- Employs SMOTE to handle imbalanced class distribution.
- Outputs a model file (
Random_Forest_Classifier_SMOTE.pkl
) for future use.
- Clone the repository:
git clone https://github.com/your-username/your-repo-name.git
- Open and run the notebook
ML_Engineer_Task.ipynb
:- Loads the raw dataset.
- Cleans and preprocesses data.
- Saves the output as
accident_data.pkl
oraccident_data.csv
.
- Open
ML_Engineer_Task_Colab.ipynb
in Google Colab or your preferred environment. - Load the preprocessed file (
.pkl
or.csv
). - Run the notebook to:
- Perform NLP on accident descriptions.
- Train the model and save it as
Random_Forest_Classifier_SMOTE.pkl
.
- Processed Dataset:
accident_data.pkl
oraccident_data.csv
- Trained Model:
Random_Forest_Classifier_SMOTE.pkl
Install the required Python libraries:
pip install pandas scikit-learn nltk matplotlib seaborn
- Validation Accuracy: ~96.9%
- Class Imbalance Handling: SMOTE improves performance on underrepresented severity classes.
- Experiment with other ML algorithms like Gradient Boosting or neural networks.
- Integrate real-time accident data for dynamic predictions.
- Enhance feature engineering for improved accuracy.
Contributions are welcome! If you have ideas for improvement:
- Fork this repository.
- Create a new branch (
feature-branch-name
). - Commit changes and push.
- Open a Pull Request.