Skip to content

Analysis and modeling of the US Accidents dataset to predict accident severity and identify contributing factors. Includes data cleaning, feature engineering, exploratory data analysis, and machine learning model development. The project aims to provide insights into road safety and accident prevention strategies.

License

Notifications You must be signed in to change notification settings

AishwaryaHoysal24/TechStaX_ML_Engineer_Task

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

🚗 Road Accident Severity Prediction

A machine learning techniques to analyze road accidents in the US and predict their severity. This project includes data preprocessing, natural language processing (NLP) on accident descriptions, and training a machine learning model to classify accident severity.


📁 Repository Structure

  • ML_Engineer_Task.ipynb
    Preprocesses raw accident data by handling missing values, feature engineering, and scaling. Outputs a processed dataset in .csv or .pkl format.

  • ML_Engineer_Task_Colab.ipynb
    Loads the preprocessed data and applies NLP techniques to accident descriptions. Trains a Random Forest Classifier with SMOTE and outputs a trained model (Random_Forest_Classifier_SMOTE.pkl).


📊 Dataset Attributes

The dataset consists of the following attributes, each providing crucial information about road accidents. Here's a detailed description:

Attribute Description Nullable
ID Unique identifier for each accident record. No
Severity Severity level of the accident (1 to 4), where 1 indicates minor impact and 4 indicates significant impact. No
Start_Time Start time of the accident in the local timezone. No
End_Time End time of the accident's impact on traffic flow. No
Start_Lat Latitude of the starting point of the accident. No
Start_Lng Longitude of the starting point of the accident. No
End_Lat Latitude of the ending point of the accident. Yes
End_Lng Longitude of the ending point of the accident. Yes
Distance(mi) Length of the road segment affected by the accident. No
Description Natural language description of the accident. No
Number Street number in the address field. Yes
Street Street name in the address field. Yes
Side Side of the street (Right/Left). Yes
City City name. Yes
County County name. Yes
State State name. Yes
Zipcode Zip code of the location. Yes
Country Country name (e.g., US). Yes
Timezone Timezone of the accident's location (e.g., Eastern, Central). Yes
Airport_Code Nearest airport-based weather station to the accident location. Yes
Weather_Timestamp Timestamp of the weather observation for the accident. Yes
Temperature(F) Temperature at the time of the accident (in Fahrenheit). Yes
Wind_Chill(F) Wind chill at the time of the accident (in Fahrenheit). Yes
Humidity(%) Humidity percentage at the time of the accident. Yes
Pressure(in) Atmospheric pressure at the time of the accident (in inches). Yes
Visibility(mi) Visibility at the time of the accident (in miles). Yes
Wind_Direction Wind direction at the time of the accident. Yes
Wind_Speed(mph) Wind speed at the time of the accident (in miles per hour). Yes
Precipitation(in) Precipitation amount at the time of the accident (in inches). Yes
Weather_Condition Weather conditions (e.g., Rain, Snow, Fog). Yes
Amenity Indicates the presence of nearby amenities. No
Bump Indicates the presence of nearby speed bumps. No
Crossing Indicates the presence of a nearby crossing. No
Give_Way Indicates the presence of a nearby "Give Way" sign. No
Junction Indicates the presence of a nearby junction. No
No_Exit Indicates the presence of a nearby "No Exit" sign. No
Railway Indicates the presence of nearby railways. No
Roundabout Indicates the presence of a nearby roundabout. No
Station Indicates the presence of a nearby station. No
Stop Indicates the presence of a nearby stop sign. No
Traffic_Calming Indicates the presence of nearby traffic-calming measures. No
Traffic_Signal Indicates the presence of nearby traffic signals. No
Turning_Loop Indicates the presence of a nearby turning loop. No
Sunrise_Sunset Period of the day (Day/Night) based on sunrise and sunset. Yes
Civil_Twilight Period of the day based on civil twilight. Yes
Nautical_Twilight Period of the day based on nautical twilight. Yes
Astronomical_Twilight Period of the day based on astronomical twilight. Yes

🛠️ Features

Data Preprocessing

  • Handles missing values using group-based imputations.
  • Groups Weather_Condition categories into broader groups (e.g., Rain, Snow, Fog).
  • Encodes categorical variables and scales numerical features.

📊 Natural Language Processing

  • Cleans, tokenizes, and lemmatizes accident descriptions.
  • Vectorizes text data using TfidfVectorizer for feature extraction.

🧠 Model Training

  • Uses a Random Forest Classifier.
  • Employs SMOTE to handle imbalanced class distribution.
  • Outputs a model file (Random_Forest_Classifier_SMOTE.pkl) for future use.

🚀 Getting Started

1️⃣ Preprocess Data

  1. Clone the repository:
    git clone https://github.com/your-username/your-repo-name.git
  2. Open and run the notebook ML_Engineer_Task.ipynb:
    • Loads the raw dataset.
    • Cleans and preprocesses data.
    • Saves the output as accident_data.pkl or accident_data.csv.

2️⃣ Train the Model

  1. Open ML_Engineer_Task_Colab.ipynb in Google Colab or your preferred environment.
  2. Load the preprocessed file (.pkl or .csv).
  3. Run the notebook to:
    • Perform NLP on accident descriptions.
    • Train the model and save it as Random_Forest_Classifier_SMOTE.pkl.

📦 Outputs

  • Processed Dataset: accident_data.pkl or accident_data.csv
  • Trained Model: Random_Forest_Classifier_SMOTE.pkl

🛑 Prerequisites

Install the required Python libraries:

pip install pandas scikit-learn nltk matplotlib seaborn

📊 Results

  • Validation Accuracy: ~96.9%
  • Class Imbalance Handling: SMOTE improves performance on underrepresented severity classes.

📅 Future Enhancements

  • Experiment with other ML algorithms like Gradient Boosting or neural networks.
  • Integrate real-time accident data for dynamic predictions.
  • Enhance feature engineering for improved accuracy.

🤝 Contribution

Contributions are welcome! If you have ideas for improvement:

  1. Fork this repository.
  2. Create a new branch (feature-branch-name).
  3. Commit changes and push.
  4. Open a Pull Request.

About

Analysis and modeling of the US Accidents dataset to predict accident severity and identify contributing factors. Includes data cleaning, feature engineering, exploratory data analysis, and machine learning model development. The project aims to provide insights into road safety and accident prevention strategies.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published