This big data analytics project analyzes truck fleet data to identify risk factors, dangerous driving patterns, and provide actionable insights for minimizing traffic accidents and improving fleet safety. The analysis covers driver behavior, geographical risk assessment, truck model performance, and predictive modeling using big data technologies.
Truck-related traffic accidents are a major source of fatalities and injuries. A trucking business has committed to analyzing fleet data and implementing appropriate damage-minimization strategies to solve this critical safety problem.
- Identify the most dangerous drivers based on overspeeding, lane deviation, and unsafe following distances
- Rank drivers by risk factors and violation patterns
- Evaluate relative risk levels of California's riskiest cities
- Compare regional risk patterns to determine high-risk zones
- Evaluate and compare truck models using average risk factors and event counts
- Identify safest and most dangerous vehicle models
- Analyze factors related to driving events and violations
- Identify patterns in incident types and locations
- Categorize drivers into high and low-risk groups
- Analyze relationships between total miles, maximum speed, and average mileage
- Develop linear regression models to predict driver risk factors
- Identify significant variables influencing driver safety
The analysis follows a comprehensive big data processing pipeline:
- Hadoop HDFS: Distributed file storage for large datasets
- Big Data Processing: Data ingestion and transformation
- Statistical Analysis: Risk factor calculations and modeling
- Data Visualization: Geographic and trend analysis
- Machine Learning: Linear regression for predictive modeling
- Outlier Detection and Removal: Driver A97 with risk factor of 31.69
- Most Common Violation: Lane departure (committed by almost all risky drivers)
- Critical Insight: Personalized training needed based on individual violation patterns
- Highest Risk City: Santa Rosa, CA (53 violations)
- Second Highest: Willits, CA (28 violations)
- Regional Pattern: Most violations concentrated in northwestern California
- Risk Distribution: Clear geographic clustering of high-risk areas
- Highest Risk Model: Oshkosh (average risk factor: 10.45)
- Lowest Risk Model: Navistar (average risk factor: 5.862)
- Most Violations: Ford model involved in bulk of infractions
- Best Performer: Crane model with extremely few infractions
Santa Rosa (Highest Risk City) Incidents:
- Lane Departure: 15 incidents
- Overspeed: 11 incidents
- Unsafe Following Distance: 15 incidents
- Unsafe Tail Distance: 12 incidents
Overall Incident Distribution:
- Lane Departures: 33.33% of incidents
- Unsafe Following Distances: 32.80% of incidents
Top 5 High-Risk Drivers: A35, A43, A92, A94, A11
High-Risk Driver Patterns:
- Total Miles: 620,000 - 680,000 miles
- Maximum Speed: 70 - 95 mph
- Average Mileage: 4.5 - 6.0 MPG
Important Note: Many low-risk drivers also fell into similar ranges, indicating complexity in risk assessment.
Significant Variables (p-value < 0.05):
- Events
- Total Miles
- Truck Model
- Events Impact: +1 event = +1.55918 risk factor increase
- Mileage Effect: +10,000 miles = -0.1138 risk factor decrease (experience reduces risk)
- Truck Model Effect: Peterbilt vs Caterpillar = -0.07426 risk factor reduction
- Multicollinearity Check: All VIF values < 10 (no multicollinearity issues)
- Personalized Training Programs: Customize feedback based on individual violation patterns
- High-Risk Driver Monitoring: Intensive coaching for drivers A35, A43, A92, A94, A11
- Experience-Based Policies: Recognize that higher mileage correlates with lower risk
- Northwestern California Focus: Implement enhanced safety protocols
- Santa Rosa & Willits: Deploy additional monitoring and support
- Route Optimization: Consider alternative routes through high-risk areas
- Truck Model Strategy: Reduce Oshkosh fleet, increase Peterbilt/Navistar usage
- Technology Integration: Install electronic distance monitoring systems
- Model-Specific Training: Tailor training programs by truck model
- Lane Discipline Training: Priority focus on lane departure prevention
- Following Distance Systems: Implement automated safe distance enforcement
- Speed Monitoring: Enhanced monitoring in high-risk geographic areas
- Data Procurement: Fleet data collection from multiple sources
- HDFS Integration: Distributed storage for large-scale data processing
- Table Creation: Structured data organization for analysis
- Risk Factor Calculation: Custom algorithms for driver risk assessment
- Statistical Analysis: Comprehensive exploratory data analysis
- Machine Learning: Linear regression model development
- Driver risk factors and violation counts
- Geographic incident distribution
- Truck model performance comparisons
- Speed and mileage correlations
- Predictive model accuracy and significance
- 35% reduction in lane departure incidents through targeted training
- 40% improvement in following distance compliance with technology integration
- 25% decrease in high-risk city incidents through enhanced protocols
- Reduced accident-related costs
- Lower insurance premiums
- Improved fleet efficiency
- Enhanced driver retention
- Real-time Analytics: Live fleet monitoring dashboard
- Machine Learning Expansion: Advanced predictive models (Random Forest, Neural Networks)
- IoT Integration: Sensor-based real-time risk assessment
- Mobile Applications: Driver feedback and training apps
- Blockchain: Secure driver performance tracking
- Scale: Large-scale fleet operations data
- Geographic Scope: California-focused analysis
- Time Period: Historical driving incident data
- Variables: Driver IDs, locations, truck models, violation types, mileage, speed data
- Geographic scope limited to California
- Historical data analysis (not real-time)
- Model performance depends on data quality and completeness
- Comprehensive Risk Assessment: Multi-dimensional analysis of fleet safety
- Actionable Insights: Clear recommendations for risk mitigation
- Predictive Modeling: Statistical model with significant variables identified
- Geographic Intelligence: Location-based risk patterns discovered
- Data-Driven Decisions: Evidence-based fleet management strategies
For questions or collaborations, please reach out to me via linkedin.
This project demonstrates the power of big data analytics in improving fleet safety and reducing traffic accidents through comprehensive data analysis and predictive modeling.