This collaborative Data Science project explores how directors, budgets, genres, and release timing influence movie success. Using scraped IMDb data and machine learning techniques, our team of five analyzed performance trends to guide decision-making in the film industry.
The dataset includes:
- π₯ Director names
- π° Movie budgets
- π Genres
- π Release dates
- π΅ Box office earnings
- β IMDb ratings
- Pinpoint which directors consistently drive commercial and critical success
- Model the impact of budget and genre on box office performance
- Analyze seasonal release trends
- Provide actionable insights for studios and producers
βββ extract_data.ipynb # IMDb scraping logic and raw data generation
βββ raw_data.csv # Unprocessed dataset
βββ main.ipynb # Data cleaning, EDA, visualizations, ML models
βββ README.md # Project overview and documentation
- Libraries: Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn
- Methods:
- Regression modeling
- Correlation and trend analysis
- Feature engineering & encoding
- Train/test split and performance evaluation
- Some directors demonstrate consistent profitability and audience acclaim.
- Genre choice and timing of release significantly influence box office outcomes.
- Budget optimization strategies can be inferred from predictive modeling results.
For a deeper narrative walkthrough and key takeaways: π Read the blog post
This was possible by a brilliant team of 5 members [myself, Bilal, Shaheer, Hasnain, Ahad] involved in scraping, cleaning, modeling, and presenting results. Collaboration was key in shaping this analytical story.
- Model tuning and validation with more complex ML architectures
- External datasets (Rotten Tomatoes, Metacritic) for enriched analysis
- Dashboard-style visualization using Plotly or Streamlit
Feel free to reach out on GitHub to discuss the project, data science, or potential collaborations!