Skip to content

This project delves into fundamental data science concepts using Python libraries like NumPy and Pandas. Using the MovieLens dataset as a case study.

Notifications You must be signed in to change notification settings

Helenaden/Data-Science-Fundamentals

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

Data Science Fundamentals: NumPy & Pandas with MovieLens Case Study

Project Overview

This project is all about building a solid foundation in data science with Python.
Using the MovieLens dataset, I explored how to work with NumPy and Pandas to analyze data, uncover patterns, and draw meaningful insights.

The dataset provides a great real-world example, combining user demographics, movie information, and ratings, the perfect playground for practicing data wrangling, analysis, and visualization.

Objective

The main goal was to analyze the MovieLens datasets (movies, users, and ratings) to:

  • Understand how movies are rated and identify rating trends.
  • Explore genre preferences and user behavior.
  • Investigate the connection between demographics (age, gender, occupation) and ratings.

Dataset Breakdown

Users

  • 943 users, each with details like age, gender, occupation, and zip code.
  • Key findings:
    • The average user age is 34 (range: 7–73).
    • Zip code values stood out as an area worth deeper investigation.

Movies

  • 1,680 movies with titles, release dates, and up to 18 genre tags.
  • Key findings:
    • Movies often belong to multiple genres.
    • Drama and Comedy were the most common.

Ratings

  • 100,000 ratings linked to users and movies, each with a timestamp.
  • Key findings:
    • The average movie rating is 3.53 out of 5.

Insights & Discoveries

  • Genre Trends:

    • Movies are spread across 18 genres.
    • About half belong to more than one genre.
    • Drama and Comedy dominate in volume.
  • Genre Preferences:

    • Film-Noir had the highest average rating (3.92).
    • Fantasy scored the lowest (3.21).
    • Overall, 72% of genres received ratings above the global average of 3.5.
  • Movie Favorites:

    • By average rating: Great Day in Harlem, A and Prefontaine.
    • By popularity: Star Wars had the highest number of ratings.
  • Demographics & Ratings:

    • The dataset is 71% male.
    • Men and women rated movies almost the same (~3.53).
    • Non-working users gave the highest ratings.
    • Healthcare workers gave the lowest, especially female healthcare workers.

Skills Applied

  • Data cleaning and preprocessing with NumPy and Pandas.
  • Exploring datasets with descriptive statistics and summaries.
  • Deriving insights from real-world data.
  • Understanding relationships between demographics, genres, and ratings.

Why This Project Matters

This case study shows how raw data can be transformed into meaningful insights.
It highlights:

  • How to clean and structure real-world datasets.
  • Ways to uncover hidden patterns in data.
  • The importance of combining technical skills with curiosity-driven exploration.

Most importantly, it lays the groundwork for more advanced machine learning and AI applications, where understanding the data is always the first step.

About

This project delves into fundamental data science concepts using Python libraries like NumPy and Pandas. Using the MovieLens dataset as a case study.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published