Skip to content

A data cleaning project focused on preparing the Google Play Store app dataset by handling missing values, fixing data types, and formatting issues for accurate analysis.

Notifications You must be signed in to change notification settings

Oluwakoya-ao/google-playstore-data-cleaning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“± Google Play Store Data Cleaning

This project focuses on cleaning and preparing a dataset of mobile applications from the Google Play Store. The goal is to make the dataset analysis-ready by handling missing values, correcting data types, and removing inconsistencies.

πŸ“‚ Repository Structure

πŸ“ google-playstore-data-cleaning/
β”‚
β”œβ”€β”€ Google_Playstore.ipynb     # Jupyter Notebook with cleaning steps
β”œβ”€β”€ Google_playstore_sampled.csv       # Original dataset 
└── README.md                  # Project documentation

🧹 Cleaning Objectives

  • Handle missing values
  • Correct data types
  • Clean string formatting issues (e.g., price, installs)
  • Remove duplicates and outliers
  • Prepare the dataset for further analysis or visualization

πŸ“Š Dataset Summary

The dataset includes app metadata such as:

  • App Name
  • Category
  • Rating
  • Number of Installs
  • Price
  • Size
  • Last Updated
  • Content Rating
  • And more...

πŸ› οΈ Tools Used

  • Python
  • Pandas
  • NumPy
  • Regex
  • Jupyter Notebook

πŸš€ How to Use

  1. Clone the repository:

    git clone https://github.com/oluwakoya-ao/google-playstore-data-cleaning.git
  2. Navigate into the project folder:

    cd google-playstore-data-cleaning
  3. Open the notebook in Jupyter:

    jupyter notebook Google_Playstore.ipynb
  4. Run each cell to see the data cleaning process.

πŸ“Œ Key Takeaways

  • The raw dataset contained inconsistencies such as:
    • Commas and symbols in numeric fields (e.g., "1,000+", "$4.99")
    • Missing values across several columns
    • Duplicated entries
  • These issues were resolved through:
    • Type conversion
    • Regex-based cleaning
    • Null value handling
    • Deduplication
  • The cleaned dataset is now ready for exploration, analysis, or modeling.

πŸ“Έ Suggested Improvements

Optional additions to enhance this project:

  • Add visual summaries of missing data using seaborn or missingno
  • Include before/after samples of cleaned fields
  • Export and include the final cleaned CSV dataset

πŸ™Œ Acknowledgments

Dataset sourced from Kaggle: Google Play Store Apps

πŸ‘¨β€πŸ’» Author

Oluwakoya Oluwafemi
πŸ“§ ooluwakoyafavour@gmail.com
πŸ”— LinkedIn


Feel free to fork, star, or contribute to this project!

About

A data cleaning project focused on preparing the Google Play Store app dataset by handling missing values, fixing data types, and formatting issues for accurate analysis.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published