This project focuses on cleaning and preparing a dataset of mobile applications from the Google Play Store. The goal is to make the dataset analysis-ready by handling missing values, correcting data types, and removing inconsistencies.
π google-playstore-data-cleaning/
β
βββ Google_Playstore.ipynb # Jupyter Notebook with cleaning steps
βββ Google_playstore_sampled.csv # Original dataset
βββ README.md # Project documentation
- Handle missing values
- Correct data types
- Clean string formatting issues (e.g., price, installs)
- Remove duplicates and outliers
- Prepare the dataset for further analysis or visualization
The dataset includes app metadata such as:
- App Name
- Category
- Rating
- Number of Installs
- Price
- Size
- Last Updated
- Content Rating
- And more...
- Python
- Pandas
- NumPy
- Regex
- Jupyter Notebook
-
Clone the repository:
git clone https://github.com/oluwakoya-ao/google-playstore-data-cleaning.git
-
Navigate into the project folder:
cd google-playstore-data-cleaning
-
Open the notebook in Jupyter:
jupyter notebook Google_Playstore.ipynb
-
Run each cell to see the data cleaning process.
- The raw dataset contained inconsistencies such as:
- Commas and symbols in numeric fields (e.g., "1,000+", "$4.99")
- Missing values across several columns
- Duplicated entries
- These issues were resolved through:
- Type conversion
- Regex-based cleaning
- Null value handling
- Deduplication
- The cleaned dataset is now ready for exploration, analysis, or modeling.
Optional additions to enhance this project:
- Add visual summaries of missing data using
seaborn
ormissingno
- Include before/after samples of cleaned fields
- Export and include the final cleaned CSV dataset
Dataset sourced from Kaggle: Google Play Store Apps
Oluwakoya Oluwafemi
π§ ooluwakoyafavour@gmail.com
π LinkedIn
Feel free to fork, star, or contribute to this project!