Skip to content

๐Ÿ“Š A comprehensive pandas mastery project with 10 modular Jupyter notebooks covering data loading, cleaning, grouping, merging, time series, visualization, and performance profiling. Includes real-world workflows, Docker, Streamlit, and reusable utils. Ideal for data scientists and analysts to learn, practice, and refer. Practice-ready and modular.

License

Notifications You must be signed in to change notification settings

SatvikPraveen/PandasPlayground

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

15 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

๐Ÿ“Š PandasPlayground โ€“ A Comprehensive Data Manipulation Project

Python License: GPL v3 Dockerized Notebooks

Master data manipulation with pandas โ€” from fundamentals to advanced performance tuning โ€” using real-world datasets and modular notebooks.


๐Ÿง  Project Purpose

PandasPlayground is designed to help aspiring data scientists and analysts master the entire pandas ecosystem through hands-on, progressive, and fully-documented Jupyter notebooks. Each module targets a specific capability โ€” from data loading and cleaning to advanced transformations and memory profiling โ€” ensuring a complete learning and review reference.

Preview Dashboard

๐Ÿ” Preview: Explore datasets, visualize trends, and profile performance โ€” all in one playground!


๐Ÿ“ Project Structure

PandasPlayground/
โ”œโ”€โ”€ assets/               # Charts, exports, and visual output images
โ”œโ”€โ”€ cheatsheets/          # Markdown-based reference sheets (e.g., pandas_cheatsheet.md)
โ”œโ”€โ”€ data/                 # Raw datasets (CSV, Excel, JSON, Parquet)
โ”œโ”€โ”€ exports/              # Final output files (CSV, Excel, styled reports)
โ”œโ”€โ”€ notebooks/            # All 10 learning notebooks (01โ€“10)
โ”œโ”€โ”€ pages/                # Streamlit multipage app (expanded)
โ”œโ”€โ”€ pandas_env/           # Local virtual environment (โš ๏ธ add to .gitignore)
โ”œโ”€โ”€ scripts/              # Modular reusable utility functions
โ”œโ”€โ”€ Dockerfile            # Docker support for reproducible environments
โ”œโ”€โ”€ LICENSE.md
โ”œโ”€โ”€ README.md             # Youโ€™re here!
โ”œโ”€โ”€ requirements.txt      # Minimal dependencies to run the project
โ”œโ”€โ”€ requirements_dev.txt  # Full dev environment
โ””โ”€โ”€ STREAMLIT_App.py      # Interactive dashboard using Streamlit

๐Ÿงพ Datasets Used

This project uses artificially generated datasets designed to replicate common real-world scenarios. Each file highlights a unique aspect of data handling and analysis using pandas.

Dataset File Format Purpose
superstore_sales.csv CSV Simulated retail sales data for grouping, time series
weather_data.json JSON Unstructured data for parsing, cleaning, and visualization
bank_loans.xlsx Excel Tabular data for filtering, EDA, and feature engineering
bank_loans_multisheet.xlsx Excel Multi-sheet structure for advanced Excel parsing
covid_data.parquet Parquet Efficient columnar data for joins and time-based analysis

๐Ÿ›  These datasets are not from public sources and were created to demonstrate the versatility of pandas across different formats and data challenges. You can find them in the data/ folder.


โœ… Modules and Concepts

Notebook Concepts
01_data_loading.ipynb Load data, inspect structure, parse dates
02_data_cleaning.ipynb Handle missing values, type conversion, string ops
03_aggregation_grouping.ipynb GroupBy, pivot, window functions
04_merging_joining.ipynb Merge, concat, index joins
05_time_series.ipynb Resample, rolling, timezone handling
06_advanced_pandas.ipynb .apply(), .map(), method chaining, memory tuning
07_visualization_with_pandas.ipynb Bar, line, box, grouped plots
08_final_pipeline.ipynb End-to-end data workflow pipeline
09_reporting_exporting.ipynb Export to Excel/CSV/Parquet, styled reports
10_performance_diagnostics.ipynb Profiling, eval(), categorical, Dask

๐Ÿ“š Learning Outcomes

โœ… Develop fluency with pandas core APIs โœ… Build modular, reusable data pipelines โœ… Understand performance bottlenecks in large datasets โœ… Practice version-controlled and containerized data science


๐Ÿ“ฆ Installation

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

๐Ÿณ Run with Docker (Optional)

# Build image
docker build -t pandasplayground .

# Run container on http://localhost:8899
docker run -pd 8899:8888 -v $(pwd):/app pandasplayground

๐Ÿ’ป Using the Project

  • ๐Ÿ”„ Want to use this on your own data? Start with 08_final_pipeline.ipynb
  • ๐Ÿงฉ Reuse functions from scripts/ for your own ETL workflows
  • ๐Ÿณ Use Dockerfile to run in an isolated, reproducible environment

๐Ÿงฐ Tools & Libraries

  • pandas
  • numpy
  • matplotlib, seaborn
  • Jupyter, JupyterLab
  • openpyxl, pyarrow
  • memory_profiler, psutil
  • Streamlit (for interactive dashboards)

๐Ÿ”— Related Projects

  • ๐Ÿงฎ NumPyMasterPro โ€“ Master NumPy with modular walkthroughs

Absolutely! Here's an expanded and professional version of the How to Contribute or Fork section to better guide future collaborators:


๐Ÿค How to Contribute or Fork

Whether you're fixing a bug, suggesting an enhancement, or adding new learning notebooks โ€” contributions are welcome and appreciated!

๐Ÿ”€ Fork & Clone the Repository

# Step 1: Fork this repository on GitHub
# Step 2: Clone your fork locally
git clone https://github.com/SatvikPraveen/PandasPlayground.git
cd PandasPlayground

๐ŸŒฑ Create a Feature Branch

Always create a new branch for your changes instead of working on main:

git checkout -b feature/your-feature-name

๐Ÿ›  Make Your Changes

  • Add your improvements (e.g., a new notebook, function in scripts/, or fixes in requirements.txt)
  • Follow consistent formatting, naming, and markdown style as used across the project
  • Update the README.md or cheatsheets if your change impacts the documentation
  • Test your code locally (if it includes logic)

โœ… Commit and Push

git add .
git commit -m "โœจ Added: Short summary of your feature"
git push origin feature/your-feature-name

๐Ÿ“ฉ Submit a Pull Request

  • Go to your fork on GitHub
  • Click "Compare & pull request"
  • Provide a clear and concise description of your changes
  • If applicable, reference any related issue (e.g., Fixes #12)
  • Wait for review or feedback

๐Ÿงช Contribution Tips

  • Keep changes modular and atomic โ€” one feature or fix per pull request

  • Be sure to sync your fork with the upstream repository periodically:

    git remote add upstream https://github.com/SatvikPraveen/PandasPlayground.git
    git pull upstream main
  • If your feature involves code, prefer writing reusable functions in scripts/ and importing them in your notebooks


๐Ÿ™ Thank You

Every contribution, no matter how small, helps improve this resource for the entire data science community. Letโ€™s build this playground together! ๐ŸŽ‰


๐Ÿ“œ License

This project is licensed under the GNU General Public License v3.0. See the LICENSE file for more details.


๐Ÿ™‹โ€โ™‚๏ธ Author

Built with ๐Ÿ’ป and โ˜• by Satvik Praveen Drop a โญ if you find this project helpful!

About

๐Ÿ“Š A comprehensive pandas mastery project with 10 modular Jupyter notebooks covering data loading, cleaning, grouping, merging, time series, visualization, and performance profiling. Includes real-world workflows, Docker, Streamlit, and reusable utils. Ideal for data scientists and analysts to learn, practice, and refer. Practice-ready and modular.

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages