Master data manipulation with pandas โ from fundamentals to advanced performance tuning โ using real-world datasets and modular notebooks.
PandasPlayground is designed to help aspiring data scientists and analysts master the entire pandas ecosystem through hands-on, progressive, and fully-documented Jupyter notebooks. Each module targets a specific capability โ from data loading and cleaning to advanced transformations and memory profiling โ ensuring a complete learning and review reference.
๐ Preview: Explore datasets, visualize trends, and profile performance โ all in one playground!
PandasPlayground/
โโโ assets/ # Charts, exports, and visual output images
โโโ cheatsheets/ # Markdown-based reference sheets (e.g., pandas_cheatsheet.md)
โโโ data/ # Raw datasets (CSV, Excel, JSON, Parquet)
โโโ exports/ # Final output files (CSV, Excel, styled reports)
โโโ notebooks/ # All 10 learning notebooks (01โ10)
โโโ pages/ # Streamlit multipage app (expanded)
โโโ pandas_env/ # Local virtual environment (โ ๏ธ add to .gitignore)
โโโ scripts/ # Modular reusable utility functions
โโโ Dockerfile # Docker support for reproducible environments
โโโ LICENSE.md
โโโ README.md # Youโre here!
โโโ requirements.txt # Minimal dependencies to run the project
โโโ requirements_dev.txt # Full dev environment
โโโ STREAMLIT_App.py # Interactive dashboard using Streamlit
This project uses artificially generated datasets designed to replicate common real-world scenarios. Each file highlights a unique aspect of data handling and analysis using pandas
.
Dataset File | Format | Purpose |
---|---|---|
superstore_sales.csv |
CSV | Simulated retail sales data for grouping, time series |
weather_data.json |
JSON | Unstructured data for parsing, cleaning, and visualization |
bank_loans.xlsx |
Excel | Tabular data for filtering, EDA, and feature engineering |
bank_loans_multisheet.xlsx |
Excel | Multi-sheet structure for advanced Excel parsing |
covid_data.parquet |
Parquet | Efficient columnar data for joins and time-based analysis |
๐ These datasets are not from public sources and were created to demonstrate the versatility of
pandas
across different formats and data challenges. You can find them in thedata/
folder.
Notebook | Concepts |
---|---|
01_data_loading.ipynb |
Load data, inspect structure, parse dates |
02_data_cleaning.ipynb |
Handle missing values, type conversion, string ops |
03_aggregation_grouping.ipynb |
GroupBy, pivot, window functions |
04_merging_joining.ipynb |
Merge, concat, index joins |
05_time_series.ipynb |
Resample, rolling, timezone handling |
06_advanced_pandas.ipynb |
.apply(), .map(), method chaining, memory tuning |
07_visualization_with_pandas.ipynb |
Bar, line, box, grouped plots |
08_final_pipeline.ipynb |
End-to-end data workflow pipeline |
09_reporting_exporting.ipynb |
Export to Excel/CSV/Parquet, styled reports |
10_performance_diagnostics.ipynb |
Profiling, eval(), categorical, Dask |
โ
Develop fluency with pandas
core APIs
โ
Build modular, reusable data pipelines
โ
Understand performance bottlenecks in large datasets
โ
Practice version-controlled and containerized data science
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Build image
docker build -t pandasplayground .
# Run container on http://localhost:8899
docker run -pd 8899:8888 -v $(pwd):/app pandasplayground
- ๐ Want to use this on your own data? Start with
08_final_pipeline.ipynb
- ๐งฉ Reuse functions from
scripts/
for your own ETL workflows - ๐ณ Use
Dockerfile
to run in an isolated, reproducible environment
- pandas
- numpy
- matplotlib, seaborn
- Jupyter, JupyterLab
- openpyxl, pyarrow
- memory_profiler, psutil
- Streamlit (for interactive dashboards)
- ๐งฎ NumPyMasterPro โ Master NumPy with modular walkthroughs
Absolutely! Here's an expanded and professional version of the How to Contribute or Fork section to better guide future collaborators:
Whether you're fixing a bug, suggesting an enhancement, or adding new learning notebooks โ contributions are welcome and appreciated!
# Step 1: Fork this repository on GitHub
# Step 2: Clone your fork locally
git clone https://github.com/SatvikPraveen/PandasPlayground.git
cd PandasPlayground
Always create a new branch for your changes instead of working on main
:
git checkout -b feature/your-feature-name
- Add your improvements (e.g., a new notebook, function in
scripts/
, or fixes inrequirements.txt
) - Follow consistent formatting, naming, and markdown style as used across the project
- Update the README.md or cheatsheets if your change impacts the documentation
- Test your code locally (if it includes logic)
git add .
git commit -m "โจ Added: Short summary of your feature"
git push origin feature/your-feature-name
- Go to your fork on GitHub
- Click "Compare & pull request"
- Provide a clear and concise description of your changes
- If applicable, reference any related issue (e.g.,
Fixes #12
) - Wait for review or feedback
-
Keep changes modular and atomic โ one feature or fix per pull request
-
Be sure to sync your fork with the upstream repository periodically:
git remote add upstream https://github.com/SatvikPraveen/PandasPlayground.git git pull upstream main
-
If your feature involves code, prefer writing reusable functions in
scripts/
and importing them in your notebooks
Every contribution, no matter how small, helps improve this resource for the entire data science community. Letโs build this playground together! ๐
This project is licensed under the GNU General Public License v3.0. See the LICENSE file for more details.
Built with ๐ป and โ by Satvik Praveen Drop a โญ if you find this project helpful!