This project aims to transform and clean data from a cafe by eliminating redundancies, improving data quality and persistency. It leverages data manipulation techniques using Pandas, with a focus on enhancing data integrity and optimizing storage for future analysis.
- Detect and remove duplicate or inconsistent records
- Convert invalid data types to numeric where applicable
- Replace or handle missing values
- Improve overall data structure and clarity
- Python
- Pandas
- Jupyter Notebook (for development and visualization)
The dataset was cleaned and transformed incrementally, with each step saved as a .pkl
file for reproducibility and version control.
-
data_step1.pkl
- Set all
Price Per Unit
values for each item correctly - Converted
Quantity
,Price Per Unit
, andTotal Spent
to numeric types - Replaced non-numeric values with
NaN
(usingpd.to_numeric
witherrors='coerce'
) - Imputed missing values in
Quantity
by dividingTotal Spent
byPrice Per Unit
- Updated missing values in
Total Spent
by multiplyingQuantity
andPrice Per Unit
- Set all
-
data_step2.pkl
- Identified and replaced invalid
Item
values (UNKNOWN, ERROR, NaN) - Used Quantity and Price Per Unit to infer the most likely Item based on frequency
- Removed redundant or duplicate rows after corrections
data_step3.pkl
- Cleaned the Payment Method column by replacing UNKNOWN, ERROR, and missing values
- Used Item, Quantity, and Price Per Unit to infer the most likely payment method
- Ensured consistency based on historical purchase patterns
- Identified and replaced invalid
Each step is saved as data_stepN.pkl
, where N
indicates the transformation phase.
- Data validation and type conversion using
pd.to_numeric()
- Filtering rows with conditions (
isna()
,notna()
) - Creating new DataFrames from cleaned Series
- Good practices in data preprocessing for analysis
The final cleaned DataFrame is ready for further use in dashboards, analysis, or machine learning tasks.
Feel free to fork or use it as a reference in your own data projects!