This project demonstrates real-world data sourcing and cleaning techniques using structured and unstructured data. It includes profiling, validation, transformation, and storage across different formats such as CSV, TXT, and SQL.
- Sourced and cleaned raw text into structured CSV
- Analyzed and visualized using charts and summaries
- Stored cleaned data in
.db
and.csv
formats
- Performed cleaning and transformation on
books.csv
- Included outlier detection, column standardization, and data checks
- Python + Jupyter Notebooks
pandas
,matplotlib
- SQL for data storage and retrieval
-
Clone the repo:
git clone https://github.com/arun-data-analyst/Data-Sourcing-and-Cleaning.git cd Data-Sourcing-and-Cleaning
-
Install the Python packages:
pip install -r requirements.txt
-
Open the notebooks in Jupyter Lab or any Python IDE
Arun Acharya
Data Analyst in training | Willis College