This project allows you to clean and transform datasets from CSV and JSON files using Apache Spark.
- Java 11 or later
- Apache Spark 3.x
- SBT (Scala Build Tool)
git clone https://github.com/kooroshkz/data-cleaning-and-transformation-pipeline.git
cd data-cleaning-and-transformation-pipeline
Run the following to install necessary dependencies:
sbt clean update
- Place your CSV/JSON files in the
data/
folder. - Run the pipeline using:
sbt run
The program will automatically pick up the dataset from the data/
folder, clean it, and save the output as original_filename_cleaned.csv
.
If you want to clean a specific file, you can run the following:
sbt run /path/to/your/file.csv
The program will clean the specific file and save the result as file_cleaned.csv
.
If there are multiple CSV/JSON files in the data/
folder, the program will ask you to choose which file you want to clean. If only one file is found, it will process that file automatically.
- Loads data: It can load both CSV and JSON files.
- Fills missing values: For numeric columns (e.g.,
Age
,Fare
), missing values are filled with the mean of that column. - Converts data types: It converts columns like
Age
andFare
into numeric types (double
). - Saves cleaned data: The cleaned dataset is saved in CSV format, named as
original_filename_cleaned.csv
.
For example, if you have data/titanic.csv
, the program will clean the data and save it as data/titanic_cleaned.csv
.
.
├── data/ # Folder containing CSV/JSON files
│ ├── titanic.csv # Example dataset (CSV)
│ └── another_file.json # Another example dataset (JSON)
├── src/ # Source code
│ ├── main/scala/ # Scala files for data processing
├── target/ # Output folder for compiled code and results
└── build.sbt # Build configuration