This project demonstrates how to perform data analysis using Databricks, focusing on Apple product sales data. The project includes extracting, transforming, and loading (ETL) workflows designed to analyze customer transactions, particularly focusing on customers who bought AirPods after purchasing iPhones.
Customer_Updated.csvProducts_Updated.csvTransaction_Updated.csv
Defines the basic structure for loading data to various destinations.
Loads data to Databricks File System (DBFS).
Loads data to DBFS with partitioning.
Loads data to a Delta Table on Databricks.
Defines the basic structure for reading data from various sources.
Reads data from CSV files.
Reads data from Parquet files.
Reads data from ORC files.
Reads data from Delta Tables.
Defines the basic structure for extracting data.
Extracts the transaction and customer data required for analysis.
Defines the basic structure for transforming data.
Identifies customers who bought AirPods after buying an iPhone.
Identifies customers who bought only AirPods and iPhones.
Average time-delay between buying iPhone and Airpods
- Objective: Identifies customers who bought AirPods after buying an iPhone.
- Steps:
- Extracts required data using
AirpodsAfterIphoneExtractor. - Transforms the data using
FirstTransformer. - Loads the results into the destination using
AirpodsAfterIphoneLoader.
- Extracts required data using
- Objective: Identifies customers who bought only AirPods and iPhones.
- Steps:
- Extracts required data using
AirpodsAfterIphoneExtractor. - Transforms the data using
SecondTransformer. - Loads the results into the destination using
OnlyAirpodsandIphoneLoader.
- Extracts required data using
- Objective: Average time-delay between buying iPhone and Airpods
- Steps:
- Extracts required data using
AirpodsAfterIphoneExtractor. - Transforms the data using
ThirdTransformer. - Loads the results into the destination using
AveragetimeLoader.
- Extracts required data using
- Objective: Runs the specified workflow.
- Usage:
- Instantiate with the workflow name (e.g., "FirstWorkFlow", "SecondWorkFlow").
- Call the
runnermethod to execute the workflow.
Defines the basic structure for sinking transformed data.
Sinks the data into the DBFS path for AirPods after iPhone purchases.
Sinks the data into the DBFS path and Delta Table for only AirPods and iPhone purchases.
Sinks the data into the DBFS path for Average Time Delay.
- Ensure all data files (
Customer_Updated.csv,Products_Updated.csv,Transaction_Updated.csv) are available in DBFS. - Run the appropriate workflow by instantiating
WorkFlowRunnerwith the desired workflow name. - Check the output in the specified DBFS path or Delta Table.
- The code uses Spark on Databricks for distributed data processing.
- Ensure that the necessary libraries are installed and Spark session is properly configured.
- Implement more complex transformation logic.
- Add support for other data formats (e.g., Avro).
- Integrate with external databases for data loading.