Automated Process in Dataproc in Google Cloud Platform to Leverage Its Computing Power
This repository contains a simplified, demonstration version of a real-world GCP project. It illustrates the structure, configurations, and code required to automate process in Google Cloud Platform Dataproc services.
| File/Folder | Purpose |
|---|---|
scripts/ |
Python scripts containing pyspark scripts for extracting data from multiple sources and loading data into BigQuery target tables. Note: The production environment contains 10+ scripts. This demo includes only a subset for clarity. |
jars |
In real project, it contains gcs-connector-hadoop2-2.1.1.jar and spark-3.2-bigquery-0.30.0.jar needed for the process. Note: jars are not included in repo to reduce sizes, users can download from original websites indicated on txt documents in this folder. |
miniconda3 |
In real project, it contains Miniconda3-py3923.5.2-0-Linux-x86_64.sh needed for the process. _Note: minconda are not included in repo to reduce sizes, users can download from original websites indicated on txt document in this folder. |
accp.yaml |
Pipeline configuration file defining build instructions for the container image, including image name, build context, and other ACCP pipeline parameters. Referenced during pipeline initialization. |
Dockerfile |
Image build specification defining the base image, system dependencies, and Python packages required to execute the scripts. Miniconda installation is indicated here. Referenced in accp.yaml. |
dataproc_serverless_01_dag.py |
Apache Airflow DAG for orchestrating and scheduling the end-to-end process. References the container image (in container_image parameter) built from the above files and defines execution logic and scheduling parameters. |
- Code and Configuration — Python scripts, configuration files, and dependencies are stored in this repository.
- Image Build — The ACCP pipeline uses
accp.yamlandDockerfileto build a container image with all required dependencies. - Data Processing — The container runs Python scripts to fetch, transform, and load data into BigQuery.
- Orchestration — Airflow triggers the container execution according to the schedule defined in
dataproc_serverless_01_dag.py.
