Skip to content

AlexYe-MapleLeafs/Automate-Dataproc-Process-in-GCP

Repository files navigation

Automated Process in Dataproc in Google Cloud Platform to Leverage Its Computing Power

This repository contains a simplified, demonstration version of a real-world GCP project. It illustrates the structure, configurations, and code required to automate process in Google Cloud Platform Dataproc services.


Repository Structure

File/Folder Purpose
scripts/ Python scripts containing pyspark scripts for extracting data from multiple sources and loading data into BigQuery target tables.
Note: The production environment contains 10+ scripts. This demo includes only a subset for clarity.
jars In real project, it contains gcs-connector-hadoop2-2.1.1.jar and spark-3.2-bigquery-0.30.0.jar needed for the process.
Note: jars are not included in repo to reduce sizes, users can download from original websites indicated on txt documents in this folder.
miniconda3 In real project, it contains Miniconda3-py3923.5.2-0-Linux-x86_64.sh needed for the process.
_Note: minconda are not included in repo to reduce sizes, users can download from original websites indicated on txt document in this folder.
accp.yaml Pipeline configuration file defining build instructions for the container image, including image name, build context, and other ACCP pipeline parameters.
Referenced during pipeline initialization.
Dockerfile Image build specification defining the base image, system dependencies, and Python packages required to execute the scripts. Miniconda installation is indicated here.
Referenced in accp.yaml.
dataproc_serverless_01_dag.py Apache Airflow DAG for orchestrating and scheduling the end-to-end process. References the container image (in container_image parameter) built from the above files and defines execution logic and scheduling parameters.

Workflow Overview

  1. Code and Configuration — Python scripts, configuration files, and dependencies are stored in this repository.
  2. Image Build — The ACCP pipeline uses accp.yaml and Dockerfile to build a container image with all required dependencies.
  3. Data Processing — The container runs Python scripts to fetch, transform, and load data into BigQuery.
  4. Orchestration — Airflow triggers the container execution according to the schedule defined in dataproc_serverless_01_dag.py.

Dataproc Processes Diagram

About

This Repo Demonstrate General Process to Automate Process in GCP Dataproc to Leverage Its Processing Power

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published