Skip to content

This project demonstrates a fundamental ETL (Extract, Transform, Load) data pipeline pattern using Python and a PostgreSQL database, all running as isolated services orchestrated by Docker Compose. The key focus is on managing a stateful service (the database) and ensuring data persistence across container lifecycles using Docker Volumes.

License

Notifications You must be signed in to change notification settings

YogeshT22/local-docker-data-pipeline

Repository files navigation

Mini Project 2: Containerized ETL Data Pipeline

This project demonstrates a fundamental ETL (Extract, Transform, Load) data pipeline pattern using Python and a PostgreSQL database, all running as isolated services orchestrated by Docker Compose. The key focus is on managing a stateful service (the database) and ensuring data persistence across container lifecycles using Docker Volumes.


Core Concepts Demonstrated

  • Stateful Services in Docker: Successfully managed a PostgreSQL database, a stateful service where data persistence is critical.
  • Data Persistence with Docker Volumes: Utilized a named Docker Volume (db-data) to store the PostgreSQL data on the host machine, ensuring that data survives container restarts and re-creations.
  • Containerized Microservices: Built a multi-container application simulating a real-world pipeline with distinct services: a data producer, a database, and a data consumer.
  • Service-to-Service Networking: Established reliable communication between Python application containers and the database container using Docker's internal networking and service names.
  • Resilient Application Design: Implemented a connection retry loop in the Python scripts, making them robust against timing issues where an application might start before the database is fully initialized.
  • Dependency Management: Managed Python library dependencies (psycopg2, Faker) using a requirements.txt file for reproducible builds.
  • Idempotent Database Setup: Used the CREATE TABLE IF NOT EXISTS SQL command to ensure the database schema setup can be run multiple times without causing errors.

Technologies Used

  • Containerization: Docker, Docker Compose
  • Database: PostgreSQL (Official Docker Image)
  • Programming Language: Python 3
  • Key Python Libraries: psycopg2-binary (for PostgreSQL connection), Faker (for test data generation)

How to Run the Pipeline

Prerequisites: Docker and Docker Compose must be installed.

  1. Clone the repository:

    git clone https://github.com/YogeshT22/local-docker-data-pipeline
    cd local-docker-data-pipeline
  2. Run the entire pipeline in detached mode: This command will build the Python application image and start all three services (producer, consumer, and database) in the correct order.

    docker-compose up --build -d
  3. Wait for the pipeline to execute: Allow about 20-30 seconds for the producer and consumer scripts to complete their work in the background.

  4. Check the consumer's output: To see the final report generated by the consumer.py script, view its logs:

    docker-compose logs consumer-app

    On the first run, you should see a report of 5 users found in the database.

  5. Test Data Persistence: Run the up command again to simulate a second data batch. The volume will persist the old data.

    docker-compose up -d

    Now, check the consumer logs again after waiting a few seconds. You should see a report of 10 users (the original 5 plus 5 new ones), proving that the data was successfully persisted in the Docker Volume.

  6. Clean Up: To stop and remove the containers, network, and persisted data volume, run:

    docker-compose down -v

Architecture Diagram

architecture-diagram.png

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

This project demonstrates a fundamental ETL (Extract, Transform, Load) data pipeline pattern using Python and a PostgreSQL database, all running as isolated services orchestrated by Docker Compose. The key focus is on managing a stateful service (the database) and ensuring data persistence across container lifecycles using Docker Volumes.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published