This project demonstrates a fundamental ETL (Extract, Transform, Load) data pipeline pattern using Python and a PostgreSQL database, all running as isolated services orchestrated by Docker Compose. The key focus is on managing a stateful service (the database) and ensuring data persistence across container lifecycles using Docker Volumes.
- Stateful Services in Docker: Successfully managed a PostgreSQL database, a stateful service where data persistence is critical.
- Data Persistence with Docker Volumes: Utilized a named Docker Volume (
db-data
) to store the PostgreSQL data on the host machine, ensuring that data survives container restarts and re-creations. - Containerized Microservices: Built a multi-container application simulating a real-world pipeline with distinct services: a data
producer
, adatabase
, and a dataconsumer
. - Service-to-Service Networking: Established reliable communication between Python application containers and the database container using Docker's internal networking and service names.
- Resilient Application Design: Implemented a connection retry loop in the Python scripts, making them robust against timing issues where an application might start before the database is fully initialized.
- Dependency Management: Managed Python library dependencies (
psycopg2
,Faker
) using arequirements.txt
file for reproducible builds. - Idempotent Database Setup: Used the
CREATE TABLE IF NOT EXISTS
SQL command to ensure the database schema setup can be run multiple times without causing errors.
- Containerization: Docker, Docker Compose
- Database: PostgreSQL (Official Docker Image)
- Programming Language: Python 3
- Key Python Libraries:
psycopg2-binary
(for PostgreSQL connection),Faker
(for test data generation)
Prerequisites: Docker and Docker Compose must be installed.
-
Clone the repository:
git clone https://github.com/YogeshT22/local-docker-data-pipeline cd local-docker-data-pipeline
-
Run the entire pipeline in detached mode: This command will build the Python application image and start all three services (producer, consumer, and database) in the correct order.
docker-compose up --build -d
-
Wait for the pipeline to execute: Allow about 20-30 seconds for the producer and consumer scripts to complete their work in the background.
-
Check the consumer's output: To see the final report generated by the
consumer.py
script, view its logs:docker-compose logs consumer-app
On the first run, you should see a report of 5 users found in the database.
-
Test Data Persistence: Run the
up
command again to simulate a second data batch. The volume will persist the old data.docker-compose up -d
Now, check the consumer logs again after waiting a few seconds. You should see a report of 10 users (the original 5 plus 5 new ones), proving that the data was successfully persisted in the Docker Volume.
-
Clean Up: To stop and remove the containers, network, and persisted data volume, run:
docker-compose down -v
This project is licensed under the MIT License - see the LICENSE file for details.