This project sets up a scalable, containerized data pipeline for ingesting, processing, storing, and visualizing sales data. It leverages Minio S3 for raw data storage, ClickHouse for data analytics, and Grafana for visualizations.
- Sales Raw Data: Collected from various sources and stored in S3.
- Preprocessing and Metric Extraction: Data is cleaned, processed, and transformed into metrics.
- ClickHouse: Stores processed metrics and provides fast analytical queries.
- Grafana: Visualizes the aggregated data using bar charts and dashboards.
- API Interaction: The system can interact with APIs for data input/output.
- Docker: Containerization for the entire pipeline.
- Docker Compose: Orchestration for multiple containers.
- Python: For preprocessing and interacting with S3.
- ClickHouse: Analytical database for storing and querying metrics.
- Grafana: Dashboard tool for data visualization.
- AWS S3: Object storage for raw data.
Ensure you have the following installed:
project-root/
├── docker-compose.yml
├── backend.Dockerfile # Custom image for creating the API server
├── run_app.py # Flask API for interacting with S3 and ClickHouse
├── requirements.txt
├── file.csv # Raw sales data (optional)
├── sales_dashboard.json # Graphana dashboard
└── diagram.jpg # Architecture diagram
-
Build the Containers and start services:
docker-compose up -d
-
Verify the Services:
- Minio s3: Accessible at
http://localhost:9001
(Credentials in docker-compose). - ClickHouse: Accessible at
http://localhost:8123
. - Grafana: Accessible at
http://localhost:3000
(default credentials:admin/admin
). - API documentation: Accesible at
http://localhost:5000/swagger-ui/#/
- Minio s3: Accessible at
-
Create a bucket to start uploading files:
curl -X POST -H "Content-Type: application/json" -d '{"bucket_name": "bucket-name"}' http://localhost:5000/bucket
-
Upload raw sales data to the S3 bucket, using the API.
curl -X POST -F "file=@file.csv" 'http://localhost:5000/ingest/sales?bucket_name=bucket-name'
-
You can verify using:
curl -X GET 'http://localhost:5000/buckets' curl -X GET 'http://localhost:5000/objects?bucket_name=bucket-name'
-
To start the preprocessing, the following endpoint reads raw data from S3, processes it, and inserts metrics into ClickHouse.
curl -X POST 'http://localhost:5000/transform/sales?bucket_name=bucket-name&file_name=file.csv'
-
Use ClickHouse's SQL queries to analyze data. Example query:
SELECT product_id, SUM(total_sales_sum) AS sum_sales FROM sales GROUP BY product_id ORDER BY sum_sales DESC;
- Log in to Grafana at
http://localhost:3000
. - Import the sales_dashboard.json.
- Check Logs:
docker-compose logs <service_name>
- Grafana Not Connecting: Verify ClickHouse is accessible and configured as a data source in Grafana.
- Data Not Processing: Ensure the S3 endpoint and ClickHouse host settings are correct.
- Permission Issues: Verify file permissions and ensure all containers have access to the required volumes.
- ClickHouse Queries: Modify or add custom queries in Grafana to fit your analysis needs.
- Grafana Dashboards: Customize dashboards by editing JSON or using the Grafana UI.
- Preprocessing: Adjust the preprocessing endpoint to include additional transformations or validation rules.
- S3 Storage: Configure S3 credentials in the
.env
file or pass them as environment variables. - CI/CD Integration: Add CI/CD pipelines for automated deployments and testing.
- Security Enhancements: Implement role-based access controls (RBAC) and secure API endpoints.
- Scalability Improvements: Configure a K8s cluster for running this resources in a distributed framework to handle large datasets.
- Monitoring and Alerts: Integrate monitoring tools to track the performance of the pipeline and set up alerts for failures.