Welcome to the Distributed Log Aggregation and Analytics System project! This repository showcases a system that collects logs from multiple Linux virtual machines (VMs) within a custom Google Cloud Platform (GCP) Virtual Private Cloud (VPC), processes them using Python, stores and analyzes the data with GCP services, and provides insights through a Flask API. The setup is automated with Bash scripts and demonstrates advanced GCP networking concepts.
This project is perfect for those looking to explore cloud computing, Linux, Python, and GCP services in depth. However, it’s not suitable for beginners without a strong foundation in these areas. Please ensure you fully understand the concepts before attempting to set up or modify this project.
- Project Overview
- Architecture
- Prerequisites
- Setup Instructions
- API Endpoints
- Testing and Validation
- Google Cloud Resources for New Users
⚠️ Important Notice for New Users
This project builds a distributed system to:
- Generate logs on two VMs:
web-server
andapp-server
. - Transfer logs to a central
processor-server
VM for processing. - Process logs with Python, store them in Cloud Storage, and publish events to Pub/Sub.
- Load processed data into BigQuery using a Cloud Function.
- Provide log analytics through a Flask API hosted on
processor-server
.
The system runs within a custom VPC with isolated subnets and firewall rules, highlighting GCP’s networking capabilities.
The architecture is illustrated below, showing the flow from log generation to analytics.
Project Architecture
The above architecture represents a robust, secure, and scalable log processing pipeline on Google Cloud Platform (GCP), leveraging a custom VPC with isolated Web, App, and Processor subnets to ensure security and high availability. Logs from the web-server and app-server are securely transferred to the processor-server over SSH (Port 22), with firewall rules restricting access to trusted IPs only. The processor-server processes logs using process_logs.py
and uploads the transformed data to Cloud Storage, while also publishing structured logs to Pub/Sub for downstream analytics. Cloud Functions automate workflows, and BigQuery enables real-time log analysis. The Flask API provides controlled external access. With network segmentation, even if one subnet is compromised, the others remain protected, ensuring strong isolation, security, and fault tolerance in this cloud-native design. 🚀
Key Components:
- VMs:
web-server
andapp-server
generate logs;processor-server
handles processing and hosts the API. - GCP Services: Cloud Storage for storage, Pub/Sub for messaging, BigQuery for analytics, and Cloud Functions for serverless data loading.
- Networking: A custom VPC with three subnets for resource isolation.
Click here for the (flow chart type of) project architecture
To get started, you’ll need:
- A Google Cloud account with billing enabled.
- Familiarity with Linux, Bash, and Python.
- Basic knowledge of GCP services: Compute Engine, Cloud Storage, Pub/Sub, BigQuery, and Cloud Functions.
- The Google Cloud SDK (
gcloud
) installed locally. - Understanding of VPC networking and IAM permissions.
-
Set Up Project and Enable APIs:
- Create a GCP project or use an existing one.
- Enable necessary APIs:
gcloud services enable compute.googleapis.com storage.googleapis.com pubsub.googleapis.com bigquery.googleapis.com cloudfunctions.googleapis.com
-
Configure Networking:
- Create a custom VPC with three subnets.
- Add firewall rules for SSH and HTTP access.
-
Launch VMs:
- Deploy
web-server
,app-server
, andprocessor-server
in their respective subnets. - Assign a service account to
processor-server
with appropriate IAM roles.
- Deploy
-
Generate Logs:
- Install Bash scripts on
web-server
andapp-server
to generate logs. - Configure systemd to run these scripts as services.
- Install Bash scripts on
-
Process Logs:
- Use
gcloud compute scp
to transfer logs toprocessor-server
. - Run a Python script to process logs, upload them to Cloud Storage, and publish to Pub/Sub.
- Use
-
Store and Analyze Data:
- Deploy a Cloud Function to load data from Pub/Sub into BigQuery.
- Set up a Flask API on
processor-server
to query BigQuery.
-
Automate:
- Create a Bash script (orchestrate.sh) on
processor-server
to automate log transfer and processing. - Schedule it with cron to run every 5 minutes.
- Create a Bash script (orchestrate.sh) on
Refer to the project’s detailed guide for specific commands.
- GET
/errors/<log_type>
: Fetches recent errors from BigQuery.<log_type>
:web
orapp
.- Example:
curl http://[PROCESSOR-SERVER-IP]:8080/errors/web
- Log Generation: Confirm logs are created on
web-server
andapp-server
. - Log Transfer: Check that logs reach
/tmp/
onprocessor-server
. - Processing: Verify processed CSVs in Cloud Storage.
- BigQuery: Query the
logs
table to ensure data is loaded. - API: Test the Flask API with
curl
to retrieve errors. - Network: Optionally, monitor VPC flow logs for traffic analysis.
To prevent unnecessary charges, ensure you delete all resources after use. Use the following commands as a guideline, modifying them based on your specific project setup:
# Delete Compute Engine instances, in our case they are app-server, web-server, processor-server
gcloud compute instances delete INSTANCE_NAME_1 INSTANCE_NAME_2 INSTANCE_NAME_3 --zone=YOUR_ZONE
# Remove Cloud Storage bucket
gsutil rm -r gs://YOUR_BUCKET_NAME
# Delete Pub/Sub topics
gcloud pubsub topics delete YOUR_TOPIC_NAME
# Remove BigQuery datasets
bq rm -r YOUR_DATASET_NAME
# Delete Cloud Functions
gcloud functions delete YOUR_FUNCTION_NAME
# Remove VPC networks
gcloud compute networks delete YOUR_VPC_NAME
Note - Please replace placeholders (YOUR_ZONE, YOUR_BUCKET_NAME, etc.) with your actual resource names. Always double-check dependencies before deletion to avoid unintended disruptions.
If you’re new to cloud computing, explore these resources first:
- Google Cloud Free Tier: Learn about free usage and costs.
- GCP Fundamentals: Get started with GCP basics.
- Compute Engine Documentation: Manage VMs.
- Cloud Storage Documentation: Handle data storage.
- Pub/Sub Documentation: Understand messaging.
- BigQuery Documentation: Analyze data.
- Cloud Functions Documentation: Use serverless computing.
- VPC Networking: Master GCP networking.
This project involves multiple Google Cloud Platform (GCP) services and complex networking configurations. If not handled properly, it can lead to significant costs and operational challenges.
- Carefully assess the billing implications of each service.
- Follow proper cleanup procedures to avoid unnecessary expenses.
- Ensure you have a solid understanding of Linux, Bash, Python, and GCP fundamentals.
- Use the GCP Pricing Calculator to estimate costs accurately.
This is not a step-by-step tutorial. Some essential configurations—such as service accounts, permissions, and resource allocations—are not explicitly detailed here. You must adapt the project to suit your specific requirements, including:
- Server locations and regions
- Memory and storage capacities
- Other necessary infrastructure settings
If you're unsure about any aspect of this project, do not proceed without gaining the required knowledge. For further guidance, feel free to reach out via my blog.
This project is licensed under the MIT License. See the LICENSE file for details.