This project is a comprehensive server log analysis system that leverages Apache Spark, Kafka, and geolocation data to process, clean, and analyze server logs in real-time.
-
Real-time log ingestion from Kafka
-
Log data cleaning and validation
-
Geolocation enrichment
-
processing with Apache Spark
-
Log Parsing and Cleaning
-
Server Log Analysis Dashboard
-
Web Log Anomaly Detection
- Detects suspicious status codes
- Identifies unusual HTTP methods
- Tracks high-traffic IP addresses
- Monitors suspicious user agents
-
scripts/
: Core processing scriptsmain.py
: Main entry pointkafka_producer.py
: Kafka log producerprocessing/
: Data cleaning modulesanalysis/
: Log analyticshelper/
: Utility functions
-
web/
: Web interface components -
data/
: Processed log data storage -
GeoLite_data/
: Geolocation database
- Apache Spark
- Apache Kafka
- Python
- Geolocation Analysis
- Streaming Data Processing
- Python 3.8+
- Java 8 or higher
- Apache Kafka
- Apache Spark
- Apache Hadoop
- GeoLite2 City database
git clone https://github.com/yassirsalmi/server-log-analysis.git
cd server-log-analysis
python3 -m venv server-analysis-env
source server-analysis-env/bin/activate
pip install -r requirements.txt
- Download GeoLite2 City database from MaxMind
- Place the database in
GeoLite_data/GeoLite2-City.mmdb
Modify the following parameters in scripts/main.py
:
kafka_bootstrap_servers
: Kafka broker addresskafka_topic
: Kafka topic for log ingestionhdfs_output_path
: Output path for cleaned logsgeoip_db_path
: Path to GeoLite2 City database
./startup.sh
This script will:
- Start Kafka and Zookeeper services
- Start Hadoop services
- Launch Kafka producer
- Start main log processing
- Initialize web application
The /analysis/anomalies
endpoint provides a comprehensive log anomaly detection service:
Query Parameters:
log_path
(optional): Path to the log file. Defaults to project's default log file.
Response Example:
{
"total_anomalies": 5,
"anomalies": [
{
"type": "Suspicious Status Code",
"ip": "192.168.1.100",
"status_code": 403,
"endpoint": "/admin",
"timestamp": "2024-01-15T10:30:45+00:00"
},
...
]
}
- Suspicious Status Codes (401, 403, 500, etc.)
- Unusual HTTP Methods (DELETE, PUT)
- High Request Rate per IP
- Suspicious User Agents