A Retrieval-Augmented Generation (RAG) system for automating Security Operations Center (SOC) log analysis.
This project implements a RAG system that combines semantic log retrieval with generative response capabilities to automate SOC log analysis. The system leverages advanced NLP techniques to process, analyze, and generate insights from security logs.
- Log Preprocessing: Tokenization and Named Entity Recognition (NER) via spaCy
- Semantic Search: Vectorization with Sentence Transformers and scalable storage in pgVector
- Response Generation: Query processing with LangChain and DeepSeek
- Interactive Interface: Streamlit application for SOC analysts
- Visualization: Elasticsearch integration for Kibana dashboards
Log Ingestion → Preprocessing → Vectorization → Query Processing → Interface/Visualization
- NLP: spaCy, Sentence Transformers
- Vector Storage: pgVector.
- Query Processing: LangChain, DeepSeek with OpenAI SDK
- Interface: Streamlit
- Visualization: Elasticsearch,Fluentd , Kibana
# Clone the repository
git clone https://github.com/MuhamedAyoub/RealTime-RAG-CyberSecurity_Analyst.git
cd RealTime-RAG-CyberSecurity_Analyst
# Create and activate virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
docker-compose up -d
Create a config.init
file in the project root with the following variables:
PGVECTOR_CONNECTION_STRING=postgresql://user:password@localhost:5432/soc_logs
ELASTICSEARCH_HOST=http://localhost:9200
DEEPSEEK_API_KEY=your_api_key_here
streamlit run app.py --server.fileWatcherType none
Enter queries like:
- "What caused the recent login failures?"
- "Show me all failed SSH attempts from external IPs"
- "Analyze access patterns for the database server"
For visualization queries, the system will:
- Process the query
- Retrieve relevant logs
- Index them in Elasticsearch
- Return structured data for Kibana visualization
soc-log-rag/
├── app.py # Streamlit application
├── src/
│ ├── preprocessing.py # Log preprocessing
│ ├── tokenizing.py # tokenizing the new logs
│ ├── embedding.py # embedding and storing logs in pgVector
│ ├── llm.py # DeepSeek LLM and LangChain RetrievalQA
├── config.ini
├── logs/
│ └── logs.md/ # Sample SOC logs for testing
├── reports/
│ └── report.pdf # Project report
├── compose.yml # Docker Compose file
├── requirements.txt # Python dependencies
└── README.md # This file
- Real-time log streaming with Fluentd
- Multi-modal analysis (combining logs with network data)
- Data anonymization for sensitive information
- Secure storage with encrypted connections
- Bias-aware response generation
Ameri Mohamed Ayoub LinkedIn Email
The Higher School of Computer Science Engineering ESI-SBA, May 2025