An LLM-powered agent that helps with Kubernetes debugging based on alerts received from monitoring systems.
This application provides an API that can:
- Receive Kubernetes alert webhooks (e.g., from Prometheus Alertmanager).
- Utilize a LangGraph-based agent (
OncallmAgent
) to analyze these alerts. - The agent uses various tools, including direct Kubernetes API access via
KubernetesService
, to gather context. - Analyze the gathered information to determine potential root causes and generate recommendations.
- Store analysis reports and provide API endpoints to retrieve them.
- Alert Webhook Integration: Receives alerts from Prometheus Alertmanager.
- LangGraph Powered Analysis: Uses a ReAct agent built with LangGraph for intelligent analysis.
- Direct Kubernetes API Integration:
KubernetesService
interacts directly with your Kubernetes cluster to fetch information about pods, services, deployments, and logs. - Automated Analysis: The LLM agent analyzes alert data and cluster information to identify potential root causes.
- Recommendation Generation: Provides actionable recommendations to resolve issues.
- Report Storage & API: Stores analysis reports and offers endpoints to list all reports or fetch specific ones by ID.
- Python 3.8+
- Access to a Kubernetes cluster (the agent will use the standard kubeconfig resolution, e.g.,
~/.kube/config
or in-cluster service account). - OpenAI API key (or other compatible LLM provider configured in
oncallm/llm_service.py
). - Docker installed and running.
- Kind installed.
- kubectl installed and configured.
- Helm (if OncaLLM deployment uses Helm - TBD).
-
Clone the repository:
git clone <repository-url> cd <repository-directory-name> # e.g., oncallm-agent
-
Install dependencies:
pip install -r requirements.txt
-
Create a
.env
file by copying from.env.example
(if provided, otherwise create new) with your configuration:# FastAPI settings APP_HOST=0.0.0.0 APP_PORT=8001 # Default port for oncallm.main # Kubernetes settings (optional, if not using default kubeconfig resolution or in-cluster auth) # KUBECONFIG_PATH=/path/to/your/kubeconfig # LLM settings OPENAI_API_KEY=your_openai_api_key LLM_MODEL=gpt-4-turbo # Or your preferred model # LLM_API_BASE=your_llm_api_base_if_not_openai_default # Optional, for self-hosted or proxy # Langfuse Observability (Optional) # LANGFUSE_PUBLIC_KEY=pk-lf-... # LANGFUSE_SECRET_KEY=sk-lf-... # LANGFUSE_HOST=https://cloud.langfuse.com # or your self-hosted instance
python -m oncallm.main
The server will start on the configured host and port (default: 0.0.0.0:8001
).
- GET /: Root endpoint with API information.
- GET /health: Health check endpoint.
- POST /webhook: Endpoint to receive alerts from Alertmanager.
- GET /reports: Lists all analysis reports.
- GET /reports/{report_id}: Retrieves a specific analysis report by its ID.
To configure Prometheus Alertmanager to send alerts to this service, add the following to your alertmanager.yml
:
receivers:
- name: 'oncallm-webhook'
webhook_configs:
- url: 'http://<your-oncallm-service-url>:8001/webhook' # Replace with actual URL
send_resolved: true
The project uses pytest
for unit and API testing.
-
To run all unit tests:
pytest tests/unit
-
To run all API tests:
pytest tests/api
-
To run all tests (unit and API):
pytest tests/
Ensure you have installed the necessary dependencies, including pytest
from requirements.txt
.
Contributions are welcome! Please feel free to submit a Pull Request.