AI-Powered Site Reliability Engineering Agent
The ITBench SRE Agent is an open-source AI-powered Site Reliability Engineering agent that automates incident response in Kubernetes and OpenShift environments. Leveraging large language models and built on the CrewAI framework, this intelligent agent diagnoses complex system failures, traces root causes, and implements remediation strategies within real-world inspired incident scenarios on the ITBench platform.
- Automated Incident Response: Diagnose and resolve incidents in Kubernetes environments
- Real-world Scenarios: Works with ITBench's collection of realistic SRE incident scenarios
- Observability Integration: Integrates with Prometheus, Jaeger, and, Clickhouse
- Containerized Execution: Runs safely in containers to prevent harmful commands on host systems
The agent should always be run in a container in order to prevent harmful commands being run on the user's PC.
-
Clone the repository
git clone https://github.com/IBM/ITBench-SRE-Agent cd ITBench-SRE-Agent
-
Prepare your kubeconfig
Move the provided kubeconfig file here into the root directory of this repo and rename it to
config
:mv /path/to/your/kubeconfig ./config
-
Configure environment
Move the provided
.env
file here to the root directory of this repo. -
Build the container image
docker build -t itbench-sre-agent --no-cache .
-
Run the agent
# macOS docker run --mount type=bind,src="$(pwd)",target=/app/lumyn -e KUBECONFIG=/app/lumyn/config -it itbench-sre-agent /bin/bash # Linux docker run --network=host --mount type=bind,src="$(pwd)",target=/app/lumyn -e KUBECONFIG=/app/lumyn/config -it itbench-sre-agent /bin/bash
-
Get the observability URL
Inside the docker container, run:
kubectl get ingress -n prometheus
You should see output like:
NAME CLASS HOSTS ADDRESS PORTS AGE prometheus nginx * ad54bc930b7ec40c38f06be1a1ed0758-1859094179.us-west-2.elb.amazonaws.com 80 10h
Copy the content under the
ADDRESS
section. This is your<observability-url>
. -
Update environment variables
Open the
.env
file in a text editor and update the following values:API_KEY_AGENTS
: Your provided API keyAPI_KEY_TOOLS
: Your provided API keyOBSERVABILITY_STACK_URL
:http://<observability-url>
TOPOLOGY_URL
:http://<observability-url>/topology
-
Start the agent
crewai run
Please see our Developer Guide for detailed information on:
- Local development setup
- Configuration options
- Customization and extension
- ITBench: Central repository providing an overview of the ITBench ecosystem, related announcements, and publications.
- CISO-CAA Agent: CISO (Chief Information Security Officer) agents that automate compliance assessments by generating policies from natural language, collecting evidence, integrating with GitOps workflows, and deploying policies for assessment.
- SRE Agent: SRE (Site Reliability Engineering) agents designed to diagnose and remediate problems in Kubernetes-based environments. Leverage logs, metrics, traces, and Kubernetes states/events from the IT enviroment.
- ITBench Leaderboard: Service that handles scenario deployment, agent evaluation, and maintains a public leaderboard for comparing agent performance on ITOps use cases.
- ITBench Utilities: Collection of supporting tools and utilities for participants in the ITBench ecosystem and leaderboard challenges.
- ITBench Tutorials: Repository containing the latest tutorials, workshops, and educational content for getting started with ITBench.
- Noah Zheutlin - @noahzibm
@misc{jha2025itbench,
title={ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks},
author={Jha, Saurabh and Arora, Rohan and Watanabe, Yuji and others},
year={2025},
url={https://github.com/IBM/itbench-sample-scenarios/blob/main/it_bench_arxiv.pdf}
}