Skip to content

Commit 81946fe

Browse files
Merge pull request #137 from anveshmuppeda/dev
Adding New Guides on Website
2 parents 619e80d + 0ee53ed commit 81946fe

23 files changed

+4784
-1
lines changed

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -150,6 +150,8 @@ Kubernetes, also known as K8s, is an open-source container orchestration platfor
150150
| 46 | 2025-04-27 | [⎈ Containerized Helm: Zero-Install Cluster Management ️](https://medium.com/@muppedaanvesh/containerized-helm-zero-install-cluster-management-%EF%B8%8F-1ea8393da3bf?source=rss-15b2de10f77d------2) |
151151
| 47 | 2025-04-27 | [⎈ K8s Tools Docker Images — kubectl ️](https://medium.com/@muppedaanvesh/k8s-tools-docker-images-kubectl-%EF%B8%8F-acd446b5c079?source=rss-15b2de10f77d------2) |
152152
| 48 | 2025-04-29 | [⎈ Containerized FluxCD: Zero-Install Cluster Management ️](https://medium.com/@muppedaanvesh/containerized-fluxcd-zero-install-cluster-management-%EF%B8%8F-4f2ace623eb4?source=rss-15b2de10f77d------2) |
153+
| 49 | 2025-05-11 | [⎈ kubectl-ai: Speak, Don’t Script ️](https://medium.com/@muppedaanvesh/kubectl-ai-speak-dont-script-%EF%B8%8F-f16e79b0fdaa?source=rss-15b2de10f77d------2) |
154+
153155

154156

155157

Lines changed: 259 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,259 @@
1+
---
2+
// filepath: /Users/anveshmuppeda/Desktop/anvesh/tech/git/kubernetes/docs/ai/eks-troubleshooting-assistant.md
3+
sidebar_label: "EKS Troubleshooting Assistant"
4+
sidebar_id: "eks-troubleshooting-assistant"
5+
sidebar_position: 1
6+
---
7+
8+
# Building an Intelligent EKS Troubleshooting Assistant: AI-Driven Kubernetes Operations
9+
10+
## **Introduction: Simplifying Kubernetes Troubleshooting with AI**
11+
12+
### [GitHub Repository | Source Code ](https://github.com/anveshmuppeda/EKS-Troubleshooting-AI-Assistant)
13+
14+
It's 2 AM. Your phone buzzes - production is down. You SSH into the cluster, only to find:
15+
- 12 pods in `CrashLoopBackOff`
16+
- Mysterious `CreateContainerConfigError`s
17+
- A flood of alerts but no clear root cause
18+
19+
Sound familiar? You're not alone. Kubernetes troubleshooting is **hard** because:
20+
1. **Logs are too noisy**: 78% of logs never get analyzed (CNCF 2023)
21+
2. **Context is fragmented**: Pod events ≠ application logs ≠ resource metrics
22+
3. **Solutions are tribal knowledge**: Relies on institutional memory
23+
24+
Kubernetes is the backbone of modern cloud-native applications, but troubleshooting issues in a complex EKS (Elastic Kubernetes Service) cluster can be overwhelming. With thousands of logs, events, and metrics generated every second, engineers often find themselves:
25+
26+
1. Manually searching through logs using `grep`
27+
2. Running multiple `kubectl` commands to gather data
28+
3. Scouring documentation and community forums for solutions
29+
4. Trying to correlate unrelated data points to find the root cause
30+
31+
This traditional troubleshooting approach is not only time-consuming but also requires deep Kubernetes expertise. To address this challenge, we’ve built an **AI-powered Kubernetes Troubleshooting Assistant**, leveraging cutting-edge technologies such as:
32+
33+
- **Natural Language Processing (NLP)**: Enables intuitive querying via Claude 3 Sonnet
34+
- **Semantic Log Search**: Uses OpenSearch vector search for context-aware retrieval
35+
- **Safe Command Execution**: Automates Kubernetes commands while ensuring security
36+
37+
By integrating AI-driven automation, we can **drastically reduce Mean Time to Resolution (MTTR)** and empower engineers to focus on solutions rather than searching for errors.
38+
39+
---
40+
41+
## **Architectural Deep Dive**
42+
![EKS Assistant](./img/eks.assistant.png)
43+
44+
### **1. Log Collection: FluentBit for Efficient Data Streaming**
45+
46+
**Why FluentBit?**
47+
- **Lightweight**: Uses only ~450KB memory, compared to Logstash’s 1GB+
48+
- **Kubernetes-Native**: Automatically enriches logs with pod names and namespaces
49+
- **AWS Optimization**: Built-in support for Amazon Kinesis via the `kinesis_streams` output plugin
50+
51+
FluentBit efficiently gathers logs from Kubernetes pods and forwards them to a streaming pipeline for further analysis.
52+
53+
---
54+
55+
### **2. Streaming Pipeline: Amazon Kinesis**
56+
57+
**Why Kinesis over Kafka?**
58+
- **Serverless Scaling**: No need to manage brokers, unlike Kafka
59+
- **Lambda Integration**: Seamless event-driven processing
60+
61+
**Terraform Configuration for Kinesis Data Stream:**
62+
```terraform
63+
resource "aws_kinesis_stream" "log_stream" {
64+
name = "${var.name}-eks-logs"
65+
stream_mode_details {
66+
stream_mode = "ON_DEMAND"
67+
}
68+
}
69+
```
70+
71+
Kinesis streams logs in real-time to an AWS Lambda function, where logs are transformed into vector embeddings.
72+
73+
---
74+
75+
### **3. Vector Processing: Titan Embeddings v2**
76+
77+
**Why Amazon Titan?**
78+
- **Lower Cost**: $0.0004/1k tokens vs OpenAI’s $0.002/1k tokens
79+
- **Optimized Vectors**: 1024-dimension embeddings balance accuracy and storage
80+
- **Seamless AWS Integration**: Works with IAM roles, no API keys needed
81+
82+
**Generating Embeddings with Amazon Titan:**
83+
```python
84+
def get_embedding(text):
85+
"""Generate embedding using Amazon Titan Embeddings V2 model"""
86+
try:
87+
body = json.dumps({ "inputText": text })
88+
response = bedrock_runtime.invoke_model(
89+
modelId=model, contentType="application/json", accept="application/json", body=body
90+
)
91+
response_body = json.loads(response.get('body').read())
92+
return response_body.get('embedding')
93+
except Exception as e:
94+
logger.error(f"Error generating embedding: {str(e)}")
95+
raise
96+
```
97+
98+
This converts logs into numerical vectors, enabling fast similarity searches.
99+
100+
---
101+
102+
### **4. Vector Database: OpenSearch Serverless**
103+
104+
**Why OpenSearch over Pinecone/Chroma?**
105+
- **Kubernetes Metadata Handling**: Native JSON field support for pod names/timestamps
106+
- **AWS Security**: Uses IAM authentication instead of API keys
107+
108+
**Index Optimization for Faster Search:**
109+
```python
110+
"knn_vector": {
111+
"dimension": 1024,
112+
"method": {
113+
"name": "hnsw",
114+
"space_type": "l2",
115+
"engine": "faiss",
116+
"parameters": {
117+
"ef_construction": 128,
118+
"m": 24
119+
}
120+
}
121+
}
122+
```
123+
124+
With this setup, engineers can perform **semantic searches** on logs and identify related issues instantly.
125+
126+
---
127+
128+
## **Build Your Own in 3 Steps**
129+
130+
### **1. Deploy Infrastructure**
131+
```bash
132+
# Clone & Initialize
133+
https://github.com/anveshmuppeda/EKS-Troubleshooting-AI-Assistant
134+
cd infra/terraform
135+
./install.sh
136+
```
137+
138+
**What Gets Created**:
139+
- **EKS Cluster**: With 3 nodes
140+
- **OpenSearch**: Vector database for log search
141+
- **Lambda**: Converts logs to AI-readable format
142+
143+
## **Real-World Error Simulation & Debugging**
144+
145+
### **Scenario 1: CrashLoopBackOff Due to OOMKill**
146+
```bash
147+
# Create OOM errors
148+
./oom-kill.sh
149+
```
150+
151+
Service A: Queries Service B continuously
152+
153+
Service B: Runs stress-ng and exceeds memory limits
154+
155+
Expected Outcome: Pod crashes with OOMKilled status
156+
157+
```bash
158+
interdependent-services service-b-787d6b5857-6vr2l 0/1 CrashLoopBackOff 29 (2m58s ago) 130m
159+
```
160+
161+
**Chatbot Interaction:**
162+
Prompt:
163+
```plaintext
164+
why service-b-787d6b5857-6vr2l pod is failing with CrashLoopBackOff?
165+
```
166+
167+
Chatbot Logs:
168+
```plaintext
169+
[INFO] 2025-03-31 00:28:28,191 - Received user query for date: 2025-03-30 00:00:00, model: Claude Sonnet, and user input:
170+
why service-b-787d6b5857-6vr2l pod is failing with CrashLoopBackOff?
171+
172+
[INFO] 2025-03-31 00:28:36,517 - Parsed commands:
173+
['kubectl logs service-b-787d6b5857-6vr2l -n interdependent-services -c service-b', 'kubectl describe pod service-b-787d6b5857-6vr2l -n interdependent-services']
174+
```
175+
176+
**Chatbot Diagnosis & Fix:**
177+
```plaintext
178+
Root Cause: The container is running stress-ng, which exceeds its memory limit (128Mi).
179+
Fix: Increase the memory limit or adjust the stress-ng memory usage.
180+
```
181+
![OOMSimulation](./img/oom-kill-pod.png)
182+
183+
---
184+
185+
### **Scenario 2: Configuration Error (CreateContainerConfigError)**
186+
```bash
187+
./provision-delete-error-pods.sh -p cache-service
188+
```
189+
190+
The pod attempts to mount a non-existent ConfigMap, causing a CreateContainerConfigError
191+
```bash
192+
prod-apps cache-service-pod 0/1 CreateContainerConfigError 0 8m30s
193+
```
194+
195+
**Chatbot Interaction:**
196+
Prompt:
197+
```plaintext
198+
why cache-service-pod pod is failing with CreateContainerConfigError?
199+
```
200+
201+
Chatbot Logs:
202+
```plaintext
203+
[INFO] 2025-03-31 00:26:38,705 - Received user query for date: 2025-03-30 00:00:00, model: Claude Sonnet, and user input:
204+
why cache-service-pod pod is failing with CreateContainerConfigError?
205+
206+
[INFO] 2025-03-31 00:26:47,303 - Parsed commands:
207+
['kubectl describe pod web-app-pod -n prod-apps', 'kubectl get service api.internal-service -n prod-apps', 'kubectl get endpoints api.internal-service -n prod-apps', 'kubectl describe pod cache-service-pod -n prod-apps']
208+
```
209+
210+
**Chatbot Diagnosis & Fix:**
211+
```plaintext
212+
Root Cause: ConfigMap non-existent-cache-config not found.
213+
Fix: Create the missing ConfigMap and verify the pod references it correctly.
214+
```
215+
![Cache-Service-Pod](./img/cache-service-pod.png)
216+
---
217+
218+
## **Why This Beats Traditional Tools**
219+
220+
### **1. Finds Needles in Haystacks**
221+
Traditional Search | AI Search
222+
-------------------|---------
223+
"error" → 10,000 results | "memory crash" → Top 5 relevant
224+
225+
### **2. Safe Automation**
226+
```python
227+
# Only allow read commands
228+
ALLOWED_COMMANDS = {'get', 'describe', 'logs'}
229+
if command.split()[1] not in ALLOWED_COMMANDS:
230+
block("Dangerous command!")
231+
```
232+
233+
### **3. Cost-Effective**
234+
235+
### **Key Benefits:**
236+
1. **Accuracy & Speed**
237+
- **Vector Search**: Finds contextually related logs instantly
238+
- **K8s Metadata Filtering**: Focuses on the most relevant issues
239+
- **89% F1-score vs 67% with traditional keyword search**
240+
241+
2. **Secure & Controlled Execution**
242+
- **IAM-based Authentication** (no stored credentials)
243+
- **Command Allow List**: Limits execution to get, describe, and logs
244+
245+
3. **Cost-Effective & Scalable**
246+
- **Serverless Streaming**: Auto-scales with demand
247+
- **Embedding Cache**: Reduces API call costs by 40%
248+
249+
This system allows engineers to diagnose and fix Kubernetes issues faster, safer, and at a fraction of the cost of traditional enterprise tools.
250+
251+
252+
253+
## **Conclusion: AI-Powered Future for Kubernetes Ops**
254+
255+
By integrating AI with Kubernetes troubleshooting, we have:
256+
- **Reduced MTTR from hours to minutes**
257+
- **Simplified troubleshooting for engineers**
258+
- **Enabled proactive issue detection before outages**
259+

docs/ai/img/cache-service-pod.png

469 KB
Loading

docs/ai/img/eks.assistant.png

191 KB
Loading

docs/ai/img/oom-kill-pod.png

571 KB
Loading

0 commit comments

Comments
 (0)