An intelligent REST API that enables natural language queries about Kubernetes cluster resources and provides AI-powered debugging for pod crashes and failures. Query your cluster state, monitor resources, get insights through conversational AI, and debug issues in seconds instead of hours.
Natural Language Interface: Ask questions about your cluster in plain English
AI-Powered Pod Debugging: Instantly diagnose crashed pods with root cause analysis and actionable fixes
Real-time Cluster Analysis: Fetch live information about pods, services, and deployments
Pattern Detection: Automatically identifies common issues (CrashLoopBackOff, OOMKilled, ImagePullBackOff)
Actionable Fixes: Get exact kubectl commands to investigate and resolve issues
Multi-namespace Support: Query and debug resources across different namespaces
Prometheus Metrics: Built-in monitoring with Prometheus-compatible metrics
Problem: Developers waste 15-30 minutes debugging common Kubernetes issues.
Solution: Get instant root cause analysis with specific fix commands in under 3 seconds.
# Before: Manual debugging (15-30 minutes)
kubectl get pods
kubectl describe pod my-app
kubectl logs my-app --previous
# ... trial and error ...
# After: AI-powered debugging (30 seconds)
curl -X POST http://localhost:8000/debug/pod-crash \
-H "Content-Type: application/json" \
-d '{"pod_name": "my-app"}' | jqDetects:
- CrashLoopBackOff
- OOMKilled (Out of Memory)
- ImagePullBackOff
- Configuration errors
- Network connectivity issues
- Permission errors
- Architecture
- Prerequisites
- Installation
- Configuration
- Usage
- API Reference
- Testing
- Deployment
- Monitoring
- Troubleshooting
┌─────────────┐
│ Client │
└──────┬──────┘
│ HTTP Request
▼
┌─────────────────────┐
│ Flask API │
│ (main.py) │
└──────┬──────────────┘
│
├────────────────┬──────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ K8s Client │ │ K8s Analyzer │ │ AI Service │
│ │ │ (Deep Debug) │ │ │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
│ ├──────────────────┤
│ ▼ ▼
│ ┌──────────────────┐ ┌──────────────┐
│ │ Debug Assistant │ │ OpenAI API │
│ │ • Pattern detect │ │ (GPT-4o) │
│ │ • AI analysis │ │ │
│ └──────────────────┘ └──────────────┘
▼
┌──────────────┐
│ Kubernetes │
│ Cluster │
└──────────────┘
- Python 3.10 or higher
- Kubernetes cluster (local or remote)
- Minikube for local development
- kubectl configured with cluster access
- OpenAI API key (required for debugging features)
- Docker (for containerized deployment)
git clone https://github.com/johnwroge/K8_AI_Query_Agent.git
cd K8_AI_Query_Agent# Create virtual environment
python3.10 -m venv venv
# Activate on macOS/Linux
source venv/bin/activate
# Activate on Windows
.\venv\Scripts\activatepip install -r requirements.txtCreate a .env file in the project root:
OPENAI_API_KEY=your-openai-api-key-here
OPENAI_MODEL=gpt-4o-mini
LOG_LEVEL=INFO
APP_PORT=8000The application can be configured through environment variables or the .env file:
| Variable | Description | Default |
|---|---|---|
OPENAI_API_KEY |
Your OpenAI API key | Required |
OPENAI_MODEL |
GPT model to use | gpt-4o-mini |
OPENAI_TEMPERATURE |
Model temperature | 0.0 |
APP_HOST |
Server host | 0.0.0.0 |
APP_PORT |
Server port | 8000 |
LOG_LEVEL |
Logging level | INFO |
K8S_NAMESPACE_FILTER |
Filter namespaces | None |
1. Start the Application
# From project root
python run.py
# Or using module syntax
python -m src.main2. Debug a Crashed Pod
curl -X POST http://localhost:8000/debug/pod-crash \
-H "Content-Type: application/json" \
-d '{"pod_name": "my-crashing-pod", "namespace": "default"}'3. Make a Natural Language Query
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"query": "How many pods are running in the default namespace?"}'4. Check Health
curl http://localhost:8000/health1. Build Image
docker build -t k8s-ai-agent:latest .2. Run Container
docker run -d \
-p 8000:8000 \
-e OPENAI_API_KEY=your-key \
--name k8s-agent \
k8s-ai-agent:latestDebug a crashed or failing pod with AI-powered analysis.
Request Body:
{
"pod_name": "my-app-deployment-xyz",
"namespace": "default"
}Response:
{
"success": true,
"pod_name": "my-app-deployment-xyz",
"namespace": "default",
"issue_type": "CrashLoopBackOff",
"root_cause": "Container exits with code 1 - Database connection failed",
"explanation": "The container is repeatedly crashing because it cannot connect to the database at postgres:5432...",
"detected_patterns": ["CrashLoopBackOff", "Database connection error in logs"],
"likely_causes": [
"Database connection string incorrect in environment variables",
"Service 'postgres' not accessible from this namespace",
"Database credentials are missing or invalid"
],
"suggested_fixes": [
{
"action": "Verify DATABASE_URL environment variable",
"command": "kubectl get pod my-app-xyz -o jsonpath='{.spec.containers[0].env}'",
"why": "Check if the database connection string is correctly configured"
}
],
"severity": "high",
"quick_fix_available": false,
"confidence": "high",
"processing_time_ms": 2347.82
}Status Codes:
200: Success - pod analyzed404: Pod not found400: Invalid request500: Server error
Process a natural language query about the cluster.
Request Body:
{
"query": "What pods are running?",
"namespace": "default"
}Response:
{
"query": "What pods are running?",
"answer": "nginx, mongodb, prometheus, k8s-agent",
"processing_time_ms": 1234.56
}Health check endpoint.
Response:
{
"status": "healthy",
"components": {
"kubernetes": "connected",
"ai_service": "connected",
"model": "gpt-4o-mini",
"debug_assistant": "ready"
}
}List all namespaces in the cluster.
Response:
{
"namespaces": ["default", "kube-system", "kube-public"]
}Prometheus metrics endpoint for monitoring.
# Run all tests
pytest
# Run with coverage
pytest --cov=src --cov-report=html
# Run specific test file
pytest tests/test_k8s_agent.py -v
pytest tests/test_debug_assistant.py -v# Deploy test pods with intentional issues
kubectl apply -f deployment/test-broken-pods.yaml
# Wait for pods to enter crash states
sleep 15
# Debug each scenario
curl -X POST http://localhost:8000/debug/pod-crash \
-H "Content-Type: application/json" \
-d '{"pod_name": "crash-loop-test"}' | jq
curl -X POST http://localhost:8000/debug/pod-crash \
-H "Content-Type: application/json" \
-d '{"pod_name": "oom-test"}' | jq
# Cleanup
kubectl delete -f deployment/test-broken-pods.yaml# Test health endpoint
pytest tests/test_k8s_agent.py::TestFlaskApp::test_health_check_success
# Test query processing
pytest tests/test_k8s_agent.py::TestFlaskApp::test_query_success
# Test debug pattern detection
pytest tests/test_debug_assistant.py::TestDebugAssistant::test_detect_crashloopbackoff# Create secret from example
cp deployment/openai-secret.example.yaml deployment/openai-secret.yaml
# Edit the file with your actual API key (no encoding needed)
vim deployment/openai-secret.yaml
# Apply the secret (Kubernetes automatically converts stringData to base64-encoded data when you apply it)
kubectl apply -f deployment/openai-secret.yaml
# Verify it was created
kubectl get secret openai-secret# Build image with Minikube's Docker daemon
eval $(minikube docker-env)
docker build -t k8s-ai-agent:latest .
# Apply deployment
kubectl apply -f deployment/deployment.yaml
# Verify deployment
kubectl get pods -l app=k8s-agent# Port forward for testing
kubectl port-forward service/k8s-agent-service 8000:80
# Or create an ingress for production
kubectl apply -f deployment/ingress.yamlThe application exposes Prometheus metrics at /metrics:
Existing Metrics:
k8s_agent_queries_total: Total queries processedk8s_agent_query_duration_seconds: Query processing latencyk8s_agent_errors_total: Total errors by typek8s_agent_cluster_info: Current cluster information
Debug Metrics:
k8s_agent_debug_requests_total{issue_type}: Debug requests by issue typek8s_agent_debug_duration_seconds: Debug processing time
curl http://localhost:8000/metricsImport the provided Grafana dashboard configuration:
kubectl apply -f deployment/prometheus.yaml# Debug a CrashLoopBackOff
curl -X POST http://localhost:8000/debug/pod-crash \
-H "Content-Type: application/json" \
-d '{"pod_name": "failing-app"}'
# Debug in specific namespace
curl -X POST http://localhost:8000/debug/pod-crash \
-H "Content-Type: application/json" \
-d '{"pod_name": "backend-service", "namespace": "production"}'# Count pods
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"query": "How many pods are running?"}'
# List deployments
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"query": "What deployments exist?"}'
# Check pod status
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"query": "What is the status of nginx pod?"}'# Query specific namespace
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"query": "List all services", "namespace": "kube-system"}'Issue: "Failed to connect to Kubernetes cluster"
- Verify kubectl is configured:
kubectl cluster-info - Check kubeconfig:
echo $KUBECONFIG - Ensure cluster is running:
minikube status
Issue: "OpenAI API key is required"
- Verify API key is set:
echo $OPENAI_API_KEY - Check .env file exists and contains key
- Ensure key has sufficient credits at OpenAI
Issue: "OpenAI API error"
- Verify API key is correct in
.env - Check API key has sufficient credits
- Verify network connectivity to OpenAI
- Check OpenAI service status
Issue: "Pod is not running"
- Check pod logs:
kubectl logs -f <pod-name> - Describe pod:
kubectl describe pod <pod-name> - Verify secret is created:
kubectl get secret openai-api-key
Issue: "Debug endpoint returns fallback response"
- This is normal - pattern detection still works without AI
- Check logs for specific OpenAI errors:
tail -f agent.log - Verify OPENAI_API_KEY is valid
- Check OpenAI API quota/limits
Enable debug logging:
LOG_LEVEL=DEBUG python run.pyView logs:
tail -f agent.log- Setup & Test Guide - Comprehensive setup and testing instructions
- Quick Reference - Quick command reference card
- Model Configuration - AI model options and cost comparison
- Roadmap - Feature roadmap and future plans
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Run tests (
pytest) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
Built with: