|
| 1 | +# Redis Monitoring and Alerting |
| 2 | + |
| 3 | +This document describes the Redis monitoring and alerting setup for the database service. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +The database service now includes comprehensive monitoring for Redis connection status and operation health. When Redis goes down, the application continues to function without caching, and alerts are triggered to notify operators. |
| 8 | + |
| 9 | +## Metrics |
| 10 | + |
| 11 | +The following Prometheus metrics are exposed at `/metrics`: |
| 12 | + |
| 13 | +### `database_redis_connection_status` |
| 14 | +- **Type**: Gauge |
| 15 | +- **Values**: `1` (connected) or `0` (disconnected) |
| 16 | +- **Description**: Current Redis connection status |
| 17 | + |
| 18 | +### `database_redis_operation_success_total` |
| 19 | +- **Type**: Counter |
| 20 | +- **Labels**: `operation` (get, set, remove) |
| 21 | +- **Description**: Total number of successful Redis operations |
| 22 | + |
| 23 | +### `database_redis_operation_failures_total` |
| 24 | +- **Type**: Counter |
| 25 | +- **Labels**: `operation` (get, set, remove) |
| 26 | +- **Description**: Total number of failed Redis operations |
| 27 | + |
| 28 | +## Alerts |
| 29 | + |
| 30 | +### DatabaseRedisDown |
| 31 | +- **Severity**: Critical |
| 32 | +- **Condition**: Redis connection is down for more than 1 minute |
| 33 | +- **Description**: The database service has lost connection to Redis. Cache is unavailable but the service continues to operate. |
| 34 | +- **Action**: Check Redis pod status, network connectivity, and Redis logs. |
| 35 | + |
| 36 | +### DatabaseRedisOperationFailures |
| 37 | +- **Severity**: Warning |
| 38 | +- **Condition**: Redis operations failing at rate > 0.1/second for 2 minutes |
| 39 | +- **Description**: Redis operations are experiencing failures |
| 40 | +- **Action**: Check Redis health, network latency, and error logs. |
| 41 | + |
| 42 | +### DatabaseRedisHighFailureRate |
| 43 | +- **Severity**: Critical |
| 44 | +- **Condition**: Redis operations failing at rate > 1/second for 1 minute |
| 45 | +- **Description**: Critical failure rate - service is degraded |
| 46 | +- **Action**: Investigate immediately. Check Redis status, restart if necessary. |
| 47 | + |
| 48 | +## Grafana Dashboard |
| 49 | + |
| 50 | +A dedicated Grafana dashboard "Database Redis Monitoring" provides: |
| 51 | + |
| 52 | +1. **Redis Connection Status** - Real-time connection state |
| 53 | +2. **Operation Success Rate** - Rate of successful operations by type |
| 54 | +3. **Operation Failure Rate** - Rate of failed operations by type |
| 55 | +4. **Success Rate %** - Overall success percentage |
| 56 | +5. **Connection History** - Timeline of connection up/down events |
| 57 | + |
| 58 | +Import the dashboard from: `Simulator/grafana-dashboards/database-redis-monitoring.json` |
| 59 | + |
| 60 | +## Deployment |
| 61 | + |
| 62 | +The monitoring stack is deployed automatically with the database Helm chart: |
| 63 | + |
| 64 | +- **ServiceMonitor**: Scrapes `/metrics` endpoint every 30 seconds |
| 65 | +- **PrometheusRule**: Defines alert rules |
| 66 | +- **Service**: Labeled for Prometheus discovery |
| 67 | + |
| 68 | +## Testing Alerting |
| 69 | + |
| 70 | +To test the alerting system: |
| 71 | + |
| 72 | +1. Deploy to staging environment |
| 73 | +2. Stop the Redis pod: `kubectl delete pod -l app=redis` |
| 74 | +3. Verify metrics show `database_redis_connection_status = 0` |
| 75 | +4. Wait 1 minute for `DatabaseRedisDown` alert to fire |
| 76 | +5. Check Alertmanager UI for active alerts |
| 77 | +6. Restart Redis and verify recovery |
| 78 | + |
| 79 | +## Configuration |
| 80 | + |
| 81 | +Alert routing and notification channels are configured in Alertmanager. Ensure the following labels are routed appropriately: |
| 82 | + |
| 83 | +- `severity: critical` → PagerDuty / immediate notifications |
| 84 | +- `severity: warning` → Slack / email notifications |
| 85 | +- `component: database` |
| 86 | +- `service: redis` |
| 87 | + |
0 commit comments