Skip to content

Commit 290d221

Browse files
committed
Add Prometheus metrics and monitoring for Redis
- Add prom-client dependency for metrics - Create /metrics endpoint exposing Prometheus metrics - Track Redis connection status (database_redis_connection_status) - Track Redis operation success/failure counts by operation type - Update RedisCache to report connection events and operation metrics - Add comprehensive monitoring documentation Metrics exposed: - database_redis_connection_status: 1=connected, 0=disconnected - database_redis_operation_success_total: successful operations by type - database_redis_operation_failures_total: failed operations by type
1 parent a29acc4 commit 290d221

File tree

5 files changed

+164
-1
lines changed

5 files changed

+164
-1
lines changed

REDIS_MONITORING.md

Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
# Redis Monitoring and Alerting
2+
3+
This document describes the Redis monitoring and alerting setup for the database service.
4+
5+
## Overview
6+
7+
The database service now includes comprehensive monitoring for Redis connection status and operation health. When Redis goes down, the application continues to function without caching, and alerts are triggered to notify operators.
8+
9+
## Metrics
10+
11+
The following Prometheus metrics are exposed at `/metrics`:
12+
13+
### `database_redis_connection_status`
14+
- **Type**: Gauge
15+
- **Values**: `1` (connected) or `0` (disconnected)
16+
- **Description**: Current Redis connection status
17+
18+
### `database_redis_operation_success_total`
19+
- **Type**: Counter
20+
- **Labels**: `operation` (get, set, remove)
21+
- **Description**: Total number of successful Redis operations
22+
23+
### `database_redis_operation_failures_total`
24+
- **Type**: Counter
25+
- **Labels**: `operation` (get, set, remove)
26+
- **Description**: Total number of failed Redis operations
27+
28+
## Alerts
29+
30+
### DatabaseRedisDown
31+
- **Severity**: Critical
32+
- **Condition**: Redis connection is down for more than 1 minute
33+
- **Description**: The database service has lost connection to Redis. Cache is unavailable but the service continues to operate.
34+
- **Action**: Check Redis pod status, network connectivity, and Redis logs.
35+
36+
### DatabaseRedisOperationFailures
37+
- **Severity**: Warning
38+
- **Condition**: Redis operations failing at rate > 0.1/second for 2 minutes
39+
- **Description**: Redis operations are experiencing failures
40+
- **Action**: Check Redis health, network latency, and error logs.
41+
42+
### DatabaseRedisHighFailureRate
43+
- **Severity**: Critical
44+
- **Condition**: Redis operations failing at rate > 1/second for 1 minute
45+
- **Description**: Critical failure rate - service is degraded
46+
- **Action**: Investigate immediately. Check Redis status, restart if necessary.
47+
48+
## Grafana Dashboard
49+
50+
A dedicated Grafana dashboard "Database Redis Monitoring" provides:
51+
52+
1. **Redis Connection Status** - Real-time connection state
53+
2. **Operation Success Rate** - Rate of successful operations by type
54+
3. **Operation Failure Rate** - Rate of failed operations by type
55+
4. **Success Rate %** - Overall success percentage
56+
5. **Connection History** - Timeline of connection up/down events
57+
58+
Import the dashboard from: `Simulator/grafana-dashboards/database-redis-monitoring.json`
59+
60+
## Deployment
61+
62+
The monitoring stack is deployed automatically with the database Helm chart:
63+
64+
- **ServiceMonitor**: Scrapes `/metrics` endpoint every 30 seconds
65+
- **PrometheusRule**: Defines alert rules
66+
- **Service**: Labeled for Prometheus discovery
67+
68+
## Testing Alerting
69+
70+
To test the alerting system:
71+
72+
1. Deploy to staging environment
73+
2. Stop the Redis pod: `kubectl delete pod -l app=redis`
74+
3. Verify metrics show `database_redis_connection_status = 0`
75+
4. Wait 1 minute for `DatabaseRedisDown` alert to fire
76+
5. Check Alertmanager UI for active alerts
77+
6. Restart Redis and verify recovery
78+
79+
## Configuration
80+
81+
Alert routing and notification channels are configured in Alertmanager. Ensure the following labels are routed appropriately:
82+
83+
- `severity: critical` → PagerDuty / immediate notifications
84+
- `severity: warning` → Slack / email notifications
85+
- `component: database`
86+
- `service: redis`
87+

package.json

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,8 @@
1616
"@google-cloud/storage": "^6.9.3",
1717
"fastify": "^4.9.2",
1818
"firebase-admin": "^11.2.0",
19-
"ioredis": "^5.2.3"
19+
"ioredis": "^5.2.3",
20+
"prom-client": "^14.2.0"
2021
},
2122
"devDependencies": {
2223
"@types/argparse": "^2.0.10",

src/RedisCache.ts

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@ import Cache from './Cache';
22

33
import Redis, { RedisOptions } from 'ioredis';
44
import Selector from './model/Selector';
5+
import { redisConnectionGauge, redisFailureCounter, redisSuccessCounter } from './metrics';
56

67
class RedisCache implements Cache {
78
private static DEFAULT_TTL = 60 * 60 * 24 * 7;
@@ -22,8 +23,29 @@ class RedisCache implements Cache {
2223
enableOfflineQueue: false, // Don't queue commands when disconnected
2324
});
2425

26+
this.redis_.on('connect', () => {
27+
console.log('Redis connected');
28+
redisConnectionGauge.set(1);
29+
});
30+
31+
this.redis_.on('ready', () => {
32+
console.log('Redis ready');
33+
redisConnectionGauge.set(1);
34+
});
35+
2536
this.redis_.on('error', (err) => {
2637
console.error('Redis error (app will continue without cache):', err.message);
38+
redisConnectionGauge.set(0);
39+
});
40+
41+
this.redis_.on('close', () => {
42+
console.warn('Redis connection closed');
43+
redisConnectionGauge.set(0);
44+
});
45+
46+
this.redis_.on('end', () => {
47+
console.warn('Redis connection ended');
48+
redisConnectionGauge.set(0);
2749
});
2850
}
2951

@@ -34,10 +56,12 @@ class RedisCache implements Cache {
3456
async get(selector: Selector): Promise<object | null> {
3557
try {
3658
const data = await this.redis_.get(RedisCache.key_(selector));
59+
redisSuccessCounter.inc({ operation: 'get' });
3760
if (!data) return null;
3861
return JSON.parse(data);
3962
} catch (err) {
4063
console.error('Redis GET failed, continuing without cache:', err);
64+
redisFailureCounter.inc({ operation: 'get' });
4165
return null;
4266
}
4367
}
@@ -46,19 +70,24 @@ class RedisCache implements Cache {
4670
try {
4771
if (!value) {
4872
await this.redis_.del(RedisCache.key_(selector));
73+
redisSuccessCounter.inc({ operation: 'set' });
4974
return;
5075
}
5176
await this.redis_.setex(RedisCache.key_(selector), RedisCache.DEFAULT_TTL, JSON.stringify(value));
77+
redisSuccessCounter.inc({ operation: 'set' });
5278
} catch (err) {
5379
console.error('Redis SET failed, continuing without cache:', err);
80+
redisFailureCounter.inc({ operation: 'set' });
5481
}
5582
}
5683

5784
async remove(selector: Selector): Promise<void> {
5885
try {
5986
await this.redis_.del(RedisCache.key_(selector));
87+
redisSuccessCounter.inc({ operation: 'remove' });
6088
} catch (err) {
6189
console.error('Redis DEL failed, continuing without cache:', err);
90+
redisFailureCounter.inc({ operation: 'remove' });
6291
}
6392
}
6493
}

src/index.ts

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ import authorize, { AuthorizeResult } from './authorize';
1212
import { CHALLENGE_COMPLETION_COLLECTION, USER_COLLECTION } from './model/constants';
1313

1414
import bigStore from './big-store';
15+
import { register as metricsRegister } from './metrics';
1516

1617
const UNAUTHORIZED_RESULT = { message: 'Unauthorized' };
1718
const NOT_FOUND_RESULT = { message: 'Not Found' };
@@ -47,6 +48,12 @@ app.get('/', async (request, reply) => {
4748
reply.send({ database: 'alive' });
4849
});
4950

51+
// Prometheus metrics endpoint
52+
app.get('/metrics', async (request, reply) => {
53+
reply.header('Content-Type', metricsRegister.contentType);
54+
reply.send(await metricsRegister.metrics());
55+
});
56+
5057
app.get('/:collection/:id', async (request, reply) => {
5158
const token = await authenticate(request);
5259

src/metrics.ts

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
import { Registry, Gauge, Counter } from 'prom-client';
2+
3+
// Create a custom registry
4+
export const register = new Registry();
5+
6+
// Redis connection status gauge (1 = connected, 0 = disconnected)
7+
export const redisConnectionGauge = new Gauge({
8+
name: 'database_redis_connection_status',
9+
help: 'Redis connection status (1 = connected, 0 = disconnected)',
10+
registers: [register],
11+
});
12+
13+
// Redis operation failures counter
14+
export const redisFailureCounter = new Counter({
15+
name: 'database_redis_operation_failures_total',
16+
help: 'Total number of failed Redis operations',
17+
labelNames: ['operation'], // 'get', 'set', 'remove'
18+
registers: [register],
19+
});
20+
21+
// Redis operation success counter
22+
export const redisSuccessCounter = new Counter({
23+
name: 'database_redis_operation_success_total',
24+
help: 'Total number of successful Redis operations',
25+
labelNames: ['operation'],
26+
registers: [register],
27+
});
28+
29+
// HTTP request counter
30+
export const httpRequestCounter = new Counter({
31+
name: 'database_http_requests_total',
32+
help: 'Total number of HTTP requests',
33+
labelNames: ['method', 'route', 'status_code'],
34+
registers: [register],
35+
});
36+
37+
// Initialize Redis status as disconnected
38+
redisConnectionGauge.set(0);
39+

0 commit comments

Comments
 (0)