Add graceful Redis failure handling with aggressive timeouts #10

bmcdorman · 2025-10-09T14:16:59Z

Problem

Redis recently went down and broke the entire application. All database operations were failing because Redis cache operations were hanging and timing out.

Solution

This PR adds graceful degradation when Redis is unavailable:

Changes Made

Added try-catch blocks in RedisCache - All Redis operations (get/set/remove) now handle errors gracefully
Aggressive timeout configuration:
- Command timeout: 200ms (fails fast when Redis is unresponsive)
- Connect timeout: 500ms (quick connection attempts)
- No retries per request (immediate failure)
- Offline queue disabled (don't queue when disconnected)
- Only 2 reconnection attempts with 50ms delays
Cleaned up db.ts - Removed scattered try-catch blocks since error handling is now centralized in RedisCache

Behavior

✅ When Redis is available: Normal caching with full performance benefits
✅ When Redis is down: Operations fail in ~200ms, app continues without cache
✅ Automatic reconnection when Redis becomes available again

Impact

App stays operational during Redis outages
Fast-fail prevents request timeouts
Database operations continue normally, just without caching performance boost

⚠️ Warning

This PR is UNTESTED - it needs to be verified in a staging environment before merging to production.

Testing Recommendations

Deploy to staging
Simulate Redis failure (stop Redis container/service)
Verify app continues to function
Check response times are acceptable (~200ms penalty per cache miss)
Restart Redis and verify it reconnects automatically
Monitor logs for error messages

- Wrap all Redis operations in try-catch blocks to prevent app crashes - Add aggressive timeout configurations (200ms command, 500ms connect) - Disable retries and offline queue for fast-fail behavior - Remove unnecessary try-catch blocks from db.ts (now handled in RedisCache) - App continues functioning without cache when Redis is unavailable ⚠️ UNTESTED - needs verification in staging environment

navzam · 2025-10-09T14:45:13Z

src/RedisCache.ts

+      commandTimeout: 200,         // 200ms per command
+      retryStrategy: (times) => {
+        // Stop retrying after 2 attempts
+        if (times > 2) return null;


Why is this check needed if maxRetriesPerRequest is 0?

navzam · 2025-10-09T14:47:00Z

src/RedisCache.ts

+      commandTimeout: 200,         // 200ms per command
+      retryStrategy: (times) => {
+        // Stop retrying after 2 attempts
+        if (times > 2) return null;


Why is this check needed if maxRetriesPerRequest is 0?

navzam · 2025-10-09T14:49:41Z

src/RedisCache.ts

+      }
+      await this.redis_.setex(RedisCache.key_(selector), RedisCache.DEFAULT_TTL, JSON.stringify(value));
+    } catch (err) {
+      console.error('Redis SET failed, continuing without cache:', err);


This is a new risk since we use a write-through strategy. If redis is temporarily unavailable but stays alive, the DB writes will still succeed and the cache will get outdated. Then when redis is available again, we'll serve stale data

@navzam

Changes based on @navzam review comments: 1. Clarify retry configuration (line 18 comment) - Added comment explaining retryStrategy controls connection retries - While maxRetriesPerRequest controls command retries - These are independent settings for different purposes 2. Fix stale cache risk (line 53 comment) - Changed to cache-aside pattern to prevent stale data - On writes: only invalidate cache (DEL), don't populate (SET) - On reads: populate cache after fetching from Firestore - Reduced TTL from 7 days to 1 hour to limit stale data window - If invalidation fails during Redis issues, stale data expires in 1hr max This ensures we never serve stale cached data that's >1 hour old, even if Redis is flaky during writes.

bmcdorman · 2025-10-09T14:57:42Z

Thanks for the review @navzam! I've addressed both issues:

1. Line 18 - Retry configuration clarification:
Added comments to explain that retryStrategy and maxRetriesPerRequest control different things:

retryStrategy: Controls reconnection attempts when connection is lost
maxRetriesPerRequest: Controls retries for individual Redis commands (GET/SET/etc)

Both are needed - we want to attempt reconnection, but individual commands should fail fast.

2. Line 53 - Stale cache risk:
Great catch! I've switched to a cache-aside pattern to prevent this:

On writes: Only invalidate cache (DEL), never populate (SET)
On reads: Populate cache after fetching from Firestore
Reduced TTL: From 7 days → 1 hour

This ensures that even if cache invalidation fails during Redis issues, stale data will expire within 1 hour max. The cache is now only populated on reads with fresh data from Firestore, eliminating the write-through stale data risk.

Let me know if you have any other concerns!

- Add prom-client dependency for metrics - Create /metrics endpoint exposing Prometheus metrics - Track Redis connection status (database_redis_connection_status) - Track Redis operation success/failure counts by operation type - Update RedisCache to report connection events and operation metrics - Add comprehensive monitoring documentation Metrics exposed: - database_redis_connection_status: 1=connected, 0=disconnected - database_redis_operation_success_total: successful operations by type - database_redis_operation_failures_total: failed operations by type Note: This PR is independent and can be deployed alongside PR #10 for graceful Redis failure handling, or standalone for monitoring only.

bmcdorman requested review from navzam and tcorbly October 9, 2025 14:17

navzam reviewed Oct 9, 2025

View reviewed changes

bmcdorman mentioned this pull request Oct 9, 2025

Add Prometheus metrics and monitoring for Redis #11

Open

navzam reviewed Oct 9, 2025

View reviewed changes

Trigger PR update

79bc749

bmcdorman force-pushed the graceful-redis-failure branch from 290d221 to 1f0231f Compare October 9, 2025 15:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add graceful Redis failure handling with aggressive timeouts #10

Add graceful Redis failure handling with aggressive timeouts #10

Uh oh!

bmcdorman commented Oct 9, 2025

Uh oh!

navzam Oct 9, 2025

Uh oh!

navzam Oct 9, 2025

Uh oh!

navzam Oct 9, 2025

Uh oh!

bmcdorman commented Oct 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add graceful Redis failure handling with aggressive timeouts #10

Are you sure you want to change the base?

Add graceful Redis failure handling with aggressive timeouts #10

Uh oh!

Conversation

bmcdorman commented Oct 9, 2025

Problem

Solution

Changes Made

Behavior

Impact

⚠️ Warning

Testing Recommendations

Uh oh!

navzam Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

navzam Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

navzam Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

bmcdorman commented Oct 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants