Skip to content

Conversation

@bmcdorman
Copy link
Member

Problem

Redis recently went down and broke the entire application. All database operations were failing because Redis cache operations were hanging and timing out.

Solution

This PR adds graceful degradation when Redis is unavailable:

Changes Made

  1. Added try-catch blocks in RedisCache - All Redis operations (get/set/remove) now handle errors gracefully

  2. Aggressive timeout configuration:

    • Command timeout: 200ms (fails fast when Redis is unresponsive)
    • Connect timeout: 500ms (quick connection attempts)
    • No retries per request (immediate failure)
    • Offline queue disabled (don't queue when disconnected)
    • Only 2 reconnection attempts with 50ms delays
  3. Cleaned up db.ts - Removed scattered try-catch blocks since error handling is now centralized in RedisCache

Behavior

  • ✅ When Redis is available: Normal caching with full performance benefits
  • ✅ When Redis is down: Operations fail in ~200ms, app continues without cache
  • ✅ Automatic reconnection when Redis becomes available again

Impact

  • App stays operational during Redis outages
  • Fast-fail prevents request timeouts
  • Database operations continue normally, just without caching performance boost

⚠️ Warning

This PR is UNTESTED - it needs to be verified in a staging environment before merging to production.

Testing Recommendations

  1. Deploy to staging
  2. Simulate Redis failure (stop Redis container/service)
  3. Verify app continues to function
  4. Check response times are acceptable (~200ms penalty per cache miss)
  5. Restart Redis and verify it reconnects automatically
  6. Monitor logs for error messages

- Wrap all Redis operations in try-catch blocks to prevent app crashes
- Add aggressive timeout configurations (200ms command, 500ms connect)
- Disable retries and offline queue for fast-fail behavior
- Remove unnecessary try-catch blocks from db.ts (now handled in RedisCache)
- App continues functioning without cache when Redis is unavailable

⚠️  UNTESTED - needs verification in staging environment
@bmcdorman bmcdorman requested review from navzam and tcorbly October 9, 2025 14:17
commandTimeout: 200, // 200ms per command
retryStrategy: (times) => {
// Stop retrying after 2 attempts
if (times > 2) return null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this check needed if maxRetriesPerRequest is 0?

commandTimeout: 200, // 200ms per command
retryStrategy: (times) => {
// Stop retrying after 2 attempts
if (times > 2) return null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this check needed if maxRetriesPerRequest is 0?

}
await this.redis_.setex(RedisCache.key_(selector), RedisCache.DEFAULT_TTL, JSON.stringify(value));
} catch (err) {
console.error('Redis SET failed, continuing without cache:', err);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new risk since we use a write-through strategy. If redis is temporarily unavailable but stays alive, the DB writes will still succeed and the cache will get outdated. Then when redis is available again, we'll serve stale data

Changes based on @navzam review comments:

1. Clarify retry configuration (line 18 comment)
   - Added comment explaining retryStrategy controls connection retries
   - While maxRetriesPerRequest controls command retries
   - These are independent settings for different purposes

2. Fix stale cache risk (line 53 comment)
   - Changed to cache-aside pattern to prevent stale data
   - On writes: only invalidate cache (DEL), don't populate (SET)
   - On reads: populate cache after fetching from Firestore
   - Reduced TTL from 7 days to 1 hour to limit stale data window
   - If invalidation fails during Redis issues, stale data expires in 1hr max

This ensures we never serve stale cached data that's >1 hour old,
even if Redis is flaky during writes.
@bmcdorman
Copy link
Member Author

Thanks for the review @navzam! I've addressed both issues:

1. Line 18 - Retry configuration clarification:
Added comments to explain that retryStrategy and maxRetriesPerRequest control different things:

  • retryStrategy: Controls reconnection attempts when connection is lost
  • maxRetriesPerRequest: Controls retries for individual Redis commands (GET/SET/etc)

Both are needed - we want to attempt reconnection, but individual commands should fail fast.

2. Line 53 - Stale cache risk:
Great catch! I've switched to a cache-aside pattern to prevent this:

  • On writes: Only invalidate cache (DEL), never populate (SET)
  • On reads: Populate cache after fetching from Firestore
  • Reduced TTL: From 7 days → 1 hour

This ensures that even if cache invalidation fails during Redis issues, stale data will expire within 1 hour max. The cache is now only populated on reads with fresh data from Firestore, eliminating the write-through stale data risk.

Let me know if you have any other concerns!

@bmcdorman bmcdorman force-pushed the graceful-redis-failure branch from 290d221 to 1f0231f Compare October 9, 2025 15:28
bmcdorman added a commit that referenced this pull request Oct 9, 2025
- Add prom-client dependency for metrics
- Create /metrics endpoint exposing Prometheus metrics
- Track Redis connection status (database_redis_connection_status)
- Track Redis operation success/failure counts by operation type
- Update RedisCache to report connection events and operation metrics
- Add comprehensive monitoring documentation

Metrics exposed:
- database_redis_connection_status: 1=connected, 0=disconnected
- database_redis_operation_success_total: successful operations by type
- database_redis_operation_failures_total: failed operations by type

Note: This PR is independent and can be deployed alongside PR #10 for
graceful Redis failure handling, or standalone for monitoring only.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants