-
Notifications
You must be signed in to change notification settings - Fork 1
Add graceful Redis failure handling with aggressive timeouts #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
- Wrap all Redis operations in try-catch blocks to prevent app crashes - Add aggressive timeout configurations (200ms command, 500ms connect) - Disable retries and offline queue for fast-fail behavior - Remove unnecessary try-catch blocks from db.ts (now handled in RedisCache) - App continues functioning without cache when Redis is unavailable⚠️ UNTESTED - needs verification in staging environment
| commandTimeout: 200, // 200ms per command | ||
| retryStrategy: (times) => { | ||
| // Stop retrying after 2 attempts | ||
| if (times > 2) return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this check needed if maxRetriesPerRequest is 0?
| commandTimeout: 200, // 200ms per command | ||
| retryStrategy: (times) => { | ||
| // Stop retrying after 2 attempts | ||
| if (times > 2) return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this check needed if maxRetriesPerRequest is 0?
src/RedisCache.ts
Outdated
| } | ||
| await this.redis_.setex(RedisCache.key_(selector), RedisCache.DEFAULT_TTL, JSON.stringify(value)); | ||
| } catch (err) { | ||
| console.error('Redis SET failed, continuing without cache:', err); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a new risk since we use a write-through strategy. If redis is temporarily unavailable but stays alive, the DB writes will still succeed and the cache will get outdated. Then when redis is available again, we'll serve stale data
Changes based on @navzam review comments: 1. Clarify retry configuration (line 18 comment) - Added comment explaining retryStrategy controls connection retries - While maxRetriesPerRequest controls command retries - These are independent settings for different purposes 2. Fix stale cache risk (line 53 comment) - Changed to cache-aside pattern to prevent stale data - On writes: only invalidate cache (DEL), don't populate (SET) - On reads: populate cache after fetching from Firestore - Reduced TTL from 7 days to 1 hour to limit stale data window - If invalidation fails during Redis issues, stale data expires in 1hr max This ensures we never serve stale cached data that's >1 hour old, even if Redis is flaky during writes.
|
Thanks for the review @navzam! I've addressed both issues: 1. Line 18 - Retry configuration clarification:
Both are needed - we want to attempt reconnection, but individual commands should fail fast. 2. Line 53 - Stale cache risk:
This ensures that even if cache invalidation fails during Redis issues, stale data will expire within 1 hour max. The cache is now only populated on reads with fresh data from Firestore, eliminating the write-through stale data risk. Let me know if you have any other concerns! |
290d221 to
1f0231f
Compare
- Add prom-client dependency for metrics - Create /metrics endpoint exposing Prometheus metrics - Track Redis connection status (database_redis_connection_status) - Track Redis operation success/failure counts by operation type - Update RedisCache to report connection events and operation metrics - Add comprehensive monitoring documentation Metrics exposed: - database_redis_connection_status: 1=connected, 0=disconnected - database_redis_operation_success_total: successful operations by type - database_redis_operation_failures_total: failed operations by type Note: This PR is independent and can be deployed alongside PR #10 for graceful Redis failure handling, or standalone for monitoring only.
Problem
Redis recently went down and broke the entire application. All database operations were failing because Redis cache operations were hanging and timing out.
Solution
This PR adds graceful degradation when Redis is unavailable:
Changes Made
Added try-catch blocks in RedisCache - All Redis operations (get/set/remove) now handle errors gracefully
Aggressive timeout configuration:
Cleaned up db.ts - Removed scattered try-catch blocks since error handling is now centralized in RedisCache
Behavior
Impact
This PR is UNTESTED - it needs to be verified in a staging environment before merging to production.
Testing Recommendations