-
Couldn't load subscription status.
- Fork 13
feat: detect multiple fatal error through health event analyzer #160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR extends the health-events-analyzer to detect patterns of multiple fatal errors across time windows, rather than just single events. The core change refactors the analyzer's reconciler to process all matching rules per event (instead of stopping at the first match) and adds two new TOML rule types: one detecting any 5 remediated faults within 7 days regardless of error code, and another detecting 5 repeated instances of the same XID within24 hours. The publisher now creates new health events tagged with the matched rule name as the CheckName and marks them as fatal with REPORT_ISSUE or CONTACT_SUPPORT actions. The MongoDB change-stream filter was changed from isfatal=false to agent != health-events-analyzer to allow processing of fatal events from other agents while avoiding infinite loops. New test helpers enable direct MongoDB queries and parameterized event injection via error-code placeholders. The changes span the health-events-analyzer's reconciler, publisher, config parser, and rule definitions, plus E2E tests, Tilt configuration, and test infrastructure.
PR Description Notes:
- The PR description template was not filled out—no summary, component checkboxes, or test status information was provided.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| health-events-analyzer/pkg/config/rules.go | 1/5 | Added omitempty to SequenceStep.Criteria but file declares wrong package name (reconciler instead of config), causing import failures |
| health-events-analyzer/pkg/reconciler/reconciler.go | 2/5 | Refactored to process all rules per event with goroutines and multierror; unsafe array access on ErrorCode[0] without bounds check |
| health-events-analyzer/pkg/reconciler/parser.go | 2/5 | New reflection-based BSON query builder; nil-pointer risks on GeneratedTimestamp/EntitiesImpacted, no bounds check on negative indices |
| health-events-analyzer/main.go | 3/5 | Changed MongoDB filter from isfatal=false to agent!=health-events-analyzer; risk of infinite loop if published events re-enter stream |
| tilt/Tiltfile | 3/5 | Enabled health-events-analyzer in Tilt; kwok-fake-nodes resource_deps missing analyzer, may cause race condition during cluster bringup |
| health-events-analyzer/go.mod | 3/5 | Added go-multierror and klog v1.0.0; v1/v2 klog duplication creates maintenance burden and potential import confusion |
| tests/Makefile | 3/5 | Added MongoDB port-forward and cert extraction for CI; globally disables TLS verification (MONGODB_TLS_SKIP_VERIFY=true) which masks cert issues |
| tests/helpers/healthevent.go | 3/5 | Added errorCode param to SendHealthEventsToNodes (breaking signature change), MongoDB query helper, TestCleanUp; no connection timeout on MongoDB client |
| tests/helpers/kube.go | 4/5 | Added WaitForNodeConditionWithCheckName and CleanupNodeConditionAndUncordon; substring matching via Contains() may false-positive on non-unique names |
| tests/multiple_fatal_event_test.go | 3/5 | New E2E test for aggregated fatal errors; unused klog import, race on shared GpuNodeName variable, 85s of hard-coded sleeps, comment mismatch |
| tests/repeated_xid_rule_test.go | 4/5 | New E2E test for repeated XID detection; unused klog import, repetitive event injection (5 identical calls), unnecessary context indirection |
| health-events-analyzer/pkg/reconciler/reconciler_test.go | 3/5 | Expanded tests for multi-error detection; trailing tab in criteria string (line 99), global test data mutation (line 306) causes order dependency |
| health-events-analyzer/pkg/publisher/publisher.go | 4/5 | Refactored Publish to create new event with ruleName as CheckName; double metric increment on non-retryable errors (lines77, 84) |
| health-events-analyzer/pkg/publisher/metrics.go | 5/5 | Renamed FatalEventPublishingError to unexported fatalEventPublishingError for better encapsulation; no external impact |
| health-events-analyzer/pkg/reconciler/metrics.go | 5/5 | Renamed entity_value label to node_name, broadened fatalEventsPublishedTotal to newEventsPublishedTotal, added ruleMatchedTotal and databaseQueryDuration metrics |
| distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml | 4/5 | Updated rules to detect 5 remediated faults in 7 days and any repeated XID with exclusion filter; relies on healtheventstatus.faultremediated schema field |
| distros/kubernetes/nvsentinel/values-tilt.yaml | 5/5 | Enabled health-events-analyzer in Tilt values; safe config change for local dev |
| tests/data/fatal-health-event.json | 5/5 | Changed XID from "79" to "ERROR_CODE" placeholder and normalized entityType to uppercase "GPU" for template reuse |
| tests/data/healthy-event.json | 5/5 | Added errorCode array field with placeholder "ERROR_CODE" and normalized entityType to uppercase "GPU" |
| tests/go.mod | 5/5 | Added MongoDB driver and Kubernetes client libraries as direct dependencies for E2E tests |
| tests/go.sum | 5/5 | Added checksums for MongoDB driver and transitive dependencies |
| health-events-analyzer/go.sum | 4/5 | Added checksums for go-multierror, errwrap, and older klog/go-logr versions; older transitive dependencies should be audited |
| tests/scale_test.go | 4/5 | Updated SendHealthEventsToNodes calls to pass explicit XID "79"; hardcoded value must match test data and rule config |
| tests/smoke_test.go | 5/5 | Minor changes (not detailed in provided context) |
Confidence score: 2/5
- This PR introduces significant architectural changes with multiple critical issues that could cause runtime failures, infinite loops, or import errors if merged without fixes.
- Score reflects package declaration mismatch (rules.go), unsafe array indexing without bounds checks (reconciler.go line 257, parser.go line 165), nil-pointer risks in reflection code (parser.go lines 177-192), potential infinite event loop (main.go line 133), concurrency issues (unbounded goroutines, shared test variables), double metric increments, and breaking test helper signature changes.
- Pay close attention to health-events-analyzer/pkg/config/rules.go (package name), health-events-analyzer/pkg/reconciler/reconciler.go (goroutine spawning, unsafe array access), health-events-analyzer/pkg/reconciler/parser.go (nil checks, array bounds), health-events-analyzer/main.go (event filter logic), tests/helpers/healthevent.go (signature breakage), and tests/Makefile (TLS verification bypass).
Sequence Diagram
sequenceDiagram
participant User
participant GPU Health Monitor
participant Platform Connector
participant MongoDB
participant Health Events Analyzer
participant Publisher
participant Fault Quarantine
participant Node
User->>GPU Health Monitor: Fatal GPU error occurs (XID 13/48/etc)
GPU Health Monitor->>Platform Connector: Send HealthEvent via gRPC
Platform Connector->>MongoDB: Insert health event document
Platform Connector->>Node: Update node condition
MongoDB->>Health Events Analyzer: Change stream notification (insert event)
Health Events Analyzer->>Health Events Analyzer: Filter: operationType=insert, agent≠health-events-analyzer, isHealthy=false
Health Events Analyzer->>Health Events Analyzer: Parse TOML rules (MultipleFatalError, RepeatedXid13)
Health Events Analyzer->>Health Events Analyzer: Check if rule applies to event XID
alt Rule applies to event
Health Events Analyzer->>MongoDB: Run aggregation pipeline with $facet
MongoDB-->>Health Events Analyzer: Return sequence match results
Health Events Analyzer->>Health Events Analyzer: Evaluate if all sequences met errorCount threshold
alt All sequences matched
Health Events Analyzer->>Publisher: Create new fatal event with agent="health-events-analyzer"
Publisher->>Platform Connector: Publish transformed HealthEvent via gRPC (with retry)
Platform Connector->>MongoDB: Insert new fatal event
Platform Connector->>Node: Update node condition with rule name (MultipleFatalError/RepeatedXid13)
Node->>Fault Quarantine: Detect new fatal condition
Fault Quarantine->>Node: Cordon and quarantine node
else Sequences not matched
Health Events Analyzer->>Health Events Analyzer: Log "rule criteria didn't match"
end
else Rule doesn't apply to XID
Health Events Analyzer->>Health Events Analyzer: Skip rule evaluation
end
Additional Comments (2)
-
health-events-analyzer/pkg/config/rules.go, line 15 (link)logic: package name is
reconcilerbut file is inconfig/directory. should this bepackage configto match the directory structure? -
tilt/Tiltfile, line 141-143 (link)logic:
nvsentinel-health-events-analyzeris missing from the resource_deps list. Since the health-events-analyzer is now enabled (line 82) and watches MongoDB change streams, kwok-fake-nodes should wait for it to be ready.
24 files reviewed, 22 comments
| healthEventWithStatus storeconnector.HealthEventWithStatus) bool { | ||
| // If no target XIDs specified, evaluate for all events | ||
| for _, seq := range rule.Sequence { | ||
| eventXID := healthEventWithStatus.HealthEvent.ErrorCode[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: array access without bounds check will panic if ErrorCode is empty. Add length validation before accessing index 0.
f0be25d to
bdcf2e4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR enables the health-events-analyzer module to detect long-term patterns of multiple fatal GPU errors, marking a major evolution from the previous XID-sequence-based detection. The feature introduces two new rule types: MultipleFatalError (5 remediated faults within 7 days) and RepeatedXid13 (5 occurrences of any fatal XID within 24 hours). The analyzer now evaluates all rules per event instead of stopping at the first match, using go-multierror to aggregate partial failures. Test fixtures were refactored to use template placeholders (ERROR_CODE, NODE_NAME) enabling parameterized multi-XID testing, and the test suite gained MongoDB integration via go.mongodb.org/mongo-driver to directly query stored events and validate pattern detection. The changes align with NVSentinel's architecture where the analyzer consumes MongoDB change-streams, applies TOML rule sets, and publishes synthetic fatal events with its own metadata (Agent: "health-events-analyzer", CheckName: ruleName) back to the platform-connector for downstream remediation.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| health-events-analyzer/pkg/reconciler/reconciler.go | 2/5 | Refactored to evaluate all rules per event with unbounded goroutine spawning and unprotected array access at line 257 that will panic if ErrorCode is empty |
| tests/smoke_test.go | 2/5 | Added error code parameter to SendHealthEventsToNodes but TestFatalUnsupportedHealthEvent has critical mismatch: sends XID 79 but asserts ErrorCode:145 |
| tests/helpers/kube.go | 3/5 | Added WaitForNodeConditionWithCheckName and CleanupNodeConditionAndUncordon helpers; cleanup logs but ignores all errors, always returning nil even if critical steps fail |
| tests/helpers/healthevent.go | 3/5 | Added MongoDB integration with getMongoCollection and QueryHealthEventByFilter; TLS verification can be disabled via env var, and TestCleanUp returns nil even if cleanup steps fail |
| distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml | 3/5 | Replaced short-window XID sequence rules with 7-day/24-hour pattern detection; MultipleFatalError checks faultremediated=true which matches resolved faults, not active failures |
| tests/repeated_xid_rule_test.go | 3/5 | New e2e test for RepeatedXid13 rule; unused klog import and context key collision risk with string variable used as context key instead of typed constant |
| health-events-analyzer/pkg/reconciler/reconciler_test.go | 3/5 | Added rule1 for multiple-fatal-error detection; typo on line 99 ("GPU\t" trailing tab) will cause production rule mismatch, and mutates global healthEvent_13 state |
| health-events-analyzer/pkg/reconciler/parser.go | 4/5 | New Parser struct extracts dynamic MongoDB query templating logic from reconciler; centralizes sequence string parsing with proper error handling |
| tests/multiple_fatal_event_test.go | 4/5 | New e2e test for MultipleFatalError rule; unused klog import and comment-code mismatch at line 122 (comment says XID 48, sends error code 31) |
| health-events-analyzer/pkg/publisher/publisher.go | 4/5 | Refactored Publish to create new HealthEvent with Agent=health-events-analyzer and CheckName=ruleName; shallow-copies slices/maps which could cause issues if original event is reused |
| tilt/Tiltfile | 4/5 | Enabled health-events-analyzer and added MongoDB port-forward for local dev testing; safe configuration change for development environment |
| tests/data/fatal-health-event.json | 4/5 | Replaced hardcoded error code 79 with ERROR_CODE placeholder for parameterized testing; tests must now substitute the placeholder or events will be rejected |
| health-events-analyzer/pkg/config/rules.go | 5/5 | Added omitempty to SequenceStep.Criteria field enabling rules to count total errors without filtering by specific properties; minimal, safe change |
| health-events-analyzer/go.mod | 3/5 | Added go-multierror for aggregating multiple errors; problematic addition of k8s.io/klog v1.0.0 alongside existing klog/v2 dependency creates import path conflict |
| tests/go.mod | 5/5 | Added mongodb driver and promoted k8s.io/client-go to direct dependency for new e2e tests; all versions align with Kubernetes 0.34.1 ecosystem |
| health-events-analyzer/pkg/reconciler/metrics.go | 5/5 | Renamed entity_value label to node_name, added rule_matched_total and databaseQueryDuration metrics; breaking change for existing Prometheus dashboards |
| distros/kubernetes/nvsentinel/values-tilt.yaml | 5/5 | Enabled health-events-analyzer in Tilt dev environment; safe, isolated configuration change for local testing |
| health-events-analyzer/pkg/publisher/metrics.go | 4/5 | Made FatalEventPublishingError metric unexported; proper encapsulation but breaking change for external packages accessing the variable directly |
| tests/scale_test.go | 4/5 | Added explicit error code parameter (79) to SendHealthEventsToNodes; hardcoded value may not match JSON data files |
| tests/data/healthy-event.json | 3/5 | Added errorCode field with ERROR_CODE placeholder; tests must substitute placeholder or events may not clear matching error codes |
| health-events-analyzer/go.sum | 5/5 | Added checksums for go-multierror and transitive dependencies; standard go.sum update with no direct issues |
| tests/go.sum | 5/5 | Added checksums for MongoDB driver and transitive dependencies; consistent with go.mod changes |
Confidence score: 2/5
- This PR introduces critical bugs (panic on empty ErrorCode array, unbounded goroutine spawning) and multiple test mismatches that will cause immediate failures
- Score reflects unsafe concurrency patterns in production code, unprotected array access that will panic, test fixture mismatches between sent XIDs and assertions, and error-swallowing cleanup logic that masks test failures
- Pay close attention to health-events-analyzer/pkg/reconciler/reconciler.go (array bounds check, goroutine limiting), tests/smoke_test.go (XID mismatch line 304/337), and tests/helpers/healthevent.go (error handling in cleanup)
Sequence Diagram
sequenceDiagram
participant User
participant HealthMonitor as Health Monitor<br/>(GPU/Syslog/CSP)
participant SimpleClient as Simple Health Client
participant PlatformConn as Platform Connector
participant MongoDB
participant Analyzer as Health Events Analyzer
participant Publisher as Analyzer Publisher
participant Quarantine as Fault Quarantine
participant K8sAPI as Kubernetes API
User->>SimpleClient: POST /health-event<br/>(fatal error, e.g. XID 13/48)
SimpleClient->>PlatformConn: HealthEventOccuredV1(gRPC)
PlatformConn->>MongoDB: Insert HealthEvent document
PlatformConn->>K8sAPI: Add/Update Node Condition
MongoDB->>Analyzer: Change Stream Event (insert)
Note over Analyzer: Filter: operationType=insert<br/>agent≠health-events-analyzer<br/>isHealthy=false
Analyzer->>Analyzer: Parse event through Parser<br/>Extract field values
loop For each configured rule
Analyzer->>Analyzer: shouldEvaluateRuleForEvent()<br/>Check if rule applies to XID
alt Rule matches XID
Analyzer->>MongoDB: Aggregate Query<br/>Build facet pipeline<br/>Check sequences in time window
Note over MongoDB: $facet with multiple sequences<br/>Count matching docs per sequence<br/>Compare to errorCount threshold
MongoDB->>Analyzer: Aggregation result<br/>{ruleMatched: true/false}
alt All sequences satisfied
Analyzer->>Publisher: Publish(ctx, event, action, ruleName)
Note over Publisher: Transform event:<br/>agent="health-events-analyzer"<br/>checkName=ruleName<br/>isFatal=true, isHealthy=false
Publisher->>Publisher: sendHealthEventWithRetry()<br/>(max 5 retries, exp backoff)
Publisher->>PlatformConn: HealthEventOccuredV1(gRPC)
PlatformConn->>MongoDB: Insert new HealthEvent<br/>(agent=health-events-analyzer)
PlatformConn->>K8sAPI: Add Node Condition<br/>(Type=ruleName, Status=True)
K8sAPI->>Quarantine: Watch: Node Condition Added
Quarantine->>K8sAPI: Cordon Node, Add Taint
end
end
end
Note over Analyzer: Emit Prometheus metrics:<br/>events_received_total<br/>rule_matched_total<br/>new_events_published_total
Additional Comments (1)
-
tilt/Tiltfile, line 142 (link)logic: health-events-analyzer is missing from the resource_deps list - fake nodes will be created before the analyzer is ready, potentially causing startup race conditions during Tilt development
24 files reviewed, 16 comments
| if idx, err := strconv.Atoi(parts[0]); err == nil && idx < len(event.EntitiesImpacted) { | ||
| entity := event.EntitiesImpacted[idx] | ||
| subField := strings.ToLower(parts[1]) | ||
|
|
||
| entityVal := reflect.ValueOf(entity).Elem() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: will panic if entity (a pointer) is nil - add nil check before calling .Elem()
|
/ok to test bdcf2e4 |
Merging this branch will decrease overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This review covers only the changes made since the last review, not the entire PR. The latest changes focus on cleanup and bug fixes across the health-events-analyzer module and test suite. The developer removed the outdated k8s.io/klog v1.0.0 dependency, fixing import conflicts with the v2 API now used throughout the codebase. In the publisher, manual struct field copying was replaced with proto.Clone() to ensure deep copies of protobuf messages and eliminate shallow-copy race conditions. Test infrastructure was improved by replacing the race-prone package-level GpuNodeName variable with context-based storage, ensuring safe data passing between test steps. Error handling in test cleanup was corrected to properly propagate failures instead of silently swallowing them. The Criteria field in TOML rule parsing was made mandatory by removing the omitempty tag, enforcing explicit declaration of filtering logic for each sequence step. Finally, a smoke test was updated to align healthy event error codes with their corresponding fatal events, and a typo (trailing tab in test data) was fixed. These changes collectively improve test reliability, dependency hygiene, and data safety in the analyzer's event publishing pipeline.
PR Description Notes:
- All checkboxes in the PR description remain unchecked (type of change, components affected, testing status, and checklist items).
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| health-events-analyzer/go.mod | 5/5 | Removed obsolete k8s.io/klog v1.0.0 dependency to eliminate import conflicts with v2 |
| health-events-analyzer/go.sum | 5/5 | Cleaned up go.sum checksums for removed dependencies (klog v1, go-logr v0.1.0) |
| health-events-analyzer/pkg/publisher/publisher.go | 5/5 | Replaced manual struct copying with proto.Clone() for safe deep-copy of protobuf messages |
| health-events-analyzer/pkg/config/rules.go | 4/5 | Removed omitempty from Criteria field, making it mandatory in TOML rules (breaking change) |
| tests/multiple_fatal_event_test.go | 5/5 | Eliminated package-level mutable variable by storing GPU node name in test context, fixing race condition |
| tests/helpers/healthevent.go | 4/5 | Fixed error handling in TestCleanUp to properly propagate failures instead of silently ignoring them |
| tests/smoke_test.go | 3/5 | Updated healthy event error code to "145" but fatal event still sends "79", creating a mismatch |
| health-events-analyzer/pkg/reconciler/reconciler_test.go | 5/5 | Fixed typo by removing trailing tab character from "GPU" in test rule criteria |
| distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml | 3/5 | Renamed rule from RepeatedXid13 to RepeatedXidError and added MultipleFatalError rule with faultremediated=true criteria |
Confidence score: 3/5
- This PR contains several quality improvements but introduces a breaking change and leaves a test inconsistency unresolved.
- Score reflects three concerns: (1) removing
omitemptyfrom theCriteriafield breaks backward compatibility with existing TOML rules that omit criteria, (2) the smoke test still has a mismatch where the fatal event sends XID "79" but expects error code "145" in assertions, and (3) the MultipleFatalError rule checksfaultremediated=truewhich counts successful remediations rather than active faults, which may not align with the rule's stated purpose. - Pay close attention to health-events-analyzer/pkg/config/rules.go (breaking TOML schema change), tests/smoke_test.go (error code mismatch between lines 304 and 376 vs assertion at line 337), and distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml (verify the
faultremediated=truelogic matches intended alert behavior).
Additional Comments (1)
-
health-events-analyzer/pkg/config/rules.go, line 24 (link)style: removing
omitemptyfrom thecriteriafield means rules must always define criteria, even if empty. If any existing rules in production havesequencesteps without criteria fields, TOML parsing will fail or inject an empty map. Verify that all deployed rule configs explicitly includecriteriain every sequence step, or add validation to reject rules with nil criteria maps. Are there any existing rules in production where sequence steps intentionally omit the criteria field, relying on the omitempty behavior?
12 files reviewed, 3 comments
4fbe7be to
85ab0e9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR implements a comprehensive multi-fatal-error detection system for the health-events-analyzer. The core change shifts the analyzer from processing only non-fatal events to evaluating all health events (fatal and non-fatal) from monitors, applying TOML-defined rules that can aggregate patterns across multiple events. The MongoDB change-stream filter now excludes only analyzer-published events (agent != "health-events-analyzer") rather than all fatal events, allowing the analyzer to detect patterns like "5 repeated XID 13 errors on a single node" or "5 remediated faults indicating recurring failures." A new Parser module dynamically resolves event fields into BSON queries using reflection, enabling rules to reference the current event's fields (e.g., "this.healthevent.nodename"). The reconciler was refactored to evaluate all rules per event (rather than early-returning after the first match) using HashiCorp's go-multierror library to accumulate results. Test infrastructure was expanded with MongoDB connectivity, parameterized error-code injection, and two new end-to-end tests (multiple_fatal_event_test.go, repeated_xid_rule_test.go) that validate rule evaluation, event publishing, node condition creation, and database persistence. Configuration changes enable the analyzer in the Tilt development environment.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
health-events-analyzer/pkg/reconciler/parser.go |
2/5 | Adds BSON query parser with reflection-based field lookups, but multiple nil pointer dereferences on .Elem() calls without nil checks will panic |
health-events-analyzer/pkg/reconciler/reconciler.go |
2/5 | Refactors to multi-rule evaluation with go-multierror and goroutines, but has array access panic (line 240), potential nil multiErr dereference (line 167), and unbounded goroutine spawning |
tests/smoke_test.go |
3/5 | Updates all event sends with explicit XID parameters, but line 304 sends XID "79" while line 337expects "145" in the node condition message, likely causing test failure |
health-events-analyzer/pkg/reconciler/reconciler_test.go |
3/5 | Expands test coverage for multi-fatal-error detection with new fixtures and rules, but mutates global state (line 306) and adds unnecessary BSON marshaling overhead in mock cursor |
tests/helpers/healthevent.go |
4/5 | Adds MongoDB query support and TestCleanUp helper with TLS/X509 auth, but hardcoded 5s sleep and /tmp/mongo-certs paths may cause issues in different environments |
tests/helpers/kube.go |
4/5 | Adds node-condition validation and cleanup helpers with hardcoded message formats, and increases global WaitInterval from 10s to 30s which slows all tests |
tests/repeated_xid_rule_test.go |
4/5 | New test for RepeatedXid13 rule with event injection and MongoDB verification, but uses non-deterministic node selection and hardcoded 5s sleep instead of polling |
tests/data/healthy-event.json |
4/5 | Adds errorCode field with "ERROR_CODE" placeholder and standardizes entity type, but placeholder won't match actual XIDs ("79", "145") without runtime substitution |
health-events-analyzer/main.go |
5/5 | Changes MongoDB filter from excluding fatal events to excluding analyzer-published events, enabling multi-fatal-error pattern detection |
distros/kubernetes/nvsentinel/values-tilt.yaml |
5/5 | Enables health-events-analyzer in Tilt development environment with clean boolean flag change |
tests/scale_test.go |
5/5 | Updates SendHealthEventsToNodes calls to pass explicit error code "79" matching new function signature |
health-events-analyzer/go.mod |
5/5 | Adds go-multierror v1.1.1 and errwrap v1.0.0 dependencies for multi-error handling with no version conflicts |
health-events-analyzer/go.sum |
5/5 | Adds checksums for HashiCorp error handling dependencies with no security concerns |
health-events-analyzer/pkg/reconciler/metrics.go |
5/5 | Renames/refactors Prometheus metrics from fatalEventsPublishedTotal to newEventsPublishedTotal and adds ruleMatchedTotal, databaseQueryDuration with standardized labels |
health-events-analyzer/pkg/publisher/metrics.go |
5/5 | Makes FatalEventPublishingError metric unexported for proper encapsulation with no functional changes |
tests/data/fatal-health-event.json |
5/5 | Converts fixture to parameterized template with "ERROR_CODE" placeholder and normalizes entity type to uppercase "GPU" |
tests/go.mod |
5/5 | Adds mongo-driver v1.17.4 and promotes k8s.io/client-go to direct dependency for MongoDB integration in tests |
tests/go.sum |
5/5 | Adds checksums for MongoDB driver and transitive dependencies with standard, well-maintained packages |
tests/multiple_fatal_event_test.go |
0/5 | New test file for multi-fatal-error rule validation (contents not analyzed) |
health-events-analyzer/pkg/publisher/publisher.go |
0/5 | Publisher implementation changes (contents not analyzed) |
tilt/Tiltfile |
0/5 | Tilt development configuration changes (contents not analyzed) |
distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml |
0/5 | Helm chart values changes (contents not analyzed) |
tests/Makefile |
0/5 | Test makefile changes (contents not analyzed) |
Confidence score: 2/5
- This PR requires careful review before merging due to critical runtime panics in the parser and reconciler, plus test infrastructure issues that will cause failures
- Score reflects multiple nil pointer dereferences in parser.go (lines 106-116, 177-192), array access without bounds check in reconciler.go (line 240), potential nil multiErr dereference (line 167), unbounded goroutine spawning, and test mismatches (smoke_test.go lines 304/337 XID discrepancy)
- Pay close attention to
health-events-analyzer/pkg/reconciler/parser.goandhealth-events-analyzer/pkg/reconciler/reconciler.gowhich contain panics that will crash the analyzer pod when triggered by nil event fields or empty ErrorCode arrays
Sequence Diagram
sequenceDiagram
participant User
participant HealthClient as Health Event Client
participant PlatformConnector as Platform Connector
participant MongoDB
participant Analyzer as Health Events Analyzer
participant Publisher as Event Publisher
participant K8s as Kubernetes API
User->>HealthClient: Send health event (XID error)
HealthClient->>PlatformConnector: HealthEventOccuredV1(event)
PlatformConnector->>MongoDB: Insert health event
PlatformConnector->>K8s: Update node condition
MongoDB->>Analyzer: Change stream notification (insert event)
Analyzer->>Analyzer: processEvent(event)
Analyzer->>Analyzer: handleEvent - evaluate rules
alt Rule: MultipleFatalError or RepeatedXidError
Analyzer->>Analyzer: validateAllSequenceCriteria
Analyzer->>MongoDB: Aggregate query (facets for sequence criteria)
MongoDB-->>Analyzer: Query results
alt Criteria matched (e.g., 5 remediated faults)
Analyzer->>Publisher: Publish(event, recommendedAction, ruleName)
Publisher->>Publisher: Clone event and override fields
Note over Publisher: agent="health-events-analyzer"<br/>checkName=ruleName<br/>isHealthy=false, isFatal=true
Publisher->>PlatformConnector: HealthEventOccuredV1(transformed event)
PlatformConnector->>MongoDB: Insert new fatal event
PlatformConnector->>K8s: Update node condition (e.g., "MultipleFatalError")
else Criteria not matched
Analyzer->>Analyzer: Log: rule not matched
end
end
Analyzer->>MongoDB: Mark event as processed (update resume token)
Additional Comments (1)
-
tilt/Tiltfile, line 141-143 (link)logic: : fake nodes now depend on
nvsentinel-mongodbbut notnvsentinel-health-events-analyzer– if the analyzer publishes events that affect node state (e.g., quarantine), this dependency list may be incomplete and cause race conditions during stack startup
23 files reviewed, 13 comments
tests/Makefile
Outdated
| test-ci: | ||
| nohup kubectl port-forward -n monitoring svc/prometheus-operator-kube-p-prometheus 9090:9090 > /dev/null 2>&1 & | ||
| nohup kubectl port-forward -n nvsentinel svc/simple-health-client 8080:8080 > /dev/null 2>&1 & | ||
| nohup kubectl port-forward -n nvsentinel pod/nvsentinel-mongodb-0 27017:27017 > /dev/null 2>&1 & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: port-forward to a specific pod (nvsentinel-mongodb-0) will fail if the pod is rescheduled or renamed (e.g., after a StatefulSet restart). Consider using svc/nvsentinel-mongodb instead to target the service endpoint, which remains stable across pod restarts. Is there a specific reason to target the pod directly instead of the service? Does the service endpoint not expose port 27017?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This review covers only the changes made since the last review, not the entire PR. The most recent updates focus on improving test stability and consistency: the MongoDB port-forward now targets the stable headless service rather than a specific pod, preventing test failures during pod restarts; a cleanup step was added to remove temporary MongoDB certificates after test runs; and timeout constants for Kubernetes resource polling were reduced from10 minutes to 1 minute to speed up test feedback loops. These changes complement the PR's main feature—enabling the health events analyzer to detect patterns across multiple different fatal error codes—by ensuring the test infrastructure is more reliable and faster, though the reduced timeouts may introduce flakiness in slower CI environments.
PR Description Notes:
- The PR template sections (Type of Change, Component(s) Affected, Testing, Checklist) are completely empty—none of the checkboxes are marked and no description is provided in the Summary field, making it difficult for reviewers to understand the scope and intent of this change without reading the code diff.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| tests/Makefile | 4/5 | Changed MongoDB port-forward from pod to headless service and added certificate cleanup to improve test stability |
| tests/helpers/kube.go | 3/5 | Reduced wait timeouts from 10min to 1min globally, which may cause flakiness in slower environments |
| health-events-analyzer/pkg/reconciler/reconciler.go | 4/5 | Removed XID pre-filter to enable cross-error-code pattern detection, increasing database load for every event |
| health-events-analyzer/pkg/reconciler/reconciler_test.go | 5/5 | Removed redundant unit test and fixed typo; no functional risk |
| tests/helpers/healthevent.go | 5/5 | Renamed parameter from PascalCase to camelCase for Go convention compliance; no functional change |
Confidence score: 3/5
- This PR includes critical test infrastructure improvements but introduces risk by globally reducing timeouts that affect all test helpers, which could cause intermittent failures in CI/CD environments that are slower than the developer's local setup.
- Score reflects concerns about the aggressive timeout reduction (10min → 1min) in
tests/helpers/kube.gothat applies to all polling operations across the test suite, and the potential performance impact of removing the XID pre-filter in the reconciler, which now forces a MongoDB aggregation query for every event regardless of rule applicability. - Pay close attention to
tests/helpers/kube.go(timeout constants may be too aggressive for slower CI runners) andhealth-events-analyzer/pkg/reconciler/reconciler.go(removal of pre-filter will increase MongoDB query load—monitor database metrics after deployment).
5 files reviewed, no comments
|
/ok to test 3800b87 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This review covers only the changes made since the last review, not the entire PR.
The most recent changes address previous review feedback by moving test cleanup logic from deferred functions inside assess steps to proper feature.Teardown callbacks in both repeated_xid_rule_test.go and multiple_fatal_event_test.go. The cleanup now runs reliably even when assess steps fail, and includes safety checks for missing context values. The test timeout in tests/helpers/kube.go was increased from 1 minute to 10 minutes to accommodate the longer-running multi-cycle fatal event tests, and informational logging was added to cleanup operations. However, a critical syntax error was introduced in health-events-analyzer/pkg/parser/parser.go: the package declaration remains package reconciler even though the file was moved to the parser/ directory, which will cause compilation to fail. Additionally, two new logic errors were added to the parser: accessing parts[1] without validating the slice length (line 178-180) and dereferencing event.GeneratedTimestamp without a nil check (line 182-184).
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
health-events-analyzer/pkg/parser/parser.go |
0/5 | Package declaration not updated after file move; will not compile. Also added array bounds and nil dereference bugs. |
health-events-analyzer/pkg/reconciler/reconciler.go |
4/5 | Updated imports and method calls to use exported Parser API after package refactor; error wrapping improved. |
tests/helpers/kube.go |
5/5 | Increased test timeout from1to 10 minutes to support multi-cycle tests; added informational logging. |
tests/repeated_xid_rule_test.go |
5/5 | Moved cleanup from defer to feature.Teardown callback; introduced ERRORCODE_13 constant; removed hard-coded sleep. |
tests/multiple_fatal_event_test.go |
4.5/5 | Moved cleanup from defer to feature.Teardown callback; introduced ERRORCODE_31 constant. Test still requires 6 errors despite PR description stating 5 cycles. |
health-events-analyzer/pkg/publisher/publisher.go |
N/A | No summary available (file marked undefined). |
Confidence score: 0/5
- This PR is not safe to merge and will cause immediate compilation failure in production.
- Score reflects a critical syntax error (
package reconcilerinpkg/parser/parser.gowhen it should bepackage parser) that will break the build, plus two new runtime panic risks (array out-of-bounds at line 178-180 and nil dereference at line 182-184 in parser.go) that were not present in the previous review. The previous negative-index bugs (lines 138-139, 165-169) also remain unresolved. - Immediate attention required for
health-events-analyzer/pkg/parser/parser.goto fix the package declaration and add bounds/nil checks before lines 178-184. After fixing the parser, carefully reviewhealth-events-analyzer/pkg/reconciler/reconciler.goto ensure all Parser method calls are correct. The test files are well-structured but cannot be validated until the parser compiles.
6 files reviewed, 2 comments
98bed98 to
5201183
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This review covers only the changes made since the last review, not the entire PR.
The latest changes address previous review feedback by refactoring test structure and improving observability. The two test files (multiple_fatal_event_test.go and repeated_xid_rule_test.go) were updated to move cleanup logic from deferred functions within assess steps into dedicated feature.Teardown blocks, following e2e-framework best practices. Error code constants (ERRORCODE_13, ERRORCODE_31) were introduced to eliminate magic strings. The test helper kube.go was updated to increase the global WaitTimeout from 1 minute to 10 minutes (necessary for multi-cycle remediation flows) and add informational logging to the CleanupNodeConditionAndUncordon function. These changes improve test reliability, maintainability, and debuggability without altering the core test logic or the analyzer's functional behavior.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| tests/repeated_xid_rule_test.go | 5/5 | Refactored cleanup from deferred function to Teardown phase and introduced ERRORCODE_13 constant |
| tests/multiple_fatal_event_test.go | 4/5 | Moved cleanup to Teardown but introduced dependency on context value being set correctly in earlier assess steps |
| tests/helpers/kube.go | 4/5 | Increased global test timeout to10 minutes and added logging to cleanup function |
Confidence score: 4/5
- This PR is safe to merge with low risk; changes are scoped to test structure improvements and observability enhancements.
- Score reflects solid refactoring that follows framework best practices, but deducts one point because
multiple_fatal_event_test.gocleanup relies on context state from a previous assess step—if that step fails before storing the node name, cleanup will be skipped silently. - Pay close attention to
tests/multiple_fatal_event_test.go(lines 156-170): ensure the first assess step always setsgpuNodeNamein context before any failure path, or add fallback logic to guarantee cleanup runs even if the context value is missing.
6 files reviewed, 1 comment
Signed-off-by: Tanisha goyal <[email protected]>
Signed-off-by: Tanisha goyal <[email protected]>
Signed-off-by: Tanisha goyal <[email protected]>
Signed-off-by: Tanisha goyal <[email protected]>
5201183 to
d065e4f
Compare
|
/ok to test d065e4f |
Signed-off-by: Tanisha goyal <[email protected]>
167309e to
8431c23
Compare
|
/ok to test 8431c23 |
Merging this branch will decrease overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This review covers only the most recent changes made since the last review, not the entire PR. The latest updates include critical bug fixes to the health-events-analyzer reconciler and parser, modifications to deployment templates, and refinements to test infrastructure. Key changes address goroutine lifecycle management, context cancellation, duplicate environment variable issues, and template syntax errors that would prevent deployments from succeeding.
Recent Changes Overview:
The most recent changes focus on:
- Parser package fixes: Corrected package declaration from
reconcilertoparser(previously flagged) - Error handling improvements: Removed double error wrapping in retry logic
- Test constants usage: Consolidated XID error codes to use declared constants throughout tests
- Fault-remediation deployment template: Fixed duplicate environment variable definitions (critical deployment blocker)
- Commons go.mod: Fixed invalid YAML v2 import path (
go.yaml.in→gopkg.in) - Docker Makefile: Removed most module targets, migrating to ko-based builds
- Reconciler refactor: Fixed goroutine lifecycle and context propagation issues
- Test refinements: Improved cleanup logic and error code consistency
The changes address previously identified issues (package naming, duplicate definitions, error wrapping) while introducing new infrastructure improvements. The most critical fix is in the fault-remediation deployment template where duplicate environment variables would cause Kubernetes to reject the manifest.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml |
1/5 | Critical: Contains duplicate environment variable definitions (lines 42-47and 82-87) that will cause deployment failures; also updates image repository path |
commons/go.mod |
1/5 | Critical syntax error on line 20: invalid import path go.yaml.in/yaml/v2 should be gopkg.in/yaml.v2 - will cause build failures |
docker/Makefile |
1/5 | Removes 7+ critical module build targets from the Makefile; health-events-analyzer and other core modules no longer built via Docker, breaking CI unless ko migration is complete |
health-events-analyzer/pkg/parser/parser.go |
2/5 | Multiple nil pointer dereference risks (lines 106, 169,184) and index out-of-bounds panic in getValueFromGeneratedTimestamp; silent failure modes return nil instead of errors |
platform-connectors/pkg/connectors/store/storeConnectorImpl.go |
2/5 | Critical defer ordering bug: defer cancel() executes before defer session.EndSession(cleanupCtx), canceling the cleanup context before it's used |
janitor/pkg/csp/client.go |
2/5 | GCP and Azure clients returned as uninitialized struct pointers will cause runtime panics; inconsistent initialization pattern vs AWS/OCI |
fault-remediation-module/pkg/reconciler/reconciler.go |
2/5 | Unreachable code in collection-client error path (lines 84-91); deferred watcher cleanup may be invoked on never-started watcher |
health-events-analyzer/pkg/reconciler/reconciler.go |
3/5 | Unbounded goroutine spawning, removal of retry logic, and potential duplicate event publishing due to race with watcher.MarkProcessed |
janitor/pkg/csp/kind.go |
3/5 | Three incorrect timeout checks use parent ctx.Err() instead of dockerCtx.Err(), causing timeouts to not be detected |
janitor/pkg/csp/gcp.go |
3/5 | Regex parsing bug with empty capture-group validation; op.Status dereferenced without nil-check will panic on unexpected GCP API responses |
janitor/pkg/csp/aws.go |
3/5 | Off-by-one error in provider ID parsing; 5-minute hardcoded cooldown tightly coupled to message format; AWS_REGION read without validation |
health-events-analyzer/pkg/reconciler/reconciler_test.go |
3/5 | Global test fixtures mutated without reset causing order dependencies; mocks don't validate pipeline content allowing incorrect rule matching |
commons/pkg/server/server.go |
3/5 | Resource leak if listener creation fails in TLS path; race condition where running flag may not reflect actual socket binding state |
janitor/pkg/controller/rebootnode_controller_test.go |
3.5/5 | AfterEach hook will panic if testRebootNode deleted during test; race condition test modifies wrong object copy |
tests/helpers/healthevent.go |
4/5 | MongoDB client recreated on every query (no pooling); X509 auth unconditionally applied may conflict with MONGODB_URI auth settings |
tests/repeated_xid_rule_test.go |
4/5 | Rapid-fire event injection assumes immediate analyzer processing; non-deterministic node selection may cause flakiness |
tests/multiple_fatal_event_test.go |
4/5 | Race condition: gpuNodeName stored in context after first assess starts, breaking cleanup if assess fails early; cleanup only targets error code 31 |
fault-quarantine-module/main.go |
4/5 | Metrics server errors swallowed (returns nil); context cancellation from reconciler won't propagate to metrics server |
platform-connectors/main.go |
3/5 | Metrics server always returns nil preventing context cancellation propagation; gRPC server still calls os.Exit bypassing shutdown; stopCh closed after cleanup starts |
tests/uat/install-apps.sh |
3/5 | Empty array expansion "${extra_set_args[@]}" will fail for non-AWS CSPs; hardcoded 15m/20m timeouts may be insufficient |
health-monitors/syslog-health-monitor/main.go |
3/5 | Flag initialization bug: nodeNameEnv default uses os.Getenv at package init; loadConfigWithRetry blocks shutdown for 15s without context check |
health-monitors/csp-health-monitor/cmd/csp-health-monitor/main.go |
3/5 | Potential deadlock if monitor doesn't respect context; metrics server failures silently swallowed masking startup issues |
Confidence score: 2/5
- This PR contains multiple critical deployment-blocking bugs that will cause immediate failures if merged
- Score reflects severe issues: duplicate environment variables in Kubernetes manifests, invalid Go import paths, nil pointer dereferences, and broken build systems that will prevent successful deployment and operation
- Primary concerns: (1) fault-remediation deployment template has duplicate env vars that Kubernetes will reject, (2) commons/go.mod has invalid import path causing build failures, (3) docker/Makefile removes all critical module targets breaking CI, (4) multiple CSP clients have uninitialized pointers causing runtime panics, (5) server cleanup context ordering bug defeats the purpose of timeout protection
Additional Comments (92)
-
platform-connectors/Dockerfile, line 1 (link)logic: entire Dockerfile deleted but platform-connectors module is still listed in the changed files and appears in the PR context. If this module is still active, the build will fail. Verify that platform-connectors now uses
.ko.yamlexclusively (visible in changed files list) and thatdocker/Makefilehas removed any references to this Dockerfile. If platform-connectors is being removed entirely, confirm all dependent modules, Helm charts, and tests have been updated. Is the platform-connectors module switching from Docker builds to ko-based builds, or is the entire module being removed? Are there any remaining references to this Dockerfile in the build system or Helm charts? -
health-monitors/gpu-health-monitor/Tiltfile, line 16 (link)logic: image name change from
ghcr.io/nvidia/nvsentinel-gpu-health-monitortoghcr.io/nvidia/nvsentinel/gpu-health-monitor(dash replaced with slash) will cause deployment failures if image references in Helm charts,.ko.yaml, or other Tiltfiles aren't updated consistently. Have all image references in Helm values,.ko.yaml, other Tiltfiles, and CI workflows been updated to match this new image path structure? -
.github/actions/setup-ci-tools/action.yml, line 197-201 (link)logic: empty string check will fail after
-ztest passes, causing workflow to exit before printing the error context. Move the echo statements before the exit. -
tests/uat/aws/delete-eks-cluster.sh, line 102 (link)style: double blank line
-
tests/uat/aws/README.md, line 26 (link)logic: K8s version 1.34 doesn't exist yet - latest stable is 1.31. Will cause cluster creation to fail.
-
tests/uat/common.sh, line 17-19 (link)style: log function doesn't handle multi-line input or special characters correctly - if
$*contains backslashes or-n/-eflags they may be interpreted byecho. Useprintf '%s\n' "[$(date +'%Y-%m-%d %H:%M:%S')] $*"instead -
janitor/pkg/metrics/metrics.go, line 64-65 (link)style: Registering metrics in
init()(viaNewActionMetrics) can panic if this package is imported twice or if the metrics are already registered elsewhere. Consider usingprometheus.MustRegisterinside aninit()with a panic-recovery strategy, or checkRegisterreturn errors to avoid hard crashes at startup. -
commons/go.mod, line 20 (link)syntax: invalid import path
go.yaml.in/yaml/v2- should begopkg.in/yaml.v2 -
commons/pkg/logger/logger.go, line 56 (link)syntax: missing godoc parameter
withSource -
.github/actions/setup-build-env/action.yml, line 74 (link)style: missing newline at end of file
-
janitor/main.go, line 93-98 (link)logic: missing MetricsBindAddress in manager options - the
metricsAddrflag is parsed at line 77 but never applied, so Prometheus metrics will be disabled -
tests/uat/aws/gpu-operator-values.yaml, line 30 (link)logic:
node-feature-discovery.masternodeSelector targetsworkload-type: gpu, but NFD master typically runs on control-plane nodes to discover features across all workers. Placing it on GPU nodes means it may not schedule if GPU nodes are tainted/cordoned during fault quarantine. Should this targetworkload-type: systemlike the operator (line 25)? Is NFD master intentionally placed on GPU nodes for a specific architectural reason in this UAT setup, or should it run on system nodes to avoid scheduling issues during fault conditions? -
platform-connectors/pkg/connectors/store/storeConnectorImpl.go, line 228-230 (link)logic: defer ordering:
cancel()will fire beforesession.EndSession(), meaning cleanupCtx will be cancelled before EndSession tries to use it. Swap the defer statements sodefer cancel()comes first (will execute last). -
distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml, line 41-87 (link)syntax: duplicate environment variable definitions -
MAINTENANCE_NAMESPACE,MAINTENANCE_VERSION, andMAINTENANCE_API_GROUPare set at lines 42-47 and again at lines 82-87 -
Makefile, line 583-591 (link)logic: identical targets - both ko-build and ko-publish call the same script without arguments. Should ko-publish pass a --publish flag or set KO_DOCKER_REPO? Does buildko.sh detect publish vs build mode via environment variables, or should ko-publish pass additional arguments?
-
health-monitors/csp-health-monitor/Tiltfile, line 17 (link)logic: image reference doesn't match the old reference at line 17: old was
ghcr.io/nvidia/nvsentinel-csp-health-monitor, new isghcr.io/nvidia/nvsentinel/csp-health-monitor. Verify that.ko.yaml, Helmvalues.yaml, and deployment manifests all use the updated path with the extra/nvsentinel/segment, or this image will fail to deploy. Have the corresponding Helm chart and.ko.yamlentries been updated to match this new image path? -
fault-remediation-module/main.go, line 289-296 (link)style: metrics server errors are swallowed - the goroutine logs the error but always returns nil, preventing graceful shutdown propagation. If metrics are critical, consider returning the error; if not, document why they're non-critical. Is the metrics server intentionally non-critical to operation? Should it block startup if it fails to bind the port, or only log errors during runtime?
-
fault-remediation-module/main.go, line 300 (link)style: if reconciler.Start returns an error immediately (e.g., initialization failure), the metrics server goroutine will continue running until gCtx is canceled by errgroup. Consider adding explicit cleanup or ensuring reconciler errors cause immediate context cancellation
-
janitor/pkg/controller/rebootnode_controller.go, line 233 (link)syntax: logger message format is incorrect - second parameter should be a key-value pair. Should be
"RebootNode", rebootNode.Name, "message", "not found, object assumed deleted" -
janitor/pkg/controller/test_utils.go, line 32-35 (link)style: fields are exported but never read outside the mock methods - should be unexported (lowercase) to prevent external access
-
fault-remediation-module/Tiltfile, line 17 (link)logic: image name changed from
ghcr.io/nvidia/nvsentinel-fault-remediation-moduletoghcr.io/nvidia/nvsentinel/fault-remediation-module(added slash before module name). This must match the image reference in the Helm chart and.ko.yaml, otherwise deployment will fail. Have the corresponding image references indistros/kubernetes/nvsentinel/charts/fault-remediation/values.yamland.ko.yamlbeen updated to use the new path with the slash? -
janitor/pkg/csp/client.go, line 50-54 (link)logic: GCP and Azure implementations return uninitialized struct pointers without error handling - these will panic when methods are called. Need to implement proper initialization functions like
NewGCPClientFromEnv()andNewAzureClientFromEnv()similar to AWS and OCI -
janitor/pkg/controller/suite_test.go, line 70 (link)logic: CRD path references
config/crd/basesbut janitor module uses Helm templates indistros/kubernetes/nvsentinel/charts/janitor/templates/. Theconfig/crd/basesdirectory doesn't exist in this repository structure. Tests will fail to find CRDs unless the path is corrected or the directory is created. Does the janitor module have aconfig/crd/basesdirectory, or should this point to the Helm template CRD files? -
janitor/pkg/controller/suite_test.go, line 71 (link)style:
ErrorIfCRDPathMissing: falsesilences errors when the CRD directory is missing, which masks misconfiguration. Tests may pass without actually loading the RebootNode CRD, causing unexpected behavior when testing reconciliation logic. -
labeler-module/main.go, line 98-106 (link)logic: metrics server goroutine swallows errors and always returns nil, which means if the labeler (line 108-110) fails first, g.Wait() won't return until the server goroutine also exits. Since srv.Serve blocks until context cancellation, a labeler error won't cause immediate shutdown unless gCtx is canceled by the labeler. Does labelerInstance.Run cancel gCtx on error, or should this use a separate cancelable context? Does labelerInstance.Run() cancel the errgroup context on error, or will the metrics server continue running after the labeler fails?
-
make/docker.mk, line 30-37 (link)style: indentation changed from tabs to spaces in docker buildx command arguments. This inconsistency with the rest of the Makefile may cause issues with Make's requirement for tabs. Verify that these are actual tab characters, not spaces.
-
nvsentinel-log-collector/Makefile, line 28 (link)style: cache reference still uses old naming
nvsentinel-buildcache:log-collectorwithout/nvsentinel/path segment – this won't match the new image naming pattern and may cause cache misses or conflicts if cache keys are tied to image name patterns -
scripts/buildko.sh, line 26-31 (link)style: conditional platform flag will cause the build to fail if
PLATFORMSenv var contains invalid values (e.g., trailing commas, unsupported architectures). Consider validatingPLATFORMSor wrapping the ko build in a trap to log the error clearly -
DEVELOPMENT.md, line 499-537 (link)style: documentation duplicates identical content already present at lines 499-537 which was already in the context before the changes. This entire section on macOS/Docker Desktop socket requirements appears twice in the file now
-
DEVELOPMENT.md, line 1106-1108 (link)style: commented-out code left in documentation - line 1106 has "# Test the built images locally" followed by the actual command, then line 1107 repeats the same comment. One of these should be removed
-
tests/uat/run.sh, line 60 (link)style: trap only captures EXIT signal. Consider adding INT and TERM to handle Ctrl+C and kill during long-running cluster operations, otherwise cleanup may not run if the user interrupts the script
-
janitor/pkg/csp/gcp.go, line 57-61 (link)logic: empty name should not be an error - groupNames[i] will be empty for unnamed capture groups. The error should only be thrown if name is empty AND result is missing a required key after the loop
-
janitor/pkg/csp/gcp.go, line 43 (link)style: regex is compiled on every call to getNodeFields. Consider declaring this as a package-level variable (var providerIDRegex = regexp.MustCompile(...)) to avoid repeated compilation overhead
-
janitor/pkg/csp/gcp.go, line 133 (link)logic: dereferencing op.Status without nil check will panic if Status is nil. Validate op is non-nil and has a Status field before dereferencing
-
tests/uat/install-apps.sh, line 17 (link)logic: set -euo pipefail will cause helm repo add/update to abort the script if the repository is already added. Consider using
helm repo add ... || trueor checking existence first. Should the script be idempotent, allowing re-runs even if repos are already added? -
tests/uat/install-apps.sh, line 129 (link)logic: if
extra_set_argsis empty (non-AWS CSP), bash will expand this as an empty string rather than an empty array, causing helm to fail with an argument parsing error. Wrap with conditional:${extra_set_args[@]+"${extra_set_args[@]}"} -
tests/uat/install-apps.sh, line 67 (link)style: missing cert-manager values file check – if
cert-manager-values.yamldoesn't exist, helm will silently ignore it. Add file existence validation before eachhelm upgradecall -
health-monitors/csp-health-monitor/cmd/maintenance-notifier/main.go, line 197-205 (link)logic: metrics server errors are swallowed by returning nil, preventing errgroup from propagating failures. If the server fails to start (e.g., port already in use), the service will continue running without metrics and no error will be logged to the caller. Should metrics server startup failures be fatal, or is silent degradation acceptable? If port binding fails, should the service abort or continue without observability?
-
health-monitors/csp-health-monitor/cmd/maintenance-notifier/main.go, line 207-214 (link)logic: trigger engine goroutine returns nil unconditionally - if
engine.Startpanics or the engine terminates unexpectedly, errgroup won't detect failure. Should this capture and return errors from the engine? Doesengine.Start(gCtx)return an error that should be propagated, or does it log errors internally and always succeed? -
tests/uat/tests.sh, line 41 (link)logic: resetting elapsed to 0 after successful reboot will cause the uncordon wait loop (line 54) to never timeout - both loops share the same elapsed variable
-
node-drainer-module/main.go, line 137-157 (link)logic: shutdown triggered by criticalError calls stop() at line 145 but normal shutdown at line 139does not. This asymmetry could cause the event watcher goroutine (line 99) to block indefinitely if the criticalError path is never taken and only gCtx.Done() fires. Should both paths call stop() to cancel ctx and unblock the event watcher?
-
node-drainer-module/main.go, line 159-169 (link)style: shutdown code duplicated in both branches (lines 147-156 and 160-169). Extract to a helper function to avoid drift if cleanup logic changes
-
janitor/pkg/csp/kind.go, line 90-95 (link)logic: checking
ctx.Err()but usingdockerCtxfor the command - if parent context times out, this check won't fire. Should checkdockerCtx.Err()instead -
janitor/pkg/csp/kind.go, line 106-111 (link)logic: same issue: checking parent
ctx.Err()butdockerCtxcontrols the command execution - timeout won't be detected correctly -
janitor/pkg/csp/kind.go, line 128-133 (link)logic: same issue: checking parent
ctx.Err()butdockerCtxcontrols the command execution - timeout won't be detected correctly -
fault-quarantine-module/main.go, line 267-275 (link)logic: metrics server error suppression means context cancellation won't propagate - if
gCtxis cancelled while server is still starting/serving, the error will be logged but the goroutine will return nil, masking potential shutdown issues. Should the metrics server goroutine respect context cancellation and propagate that as an error, or is silent termination on context cancellation intentional? -
fault-quarantine-module/main.go, line 278 (link)logic: if reconciler.Start returns an error, errgroup will cancel gCtx, but the metrics server goroutine (line 267-275) will ignore the cancellation and return nil. This means g.Wait() will only return the reconciler error, not both. Is this asymmetric error handling intentional? Should both goroutines propagate their errors (including context cancellation), or is it intentional that only the reconciler's error is returned to the caller?
-
docker/Makefile, line 22-31 (link)logic: removed almost all Go and Python modules from build definitions - modules like
platform-connectors,health-events-analyzer,fault-quarantine-module,labeler-module,node-drainer-module,fault-remediation-module, andjanitorare central to the PR's stated goal ("detect multiple fatal error through health event analyzer"). Removing them from the build system means they won't be built by CI or by local developers usingmake build-all. This will break CI pipelines and local development workflows. Were these modules moved to a different build system (e.g., ko exclusively), or is this an accidental deletion? Were these modules intentionally migrated to a different build system, or should they remain in this Makefile? -
docker/Makefile, line 143 (link)logic: changed image name filter from
nvsentinel-tonvsentinel/- this breaks cleanup for images built with the old naming convention. If any images with the old pattern exist, they won't be removed. -
scripts/ko-tilt-build.sh, line 35 (link)style: multi-arch build to ttl.sh with both linux/amd64 and linux/arm64 may be slower and unnecessary for local Tilt dev where most developers use a single architecture. Consider using
--platform=$(go env GOOS)/$(go env GOARCH)for faster local iteration. Is building for both architectures during local Tilt development intentional, or would single-arch builds suffice for faster iteration? -
scripts/ko-tilt-build.sh, line 35-38 (link)style: if
ko buildsucceeds butcrane cpfails (network issue, registry unavailable), the image remains on ttl.sh and the script exits with error - the EXPECTED_REF won't be available, causing Tilt to fail. Consider adding error handling or a cleanup trap -
scripts/ko-tilt-build.sh, line 34 (link)style: TEMP_TAG uses timestamp + PID but concurrent invocations within the same second from the same parent process (e.g., parallel Tilt builds) could generate the same tag, causing conflicts. Consider adding $RANDOM or the module name to ensure uniqueness
-
commons/pkg/server/server_test.go, line 41 (link)style: type assertion without 'ok' check will panic if Addr() returns a non-TCPAddr (unlikely in practice, but good to be defensive in test helpers)
-
janitor/pkg/csp/aws.go, line 124 (link)style: use logger.Error instead of fmt.Println for consistency - all other error paths use the structured logger
-
janitor/pkg/csp/aws.go, line 143-147 (link)logic: off-by-one: if provider ID is exactly "aws:///us-west-2/i-xxx", Split produces 6 parts, not 5.
len(parts) < 5allows 4-part IDs which would panic atparts[4]. Should be< 6or!= 6. Are there valid AWS provider ID formats that have fewer than 6 parts, or should this check enforce exactly 6 parts? -
distros/kubernetes/nvsentinel/charts/janitor/templates/serviceaccount.yaml, line 20-49 (link)logic: conditional logic blocks for CSP annotations are stacked sequentially, which will merge annotations from all providers if multiple provider blocks have truthy conditions (e.g., if
.Values.csp.provideris "aws" but.Values.csp.gcp.projectis also set). Sinceeq (.Values.csp.provider | default "kind") "aws"only checks the provider, but then lines 27-32, 33-39, 40-46 all independently check their own provider strings, a misconfigured values file could render multiple provider annotations. Should these beelse ifblocks in Helm template logic, or is there validation elsewhere that ensures only one provider's fields are populated? Does the janitor Helm chart validate that only one CSP provider's configuration fields are populated at install time, or is there a risk of mixed provider annotations? -
fault-remediation-module/pkg/reconciler/reconciler.go, line 71-76 (link)logic: returning an error from
Startmeans the reconciler will stop entirely on initialization failure, but the deferred watcher cleanup at lines 78–82 will not run since the watcher was just assigned and hasn't been successfully created yet. Thedeferis safe becausewatcheris non-nil at that point, but ifGetCollectionClientfails at line 84, the watcher is started at line 93after the error return path, which means the watcher is never started but the defer will still callCloseon it. Should the defer be moved to after line 93 (watcher.Start), or should the collection client initialization happen before the watcher is created to ensure no partial state? -
fault-remediation-module/pkg/reconciler/reconciler.go, line 84-91 (link)syntax: error path logs and returns the error at lines 86–90, but the
returnstatement is unreachable because it comes after thereturnat line90. This is dead code -
janitor/pkg/csp/oci.go, line 90 (link)logic:
node.Spec.ProviderIDis a string (format:oci://<ocid>), butInstanceIdexpects just the OCID without theoci://prefix. This will cause API call failures. Extract the OCID usingstrings.TrimPrefix(node.Spec.ProviderID, "oci://")before passing it. -
janitor/pkg/csp/oci.go, line 89-92 (link)style: API response is discarded, but it contains the instance state and request ID. If the instance is already rebooting or in a stopped state, this call may succeed but the reboot may not happen. Consider validating the response state or returning the request ID for tracking. Should the function verify that the instance was in a 'RUNNING' state before the reboot, or is it acceptable to silently succeed even if the instance is already stopped/terminated?
-
janitor/pkg/csp/oci.go, line 112-113 (link)logic: always returns
falsebefore 5 minutes elapsed, then unconditionally returnstrueafter 5 minutes without checking actual instance state. This assumes the node is always ready after 5 minutes, which may not be true if the reboot hangs or the instance enters a failed state. Should this query the OCI API to check the instance's actual lifecycle state? Is the 5-minute hardcoded delay a placeholder for future API checks, or is it intentional to assume the node is always ready after 5 minutes without verifying instance state? -
commons/pkg/server/server.go, line 462 (link)logic:
listener.Close()is only called in the TLS error path, but ifnet.Listenfails at line 475 in the non-TLS path, or ifnet.Listensucceeds at line 454 but before wrapping with TLS (line 471) the function returns, the listener will leak. Add deferred cleanup or ensure all early returns close the listener. -
commons/pkg/server/server.go, line 488-490 (link)logic: race condition:
runningis set to true inside the server goroutine, but if the goroutine hasn't started yet whenIsRunning()is called externally, it will return false even thoughServe()is executing and the socket is bound. Consider settingrunning=truebefore launching the goroutine. -
tests/uat/aws/create-eks-cluster.sh, line 120 (link)logic: wildcard filter 'rivate' is fragile and could match unintended route tables (e.g., 'derivate-table'). Consider a more precise filter like 'tag:aws:cloudformation:logical-id=Private' or query by explicit association with private subnets
-
janitor/pkg/csp/azure.go, line 99 (link)style: using fmt.Println instead of logger for consistency - should use logger.Error(err, "Error parsing time")
-
janitor/pkg/csp/azure.go, line 86 (link)style: error already logged on previous line, this logs it twice. Remove the Sprintf formatting and just pass the error.
-
janitor/pkg/csp/azure.go, line 132 (link)style: error already logged via logger.Error, no need to include it in Sprintf message again
-
janitor/pkg/csp/azure.go, line 138 (link)logic: dereferencing status.Code without nil check will panic if Code is nil
-
janitor/pkg/controller/rebootnode_controller_test.go, line 169-174 (link)logic:
AfterEachwill fail if any test deletes thetestRebootNodeobject, sincek8sClient.Get()will return an error. Tests that result in failures or deletions will cause this hook to panic instead of validating conditions. Should this hook skip validation if the RebootNode no longer exists, or is deletion never expected in these tests? -
janitor/pkg/controller/rebootnode_controller_test.go, line 476-477 (link)logic: creating
freshRebootNodedirectly in the fake client but never updating it after callingSetInitialConditions()andSetStartTime()on lines 480-481. The in-memory object is modified but the fake client's stored copy is stale. Subsequent assertions on lines 485-502 pass because they check the in-memory object, but if the test fetched it from the client again, the status would be empty. -
janitor/go.mod, line 8-13 (link)style: all five CSP SDKs (GCP, Azure, AWS, OCI) are now direct dependencies. This increases attack surface and binary size. If the janitor only interacts with one CSP per deployment, consider using build tags to conditionally compile each provider
-
janitor/go.mod, line 113 (link)syntax: duplicate
gopkg.in/yaml.v3entries at lines 94 and 113 suggest a transitive dependency conflict or incorrect merge. Rungo mod tidyto deduplicate -
.github/workflows/helm-e2e-test.yml, line 205 (link)logic:
github.reffor a tag push will look likerefs/tags/v1.2.3, notv1.2.3. Helm images typically expect a clean tag name likev1.2.3. This will cause the validation script to look for images with the full ref path as the tag, which won't exist in the registry. -
.github/workflows/helm-e2e-test.yml, line 233 (link)logic: same issue:
github.refincludesrefs/tags/prefix, so this will setglobal.image.tag=refs/tags/v1.2.3instead ofv1.2.3. Helm will try to pull images with the wrong tag. -
.github/workflows/helm-e2e-test.yml, line 240 (link)logic: same issue: validation will search for images tagged
refs/tags/v1.2.3instead ofv1.2.3 -
.github/workflows/helm-e2e-test.yml, line 267 (link)logic: same issue in debug artifacts collection
-
health-events-analyzer/pkg/reconciler/reconciler_test.go, line 236-277 (link)logic: test name is "match multiple fatal error rule" but the mock returns
ruleMatched: truefor all Aggregate calls without distinguishing which rule is being evaluated. If the reconciler evaluates multiple rules, the mock will return true for each, causing false positives. Should the mock validate that rule1's criteria are actually used in the pipeline, or is the test only verifying that any match triggers publish? Does this test intend to verify that rule1 specifically is matched (requiring mock inspection of the pipeline), or just that any matching rule triggers publish? -
.github/workflows/publish.yml, line 169 (link)logic: comment says "Once validated, this will replace the container-publish job above" but both jobs run concurrently. This will double-build gpu-health-monitor and syslog-health-monitor images (once via Docker matrix, once via ko), pushing two different images with potentially conflicting tags to the same registry paths. Should the ko build-images job only build the components that were removed from the container-publish matrix (the 8 Go modules), or is the intent to eventually replace all Docker builds? Running both jobs for the same components will publish conflicting images.
-
.github/workflows/publish.yml, line 109-123 (link)logic: container_name changed from 'nvsentinel-<component>' to 'nvsentinel/<component>' for all entries. This changes the published image path in the registry (e.g., ghcr.io/nvidia/nvsentinel-gpu-health-monitor → ghcr.io/nvidia/nvsentinel/gpu-health-monitor). Existing deployments and Helm charts that reference the old path will fail to pull images after this change.
-
health-events-analyzer/pkg/parser/parser.go, line 82-92 (link)style: len check is insufficient – if
parts[0]exists but is empty, comparison will succeed yetparts[1:]may be empty or out of bounds, leading to nil dereference in reflection helpers -
platform-connectors/main.go, line 268-276 (link)logic: metrics server swallows all errors including context cancellation. If
srv.Serve(gCtx)returns on context cancel, theslog.Errorwill log misleadingly. Return the error fromsrv.Servesog.Wait()can distinguish between normal shutdown and actual failures. Should the metrics server error be propagated to cause overall shutdown, or is it acceptable for the platform-connectors to run without metrics/health endpoints? -
platform-connectors/main.go, line 278-306 (link)logic: signal handler goroutine will never receive from
gCtx.Done()unless the first goroutine (metrics server) fails AND returns a non-nil error (which it never does - line 275 always returns nil). This means SIGINT/SIGTERM is the only shutdown path. If the metrics server's context is cancelled, the signal handler won't know -
platform-connectors/main.go, line 295 (link)logic: closing
stopChafter starting cleanup creates a race -k8sConnector.FetchAndProcessHealthMetric(line 104) andstoreConnector.FetchAndProcessHealthMetric(line 121) may still be reading fromstopCh. Close it before cleanup to give them time to wind down -
platform-connectors/main.go, line 303 (link)style: calling
cancel()after cleanup means the connectors' contexts remain live during shutdown. Move this beforecleanupResourcesso the connectors see cancellation immediately -
platform-connectors/main.go, line 141-147 (link)logic: grpcServer.Serve goroutine exits the entire process on error (line 145), bypassing graceful shutdown. Consider returning the error to the caller or using
errgroupto coordinate shutdown with the metrics server -
health-events-analyzer/pkg/parser/parser.go, line 133-144 (link)style: bounds check on line 139 covers negative index but does not distinguish
err != nilfrom out-of-bounds. Ifparts[0]is non-numeric,strconv.Atoireturns error and function returns nil silently. Consider logging or returning error for invalid index strings. Should non-numeric array indices be logged as warnings or treated as valid "field not found" scenarios? -
tests/multiple_fatal_event_test.go, line 165 (link)logic: cleanup passes error code "31" to
TestCleanUp, but lines 65–105 injected codes "13" and "48" five times. If those events persist in MongoDB or as node conditions, cleanup may fail to remove them because it only targets "31". Should cleanup iterate over all three error codes (13, 48, 31) or clear all conditions regardless of error code? Does TestCleanUp need to clear healthy events for error codes 13 and48 in addition to 31, or does clearing 31 and the condition "MultipleFatalError" sufficiently reset the node state? -
health-monitors/syslog-health-monitor/main.go, line 58 (link)logic:
flag.Stringwithos.Getenv()as default captures the env value at package init, before flag parsing. If NODE_NAME is unset during startup, flag.Parse() will never update it. Set the default to empty string and read the env insiderun()after parsing. -
health-monitors/syslog-health-monitor/main.go, line 171-178 (link)logic: metrics server errors are swallowed (logged but returned as
nil), causing errgroup to ignore server failures. If this is intentional, document why; otherwise remove the inner error handling and return the error directly so errgroup can propagate it. should the service continue if the metrics server fails to bind (e.g., port conflict), or is that a fatal condition? -
health-monitors/syslog-health-monitor/main.go, line 322 (link)logic:
conn.Connect()is non-blocking, so the firstGetState()may return Idle. The subsequentWaitForStateChangewill block until the state changes from Idle, but if connection fails immediately, the loop may return an error before the state transitions. Add a check afterWaitForStateChangereturnsfalseto verify the failure reason. should this function also return if the connection enters TransientFailure or Shutdown states, or only on Ready? -
health-monitors/syslog-health-monitor/main.go, line 349-378 (link)logic: retry loop does not honor
ctxcancellation. If the process is interrupted during retries (e.g., SIGTERM), this will block for up tomaxConfigRetries * attemptseconds. Add aselectonctx.Done()before or aftertime.Sleep. -
health-monitors/syslog-health-monitor/main.go, line 205-212 (link)style: exponential backoff resets the exponent to 2s on first failure instead of using attempt count. If the first error happens at tick 0, the backoff will be 2s, 4s, 8s, 16s, 30s. If this is intentional (to avoid penalizing the first failure), document it. should backoff start at1s * attempt or always begin at 2s on first error?
148 files reviewed, 96 comments
| "recommendedAction": 15, | ||
| "errorCode": [ | ||
| "79" | ||
| "ERROR_CODE" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: this placeholder won't match the actual error codes used in the new test files ("13", "48", "31" in repeated_xid_rule_test.go and multiple_fatal_event_test.go). Tests that depend on this template must replace "ERROR_CODE" with a concrete XID or the healthy events won't clear the corresponding fatal events
| for attempt := 1; attempt <= maxRetries; attempt++ { | ||
| node, err := clientset.CoreV1().Nodes().Get(ctx, nodeName, metav1.GetOptions{}) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to get node %s: %w", nodeName, err) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: retry logic fetches the node fresh on each attempt, but the stale node object from the outer scope is used for the uncordon block (lines 672-676). If the first update succeeds but uncordon logic starts before the break at 659, it operates on stale data. Move the uncordon Get call inside its own retry block or restructure to avoid scope confusion
| clientOpts := options.Client().ApplyURI(mongoURI).SetTLSConfig(tlsConfig) | ||
| // Use X509 auth by default | ||
| clientOpts.SetAuth(options.Credential{AuthMechanism: "MONGODB-X509", AuthSource: "$external"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: X509 auth is hardcoded even when MONGODB_URI is provided via env var. If the URI already specifies auth mechanism, this will conflict. Should only set auth when not present in URI. Should X509 auth be applied unconditionally, or only when the URI doesn't already specify an auth mechanism?
| if err != nil { | ||
| klog.Errorf("failed to send healthy events during cleanup: %v", err) | ||
| } | ||
| time.Sleep(5 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: 5-second sleep is arbitrary and may not be long enough for all environments. Consider polling for condition removal or making the delay configurable
| } | ||
|
|
||
| // getValueFromErrorCode safely returns event.ErrorCode[index] if present. | ||
| func getValueFromErrorCode(event *platform_connectors.HealthEvent, parts []string) any { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we rename all the function: getValueFromX -> getX? Will it more readable?
Signed-off-by: Tanisha goyal <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This review covers only the changes made since the last review, not the entire PR. The most recent changes consist of three minimal updates: (1) resolving a merge conflict in health-events-analyzer/go.mod by removing github.com/kr/text v0.2.0 // indirect while keeping github.com/hashicorp/errwrap v1.0.0// indirect, (2) adding an explicit but unreachable return nil statement at line 104 in reconciler.go to satisfy Go's return-path requirements and removing a blank line at line 181, and (3) adding the new test file tests/repeated_xid_rule_test.go. The changes integrate with the broader PR objective of detecting repeated XID failures and multiple fatal errors: the go.mod conflict resolution ensures the correct transitive dependencies (errwrap is used by go-multierror for aggregated error handling in rule evaluation), the reconciler return statement satisfies compiler requirements without changing runtime behavior (the event-processing loop blocks indefinitely), and the new test validates that after 5 repeated cycles of XID 13 followed by healthy events, the 6th occurrence triggers the RepeatedXid rule and applies a node-termination recovery action.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| health-events-analyzer/go.mod | 5/5 | Resolved merge conflict by removing github.com/kr/text and keeping github.com/hashicorp/errwrap // indirect |
| health-events-analyzer/pkg/reconciler/reconciler.go | 5/5 | Added unreachable return nil at line 104 and removed blank line at line 181 |
| tests/repeated_xid_rule_test.go | 3/5 | New test for repeated XID failure detection with 5-cycle remediation loop |
Confidence score: 3/5
- This PR includes only incremental changes since the last review; the new test file and minor reconciler update carry moderate risk due to timing dependencies and potential race conditions.
- Score reflects uncertainty about test reliability (hardcoded sleeps, randomized node selection, and potential ordering issues) and missing details on the new test file's implementation quality. The go.mod and reconciler changes are low-risk and safe.
- Pay close attention to
tests/repeated_xid_rule_test.go– ensure it follows the same patterns asmultiple_fatal_event_test.goreviewed previously, avoids hardcoded delays, properly cleans up test state, and does not introduce race conditions withGpuNodeName.
3 files reviewed, 1 comment
| [[rules.sequence]] | ||
| criteria = { "healthevent.entitiesimpacted.0.entitytype" = "GPU", "healthevent.entitiesimpacted.0.entityvalue" = "this.healthevent.entitiesimpacted.0.entityvalue", "healthevent.errorcode.0" = "13", "healthevent.nodename" = "this.healthevent.nodename" } | ||
| errorCount = 3 | ||
| criteria = { "healthevent.entitiesimpacted.0.entitytype" = "GPU","healthevent.entitiesimpacted.0.entityvalue" = "this.healthevent.entitiesimpacted.0.entityvalue","healthevent.errorcode.0" = "this.healthevent.errorcode.0","healthevent.nodename" = "this.healthevent.nodename","healthevent.checkname" = "{\"$ne\": \"HealthEventsAnalyzer\"}"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't "healthevent.checkname" = "{\"$ne\": \"HealthEventsAnalyzer\"}" needed for every rule, can we add it as a default in the code rather than having the rule specify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think so. We might not required for multipleFatalErrorRule. When in the phase2, where we will be taking actions on event-analyzer events then Shouldn't we consider health-event-analyzer event also as part of remediation cycle if remediation was done?
and since we wanted to keep query logic mostly configurable that's why added in here instead of keeping it in code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, let's not do that, it will become recursive in that way. To keep things simple we can only consider events that are not raised by the event-analyzer
|
|
||
| // getValueFromErrorCode safely returns event.ErrorCode[index] if present. | ||
| func getValueFromErrorCode(event *platform_connectors.HealthEvent, parts []string) any { | ||
| if len(parts) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when will this happened? Don't we already validate len(part) == 0 in the caller? Can you please claritfy?
| } | ||
|
|
||
| func getValueFromHealthEvent(event *platform_connectors.HealthEvent, parts []string) any { | ||
| if len(parts) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when will parts be 0? If parts is 0 isn't it an error?
|
|
||
| // getValueFromMetadata returns the value for a metadata key if it exists. | ||
| func getValueFromMetadata(event *platform_connectors.HealthEvent, parts []string) any { | ||
| if len(parts) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when will this happened? Don't we already validate len(part) == 0 in the caller? Can you please claritfy?
| } | ||
|
|
||
| func getValueFromHealthEventStatus(event storeconnector.HealthEventStatus, parts []string) any { | ||
| if len(parts) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when will this happened? Don't we already validate len(part) == 0 in the caller? Can you please claritfy?
| assert.NoError(t, err, "failed to create client") | ||
|
|
||
| // Check node condition for matched ruleset | ||
| helpers.WaitForNodeConditionWithCheckName(ctx, t, client, gpuNodeName, "MultipleFatalError") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check if the node condition was added, no need to poke into the mongodb
| feature.Assess("Inject multiple fatal errors", func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context { | ||
| gpuNodeName := ctx.Value(keyGpuNodeName).(string) | ||
| if gpuNodeName == "" { | ||
| t.Fatal("GPU node name not found in context - previous assess step may have failed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this possible?
| "healthevent.checkname": "RepeatedXid13", | ||
| } | ||
|
|
||
| event, err := helpers.QueryHealthEventByFilter(ctx, filter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are already checking node condition, do we need this as well?
| ], | ||
| resource_deps=['cert-manager'], | ||
| ) | ||
| k8s_resource( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't connect to mongodb in tests, rather please check outcomes
| [[rules.sequence]] | ||
| criteria = { "healthevent.entitiesimpacted.0.entitytype" = "GPU", "healthevent.entitiesimpacted.0.entityvalue" = "this.healthevent.entitiesimpacted.0.entityvalue", "healthevent.errorcode.0" = "13", "healthevent.nodename" = "this.healthevent.nodename" } | ||
| errorCount = 3 | ||
| criteria = { "healthevent.entitiesimpacted.0.entitytype" = "GPU","healthevent.entitiesimpacted.0.entityvalue" = "this.healthevent.entitiesimpacted.0.entityvalue","healthevent.errorcode.0" = "this.healthevent.errorcode.0","healthevent.nodename" = "this.healthevent.nodename","healthevent.checkname" = "{\"$ne\": \"HealthEventsAnalyzer\"}"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since XIDs can happen in bursts, how are we handling that scenario?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed on call, we are not handling burst of XIDs for now. I'll handle these changes in new PR
|
/ok to test 641b2f5 |
1 similar comment
|
/ok to test 641b2f5 |
Summary
With this PR we are targeting following requirements.
Repeated XID failures: We want to analyze if XIDs are occurring even after applying the remediation after nth time then what should be the correct recovery action. For example: XID 120 has occurred and we have applied RESET_GPU (as recommended in column P) but even if after applying this recovery action say 5 times and 120 is still present then we should opt for a different recovery action.
Multiple fatal errors: We want to analyze if there have been multiple fatal errors on the same node within the 7 days window and node has gone through the recovery cycle(say 5 cycles) but still fatal errors occurred then that means there is something wrong with the node and it needs to be submitted to CSP for deeper analysis.
Type of Change
Component(s) Affected
Testing
Checklist
Testing