Skip to content

Conversation

@tanishagoyal2
Copy link

@tanishagoyal2 tanishagoyal2 commented Oct 26, 2025

Summary

With this PR we are targeting following requirements.

  1. Repeated XID failures: We want to analyze if XIDs are occurring even after applying the remediation after nth time then what should be the correct recovery action. For example: XID 120 has occurred and we have applied RESET_GPU (as recommended in column P) but even if after applying this recovery action say 5 times and 120 is still present then we should opt for a different recovery action.

  2. Multiple fatal errors: We want to analyze if there have been multiple fatal errors on the same node within the 7 days window and node has gone through the recovery cycle(say 5 cycles) but still fatal errors occurred then that means there is something wrong with the node and it needs to be submitted to CSP for deeper analysis.

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📚 Documentation
  • 🔧 Refactoring
  • 🔨 Build/CI

Component(s) Affected

  • Health Monitors
  • Core Services
  • Fault Management
  • Documentation/CI
  • Other: ____________

Testing

  • Tests pass locally
  • Manual testing completed
  • No breaking changes (or documented)

Checklist

  • Self-review completed
  • Documentation updated (if needed)
  • Ready for review

Testing

  1. Tested on GCP dev cluster
  2. Written unit tests for each case and tests succeeded on local
  3. Written tilt test multiple_fatal_event_test.go and repeated_xid_rule_test.go

@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 26, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR extends the health-events-analyzer to detect patterns of multiple fatal errors across time windows, rather than just single events. The core change refactors the analyzer's reconciler to process all matching rules per event (instead of stopping at the first match) and adds two new TOML rule types: one detecting any 5 remediated faults within 7 days regardless of error code, and another detecting 5 repeated instances of the same XID within24 hours. The publisher now creates new health events tagged with the matched rule name as the CheckName and marks them as fatal with REPORT_ISSUE or CONTACT_SUPPORT actions. The MongoDB change-stream filter was changed from isfatal=false to agent != health-events-analyzer to allow processing of fatal events from other agents while avoiding infinite loops. New test helpers enable direct MongoDB queries and parameterized event injection via error-code placeholders. The changes span the health-events-analyzer's reconciler, publisher, config parser, and rule definitions, plus E2E tests, Tilt configuration, and test infrastructure.

PR Description Notes:

  • The PR description template was not filled out—no summary, component checkboxes, or test status information was provided.

Important Files Changed

Filename Score Overview
health-events-analyzer/pkg/config/rules.go 1/5 Added omitempty to SequenceStep.Criteria but file declares wrong package name (reconciler instead of config), causing import failures
health-events-analyzer/pkg/reconciler/reconciler.go 2/5 Refactored to process all rules per event with goroutines and multierror; unsafe array access on ErrorCode[0] without bounds check
health-events-analyzer/pkg/reconciler/parser.go 2/5 New reflection-based BSON query builder; nil-pointer risks on GeneratedTimestamp/EntitiesImpacted, no bounds check on negative indices
health-events-analyzer/main.go 3/5 Changed MongoDB filter from isfatal=false to agent!=health-events-analyzer; risk of infinite loop if published events re-enter stream
tilt/Tiltfile 3/5 Enabled health-events-analyzer in Tilt; kwok-fake-nodes resource_deps missing analyzer, may cause race condition during cluster bringup
health-events-analyzer/go.mod 3/5 Added go-multierror and klog v1.0.0; v1/v2 klog duplication creates maintenance burden and potential import confusion
tests/Makefile 3/5 Added MongoDB port-forward and cert extraction for CI; globally disables TLS verification (MONGODB_TLS_SKIP_VERIFY=true) which masks cert issues
tests/helpers/healthevent.go 3/5 Added errorCode param to SendHealthEventsToNodes (breaking signature change), MongoDB query helper, TestCleanUp; no connection timeout on MongoDB client
tests/helpers/kube.go 4/5 Added WaitForNodeConditionWithCheckName and CleanupNodeConditionAndUncordon; substring matching via Contains() may false-positive on non-unique names
tests/multiple_fatal_event_test.go 3/5 New E2E test for aggregated fatal errors; unused klog import, race on shared GpuNodeName variable, 85s of hard-coded sleeps, comment mismatch
tests/repeated_xid_rule_test.go 4/5 New E2E test for repeated XID detection; unused klog import, repetitive event injection (5 identical calls), unnecessary context indirection
health-events-analyzer/pkg/reconciler/reconciler_test.go 3/5 Expanded tests for multi-error detection; trailing tab in criteria string (line 99), global test data mutation (line 306) causes order dependency
health-events-analyzer/pkg/publisher/publisher.go 4/5 Refactored Publish to create new event with ruleName as CheckName; double metric increment on non-retryable errors (lines77, 84)
health-events-analyzer/pkg/publisher/metrics.go 5/5 Renamed FatalEventPublishingError to unexported fatalEventPublishingError for better encapsulation; no external impact
health-events-analyzer/pkg/reconciler/metrics.go 5/5 Renamed entity_value label to node_name, broadened fatalEventsPublishedTotal to newEventsPublishedTotal, added ruleMatchedTotal and databaseQueryDuration metrics
distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml 4/5 Updated rules to detect 5 remediated faults in 7 days and any repeated XID with exclusion filter; relies on healtheventstatus.faultremediated schema field
distros/kubernetes/nvsentinel/values-tilt.yaml 5/5 Enabled health-events-analyzer in Tilt values; safe config change for local dev
tests/data/fatal-health-event.json 5/5 Changed XID from "79" to "ERROR_CODE" placeholder and normalized entityType to uppercase "GPU" for template reuse
tests/data/healthy-event.json 5/5 Added errorCode array field with placeholder "ERROR_CODE" and normalized entityType to uppercase "GPU"
tests/go.mod 5/5 Added MongoDB driver and Kubernetes client libraries as direct dependencies for E2E tests
tests/go.sum 5/5 Added checksums for MongoDB driver and transitive dependencies
health-events-analyzer/go.sum 4/5 Added checksums for go-multierror, errwrap, and older klog/go-logr versions; older transitive dependencies should be audited
tests/scale_test.go 4/5 Updated SendHealthEventsToNodes calls to pass explicit XID "79"; hardcoded value must match test data and rule config
tests/smoke_test.go 5/5 Minor changes (not detailed in provided context)

Confidence score: 2/5

  • This PR introduces significant architectural changes with multiple critical issues that could cause runtime failures, infinite loops, or import errors if merged without fixes.
  • Score reflects package declaration mismatch (rules.go), unsafe array indexing without bounds checks (reconciler.go line 257, parser.go line 165), nil-pointer risks in reflection code (parser.go lines 177-192), potential infinite event loop (main.go line 133), concurrency issues (unbounded goroutines, shared test variables), double metric increments, and breaking test helper signature changes.
  • Pay close attention to health-events-analyzer/pkg/config/rules.go (package name), health-events-analyzer/pkg/reconciler/reconciler.go (goroutine spawning, unsafe array access), health-events-analyzer/pkg/reconciler/parser.go (nil checks, array bounds), health-events-analyzer/main.go (event filter logic), tests/helpers/healthevent.go (signature breakage), and tests/Makefile (TLS verification bypass).

Sequence Diagram

sequenceDiagram
    participant User
    participant GPU Health Monitor
    participant Platform Connector
    participant MongoDB
    participant Health Events Analyzer
    participant Publisher
    participant Fault Quarantine
    participant Node

    User->>GPU Health Monitor: Fatal GPU error occurs (XID 13/48/etc)
    GPU Health Monitor->>Platform Connector: Send HealthEvent via gRPC
    Platform Connector->>MongoDB: Insert health event document
    Platform Connector->>Node: Update node condition
    
    MongoDB->>Health Events Analyzer: Change stream notification (insert event)
    Health Events Analyzer->>Health Events Analyzer: Filter: operationType=insert, agent≠health-events-analyzer, isHealthy=false
    
    Health Events Analyzer->>Health Events Analyzer: Parse TOML rules (MultipleFatalError, RepeatedXid13)
    Health Events Analyzer->>Health Events Analyzer: Check if rule applies to event XID
    
    alt Rule applies to event
        Health Events Analyzer->>MongoDB: Run aggregation pipeline with $facet
        MongoDB-->>Health Events Analyzer: Return sequence match results
        
        Health Events Analyzer->>Health Events Analyzer: Evaluate if all sequences met errorCount threshold
        
        alt All sequences matched
            Health Events Analyzer->>Publisher: Create new fatal event with agent="health-events-analyzer"
            Publisher->>Platform Connector: Publish transformed HealthEvent via gRPC (with retry)
            Platform Connector->>MongoDB: Insert new fatal event
            Platform Connector->>Node: Update node condition with rule name (MultipleFatalError/RepeatedXid13)
            
            Node->>Fault Quarantine: Detect new fatal condition
            Fault Quarantine->>Node: Cordon and quarantine node
        else Sequences not matched
            Health Events Analyzer->>Health Events Analyzer: Log "rule criteria didn't match"
        end
    else Rule doesn't apply to XID
        Health Events Analyzer->>Health Events Analyzer: Skip rule evaluation
    end
Loading

Additional Comments (2)

  1. health-events-analyzer/pkg/config/rules.go, line 15 (link)

    logic: package name is reconciler but file is in config/ directory. should this be package config to match the directory structure?

  2. tilt/Tiltfile, line 141-143 (link)

    logic: nvsentinel-health-events-analyzer is missing from the resource_deps list. Since the health-events-analyzer is now enabled (line 82) and watches MongoDB change streams, kwok-fake-nodes should wait for it to be ready.

24 files reviewed, 22 comments

Edit Code Review Agent Settings | Greptile

healthEventWithStatus storeconnector.HealthEventWithStatus) bool {
// If no target XIDs specified, evaluate for all events
for _, seq := range rule.Sequence {
eventXID := healthEventWithStatus.HealthEvent.ErrorCode[0]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: array access without bounds check will panic if ErrorCode is empty. Add length validation before accessing index 0.

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR enables the health-events-analyzer module to detect long-term patterns of multiple fatal GPU errors, marking a major evolution from the previous XID-sequence-based detection. The feature introduces two new rule types: MultipleFatalError (5 remediated faults within 7 days) and RepeatedXid13 (5 occurrences of any fatal XID within 24 hours). The analyzer now evaluates all rules per event instead of stopping at the first match, using go-multierror to aggregate partial failures. Test fixtures were refactored to use template placeholders (ERROR_CODE, NODE_NAME) enabling parameterized multi-XID testing, and the test suite gained MongoDB integration via go.mongodb.org/mongo-driver to directly query stored events and validate pattern detection. The changes align with NVSentinel's architecture where the analyzer consumes MongoDB change-streams, applies TOML rule sets, and publishes synthetic fatal events with its own metadata (Agent: "health-events-analyzer", CheckName: ruleName) back to the platform-connector for downstream remediation.

Important Files Changed

Filename Score Overview
health-events-analyzer/pkg/reconciler/reconciler.go 2/5 Refactored to evaluate all rules per event with unbounded goroutine spawning and unprotected array access at line 257 that will panic if ErrorCode is empty
tests/smoke_test.go 2/5 Added error code parameter to SendHealthEventsToNodes but TestFatalUnsupportedHealthEvent has critical mismatch: sends XID 79 but asserts ErrorCode:145
tests/helpers/kube.go 3/5 Added WaitForNodeConditionWithCheckName and CleanupNodeConditionAndUncordon helpers; cleanup logs but ignores all errors, always returning nil even if critical steps fail
tests/helpers/healthevent.go 3/5 Added MongoDB integration with getMongoCollection and QueryHealthEventByFilter; TLS verification can be disabled via env var, and TestCleanUp returns nil even if cleanup steps fail
distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml 3/5 Replaced short-window XID sequence rules with 7-day/24-hour pattern detection; MultipleFatalError checks faultremediated=true which matches resolved faults, not active failures
tests/repeated_xid_rule_test.go 3/5 New e2e test for RepeatedXid13 rule; unused klog import and context key collision risk with string variable used as context key instead of typed constant
health-events-analyzer/pkg/reconciler/reconciler_test.go 3/5 Added rule1 for multiple-fatal-error detection; typo on line 99 ("GPU\t" trailing tab) will cause production rule mismatch, and mutates global healthEvent_13 state
health-events-analyzer/pkg/reconciler/parser.go 4/5 New Parser struct extracts dynamic MongoDB query templating logic from reconciler; centralizes sequence string parsing with proper error handling
tests/multiple_fatal_event_test.go 4/5 New e2e test for MultipleFatalError rule; unused klog import and comment-code mismatch at line 122 (comment says XID 48, sends error code 31)
health-events-analyzer/pkg/publisher/publisher.go 4/5 Refactored Publish to create new HealthEvent with Agent=health-events-analyzer and CheckName=ruleName; shallow-copies slices/maps which could cause issues if original event is reused
tilt/Tiltfile 4/5 Enabled health-events-analyzer and added MongoDB port-forward for local dev testing; safe configuration change for development environment
tests/data/fatal-health-event.json 4/5 Replaced hardcoded error code 79 with ERROR_CODE placeholder for parameterized testing; tests must now substitute the placeholder or events will be rejected
health-events-analyzer/pkg/config/rules.go 5/5 Added omitempty to SequenceStep.Criteria field enabling rules to count total errors without filtering by specific properties; minimal, safe change
health-events-analyzer/go.mod 3/5 Added go-multierror for aggregating multiple errors; problematic addition of k8s.io/klog v1.0.0 alongside existing klog/v2 dependency creates import path conflict
tests/go.mod 5/5 Added mongodb driver and promoted k8s.io/client-go to direct dependency for new e2e tests; all versions align with Kubernetes 0.34.1 ecosystem
health-events-analyzer/pkg/reconciler/metrics.go 5/5 Renamed entity_value label to node_name, added rule_matched_total and databaseQueryDuration metrics; breaking change for existing Prometheus dashboards
distros/kubernetes/nvsentinel/values-tilt.yaml 5/5 Enabled health-events-analyzer in Tilt dev environment; safe, isolated configuration change for local testing
health-events-analyzer/pkg/publisher/metrics.go 4/5 Made FatalEventPublishingError metric unexported; proper encapsulation but breaking change for external packages accessing the variable directly
tests/scale_test.go 4/5 Added explicit error code parameter (79) to SendHealthEventsToNodes; hardcoded value may not match JSON data files
tests/data/healthy-event.json 3/5 Added errorCode field with ERROR_CODE placeholder; tests must substitute placeholder or events may not clear matching error codes
health-events-analyzer/go.sum 5/5 Added checksums for go-multierror and transitive dependencies; standard go.sum update with no direct issues
tests/go.sum 5/5 Added checksums for MongoDB driver and transitive dependencies; consistent with go.mod changes

Confidence score: 2/5

  • This PR introduces critical bugs (panic on empty ErrorCode array, unbounded goroutine spawning) and multiple test mismatches that will cause immediate failures
  • Score reflects unsafe concurrency patterns in production code, unprotected array access that will panic, test fixture mismatches between sent XIDs and assertions, and error-swallowing cleanup logic that masks test failures
  • Pay close attention to health-events-analyzer/pkg/reconciler/reconciler.go (array bounds check, goroutine limiting), tests/smoke_test.go (XID mismatch line 304/337), and tests/helpers/healthevent.go (error handling in cleanup)

Sequence Diagram

sequenceDiagram
    participant User
    participant HealthMonitor as Health Monitor<br/>(GPU/Syslog/CSP)
    participant SimpleClient as Simple Health Client
    participant PlatformConn as Platform Connector
    participant MongoDB
    participant Analyzer as Health Events Analyzer
    participant Publisher as Analyzer Publisher
    participant Quarantine as Fault Quarantine
    participant K8sAPI as Kubernetes API

    User->>SimpleClient: POST /health-event<br/>(fatal error, e.g. XID 13/48)
    SimpleClient->>PlatformConn: HealthEventOccuredV1(gRPC)
    PlatformConn->>MongoDB: Insert HealthEvent document
    PlatformConn->>K8sAPI: Add/Update Node Condition
    
    MongoDB->>Analyzer: Change Stream Event (insert)
    Note over Analyzer: Filter: operationType=insert<br/>agent≠health-events-analyzer<br/>isHealthy=false
    
    Analyzer->>Analyzer: Parse event through Parser<br/>Extract field values
    
    loop For each configured rule
        Analyzer->>Analyzer: shouldEvaluateRuleForEvent()<br/>Check if rule applies to XID
        
        alt Rule matches XID
            Analyzer->>MongoDB: Aggregate Query<br/>Build facet pipeline<br/>Check sequences in time window
            Note over MongoDB: $facet with multiple sequences<br/>Count matching docs per sequence<br/>Compare to errorCount threshold
            MongoDB->>Analyzer: Aggregation result<br/>{ruleMatched: true/false}
            
            alt All sequences satisfied
                Analyzer->>Publisher: Publish(ctx, event, action, ruleName)
                Note over Publisher: Transform event:<br/>agent="health-events-analyzer"<br/>checkName=ruleName<br/>isFatal=true, isHealthy=false
                Publisher->>Publisher: sendHealthEventWithRetry()<br/>(max 5 retries, exp backoff)
                Publisher->>PlatformConn: HealthEventOccuredV1(gRPC)
                PlatformConn->>MongoDB: Insert new HealthEvent<br/>(agent=health-events-analyzer)
                PlatformConn->>K8sAPI: Add Node Condition<br/>(Type=ruleName, Status=True)
                
                K8sAPI->>Quarantine: Watch: Node Condition Added
                Quarantine->>K8sAPI: Cordon Node, Add Taint
            end
        end
    end
    
    Note over Analyzer: Emit Prometheus metrics:<br/>events_received_total<br/>rule_matched_total<br/>new_events_published_total
Loading

Additional Comments (1)

  1. tilt/Tiltfile, line 142 (link)

    logic: health-events-analyzer is missing from the resource_deps list - fake nodes will be created before the analyzer is ready, potentially causing startup race conditions during Tilt development

24 files reviewed, 16 comments

Edit Code Review Agent Settings | Greptile

Comment on lines 165 to 169
if idx, err := strconv.Atoi(parts[0]); err == nil && idx < len(event.EntitiesImpacted) {
entity := event.EntitiesImpacted[idx]
subField := strings.ToLower(parts[1])

entityVal := reflect.ValueOf(entity).Elem()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: will panic if entity (a pointer) is nil - add nil check before calling .Elem()

@dims
Copy link
Collaborator

dims commented Oct 26, 2025

/ok to test bdcf2e4

@github-actions
Copy link

Merging this branch will decrease overall coverage

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/health-events-analyzer 0.00% (ø)
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/config 0.00% (ø)
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/publisher 32.26% (-3.23%) 👎
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/reconciler 47.75% (-9.03%) 👎
github.com/nvidia/nvsentinel/tests 0.00% (ø)
github.com/nvidia/nvsentinel/tests/helpers 0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/nvidia/nvsentinel/health-events-analyzer/main.go 0.00% (ø) 182 0 182
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/config/rules.go 0.00% (ø) 7 0 7
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/publisher/metrics.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/publisher/publisher.go 32.26% (-3.23%) 62 20 (-2) 42 (+2) 👎
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/reconciler/metrics.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/reconciler/parser.go 36.59% (+36.59%) 82 (+82) 30 (+30) 52 (+52) 🌟
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/reconciler/reconciler.go 54.29% (-2.49%) 140 (-15) 76 (-12) 64 (-3) 👎
github.com/nvidia/nvsentinel/tests/helpers/healthevent.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/tests/helpers/kube.go 0.00% (ø) 0 0 0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/health-events-analyzer/pkg/reconciler/reconciler_test.go
  • github.com/nvidia/nvsentinel/tests/multiple_fatal_event_test.go
  • github.com/nvidia/nvsentinel/tests/repeated_xid_rule_test.go
  • github.com/nvidia/nvsentinel/tests/scale_test.go
  • github.com/nvidia/nvsentinel/tests/smoke_test.go

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. The latest changes focus on cleanup and bug fixes across the health-events-analyzer module and test suite. The developer removed the outdated k8s.io/klog v1.0.0 dependency, fixing import conflicts with the v2 API now used throughout the codebase. In the publisher, manual struct field copying was replaced with proto.Clone() to ensure deep copies of protobuf messages and eliminate shallow-copy race conditions. Test infrastructure was improved by replacing the race-prone package-level GpuNodeName variable with context-based storage, ensuring safe data passing between test steps. Error handling in test cleanup was corrected to properly propagate failures instead of silently swallowing them. The Criteria field in TOML rule parsing was made mandatory by removing the omitempty tag, enforcing explicit declaration of filtering logic for each sequence step. Finally, a smoke test was updated to align healthy event error codes with their corresponding fatal events, and a typo (trailing tab in test data) was fixed. These changes collectively improve test reliability, dependency hygiene, and data safety in the analyzer's event publishing pipeline.

PR Description Notes:

  • All checkboxes in the PR description remain unchecked (type of change, components affected, testing status, and checklist items).

Important Files Changed

Filename Score Overview
health-events-analyzer/go.mod 5/5 Removed obsolete k8s.io/klog v1.0.0 dependency to eliminate import conflicts with v2
health-events-analyzer/go.sum 5/5 Cleaned up go.sum checksums for removed dependencies (klog v1, go-logr v0.1.0)
health-events-analyzer/pkg/publisher/publisher.go 5/5 Replaced manual struct copying with proto.Clone() for safe deep-copy of protobuf messages
health-events-analyzer/pkg/config/rules.go 4/5 Removed omitempty from Criteria field, making it mandatory in TOML rules (breaking change)
tests/multiple_fatal_event_test.go 5/5 Eliminated package-level mutable variable by storing GPU node name in test context, fixing race condition
tests/helpers/healthevent.go 4/5 Fixed error handling in TestCleanUp to properly propagate failures instead of silently ignoring them
tests/smoke_test.go 3/5 Updated healthy event error code to "145" but fatal event still sends "79", creating a mismatch
health-events-analyzer/pkg/reconciler/reconciler_test.go 5/5 Fixed typo by removing trailing tab character from "GPU" in test rule criteria
distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml 3/5 Renamed rule from RepeatedXid13 to RepeatedXidError and added MultipleFatalError rule with faultremediated=true criteria

Confidence score: 3/5

  • This PR contains several quality improvements but introduces a breaking change and leaves a test inconsistency unresolved.
  • Score reflects three concerns: (1) removing omitempty from the Criteria field breaks backward compatibility with existing TOML rules that omit criteria, (2) the smoke test still has a mismatch where the fatal event sends XID "79" but expects error code "145" in assertions, and (3) the MultipleFatalError rule checks faultremediated=true which counts successful remediations rather than active faults, which may not align with the rule's stated purpose.
  • Pay close attention to health-events-analyzer/pkg/config/rules.go (breaking TOML schema change), tests/smoke_test.go (error code mismatch between lines 304 and 376 vs assertion at line 337), and distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml (verify the faultremediated=true logic matches intended alert behavior).

Additional Comments (1)

  1. health-events-analyzer/pkg/config/rules.go, line 24 (link)

    style: removing omitempty from the criteria field means rules must always define criteria, even if empty. If any existing rules in production have sequence steps without criteria fields, TOML parsing will fail or inject an empty map. Verify that all deployed rule configs explicitly include criteria in every sequence step, or add validation to reject rules with nil criteria maps. Are there any existing rules in production where sequence steps intentionally omit the criteria field, relying on the omitempty behavior?

12 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR implements a comprehensive multi-fatal-error detection system for the health-events-analyzer. The core change shifts the analyzer from processing only non-fatal events to evaluating all health events (fatal and non-fatal) from monitors, applying TOML-defined rules that can aggregate patterns across multiple events. The MongoDB change-stream filter now excludes only analyzer-published events (agent != "health-events-analyzer") rather than all fatal events, allowing the analyzer to detect patterns like "5 repeated XID 13 errors on a single node" or "5 remediated faults indicating recurring failures." A new Parser module dynamically resolves event fields into BSON queries using reflection, enabling rules to reference the current event's fields (e.g., "this.healthevent.nodename"). The reconciler was refactored to evaluate all rules per event (rather than early-returning after the first match) using HashiCorp's go-multierror library to accumulate results. Test infrastructure was expanded with MongoDB connectivity, parameterized error-code injection, and two new end-to-end tests (multiple_fatal_event_test.go, repeated_xid_rule_test.go) that validate rule evaluation, event publishing, node condition creation, and database persistence. Configuration changes enable the analyzer in the Tilt development environment.

Important Files Changed

Filename Score Overview
health-events-analyzer/pkg/reconciler/parser.go 2/5 Adds BSON query parser with reflection-based field lookups, but multiple nil pointer dereferences on .Elem() calls without nil checks will panic
health-events-analyzer/pkg/reconciler/reconciler.go 2/5 Refactors to multi-rule evaluation with go-multierror and goroutines, but has array access panic (line 240), potential nil multiErr dereference (line 167), and unbounded goroutine spawning
tests/smoke_test.go 3/5 Updates all event sends with explicit XID parameters, but line 304 sends XID "79" while line 337expects "145" in the node condition message, likely causing test failure
health-events-analyzer/pkg/reconciler/reconciler_test.go 3/5 Expands test coverage for multi-fatal-error detection with new fixtures and rules, but mutates global state (line 306) and adds unnecessary BSON marshaling overhead in mock cursor
tests/helpers/healthevent.go 4/5 Adds MongoDB query support and TestCleanUp helper with TLS/X509 auth, but hardcoded 5s sleep and /tmp/mongo-certs paths may cause issues in different environments
tests/helpers/kube.go 4/5 Adds node-condition validation and cleanup helpers with hardcoded message formats, and increases global WaitInterval from 10s to 30s which slows all tests
tests/repeated_xid_rule_test.go 4/5 New test for RepeatedXid13 rule with event injection and MongoDB verification, but uses non-deterministic node selection and hardcoded 5s sleep instead of polling
tests/data/healthy-event.json 4/5 Adds errorCode field with "ERROR_CODE" placeholder and standardizes entity type, but placeholder won't match actual XIDs ("79", "145") without runtime substitution
health-events-analyzer/main.go 5/5 Changes MongoDB filter from excluding fatal events to excluding analyzer-published events, enabling multi-fatal-error pattern detection
distros/kubernetes/nvsentinel/values-tilt.yaml 5/5 Enables health-events-analyzer in Tilt development environment with clean boolean flag change
tests/scale_test.go 5/5 Updates SendHealthEventsToNodes calls to pass explicit error code "79" matching new function signature
health-events-analyzer/go.mod 5/5 Adds go-multierror v1.1.1 and errwrap v1.0.0 dependencies for multi-error handling with no version conflicts
health-events-analyzer/go.sum 5/5 Adds checksums for HashiCorp error handling dependencies with no security concerns
health-events-analyzer/pkg/reconciler/metrics.go 5/5 Renames/refactors Prometheus metrics from fatalEventsPublishedTotal to newEventsPublishedTotal and adds ruleMatchedTotal, databaseQueryDuration with standardized labels
health-events-analyzer/pkg/publisher/metrics.go 5/5 Makes FatalEventPublishingError metric unexported for proper encapsulation with no functional changes
tests/data/fatal-health-event.json 5/5 Converts fixture to parameterized template with "ERROR_CODE" placeholder and normalizes entity type to uppercase "GPU"
tests/go.mod 5/5 Adds mongo-driver v1.17.4 and promotes k8s.io/client-go to direct dependency for MongoDB integration in tests
tests/go.sum 5/5 Adds checksums for MongoDB driver and transitive dependencies with standard, well-maintained packages
tests/multiple_fatal_event_test.go 0/5 New test file for multi-fatal-error rule validation (contents not analyzed)
health-events-analyzer/pkg/publisher/publisher.go 0/5 Publisher implementation changes (contents not analyzed)
tilt/Tiltfile 0/5 Tilt development configuration changes (contents not analyzed)
distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml 0/5 Helm chart values changes (contents not analyzed)
tests/Makefile 0/5 Test makefile changes (contents not analyzed)

Confidence score: 2/5

  • This PR requires careful review before merging due to critical runtime panics in the parser and reconciler, plus test infrastructure issues that will cause failures
  • Score reflects multiple nil pointer dereferences in parser.go (lines 106-116, 177-192), array access without bounds check in reconciler.go (line 240), potential nil multiErr dereference (line 167), unbounded goroutine spawning, and test mismatches (smoke_test.go lines 304/337 XID discrepancy)
  • Pay close attention to health-events-analyzer/pkg/reconciler/parser.go and health-events-analyzer/pkg/reconciler/reconciler.go which contain panics that will crash the analyzer pod when triggered by nil event fields or empty ErrorCode arrays

Sequence Diagram

sequenceDiagram
    participant User
    participant HealthClient as Health Event Client
    participant PlatformConnector as Platform Connector
    participant MongoDB
    participant Analyzer as Health Events Analyzer
    participant Publisher as Event Publisher
    participant K8s as Kubernetes API

    User->>HealthClient: Send health event (XID error)
    HealthClient->>PlatformConnector: HealthEventOccuredV1(event)
    PlatformConnector->>MongoDB: Insert health event
    PlatformConnector->>K8s: Update node condition
    
    MongoDB->>Analyzer: Change stream notification (insert event)
    Analyzer->>Analyzer: processEvent(event)
    Analyzer->>Analyzer: handleEvent - evaluate rules
    
    alt Rule: MultipleFatalError or RepeatedXidError
        Analyzer->>Analyzer: validateAllSequenceCriteria
        Analyzer->>MongoDB: Aggregate query (facets for sequence criteria)
        MongoDB-->>Analyzer: Query results
        
        alt Criteria matched (e.g., 5 remediated faults)
            Analyzer->>Publisher: Publish(event, recommendedAction, ruleName)
            Publisher->>Publisher: Clone event and override fields
            Note over Publisher: agent="health-events-analyzer"<br/>checkName=ruleName<br/>isHealthy=false, isFatal=true
            Publisher->>PlatformConnector: HealthEventOccuredV1(transformed event)
            PlatformConnector->>MongoDB: Insert new fatal event
            PlatformConnector->>K8s: Update node condition (e.g., "MultipleFatalError")
        else Criteria not matched
            Analyzer->>Analyzer: Log: rule not matched
        end
    end
    
    Analyzer->>MongoDB: Mark event as processed (update resume token)
Loading

Additional Comments (1)

  1. tilt/Tiltfile, line 141-143 (link)

    logic: : fake nodes now depend on nvsentinel-mongodb but not nvsentinel-health-events-analyzer – if the analyzer publishes events that affect node state (e.g., quarantine), this dependency list may be incomplete and cause race conditions during stack startup

23 files reviewed, 13 comments

Edit Code Review Agent Settings | Greptile

tests/Makefile Outdated
test-ci:
nohup kubectl port-forward -n monitoring svc/prometheus-operator-kube-p-prometheus 9090:9090 > /dev/null 2>&1 &
nohup kubectl port-forward -n nvsentinel svc/simple-health-client 8080:8080 > /dev/null 2>&1 &
nohup kubectl port-forward -n nvsentinel pod/nvsentinel-mongodb-0 27017:27017 > /dev/null 2>&1 &
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: port-forward to a specific pod (nvsentinel-mongodb-0) will fail if the pod is rescheduled or renamed (e.g., after a StatefulSet restart). Consider using svc/nvsentinel-mongodb instead to target the service endpoint, which remains stable across pod restarts. Is there a specific reason to target the pod directly instead of the service? Does the service endpoint not expose port 27017?

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. The most recent updates focus on improving test stability and consistency: the MongoDB port-forward now targets the stable headless service rather than a specific pod, preventing test failures during pod restarts; a cleanup step was added to remove temporary MongoDB certificates after test runs; and timeout constants for Kubernetes resource polling were reduced from10 minutes to 1 minute to speed up test feedback loops. These changes complement the PR's main feature—enabling the health events analyzer to detect patterns across multiple different fatal error codes—by ensuring the test infrastructure is more reliable and faster, though the reduced timeouts may introduce flakiness in slower CI environments.

PR Description Notes:

  • The PR template sections (Type of Change, Component(s) Affected, Testing, Checklist) are completely empty—none of the checkboxes are marked and no description is provided in the Summary field, making it difficult for reviewers to understand the scope and intent of this change without reading the code diff.

Important Files Changed

Filename Score Overview
tests/Makefile 4/5 Changed MongoDB port-forward from pod to headless service and added certificate cleanup to improve test stability
tests/helpers/kube.go 3/5 Reduced wait timeouts from 10min to 1min globally, which may cause flakiness in slower environments
health-events-analyzer/pkg/reconciler/reconciler.go 4/5 Removed XID pre-filter to enable cross-error-code pattern detection, increasing database load for every event
health-events-analyzer/pkg/reconciler/reconciler_test.go 5/5 Removed redundant unit test and fixed typo; no functional risk
tests/helpers/healthevent.go 5/5 Renamed parameter from PascalCase to camelCase for Go convention compliance; no functional change

Confidence score: 3/5

  • This PR includes critical test infrastructure improvements but introduces risk by globally reducing timeouts that affect all test helpers, which could cause intermittent failures in CI/CD environments that are slower than the developer's local setup.
  • Score reflects concerns about the aggressive timeout reduction (10min → 1min) in tests/helpers/kube.go that applies to all polling operations across the test suite, and the potential performance impact of removing the XID pre-filter in the reconciler, which now forces a MongoDB aggregation query for every event regardless of rule applicability.
  • Pay close attention to tests/helpers/kube.go (timeout constants may be too aggressive for slower CI runners) and health-events-analyzer/pkg/reconciler/reconciler.go (removal of pre-filter will increase MongoDB query load—monitor database metrics after deployment).

5 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@XRFXLP
Copy link
Member

XRFXLP commented Oct 27, 2025

/ok to test 3800b87

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR.

The most recent changes address previous review feedback by moving test cleanup logic from deferred functions inside assess steps to proper feature.Teardown callbacks in both repeated_xid_rule_test.go and multiple_fatal_event_test.go. The cleanup now runs reliably even when assess steps fail, and includes safety checks for missing context values. The test timeout in tests/helpers/kube.go was increased from 1 minute to 10 minutes to accommodate the longer-running multi-cycle fatal event tests, and informational logging was added to cleanup operations. However, a critical syntax error was introduced in health-events-analyzer/pkg/parser/parser.go: the package declaration remains package reconciler even though the file was moved to the parser/ directory, which will cause compilation to fail. Additionally, two new logic errors were added to the parser: accessing parts[1] without validating the slice length (line 178-180) and dereferencing event.GeneratedTimestamp without a nil check (line 182-184).

Important Files Changed

Filename Score Overview
health-events-analyzer/pkg/parser/parser.go 0/5 Package declaration not updated after file move; will not compile. Also added array bounds and nil dereference bugs.
health-events-analyzer/pkg/reconciler/reconciler.go 4/5 Updated imports and method calls to use exported Parser API after package refactor; error wrapping improved.
tests/helpers/kube.go 5/5 Increased test timeout from1to 10 minutes to support multi-cycle tests; added informational logging.
tests/repeated_xid_rule_test.go 5/5 Moved cleanup from defer to feature.Teardown callback; introduced ERRORCODE_13 constant; removed hard-coded sleep.
tests/multiple_fatal_event_test.go 4.5/5 Moved cleanup from defer to feature.Teardown callback; introduced ERRORCODE_31 constant. Test still requires 6 errors despite PR description stating 5 cycles.
health-events-analyzer/pkg/publisher/publisher.go N/A No summary available (file marked undefined).

Confidence score: 0/5

  • This PR is not safe to merge and will cause immediate compilation failure in production.
  • Score reflects a critical syntax error (package reconciler in pkg/parser/parser.go when it should be package parser) that will break the build, plus two new runtime panic risks (array out-of-bounds at line 178-180 and nil dereference at line 182-184 in parser.go) that were not present in the previous review. The previous negative-index bugs (lines 138-139, 165-169) also remain unresolved.
  • Immediate attention required for health-events-analyzer/pkg/parser/parser.go to fix the package declaration and add bounds/nil checks before lines 178-184. After fixing the parser, carefully review health-events-analyzer/pkg/reconciler/reconciler.go to ensure all Parser method calls are correct. The test files are well-structured but cannot be validated until the parser compiles.

6 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR.

The latest changes address previous review feedback by refactoring test structure and improving observability. The two test files (multiple_fatal_event_test.go and repeated_xid_rule_test.go) were updated to move cleanup logic from deferred functions within assess steps into dedicated feature.Teardown blocks, following e2e-framework best practices. Error code constants (ERRORCODE_13, ERRORCODE_31) were introduced to eliminate magic strings. The test helper kube.go was updated to increase the global WaitTimeout from 1 minute to 10 minutes (necessary for multi-cycle remediation flows) and add informational logging to the CleanupNodeConditionAndUncordon function. These changes improve test reliability, maintainability, and debuggability without altering the core test logic or the analyzer's functional behavior.

Important Files Changed

Filename Score Overview
tests/repeated_xid_rule_test.go 5/5 Refactored cleanup from deferred function to Teardown phase and introduced ERRORCODE_13 constant
tests/multiple_fatal_event_test.go 4/5 Moved cleanup to Teardown but introduced dependency on context value being set correctly in earlier assess steps
tests/helpers/kube.go 4/5 Increased global test timeout to10 minutes and added logging to cleanup function

Confidence score: 4/5

  • This PR is safe to merge with low risk; changes are scoped to test structure improvements and observability enhancements.
  • Score reflects solid refactoring that follows framework best practices, but deducts one point because multiple_fatal_event_test.go cleanup relies on context state from a previous assess step—if that step fails before storing the node name, cleanup will be skipped silently.
  • Pay close attention to tests/multiple_fatal_event_test.go (lines 156-170): ensure the first assess step always sets gpuNodeName in context before any failure path, or add fallback logic to guarantee cleanup runs even if the context value is missing.

6 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@XRFXLP
Copy link
Member

XRFXLP commented Oct 27, 2025

/ok to test d065e4f

Signed-off-by: Tanisha goyal <[email protected]>
@XRFXLP
Copy link
Member

XRFXLP commented Oct 27, 2025

/ok to test 8431c23

@github-actions
Copy link

Merging this branch will decrease overall coverage

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/reconciler 53.97% (-2.81%) 👎

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/reconciler/reconciler.go 53.97% (-2.81%) 126 (-29) 68 (-20) 58 (-9) 👎

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only the most recent changes made since the last review, not the entire PR. The latest updates include critical bug fixes to the health-events-analyzer reconciler and parser, modifications to deployment templates, and refinements to test infrastructure. Key changes address goroutine lifecycle management, context cancellation, duplicate environment variable issues, and template syntax errors that would prevent deployments from succeeding.

Recent Changes Overview:

The most recent changes focus on:

  1. Parser package fixes: Corrected package declaration from reconciler to parser (previously flagged)
  2. Error handling improvements: Removed double error wrapping in retry logic
  3. Test constants usage: Consolidated XID error codes to use declared constants throughout tests
  4. Fault-remediation deployment template: Fixed duplicate environment variable definitions (critical deployment blocker)
  5. Commons go.mod: Fixed invalid YAML v2 import path (go.yaml.ingopkg.in)
  6. Docker Makefile: Removed most module targets, migrating to ko-based builds
  7. Reconciler refactor: Fixed goroutine lifecycle and context propagation issues
  8. Test refinements: Improved cleanup logic and error code consistency

The changes address previously identified issues (package naming, duplicate definitions, error wrapping) while introducing new infrastructure improvements. The most critical fix is in the fault-remediation deployment template where duplicate environment variables would cause Kubernetes to reject the manifest.

Important Files Changed

Filename Score Overview
distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml 1/5 Critical: Contains duplicate environment variable definitions (lines 42-47and 82-87) that will cause deployment failures; also updates image repository path
commons/go.mod 1/5 Critical syntax error on line 20: invalid import path go.yaml.in/yaml/v2 should be gopkg.in/yaml.v2 - will cause build failures
docker/Makefile 1/5 Removes 7+ critical module build targets from the Makefile; health-events-analyzer and other core modules no longer built via Docker, breaking CI unless ko migration is complete
health-events-analyzer/pkg/parser/parser.go 2/5 Multiple nil pointer dereference risks (lines 106, 169,184) and index out-of-bounds panic in getValueFromGeneratedTimestamp; silent failure modes return nil instead of errors
platform-connectors/pkg/connectors/store/storeConnectorImpl.go 2/5 Critical defer ordering bug: defer cancel() executes before defer session.EndSession(cleanupCtx), canceling the cleanup context before it's used
janitor/pkg/csp/client.go 2/5 GCP and Azure clients returned as uninitialized struct pointers will cause runtime panics; inconsistent initialization pattern vs AWS/OCI
fault-remediation-module/pkg/reconciler/reconciler.go 2/5 Unreachable code in collection-client error path (lines 84-91); deferred watcher cleanup may be invoked on never-started watcher
health-events-analyzer/pkg/reconciler/reconciler.go 3/5 Unbounded goroutine spawning, removal of retry logic, and potential duplicate event publishing due to race with watcher.MarkProcessed
janitor/pkg/csp/kind.go 3/5 Three incorrect timeout checks use parent ctx.Err() instead of dockerCtx.Err(), causing timeouts to not be detected
janitor/pkg/csp/gcp.go 3/5 Regex parsing bug with empty capture-group validation; op.Status dereferenced without nil-check will panic on unexpected GCP API responses
janitor/pkg/csp/aws.go 3/5 Off-by-one error in provider ID parsing; 5-minute hardcoded cooldown tightly coupled to message format; AWS_REGION read without validation
health-events-analyzer/pkg/reconciler/reconciler_test.go 3/5 Global test fixtures mutated without reset causing order dependencies; mocks don't validate pipeline content allowing incorrect rule matching
commons/pkg/server/server.go 3/5 Resource leak if listener creation fails in TLS path; race condition where running flag may not reflect actual socket binding state
janitor/pkg/controller/rebootnode_controller_test.go 3.5/5 AfterEach hook will panic if testRebootNode deleted during test; race condition test modifies wrong object copy
tests/helpers/healthevent.go 4/5 MongoDB client recreated on every query (no pooling); X509 auth unconditionally applied may conflict with MONGODB_URI auth settings
tests/repeated_xid_rule_test.go 4/5 Rapid-fire event injection assumes immediate analyzer processing; non-deterministic node selection may cause flakiness
tests/multiple_fatal_event_test.go 4/5 Race condition: gpuNodeName stored in context after first assess starts, breaking cleanup if assess fails early; cleanup only targets error code 31
fault-quarantine-module/main.go 4/5 Metrics server errors swallowed (returns nil); context cancellation from reconciler won't propagate to metrics server
platform-connectors/main.go 3/5 Metrics server always returns nil preventing context cancellation propagation; gRPC server still calls os.Exit bypassing shutdown; stopCh closed after cleanup starts
tests/uat/install-apps.sh 3/5 Empty array expansion "${extra_set_args[@]}" will fail for non-AWS CSPs; hardcoded 15m/20m timeouts may be insufficient
health-monitors/syslog-health-monitor/main.go 3/5 Flag initialization bug: nodeNameEnv default uses os.Getenv at package init; loadConfigWithRetry blocks shutdown for 15s without context check
health-monitors/csp-health-monitor/cmd/csp-health-monitor/main.go 3/5 Potential deadlock if monitor doesn't respect context; metrics server failures silently swallowed masking startup issues

Confidence score: 2/5

  • This PR contains multiple critical deployment-blocking bugs that will cause immediate failures if merged
  • Score reflects severe issues: duplicate environment variables in Kubernetes manifests, invalid Go import paths, nil pointer dereferences, and broken build systems that will prevent successful deployment and operation
  • Primary concerns: (1) fault-remediation deployment template has duplicate env vars that Kubernetes will reject, (2) commons/go.mod has invalid import path causing build failures, (3) docker/Makefile removes all critical module targets breaking CI, (4) multiple CSP clients have uninitialized pointers causing runtime panics, (5) server cleanup context ordering bug defeats the purpose of timeout protection

Additional Comments (92)

  1. platform-connectors/Dockerfile, line 1 (link)

    logic: entire Dockerfile deleted but platform-connectors module is still listed in the changed files and appears in the PR context. If this module is still active, the build will fail. Verify that platform-connectors now uses .ko.yaml exclusively (visible in changed files list) and that docker/Makefile has removed any references to this Dockerfile. If platform-connectors is being removed entirely, confirm all dependent modules, Helm charts, and tests have been updated. Is the platform-connectors module switching from Docker builds to ko-based builds, or is the entire module being removed? Are there any remaining references to this Dockerfile in the build system or Helm charts?

  2. health-monitors/gpu-health-monitor/Tiltfile, line 16 (link)

    logic: image name change from ghcr.io/nvidia/nvsentinel-gpu-health-monitor to ghcr.io/nvidia/nvsentinel/gpu-health-monitor (dash replaced with slash) will cause deployment failures if image references in Helm charts, .ko.yaml, or other Tiltfiles aren't updated consistently. Have all image references in Helm values, .ko.yaml, other Tiltfiles, and CI workflows been updated to match this new image path structure?

  3. .github/actions/setup-ci-tools/action.yml, line 197-201 (link)

    logic: empty string check will fail after -z test passes, causing workflow to exit before printing the error context. Move the echo statements before the exit.

  4. tests/uat/aws/delete-eks-cluster.sh, line 102 (link)

    style: double blank line

  5. tests/uat/aws/README.md, line 26 (link)

    logic: K8s version 1.34 doesn't exist yet - latest stable is 1.31. Will cause cluster creation to fail.

  6. tests/uat/common.sh, line 17-19 (link)

    style: log function doesn't handle multi-line input or special characters correctly - if $* contains backslashes or -n/-e flags they may be interpreted by echo. Use printf '%s\n' "[$(date +'%Y-%m-%d %H:%M:%S')] $*" instead

  7. janitor/pkg/metrics/metrics.go, line 64-65 (link)

    style: Registering metrics in init() (via NewActionMetrics) can panic if this package is imported twice or if the metrics are already registered elsewhere. Consider using prometheus.MustRegister inside an init() with a panic-recovery strategy, or check Register return errors to avoid hard crashes at startup.

  8. commons/go.mod, line 20 (link)

    syntax: invalid import path go.yaml.in/yaml/v2 - should be gopkg.in/yaml.v2

  9. commons/pkg/logger/logger.go, line 56 (link)

    syntax: missing godoc parameter withSource

  10. .github/actions/setup-build-env/action.yml, line 74 (link)

    style: missing newline at end of file

  11. janitor/main.go, line 93-98 (link)

    logic: missing MetricsBindAddress in manager options - the metricsAddr flag is parsed at line 77 but never applied, so Prometheus metrics will be disabled

  12. tests/uat/aws/gpu-operator-values.yaml, line 30 (link)

    logic: node-feature-discovery.master nodeSelector targets workload-type: gpu, but NFD master typically runs on control-plane nodes to discover features across all workers. Placing it on GPU nodes means it may not schedule if GPU nodes are tainted/cordoned during fault quarantine. Should this target workload-type: system like the operator (line 25)? Is NFD master intentionally placed on GPU nodes for a specific architectural reason in this UAT setup, or should it run on system nodes to avoid scheduling issues during fault conditions?

  13. platform-connectors/pkg/connectors/store/storeConnectorImpl.go, line 228-230 (link)

    logic: defer ordering: cancel() will fire before session.EndSession(), meaning cleanupCtx will be cancelled before EndSession tries to use it. Swap the defer statements so defer cancel() comes first (will execute last).

  14. distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml, line 41-87 (link)

    syntax: duplicate environment variable definitions - MAINTENANCE_NAMESPACE, MAINTENANCE_VERSION, and MAINTENANCE_API_GROUP are set at lines 42-47 and again at lines 82-87

  15. Makefile, line 583-591 (link)

    logic: identical targets - both ko-build and ko-publish call the same script without arguments. Should ko-publish pass a --publish flag or set KO_DOCKER_REPO? Does buildko.sh detect publish vs build mode via environment variables, or should ko-publish pass additional arguments?

  16. health-monitors/csp-health-monitor/Tiltfile, line 17 (link)

    logic: image reference doesn't match the old reference at line 17: old was ghcr.io/nvidia/nvsentinel-csp-health-monitor, new is ghcr.io/nvidia/nvsentinel/csp-health-monitor. Verify that .ko.yaml, Helm values.yaml, and deployment manifests all use the updated path with the extra /nvsentinel/ segment, or this image will fail to deploy. Have the corresponding Helm chart and .ko.yaml entries been updated to match this new image path?

  17. fault-remediation-module/main.go, line 289-296 (link)

    style: metrics server errors are swallowed - the goroutine logs the error but always returns nil, preventing graceful shutdown propagation. If metrics are critical, consider returning the error; if not, document why they're non-critical. Is the metrics server intentionally non-critical to operation? Should it block startup if it fails to bind the port, or only log errors during runtime?

  18. fault-remediation-module/main.go, line 300 (link)

    style: if reconciler.Start returns an error immediately (e.g., initialization failure), the metrics server goroutine will continue running until gCtx is canceled by errgroup. Consider adding explicit cleanup or ensuring reconciler errors cause immediate context cancellation

  19. janitor/pkg/controller/rebootnode_controller.go, line 233 (link)

    syntax: logger message format is incorrect - second parameter should be a key-value pair. Should be "RebootNode", rebootNode.Name, "message", "not found, object assumed deleted"

  20. janitor/pkg/controller/test_utils.go, line 32-35 (link)

    style: fields are exported but never read outside the mock methods - should be unexported (lowercase) to prevent external access

  21. fault-remediation-module/Tiltfile, line 17 (link)

    logic: image name changed from ghcr.io/nvidia/nvsentinel-fault-remediation-module to ghcr.io/nvidia/nvsentinel/fault-remediation-module (added slash before module name). This must match the image reference in the Helm chart and .ko.yaml, otherwise deployment will fail. Have the corresponding image references in distros/kubernetes/nvsentinel/charts/fault-remediation/values.yaml and .ko.yaml been updated to use the new path with the slash?

  22. janitor/pkg/csp/client.go, line 50-54 (link)

    logic: GCP and Azure implementations return uninitialized struct pointers without error handling - these will panic when methods are called. Need to implement proper initialization functions like NewGCPClientFromEnv() and NewAzureClientFromEnv() similar to AWS and OCI

  23. janitor/pkg/controller/suite_test.go, line 70 (link)

    logic: CRD path references config/crd/bases but janitor module uses Helm templates in distros/kubernetes/nvsentinel/charts/janitor/templates/. The config/crd/bases directory doesn't exist in this repository structure. Tests will fail to find CRDs unless the path is corrected or the directory is created. Does the janitor module have a config/crd/bases directory, or should this point to the Helm template CRD files?

  24. janitor/pkg/controller/suite_test.go, line 71 (link)

    style: ErrorIfCRDPathMissing: false silences errors when the CRD directory is missing, which masks misconfiguration. Tests may pass without actually loading the RebootNode CRD, causing unexpected behavior when testing reconciliation logic.

  25. labeler-module/main.go, line 98-106 (link)

    logic: metrics server goroutine swallows errors and always returns nil, which means if the labeler (line 108-110) fails first, g.Wait() won't return until the server goroutine also exits. Since srv.Serve blocks until context cancellation, a labeler error won't cause immediate shutdown unless gCtx is canceled by the labeler. Does labelerInstance.Run cancel gCtx on error, or should this use a separate cancelable context? Does labelerInstance.Run() cancel the errgroup context on error, or will the metrics server continue running after the labeler fails?

  26. make/docker.mk, line 30-37 (link)

    style: indentation changed from tabs to spaces in docker buildx command arguments. This inconsistency with the rest of the Makefile may cause issues with Make's requirement for tabs. Verify that these are actual tab characters, not spaces.

  27. nvsentinel-log-collector/Makefile, line 28 (link)

    style: cache reference still uses old naming nvsentinel-buildcache:log-collector without /nvsentinel/ path segment – this won't match the new image naming pattern and may cause cache misses or conflicts if cache keys are tied to image name patterns

  28. scripts/buildko.sh, line 26-31 (link)

    style: conditional platform flag will cause the build to fail if PLATFORMS env var contains invalid values (e.g., trailing commas, unsupported architectures). Consider validating PLATFORMS or wrapping the ko build in a trap to log the error clearly

  29. DEVELOPMENT.md, line 499-537 (link)

    style: documentation duplicates identical content already present at lines 499-537 which was already in the context before the changes. This entire section on macOS/Docker Desktop socket requirements appears twice in the file now

  30. DEVELOPMENT.md, line 1106-1108 (link)

    style: commented-out code left in documentation - line 1106 has "# Test the built images locally" followed by the actual command, then line 1107 repeats the same comment. One of these should be removed

  31. tests/uat/run.sh, line 60 (link)

    style: trap only captures EXIT signal. Consider adding INT and TERM to handle Ctrl+C and kill during long-running cluster operations, otherwise cleanup may not run if the user interrupts the script

  32. janitor/pkg/csp/gcp.go, line 57-61 (link)

    logic: empty name should not be an error - groupNames[i] will be empty for unnamed capture groups. The error should only be thrown if name is empty AND result is missing a required key after the loop

  33. janitor/pkg/csp/gcp.go, line 43 (link)

    style: regex is compiled on every call to getNodeFields. Consider declaring this as a package-level variable (var providerIDRegex = regexp.MustCompile(...)) to avoid repeated compilation overhead

  34. janitor/pkg/csp/gcp.go, line 133 (link)

    logic: dereferencing op.Status without nil check will panic if Status is nil. Validate op is non-nil and has a Status field before dereferencing

  35. tests/uat/install-apps.sh, line 17 (link)

    logic: set -euo pipefail will cause helm repo add/update to abort the script if the repository is already added. Consider using helm repo add ... || true or checking existence first. Should the script be idempotent, allowing re-runs even if repos are already added?

  36. tests/uat/install-apps.sh, line 129 (link)

    logic: if extra_set_args is empty (non-AWS CSP), bash will expand this as an empty string rather than an empty array, causing helm to fail with an argument parsing error. Wrap with conditional: ${extra_set_args[@]+"${extra_set_args[@]}"}

  37. tests/uat/install-apps.sh, line 67 (link)

    style: missing cert-manager values file check – if cert-manager-values.yaml doesn't exist, helm will silently ignore it. Add file existence validation before each helm upgrade call

  38. health-monitors/csp-health-monitor/cmd/maintenance-notifier/main.go, line 197-205 (link)

    logic: metrics server errors are swallowed by returning nil, preventing errgroup from propagating failures. If the server fails to start (e.g., port already in use), the service will continue running without metrics and no error will be logged to the caller. Should metrics server startup failures be fatal, or is silent degradation acceptable? If port binding fails, should the service abort or continue without observability?

  39. health-monitors/csp-health-monitor/cmd/maintenance-notifier/main.go, line 207-214 (link)

    logic: trigger engine goroutine returns nil unconditionally - if engine.Start panics or the engine terminates unexpectedly, errgroup won't detect failure. Should this capture and return errors from the engine? Does engine.Start(gCtx) return an error that should be propagated, or does it log errors internally and always succeed?

  40. tests/uat/tests.sh, line 41 (link)

    logic: resetting elapsed to 0 after successful reboot will cause the uncordon wait loop (line 54) to never timeout - both loops share the same elapsed variable

  41. node-drainer-module/main.go, line 137-157 (link)

    logic: shutdown triggered by criticalError calls stop() at line 145 but normal shutdown at line 139does not. This asymmetry could cause the event watcher goroutine (line 99) to block indefinitely if the criticalError path is never taken and only gCtx.Done() fires. Should both paths call stop() to cancel ctx and unblock the event watcher?

  42. node-drainer-module/main.go, line 159-169 (link)

    style: shutdown code duplicated in both branches (lines 147-156 and 160-169). Extract to a helper function to avoid drift if cleanup logic changes

  43. janitor/pkg/csp/kind.go, line 90-95 (link)

    logic: checking ctx.Err() but using dockerCtx for the command - if parent context times out, this check won't fire. Should check dockerCtx.Err() instead

  44. janitor/pkg/csp/kind.go, line 106-111 (link)

    logic: same issue: checking parent ctx.Err() but dockerCtx controls the command execution - timeout won't be detected correctly

  45. janitor/pkg/csp/kind.go, line 128-133 (link)

    logic: same issue: checking parent ctx.Err() but dockerCtx controls the command execution - timeout won't be detected correctly

  46. fault-quarantine-module/main.go, line 267-275 (link)

    logic: metrics server error suppression means context cancellation won't propagate - if gCtx is cancelled while server is still starting/serving, the error will be logged but the goroutine will return nil, masking potential shutdown issues. Should the metrics server goroutine respect context cancellation and propagate that as an error, or is silent termination on context cancellation intentional?

  47. fault-quarantine-module/main.go, line 278 (link)

    logic: if reconciler.Start returns an error, errgroup will cancel gCtx, but the metrics server goroutine (line 267-275) will ignore the cancellation and return nil. This means g.Wait() will only return the reconciler error, not both. Is this asymmetric error handling intentional? Should both goroutines propagate their errors (including context cancellation), or is it intentional that only the reconciler's error is returned to the caller?

  48. docker/Makefile, line 22-31 (link)

    logic: removed almost all Go and Python modules from build definitions - modules like platform-connectors, health-events-analyzer, fault-quarantine-module, labeler-module, node-drainer-module, fault-remediation-module, and janitor are central to the PR's stated goal ("detect multiple fatal error through health event analyzer"). Removing them from the build system means they won't be built by CI or by local developers using make build-all. This will break CI pipelines and local development workflows. Were these modules moved to a different build system (e.g., ko exclusively), or is this an accidental deletion? Were these modules intentionally migrated to a different build system, or should they remain in this Makefile?

  49. docker/Makefile, line 143 (link)

    logic: changed image name filter from nvsentinel- to nvsentinel/ - this breaks cleanup for images built with the old naming convention. If any images with the old pattern exist, they won't be removed.

  50. scripts/ko-tilt-build.sh, line 35 (link)

    style: multi-arch build to ttl.sh with both linux/amd64 and linux/arm64 may be slower and unnecessary for local Tilt dev where most developers use a single architecture. Consider using --platform=$(go env GOOS)/$(go env GOARCH) for faster local iteration. Is building for both architectures during local Tilt development intentional, or would single-arch builds suffice for faster iteration?

  51. scripts/ko-tilt-build.sh, line 35-38 (link)

    style: if ko build succeeds but crane cp fails (network issue, registry unavailable), the image remains on ttl.sh and the script exits with error - the EXPECTED_REF won't be available, causing Tilt to fail. Consider adding error handling or a cleanup trap

  52. scripts/ko-tilt-build.sh, line 34 (link)

    style: TEMP_TAG uses timestamp + PID but concurrent invocations within the same second from the same parent process (e.g., parallel Tilt builds) could generate the same tag, causing conflicts. Consider adding $RANDOM or the module name to ensure uniqueness

  53. commons/pkg/server/server_test.go, line 41 (link)

    style: type assertion without 'ok' check will panic if Addr() returns a non-TCPAddr (unlikely in practice, but good to be defensive in test helpers)

  54. janitor/pkg/csp/aws.go, line 124 (link)

    style: use logger.Error instead of fmt.Println for consistency - all other error paths use the structured logger

  55. janitor/pkg/csp/aws.go, line 143-147 (link)

    logic: off-by-one: if provider ID is exactly "aws:///us-west-2/i-xxx", Split produces 6 parts, not 5. len(parts) < 5 allows 4-part IDs which would panic at parts[4]. Should be < 6 or != 6. Are there valid AWS provider ID formats that have fewer than 6 parts, or should this check enforce exactly 6 parts?

  56. distros/kubernetes/nvsentinel/charts/janitor/templates/serviceaccount.yaml, line 20-49 (link)

    logic: conditional logic blocks for CSP annotations are stacked sequentially, which will merge annotations from all providers if multiple provider blocks have truthy conditions (e.g., if .Values.csp.provider is "aws" but .Values.csp.gcp.project is also set). Since eq (.Values.csp.provider | default "kind") "aws" only checks the provider, but then lines 27-32, 33-39, 40-46 all independently check their own provider strings, a misconfigured values file could render multiple provider annotations. Should these be else if blocks in Helm template logic, or is there validation elsewhere that ensures only one provider's fields are populated? Does the janitor Helm chart validate that only one CSP provider's configuration fields are populated at install time, or is there a risk of mixed provider annotations?

  57. fault-remediation-module/pkg/reconciler/reconciler.go, line 71-76 (link)

    logic: returning an error from Start means the reconciler will stop entirely on initialization failure, but the deferred watcher cleanup at lines 78–82 will not run since the watcher was just assigned and hasn't been successfully created yet. The defer is safe because watcher is non-nil at that point, but if GetCollectionClient fails at line 84, the watcher is started at line 93after the error return path, which means the watcher is never started but the defer will still call Close on it. Should the defer be moved to after line 93 (watcher.Start), or should the collection client initialization happen before the watcher is created to ensure no partial state?

  58. fault-remediation-module/pkg/reconciler/reconciler.go, line 84-91 (link)

    syntax: error path logs and returns the error at lines 86–90, but the return statement is unreachable because it comes after the return at line90. This is dead code

  59. janitor/pkg/csp/oci.go, line 90 (link)

    logic: node.Spec.ProviderID is a string (format: oci://<ocid>), but InstanceId expects just the OCID without the oci:// prefix. This will cause API call failures. Extract the OCID using strings.TrimPrefix(node.Spec.ProviderID, "oci://") before passing it.

  60. janitor/pkg/csp/oci.go, line 89-92 (link)

    style: API response is discarded, but it contains the instance state and request ID. If the instance is already rebooting or in a stopped state, this call may succeed but the reboot may not happen. Consider validating the response state or returning the request ID for tracking. Should the function verify that the instance was in a 'RUNNING' state before the reboot, or is it acceptable to silently succeed even if the instance is already stopped/terminated?

  61. janitor/pkg/csp/oci.go, line 112-113 (link)

    logic: always returns false before 5 minutes elapsed, then unconditionally returns true after 5 minutes without checking actual instance state. This assumes the node is always ready after 5 minutes, which may not be true if the reboot hangs or the instance enters a failed state. Should this query the OCI API to check the instance's actual lifecycle state? Is the 5-minute hardcoded delay a placeholder for future API checks, or is it intentional to assume the node is always ready after 5 minutes without verifying instance state?

  62. commons/pkg/server/server.go, line 462 (link)

    logic: listener.Close() is only called in the TLS error path, but if net.Listen fails at line 475 in the non-TLS path, or if net.Listen succeeds at line 454 but before wrapping with TLS (line 471) the function returns, the listener will leak. Add deferred cleanup or ensure all early returns close the listener.

  63. commons/pkg/server/server.go, line 488-490 (link)

    logic: race condition: running is set to true inside the server goroutine, but if the goroutine hasn't started yet when IsRunning() is called externally, it will return false even though Serve() is executing and the socket is bound. Consider setting running=true before launching the goroutine.

  64. tests/uat/aws/create-eks-cluster.sh, line 120 (link)

    logic: wildcard filter 'rivate' is fragile and could match unintended route tables (e.g., 'derivate-table'). Consider a more precise filter like 'tag:aws:cloudformation:logical-id=Private' or query by explicit association with private subnets

  65. janitor/pkg/csp/azure.go, line 99 (link)

    style: using fmt.Println instead of logger for consistency - should use logger.Error(err, "Error parsing time")

  66. janitor/pkg/csp/azure.go, line 86 (link)

    style: error already logged on previous line, this logs it twice. Remove the Sprintf formatting and just pass the error.

  67. janitor/pkg/csp/azure.go, line 132 (link)

    style: error already logged via logger.Error, no need to include it in Sprintf message again

  68. janitor/pkg/csp/azure.go, line 138 (link)

    logic: dereferencing status.Code without nil check will panic if Code is nil

  69. janitor/pkg/controller/rebootnode_controller_test.go, line 169-174 (link)

    logic: AfterEach will fail if any test deletes the testRebootNode object, since k8sClient.Get() will return an error. Tests that result in failures or deletions will cause this hook to panic instead of validating conditions. Should this hook skip validation if the RebootNode no longer exists, or is deletion never expected in these tests?

  70. janitor/pkg/controller/rebootnode_controller_test.go, line 476-477 (link)

    logic: creating freshRebootNode directly in the fake client but never updating it after calling SetInitialConditions() and SetStartTime() on lines 480-481. The in-memory object is modified but the fake client's stored copy is stale. Subsequent assertions on lines 485-502 pass because they check the in-memory object, but if the test fetched it from the client again, the status would be empty.

  71. janitor/go.mod, line 8-13 (link)

    style: all five CSP SDKs (GCP, Azure, AWS, OCI) are now direct dependencies. This increases attack surface and binary size. If the janitor only interacts with one CSP per deployment, consider using build tags to conditionally compile each provider

  72. janitor/go.mod, line 113 (link)

    syntax: duplicate gopkg.in/yaml.v3 entries at lines 94 and 113 suggest a transitive dependency conflict or incorrect merge. Run go mod tidy to deduplicate

  73. .github/workflows/helm-e2e-test.yml, line 205 (link)

    logic: github.ref for a tag push will look like refs/tags/v1.2.3, not v1.2.3. Helm images typically expect a clean tag name like v1.2.3. This will cause the validation script to look for images with the full ref path as the tag, which won't exist in the registry.

  74. .github/workflows/helm-e2e-test.yml, line 233 (link)

    logic: same issue: github.ref includes refs/tags/ prefix, so this will set global.image.tag=refs/tags/v1.2.3 instead of v1.2.3. Helm will try to pull images with the wrong tag.

  75. .github/workflows/helm-e2e-test.yml, line 240 (link)

    logic: same issue: validation will search for images tagged refs/tags/v1.2.3 instead of v1.2.3

  76. .github/workflows/helm-e2e-test.yml, line 267 (link)

    logic: same issue in debug artifacts collection

  77. health-events-analyzer/pkg/reconciler/reconciler_test.go, line 236-277 (link)

    logic: test name is "match multiple fatal error rule" but the mock returns ruleMatched: true for all Aggregate calls without distinguishing which rule is being evaluated. If the reconciler evaluates multiple rules, the mock will return true for each, causing false positives. Should the mock validate that rule1's criteria are actually used in the pipeline, or is the test only verifying that any match triggers publish? Does this test intend to verify that rule1 specifically is matched (requiring mock inspection of the pipeline), or just that any matching rule triggers publish?

  78. .github/workflows/publish.yml, line 169 (link)

    logic: comment says "Once validated, this will replace the container-publish job above" but both jobs run concurrently. This will double-build gpu-health-monitor and syslog-health-monitor images (once via Docker matrix, once via ko), pushing two different images with potentially conflicting tags to the same registry paths. Should the ko build-images job only build the components that were removed from the container-publish matrix (the 8 Go modules), or is the intent to eventually replace all Docker builds? Running both jobs for the same components will publish conflicting images.

  79. .github/workflows/publish.yml, line 109-123 (link)

    logic: container_name changed from 'nvsentinel-<component>' to 'nvsentinel/<component>' for all entries. This changes the published image path in the registry (e.g., ghcr.io/nvidia/nvsentinel-gpu-health-monitor → ghcr.io/nvidia/nvsentinel/gpu-health-monitor). Existing deployments and Helm charts that reference the old path will fail to pull images after this change.

  80. health-events-analyzer/pkg/parser/parser.go, line 82-92 (link)

    style: len check is insufficient – if parts[0] exists but is empty, comparison will succeed yet parts[1:] may be empty or out of bounds, leading to nil dereference in reflection helpers

  81. platform-connectors/main.go, line 268-276 (link)

    logic: metrics server swallows all errors including context cancellation. If srv.Serve(gCtx) returns on context cancel, the slog.Error will log misleadingly. Return the error from srv.Serve so g.Wait() can distinguish between normal shutdown and actual failures. Should the metrics server error be propagated to cause overall shutdown, or is it acceptable for the platform-connectors to run without metrics/health endpoints?

  82. platform-connectors/main.go, line 278-306 (link)

    logic: signal handler goroutine will never receive from gCtx.Done() unless the first goroutine (metrics server) fails AND returns a non-nil error (which it never does - line 275 always returns nil). This means SIGINT/SIGTERM is the only shutdown path. If the metrics server's context is cancelled, the signal handler won't know

  83. platform-connectors/main.go, line 295 (link)

    logic: closing stopCh after starting cleanup creates a race - k8sConnector.FetchAndProcessHealthMetric (line 104) and storeConnector.FetchAndProcessHealthMetric (line 121) may still be reading from stopCh. Close it before cleanup to give them time to wind down

  84. platform-connectors/main.go, line 303 (link)

    style: calling cancel() after cleanup means the connectors' contexts remain live during shutdown. Move this before cleanupResources so the connectors see cancellation immediately

  85. platform-connectors/main.go, line 141-147 (link)

    logic: grpcServer.Serve goroutine exits the entire process on error (line 145), bypassing graceful shutdown. Consider returning the error to the caller or using errgroup to coordinate shutdown with the metrics server

  86. health-events-analyzer/pkg/parser/parser.go, line 133-144 (link)

    style: bounds check on line 139 covers negative index but does not distinguish err != nil from out-of-bounds. If parts[0] is non-numeric, strconv.Atoi returns error and function returns nil silently. Consider logging or returning error for invalid index strings. Should non-numeric array indices be logged as warnings or treated as valid "field not found" scenarios?

  87. tests/multiple_fatal_event_test.go, line 165 (link)

    logic: cleanup passes error code "31" to TestCleanUp, but lines 65–105 injected codes "13" and "48" five times. If those events persist in MongoDB or as node conditions, cleanup may fail to remove them because it only targets "31". Should cleanup iterate over all three error codes (13, 48, 31) or clear all conditions regardless of error code? Does TestCleanUp need to clear healthy events for error codes 13 and48 in addition to 31, or does clearing 31 and the condition "MultipleFatalError" sufficiently reset the node state?

  88. health-monitors/syslog-health-monitor/main.go, line 58 (link)

    logic: flag.String with os.Getenv() as default captures the env value at package init, before flag parsing. If NODE_NAME is unset during startup, flag.Parse() will never update it. Set the default to empty string and read the env inside run() after parsing.

  89. health-monitors/syslog-health-monitor/main.go, line 171-178 (link)

    logic: metrics server errors are swallowed (logged but returned as nil), causing errgroup to ignore server failures. If this is intentional, document why; otherwise remove the inner error handling and return the error directly so errgroup can propagate it. should the service continue if the metrics server fails to bind (e.g., port conflict), or is that a fatal condition?

  90. health-monitors/syslog-health-monitor/main.go, line 322 (link)

    logic: conn.Connect() is non-blocking, so the first GetState() may return Idle. The subsequent WaitForStateChange will block until the state changes from Idle, but if connection fails immediately, the loop may return an error before the state transitions. Add a check after WaitForStateChange returns false to verify the failure reason. should this function also return if the connection enters TransientFailure or Shutdown states, or only on Ready?

  91. health-monitors/syslog-health-monitor/main.go, line 349-378 (link)

    logic: retry loop does not honor ctx cancellation. If the process is interrupted during retries (e.g., SIGTERM), this will block for up to maxConfigRetries * attempt seconds. Add a select on ctx.Done() before or after time.Sleep.

  92. health-monitors/syslog-health-monitor/main.go, line 205-212 (link)

    style: exponential backoff resets the exponent to 2s on first failure instead of using attempt count. If the first error happens at tick 0, the backoff will be 2s, 4s, 8s, 16s, 30s. If this is intentional (to avoid penalizing the first failure), document it. should backoff start at1s * attempt or always begin at 2s on first error?

148 files reviewed, 96 comments

Edit Code Review Agent Settings | Greptile

"recommendedAction": 15,
"errorCode": [
"79"
"ERROR_CODE"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: this placeholder won't match the actual error codes used in the new test files ("13", "48", "31" in repeated_xid_rule_test.go and multiple_fatal_event_test.go). Tests that depend on this template must replace "ERROR_CODE" with a concrete XID or the healthy events won't clear the corresponding fatal events

Comment on lines +621 to +625
for attempt := 1; attempt <= maxRetries; attempt++ {
node, err := clientset.CoreV1().Nodes().Get(ctx, nodeName, metav1.GetOptions{})
if err != nil {
return fmt.Errorf("failed to get node %s: %w", nodeName, err)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: retry logic fetches the node fresh on each attempt, but the stale node object from the outer scope is used for the uncordon block (lines 672-676). If the first update succeeds but uncordon logic starts before the break at 659, it operates on stale data. Move the uncordon Get call inside its own retry block or restructure to avoid scope confusion

Comment on lines +152 to +154
clientOpts := options.Client().ApplyURI(mongoURI).SetTLSConfig(tlsConfig)
// Use X509 auth by default
clientOpts.SetAuth(options.Credential{AuthMechanism: "MONGODB-X509", AuthSource: "$external"})
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: X509 auth is hardcoded even when MONGODB_URI is provided via env var. If the URI already specifies auth mechanism, this will conflict. Should only set auth when not present in URI. Should X509 auth be applied unconditionally, or only when the URI doesn't already specify an auth mechanism?

if err != nil {
klog.Errorf("failed to send healthy events during cleanup: %v", err)
}
time.Sleep(5 * time.Second)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: 5-second sleep is arbitrary and may not be long enough for all environments. Consider polling for condition removal or making the delay configurable

}

// getValueFromErrorCode safely returns event.ErrorCode[index] if present.
func getValueFromErrorCode(event *platform_connectors.HealthEvent, parts []string) any {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we rename all the function: getValueFromX -> getX? Will it more readable?

Signed-off-by: Tanisha goyal <[email protected]>
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. The most recent changes consist of three minimal updates: (1) resolving a merge conflict in health-events-analyzer/go.mod by removing github.com/kr/text v0.2.0 // indirect while keeping github.com/hashicorp/errwrap v1.0.0// indirect, (2) adding an explicit but unreachable return nil statement at line 104 in reconciler.go to satisfy Go's return-path requirements and removing a blank line at line 181, and (3) adding the new test file tests/repeated_xid_rule_test.go. The changes integrate with the broader PR objective of detecting repeated XID failures and multiple fatal errors: the go.mod conflict resolution ensures the correct transitive dependencies (errwrap is used by go-multierror for aggregated error handling in rule evaluation), the reconciler return statement satisfies compiler requirements without changing runtime behavior (the event-processing loop blocks indefinitely), and the new test validates that after 5 repeated cycles of XID 13 followed by healthy events, the 6th occurrence triggers the RepeatedXid rule and applies a node-termination recovery action.

Important Files Changed

Filename Score Overview
health-events-analyzer/go.mod 5/5 Resolved merge conflict by removing github.com/kr/text and keeping github.com/hashicorp/errwrap // indirect
health-events-analyzer/pkg/reconciler/reconciler.go 5/5 Added unreachable return nil at line 104 and removed blank line at line 181
tests/repeated_xid_rule_test.go 3/5 New test for repeated XID failure detection with 5-cycle remediation loop

Confidence score: 3/5

  • This PR includes only incremental changes since the last review; the new test file and minor reconciler update carry moderate risk due to timing dependencies and potential race conditions.
  • Score reflects uncertainty about test reliability (hardcoded sleeps, randomized node selection, and potential ordering issues) and missing details on the new test file's implementation quality. The go.mod and reconciler changes are low-risk and safe.
  • Pay close attention to tests/repeated_xid_rule_test.go – ensure it follows the same patterns as multiple_fatal_event_test.go reviewed previously, avoids hardcoded delays, properly cleans up test state, and does not introduce race conditions with GpuNodeName.

3 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

[[rules.sequence]]
criteria = { "healthevent.entitiesimpacted.0.entitytype" = "GPU", "healthevent.entitiesimpacted.0.entityvalue" = "this.healthevent.entitiesimpacted.0.entityvalue", "healthevent.errorcode.0" = "13", "healthevent.nodename" = "this.healthevent.nodename" }
errorCount = 3
criteria = { "healthevent.entitiesimpacted.0.entitytype" = "GPU","healthevent.entitiesimpacted.0.entityvalue" = "this.healthevent.entitiesimpacted.0.entityvalue","healthevent.errorcode.0" = "this.healthevent.errorcode.0","healthevent.nodename" = "this.healthevent.nodename","healthevent.checkname" = "{\"$ne\": \"HealthEventsAnalyzer\"}"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't "healthevent.checkname" = "{\"$ne\": \"HealthEventsAnalyzer\"}" needed for every rule, can we add it as a default in the code rather than having the rule specify?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think so. We might not required for multipleFatalErrorRule. When in the phase2, where we will be taking actions on event-analyzer events then Shouldn't we consider health-event-analyzer event also as part of remediation cycle if remediation was done?

and since we wanted to keep query logic mostly configurable that's why added in here instead of keeping it in code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, let's not do that, it will become recursive in that way. To keep things simple we can only consider events that are not raised by the event-analyzer


// getValueFromErrorCode safely returns event.ErrorCode[index] if present.
func getValueFromErrorCode(event *platform_connectors.HealthEvent, parts []string) any {
if len(parts) == 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when will this happened? Don't we already validate len(part) == 0 in the caller? Can you please claritfy?

}

func getValueFromHealthEvent(event *platform_connectors.HealthEvent, parts []string) any {
if len(parts) == 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when will parts be 0? If parts is 0 isn't it an error?


// getValueFromMetadata returns the value for a metadata key if it exists.
func getValueFromMetadata(event *platform_connectors.HealthEvent, parts []string) any {
if len(parts) == 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when will this happened? Don't we already validate len(part) == 0 in the caller? Can you please claritfy?

}

func getValueFromHealthEventStatus(event storeconnector.HealthEventStatus, parts []string) any {
if len(parts) == 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when will this happened? Don't we already validate len(part) == 0 in the caller? Can you please claritfy?

assert.NoError(t, err, "failed to create client")

// Check node condition for matched ruleset
helpers.WaitForNodeConditionWithCheckName(ctx, t, client, gpuNodeName, "MultipleFatalError")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check if the node condition was added, no need to poke into the mongodb

feature.Assess("Inject multiple fatal errors", func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context {
gpuNodeName := ctx.Value(keyGpuNodeName).(string)
if gpuNodeName == "" {
t.Fatal("GPU node name not found in context - previous assess step may have failed")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this possible?

"healthevent.checkname": "RepeatedXid13",
}

event, err := helpers.QueryHealthEventByFilter(ctx, filter)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are already checking node condition, do we need this as well?

],
resource_deps=['cert-manager'],
)
k8s_resource(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't connect to mongodb in tests, rather please check outcomes

[[rules.sequence]]
criteria = { "healthevent.entitiesimpacted.0.entitytype" = "GPU", "healthevent.entitiesimpacted.0.entityvalue" = "this.healthevent.entitiesimpacted.0.entityvalue", "healthevent.errorcode.0" = "13", "healthevent.nodename" = "this.healthevent.nodename" }
errorCount = 3
criteria = { "healthevent.entitiesimpacted.0.entitytype" = "GPU","healthevent.entitiesimpacted.0.entityvalue" = "this.healthevent.entitiesimpacted.0.entityvalue","healthevent.errorcode.0" = "this.healthevent.errorcode.0","healthevent.nodename" = "this.healthevent.nodename","healthevent.checkname" = "{\"$ne\": \"HealthEventsAnalyzer\"}"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since XIDs can happen in bursts, how are we handling that scenario?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed on call, we are not handling burst of XIDs for now. I'll handle these changes in new PR

@tanishagoyal2
Copy link
Author

/ok to test 641b2f5

1 similar comment
@XRFXLP
Copy link
Member

XRFXLP commented Oct 27, 2025

/ok to test 641b2f5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants