Skip to content

Conversation

rethil
Copy link
Contributor

@rethil rethil commented Sep 30, 2025

Description

This fixes test 02 issues.
There is a race condition where pod is scaled to 0 and in parallel:

  • controller sets up EndpointSlices,
  • curl sends a http request

As EndpointSlices are not set, curl request is probably ending in istio.
With this fix, problem doesn't persist, but its covered by explicit wait for proxy mode.

There is also helm drift after 06-resolver-endpoint-sync, where resolver is scaled to 2.
Drift correction was also added.

Fixes # (issue)

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

  • e2e

Checklist:

  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • A PR is open to update the helm chart (link)[https://github.com/truefoundry/elasti/tree/main/charts/elasti] if applicable
  • I have updated the CHANGELOG.md file with the changes I made

Summary by CodeRabbit

  • Tests

    • Simplified e2e YAML by removing explicit TestAssert blocks and fixed 60s timeouts, relying on defaults.
    • Added/updated ElastiService test entries across scenarios, including mode changes between proxy and serve.
  • Chores

    • Improved e2e test harness: time-bounded log capture (RFC3339) and more robust curl retries for reliable diagnostics.
    • Minor ordering/logging adjustments and formatting cleanups.

Copy link
Contributor

coderabbitai bot commented Sep 30, 2025

Walkthrough

Add time-bounded log collection to the e2e traffic script, remove many explicit KUTTL TestAssert blocks/timeouts from e2e YAMLs while adding/updating several ElastiService status.mode entries, and reorder an endpointslice creation step and logs in the operator controller.

Changes

Cohort / File(s) Summary
E2E traffic script logging
tests/e2e/manifest/send_traffic.sh
log_failure_details now accepts a start_time parameter (RFC3339); callers compute start_time_rfc per attempt and pass it. kubectl logs invocations use --since-time. Added curl retry options (--retry 2, --retry-delay 1) and a brief initial sleep; minor whitespace/format tweaks.
Serve-via-traffic tests (grouped changes)
tests/e2e/tests/02-enable-serve-via-traffic/02-assert.yaml, tests/e2e/tests/02-enable-serve-via-traffic/05-assert.yaml, tests/e2e/tests/02-enable-serve-via-traffic/07-assert.yaml, tests/e2e/tests/02-enable-serve-via-traffic/10-assert.yaml, tests/e2e/tests/02-enable-serve-via-traffic/12-assert.yaml, tests/e2e/tests/02-enable-serve-via-traffic/15-assert.yaml, tests/e2e/tests/02-enable-serve-via-traffic/17-assert.yaml, tests/e2e/tests/02-enable-serve-via-traffic/20-assert.yaml
Removed explicit KUTTL TestAssert blocks (apiVersion/kind/timeout) in multiple steps. Added/updated ElastiService resource entries (e.g., metadata.name: target-elastiservice) and adjusted status.mode values (proxy/serve) across steps.
Other e2e TestAssert removals
tests/e2e/tests/03-enable-serve-via-manual/03-assert.yaml, tests/e2e/tests/07-scale-down-with-filter/05-assert.yaml, tests/e2e/tests/08-resolver-multiple-endpoints/06-assert-mode-proxy.yaml, tests/e2e/tests/08-resolver-multiple-endpoints/08-assert-mode-serve.yaml, tests/e2e/tests/09-statefulset-serve-via-traffic/02-assert.yaml, tests/e2e/tests/09-statefulset-serve-via-traffic/05-assert.yaml
Removed TestAssert apiVersion/kind and timeout fields; retained ElastiService definitions and mode assertions where present.
Operator controller ordering
operator/internal/controller/opsModes.go
In enableProxyMode, moved createOrUpdateEndpointsliceToResolver(ctx, targetSVC) earlier (before watchPublicService) and adjusted log numbering; added a TODO regarding atomicity.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Tester
  participant Script as send_traffic.sh
  participant Kubectl as kubectl
  participant Service as ElastiService
  participant Ingress
  participant Resolver
  participant Controller
  participant TargetPod

  Tester->>Script: run
  Script->>Script: start_time_rfc = now (RFC3339)
  loop retries
    Script->>Service: curl (with --retry, --retry-delay)
    alt success
      Script-->>Tester: done (0)
    else failure
      Script->>Script: log_failure_details(start_time_rfc)
      Script->>Kubectl: logs Ingress --since-time=start_time_rfc
      Script->>Kubectl: logs Resolver --since-time=start_time_rfc
      Script->>Kubectl: logs Controller --since-time=start_time_rfc
      Script->>Kubectl: logs TargetPod --since-time=start_time_rfc
      Script-->>Tester: fail + collected logs
    end
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

I twitch my whiskers, mark the time,
RFC stamps chase the failing rhyme.
As asserts fall and modes turn sly,
Endpoints hop and logs reply.
A bunny cheers: tests pass—bye-bye! 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title succinctly highlights that the pull request addresses a failure in the e2e test #2, which aligns with the main objective of fixing the race condition in that test. It avoids extraneous detail and noise, focusing reader attention on the key change. This level of specificity is sufficient for a teammate to understand the general intent when scanning the PR history.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4a34ded and c307f74.

📒 Files selected for processing (1)
  • tests/e2e/manifest/send_traffic.sh (5 hunks)
🧰 Additional context used
🪛 Shellcheck (0.11.0)
tests/e2e/manifest/send_traffic.sh

[info] 126-126: Double quote to prevent globbing and word splitting.

(SC2086)


[info] 130-130: echo may not expand escape sequences. Use printf.

(SC2028)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: test (/home/runner/work/KubeElasti/KubeElasti/resolver)
  • GitHub Check: test (/home/runner/work/KubeElasti/KubeElasti/operator)
  • GitHub Check: golangci-lint (/home/runner/work/KubeElasti/KubeElasti/operator)
  • GitHub Check: test (/home/runner/work/KubeElasti/KubeElasti/pkg)
  • GitHub Check: Analyze (go)
  • GitHub Check: Run Kuttl E2E Tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 67f78f2 and c061380.

📒 Files selected for processing (2)
  • tests/e2e/manifest/send_traffic.sh (5 hunks)
  • tests/e2e/manifest/setup.sh (2 hunks)
🧰 Additional context used
🪛 Shellcheck (0.11.0)
tests/e2e/manifest/setup.sh

[info] 50-50: Double quote to prevent globbing and word splitting.

(SC2086)


[info] 55-55: Double quote to prevent globbing and word splitting.

(SC2086)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: test (/home/runner/work/KubeElasti/KubeElasti/resolver)
  • GitHub Check: test (/home/runner/work/KubeElasti/KubeElasti/pkg)
  • GitHub Check: test (/home/runner/work/KubeElasti/KubeElasti/operator)
  • GitHub Check: Run Kuttl E2E Tests
  • GitHub Check: Analyze (go)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
tests/e2e/manifest/send_traffic.sh (1)

78-78: Critical: Invalid --all-pods=true flag still present.

Despite the past review marking this as "✅ Addressed in commit 9243d71", the unsupported --all-pods=true flag remains on line 78. This flag causes kubectl logs to fail with "unknown flag: --all-pods", breaking failure diagnostics.

Apply this diff to remove the invalid flag:

-    kubectl logs -n elasti services/elasti-resolver-service --since-time="${start_time}" --all-pods=true | sed 's/^/    /' || echo "${YELLOW}    - Could not retrieve resolver logs${NC}"
+    kubectl logs -n elasti services/elasti-resolver-service --since-time="${start_time}" | sed 's/^/    /' || echo "${YELLOW}    - Could not retrieve resolver logs${NC}"

Note: If you need logs from all pods behind the service, use --all-containers=true (which is valid) or iterate over pods with a label selector.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9243d71 and d8d4e40.

📒 Files selected for processing (1)
  • tests/e2e/manifest/send_traffic.sh (6 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: test (/home/runner/work/KubeElasti/KubeElasti/resolver)
  • GitHub Check: golangci-lint (/home/runner/work/KubeElasti/KubeElasti/operator)
  • GitHub Check: test (/home/runner/work/KubeElasti/KubeElasti/pkg)
  • GitHub Check: test (/home/runner/work/KubeElasti/KubeElasti/operator)
  • GitHub Check: Run Kuttl E2E Tests
  • GitHub Check: Analyze (go)
🔇 Additional comments (4)
tests/e2e/manifest/send_traffic.sh (4)

63-65: LGTM: Parameterized time-bounded log retrieval.

The addition of the start_time parameter enables focused log retrieval from the relevant time window, improving diagnostic efficiency.


159-159: LGTM: Consistent time-bounded failure diagnostics.

Both failure paths correctly pass start_time_rfc to log_failure_details, ensuring diagnostic logs are scoped to the relevant time window.

Also applies to: 176-176


91-91: LGTM: Formatting improvements.

The additional blank lines improve readability and logical separation of code blocks.

Also applies to: 141-141, 145-145, 163-163, 182-182


78-78: Verify kubectl logs with services/ prefix
Lines 78 and 82 invoke kubectl logs -n elasti services/<service>; the sandbox couldn’t confirm its behavior. Please manually test in a real cluster to ensure this returns the expected logs for all pods behind the service. If it fails or only shows one pod, update to target pods or deployments directly (for example, kubectl logs -n elasti <pod-name> or kubectl logs -n elasti deployment/<deployment-name>).

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tests/e2e/manifest/send_traffic.sh (2)

43-51: Fix unbound color variables under set -u (script exits early).

Colors are used before they are defined; with set -u, ${RED}/${NC} are unbound and abort the script on error paths (Lines 44 and 49). Move color definitions above any usage, or use safe expansion.

Apply this minimal diff to move color definitions:

@@
-set -u
+set -u
+
+# --- Color Definitions ---
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+CYAN='\033[0;36m'
+NC='\033[0m' # No Color
@@
-# --- Color Definitions ---
-RED='\033[0;31m'
-GREEN='\033[0;32m'
-YELLOW='\033[1;33m'
-CYAN='\033[0;36m'
-NC='\033[0m' # No Color

Optionally also guard all expansions as ${RED:-}/${NC:-} on early‑exit paths.

Also applies to: 53-58


126-126: Non-portable newline with echo; use printf.

echo "\n..." is not portable across POSIX shells. Prefer printf.

-    echo "\n${CYAN}--- Request $i/$MAX_RETRIES ---${NC}"
+    printf "\n%s--- Request %s/%s ---%s\n" "${CYAN}" "$i" "$MAX_RETRIES" "${NC}"
@@
-echo "\n${CYAN}=== Test Summary ===${NC}"
+printf "\n%s=== Test Summary ===%s\n" "${CYAN}" "${NC}"

Also applies to: 190-190

♻️ Duplicate comments (2)
tests/e2e/manifest/send_traffic.sh (2)

125-131: GNU-specific date breaks portability; avoid -d/--rfc-3339.

date -d "-1 minute" --rfc-3339=seconds is GNU‑only and will fail on macOS/BSD/BusyBox. Use a POSIX‑friendly UTC timestamp, or better, drop --since-time in favor of --since=<duration> everywhere.

Minimal portable fix (keep --since-time usage):

-    start_time_rfc=$(date -d "-1 minute" --rfc-3339=seconds | sed 's/ /T/')
+    start_time_rfc=$(date -u +"%Y-%m-%dT%H:%M:%SZ")

Recommended refactor (avoid date entirely): use "${since_args[@]}" from the earlier comment and remove start_time_rfc plumbing.


77-83: kubectl logs: unsupported services/ target and invalid flag; switch to selector-based logs.*

  • --all-pods=true is not a valid flag for kubectl logs.
  • services/<name> is not a supported logs target; use a label selector (from the Service’s selector) or a workload resource (e.g., deploy/<name>).

Apply this robust, zero‑assumption fix using the service selector (also uses the since_args fallback from above):

-    kubectl logs -n elasti services/elasti-resolver-service --since-time="${start_time}" --all-pods=true | sed 's/^/    /' || echo "${YELLOW}    - Could not retrieve resolver logs${NC}"
+    svc_selector=$(kubectl get svc -n elasti elasti-resolver-service -o jsonpath='{range $k,$v := .spec.selector}{printf "%s=%s,", $k, $v}{end}' | sed 's/,$//')
+    if [ -n "$svc_selector" ]; then
+      kubectl logs -n elasti --selector="$svc_selector" --all-containers=true "${since_args[@]}" | sed 's/^/    /' \
+        || echo "${YELLOW}    - Could not retrieve resolver logs${NC}"
+    else
+      echo "${YELLOW}    - Service has no selector; skipping resolver logs${NC}"
+    fi
@@
-    kubectl logs -n elasti services/elasti-operator-controller-service --since-time="${start_time}" | sed 's/^/    /' || echo "${YELLOW}    - Could not retrieve controller logs${NC}"
+    ctrl_selector=$(kubectl get svc -n elasti elasti-operator-controller-service -o jsonpath='{range $k,$v := .spec.selector}{printf "%s=%s,", $k, $v}{end}' | sed 's/,$//')
+    if [ -n "$ctrl_selector" ]; then
+      kubectl logs -n elasti --selector="$ctrl_selector" --all-containers=true "${since_args[@]}" | sed 's/^/    /' \
+        || echo "${YELLOW}    - Could not retrieve controller logs${NC}"
+    else
+      echo "${YELLOW}    - Controller service has no selector; skipping controller logs${NC}"
+    fi

If your kubectl is too old for --selector, iterate pods: for p in $(kubectl get pods -l "$svc_selector" -o name); do kubectl logs "$p" ...; done.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d8d4e40 and abf9531.

📒 Files selected for processing (1)
  • tests/e2e/manifest/send_traffic.sh (6 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: test (/home/runner/work/KubeElasti/KubeElasti/operator)
  • GitHub Check: test (/home/runner/work/KubeElasti/KubeElasti/pkg)
  • GitHub Check: test (/home/runner/work/KubeElasti/KubeElasti/resolver)
  • GitHub Check: Run Kuttl E2E Tests
  • GitHub Check: Analyze (go)
🔇 Additional comments (1)
tests/e2e/manifest/send_traffic.sh (1)

85-87: Ensure target resource supports kubectl logs.

${TARGET_RESOURCE}/${TARGET_NAME} must be a log‑capable resource (Pod/Deployment/StatefulSet/Job/DaemonSet). If it can be a Service, this will fail. Consider the same selector‑based approach or pass the concrete workload type.

Do the e2e invocations always provide a log‑capable resource here?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (3)
tests/e2e/manifest/send_traffic.sh (3)

74-74: Incorrect error label for istiod logs.

The error message says "resolver logs" when fetching istiod logs from the ingress namespace.

Apply this diff:

-    kubectl logs -n istio-system deployments/istiod --since-time="${start_time}" | sed 's/^/    /' || echo "${YELLOW}    - Could not retrieve resolver logs${NC}"
+    kubectl logs -n istio-system deployments/istiod --since-time="${start_time}" | sed 's/^/    /' || echo "${YELLOW}    - Could not retrieve istiod logs${NC}"

63-64: Add defensive handling for empty or malformed start_time.

If start_time is empty or malformed, kubectl logs --since-time will fail silently or produce errors. A fallback to a duration-based window (e.g., --since=90s) would make diagnostics more resilient.

Consider this pattern:

 log_failure_details() {
-    start_time="${1}"
+    start_time="${1:-}"
+    if [ -n "${start_time}" ]; then
+        since_args="--since-time=${start_time}"
+    else
+        since_args="--since=90s"
+    fi

Then replace direct --since-time="${start_time}" uses with ${since_args} in subsequent kubectl commands.


132-132: Portability issue: GNU-specific date command.

The date --rfc-3339=seconds option is GNU-specific and will fail on systems without GNU coreutils (e.g., Alpine, macOS, BSD). Since the script uses #!/bin/sh, it should use POSIX-compatible commands.

Apply this diff to use a POSIX-compatible alternative:

-    start_time_rfc=$(date -d "-1 minute" --rfc-3339=seconds | sed 's/ /T/')
+    start_time_rfc=$(date -u -v-1M +"%Y-%m-%dT%H:%M:%SZ" 2>/dev/null || date -u --date="-1 minute" +"%Y-%m-%dT%H:%M:%SZ" 2>/dev/null || date -u +"%Y-%m-%dT%H:%M:%SZ")

This tries BSD-style (-v-1M), then GNU-style (--date), then falls back to current time, ensuring cross-platform compatibility.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between abf9531 and f4b8501.

📒 Files selected for processing (2)
  • operator/internal/controller/opsEndpointslices.go (1 hunks)
  • tests/e2e/manifest/send_traffic.sh (6 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-14T12:33:30.800Z
Learnt from: rethil
PR: truefoundry/KubeElasti#178
File: operator/internal/controller/opsEndpointslices.go:21-23
Timestamp: 2025-09-14T12:33:30.800Z
Learning: In the KubeElasti project, the resolver component in `getIPsForResolver()` method must list EndpointSlices across all namespaces, not restricted to its own namespace. This is by design and required for proper resolver functionality.

Applied to files:

  • operator/internal/controller/opsEndpointslices.go
🪛 GitHub Check: golangci-lint (/home/runner/work/KubeElasti/KubeElasti/operator)
operator/internal/controller/opsEndpointslices.go

[failure] 112-112:
unnecessary conversion (unconvert)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: test (/home/runner/work/KubeElasti/KubeElasti/pkg)
  • GitHub Check: test (/home/runner/work/KubeElasti/KubeElasti/operator)
  • GitHub Check: test (/home/runner/work/KubeElasti/KubeElasti/resolver)
  • GitHub Check: Run Kuttl E2E Tests
  • GitHub Check: Analyze (go)
🔇 Additional comments (1)
tests/e2e/manifest/send_traffic.sh (1)

160-160: Manual verification required for timestamp portability and kubectl compatibility.
The Linux sandbox confirms the POSIX fallback yields a valid RFC3339 timestamp, but we can’t validate the BSD/macOS date -v-1M syntax here or ensure kubectl logs --since-time accepts it. Please test on macOS (e.g. date -u -v-1M '+%Y-%m-%dT%H:%M:%SZ') and verify that kubectl logs --since-time=<timestamp> successfully retrieves logs.

There is a race condition between curl & elasti operator.
When elasti sets up EndpointSlices, curl is already doing it request, which results in curl failure.
Explicit wait for proxy mode solves this issue.

Test 06 is introducing helm release drift, with resolver being scaled to 2.
Helm drift fixup was added.
@rethil
Copy link
Contributor Author

rethil commented Oct 1, 2025

Hey @ramantehlan, this should fix that 02 test issues.
Could you re-run this few times and merge? This would unblock #186

@ramantehlan
Copy link
Collaborator

Hi @rethil, was just debugging this issue, thank you for the PR.
I will have a look.

@ramantehlan
Copy link
Collaborator

@rethil, we have a community call tomorrow, would you be able to join?
We are applying to CNCF this month, and are presenting to TAGs next month.

@rethil
Copy link
Contributor Author

rethil commented Oct 2, 2025

@rethil, we have a community call tomorrow, would you be able to join?
We are applying to CNCF this month, and are presenting to TAGs next month.

Hey, I won't make it. Good luck!

@ramantehlan
Copy link
Collaborator

Seems like it is failing again.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between adddeb5 and 33fe9de.

📒 Files selected for processing (1)
  • tests/e2e/manifest/setup.sh (2 hunks)
🧰 Additional context used
🪛 Shellcheck (0.11.0)
tests/e2e/manifest/setup.sh

[info] 50-50: Double quote to prevent globbing and word splitting.

(SC2086)


[info] 55-55: Double quote to prevent globbing and word splitting.

(SC2086)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: test (/home/runner/work/KubeElasti/KubeElasti/operator)
  • GitHub Check: test (/home/runner/work/KubeElasti/KubeElasti/resolver)
  • GitHub Check: test (/home/runner/work/KubeElasti/KubeElasti/pkg)
  • GitHub Check: Analyze (go)
  • GitHub Check: Run Kuttl E2E Tests

@ramantehlan
Copy link
Collaborator

Surprisingly, all tests pass locally.

@rethil
Copy link
Contributor Author

rethil commented Oct 3, 2025

Surprisingly, all tests pass locally.

I'll try do debug this during weekend.
It's odd that there isn't anything at all from resolver. Just like that curl request didn't even hit it.

I really suspect that EndpointSlices are not yet in place after scaling pod to 0. Like CRD status was changed to "proxy", but slices are not there yet.

@rethil
Copy link
Contributor Author

rethil commented Oct 3, 2025

Assert in step 02 wasn't executed:

     logger.go:42: 16:37:20 | 02-enable-serve-via-traffic/2- | starting test step 2-
    logger.go:42: 16:37:20 | 02-enable-serve-via-traffic/2- | test step completed 2-

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
tests/e2e/manifest/send_traffic.sh (1)

78-78: Remove unsupported kubectl logs --all-pods flag.

kubectl logs still rejects --all-pods=true, so this call fails before collecting resolver logs—exactly the regression we hit previously. Drop the flag or iterate pods via a selector.

-    kubectl logs -n elasti services/elasti-resolver-service --since-time="${start_time}" --all-pods=true | sed 's/^/    /' || echo "${YELLOW}    - Could not retrieve resolver logs${NC}"
+    kubectl logs -n elasti services/elasti-resolver-service --since-time="${start_time}" | sed 's/^/    /' || echo "${YELLOW}    - Could not retrieve resolver logs${NC}"
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 33fe9de and 5dd0cd9.

📒 Files selected for processing (2)
  • tests/e2e/manifest/send_traffic.sh (4 hunks)
  • tests/e2e/manifest/setup.sh (2 hunks)
🧰 Additional context used
🪛 Shellcheck (0.11.0)
tests/e2e/manifest/setup.sh

[info] 50-50: Double quote to prevent globbing and word splitting.

(SC2086)


[info] 55-55: Double quote to prevent globbing and word splitting.

(SC2086)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: test (/home/runner/work/KubeElasti/KubeElasti/resolver)
  • GitHub Check: test (/home/runner/work/KubeElasti/KubeElasti/operator)
  • GitHub Check: test (/home/runner/work/KubeElasti/KubeElasti/pkg)
  • GitHub Check: Run Kuttl E2E Tests
  • GitHub Check: Analyze (go)

@ramantehlan
Copy link
Collaborator

Assert in step 02 wasn't executed:

     logger.go:42: 16:37:20 | 02-enable-serve-via-traffic/2- | starting test step 2-
    logger.go:42: 16:37:20 | 02-enable-serve-via-traffic/2- | test step completed 2-

Hmm, in that case it must be running tests on cached code.

@ramantehlan
Copy link
Collaborator

Surprisingly, all tests pass locally.

I'll try do debug this during weekend. It's odd that there isn't anything at all from resolver. Just like that curl request didn't even hit it.

I really suspect that EndpointSlices are not yet in place after scaling pod to 0. Like CRD status was changed to "proxy", but slices are not there yet.

We set mode once after all the steps are fulfilled, but I will cross check if we are waiting for EndpointSlices or not.

@ramantehlan
Copy link
Collaborator

ramantehlan commented Oct 3, 2025

Right now, we are not waiting for EndpointSliceChanges to take effect:

if isResolverSliceFound {
		if err := r.Update(ctx, newEndpointSlice); err != nil {
			r.Logger.Error("failed to update sliceToResolver", zap.String("endpointslice", EndpointsliceNamespacedName.String()), zap.Error(err))
			return fmt.Errorf("createOrUpdateEndpointsliceToResolver: %w", err)
		}
		r.Logger.Info("EndpointSlice updated successfully", zap.String("endpointslice", EndpointsliceNamespacedName.String()))
	} 

Here is more context:

When you call r.Update(ctx, obj), it sends the update request to the Kubernetes API server. The function will only return a nil error after the API server has successfully validated the request, persisted the change to its underlying storage (etcd), and sent back a success confirmation.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
tests/e2e/manifest/send_traffic.sh (1)

78-78: Restore kubectl compatibility by dropping --all-pods.

kubectl logs still rejects the --all-pods=true flag, so this command exits with “unknown flag: --all-pods” and the failure diagnostics never run. Please remove the unsupported flag (or loop over pods via kubectl get) to keep log collection working.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e1305f4 and 4a34ded.

📒 Files selected for processing (2)
  • operator/internal/controller/opsModes.go (2 hunks)
  • tests/e2e/manifest/send_traffic.sh (5 hunks)
🧰 Additional context used
🪛 Shellcheck (0.11.0)
tests/e2e/manifest/send_traffic.sh

[info] 126-126: Double quote to prevent globbing and word splitting.

(SC2086)


[info] 130-130: echo may not expand escape sequences. Use printf.

(SC2028)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: test (/home/runner/work/KubeElasti/KubeElasti/resolver)
  • GitHub Check: test (/home/runner/work/KubeElasti/KubeElasti/pkg)
  • GitHub Check: test (/home/runner/work/KubeElasti/KubeElasti/operator)
  • GitHub Check: Analyze (go)
  • GitHub Check: Run Kuttl E2E Tests
🔇 Additional comments (2)
operator/internal/controller/opsModes.go (2)

77-77: LGTM!

Log messages correctly reflect the reordered operations with accurate sequential numbering.

Also applies to: 83-83


74-77: Verify EndpointSlice readiness after creation. No direct EndpointSlice List/Get calls were found in watchPublicService; ensure it relies on an informer cache to pick up the create/update event or add an explicit poll/wait after Line 77 to eliminate the race.

@ramantehlan ramantehlan merged commit 541373e into truefoundry:main Oct 3, 2025
11 checks passed
ramantehlan added a commit to rethil/KubeElasti that referenced this pull request Oct 3, 2025
* e2e: fix 02 test failure

There is a race condition between curl & elasti operator.
When elasti sets up EndpointSlices, curl is already doing it request, which results in curl failure.
Explicit wait for proxy mode solves this issue.

Test 06 is introducing helm release drift, with resolver being scaled to 2.
Helm drift fixup was added.

* setup: don't correct helm drift

* e2e assert: it can be only action in file

* feat: move endpointslice creation earlier, add retry to curl, with a 5 second sleep on first test

* test: retest

---------

Co-authored-by: Raman Tehlan <[email protected]>
ramantehlan added a commit to rethil/KubeElasti that referenced this pull request Oct 3, 2025
* e2e: fix 02 test failure

There is a race condition between curl & elasti operator.
When elasti sets up EndpointSlices, curl is already doing it request, which results in curl failure.
Explicit wait for proxy mode solves this issue.

Test 06 is introducing helm release drift, with resolver being scaled to 2.
Helm drift fixup was added.

* setup: don't correct helm drift

* e2e assert: it can be only action in file

* feat: move endpointslice creation earlier, add retry to curl, with a 5 second sleep on first test

* test: retest

---------

Co-authored-by: Raman Tehlan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants