-
Notifications
You must be signed in to change notification settings - Fork 49
LCORE-334: Lightspeed core needs to fully support RHOAI LLM provider (prow) #747
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughAdds RHOAI (vLLM) provider entries and a complete e2e test harness: documentation, Kubernetes manifests (operators, ServingRuntime, InferenceService, pods), LCS/llama-stack configs, and orchestration/bootstrap/deploy/test scripts to install vLLM, expose services, and run end-to-end tests. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Prow
participant Pipeline as pipeline.sh
participant Bootstrap as bootstrap.sh
participant DeployVLLM as deploy-vllm.sh
participant Services as pipeline-services.sh
participant K8s
participant Llama as llama-stack
participant LCS as lightspeed-stack
participant TestPod as test-pod
Prow->>Pipeline: start pipeline
Pipeline->>K8s: create namespace, secrets, configmaps
Pipeline->>Bootstrap: apply operators & wait for CSVs
Bootstrap->>K8s: ensure operator CSVs ready
Pipeline->>DeployVLLM: wait for KServe CRDs/controller, apply vLLM manifests
DeployVLLM->>K8s: apply ServingRuntime & InferenceService
Pipeline->>Services: deploy llama-stack pod
Services->>K8s: wait Ready → expose service → create LLAMA_IP secret
Services->>K8s: deploy lightspeed-stack with LLAMA_IP
Pipeline->>K8s: deploy test-pod (spin-up)
TestPod->>LCS: hit /v1/models and run make test-e2e
TestPod-->>Prow: return exit status
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (20)
🚧 Files skipped from review as they are similar to previous changes (10)
🧰 Additional context used🧠 Learnings (3)📚 Learning: 2025-09-18T16:46:33.353ZApplied to files:
📚 Learning: 2025-09-02T11:09:40.404ZApplied to files:
📚 Learning: 2025-08-18T10:56:55.349ZApplied to files:
🪛 Checkov (3.2.334)tests/e2e-prow/rhoai/manifests/lightspeed/lightspeed-stack.yaml[medium] 1-25: Containers should not run with allowPrivilegeEscalation (CKV_K8S_20) [medium] 1-25: Minimize the admission of root containers (CKV_K8S_23) tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack.yaml[medium] 1-34: Containers should not run with allowPrivilegeEscalation (CKV_K8S_20) [medium] 1-34: Minimize the admission of root containers (CKV_K8S_23) tests/e2e-prow/rhoai/manifests/test-pod/spin-up.yaml[medium] 1-30: Containers should not run with allowPrivilegeEscalation (CKV_K8S_20) [medium] 1-30: Minimize the admission of root containers (CKV_K8S_23) ⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
🔇 Additional comments (3)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Nitpick comments (12)
tests/e2e-prow/rhoai/manifests/test-pod/spin-up.yaml (1)
1-30: Add SecurityContext to restrict privileges in test pod.The container lacks a SecurityContext and will run as root. Checkov flags this as a security concern. Even in test environments, add a minimal SecurityContext to follow Kubernetes security best practices.
Apply this diff to add basic security constraints:
containers: - name: test-container + securityContext: + allowPrivilegeEscalation: false + runAsNonRoot: true + runAsUser: 1000 + readOnlyRootFilesystem: false env: - name: LCS_IPAlternatively, if the test script requires root or elevated privileges, document this in a comment explaining the rationale.
tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack.yaml (1)
1-34: Add SecurityContext and consider limiting emptyDir size.Two improvements for production readiness (though acceptable for test environment):
- No SecurityContext is defined; container runs as root by default.
- The
app-rootemptyDir has no size limit and could consume unbounded disk space.Apply this diff:
- name: llama-stack-container + securityContext: + allowPrivilegeEscalation: false + runAsNonRoot: true + runAsUser: 1000 + readOnlyRootFilesystem: false env:And add a size limit to the emptyDir:
volumes: - name: app-root - emptyDir: {} + emptyDir: + sizeLimit: 1GiIf the test workflow requires root or large storage, document the rationale in a comment.
tests/e2e-prow/rhoai/manifests/operators/operators.yaml (1)
1-35: Normalize YAML quoting style for consistency.Lines 8 and 20 quote the
channelvalue ("stable"), while line 32 does not (stable). Both are functionally equivalent, but normalizing to consistent style improves maintainability.Apply this diff to unify the style:
spec: channel: "stable" name: "servicemeshoperator"or:
spec: - channel: "stable" + channel: stableEither approach is fine; pick the one matching your project conventions (typically unquoted for simple strings).
tests/e2e-prow/rhoai/pipeline-test-pod.sh (1)
1-5: Pipeline script is straightforward and correct.The script correctly computes BASE_DIR and applies the test pod manifest. For robustness, consider adding
set -eat line 2 to exit on oc apply failure, but the current implementation is acceptable for this simple orchestration step.Optionally add error handling:
#!/bin/bash +set -e BASE_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"tests/e2e-prow/rhoai/configs/lightspeed-stack.yaml (1)
16-17: Consider using environment variable for api_key placeholder.The hardcoded api_key value
"xyzzy"is a placeholder. For better security posture and test flexibility, use an environment variable pattern like other values:- api_key: xyzzy + api_key: ${env.LLAMA_API_KEY:-xyzzy}This allows overriding the placeholder via environment variable in production while maintaining a fallback for testing.
tests/e2e-prow/rhoai/scripts/deploy-vllm.sh (1)
1-3: Add strict error handling and argument validation.The script lacks
set -euo pipefailand doesn't validate theBASE_DIRargument, which could lead to applying manifests from incorrect paths.#!/bin/bash + +set -euo pipefail + +if [[ -z "${1:-}" ]]; then + echo "❌ BASE_DIR argument required" + exit 1 +fi -BASE_DIR="$1" +readonly BASE_DIR="$1"tests/e2e-prow/rhoai/scripts/bootstrap.sh (1)
1-10: Add BASE_DIR validation for robustness.The script uses
BASE_DIRwithout validating it's provided, which could lead to manifests being applied from incorrect paths.#!/bin/bash set -euo pipefail +if [[ -z "${1:-}" ]]; then + echo "❌ BASE_DIR argument required" + exit 1 +fi -BASE_DIR="$1" +readonly BASE_DIR="$1"tests/e2e-prow/rhoai/pipeline.sh (4)
5-5: Consider parameterizing the hardcoded model name.Line 5 hardcodes
MODEL_NAME="meta-llama/Llama-3.2-1B-Instruct". For flexibility in future testing with different models, consider accepting this as an environment variable with a default fallback:MODEL_NAME="${LLAMA_MODEL_ID:-meta-llama/Llama-3.2-1B-Instruct}"This allows prow jobs to override the model without modifying the script.
100-100: Strengthen response validation to tolerate formatting variations.Line 100 checks for exact string
'"object":"text_completion"'in the response body. This is fragile if the JSON response format or field order changes. Consider a more robust check usingjq:if [[ "$http_code" == "200" ]] && echo "$body" | jq -e '.object == "text_completion"' >/dev/null 2>&1; then echo "✅ API test passed." break else echo "❌ API test failed (HTTP $http_code)" echo "$body" | jq . 2>/dev/null || echo "$body" fiThis extracts and validates the field semantically rather than relying on string position.
Also applies to: 105-105
200-210: Add error handling for missing pod termination status.Line 200 extracts the exit code using a jsonpath that assumes
.status.containerStatuses[0].state.terminated.exitCodeexists. If the pod terminates abnormally (e.g., OOMKilled, evicted) or the path doesn't exist,$TEST_EXIT_CODEwill be empty or null, and the exit condition at line 204 will default to 2 (due to${TEST_EXIT_CODE:-2}fallback). While this is safe (fails closed), add explicit logging to surface the actual termination reason:TEST_EXIT_CODE=$(oc get pod test-pod -n $NAMESPACE -o jsonpath='{.status.containerStatuses[0].state.terminated.exitCode}' 2>/dev/null) TERMINATION_REASON=$(oc get pod test-pod -n $NAMESPACE -o jsonpath='{.status.containerStatuses[0].state.terminated.reason}' 2>/dev/null) if [[ -z "$TEST_EXIT_CODE" ]]; then echo "❌ Unable to extract exit code. Termination reason: $TERMINATION_REASON" exit 1 fiThis provides visibility into unexpected terminations beyond exit code 0/non-zero.
12-15: Document or verify hardcoded credential paths and image mirror URLs.Lines 12-15 read credentials from hardcoded prow mount paths (
/var/run/...), and lines 26-29 mirror a specific AWS image to a specific quay.io organization path (quay.io/rh-ee-cpompeia/vllm-cpu:latest). These are tightly coupled to the prow environment and the Red Hat organization structure. Add inline comments documenting:
- Why these specific mount paths are used (prow convention)
- Why the quay.io target is organization-specific
- Whether these paths/URLs are managed by a central prow configuration or should be externalized
Example:
# Prow mounts secrets at these standardized paths (managed by prow infra) export HUGGING_FACE_HUB_TOKEN=$(cat /var/run/huggingface/hf-token-ces-lcore-test || true) ... # Mirror to quay.io repo managed by RHOAI release engineering oc image mirror \ public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:latest \ quay.io/rh-ee-cpompeia/vllm-cpu:latestThis improves maintainability for future modifications.
Also applies to: 26-29
tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-cpu.yaml (1)
14-26: Consider parameterizing vLLM container arguments for flexibility.Lines 14-26 hardcode critical vLLM launch parameters (model, tool parser, max model length). For future test variations, consider using environment variables or an init container to populate these dynamically. However, if this is intentionally a static CPU benchmark configuration for prow, add a comment clarifying the rationale:
# Static configuration optimized for CPU-only prow environment # To test alternative models or parameters, create a new ServingRuntime manifest containers: - args: - --model - ${VLLM_MODEL_ID:-meta-llama/Llama-3.2-1B-Instruct} # Override via env var ...Alternatively, if environment variables are not supported in vLLM args, document this limitation.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (20)
README.md(1 hunks)docs/providers.md(1 hunks)tests/e2e-prow/rhoai/configs/lightspeed-stack.yaml(1 hunks)tests/e2e-prow/rhoai/configs/run.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/lightspeed/lightspeed-stack.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/operators/ds-cluster.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/operators/operatorgroup.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/operators/operators.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/test-pod/spin-up.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/vllm/inference-service.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-cpu.yaml(1 hunks)tests/e2e-prow/rhoai/pipeline-services.sh(1 hunks)tests/e2e-prow/rhoai/pipeline-test-pod.sh(1 hunks)tests/e2e-prow/rhoai/pipeline-vllm.sh(1 hunks)tests/e2e-prow/rhoai/pipeline.sh(1 hunks)tests/e2e-prow/rhoai/run-tests.sh(1 hunks)tests/e2e-prow/rhoai/scripts/bootstrap.sh(1 hunks)tests/e2e-prow/rhoai/scripts/deploy-vllm.sh(1 hunks)tests/e2e-prow/rhoai/scripts/get-vllm-pod-info.sh(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-18T16:46:33.353Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-09-18T16:46:33.353Z
Learning: Applies to tests/e2e/test_list.txt : Maintain the E2E test list in tests/e2e/test_list.txt
Applied to files:
tests/e2e-prow/rhoai/run-tests.shtests/e2e-prow/rhoai/pipeline.sh
📚 Learning: 2025-09-02T11:09:40.404Z
Learnt from: radofuchs
Repo: lightspeed-core/lightspeed-stack PR: 485
File: tests/e2e/features/environment.py:87-95
Timestamp: 2025-09-02T11:09:40.404Z
Learning: In the lightspeed-stack e2e tests, noop authentication tests use the default lightspeed-stack.yaml configuration, while noop-with-token tests use the Authorized tag to trigger a config swap to the specialized noop-with-token configuration file.
Applied to files:
tests/e2e-prow/rhoai/configs/lightspeed-stack.yaml
🪛 Checkov (3.2.334)
tests/e2e-prow/rhoai/manifests/lightspeed/lightspeed-stack.yaml
[medium] 1-25: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 1-25: Minimize the admission of root containers
(CKV_K8S_23)
tests/e2e-prow/rhoai/manifests/test-pod/spin-up.yaml
[medium] 1-30: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 1-30: Minimize the admission of root containers
(CKV_K8S_23)
tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack.yaml
[medium] 1-34: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 1-34: Minimize the admission of root containers
(CKV_K8S_23)
🔇 Additional comments (14)
tests/e2e-prow/rhoai/manifests/operators/operatorgroup.yaml (1)
1-6: OperatorGroup manifest looks correct.The global-operators OperatorGroup with empty spec correctly scopes the subscriptions in openshift-operators namespace to cluster-wide availability, which is appropriate for the operator setup.
tests/e2e-prow/rhoai/manifests/operators/ds-cluster.yaml (1)
1-17: DataScienceCluster configuration is appropriate for e2e test.Enabling kserve and service mesh while removing workbenches, dashboard, and pipelines is a focused setup for testing vLLM inference services. The configuration aligns with the operator subscriptions defined in operators.yaml.
README.md (1)
127-127: LLM compatibility table entry looks good.The RHOAI (vLLM) entry is correctly formatted, follows the table conventions, and includes a valid link to the test configuration example.
docs/providers.md (1)
60-65: LGTM!The documentation entry for the RHOAI (vllm) provider is consistent with the existing table format and accurately reflects the provider details.
tests/e2e-prow/rhoai/manifests/vllm/inference-service.yaml (1)
1-13: LGTM!The KServe InferenceService manifest is correctly structured and appropriately references the vllm runtime and model storage configuration.
tests/e2e-prow/rhoai/scripts/bootstrap.sh (1)
1-47: Overall script quality is solid.The bootstrap script demonstrates good practices with
set -euo pipefail, clear progress logging, a well-structuredwait_for_operatorhelper, and appropriate timeout values. The operator sequencing and readiness checks are well thought out for a complex orchestration scenario. Minor: add BASE_DIR validation (see comment above).tests/e2e-prow/rhoai/pipeline-vllm.sh (1)
1-7: Add error handling to prevent cascading failures.The script chains sub-scripts without checking exit codes. If any script fails, the pipeline continues silently instead of stopping. This can lead to partial/corrupted state being passed to the test phase.
Apply this diff to add strict error handling:
#!/bin/bash + +set -euo pipefail PIPELINE_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" "$PIPELINE_DIR/scripts/bootstrap.sh" "$PIPELINE_DIR" "$PIPELINE_DIR/scripts/deploy-vllm.sh" "$PIPELINE_DIR" "$PIPELINE_DIR/scripts/get-vllm-pod-info.sh"Note:
get-vllm-pod-info.shexpects an optional ISVC_NAME argument (defaults to "vllm-model"), not PIPELINE_DIR, so the current call without arguments is correct.Likely an incorrect or invalid review comment.
tests/e2e-prow/rhoai/configs/run.yaml (1)
62-69: Verify llama-stack supports${env.*}environment variable interpolation syntax.The inference provider configuration uses
${env.KSVC_URL}and${env.VLLM_API_KEY}for runtime variable substitution. Ensure that the llama-stack runtime (v2.25 per PR objectives) natively supports this interpolation syntax before container startup, or if pre-substitution is required, verify it's handled by the deployment manifests.tests/e2e-prow/rhoai/pipeline-services.sh (2)
5-5: Verify manifest paths and pod naming conventions are consistent.Line 5 applies
manifests/lightspeed/llama-stack.yamlwhile line 25 appliesmanifests/lightspeed/lightspeed-stack.yaml. Confirm both files exist at the expected paths and that the pod names (llama-stack-service,lightspeed-stack-service) match their respective manifest deployments, especially if the subdirectory structure differs from the manifest locations.Also applies to: 25-25
7-8: Verify pod/service naming and port alignment.The script waits for pod
llama-stack-serviceand then exposes it as servicellama-stack-service-svcon port 8321. Confirm that the llama-stack manifest defines a pod (not a Deployment/StatefulSet) namedllama-stack-service, and that port 8321 is the correct target port for the llama-stack service.Also applies to: 13-17
tests/e2e-prow/rhoai/pipeline.sh (2)
128-128: Verify run-tests.sh exists before creating configmap.Line 128 creates a configmap from
configs/run-tests.sh. Ensure this file exists in the repository at the path relative toBASE_DIR(likely$BASE_DIR/configs/run-tests.shortests/e2e-prow/rhoai/configs/run-tests.sh). If the file path is incorrect or the file is not committed, the configmap creation will fail silently or create an empty configmap.
168-168: Verify pipeline-test-pod.sh and ensure it deploys pod named 'test-pod'.Line 168 calls
./pipeline-test-pod.sh, which is not provided in the review. Confirm that this script:
- Exists and is executable
- Deploys a pod named
test-podin the$NAMESPACE(e2e-rhoai-dsc)- Does not fail silently on errors
The pipeline depends on
test-podexisting at line 174 (wait condition) and line 181 (phase check).tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-cpu.yaml (2)
29-29: Verify vLLM image is accessible and correctly mirrored.Line 29 references
quay.io/rh-ee-cpompeia/vllm-cpu:latest. This image should be mirrored by thepipeline.shscript (lines 26-29 in that file) frompublic.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:latest. Confirm:
- The mirror source URL is correct and publicly accessible
- The target image path matches exactly (including registry, namespace, and tag)
- The image is available before this manifest is applied (verify dependency order in pipeline.sh)
If the mirror fails or the image is not pushed, pod startup will fail with ImagePullBackOff.
11-12: Verify model loading timeout is adequate for the model size and environment.The
modelLoadingTimeoutMillis: 90000(90 seconds) may be insufficient for downloading and initializing a 1B parameter model on CPU in prow, especially if HuggingFace API rate limits or network latency are involved. Consider:
- Running a test deployment to measure actual load time
- Increasing timeout to 180-300 seconds as a conservative margin
- Adding observability (logs, metrics) to track model loading duration for future troubleshooting
If the timeout is consistently exceeded, the pod will fail to become Ready, blocking the entire test pipeline.
| apiVersion: v1 | ||
| kind: Pod | ||
| metadata: | ||
| name: lightspeed-stack-service | ||
| namespace: e2e-rhoai-dsc | ||
| spec: | ||
| containers: | ||
| - name: lightspeed-stack-container | ||
| env: | ||
| - name: LLAMA_IP | ||
| valueFrom: | ||
| secretKeyRef: | ||
| name: llama-stack-ip-secret | ||
| key: key | ||
| image: quay.io/lightspeed-core/lightspeed-stack:dev-latest | ||
| ports: | ||
| - containerPort: 8080 | ||
| volumeMounts: | ||
| - name: config | ||
| mountPath: /app-root/lightspeed-stack.yaml | ||
| subPath: lightspeed-stack.yaml | ||
| volumes: | ||
| - name: config | ||
| configMap: | ||
| name: lightspeed-stack-config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add security context and resource limits to meet Kubernetes security best practices.
The Pod lacks security hardening and resource management:
- No
securityContext(allows privilege escalation and root execution) - No resource
requests/limits(affects scheduling and prevents runaway consumption)
Apply this diff to address security and resource management:
spec:
+ securityContext:
+ runAsNonRoot: true
+ runAsUser: 1000
containers:
- name: lightspeed-stack-container
+ securityContext:
+ allowPrivilegeEscalation: false
+ resources:
+ requests:
+ memory: "512Mi"
+ cpu: "250m"
+ limits:
+ memory: "1Gi"
+ cpu: "500m"
env:
- name: LLAMA_IPAdjust resource values based on actual performance requirements observed during testing.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| apiVersion: v1 | |
| kind: Pod | |
| metadata: | |
| name: lightspeed-stack-service | |
| namespace: e2e-rhoai-dsc | |
| spec: | |
| containers: | |
| - name: lightspeed-stack-container | |
| env: | |
| - name: LLAMA_IP | |
| valueFrom: | |
| secretKeyRef: | |
| name: llama-stack-ip-secret | |
| key: key | |
| image: quay.io/lightspeed-core/lightspeed-stack:dev-latest | |
| ports: | |
| - containerPort: 8080 | |
| volumeMounts: | |
| - name: config | |
| mountPath: /app-root/lightspeed-stack.yaml | |
| subPath: lightspeed-stack.yaml | |
| volumes: | |
| - name: config | |
| configMap: | |
| name: lightspeed-stack-config | |
| apiVersion: v1 | |
| kind: Pod | |
| metadata: | |
| name: lightspeed-stack-service | |
| namespace: e2e-rhoai-dsc | |
| spec: | |
| securityContext: | |
| runAsNonRoot: true | |
| runAsUser: 1000 | |
| containers: | |
| - name: lightspeed-stack-container | |
| securityContext: | |
| allowPrivilegeEscalation: false | |
| resources: | |
| requests: | |
| memory: "512Mi" | |
| cpu: "250m" | |
| limits: | |
| memory: "1Gi" | |
| cpu: "500m" | |
| env: | |
| - name: LLAMA_IP | |
| valueFrom: | |
| secretKeyRef: | |
| name: llama-stack-ip-secret | |
| key: key | |
| image: quay.io/lightspeed-core/lightspeed-stack:dev-latest | |
| ports: | |
| - containerPort: 8080 | |
| volumeMounts: | |
| - name: config | |
| mountPath: /app-root/lightspeed-stack.yaml | |
| subPath: lightspeed-stack.yaml | |
| volumes: | |
| - name: config | |
| configMap: | |
| name: lightspeed-stack-config |
🧰 Tools
🪛 Checkov (3.2.334)
[medium] 1-25: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 1-25: Minimize the admission of root containers
(CKV_K8S_23)
| - --max-model-len | ||
| - "2048" | ||
| image: quay.io/rh-ee-cpompeia/vllm-cpu:latest | ||
| name: kserve-container |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add resource requests and limits to prevent scheduling issues and runaway resource consumption.
The manifest does not define resources.requests or resources.limits. On a CPU-only prow node, vLLM can consume unbounded CPU and memory, potentially evicting other pods or causing the entire node to become unschedulable. Add conservative limits:
containers:
- name: kserve-container
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "4"
memory: "8Gi"Adjust these values based on the 1B model's actual memory footprint (likely 2-4 GB) and CPU availability in prow.
🤖 Prompt for AI Agents
In tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-cpu.yaml around line 30, the
container "kserve-container" lacks resources.requests and resources.limits which
can cause unbounded CPU/memory usage; add a resources block under that container
with conservative requests and limits (e.g., requests: cpu "2", memory "4Gi";
limits: cpu "4", memory "8Gi") and adjust values as needed for the 1B model and
prow node capacity to prevent scheduling and eviction issues.
| #======================================== | ||
| # 6. WAIT FOR POD & TEST API | ||
| #======================================== | ||
| source pod.env |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add error handling for pod.env sourcing and validation of $KSVC_URL.
Line 74 sources pod.env without error checking. If the file is missing or malformed, $POD_NAME and $KSVC_URL will be empty or unset, causing silent failures in the wait command (line 75) and curl calls (line 90). Add validation after sourcing:
source pod.env || { echo "❌ Failed to source pod.env"; exit 1; }
[[ -n "$POD_NAME" ]] && [[ -n "$KSVC_URL" ]] || { echo "❌ Missing POD_NAME or KSVC_URL"; exit 1; }Alternatively, verify that pipeline-vllm.sh unconditionally exports these variables to a file in a way that guarantees success.
Also applies to: 90-90
🤖 Prompt for AI Agents
In tests/e2e-prow/rhoai/pipeline.sh around lines 74 and 90, the script currently
sources pod.env without checking for errors and then uses $POD_NAME and
$KSVC_URL later, which can lead to silent failures; after sourcing pod.env,
ensure the source succeeded and validate that both POD_NAME and KSVC_URL are
non-empty, printing an error and exiting non-zero if sourcing fails or either
variable is missing; alternatively, update pipeline-vllm.sh to reliably write
and export these variables to pod.env (or fail loudly) so sourcing cannot
silently produce unset values.
| git clone https://github.com/lightspeed-core/lightspeed-stack.git | ||
| cd lightspeed-stack | ||
|
|
||
| echo "pod started" | ||
| echo $LCS_IP | ||
|
|
||
| curl -f http://$LCS_IP:8080/v1/models || { | ||
| echo "❌ Basic connectivity failed - showing logs before running full tests" | ||
| exit 1 | ||
| } | ||
|
|
||
| echo "Installing test dependencies..." | ||
| pip install uv | ||
| uv sync | ||
|
|
||
| export E2E_LSC_HOSTNAME=$LCS_IP | ||
| export E2E_LLAMA_HOSTNAME=$LLAMA_IP | ||
|
|
||
| echo "Running comprehensive e2e test suite..." | ||
| make test-e2e No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add environment variable validation and explicit error handling.
The script assumes LCS_IP and LLAMA_IP are set but doesn't validate them. If missing, errors are cryptic. Also, make test-e2e failures may not propagate correctly.
Apply this diff to validate inputs and ensure proper error propagation:
git clone https://github.com/lightspeed-core/lightspeed-stack.git
cd lightspeed-stack
+
+# Validate required environment variables
+if [[ -z "${LCS_IP:-}" ]] || [[ -z "${LLAMA_IP:-}" ]]; then
+ echo "❌ Required environment variables not set: LCS_IP=$LCS_IP, LLAMA_IP=$LLAMA_IP"
+ exit 1
+fi
echo "pod started"
echo $LCS_IP
@@ -12,7 +20,7 @@ echo $LCS_IP
echo "Installing test dependencies..."
pip install uv
-uv sync
+uv sync || { echo "❌ Failed to sync dependencies"; exit 1; }
export E2E_LSC_HOSTNAME=$LCS_IP
export E2E_LLAMA_HOSTNAME=$LLAMA_IP
echo "Running comprehensive e2e test suite..."
-make test-e2e
+make test-e2e || { echo "❌ E2E tests failed"; exit 1; }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| git clone https://github.com/lightspeed-core/lightspeed-stack.git | |
| cd lightspeed-stack | |
| echo "pod started" | |
| echo $LCS_IP | |
| curl -f http://$LCS_IP:8080/v1/models || { | |
| echo "❌ Basic connectivity failed - showing logs before running full tests" | |
| exit 1 | |
| } | |
| echo "Installing test dependencies..." | |
| pip install uv | |
| uv sync | |
| export E2E_LSC_HOSTNAME=$LCS_IP | |
| export E2E_LLAMA_HOSTNAME=$LLAMA_IP | |
| echo "Running comprehensive e2e test suite..." | |
| make test-e2e | |
| git clone https://github.com/lightspeed-core/lightspeed-stack.git | |
| cd lightspeed-stack | |
| # Validate required environment variables | |
| if [[ -z "${LCS_IP:-}" ]] || [[ -z "${LLAMA_IP:-}" ]]; then | |
| echo "❌ Required environment variables not set: LCS_IP=$LCS_IP, LLAMA_IP=$LLAMA_IP" | |
| exit 1 | |
| fi | |
| echo "pod started" | |
| echo $LCS_IP | |
| curl -f http://$LCS_IP:8080/v1/models || { | |
| echo "❌ Basic connectivity failed - showing logs before running full tests" | |
| exit 1 | |
| } | |
| echo "Installing test dependencies..." | |
| pip install uv | |
| uv sync || { echo "❌ Failed to sync dependencies"; exit 1; } | |
| export E2E_LSC_HOSTNAME=$LCS_IP | |
| export E2E_LLAMA_HOSTNAME=$LLAMA_IP | |
| echo "Running comprehensive e2e test suite..." | |
| make test-e2e || { echo "❌ E2E tests failed"; exit 1; } |
🤖 Prompt for AI Agents
In tests/e2e-prow/rhoai/run-tests.sh around lines 1-20, add explicit environment
validation and stricter error handling: enable safe shell options (set -euo
pipefail) and a trap to print a helpful message on failure; check that LCS_IP
and LLAMA_IP are defined and non-empty and exit with descriptive error codes if
they are missing; after the curl health check, capture and print relevant logs
on failure before exiting; run make test-e2e and explicitly exit with the make
command's status so failures propagate to the caller.
| oc apply -f "$BASE_DIR/manifests/vllm/vllm-runtime-cpu.yaml" | ||
|
|
||
| oc apply -f "$BASE_DIR/manifests/vllm/inference-service.yaml" No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add error handling to manifest application commands.
The oc apply commands at the end don't check for failures. If manifests are malformed or paths are wrong, the script succeeds silently.
oc apply -f "$BASE_DIR/manifests/vllm/vllm-runtime-cpu.yaml"
+oc apply -f "$BASE_DIR/manifests/vllm/inference-service.yaml" || {
+ echo "❌ Failed to apply vLLM manifests"
+ exit 1
+}
-oc apply -f "$BASE_DIR/manifests/vllm/inference-service.yaml"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| oc apply -f "$BASE_DIR/manifests/vllm/vllm-runtime-cpu.yaml" | |
| oc apply -f "$BASE_DIR/manifests/vllm/inference-service.yaml" | |
| oc apply -f "$BASE_DIR/manifests/vllm/vllm-runtime-cpu.yaml" | |
| oc apply -f "$BASE_DIR/manifests/vllm/inference-service.yaml" || { | |
| echo "❌ Failed to apply vLLM manifests" | |
| exit 1 | |
| } |
🤖 Prompt for AI Agents
In tests/e2e-prow/rhoai/scripts/deploy-vllm.sh around lines 32 to 34, the two oc
apply commands run without checking for failures; update the script to handle
errors by verifying each command's exit status and exiting the script on
failure. After each oc apply -f "$BASE_DIR/..." command, add a check like: if
the command fails (non‑zero exit code) print a descriptive error to stderr
including the manifest path and exit with a non‑zero status (or use set -e at
the top and/or append || { echo "failed to apply <file>" >&2; exit 1; } to each
command) so malformed manifests or wrong paths cause the script to fail fast.
| until [ -n "$POD_NAME" ] || [ $ELAPSED -ge $TIMEOUT ]; do | ||
| POD_NAME=$(oc get pods -n "$NAMESPACE" \ | ||
| -l "serving.kserve.io/inferenceservice=$ISVC_NAME" \ | ||
| -o jsonpath="{.items[?(@.status.phase=='Running')].metadata.name}" 2>/dev/null) | ||
| echo "Waiting for pod $POD_NAME in namespace $NAMESPACE" | ||
|
|
||
| if [ -z "$POD_NAME" ]; then | ||
| echo " -> Pod not running yet, waiting $INTERVAL seconds..." | ||
| sleep $INTERVAL | ||
| ELAPSED=$((ELAPSED + INTERVAL)) | ||
| fi | ||
| done | ||
|
|
||
| oc describe pod $POD_NAME -n $NAMESPACE || true | ||
| oc logs $POD_NAME -n $NAMESPACE || true | ||
|
|
||
| POD_NAME=$(oc get pods -n $NAMESPACE -o jsonpath='{.items[0].metadata.name}') | ||
|
|
||
| if [ -z "$POD_NAME" ]; then | ||
| echo " -> Timeout reached after $TIMEOUT seconds. Pod is not running." | ||
| else | ||
| echo " -> Pod is running: $POD_NAME" | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical bug: POD_NAME is overwritten after targeted lookup.
The until loop (lines 18–29) successfully locates the pod matching the InferenceService label. However, line 34 immediately overwrites POD_NAME with the first pod in the namespace, defeating the targeted lookup and potentially operating on the wrong pod.
Apply this diff to remove the erroneous line 34:
oc describe pod $POD_NAME -n $NAMESPACE || true
oc logs $POD_NAME -n $NAMESPACE || true
-POD_NAME=$(oc get pods -n $NAMESPACE -o jsonpath='{.items[0].metadata.name}')
-
if [ -z "$POD_NAME" ]; then
echo " -> Timeout reached after $TIMEOUT seconds. Pod is not running."
elseThis preserves the POD_NAME found by the targeted loop lookup.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| until [ -n "$POD_NAME" ] || [ $ELAPSED -ge $TIMEOUT ]; do | |
| POD_NAME=$(oc get pods -n "$NAMESPACE" \ | |
| -l "serving.kserve.io/inferenceservice=$ISVC_NAME" \ | |
| -o jsonpath="{.items[?(@.status.phase=='Running')].metadata.name}" 2>/dev/null) | |
| echo "Waiting for pod $POD_NAME in namespace $NAMESPACE" | |
| if [ -z "$POD_NAME" ]; then | |
| echo " -> Pod not running yet, waiting $INTERVAL seconds..." | |
| sleep $INTERVAL | |
| ELAPSED=$((ELAPSED + INTERVAL)) | |
| fi | |
| done | |
| oc describe pod $POD_NAME -n $NAMESPACE || true | |
| oc logs $POD_NAME -n $NAMESPACE || true | |
| POD_NAME=$(oc get pods -n $NAMESPACE -o jsonpath='{.items[0].metadata.name}') | |
| if [ -z "$POD_NAME" ]; then | |
| echo " -> Timeout reached after $TIMEOUT seconds. Pod is not running." | |
| else | |
| echo " -> Pod is running: $POD_NAME" | |
| fi | |
| until [ -n "$POD_NAME" ] || [ $ELAPSED -ge $TIMEOUT ]; do | |
| POD_NAME=$(oc get pods -n "$NAMESPACE" \ | |
| -l "serving.kserve.io/inferenceservice=$ISVC_NAME" \ | |
| -o jsonpath="{.items[?(@.status.phase=='Running')].metadata.name}" 2>/dev/null) | |
| echo "Waiting for pod $POD_NAME in namespace $NAMESPACE" | |
| if [ -z "$POD_NAME" ]; then | |
| echo " -> Pod not running yet, waiting $INTERVAL seconds..." | |
| sleep $INTERVAL | |
| ELAPSED=$((ELAPSED + INTERVAL)) | |
| fi | |
| done | |
| oc describe pod $POD_NAME -n $NAMESPACE || true | |
| oc logs $POD_NAME -n $NAMESPACE || true | |
| if [ -z "$POD_NAME" ]; then | |
| echo " -> Timeout reached after $TIMEOUT seconds. Pod is not running." | |
| else | |
| echo " -> Pod is running: $POD_NAME" | |
| fi |
🤖 Prompt for AI Agents
In tests/e2e-prow/rhoai/scripts/get-vllm-pod-info.sh around lines 18 to 40,
POD_NAME found by the targeted oc get pods lookup inside the until loop is being
overwritten by an unconditional oc get pods call (line 34) that selects the
first pod in the namespace; remove that extra POD_NAME assignment so the script
keeps and uses the pod discovered by the label-based lookup, ensuring subsequent
describe/logs and timeout checks operate on the intended pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (5)
tests/e2e-prow/rhoai/scripts/deploy-vllm.sh (1)
32-34: Add error handling to manifest application commands.The
oc applycommands lack failure checking. If manifests are malformed or paths incorrect, the script succeeds silently, causing downstream failures to be hard to diagnose.-oc apply -f "$BASE_DIR/manifests/vllm/vllm-runtime-cpu.yaml" +oc apply -f "$BASE_DIR/manifests/vllm/vllm-runtime-cpu.yaml" || { + echo "❌ Failed to apply vLLM runtime manifest" + exit 1 +} -oc apply -f "$BASE_DIR/manifests/vllm/inference-service.yaml" +oc apply -f "$BASE_DIR/manifests/vllm/inference-service.yaml" || { + echo "❌ Failed to apply inference service manifest" + exit 1 +}Alternatively, add
set -eat the top of the script to fail fast on any command error.tests/e2e-prow/rhoai/manifests/lightspeed/lightspeed-stack.yaml (1)
1-25: Add security context and resource limits (known issue).This Pod lacks
securityContextand resource management, allowing root execution and privilege escalation. Apply the same hardening as recommended for test-pod:spec: + securityContext: + runAsNonRoot: true + runAsUser: 1000 + fsGroup: 1000 containers: - name: lightspeed-stack-container + securityContext: + allowPrivilegeEscalation: false + readOnlyRootFilesystem: false + capabilities: + drop: + - ALL + resources: + requests: + memory: "512Mi" + cpu: "250m" + limits: + memory: "1Gi" + cpu: "500m" env:Note:
readOnlyRootFilesystemis set tofalsesince lightspeed-stack likely needs write access at runtime.tests/e2e-prow/rhoai/run-tests.sh (1)
1-20: Add environment validation and error handling (known issue).The script assumes
LCS_IPandLLAMA_IPare set but doesn't validate them. Additionally, command failures don't propagate. Apply this diff to address previous concerns:+#!/bin/bash +set -euo pipefail +trap 'echo "❌ Test script failed at line $LINENO"; exit 1' ERR + +# Validate required environment variables +if [[ -z "${LCS_IP:-}" ]] || [[ -z "${LLAMA_IP:-}" ]]; then + echo "❌ Required environment variables not set: LCS_IP=${LCS_IP:-empty}, LLAMA_IP=${LLAMA_IP:-empty}" + exit 1 +fi + git clone https://github.com/lightspeed-core/lightspeed-stack.git cd lightspeed-stack echo "pod started" echo $LCS_IP curl -f http://$LCS_IP:8080/v1/models || { echo "❌ Basic connectivity failed - showing logs before running full tests" exit 1 } echo "Installing test dependencies..." pip install uv -uv sync +uv sync || { echo "❌ Failed to sync dependencies"; exit 1; } export E2E_LSC_HOSTNAME=$LCS_IP export E2E_LLAMA_HOSTNAME=$LLAMA_IP echo "Running comprehensive e2e test suite..." -make test-e2e +make test-e2e || { echo "❌ E2E tests failed with exit code $?"; exit $?; }tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack.yaml (1)
1-34: Add security context and resource limits (pattern across manifests).This Pod lacks
securityContextand resource management. Apply consistent hardening across all Kubernetes manifests:spec: + securityContext: + runAsNonRoot: true + runAsUser: 1000 + fsGroup: 1000 containers: - name: llama-stack-container + securityContext: + allowPrivilegeEscalation: false + readOnlyRootFilesystem: false + capabilities: + drop: + - ALL + resources: + requests: + memory: "512Mi" + cpu: "250m" + limits: + memory: "2Gi" + cpu: "1000m" env:Note: Adjust memory/CPU limits based on llama-stack workload requirements; may need more resources than lightspeed-stack.
tests/e2e-prow/rhoai/pipeline.sh (1)
61-62: Add error handling for pod.env sourcing and variable validation.Line 61 sources
pod.envwithout error checking. Ifpipeline-vllm.shfails or the file is missing,$POD_NAMEand$KSVC_URLremain unset, causing silent failures downstream. Add validation:-source pod.env -oc wait --for=condition=Ready pod/$POD_NAME -n $NAMESPACE --timeout=300s +source pod.env || { echo "❌ Failed to source pod.env from pipeline-vllm.sh"; exit 1; } + +# Validate critical variables from sourced file +[[ -n "${POD_NAME:-}" ]] || { echo "❌ POD_NAME not set in pod.env"; exit 1; } +[[ -n "${KSVC_URL:-}" ]] || { echo "❌ KSVC_URL not set in pod.env"; exit 1; } + +oc wait --for=condition=Ready pod/$POD_NAME -n $NAMESPACE --timeout=300sAlternatively, verify that
pipeline-vllm.shunconditionally exports these variables and document the contract.
🧹 Nitpick comments (2)
tests/e2e-prow/rhoai/configs/run.yaml (1)
66-69: Document test-only security settings.Line 68 disables TLS verification (
tls_verify: false), which is acceptable for test environments but should be documented to prevent misuse in production. Consider adding an inline comment:inference: - provider_id: vllm provider_type: remote::vllm config: url: ${env.KSVC_URL}/v1/ api_token: ${env.VLLM_API_KEY} + # Note: TLS verification disabled for e2e testing only tls_verify: false max_tokens: 1024tests/e2e-prow/rhoai/configs/lightspeed-stack.yaml (1)
9-17: Document test-only authentication settings.The configuration uses noop authentication and hardcoded test API key ("xyzzy"). These are appropriate for e2e testing but should be clearly marked to prevent accidental production use:
llama_stack: # Uses a remote llama-stack service # The instance would have already been started with a llama-stack-run.yaml file use_as_library_client: false # Alternative for "as library use" # use_as_library_client: true # library_client_config_path: <path-to-llama-stack-run.yaml-file> url: http://${env.LLAMA_IP}:8321 - api_key: xyzzy + api_key: xyzzy # TEST VALUE ONLY - replace in production user_data_collection:And at authentication section:
authentication: - module: "noop" + module: "noop" # No-op auth: TEST/DEV ONLY
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (20)
README.md(1 hunks)docs/providers.md(1 hunks)tests/e2e-prow/rhoai/configs/lightspeed-stack.yaml(1 hunks)tests/e2e-prow/rhoai/configs/run.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/lightspeed/lightspeed-stack.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/operators/ds-cluster.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/operators/operatorgroup.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/operators/operators.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/test-pod/spin-up.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/vllm/inference-service.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-cpu.yaml(1 hunks)tests/e2e-prow/rhoai/pipeline-services.sh(1 hunks)tests/e2e-prow/rhoai/pipeline-test-pod.sh(1 hunks)tests/e2e-prow/rhoai/pipeline-vllm.sh(1 hunks)tests/e2e-prow/rhoai/pipeline.sh(1 hunks)tests/e2e-prow/rhoai/run-tests.sh(1 hunks)tests/e2e-prow/rhoai/scripts/bootstrap.sh(1 hunks)tests/e2e-prow/rhoai/scripts/deploy-vllm.sh(1 hunks)tests/e2e-prow/rhoai/scripts/get-vllm-pod-info.sh(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (12)
- tests/e2e-prow/rhoai/pipeline-test-pod.sh
- README.md
- tests/e2e-prow/rhoai/pipeline-vllm.sh
- tests/e2e-prow/rhoai/scripts/get-vllm-pod-info.sh
- tests/e2e-prow/rhoai/manifests/operators/operators.yaml
- tests/e2e-prow/rhoai/manifests/vllm/inference-service.yaml
- tests/e2e-prow/rhoai/scripts/bootstrap.sh
- tests/e2e-prow/rhoai/manifests/operators/ds-cluster.yaml
- tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-cpu.yaml
- docs/providers.md
- tests/e2e-prow/rhoai/manifests/operators/operatorgroup.yaml
- tests/e2e-prow/rhoai/pipeline-services.sh
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-18T16:46:33.353Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-09-18T16:46:33.353Z
Learning: Applies to tests/e2e/test_list.txt : Maintain the E2E test list in tests/e2e/test_list.txt
Applied to files:
tests/e2e-prow/rhoai/pipeline.shtests/e2e-prow/rhoai/run-tests.sh
📚 Learning: 2025-09-02T11:09:40.404Z
Learnt from: radofuchs
Repo: lightspeed-core/lightspeed-stack PR: 485
File: tests/e2e/features/environment.py:87-95
Timestamp: 2025-09-02T11:09:40.404Z
Learning: In the lightspeed-stack e2e tests, noop authentication tests use the default lightspeed-stack.yaml configuration, while noop-with-token tests use the Authorized tag to trigger a config swap to the specialized noop-with-token configuration file.
Applied to files:
tests/e2e-prow/rhoai/configs/lightspeed-stack.yaml
🪛 Checkov (3.2.334)
tests/e2e-prow/rhoai/manifests/lightspeed/lightspeed-stack.yaml
[medium] 1-25: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 1-25: Minimize the admission of root containers
(CKV_K8S_23)
tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack.yaml
[medium] 1-34: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 1-34: Minimize the admission of root containers
(CKV_K8S_23)
tests/e2e-prow/rhoai/manifests/test-pod/spin-up.yaml
[medium] 1-30: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 1-30: Minimize the admission of root containers
(CKV_K8S_23)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: e2e_tests (ci)
- GitHub Check: e2e_tests (azure)
- GitHub Check: build-pr
| apiVersion: v1 | ||
| kind: Pod | ||
| metadata: | ||
| name: test-pod | ||
| namespace: e2e-rhoai-dsc | ||
| spec: | ||
| containers: | ||
| - name: test-container | ||
| env: | ||
| - name: LCS_IP | ||
| valueFrom: | ||
| secretKeyRef: | ||
| name: lcs-ip-secret | ||
| key: key | ||
| - name: LLAMA_IP | ||
| valueFrom: | ||
| secretKeyRef: | ||
| name: llama-stack-ip-secret | ||
| key: key | ||
| image: registry.access.redhat.com/ubi9/python-312 | ||
| command: ["/bin/sh", "/scripts/run-tests.sh"] | ||
| volumeMounts: | ||
| - name: script-volume | ||
| mountPath: /scripts | ||
| volumes: | ||
| - name: script-volume | ||
| configMap: | ||
| name: test-script-cm | ||
| defaultMode: 0755 # Make the script executable | ||
| restartPolicy: Never No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add security context to harden the test Pod.
The Pod lacks securityContext and resource management, allowing root execution and privilege escalation. For test pods in shared environments, apply basic hardening:
spec:
+ securityContext:
+ runAsNonRoot: true
+ runAsUser: 1000
+ fsGroup: 1000
containers:
- name: test-container
+ securityContext:
+ allowPrivilegeEscalation: false
+ readOnlyRootFilesystem: true
+ capabilities:
+ drop:
+ - ALL
+ resources:
+ requests:
+ memory: "256Mi"
+ cpu: "100m"
+ limits:
+ memory: "512Mi"
+ cpu: "500m"
env:Adjust resource limits based on test runtime requirements.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| apiVersion: v1 | |
| kind: Pod | |
| metadata: | |
| name: test-pod | |
| namespace: e2e-rhoai-dsc | |
| spec: | |
| containers: | |
| - name: test-container | |
| env: | |
| - name: LCS_IP | |
| valueFrom: | |
| secretKeyRef: | |
| name: lcs-ip-secret | |
| key: key | |
| - name: LLAMA_IP | |
| valueFrom: | |
| secretKeyRef: | |
| name: llama-stack-ip-secret | |
| key: key | |
| image: registry.access.redhat.com/ubi9/python-312 | |
| command: ["/bin/sh", "/scripts/run-tests.sh"] | |
| volumeMounts: | |
| - name: script-volume | |
| mountPath: /scripts | |
| volumes: | |
| - name: script-volume | |
| configMap: | |
| name: test-script-cm | |
| defaultMode: 0755 # Make the script executable | |
| restartPolicy: Never | |
| apiVersion: v1 | |
| kind: Pod | |
| metadata: | |
| name: test-pod | |
| namespace: e2e-rhoai-dsc | |
| spec: | |
| securityContext: | |
| runAsNonRoot: true | |
| runAsUser: 1000 | |
| fsGroup: 1000 | |
| containers: | |
| - name: test-container | |
| securityContext: | |
| allowPrivilegeEscalation: false | |
| readOnlyRootFilesystem: true | |
| capabilities: | |
| drop: | |
| - ALL | |
| resources: | |
| requests: | |
| memory: "256Mi" | |
| cpu: "100m" | |
| limits: | |
| memory: "512Mi" | |
| cpu: "500m" | |
| env: | |
| - name: LCS_IP | |
| valueFrom: | |
| secretKeyRef: | |
| name: lcs-ip-secret | |
| key: key | |
| - name: LLAMA_IP | |
| valueFrom: | |
| secretKeyRef: | |
| name: llama-stack-ip-secret | |
| key: key | |
| image: registry.access.redhat.com/ubi9/python-312 | |
| command: ["/bin/sh", "/scripts/run-tests.sh"] | |
| volumeMounts: | |
| - name: script-volume | |
| mountPath: /scripts | |
| volumes: | |
| - name: script-volume | |
| configMap: | |
| name: test-script-cm | |
| defaultMode: 0755 # Make the script executable | |
| restartPolicy: Never |
🧰 Tools
🪛 Checkov (3.2.334)
[medium] 1-30: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 1-30: Minimize the admission of root containers
(CKV_K8S_23)
🤖 Prompt for AI Agents
In tests/e2e-prow/rhoai/manifests/test-pod/spin-up.yaml around lines 1 to 30,
the Pod lacks securityContext and resource requests/limits which permits root
execution and privilege escalation; add a Pod-level securityContext (e.g.,
runAsNonRoot: true, runAsUser: 1000) and a container-level securityContext
(allowPrivilegeEscalation: false, readOnlyRootFilesystem: true, runAsNonRoot:
true, runAsUser: 1000, capabilities: drop: ["ALL"]) and include sensible
resource requests/limits (cpu/memory) for the test workload so the container
cannot run as root or escalate privileges and has bounded resource usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
♻️ Duplicate comments (3)
tests/e2e-prow/rhoai/pipeline.sh (3)
74-82: Add error handling for pod.env sourcing and validation of $POD_NAME and $KSVC_URL.Line 81 sources
pod.envwithout error checking. If the file is missing or malformed,$POD_NAMEand$KSVC_URLwill be empty or unset, causing silent failures in the wait command (line 82) and curl calls (line 97).-source pod.env +source pod.env || { echo "❌ Failed to source pod.env"; exit 1; } +[[ -n "$POD_NAME" ]] && [[ -n "$KSVC_URL" ]] || { echo "❌ Missing POD_NAME or KSVC_URL"; exit 1; }Alternatively, verify that
pipeline-vllm.shunconditionally exports these variables to a file in a way that guarantees success, then document this dependency in a comment.
181-182: Verify wait command exit code to catch timeout or readiness failures.Line 181 waits for the test pod to be Ready with a 900s timeout, but the exit code is not checked. If the pod fails to reach Ready state or the wait times out, the script continues to the polling loop, masking the failure.
-oc wait --for=condition=Ready=True pod/test-pod -n $NAMESPACE --timeout=900s +oc wait --for=condition=Ready=True pod/test-pod -n $NAMESPACE --timeout=900s || { + echo "⏰ Timeout or error waiting for test-pod readiness" + oc describe pod test-pod -n $NAMESPACE || true + exit 1 +}
1-41:⚠️ Missing strict error handling and shebang.The script lacks a shebang and shell safety options. Add these at the very top to prevent silent failures and invalid state propagation:
+#!/bin/bash +set -euo pipefail +trap 'echo "❌ Pipeline failed at line $LINENO"; exit 1' ERR + #======================================== # 1. GLOBAL CONFIG #========================================Additionally, line 22–23's
podman login --get-loginandbuildah login --get-loginare status checks, not login operations. If the intent is to log in, replace with actuallogincommands. If only checking status, clarify the intent with a comment. Line 28'ssudo podman || trueis also unclear—clarify or remove. Line 39'sls -A || trueappears to be debug code.
🧹 Nitpick comments (3)
tests/e2e-prow/rhoai/pipeline.sh (3)
133-135: Verify configmap source files exist before creation.Lines 133–135 create configmaps from files (
configs/run.yaml,configs/lightspeed-stack.yaml,run-tests.sh) without checking if these files exist. If missing,oc create configmapwill fail with a confusing error.+[[ -f configs/run.yaml ]] || { echo "❌ configs/run.yaml not found"; exit 1; } +[[ -f configs/lightspeed-stack.yaml ]] || { echo "❌ configs/lightspeed-stack.yaml not found"; exit 1; } +[[ -f run-tests.sh ]] || { echo "❌ run-tests.sh not found"; exit 1; } + oc create configmap llama-stack-config -n "$NAMESPACE" --from-file=configs/run.yaml oc create configmap lightspeed-stack-config -n "$NAMESPACE" --from-file=configs/lightspeed-stack.yaml oc create configmap test-script-cm -n "$NAMESPACE" --from-file=run-tests.shOnce
set -euo pipefailis in place,occommands will fail automatically if files are missing, but explicit checks provide clearer error messages.
157-163: Add error handling for pod labeling and service exposure.Lines 157 and 159–163 label a pod and expose it as a service without checking for success. If the pod doesn't exist or the commands fail, the script continues with an invalid
$LCS_IPvalue.-oc label pod lightspeed-stack-service pod=lightspeed-stack-service -n $NAMESPACE +oc label pod lightspeed-stack-service pod=lightspeed-stack-service -n $NAMESPACE \ + || { echo "❌ Failed to label pod"; exit 1; } -oc expose pod lightspeed-stack-service \ +oc expose pod lightspeed-stack-service \ --name=lightspeed-stack-service-svc \ --port=8080 \ --type=ClusterIP \ - -n $NAMESPACE + -n $NAMESPACE \ + || { echo "❌ Failed to expose service"; exit 1; }
207-217: Add defensive checks for exit code extraction and pod state verification.Line 207 extracts the exit code from a JSON path that assumes the pod has terminated and has a containerStatus. If the pod is in an unexpected state (e.g., not fully terminated, missing containerStatuses), the path returns empty and
${TEST_EXIT_CODE:-2}defaults to 2, masking the actual pod state.+# Verify pod is actually in a terminal state before extracting exit code +POD_PHASE=$(oc get pod test-pod -n $NAMESPACE -o jsonpath='{.status.phase}') +if [[ "$POD_PHASE" != "Succeeded" && "$POD_PHASE" != "Failed" ]]; then + echo "❌ Pod is in phase $POD_PHASE, not a terminal state" + oc describe pod test-pod -n $NAMESPACE + exit 1 +fi + TEST_EXIT_CODE=$(oc get pod test-pod -n $NAMESPACE -o jsonpath='{.status.containerStatuses[0].state.terminated.exitCode}') +TEST_EXIT_CODE="${TEST_EXIT_CODE:-999}" # Use 999 to indicate missing exit code +if [[ -z "$TEST_EXIT_CODE" || "$TEST_EXIT_CODE" == "999" ]]; then + echo "❌ Failed to extract exit code from pod. Pod may not have fully terminated." + oc describe pod test-pod -n $NAMESPACE + exit 1 +fi echo "===== E2E COMPLETE =====" -if [ "${TEST_EXIT_CODE:-2}" -ne 0 ]; then +if [ "$TEST_EXIT_CODE" -ne 0 ]; then echo "❌ E2E tests failed with exit code $TEST_EXIT_CODE (pod/test-pod failed)" else echo "✅ E2E tests succeeded" fi
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tests/e2e-prow/rhoai/pipeline.sh(1 hunks)tests/e2e-prow/rhoai/scripts/build_vllm.sh(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-18T16:46:33.353Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-09-18T16:46:33.353Z
Learning: Applies to tests/e2e/test_list.txt : Maintain the E2E test list in tests/e2e/test_list.txt
Applied to files:
tests/e2e-prow/rhoai/pipeline.sh
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: build-pr
- GitHub Check: e2e_tests (ci)
- GitHub Check: e2e_tests (azure)
tests/e2e-prow/rhoai/pipeline.sh
Outdated
| chmod +x ./scripts/build_vllm.sh | ||
| ./scripts/build_vllm.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validate script existence and error handling before execution.
The script attempts to chmod and execute ./scripts/build_vllm.sh without checking if it exists or verifying the operation succeeded. If the file is missing or execution fails, the pipeline silently continues.
chmod +x ./scripts/build_vllm.sh
+ [[ -f ./scripts/build_vllm.sh ]] || { echo "❌ build_vllm.sh not found"; exit 1; }
./scripts/build_vllm.sh
+ [[ $? -eq 0 ]] || { echo "❌ build_vllm.sh failed"; exit 1; }Once set -euo pipefail is added to the script header (see prior comment), the explicit checks can be simplified.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| chmod +x ./scripts/build_vllm.sh | |
| ./scripts/build_vllm.sh | |
| chmod +x ./scripts/build_vllm.sh | |
| [[ -f ./scripts/build_vllm.sh ]] || { echo "❌ build_vllm.sh not found"; exit 1; } | |
| ./scripts/build_vllm.sh | |
| [[ $? -eq 0 ]] || { echo "❌ build_vllm.sh failed"; exit 1; } |
🤖 Prompt for AI Agents
In tests/e2e-prow/rhoai/pipeline.sh around lines 35 to 36, the script
unconditionally chmods and executes ./scripts/build_vllm.sh which can silently
fail if the file is missing or execution errors; add an existence check (e.g.,
test -f or [ -x ]) before chmod/execute and fail fast with a clear error message
if the script is absent or not executable, then run chmod +x and execute only
after the checks; since set -euo pipefail is enabled earlier, keep the checks
minimal and ensure any non-zero exit from the script propagates by not
suppressing errors.
| git clone https://github.com/vllm-project/vllm.git | ||
| cd vllm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add error handling for git operations.
The script will continue even if git clone fails or if the vllm directory does not exist after cloning.
Apply this diff to add error handling:
+set -eo pipefail
+
git clone https://github.com/vllm-project/vllm.git
-cd vllm
+cd vllm || { echo "Failed to enter vllm directory"; exit 1; }Additionally, consider pinning a specific version rather than cloning the unstable main branch:
-git clone https://github.com/vllm-project/vllm.git
+git clone --branch v0.6.0 https://github.com/vllm-project/vllm.git📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| git clone https://github.com/vllm-project/vllm.git | |
| cd vllm | |
| set -eo pipefail | |
| git clone --branch v0.6.0 https://github.com/vllm-project/vllm.git | |
| cd vllm || { echo "Failed to enter vllm directory"; exit 1; } |
🤖 Prompt for AI Agents
In tests/e2e-prow/rhoai/scripts/build_vllm.sh around lines 1-2, the script
currently runs `git clone` and `cd vllm` without checking for errors; update it
to abort if `git clone` fails and to verify the repository directory exists
before changing into it. Specifically, after running git clone, check its exit
status and exit non‑zero with an explanatory message on failure; verify the vllm
directory exists (and is a git repo) before cd and exit with an error if not;
and optionally replace cloning main with a checkout of a pinned tag/commit
(clone then git checkout <tag-or-commit>) to avoid tracking unstable main.
| git clone https://github.com/vllm-project/vllm.git | ||
| cd vllm | ||
|
|
||
| file="docker/Dockerfile.cpu" | ||
| sed 's|\(target=[^, ]*\)|\1,z|g' "$file" > "${file}.patched" | ||
|
|
||
| DOCKER_BUILDKIT=1 podman build . \ | ||
| # --target cpu \ | ||
| --tag vllm-openai-lcore/cpu \ | ||
| --file docker/Dockerfile.cpu.patched \ | ||
| --build-arg max_jobs=1 No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Add overall script robustness: shebang, logging, and documentation.
The script lacks a shebang, structured logging, and inline documentation explaining the build strategy (particularly the patching and ,z suffix).
Add a shebang at the top and improve logging and documentation:
+#!/bin/bash
+set -eo pipefail
+
+# Build vLLM image with CPU target and SELinux context-sharing label
+# - Clones vLLM repository
+# - Patches Dockerfile.cpu to add ,z suffix to target for SELinux labeling
+# - Builds image with max_jobs=1 to reduce memory footprint in prow environment
+
git clone --branch v0.6.0 https://github.com/vllm-project/vllm.git
cd vllm || { echo "Failed to enter vllm directory"; exit 1; }
+echo "Patching docker/Dockerfile.cpu..."
file="docker/Dockerfile.cpu"
# ... rest of script with added error checksCommittable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In tests/e2e-prow/rhoai/scripts/build_vllm.sh around lines 1 to 11, the script
is missing a shebang, basic error handling, explanatory comments for the sed
patch that appends ",z" to the Docker target, and simple/logged steps to aid
debugging; add a bash shebang (#!/usr/bin/env bash) and set strict options (set
-euo pipefail), add a brief top-of-file comment explaining the patch purpose
(why sed appends ",z" and what the patched Dockerfile is for), wrap key
operations with simple timestamped echo/log messages (cloning, patching, build
start/end, and any failures), check that docker/Dockerfile.cpu exists before
running sed and verify sed produced docker/Dockerfile.cpu.patched, and echo the
final podman/docker build command and its build-args so failures are easier to
trace.
| file="docker/Dockerfile.cpu" | ||
| sed 's|\(target=[^, ]*\)|\1,z|g' "$file" > "${file}.patched" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verify sed patch correctness and add error handling.
The sed pattern assumes a specific Dockerfile format (target=[^, ]*). If the format differs, the patch silently fails. Additionally, no error handling exists if the file is missing or if the patch cannot be applied.
Apply this diff to add validation and error handling:
+if [[ ! -f "$file" ]]; then
+ echo "Error: $file not found"; exit 1
+fi
+
sed 's|\(target=[^, ]*\)|\1,z|g' "$file" > "${file}.patched" || { echo "Patch failed"; exit 1; }
+
+if ! grep -q "target=.*,z" "${file}.patched"; then
+ echo "Warning: Patch may not have applied correctly (no target with ,z found)"
+fiAlso, clarify the purpose of the ,z suffix in a comment (e.g., if it's for SELinux labeling):
+# Append ,z to target for SELinux labeling (context-sharing)
sed 's|\(target=[^, ]*\)|\1,z|g' "$file" > "${file}.patched"Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In tests/e2e-prow/rhoai/scripts/build_vllm.sh around lines 4 to 5, the sed
command blindly patches docker/Dockerfile.cpu and appends ",z" without checking
the file exists, verifying the patch applied, or explaining the suffix; add a
guard to fail early if docker/Dockerfile.cpu is missing, run sed and check its
exit status, validate the patched file contains the expected "target=... ,z"
change (or revert/remove the patched file and exit non-zero if not), and add an
inline comment explaining the purpose of the ",z" suffix (for example SELinux
label) so the intent is clear.
| DOCKER_BUILDKIT=1 podman build . \ | ||
| # --target cpu \ | ||
| --tag vllm-openai-lcore/cpu \ | ||
| --file docker/Dockerfile.cpu.patched \ | ||
| --build-arg max_jobs=1 No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add error handling for podman build and clarify image naming.
The podman build can fail silently without feedback, and the image tag lacks a registry prefix which may cause issues when pushing to a registry (e.g., quay.io).
Apply this diff to add error handling and update the image name:
-DOCKER_BUILDKIT=1 podman build . \
+DOCKER_BUILDKIT=1 podman build . || { echo "Build failed"; exit 1; }
# --target cpu \
--tag vllm-openai-lcore/cpu \
--file docker/Dockerfile.cpu.patched \
--build-arg max_jobs=1Update the image tag to include a registry prefix (adjust as needed for your infrastructure):
- --tag vllm-openai-lcore/cpu \
+ --tag quay.io/vllm-openai-lcore/cpu \Also, document why --target cpu is commented out and why max_jobs=1 is used:
-# --target cpu \
+# NOTE: target cpu is handled implicitly; max_jobs=1 to reduce memory usage in prow📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| DOCKER_BUILDKIT=1 podman build . \ | |
| # --target cpu \ | |
| --tag vllm-openai-lcore/cpu \ | |
| --file docker/Dockerfile.cpu.patched \ | |
| --build-arg max_jobs=1 | |
| DOCKER_BUILDKIT=1 podman build . \ | |
| # NOTE: target cpu is handled implicitly; max_jobs=1 to reduce memory usage in prow | |
| --tag quay.io/vllm-openai-lcore/cpu \ | |
| --file docker/Dockerfile.cpu.patched \ | |
| --build-arg max_jobs=1 || { echo "Build failed"; exit 1; } |
🤖 Prompt for AI Agents
In tests/e2e-prow/rhoai/scripts/build_vllm.sh around lines 7 to 11, the podman
build may fail silently and the image tag lacks a registry prefix; update the
script to run the build command so failures are detected and cause the script to
exit non‑zero (check the podman build exit status and exit with an error message
if it fails), change the --tag value to include your registry/org prefix (e.g.,
quay.io/<your-org>/vllm-openai-lcore:cpu or similar) to avoid push/name
confusion, and add brief inline comments explaining why --target cpu is
commented out and why --build-arg max_jobs=1 is enforced so future maintainers
understand the choices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (6)
tests/e2e-prow/rhoai/scripts/build_vllm.sh (1)
1-11: Unresolved critical issues: add shebang, error handling, and validation.Multiple critical issues from previous reviews remain unaddressed. The script lacks proper error handling for git operations, sed patching, and podman build, which could cause silent failures in the prow pipeline. Additionally, the image tag lacks a registry prefix.
Apply this consolidated diff to address all critical issues:
+#!/bin/bash +set -eo pipefail + +# Build vLLM image with CPU target and SELinux context-sharing label +# - Clones vLLM repository at a pinned version +# - Patches docker/Dockerfile.cpu to append ,z suffix for SELinux labeling +# - Builds image with max_jobs=1 to reduce memory footprint in prow environment + -git clone https://github.com/vllm-project/vllm.git -cd vllm +git clone --branch v0.6.0 https://github.com/vllm-project/vllm.git || { echo "Failed to clone vllm repository"; exit 1; } +cd vllm || { echo "Failed to enter vllm directory"; exit 1; } +echo "Patching docker/Dockerfile.cpu..." file="docker/Dockerfile.cpu" -sed 's|\(target=[^, ]*\)|\1,z|g' "$file" > "${file}.patched" + +if [[ ! -f "$file" ]]; then + echo "Error: $file not found"; exit 1 +fi + +sed 's|\(target=[^, ]*\)|\1,z|g' "$file" > "${file}.patched" || { echo "Patch failed"; exit 1; } + +if ! grep -q "target=.*,z" "${file}.patched"; then + echo "Warning: Patch may not have applied correctly (no target with ,z found)" +fi +echo "Building vLLM CPU image..." DOCKER_BUILDKIT=1 podman build . \ -# --target cpu \ +# --target cpu \ # target cpu is handled implicitly via Dockerfile.cpu - --tag vllm-openai-lcore/cpu \ + --tag quay.io/vllm-openai-lcore/cpu \ --file docker/Dockerfile.cpu.patched \ - --build-arg max_jobs=1 + --build-arg max_jobs=1 || { echo "Build failed"; exit 1; } + +echo "vLLM image built successfully"tests/e2e-prow/rhoai/pipeline.sh (5)
1-10: Enable strict error handling at script start.For robustness in complex orchestration, add safety options before any logic:
+#!/bin/bash +set -euo pipefail +trap 'echo "❌ Pipeline failed at line $LINENO"; exit 1' ERR + #======================================== # 1. GLOBAL CONFIG #========================================Without this, unhandled command failures allow the script to proceed with invalid state, and subsequent checks (e.g., variable validation) become brittle. This is a prerequisite for the fixes below.
37-38: Validate script existence and error handling for build_vllm.sh.Check the file exists before chmod/execute and fail immediately on error:
chmod +x ./scripts/build_vllm.sh +[[ -f ./scripts/build_vllm.sh ]] || { echo "❌ build_vllm.sh not found"; exit 1; } ./scripts/build_vllm.sh +[[ $? -eq 0 ]] || { echo "❌ build_vllm.sh failed with exit code $?"; exit 1; }Alternatively, once
set -euo pipefailis in place (see prior comment), the explicit exit code check can be simplified.
65-66: Add error handling for curl download.Validate the download succeeds and the file exists:
curl -sL -o tool_chat_template_llama3.2_json.jinja \ - https://raw.githubusercontent.com/vllm-project/vllm/main/examples/tool_chat_template_llama3.2_json.jinja + https://raw.githubusercontent.com/vllm-project/vllm/main/examples/tool_chat_template_llama3.2_json.jinja \ + || { echo "❌ Failed to download jinja template"; exit 1; } + +[[ -f tool_chat_template_llama3.2_json.jinja ]] \ + || { echo "❌ Jinja template file not found after download"; exit 1; }This prevents the script from proceeding to configmap creation with a missing file.
83-84: Add error handling for pod.env sourcing and validate required variables.After sourcing pod.env, verify it succeeded and required variables are set:
-source pod.env +source pod.env || { echo "❌ Failed to source pod.env"; exit 1; } +[[ -n "$POD_NAME" ]] || { echo "❌ POD_NAME is unset after sourcing pod.env"; exit 1; } +[[ -n "$KSVC_URL" ]] || { echo "❌ KSVC_URL is unset after sourcing pod.env"; exit 1; } + oc wait --for=condition=Ready pod/$POD_NAME -n $NAMESPACE --timeout=300sThis ensures that unset or empty variables fail fast rather than causing cryptic failures downstream.
183-183: Verify wait command exit code to catch timeout or pod failures.Add explicit exit code checking for the wait command:
-oc wait --for=condition=Ready=True pod/test-pod -n $NAMESPACE --timeout=900s +oc wait --for=condition=Ready=True pod/test-pod -n $NAMESPACE --timeout=900s || { + echo "⏰ Timeout or error waiting for test-pod readiness" + oc describe pod test-pod -n $NAMESPACE || true + exit 1 +}This prevents the script from proceeding with the polling loop if the pod fails to reach Ready state within 900s.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tests/e2e-prow/rhoai/pipeline.sh(1 hunks)tests/e2e-prow/rhoai/scripts/build_vllm.sh(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-18T16:46:33.353Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-09-18T16:46:33.353Z
Learning: Applies to tests/e2e/test_list.txt : Maintain the E2E test list in tests/e2e/test_list.txt
Applied to files:
tests/e2e-prow/rhoai/pipeline.sh
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: build-pr
- GitHub Check: e2e_tests (azure)
- GitHub Check: e2e_tests (ci)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (5)
tests/e2e-prow/rhoai/pipeline.sh (5)
1-26: Add shell safety options and validate build_vllm.sh execution.The script lacks foundational error-handling options and doesn't validate script execution, allowing failures to propagate silently throughout the pipeline orchestration.
Apply this diff to add shell safety and fix build_vllm.sh validation:
#======================================== # 1. GLOBAL CONFIG #======================================== +#!/bin/bash +set -euo pipefail +trap 'echo "❌ Pipeline failed at line $LINENO"; exit 1' ERR + NAMESPACE="e2e-rhoai-dsc" MODEL_NAME="meta-llama/Llama-3.2-1B-Instruct" #======================================== # 2. ENVIRONMENT SETUP #======================================== echo "===== Setting up environment variables =====" export HUGGING_FACE_HUB_TOKEN=$(cat /var/run/huggingface/hf-token-ces-lcore-test || true) export VLLM_API_KEY=$(cat /var/run/vllm/vllm-api-key-lcore-test || true) [[ -n "$HUGGING_FACE_HUB_TOKEN" ]] && echo "✅ HUGGING_FACE_HUB_TOKEN is set" || { echo "❌ Missing HUGGING_FACE_HUB_TOKEN"; exit 1; } [[ -n "$VLLM_API_KEY" ]] && echo "✅ VLLM_API_KEY is set" || { echo "❌ Missing VLLM_API_KEY"; exit 1; } -chmod +x ./scripts/build_vllm.sh -./scripts/build_vllm.sh +[[ -f ./scripts/build_vllm.sh ]] || { echo "❌ build_vllm.sh not found"; exit 1; } +chmod +x ./scripts/build_vllm.sh +./scripts/build_vllm.sh || { echo "❌ build_vllm.sh failed with exit code $?"; exit 1; }With
set -euo pipefailin place, subsequent command failures will automatically exit the script unless explicitly handled.
47-51: Add error handling for curl download and file validation.The curl download has no error checking; network failures or URL issues cause silent file absence, leading to cryptic configmap creation errors downstream.
Apply this diff to add error checking:
curl -sL -o tool_chat_template_llama3.2_json.jinja \ - https://raw.githubusercontent.com/vllm-project/vllm/main/examples/tool_chat_template_llama3.2_json.jinja + https://raw.githubusercontent.com/vllm-project/vllm/main/examples/tool_chat_template_llama3.2_json.jinja \ + || { echo "❌ Failed to download jinja template"; exit 1; } + +[[ -f tool_chat_template_llama3.2_json.jinja ]] \ + || { echo "❌ Jinja template file not found after download"; exit 1; } oc create configmap vllm-chat-template -n "$NAMESPACE" \ --from-file=tool_chat_template_llama3.2_json.jinja --dry-run=client -o yaml | oc apply -f -Alternatively, once
set -euo pipefailis in place, rely on the error trap for the curl command itself, but keep the file existence check.
65-66: Add error handling for pod.env sourcing and variable validation.Line 65 sources pod.env without checking if the source succeeded, and subsequent code (lines 66, 81) uses
$POD_NAMEand$KSVC_URLwithout validation. If the file is missing or malformed, these variables will be empty or unset, causing silent failures in the wait command and curl calls.Apply this diff to add validation:
-source pod.env +source pod.env || { echo "❌ Failed to source pod.env"; exit 1; } +[[ -n "$POD_NAME" ]] && [[ -n "$KSVC_URL" ]] || { echo "❌ Missing POD_NAME or KSVC_URL in pod.env"; exit 1; }Ensure that
pipeline-vllm.shunconditionally exportsPOD_NAMEandKSVC_URLto pod.env and exits non-zero on failure, so sourcing here cannot silently produce unset values.
66-66: Check exit code of oc wait for pod readiness.The oc wait command doesn't check for success. If the pod fails to reach Ready within the 300s timeout, the script proceeds with an unready pod, causing cryptic failures in the API test loop.
Apply this diff:
-oc wait --for=condition=Ready pod/$POD_NAME -n $NAMESPACE --timeout=300s +oc wait --for=condition=Ready pod/$POD_NAME -n $NAMESPACE --timeout=300s || { + echo "❌ Timeout or error waiting for $POD_NAME readiness" + oc describe pod $POD_NAME -n $NAMESPACE || true + exit 1 +}
165-165: Check exit code of oc wait for test-pod readiness and add timeout context.The oc wait command doesn't validate success. If the test pod fails to reach Ready, the subsequent polling loop will still execute but may produce confusing results.
Apply this diff:
-oc wait --for=condition=Ready=True pod/test-pod -n $NAMESPACE --timeout=900s +oc wait --for=condition=Ready=True pod/test-pod -n $NAMESPACE --timeout=900s || { + echo "⏰ Timeout or error waiting for test-pod readiness" + oc describe pod test-pod -n $NAMESPACE || true + exit 1 +}This prevents the polling loop from running on a pod that failed to become Ready.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/e2e-prow/rhoai/pipeline.sh(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-18T16:46:33.353Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-09-18T16:46:33.353Z
Learning: Applies to tests/e2e/test_list.txt : Maintain the E2E test list in tests/e2e/test_list.txt
Applied to files:
tests/e2e-prow/rhoai/pipeline.sh
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: e2e_tests (ci)
- GitHub Check: e2e_tests (azure)
- GitHub Check: build-pr
🔇 Additional comments (3)
tests/e2e-prow/rhoai/pipeline.sh (3)
58-58: Validate pipeline-vllm.sh execution.The pipeline-vllm.sh script is executed without error checking. Once
set -euo pipefailis enabled at the script header, this will automatically fail on non-zero exit; however, ensure the trap handler provides sufficient context.
121-121: Validate pipeline-services.sh execution.The pipeline-services.sh script is executed without error checking. Once
set -euo pipefailis enabled, this will automatically fail on non-zero exit; ensure the error trap provides context about which service deployment failed.
159-159: Validate pipeline-test-pod.sh execution.The pipeline-test-pod.sh script is executed without error checking. Once
set -euo pipefailis enabled, this will automatically fail on non-zero exit; ensure the error trap and any pod creation failures are logged clearly.
| oc wait pod/lightspeed-stack-service pod/llama-stack-service \ | ||
| -n "$NAMESPACE" --for=condition=Ready --timeout=300s | ||
| sleep 30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check exit code of oc wait for services and add timeout protection.
The oc wait command for lightspeed-stack-service and llama-stack-service doesn't validate success. If either pod fails to become Ready, the script continues with incomplete deployments, causing failures downstream.
Apply this diff:
./pipeline-services.sh
-oc wait pod/lightspeed-stack-service pod/llama-stack-service \
- -n "$NAMESPACE" --for=condition=Ready --timeout=300s
+oc wait pod/lightspeed-stack-service pod/llama-stack-service \
+ -n "$NAMESPACE" --for=condition=Ready --timeout=300s || {
+ echo "❌ Timeout or error waiting for service pods"
+ oc get pods -n "$NAMESPACE" -o wide || true
+ oc describe pod lightspeed-stack-service -n "$NAMESPACE" || true
+ oc describe pod llama-stack-service -n "$NAMESPACE" || true
+ exit 1
+}
sleep 30🤖 Prompt for AI Agents
In tests/e2e-prow/rhoai/pipeline.sh around lines 123 to 125, the oc wait
invocation and subsequent sleep do not verify that the pods actually became
Ready and lack protection if oc wait hangs or fails; modify the script to run oc
wait under a timeout (or use the existing --timeout but also wrap with a
process-level timeout command), capture its exit code, and if non-zero emit a
clear error message and exit the script with a non-zero status; only proceed to
sleep and subsequent steps when oc wait succeeds.
| TEST_EXIT_CODE=$(oc get pod test-pod -n $NAMESPACE -o jsonpath='{.status.containerStatuses[0].state.terminated.exitCode}') | ||
|
|
||
| echo "===== E2E COMPLETE =====" | ||
|
|
||
| if [ "${TEST_EXIT_CODE:-2}" -ne 0 ]; then | ||
| echo "❌ E2E tests failed with exit code $TEST_EXIT_CODE (pod/test-pod failed)" | ||
| else | ||
| echo "✅ E2E tests succeeded" | ||
| fi | ||
|
|
||
| exit $TEST_EXIT_CODE No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validate test pod exit code extraction and handle missing/incomplete state.
Line 191 extracts the exit code from the test pod, but if the pod is still running, in an error state, or containerStatuses is empty, the jsonpath will return an empty string. Line 195 then compares this against 0, which silently treats missing exit codes as failures (via the fallback to 2).
Apply this diff to add explicit validation:
-TEST_EXIT_CODE=$(oc get pod test-pod -n $NAMESPACE -o jsonpath='{.status.containerStatuses[0].state.terminated.exitCode}')
+TEST_EXIT_CODE=$(oc get pod test-pod -n $NAMESPACE -o jsonpath='{.status.containerStatuses[0].state.terminated.exitCode}' 2>/dev/null || echo "")
+
+if [[ -z "$TEST_EXIT_CODE" ]]; then
+ echo "❌ Failed to extract test pod exit code (pod may not have terminated)"
+ oc describe pod test-pod -n $NAMESPACE || true
+ exit 1
+fi
echo "===== E2E COMPLETE ====="
-if [ "${TEST_EXIT_CODE:-2}" -ne 0 ]; then
+if [[ "$TEST_EXIT_CODE" -ne 0 ]]; then
echo "❌ E2E tests failed with exit code $TEST_EXIT_CODE (pod/test-pod failed)"
else
echo "✅ E2E tests succeeded"
fi
exit $TEST_EXIT_CODEThis ensures that incomplete or missing exit codes are caught explicitly rather than silently treated as failures.
🤖 Prompt for AI Agents
In tests/e2e-prow/rhoai/pipeline.sh around lines 191 to 201, the script blindly
extracts the pod container terminated exitCode which can be empty if the pod is
still running, in an error state, or has no containerStatuses; update the logic
to explicitly validate that the containerStatuses array exists and that the
first container has a .state.terminated.exitCode before using it by: query the
pod JSON to check .status.phase and .status.containerStatuses length and the
presence of .state.terminated, emit a clear error and set a distinct non-zero
exit code (and fail fast) when those fields are missing or the pod hasn’t
terminated, otherwise assign TEST_EXIT_CODE to the extracted integer and proceed
with the existing success/failure message and exit with that code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (12)
tests/e2e-prow/rhoai/manifests/lightspeed/lightspeed-stack.yaml (1)
1-25: Add security hardening: securityContext and resource limits.The Pod lacks fundamental security controls and resource management:
- No
securityContext— container runs as root with privilege escalation allowed- No resource
requests/limits— can consume unbounded resources, affecting cluster stabilityApply this diff to add hardening (adjust resource values based on observed memory footprint):
spec: + securityContext: + runAsNonRoot: true + runAsUser: 1000 + fsGroup: 1000 containers: - name: lightspeed-stack-container + securityContext: + allowPrivilegeEscalation: false + readOnlyRootFilesystem: false + resources: + requests: + memory: "512Mi" + cpu: "250m" + limits: + memory: "1Gi" + cpu: "500m" env:tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-cpu.yaml (1)
14-62: Add resource requests and limits to prevent cluster resource exhaustion.The vLLM container lacks
resources.requestsandresources.limits. On a CPU-only prow node, vLLM can consume unbounded resources, evicting other pods or making the node unschedulable. Add conservative resource bounds:Apply this diff to add resource management (adjust based on 1B model's actual footprint):
- args: - --model # - /mnt/models/ - meta-llama/Llama-3.2-1B-Instruct - --enable-auto-tool-choice - --tool-call-parser - llama3_json - --chat-template - /mnt/chat-template/tool_chat_template_llama3.2_json.jinja - --download-dir - /tmp/models-cache - --port - "8080" - --max-model-len - "2048" image: quay.io/rh-ee-cpompeia/vllm-cpu:latest name: kserve-container + resources: + requests: + cpu: "2" + memory: "3Gi" + limits: + cpu: "4" + memory: "6Gi" env:tests/e2e-prow/rhoai/run-tests.sh (1)
1-20: Add environment validation and explicit error handling throughout script.This script is missing critical safeguards noted in prior review:
- No validation that
LCS_IPandLLAMA_IPare set (leads to cryptic failures)make test-e2efailure doesn't explicitly exit with error code- No shell safety options (
set -euo pipefail) to catch unhandled errorsErrors in e2e tests must propagate to the prow job failure path.
Apply this diff to add validation, error handling, and safety:
+#!/bin/bash +set -euo pipefail + +# Validate required environment variables +if [[ -z "${LCS_IP:-}" ]] || [[ -z "${LLAMA_IP:-}" ]]; then + echo "❌ Error: Required environment variables not set." + echo " LCS_IP=${LCS_IP:-<not set>}" + echo " LLAMA_IP=${LLAMA_IP:-<not set>}" + exit 1 +fi + git clone https://github.com/lightspeed-core/lightspeed-stack.git cd lightspeed-stack echo "pod started" echo $LCS_IP curl -f http://$LCS_IP:8080/v1/models || { echo "❌ Basic connectivity failed - showing logs before running full tests" exit 1 } echo "Installing test dependencies..." pip install uv -uv sync +uv sync || { echo "❌ Failed to sync dependencies"; exit 1; } export E2E_LSC_HOSTNAME=$LCS_IP export E2E_LLAMA_HOSTNAME=$LLAMA_IP echo "Running comprehensive e2e test suite..." -make test-e2e +make test-e2e || { echo "❌ E2E tests failed"; exit 1; }tests/e2e-prow/rhoai/manifests/test-pod/spin-up.yaml (1)
1-30: Add security hardening: securityContext and resource limits.This test Pod lacks security controls allowing root execution with privilege escalation:
- No
securityContext— container runs as root- No resource
requests/limits— unbounded resource usage in test environment- Root filesystem not read-only — unnecessary write access
Apply this diff to harden the Pod:
spec: + securityContext: + runAsNonRoot: true + runAsUser: 1000 + fsGroup: 1000 containers: - name: test-container + securityContext: + allowPrivilegeEscalation: false + readOnlyRootFilesystem: true + capabilities: + drop: + - ALL + resources: + requests: + memory: "256Mi" + cpu: "100m" + limits: + memory: "512Mi" + cpu: "500m" env:tests/e2e-prow/rhoai/scripts/deploy-vllm.sh (1)
32-34: Add error handling to manifest application commands.The
oc applycommands lack failure checks. If manifests are malformed or paths incorrect, the script succeeds silently.-oc apply -f "$BASE_DIR/manifests/vllm/vllm-runtime-cpu.yaml" +oc apply -f "$BASE_DIR/manifests/vllm/vllm-runtime-cpu.yaml" || { + echo "❌ Failed to apply vLLM runtime manifest" + exit 1 +} -oc apply -f "$BASE_DIR/manifests/vllm/inference-service.yaml" +oc apply -f "$BASE_DIR/manifests/vllm/inference-service.yaml" || { + echo "❌ Failed to apply InferenceService manifest" + exit 1 +}tests/e2e-prow/rhoai/pipeline.sh (7)
1-10: Enable strict error handling at script start.The script lacks shell safety options (
set -euo pipefailand an error trap). This allows unhandled failures to cascade through the orchestration pipeline.+#!/bin/bash +set -euo pipefail +trap 'echo "❌ Pipeline failed at line $LINENO"; exit 1' ERR + #======================================== # 1. GLOBAL CONFIG #======================================== NAMESPACE="e2e-rhoai-dsc"Once this is in place, subsequent explicit error checks can be simplified by relying on automatic exit on non-zero status.
43-47: Add error handling for curl download and file existence check.The jinja template download (lines 43–44) lacks error handling. A failed network call would leave no file, causing the configmap creation on line 46 to fail with a cryptic error.
curl -sL -o tool_chat_template_llama3.2_json.jinja \ - https://raw.githubusercontent.com/vllm-project/vllm/main/examples/tool_chat_template_llama3.2_json.jinja + https://raw.githubusercontent.com/vllm-project/vllm/main/examples/tool_chat_template_llama3.2_json.jinja \ + || { echo "❌ Failed to download jinja template"; exit 1; } + +[[ -f tool_chat_template_llama3.2_json.jinja ]] \ + || { echo "❌ Jinja template file not found after download"; exit 1; } oc create configmap vllm-chat-template -n "$NAMESPACE" \
54-54: Validate script existence and check execution status ofpipeline-vllm.sh.The script attempts to execute
./pipeline-vllm.shwithout verifying it exists or checking if it succeeded. Silent failures would allow the pipeline to proceed with incomplete setup.+[[ -f ./pipeline-vllm.sh ]] || { echo "❌ pipeline-vllm.sh not found"; exit 1; } ./pipeline-vllm.sh +[[ $? -eq 0 ]] || { echo "❌ pipeline-vllm.sh failed"; exit 1; }With
set -euo pipefailadded at the script start, this simplifies to relying on automatic exit on error.
61-62: Validatesource pod.envandoc waitexit codes.Line 61 sources
pod.envwithout error checking; if missing or malformed,$POD_NAMEand$KSVC_URLremain unset. Line 62'soc waitdoesn't validate success, so the script proceeds even if the pod fails to become Ready.-source pod.env -oc wait --for=condition=Ready pod/$POD_NAME -n $NAMESPACE --timeout=300s +source pod.env || { echo "❌ Failed to source pod.env"; exit 1; } +[[ -n "$POD_NAME" ]] && [[ -n "$KSVC_URL" ]] || { echo "❌ Missing POD_NAME or KSVC_URL"; exit 1; } + +oc wait --for=condition=Ready pod/$POD_NAME -n $NAMESPACE --timeout=300s || { + echo "❌ Timeout or error waiting for pod readiness" + oc describe pod "$POD_NAME" -n "$NAMESPACE" || true + exit 1 +}
119-121: Check exit code ofoc waitfor service pods and add diagnostic output on failure.The wait command for
lightspeed-stack-serviceandllama-stack-service(lines 119–120) doesn't validate success. If either pod fails to reach Ready, the script continues with incomplete deployments../pipeline-services.sh oc wait pod/lightspeed-stack-service pod/llama-stack-service \ - -n "$NAMESPACE" --for=condition=Ready --timeout=300s + -n "$NAMESPACE" --for=condition=Ready --timeout=300s || { + echo "❌ Timeout or error waiting for service pods" + oc get pods -n "$NAMESPACE" -o wide || true + oc describe pod lightspeed-stack-service -n "$NAMESPACE" || true + oc describe pod llama-stack-service -n "$NAMESPACE" || true + exit 1 +} sleep 30
161-161: Verifyoc waitexit code for test-pod readiness.Line 161's
oc waitcommand doesn't check for failure. If the pod times out or fails to become Ready, the script continues with an incomplete test pod, causing cryptic errors downstream.# Wait until tests are complete -oc wait --for=condition=Ready=True pod/test-pod -n $NAMESPACE --timeout=900s +oc wait --for=condition=Ready=True pod/test-pod -n $NAMESPACE --timeout=900s || { + echo "⏰ Timeout or error waiting for test-pod readiness" + oc describe pod test-pod -n $NAMESPACE || true + exit 1 +} start_time=$(date +%s)
187-197: Validate test pod exit code extraction and handle incomplete/missing state.Line 187 extracts the exit code via jsonpath, but if the pod hasn't terminated or has no containerStatuses, the result is empty. Line 191's
${TEST_EXIT_CODE:-2}fallback silently masks these states, treating missing exit codes as failures without clear diagnostics.-TEST_EXIT_CODE=$(oc get pod test-pod -n $NAMESPACE -o jsonpath='{.status.containerStatuses[0].state.terminated.exitCode}') +TEST_EXIT_CODE=$(oc get pod test-pod -n $NAMESPACE -o jsonpath='{.status.containerStatuses[0].state.terminated.exitCode}' 2>/dev/null || echo "") + +if [[ -z "$TEST_EXIT_CODE" ]]; then + echo "❌ Failed to extract test pod exit code (pod may not have terminated)" + oc describe pod test-pod -n $NAMESPACE || true + exit 1 +fi echo "===== E2E COMPLETE =====" -if [ "${TEST_EXIT_CODE:-2}" -ne 0 ]; then +if [[ "$TEST_EXIT_CODE" -ne 0 ]]; then echo "❌ E2E tests failed with exit code $TEST_EXIT_CODE (pod/test-pod failed)" else echo "✅ E2E tests succeeded" fi
🧹 Nitpick comments (1)
tests/e2e-prow/rhoai/pipeline-vllm.sh (1)
1-7: Add shell safety options for fail-fast behavior.The orchestration script should stop execution if any called script fails. Add
set -euo pipefailfor robustness:#!/bin/bash +set -euo pipefail PIPELINE_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (20)
README.md(1 hunks)docs/providers.md(1 hunks)tests/e2e-prow/rhoai/configs/lightspeed-stack.yaml(1 hunks)tests/e2e-prow/rhoai/configs/run.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/lightspeed/lightspeed-stack.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/operators/ds-cluster.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/operators/operatorgroup.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/operators/operators.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/test-pod/spin-up.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/vllm/inference-service.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-cpu.yaml(1 hunks)tests/e2e-prow/rhoai/pipeline-services.sh(1 hunks)tests/e2e-prow/rhoai/pipeline-test-pod.sh(1 hunks)tests/e2e-prow/rhoai/pipeline-vllm.sh(1 hunks)tests/e2e-prow/rhoai/pipeline.sh(1 hunks)tests/e2e-prow/rhoai/run-tests.sh(1 hunks)tests/e2e-prow/rhoai/scripts/bootstrap.sh(1 hunks)tests/e2e-prow/rhoai/scripts/deploy-vllm.sh(1 hunks)tests/e2e-prow/rhoai/scripts/get-vllm-pod-info.sh(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (10)
- tests/e2e-prow/rhoai/configs/lightspeed-stack.yaml
- tests/e2e-prow/rhoai/scripts/get-vllm-pod-info.sh
- tests/e2e-prow/rhoai/configs/run.yaml
- tests/e2e-prow/rhoai/manifests/operators/operatorgroup.yaml
- tests/e2e-prow/rhoai/manifests/vllm/inference-service.yaml
- docs/providers.md
- tests/e2e-prow/rhoai/manifests/operators/ds-cluster.yaml
- tests/e2e-prow/rhoai/pipeline-services.sh
- tests/e2e-prow/rhoai/scripts/bootstrap.sh
- tests/e2e-prow/rhoai/pipeline-test-pod.sh
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-18T16:46:33.353Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-09-18T16:46:33.353Z
Learning: Applies to tests/e2e/test_list.txt : Maintain the E2E test list in tests/e2e/test_list.txt
Applied to files:
tests/e2e-prow/rhoai/run-tests.sh
🪛 Checkov (3.2.334)
tests/e2e-prow/rhoai/manifests/lightspeed/lightspeed-stack.yaml
[medium] 1-25: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 1-25: Minimize the admission of root containers
(CKV_K8S_23)
tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack.yaml
[medium] 1-34: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 1-34: Minimize the admission of root containers
(CKV_K8S_23)
tests/e2e-prow/rhoai/manifests/test-pod/spin-up.yaml
[medium] 1-30: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 1-30: Minimize the admission of root containers
(CKV_K8S_23)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: build-pr
- GitHub Check: e2e_tests (ci)
- GitHub Check: e2e_tests (azure)
🔇 Additional comments (2)
README.md (1)
127-127: Clear and well-documented addition.The RHOAI (vLLM) provider row is correctly formatted and includes a helpful example link to the new e2e configuration.
tests/e2e-prow/rhoai/manifests/operators/operators.yaml (1)
1-35: Operator subscriptions are correctly configured.The three operator subscriptions (Service Mesh, Serverless, RHODS) are properly formatted, reference stable channels, and target the correct namespace and source.
| apiVersion: v1 | ||
| kind: Pod | ||
| metadata: | ||
| name: llama-stack-service | ||
| namespace: e2e-rhoai-dsc | ||
| spec: | ||
| containers: | ||
| - name: llama-stack-container | ||
| env: | ||
| - name: KSVC_URL | ||
| valueFrom: | ||
| secretKeyRef: | ||
| name: api-url-secret | ||
| key: key | ||
| - name: VLLM_API_KEY | ||
| valueFrom: | ||
| secretKeyRef: | ||
| name: vllm-api-key-secret | ||
| key: key | ||
| image: quay.io/opendatahub/llama-stack:rhoai-v2.25-latest | ||
| ports: | ||
| - containerPort: 8321 | ||
| volumeMounts: | ||
| - name: app-root | ||
| mountPath: /opt/app-root/src/.llama | ||
| - name: config | ||
| mountPath: /opt/app-root/run.yaml | ||
| subPath: run.yaml | ||
| volumes: | ||
| - name: app-root | ||
| emptyDir: {} | ||
| - name: config | ||
| configMap: | ||
| name: llama-stack-config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add security hardening: securityContext and resource limits.
This Pod lacks security controls and resource management like other manifests in the PR:
- No
securityContext— container runs as root - No resource
requests/limits— unbounded resource usage
Apply this diff to add hardening:
spec:
+ securityContext:
+ runAsNonRoot: true
+ runAsUser: 1000
+ fsGroup: 1000
containers:
- name: llama-stack-container
+ securityContext:
+ allowPrivilegeEscalation: false
+ resources:
+ requests:
+ memory: "2Gi"
+ cpu: "500m"
+ limits:
+ memory: "4Gi"
+ cpu: "2"
env:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| apiVersion: v1 | |
| kind: Pod | |
| metadata: | |
| name: llama-stack-service | |
| namespace: e2e-rhoai-dsc | |
| spec: | |
| containers: | |
| - name: llama-stack-container | |
| env: | |
| - name: KSVC_URL | |
| valueFrom: | |
| secretKeyRef: | |
| name: api-url-secret | |
| key: key | |
| - name: VLLM_API_KEY | |
| valueFrom: | |
| secretKeyRef: | |
| name: vllm-api-key-secret | |
| key: key | |
| image: quay.io/opendatahub/llama-stack:rhoai-v2.25-latest | |
| ports: | |
| - containerPort: 8321 | |
| volumeMounts: | |
| - name: app-root | |
| mountPath: /opt/app-root/src/.llama | |
| - name: config | |
| mountPath: /opt/app-root/run.yaml | |
| subPath: run.yaml | |
| volumes: | |
| - name: app-root | |
| emptyDir: {} | |
| - name: config | |
| configMap: | |
| name: llama-stack-config | |
| apiVersion: v1 | |
| kind: Pod | |
| metadata: | |
| name: llama-stack-service | |
| namespace: e2e-rhoai-dsc | |
| spec: | |
| securityContext: | |
| runAsNonRoot: true | |
| runAsUser: 1000 | |
| fsGroup: 1000 | |
| containers: | |
| - name: llama-stack-container | |
| securityContext: | |
| allowPrivilegeEscalation: false | |
| resources: | |
| requests: | |
| memory: "2Gi" | |
| cpu: "500m" | |
| limits: | |
| memory: "4Gi" | |
| cpu: "2" | |
| env: | |
| - name: KSVC_URL | |
| valueFrom: | |
| secretKeyRef: | |
| name: api-url-secret | |
| key: key | |
| - name: VLLM_API_KEY | |
| valueFrom: | |
| secretKeyRef: | |
| name: vllm-api-key-secret | |
| key: key | |
| image: quay.io/opendatahub/llama-stack:rhoai-v2.25-latest | |
| ports: | |
| - containerPort: 8321 | |
| volumeMounts: | |
| - name: app-root | |
| mountPath: /opt/app-root/src/.llama | |
| - name: config | |
| mountPath: /opt/app-root/run.yaml | |
| subPath: run.yaml | |
| volumes: | |
| - name: app-root | |
| emptyDir: {} | |
| - name: config | |
| configMap: | |
| name: llama-stack-config |
🧰 Tools
🪛 Checkov (3.2.334)
[medium] 1-34: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 1-34: Minimize the admission of root containers
(CKV_K8S_23)
🤖 Prompt for AI Agents
In tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack.yaml lines 1-34, the
Pod is missing security hardening and resource management; add a pod-level
securityContext (e.g., runAsNonRoot: true, runAsUser: 1000, fsGroup: 1000) and a
container-level securityContext (allowPrivilegeEscalation: false, privileged:
false, capabilities: drop: ["ALL"], and keep readOnlyRootFilesystem appropriate
for the mounted volume), and add resources.requests and resources.limits for CPU
and memory (example requests cpu: "100m", memory: "128Mi"; limits cpu: "500m",
memory: "1Gi") to the container spec so the Pod does not run as root and has
bounded resources.
| apiVersion: serving.kserve.io/v1alpha1 | ||
| kind: ServingRuntime | ||
| labels: | ||
| opendatahub.io/dashboard: "true" | ||
| metadata: | ||
| annotations: | ||
| openshift.io/display-name: vLLM | ||
| name: vllm | ||
| namespace: e2e-rhoai-dsc | ||
| spec: | ||
| builtInAdapter: | ||
| modelLoadingTimeoutMillis: 90000 | ||
| containers: | ||
| - args: | ||
| - --model | ||
| # - /mnt/models/ | ||
| - meta-llama/Llama-3.2-1B-Instruct | ||
| - --enable-auto-tool-choice | ||
| - --tool-call-parser | ||
| - llama3_json | ||
| - --chat-template | ||
| - /mnt/chat-template/tool_chat_template_llama3.2_json.jinja | ||
| - --download-dir | ||
| - /tmp/models-cache | ||
| - --port | ||
| - "8080" | ||
| - --max-model-len | ||
| - "2048" | ||
| image: quay.io/rh-ee-cpompeia/vllm-cpu:latest | ||
| name: kserve-container | ||
| env: | ||
| - name: HF_HOME | ||
| value: /mnt/models-cache/hf_home | ||
| - name: HUGGING_FACE_HUB_TOKEN | ||
| valueFrom: | ||
| secretKeyRef: | ||
| name: hf-token-secret | ||
| key: token | ||
| - name: VLLM_API_KEY | ||
| valueFrom: | ||
| secretKeyRef: | ||
| name: vllm-api-key-secret | ||
| key: key | ||
| ports: | ||
| - containerPort: 8080 | ||
| name: http1 | ||
| protocol: TCP | ||
| volumeMounts: | ||
| - name: chat-template | ||
| mountPath: /mnt/chat-template | ||
| - name: models-cache | ||
| mountPath: /mnt/models-cache | ||
| volumes: | ||
| - name: chat-template | ||
| configMap: | ||
| name: vllm-chat-template | ||
| - name: models-cache | ||
| emptyDir: {} | ||
| multiModel: false | ||
| supportedModelFormats: | ||
| - autoSelect: true | ||
| name: pytorch No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add security hardening: securityContext for container.
Like other Pod manifests in this PR, the ServingRuntime container lacks a security context:
Apply this diff to add security hardening:
image: quay.io/rh-ee-cpompeia/vllm-cpu:latest
name: kserve-container
+ securityContext:
+ allowPrivilegeEscalation: false
resources:Committable suggestion skipped: line range outside the PR's diff.
|
|
||
| # Wait for rollout to complete | ||
| echo "Waiting for kserve-controller-manager rollout..." | ||
| oc rollout status deployment/kserve-controller-manager -n redhat-ods-applications --timeout=300s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check exit code of oc rollout status command.
The rollout status on line 23 may timeout or fail, but the script continues regardless. Add error handling:
echo "Waiting for kserve-controller-manager rollout..."
-oc rollout status deployment/kserve-controller-manager -n redhat-ods-applications --timeout=300s
+oc rollout status deployment/kserve-controller-manager -n redhat-ods-applications --timeout=300s || {
+ echo "❌ Rollout failed or timed out"
+ exit 1
+}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| oc rollout status deployment/kserve-controller-manager -n redhat-ods-applications --timeout=300s | |
| echo "Waiting for kserve-controller-manager rollout..." | |
| oc rollout status deployment/kserve-controller-manager -n redhat-ods-applications --timeout=300s || { | |
| echo "❌ Rollout failed or timed out" | |
| exit 1 | |
| } |
🤖 Prompt for AI Agents
In tests/e2e-prow/rhoai/scripts/deploy-vllm.sh around line 23, the oc rollout
status command's exit code is ignored so the script will continue even if
rollout times out or fails; modify the script to check the command's exit status
immediately after running it and, on non-zero exit, print a clear error message
and exit the script with a non-zero status (or implement a small retry
loop/backoff before failing) so failures stop the CI flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
tests/e2e-prow/rhoai/manifests/test-pod/spin-up.yaml (1)
1-30: Add security context and resource limits to harden the test Pod.This comment duplicates a previous review observation: the Pod lacks
securityContextand resource constraints, allowing root execution and privilege escalation. For test pods in shared prow environments, apply basic hardening as suggested in the prior review.Apply this diff to address the security gaps:
spec: + securityContext: + runAsNonRoot: true + runAsUser: 1000 + fsGroup: 1000 containers: - name: test-container + securityContext: + allowPrivilegeEscalation: false + readOnlyRootFilesystem: true + capabilities: + drop: + - ALL + resources: + requests: + memory: "256Mi" + cpu: "100m" + limits: + memory: "512Mi" + cpu: "500m" env:Adjust memory/CPU limits based on your test runtime requirements.
🧹 Nitpick comments (4)
tests/e2e-prow/rhoai/configs/lightspeed-stack.yaml (1)
15-17: Consider sourcing the API key from environment instead of hardcoding.Line 17 uses a hardcoded placeholder API key
xyzzy. For test environments, sourcing it from an environment variable would be more flexible and align with the pattern used for LLAMA_IP.url: http://${env.LLAMA_IP}:8321 - api_key: ${env.LLAMA_STACK_API_KEY}This also reduces the risk of accidentally exposing the pattern in documentation.
tests/e2e-prow/rhoai/scripts/bootstrap.sh (1)
5-5: Add validation for required BASE_DIR parameter.The script accepts
BASE_DIRwithout checking if it was provided or if it exists. This could cause confusing failures later. Consider adding validation:BASE_DIR="${1:?ERROR: BASE_DIR is required as first argument}" if [[ ! -d "$BASE_DIR" ]]; then echo "ERROR: BASE_DIR does not exist: $BASE_DIR" >&2 exit 1 fitests/e2e-prow/rhoai/configs/run.yaml (1)
63-69: Add clarifying comments for test-specific settings.The vLLM inference provider configuration includes test-specific settings like
tls_verify: false. A comment would clarify this is for testing and prevent accidental use in production configurations:inference: - provider_id: vllm provider_type: remote::vllm config: url: ${env.KSVC_URL}/v1/ api_token: ${env.VLLM_API_KEY} tls_verify: false # Test environment only - do not use in production max_tokens: 1024docs/providers.md (1)
64-64: Use consistent capitalization for vLLM across documentation.The entry uses lowercase
vllmhere, but README.md (line 127) uses uppercasevLLM. For consistency, both should use the same capitalization. The standard capitalization isvLLM.- | RHOAI (vllm) | latest operator | remote | `openai` | ✅ | + | RHOAI (vLLM) | latest operator | remote | `openai` | ✅ |
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (20)
README.md(1 hunks)docs/providers.md(1 hunks)tests/e2e-prow/rhoai/configs/lightspeed-stack.yaml(1 hunks)tests/e2e-prow/rhoai/configs/run.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/lightspeed/lightspeed-stack.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/operators/ds-cluster.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/operators/operatorgroup.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/operators/operators.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/test-pod/spin-up.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/vllm/inference-service.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-cpu.yaml(1 hunks)tests/e2e-prow/rhoai/pipeline-services.sh(1 hunks)tests/e2e-prow/rhoai/pipeline-test-pod.sh(1 hunks)tests/e2e-prow/rhoai/pipeline-vllm.sh(1 hunks)tests/e2e-prow/rhoai/pipeline.sh(1 hunks)tests/e2e-prow/rhoai/run-tests.sh(1 hunks)tests/e2e-prow/rhoai/scripts/bootstrap.sh(1 hunks)tests/e2e-prow/rhoai/scripts/deploy-vllm.sh(1 hunks)tests/e2e-prow/rhoai/scripts/get-vllm-pod-info.sh(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (11)
- tests/e2e-prow/rhoai/scripts/get-vllm-pod-info.sh
- tests/e2e-prow/rhoai/run-tests.sh
- tests/e2e-prow/rhoai/manifests/operators/operatorgroup.yaml
- tests/e2e-prow/rhoai/pipeline-test-pod.sh
- tests/e2e-prow/rhoai/pipeline-vllm.sh
- tests/e2e-prow/rhoai/pipeline.sh
- tests/e2e-prow/rhoai/pipeline-services.sh
- tests/e2e-prow/rhoai/scripts/deploy-vllm.sh
- tests/e2e-prow/rhoai/manifests/vllm/inference-service.yaml
- tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-cpu.yaml
- tests/e2e-prow/rhoai/manifests/operators/operators.yaml
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-02T11:09:40.404Z
Learnt from: radofuchs
Repo: lightspeed-core/lightspeed-stack PR: 485
File: tests/e2e/features/environment.py:87-95
Timestamp: 2025-09-02T11:09:40.404Z
Learning: In the lightspeed-stack e2e tests, noop authentication tests use the default lightspeed-stack.yaml configuration, while noop-with-token tests use the Authorized tag to trigger a config swap to the specialized noop-with-token configuration file.
Applied to files:
tests/e2e-prow/rhoai/configs/lightspeed-stack.yaml
🪛 Checkov (3.2.334)
tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack.yaml
[medium] 1-34: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 1-34: Minimize the admission of root containers
(CKV_K8S_23)
tests/e2e-prow/rhoai/manifests/test-pod/spin-up.yaml
[medium] 1-30: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 1-30: Minimize the admission of root containers
(CKV_K8S_23)
tests/e2e-prow/rhoai/manifests/lightspeed/lightspeed-stack.yaml
[medium] 1-25: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 1-25: Minimize the admission of root containers
(CKV_K8S_23)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: build-pr
- GitHub Check: e2e_tests (azure)
- GitHub Check: e2e_tests (ci)
🔇 Additional comments (4)
README.md (1)
127-127: Documentation addition is clear and consistent.The new RHOAI (vLLM) provider entry follows the existing table format, includes appropriate metadata, and links to the new example configuration. This aligns well with the supporting documentation updates.
tests/e2e-prow/rhoai/manifests/lightspeed/lightspeed-stack.yaml (1)
6-17: Add security context and resource limits to meet Kubernetes best practices.The Pod specification lacks security hardening and resource management controls. For test environments in shared CI/CD clusters, these are important:
spec: + securityContext: + runAsNonRoot: true + runAsUser: 1000 containers: - name: lightspeed-stack-container + securityContext: + allowPrivilegeEscalance: false + resources: + requests: + memory: "512Mi" + cpu: "250m" + limits: + memory: "1Gi" + cpu: "500m" env:Was the security hardening intentionally deferred? If not, please apply these changes to meet K8s security standards.
tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack.yaml (1)
6-22: Add security hardening: securityContext and resource limits.This Pod lacks security controls and resource management. Even for ephemeral test environments, these controls prevent privilege escalation and resource exhaustion:
spec: + securityContext: + runAsNonRoot: true + runAsUser: 1000 + fsGroup: 1000 containers: - name: llama-stack-container + securityContext: + allowPrivilegeEscalation: false + resources: + requests: + memory: "2Gi" + cpu: "500m" + limits: + memory: "4Gi" + cpu: "2" env:Adjust resource values based on actual performance requirements observed during testing.
Please apply security hardening and define resource requests/limits appropriate for your workload.
tests/e2e-prow/rhoai/manifests/operators/ds-cluster.yaml (1)
1-17: Well-focused DataScienceCluster configuration for e2e testing.The manifest appropriately enables only the required components (kserve for vLLM serving) and service mesh for routing, while removing unnecessary components (workbenches, dashboard, pipelines) for an inference-focused e2e test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (8)
tests/e2e-prow/rhoai/manifests/lightspeed/lightspeed-stack.yaml (1)
1-25: ** Add security context and resource limits for Pod hardening.**This Pod manifest lacks security hardening and resource management (identical to prior review on this file). While this is test infrastructure, applying basic security controls aligns with Kubernetes best practices and mitigates privilege escalation risks.
Apply this diff to add hardening:
spec: + securityContext: + runAsNonRoot: true + runAsUser: 1000 containers: - name: lightspeed-stack-container + securityContext: + allowPrivilegeEscalation: false + resources: + requests: + memory: "512Mi" + cpu: "250m" + limits: + memory: "1Gi" + cpu: "500m" env:Adjust resource values based on actual performance during testing.
tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack.yaml (1)
1-34: ** Add security hardening: securityContext and resource limits.**This Pod manifests lacks the same security controls flagged in prior reviews. For a Pod serving model inference, security and resource bounds are important to prevent resource exhaustion.
Apply this diff to add hardening:
spec: + securityContext: + runAsNonRoot: true + runAsUser: 1000 + fsGroup: 1000 containers: - name: llama-stack-container + securityContext: + allowPrivilegeEscalation: false + resources: + requests: + memory: "2Gi" + cpu: "500m" + limits: + memory: "4Gi" + cpu: "2" env:Adjust resource values based on observed memory and CPU footprint of llama-stack during test runs.
tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-cpu.yaml (1)
1-65: ** Add resource limits and security context to prevent node exhaustion and privilege escalation.**Prior reviews flagged two unresolved issues in this ServingRuntime manifest:
- No
resources.requests/limits— vLLM can consume unbounded CPU/memory, potentially evicting other prow workloads- No container
securityContext— allows privilege escalationApply this diff to add both:
- args: ... image: quay.io/rh-ee-cpompeia/vllm-cpu:latest name: kserve-container + securityContext: + allowPrivilegeEscalation: false + resources: + requests: + cpu: "2" + memory: "4Gi" + limits: + cpu: "4" + memory: "8Gi" env:Adjust CPU/memory limits based on the 1B model's actual footprint (likely 2–4 GB) and available prow node capacity.
tests/e2e-prow/rhoai/manifests/test-pod/spin-up.yaml (1)
1-30: ** Add security context and resource limits to harden test Pod.**This test Pod lacks security controls and resource management, same as flagged in prior reviews. Test workloads in shared prow environments should be hardened against privilege escalation and resource exhaustion.
Apply this diff:
spec: + securityContext: + runAsNonRoot: true + runAsUser: 1000 + fsGroup: 1000 containers: - name: test-container + securityContext: + allowPrivilegeEscalation: false + capabilities: + drop: + - ALL + resources: + requests: + memory: "256Mi" + cpu: "100m" + limits: + memory: "512Mi" + cpu: "500m" env:Adjust resource limits based on observed test runtime requirements.
tests/e2e-prow/rhoai/pipeline.sh (3)
194-204: Validate test pod exit code extraction and handle incomplete termination state.Line 194 extracts the exit code from the test pod, but if the pod is still running or containerStatuses is empty, the jsonpath returns an empty string. Line 198 then uses
${TEST_EXIT_CODE:-2}which treats empty as 2 (failure), potentially masking the real issue.Apply this diff to add explicit validation:
-TEST_EXIT_CODE=$(oc get pod test-pod -n $NAMESPACE -o jsonpath='{.status.containerStatuses[0].state.terminated.exitCode}') +TEST_EXIT_CODE=$(oc get pod test-pod -n $NAMESPACE -o jsonpath='{.status.containerStatuses[0].state.terminated.exitCode}' 2>/dev/null || echo "") + +if [[ -z "$TEST_EXIT_CODE" ]]; then + echo "❌ Failed to extract test pod exit code (pod may not have terminated)" + echo "Pod status:" + oc describe pod test-pod -n $NAMESPACE || true + oc logs test-pod -n $NAMESPACE || true + exit 1 +fi echo "===== E2E COMPLETE =====" -if [ "${TEST_EXIT_CODE:-2}" -ne 0 ]; then +if [[ "$TEST_EXIT_CODE" -ne 0 ]]; then echo "❌ E2E tests failed with exit code $TEST_EXIT_CODE (pod/test-pod failed)" else echo "✅ E2E tests succeeded" fi exit $TEST_EXIT_CODEThis ensures incomplete or missing exit codes are caught explicitly rather than silently treated as failures.
126-128: Validate oc wait exit code and add diagnostic output on timeout.Line 126–128 waits for service Pods but doesn't check the exit code. If the wait times out or fails, the script continues with incomplete deployments, causing confusing downstream failures.
Apply this diff to add proper error handling:
./pipeline-services.sh -oc wait pod/lightspeed-stack-service pod/llama-stack-service \ - -n "$NAMESPACE" --for=condition=Ready --timeout=300s +oc wait pod/lightspeed-stack-service pod/llama-stack-service \ + -n "$NAMESPACE" --for=condition=Ready --timeout=300s || { + echo "❌ Timeout or error waiting for service pods to become Ready" + echo "Current pod status:" + oc get pods -n "$NAMESPACE" -o wide || true + echo "Lightspeed stack pod details:" + oc describe pod lightspeed-stack-service -n "$NAMESPACE" || true + echo "Llama stack pod details:" + oc describe pod llama-stack-service -n "$NAMESPACE" || true + exit 1 +} + sleep 30This prevents proceeding with an incomplete deployment.
68-69: Add error handling and validation for pod.env sourcing.Line 68 sources
pod.envwithout checking if the file exists or if sourcing succeeded. If missing,$POD_NAMEand$KSVC_URLwill be empty, causing cryptic failures on line 69 and later at line 84.Apply this diff to validate:
+if [[ ! -f pod.env ]]; then + echo "❌ pod.env not found" + exit 1 +fi + source pod.env || { echo "❌ Failed to source pod.env"; exit 1; } + +if [[ -z "$POD_NAME" ]] || [[ -z "$KSVC_URL" ]]; then + echo "❌ Missing required variables: POD_NAME=$POD_NAME KSVC_URL=$KSVC_URL" + exit 1 +fi + oc wait --for=condition=Ready pod/$POD_NAME -n $NAMESPACE --timeout=300stests/e2e-prow/rhoai/run-tests.sh (1)
1-20: Add strict error handling and environment variable validation.The script lacks safe shell options and doesn't validate critical environment variables. If
LCS_IPorLLAMA_IPare missing, errors will be cryptic.Apply this diff to harden the script:
+#!/bin/bash +set -euo pipefail +trap 'echo "❌ Script failed at line $LINENO"; exit 1' ERR + +# Validate required environment variables +if [[ -z "${LCS_IP:-}" ]]; then + echo "❌ Required environment variable not set: LCS_IP" + exit 1 +fi +if [[ -z "${LLAMA_IP:-}" ]]; then + echo "❌ Required environment variable not set: LLAMA_IP" + exit 1 +fi + git clone https://github.com/lightspeed-core/lightspeed-stack.git cd lightspeed-stack echo "pod started" echo $LCS_IP curl -f http://$LCS_IP:8080/v1/models || { echo "❌ Basic connectivity failed - showing logs before running full tests" exit 1 } echo "Installing test dependencies..." pip install uv -uv sync +uv sync || { echo "❌ Failed to sync dependencies"; exit 1; } export E2E_LSC_HOSTNAME=$LCS_IP export E2E_LLAMA_HOSTNAME=$LLAMA_IP echo "Running comprehensive e2e test suite..." -make test-e2e +make test-e2e || { echo "❌ E2E tests failed"; exit 1; }
🧹 Nitpick comments (6)
tests/e2e-prow/rhoai/configs/run.yaml (1)
95-100: Telemetry service name is hardcoded; consider making it configurable.The telemetry provider at line 96 hardcodes
service_name: 'lightspeed-stack-telemetry'. For environments where this service name varies, consider using an environment variable or config override. However, this is a low-priority nice-to-have for test infrastructure.tests/e2e-prow/rhoai/pipeline-test-pod.sh (1)
1-5: Script is correct but lacks inline error handling.The script applies a manifest without checking the exit code. While the parent pipeline (
pipeline.sh) hasset -euo pipefail, adding inline error handling makes this script more robust if called independently.Consider this optional improvement:
#!/bin/bash BASE_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" -oc apply -f "$BASE_DIR/manifests/test-pod/spin-up.yaml" +oc apply -f "$BASE_DIR/manifests/test-pod/spin-up.yaml" || { + echo "❌ Failed to apply test pod manifest" + exit 1 +}tests/e2e-prow/rhoai/scripts/bootstrap.sh (2)
38-38: Consider error handling for informationaloc get csvoutput.Line 38 runs
oc get csvfor informational purposes butset -ewill terminate the script if this command fails. Consider redirecting to stderr or wrapping with error handling to ensure debug output doesn't halt the script unexpectedly.-oc get csv -n openshift-operators +oc get csv -n openshift-operators 2>&1 || echo "Note: Unable to list CSVs"
5-5: Optional: Add defensive parameter validation for safety, but not critical.Verification shows
bootstrap.shis called only frompipeline-vllm.shline 5, which always passes a valid directory via"$PIPELINE_DIR"(derived from$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)). The current implementation is safe with this caller.However, adding validation remains a good defensive practice to prevent issues if the script is later called from other contexts:
BASE_DIR="$1" + +if [[ -z "$BASE_DIR" ]]; then + echo "ERROR: BASE_DIR not provided. Usage: $0 <BASE_DIR>" >&2 + exit 1 +fi + +if [[ ! -d "$BASE_DIR" ]]; then + echo "ERROR: BASE_DIR does not exist: $BASE_DIR" >&2 + exit 1 +fitests/e2e-prow/rhoai/pipeline-services.sh (2)
25-25: Add readiness wait for lightspeed-stack, consistent with llama-stack pattern.Line 25 applies the lightspeed-stack manifest but doesn't wait for it to be ready. This is inconsistent with the llama-stack deployment (lines 5–8) and risks downstream tests running before lightspeed-stack is available.
Add a wait after line 25:
oc apply -f "$BASE_DIR/manifests/lightspeed/lightspeed-stack.yaml" + +oc wait pod/lightspeed-stack-service \ + -n e2e-rhoai-dsc --for=condition=Ready --timeout=300sIf the pod name differs or multiple pods are created, adjust the selector accordingly. Verify the expected pod name in lightspeed-stack.yaml.
19-19: Rename LLAMA_IP to reflect content (DNS hostname, not IP) — requires coordinated updates across multiple files.The variable contains a DNS FQDN (
llama-stack-service-svc.e2e-rhoai-dsc.svc.cluster.local), not an IP address. The nameLLAMA_IPis misleading.However, renaming requires updates in multiple locations:
tests/e2e-prow/rhoai/pipeline-services.sh:19— variable definitiontests/e2e-prow/rhoai/pipeline-services.sh:22— secret creationtests/e2e-prow/rhoai/run-tests.sh:17— export and aliasingtests/e2e-prow/rhoai/manifests/lightspeed/lightspeed-stack.yaml:10— env var nametests/e2e-prow/rhoai/manifests/test-pod/spin-up.yaml:15— env var nametests/e2e-prow/rhoai/configs/lightspeed-stack.yaml:16— config referenceSuggested approach: rename to
LLAMA_ENDPOINTorLLAMA_HOST_URIand update all references consistently.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (20)
README.md(1 hunks)docs/providers.md(1 hunks)tests/e2e-prow/rhoai/configs/lightspeed-stack.yaml(1 hunks)tests/e2e-prow/rhoai/configs/run.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/lightspeed/lightspeed-stack.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/operators/ds-cluster.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/operators/operatorgroup.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/operators/operators.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/test-pod/spin-up.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/vllm/inference-service.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-cpu.yaml(1 hunks)tests/e2e-prow/rhoai/pipeline-services.sh(1 hunks)tests/e2e-prow/rhoai/pipeline-test-pod.sh(1 hunks)tests/e2e-prow/rhoai/pipeline-vllm.sh(1 hunks)tests/e2e-prow/rhoai/pipeline.sh(1 hunks)tests/e2e-prow/rhoai/run-tests.sh(1 hunks)tests/e2e-prow/rhoai/scripts/bootstrap.sh(1 hunks)tests/e2e-prow/rhoai/scripts/deploy-vllm.sh(1 hunks)tests/e2e-prow/rhoai/scripts/get-vllm-pod-info.sh(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (10)
- docs/providers.md
- tests/e2e-prow/rhoai/pipeline-vllm.sh
- tests/e2e-prow/rhoai/manifests/operators/operatorgroup.yaml
- tests/e2e-prow/rhoai/manifests/operators/ds-cluster.yaml
- tests/e2e-prow/rhoai/configs/lightspeed-stack.yaml
- tests/e2e-prow/rhoai/manifests/vllm/inference-service.yaml
- tests/e2e-prow/rhoai/manifests/operators/operators.yaml
- README.md
- tests/e2e-prow/rhoai/scripts/get-vllm-pod-info.sh
- tests/e2e-prow/rhoai/scripts/deploy-vllm.sh
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-18T16:46:33.353Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-09-18T16:46:33.353Z
Learning: Applies to tests/e2e/test_list.txt : Maintain the E2E test list in tests/e2e/test_list.txt
Applied to files:
tests/e2e-prow/rhoai/run-tests.shtests/e2e-prow/rhoai/pipeline.sh
🪛 Checkov (3.2.334)
tests/e2e-prow/rhoai/manifests/lightspeed/lightspeed-stack.yaml
[medium] 1-25: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 1-25: Minimize the admission of root containers
(CKV_K8S_23)
tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack.yaml
[medium] 1-34: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 1-34: Minimize the admission of root containers
(CKV_K8S_23)
tests/e2e-prow/rhoai/manifests/test-pod/spin-up.yaml
[medium] 1-30: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 1-30: Minimize the admission of root containers
(CKV_K8S_23)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: build-pr
- GitHub Check: e2e_tests (azure)
- GitHub Check: e2e_tests (ci)
| #!/bin/bash | ||
|
|
||
| BASE_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Add strict Bash error handling for consistency.
This script lacks set -euo pipefail, unlike bootstrap.sh. Without strict mode, unset variables and pipe failures won't halt execution, risking silent failures in orchestration.
Add after the shebang:
#!/bin/bash
+
+set -euo pipefail📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #!/bin/bash | |
| BASE_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" | |
| #!/bin/bash | |
| set -euo pipefail | |
| BASE_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" |
🤖 Prompt for AI Agents
In tests/e2e-prow/rhoai/pipeline-services.sh around lines 1 to 3, the script
lacks strict Bash error handling; insert the line "set -euo pipefail"
immediately after the shebang so the script exits on errors, treats unset
variables as failures, and fails pipelines on the first failing command.
| oc create secret generic llama-stack-ip-secret \ | ||
| --from-literal=key="$LLAMA_IP" \ | ||
| -n e2e-rhoai-dsc || echo "Secret exists" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Replace overly permissive error suppression with explicit existence check.
The || echo "Secret exists" at line 23 masks all errors (permission denied, namespace not found, etc.), treating them as "already exists". This hides real failures.
Apply this diff to check explicitly:
-oc create secret generic llama-stack-ip-secret \
- --from-literal=key="$LLAMA_IP" \
- -n e2e-rhoai-dsc || echo "Secret exists"
+# Create secret only if it doesn't exist
+if ! oc get secret llama-stack-ip-secret -n e2e-rhoai-dsc >/dev/null 2>&1; then
+ oc create secret generic llama-stack-ip-secret \
+ --from-literal=key="$LLAMA_ENDPOINT" \
+ -n e2e-rhoai-dsc
+fiCommittable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In tests/e2e-prow/rhoai/pipeline-services.sh around lines 21 to 23, the current
use of "|| echo 'Secret exists'" after oc create secret masks all errors; change
this to explicitly check if the secret already exists and only suppress the
error in that case. Replace the permissive suppression with an if/else flow:
attempt to create the secret, on failure inspect the error or run "oc get secret
<name> -n <ns>" to confirm existence; if the secret exists, print a harmless
message and continue, otherwise propagate or exit with the original error code
so real failures (permissions, missing namespace) are not hidden.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (5)
tests/e2e-prow/rhoai/manifests/test-pod/spin-up.yaml (1)
1-30: Security context and resource limits still missing (flagged in prior review).This Pod still lacks the security hardening and resource management that was recommended in the previous review. The static analysis continues to flag CKV_K8S_20 and CKV_K8S_23.
If not yet addressed, apply the diff from the prior review to add securityContext (runAsNonRoot, runAsUser) and resource limits. Given the review mode is "chill," prioritize this if redeploying in production; for e2e testing in an ephemeral environment, it can be deferred.
tests/e2e-prow/rhoai/manifests/lightspeed/lightspeed-stack.yaml (1)
1-25: Security context and resource limits still missing (flagged in prior review).This Pod lacks the recommended security hardening and resource management from the previous review. Static analysis continues to flag CKV_K8S_20 and CKV_K8S_23.
Apply the diff from the prior review to add securityContext and resource limits appropriate for the Lightspeed stack container (likely requiring more resources than the test pod). For e2e testing, this can be deferred; prioritize for production environments.
tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack.yaml (1)
1-34: Security context and resource limits still missing (flagged in prior review).This Pod lacks the security hardening and resource management recommended in the previous review. Static analysis flags CKV_K8S_20 and CKV_K8S_23.
Apply the diff from the prior review, adjusting resource requests/limits as needed (llama-stack likely requires more CPU/memory than the test pod—previously suggested 2Gi memory, 500m CPU requests; 4Gi memory, 2 CPU limits). For e2e testing, defer if needed; prioritize for production.
tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-cpu.yaml (2)
29-29: Add resource requests and limits to prevent scheduling issues.The container lacks
resources.requestsandresources.limits. On a CPU-only prow node, vLLM can consume unbounded CPU and memory, potentially evicting other pods or rendering the node unschedulable. This was flagged in prior reviews and remains unaddressed.Apply this diff to add conservative resource constraints:
image: quay.io/rh-ee-cpompeia/vllm-cpu:latest name: kserve-container + resources: + requests: + cpu: "2" + memory: "4Gi" + limits: + cpu: "4" + memory: "8Gi" env:Adjust these values based on the 1B model's actual memory footprint (likely 2–4 GB) and available prow node capacity.
14-46: Add security hardening: securityContext for the container.The container specification lacks a
securityContext, which was flagged in prior reviews. Add security constraints to restrict privilege escalation and follow the principle of least privilege.Apply this diff to harden the security posture:
image: quay.io/rh-ee-cpompeia/vllm-cpu:latest name: kserve-container + securityContext: + allowPrivilegeEscalation: false + readOnlyRootFilesystem: false + runAsNonRoot: false resources:(Note:
runAsNonRoot: falseis required here because vLLM requires root to bind to port 8080. If you can restrict the container to run as a non-root user and use a port ≥ 1024, setrunAsNonRoot: trueand add arunAsUserfield.)
🧹 Nitpick comments (3)
tests/e2e-prow/rhoai/pipeline-vllm.sh (1)
1-7: Addset -euo pipefailfor consistency and safety.While this thin wrapper script may inherit error handling from the parent pipeline.sh, it should include
set -euo pipefailat the top (after the shebang) for defensive robustness. If any subscript fails, the wrapper will then exit immediately rather than continuing.#!/bin/bash +set -euo pipefail PIPELINE_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"tests/e2e-prow/rhoai/scripts/deploy-vllm.sh (1)
1-2: Add strict error handling at script start.This script lacks
set -euo pipefail. While the polling loops are robust, critical commands likeoc rollout statusandoc applycan fail silently. Add error handling:#!/bin/bash +set -euo pipefail +trap 'echo "❌ Deployment setup failed at line $LINENO"; exit 1' ERR BASE_DIR="$1"This ensures any unhandled command failure stops the script immediately.
tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-cpu.yaml (1)
28-28: Consider using a pinned image digest instead of "latest".The container image uses the
latesttag (quay.io/rh-ee-cpompeia/vllm-cpu:latest), which reduces reproducibility and can lead to unexpected behavior changes between deployments. For better consistency, especially in e2e test environments, consider pinning to a specific image digest or versioned tag.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (20)
README.md(1 hunks)docs/providers.md(1 hunks)tests/e2e-prow/rhoai/configs/lightspeed-stack.yaml(1 hunks)tests/e2e-prow/rhoai/configs/run.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/lightspeed/lightspeed-stack.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/operators/ds-cluster.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/operators/operatorgroup.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/operators/operators.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/test-pod/spin-up.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/vllm/inference-service.yaml(1 hunks)tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-cpu.yaml(1 hunks)tests/e2e-prow/rhoai/pipeline-services.sh(1 hunks)tests/e2e-prow/rhoai/pipeline-test-pod.sh(1 hunks)tests/e2e-prow/rhoai/pipeline-vllm.sh(1 hunks)tests/e2e-prow/rhoai/pipeline.sh(1 hunks)tests/e2e-prow/rhoai/run-tests.sh(1 hunks)tests/e2e-prow/rhoai/scripts/bootstrap.sh(1 hunks)tests/e2e-prow/rhoai/scripts/deploy-vllm.sh(1 hunks)tests/e2e-prow/rhoai/scripts/get-vllm-pod-info.sh(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- tests/e2e-prow/rhoai/manifests/operators/operators.yaml
🚧 Files skipped from review as they are similar to previous changes (10)
- tests/e2e-prow/rhoai/configs/lightspeed-stack.yaml
- docs/providers.md
- tests/e2e-prow/rhoai/scripts/get-vllm-pod-info.sh
- README.md
- tests/e2e-prow/rhoai/manifests/vllm/inference-service.yaml
- tests/e2e-prow/rhoai/run-tests.sh
- tests/e2e-prow/rhoai/pipeline-services.sh
- tests/e2e-prow/rhoai/manifests/operators/operatorgroup.yaml
- tests/e2e-prow/rhoai/pipeline-test-pod.sh
- tests/e2e-prow/rhoai/scripts/bootstrap.sh
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-18T16:46:33.353Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-stack PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-09-18T16:46:33.353Z
Learning: Applies to tests/e2e/test_list.txt : Maintain the E2E test list in tests/e2e/test_list.txt
Applied to files:
tests/e2e-prow/rhoai/pipeline.sh
🪛 Checkov (3.2.334)
tests/e2e-prow/rhoai/manifests/lightspeed/lightspeed-stack.yaml
[medium] 1-25: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 1-25: Minimize the admission of root containers
(CKV_K8S_23)
tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack.yaml
[medium] 1-34: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 1-34: Minimize the admission of root containers
(CKV_K8S_23)
tests/e2e-prow/rhoai/manifests/test-pod/spin-up.yaml
[medium] 1-30: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 1-30: Minimize the admission of root containers
(CKV_K8S_23)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: build-pr
- GitHub Check: e2e_tests (azure)
- GitHub Check: e2e_tests (ci)
🔇 Additional comments (9)
tests/e2e-prow/rhoai/pipeline.sh (4)
68-69: Validatepod.envsourcing and variable presence.Line 68 sources
pod.envwithout checking if the source succeeded or if$POD_NAMEand$KSVC_URLare set. Ifpod.envis missing or malformed, subsequent commands at line 69 and 84 will fail with confusing errors.+source pod.env || { echo "❌ Failed to source pod.env from pipeline-vllm.sh"; exit 1; } +[[ -n "$POD_NAME" && -n "$KSVC_URL" ]] || { echo "❌ pod.env missing POD_NAME or KSVC_URL"; exit 1; } oc wait --for=condition=Ready pod/$POD_NAME -n $NAMESPACE --timeout=300sAlternatively, verify that
pipeline-vllm.shunconditionally writes and exports these variables topod.envbefore this script sources it.
126-128: Check exit code of service readiness wait.The
oc waitat line 126 lacks exit code validation. If it times out or fails, the script continues with incomplete deployments, causing cryptic failures downstream. Add explicit error handling:oc wait pod/lightspeed-stack-service pod/llama-stack-service \ - -n "$NAMESPACE" --for=condition=Ready --timeout=300s + -n "$NAMESPACE" --for=condition=Ready --timeout=300s || { + echo "❌ Service pods did not reach Ready within 300s" + oc get pods -n "$NAMESPACE" -o wide || true + exit 1 +} sleep 30
168-169: Clarify and validate dual-wait pattern.Line 168 uses an unusual pattern: wait for Ready=True, and if that times out/fails, wait for Ready=False. This is non-standard and could mask deployment issues. Clarify the intent:
- If the goal is to ensure the pod terminates (success or failure), use a single condition or add explicit validation.
- If testing pod failure scenarios, document this clearly.
Please verify the intent of this dual-wait pattern. If pods should reach Ready=True and the second condition is a fallback, consider restructuring for clarity:
# Example: Wait for terminal state with explicit timeout oc wait --for=condition=Ready=True pod/test-pod -n $NAMESPACE --timeout=900s && { echo "✅ Pod reached Ready state" } || { echo "⚠️ Pod did not reach Ready; checking if it failed..." oc wait --for=condition=Ready=False pod/test-pod -n $NAMESPACE --timeout=60s || { echo "❌ Pod stuck in intermediate state" exit 1 } }
194-204: Validate test pod exit code extraction before using it.Line 194 extracts
TEST_EXIT_CODEwithout checking if the pod has terminated or ifcontainerStatusesexists. An empty or missing exit code can cause silent failures or incorrect exit logic. Add explicit validation:-TEST_EXIT_CODE=$(oc get pod test-pod -n $NAMESPACE -o jsonpath='{.status.containerStatuses[0].state.terminated.exitCode}') +TEST_EXIT_CODE=$(oc get pod test-pod -n $NAMESPACE -o jsonpath='{.status.containerStatuses[0].state.terminated.exitCode}' 2>/dev/null || echo "") + +if [[ -z "$TEST_EXIT_CODE" ]]; then + echo "❌ Test pod did not terminate or exitCode is missing" + oc describe pod test-pod -n $NAMESPACE || true + oc logs test-pod -n $NAMESPACE || true + exit 1 +fi echo "===== E2E COMPLETE =====" -if [ "${TEST_EXIT_CODE:-2}" -ne 0 ]; then +if [[ "$TEST_EXIT_CODE" -ne 0 ]]; then echo "❌ E2E tests failed with exit code $TEST_EXIT_CODE (pod/test-pod failed)" else echo "✅ E2E tests succeeded" fi exit $TEST_EXIT_CODEtests/e2e-prow/rhoai/scripts/deploy-vllm.sh (2)
23-23: Check exit code ofoc rollout status.The rollout status command at line 23 doesn't validate success. If it times out or the deployment fails, the script continues with an incomplete rollout, causing silent failures. Add error handling:
echo "Waiting for kserve-controller-manager rollout..." -oc rollout status deployment/kserve-controller-manager -n redhat-ods-applications --timeout=300s +oc rollout status deployment/kserve-controller-manager -n redhat-ods-applications --timeout=300s || { + echo "❌ KServe rollout failed or timed out" + exit 1 +}Alternatively, add
set -euo pipefailat the script top to catch this automatically.
32-34: Add error handling to manifest application.The
oc applycommands at lines 32 and 34 lack error checking. If manifests are malformed or paths are incorrect, the commands fail silently. Add error handling:oc apply -f "$BASE_DIR/manifests/vllm/vllm-runtime-cpu.yaml" + || { echo "❌ Failed to apply vllm-runtime-cpu.yaml"; exit 1; } oc apply -f "$BASE_DIR/manifests/vllm/inference-service.yaml" + || { echo "❌ Failed to apply inference-service.yaml"; exit 1; }Or add
set -euo pipefailat the script top to fail on any command error.tests/e2e-prow/rhoai/manifests/operators/ds-cluster.yaml (1)
1-17: Manifest looks good.This DataScienceCluster manifest is well-formed and properly configures KServe with service mesh. No issues identified.
tests/e2e-prow/rhoai/configs/run.yaml (1)
1-120: Configuration aligns well with deployment topology.This llama-stack configuration correctly references the vLLM inference provider via environment variables and is well-structured for the e2e test scenario. The use of
${env.KSVC_URL}and${env.VLLM_API_KEY}matches the secrets/ConfigMaps provisioned inpipeline.sh.tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-cpu.yaml (1)
30-42: Verify that referenced secrets and ConfigMap exist in the deployment pipeline.The manifest references three Kubernetes objects that must be created before the ServingRuntime is deployed:
- Secret
hf-token-secret(line 36)- Secret
vllm-api-key-secret(line 41)- ConfigMap
vllm-chat-template(line 57)If these objects are not created beforehand, the pod will fail to start or fail at runtime.
Verify that the deployment pipeline (e.g., in
deploy-vllm.shor related scripts) creates these secrets and ConfigMap before applying this manifest. If they are not yet created, add creation steps to the pipeline.
| - args: | ||
| # - /mnt/models/ | ||
| - meta-llama/Llama-3.2-1B-Instruct | ||
| - --enable-auto-tool-choice | ||
| - --tool-call-parser | ||
| - llama3_json | ||
| - --chat-template | ||
| - /mnt/chat-template/tool_chat_template_llama3.2_json.jinja | ||
| - --download-dir | ||
| - /tmp/models-cache | ||
| - --port | ||
| - "8080" | ||
| - --max-model-len | ||
| - "2048" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix critical path mismatch in download-dir argument.
The container argument on line 23 specifies --download-dir /tmp/models-cache, but the volumeMount on line 51 mounts the models-cache volume to /mnt/models-cache. This path mismatch means vLLM will download models to /tmp/models-cache (inside the container's ephemeral filesystem) instead of the mounted emptyDir volume at /mnt/models-cache. Models will not persist across pod restarts.
Apply this diff to fix the path mismatch:
- --download-dir
- - /tmp/models-cache
+ - /mnt/models-cache📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - args: | |
| # - /mnt/models/ | |
| - meta-llama/Llama-3.2-1B-Instruct | |
| - --enable-auto-tool-choice | |
| - --tool-call-parser | |
| - llama3_json | |
| - --chat-template | |
| - /mnt/chat-template/tool_chat_template_llama3.2_json.jinja | |
| - --download-dir | |
| - /tmp/models-cache | |
| - --port | |
| - "8080" | |
| - --max-model-len | |
| - "2048" | |
| - args: | |
| # - /mnt/models/ | |
| - meta-llama/Llama-3.2-1B-Instruct | |
| - --enable-auto-tool-choice | |
| - --tool-call-parser | |
| - llama3_json | |
| - --chat-template | |
| - /mnt/chat-template/tool_chat_template_llama3.2_json.jinja | |
| - --download-dir | |
| - /mnt/models-cache | |
| - --port | |
| - "8080" | |
| - --max-model-len | |
| - "2048" |
🤖 Prompt for AI Agents
In tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-cpu.yaml around lines 14 to
27, the --download-dir argument is set to /tmp/models-cache which does not match
the volumeMount at /mnt/models-cache; change the --download-dir value to
/mnt/models-cache so vLLM writes model files into the mounted models-cache
volume (ensuring persistence across pod restarts) and verify the path matches
exactly the container volumeMount name and path.
vLLM served in Openshift, llama-stack (2.25 RHOAI) and lightspeed-stack (dev-latest) served in Openshift as independent containers. Tests run on a fourth pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
vLLM served in Openshift, llama-stack (2.25 RHOAI image) and lightspeed-stack (dev-latest) served in Openshift as independent containers. Tests run on a fourth pod.
vLLM runs in CPU mode since GPUs are not available yet for our prow env.
Description
The main script pipeline.sh runs in prow and does the following:
Type of change
Related Tickets & Documents
Checklist before requesting a review
Testing
Please refer to the prow logs of the latest run.
Summary by CodeRabbit
Documentation
Tests
Chores