Skip to content

Fix CI setup for llmisvc#1196

Merged
pierDipi merged 43 commits intoopendatahub-io:masterfrom
andresllh:fix-race-condition-llmisvc-tests-master
Mar 23, 2026
Merged

Fix CI setup for llmisvc#1196
pierDipi merged 43 commits intoopendatahub-io:masterfrom
andresllh:fix-race-condition-llmisvc-tests-master

Conversation

@andresllh
Copy link
Copy Markdown
Member

@andresllh andresllh commented Mar 13, 2026

Summary by CodeRabbit

  • Tests
    • Improved test setup flow with staged resource application, webhook readiness waits, and additional logging.
    • Adjusted test execution mode via test-run argument change.
    • Updated fixtures to enforce non-root security context.
    • Enhanced auth-related tests to auto-add a disabled-auth annotation when missing and handle LLM service config updates.

Signed-off-by: Andres Llausas <allausas@redhat.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 13, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Updates the OpenShift CI test setup script to perform per-image substitutions, apply resources in multiple stages (CRDs first, then resources that may trigger webhook validation), wait for the llmisvc-controller-manager to be ready before applying webhook-validated resources, re-apply LLMInferenceServiceConfig after webhook readiness, and conditionally patch/restart the controller when running llminferenceservice tests; adds related logging and waits. Test fixtures for LLMInferenceServiceConfig now set securityContext.runAsNonRoot: true and remove runAsUser fields. test_llm_inference_service.py now ensures metadata.annotations exist and sets security.opendatahub.io/enable-auth = "false" when annotations are missing before creating services. The CI workflow invocation arg changed from 1 to 0.

Note: automatically disabling auth in tests is a potential security concern (see CWE-306: Missing Authentication).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'Fix CI setup for llmisvc' is related to the changeset but is overly vague and doesn't clearly convey the main purpose—addressing a race condition in llmisvc configuration. Revise to a more specific title such as 'Fix race condition in llmisvc config application by sequencing webhook readiness' to better reflect the actual changes made.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can use Trivy to scan for security misconfigurations and secrets in Infrastructure as Code files.

Add a .trivyignore file to your project to customize which findings Trivy reports.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/scripts/openshift-ci/setup-e2e-tests.sh`:
- Around line 136-138: The current unconditional suppression of errors after the
oc apply call hides real failures; change the oc apply invocation in the setup
script (the line using oc apply --server-side=true --force-conflicts -f -) to
capture its exit code and output, log any failure to stderr (e.g., echo or
printf with the captured output and exit code), and only swallow/continue for
expected webhook/validation errors by checking the output text for
webhook/validation-related messages before allowing the script to proceed; for
any other failures exit non-zero so problems like auth, malformed YAML, missing
CRDs, quotas, or network errors are surfaced.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 238f208f-076c-44b1-a5b9-7ae58cb3645f

📥 Commits

Reviewing files that changed from the base of the PR and between ae20ef2 and 79a08ae.

📒 Files selected for processing (1)
  • test/scripts/openshift-ci/setup-e2e-tests.sh

Comment thread test/scripts/openshift-ci/setup-e2e-tests.sh
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
test/scripts/openshift-ci/setup-e2e-tests.sh (1)

136-138: ⚠️ Potential issue | 🟠 Major

Do not suppress all oc apply failures with || true.

Line 138 masks unrelated failures (auth, malformed manifests, missing APIs, network) and makes CI results non-diagnostic. Restrict continuation to expected webhook/validation failures only.

Proposed fix
-  echo "$ODH_MANIFESTS" | oc apply --server-side=true --force-conflicts -f - || true
+  APPLY_OUTPUT="$(echo "$ODH_MANIFESTS" | oc apply --server-side=true --force-conflicts -f - 2>&1)" || APPLY_RC=$?
+  APPLY_RC="${APPLY_RC:-0}"
+  if [[ "$APPLY_RC" -ne 0 ]]; then
+    if grep -qiE 'webhook|admission|denied the request|validation' <<<"$APPLY_OUTPUT"; then
+      echo "⚠️ Initial apply failed due to expected webhook/validation timing. Retrying after controller readiness."
+    else
+      echo "$APPLY_OUTPUT" >&2
+      exit "$APPLY_RC"
+    fi
+  fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/scripts/openshift-ci/setup-e2e-tests.sh` around lines 136 - 138, The
current line piping "$ODH_MANIFESTS" into oc apply ends with "|| true", which
masks all failures; change this to run oc apply without unconditional
suppression, capture its stderr/exit code, and only ignore/ retry when the
failure matches the expected webhook/validation error (e.g., contains
"validation webhook" or the specific LLMInferenceServiceConfig webhook message).
Concretely: run oc apply --server-side=true --force-conflicts -f - and if it
fails, inspect the output for the known webhook/validation error text and retry
a bounded number of times; for any other error (auth, malformed manifest,
missing API, network), propagate the non-zero exit so CI fails and logs the real
error. Use the existing ODH_MANIFESTS variable and the oc apply command as the
location to implement this conditional retry/exit behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@config/default/manager_image_patch.yaml`:
- Line 11: Replace the unqualified mutable image reference "image:
kserve-controller" with a fully qualified, pinned image (registry hostname,
repository, and immutable tag or digest) so the manager image is deterministic;
update the value in config/default/manager_image_patch.yaml where "image:
kserve-controller" appears and ensure it matches the image naming convention
used by your kustomization (or add an images: transform in
config/default/kustomization.yaml) so deployments use the exact
registry/repository:tag or `@sha256`:digest.

In `@test/e2e/llmisvc/test_resources.py`:
- Line 31: Tests hardcode the gatewayClassName value ("openshift-default")
instead of using the existing env-driven GATEWAY_CLASS_NAME, breaking
non-OpenShift runs; replace the literal "openshift-default" in the JSON/dict
entries (both occurrences) with the GATEWAY_CLASS_NAME variable so the test
reads the environment-driven value, ensuring GATEWAY_CLASS_NAME is in scope
where gatewayClassName is constructed and updating both places referenced in
test_resources.py.

In `@test/scripts/openshift-ci/setup-e2e-tests.sh`:
- Line 150: The shell commands calling oc (e.g., the oc patch configmap
inferenceservice-config invocation) pass the KSERVE_NAMESPACE variable unquoted;
update those invocations to quote the namespace variable as
"${KSERVE_NAMESPACE}" wherever it’s used (both occurrences flagged) to prevent
word-splitting/globbing and command injection risks.

---

Duplicate comments:
In `@test/scripts/openshift-ci/setup-e2e-tests.sh`:
- Around line 136-138: The current line piping "$ODH_MANIFESTS" into oc apply
ends with "|| true", which masks all failures; change this to run oc apply
without unconditional suppression, capture its stderr/exit code, and only
ignore/ retry when the failure matches the expected webhook/validation error
(e.g., contains "validation webhook" or the specific LLMInferenceServiceConfig
webhook message). Concretely: run oc apply --server-side=true --force-conflicts
-f - and if it fails, inspect the output for the known webhook/validation error
text and retry a bounded number of times; for any other error (auth, malformed
manifest, missing API, network), propagate the non-zero exit so CI fails and
logs the real error. Use the existing ODH_MANIFESTS variable and the oc apply
command as the location to implement this conditional retry/exit behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 198d4ce8-9310-43bb-afcf-bd60b92dfa17

📥 Commits

Reviewing files that changed from the base of the PR and between 79a08ae and 94bf125.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum, !**/*.sum
📒 Files selected for processing (6)
  • config/default/manager_image_patch.yaml
  • pkg/controller/v1alpha2/llmisvc/workload_storage.go
  • pkg/utils/storage.go
  • test/e2e/llmisvc/fixtures.py
  • test/e2e/llmisvc/test_resources.py
  • test/scripts/openshift-ci/setup-e2e-tests.sh

Comment thread config/default/manager_image_patch.yaml Outdated
Comment thread test/e2e/llmisvc/test_resources.py Outdated
Comment thread test/scripts/openshift-ci/setup-e2e-tests.sh Outdated
Comment thread pkg/controller/v1alpha2/llmisvc/workload_storage.go Outdated
Comment thread pkg/utils/storage.go Outdated
Signed-off-by: Andres Llausas <allausas@redhat.com>
@andresllh andresllh force-pushed the fix-race-condition-llmisvc-tests-master branch from 8a9813b to a471d38 Compare March 16, 2026 20:12
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/llmisvc/test_llm_inference_service.py`:
- Around line 381-385: The guard only checks if
test_case.llm_service.metadata.annotations is falsy, missing the case where
annotations exist but the key "security.opendatahub.io/enable-auth" is absent;
update the logic around test_case.llm_service.metadata.annotations to ensure
annotations is a dict (create if None) and then explicitly set
test_case.llm_service.metadata.annotations["security.opendatahub.io/enable-auth"]
= "false" only when that specific key is not present, using the existing symbols
test_case.llm_service.metadata.annotations and the annotation key to locate and
modify the code.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: a28b493b-467c-4f54-906b-8e508dea39e8

📥 Commits

Reviewing files that changed from the base of the PR and between 94bf125 and a471d38.

📒 Files selected for processing (4)
  • config/overlays/odh-test/configmap/inferenceservice.yaml
  • test/e2e/llmisvc/fixtures.py
  • test/e2e/llmisvc/test_llm_inference_service.py
  • test/scripts/openshift-ci/setup-e2e-tests.sh
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/e2e/llmisvc/fixtures.py

Comment thread test/e2e/llmisvc/test_llm_inference_service.py Outdated
Comment thread go.sum
Comment thread pkg/controller/v1alpha2/llmisvc/workload_storage.go Outdated
Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>
Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>
Comment thread test/e2e/llmisvc/test_resources.py Outdated
Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>
Comment thread pkg/controller/v1alpha2/llmisvc/workload_storage.go Outdated
Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>
Comment thread test/e2e/llmisvc/test_resources.py Outdated
pierDipi and others added 2 commits March 17, 2026 10:00
Signed-off-by: Pierangelo Di Pilato <pierdipi@redhat.com>
Signed-off-by: Pierangelo Di Pilato <pierdipi@redhat.com>
Signed-off-by: Pierangelo Di Pilato <pierdipi@redhat.com>
@pierDipi pierDipi force-pushed the fix-race-condition-llmisvc-tests-master branch from 6792d67 to 146b83d Compare March 20, 2026 06:57
@pierDipi pierDipi force-pushed the fix-race-condition-llmisvc-tests-master branch from c932e29 to 5444a04 Compare March 20, 2026 08:17
Signed-off-by: Pierangelo Di Pilato <pierdipi@redhat.com>
@pierDipi pierDipi force-pushed the fix-race-condition-llmisvc-tests-master branch from 5444a04 to d07db8d Compare March 20, 2026 09:21
Signed-off-by: Pierangelo Di Pilato <pierdipi@redhat.com>
Switch to `oc`

Signed-off-by: Pierangelo Di Pilato <pierdipi@redhat.com>
@pierDipi pierDipi force-pushed the fix-race-condition-llmisvc-tests-master branch from d881c16 to df866f7 Compare March 20, 2026 17:20
@pierDipi
Copy link
Copy Markdown
Member

⏳ waiting for authorino-operator to be ready.…
error: timed out waiting for the condition on kuadrants/kuadrant

/test e2e-llm-inference-service

@pierDipi
Copy link
Copy Markdown
Member

/test e2e-graph

@bartoszmajsak
Copy link
Copy Markdown

/retest-required

@pierDipi
Copy link
Copy Markdown
Member

FAILED llmisvc/test_storage_version_migration.py::TestStorageVersionMigration::test_storage_version_migration_after_simulated_upgrade - subprocess.CalledProcessError: Command '['oc', 'rollout', 'restart', 'deployment/llmisvc-controller-manager', '-n', 'opendatahub']' returned non-zero exit status 1.

controller is running in Kserve namespace

  name: llmisvc-controller-manager-5d5d6dffd5-6l4p5
  namespace: kserve

Signed-off-by: Pierangelo Di Pilato <pierdipi@redhat.com>
@pierDipi pierDipi force-pushed the fix-race-condition-llmisvc-tests-master branch from a966ca2 to c2bd34e Compare March 21, 2026 07:37
Fix error `no matches for kind \"ClusterServingRuntime\" in version \"serving.kserve.io/v1alpha1\"`

Signed-off-by: Pierangelo Di Pilato <pierdipi@redhat.com>
Signed-off-by: Pierangelo Di Pilato <pierdipi@redhat.com>
@pierDipi
Copy link
Copy Markdown
Member

The isvc tests failures can be addressed separately

/override ci/prow/e2e-raw
/override ci/prow/e2e-predictor
/override ci/prow/e2e-graph

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Mar 23, 2026

@pierDipi: Overrode contexts on behalf of pierDipi: ci/prow/e2e-graph, ci/prow/e2e-predictor, ci/prow/e2e-raw

Details

In response to this:

The isvc tests failures can be addressed separately

/override ci/prow/e2e-raw
/override ci/prow/e2e-predictor
/override ci/prow/e2e-graph

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Copy Markdown
Member

@pierDipi pierDipi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@openshift-ci openshift-ci bot added the lgtm label Mar 23, 2026
@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Mar 23, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andresllh, pierDipi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@pierDipi pierDipi merged commit 1ac3660 into opendatahub-io:master Mar 23, 2026
35 of 37 checks passed
@github-project-automation github-project-automation bot moved this from New/Backlog to Done in ODH Model Serving Planning Mar 23, 2026
@openshift-cherrypick-robot
Copy link
Copy Markdown

@pierDipi: #1196 failed to apply on top of branch "release-v0.17":

Applying: Fixing race condition for llmisvcconfigs
Applying: Rebasing midstream/master onto my branch
Using index info to reconstruct a base tree...
M	go.sum
M	pkg/controller/v1alpha2/llmisvc/workload_storage.go
M	pkg/utils/storage.go
M	test/e2e/llmisvc/fixtures.py
Falling back to patching base and 3-way merge...
Auto-merging test/e2e/llmisvc/fixtures.py
Auto-merging pkg/utils/storage.go
Auto-merging pkg/controller/v1alpha2/llmisvc/workload_storage.go
Auto-merging go.sum
CONFLICT (content): Merge conflict in go.sum
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Patch failed at 0002 Rebasing midstream/master onto my branch

Details

In response to this:

/cherry-pick release-v0.17

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants