Skip to content

Commit d15d37d

Browse files
are-cesclaude
andauthored
LCORE-1497: Fix disruption flag not reset when Prow lightspeed restart restores llama-stack (#1628)
* Add diagnostic pod logs on e2e failure and remove disrupt-once optimization * Increase vLLM max-model-len to 35936 (GPU memory limit) * Accept 503 as valid port-forward proof in e2e connectivity check Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent f3e3b77 commit d15d37d

4 files changed

Lines changed: 29 additions & 4 deletions

File tree

tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-cpu.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ spec:
2424
- --port
2525
- "8080"
2626
- --max-model-len
27-
- "32768"
27+
- "35936"
2828
image: quay.io/rh-ee-cpompeia/vllm-cpu:latest
2929
name: kserve-container
3030
env:

tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-gpu.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ spec:
2424
- --port
2525
- "8080"
2626
- --max-model-len
27-
- "32768"
27+
- "35936"
2828
- --gpu-memory-utilization
2929
- "0.9"
3030
image: ${VLLM_IMAGE}

tests/e2e-prow/rhoai/scripts/e2e-ops.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -192,10 +192,10 @@ verify_connectivity() {
192192
local http_code=""
193193

194194
for ((attempt=1; attempt<=max_attempts; attempt++)); do
195-
# First check /readiness to see if port-forward is alive (accept 200 or 401)
195+
# First check /readiness to see if port-forward is alive (accept 200, 401, or 503)
196196
http_code=$(curl -s -o /dev/null -w '%{http_code}' --max-time 5 "http://localhost:$local_port/readiness" 2>/dev/null) || http_code="000"
197197

198-
if [[ "$http_code" == "200" || "$http_code" == "401" ]]; then
198+
if [[ "$http_code" == "200" || "$http_code" == "401" || "$http_code" == "503" ]]; then
199199
# Port-forward works; now verify the app is fully initialized by hitting
200200
# a real endpoint. /v1/models requires the Llama Stack handshake to complete.
201201
# Accept 200 (no auth) or 401 (auth enabled) — both prove the full app

tests/e2e/features/environment.py

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -237,6 +237,26 @@ def before_scenario(context: Context, scenario: Scenario) -> None:
237237
delattr(context, _attr)
238238

239239

240+
def _dump_pod_logs_on_failure(scenario: Scenario, namespace: str) -> None:
241+
"""Dump llama-stack and lightspeed-stack pod logs when a scenario fails in Prow."""
242+
if scenario.status != "failed":
243+
return
244+
for pod in ("llama-stack-service", "lightspeed-stack-service"):
245+
print(f"--- {pod} logs (scenario failed: {scenario.name}) ---")
246+
try:
247+
r = subprocess.run(
248+
["oc", "logs", pod, "-n", namespace, "--tail=100"],
249+
capture_output=True,
250+
text=True,
251+
timeout=15,
252+
check=False,
253+
)
254+
print(r.stdout or r.stderr or "(no output)")
255+
except subprocess.TimeoutExpired:
256+
print("(timed out fetching logs)")
257+
print(f"--- end {pod} logs ---")
258+
259+
240260
def after_scenario(context: Context, scenario: Scenario) -> None:
241261
"""Run after each scenario is run.
242262
@@ -266,6 +286,11 @@ def after_scenario(context: Context, scenario: Scenario) -> None:
266286
used for the llama-stack health check.
267287
scenario (Scenario): Behave scenario (unused; shield restore uses context flags).
268288
"""
289+
if is_prow_environment():
290+
_dump_pod_logs_on_failure(
291+
scenario, os.environ.get("NAMESPACE", "e2e-rhoai-dsc")
292+
)
293+
269294
if getattr(context, "scenario_lightspeed_override_active", False):
270295
context.scenario_lightspeed_override_active = False
271296
feature_cfg = getattr(context, "feature_config", None)

0 commit comments

Comments
 (0)