[None][chore] Make submit.py can run single GPU test and accept customized config file#14630
[None][chore] Make submit.py can run single GPU test and accept customized config file#14630HuiGao-NV wants to merge 3 commits into
Conversation
Signed-off-by: Hui Gao <huig@nvidia.com>
📝 WalkthroughWalkthroughThe PR updates ChangesPerformance Test Submission Enhancements
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
jenkins/scripts/perf/local/submit.py (1)
293-294: ⚡ Quick winCatch specific exceptions instead of bare
Exception.As per coding guidelines, avoid broad exception handling. The
subprocesscall can raisesubprocess.SubprocessError(or its subclasses likeCalledProcessError),FileNotFoundErrorifsinfoisn't found, orOSErrorfor other OS-level issues.Proposed fix
- except Exception: + except (subprocess.SubprocessError, FileNotFoundError, OSError): return False🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@jenkins/scripts/perf/local/submit.py` around lines 293 - 294, Replace the broad "except Exception:" in the try/except that runs the external "sinfo" subprocess with specific exception handlers: catch subprocess.SubprocessError (to include CalledProcessError and related subprocess failures), FileNotFoundError (if "sinfo" is missing), and OSError for other OS-level issues, returning False for those specific cases; ensure you reference the subprocess invocation of "sinfo" (the block that calls subprocess.run / subprocess.check_output) and include the caught exception in any logging or diagnostics to aid debugging.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@jenkins/scripts/perf/local/submit.py`:
- Line 844: Add a missing argparse definition for the --disagg-server-port
option so args.disagg_server_port exists at runtime; update the argparse setup
(the parser variable used in the argument parsing block) to add a new integer
argument named "--disagg-server-port" with a sensible default (e.g., 8000) and a
short help string indicating it is the port for the disaggregated server (used
for accuracy tests), ensuring subsequent references to args.disagg_server_port
(e.g., where the environment line sets DISAGG_SERVER_PORT) no longer raise
AttributeError.
- Around line 337-339: The substring check `llm_src not in mounts` can falsely
skip adding a mount when llm_src is a prefix of another path; instead parse the
existing mounts into their source paths and compare exactly. Split `mounts` on
commas, for each entry split on ':' and collect the left-side source paths
(normalize with os.path.abspath or os.path.normpath), then check `if llm_src and
llm_src not in source_paths:` before creating `llm_src_mount =
f"{llm_src}:{llm_src}"` and appending to `mounts`; update the logic around the
`llm_src`, `mounts`, and `llm_src_mount` variables accordingly.
---
Nitpick comments:
In `@jenkins/scripts/perf/local/submit.py`:
- Around line 293-294: Replace the broad "except Exception:" in the try/except
that runs the external "sinfo" subprocess with specific exception handlers:
catch subprocess.SubprocessError (to include CalledProcessError and related
subprocess failures), FileNotFoundError (if "sinfo" is missing), and OSError for
other OS-level issues, returning False for those specific cases; ensure you
reference the subprocess invocation of "sinfo" (the block that calls
subprocess.run / subprocess.check_output) and include the caught exception in
any logging or diagnostics to aid debugging.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 9228a729-16e4-4e0b-93e0-c1bff5ede545
📒 Files selected for processing (1)
jenkins/scripts/perf/local/submit.py
Signed-off-by: Hui Gao <huig@nvidia.com>
|
/bot run --stage-list "GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE2-GPU8-Post-Merge-1,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-1,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-4,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-2,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-3," --disable-fail-fast |
|
PR_Github #50538 [ run ] triggered by Bot. Commit: |
|
/bot run --stage-list "GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE2-GPU8-Post-Merge-1,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-1,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-2" --disable-fail-fast |
|
PR_Github #50544 [ run ] triggered by Bot. Commit: |
|
PR_Github #50538 [ run ] completed with state |
|
PR_Github #50544 [ run ] completed with state
|
|
/bot run --stage-list "GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE2-GPU8-Post-Merge-1,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-1,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-2" --disable-fail-fast |
|
PR_Github #50630 [ run ] triggered by Bot. Commit: |
5c51268 to
2835b14
Compare
|
PR_Github #50630 [ run ] completed with state
|
|
/bot run --stage-list "GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE2-GPU8-Post-Merge-1,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-1,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-2" --disable-fail-fast |
|
PR_Github #50652 [ run ] triggered by Bot. Commit: |
|
PR_Github #50652 [ run ] completed with state
|
|
/bot run --post-merge --only-multi-gpu-test --disable-fail-fast |
|
PR_Github #50694 [ run ] triggered by Bot. Commit: |
|
PR_Github #50694 [ run ] completed with state |
Signed-off-by: HuiGao <huig@nvidia.com>
2835b14 to
f0e0eb4
Compare
|
/bot run --post-merge --only-multi-gpu-test --disable-fail-fast |
|
PR_Github #51000 [ run ] triggered by Bot. Commit: |
|
PR_Github #51000 [ run ] completed with state |
Make submit.py can run single GPU test and accept customized config file
Summary by CodeRabbit
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either
api-compatibleorapi-breaking. Forapi-breaking, includeBREAKINGin the PR title.Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.