[None][chore] Make submit.py can run single GPU test and accept customized config file by HuiGao-NV · Pull Request #14630 · NVIDIA/TensorRT-LLM

HuiGao-NV · 2026-05-27T11:14:08Z

Make submit.py can run single GPU test and accept customized config file

Summary by CodeRabbit

Improvements
- Enhanced GPU detection and configuration for performance testing environments.
- Optimized accuracy testing workflow with improved configuration handling.
- Refined container and MPI settings for better performance across different job configurations.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: Hui Gao <huig@nvidia.com>

coderabbitai · 2026-05-27T11:18:36Z

📝 Walkthrough

Walkthrough

The PR updates jenkins/scripts/perf/local/submit.py to enhance performance test submission with SLURM GPU detection, configuration normalization, improved container argument handling, and updated accuracy test execution for aggregated and disaggregated modes.

Changes

Performance Test Submission Enhancements

Layer / File(s)	Summary
GPU detection and SBATCH configuration `jenkins/scripts/perf/local/submit.py`	Adds `subprocess` import and `partition_has_gpu_gres()` function to probe SLURM partitions via `sinfo`. Conditionally emits `--gpus-per-node` and `--gres` SBATCH flags based on detected GPU GRES availability.
Configuration path resolution and pytest folder setup `jenkins/scripts/perf/local/submit.py`	Resolves `config_yaml` to absolute path when `--config-file` is provided. Computes and exports `AGG_CONFIG_FOLDER` and `DISAGG_CONFIG_FOLDER` from the config file's parent directory for pytest discovery.
Accuracy configuration standardization `jenkins/scripts/perf/local/submit.py`	Introduces `normalize_accuracy_config()` helper to convert flat accuracy configs into nested "tasks" dictionary format. Updates accuracy config initialization to apply normalization.
Enhanced srun arguments with container mounts and MPI tuning `jenkins/scripts/perf/local/submit.py`	Updates `generate_srun_args()` signature to accept `llm_src` and `hardware_config`. Auto-mounts `llm_src` into container via `--container-mounts`. Omits default `--mpi=pmi2` for single-GPU aggregated jobs. Updates call site to pass new parameters.
Aggregated pytest launcher adjustment `jenkins/scripts/perf/local/submit.py`	Conditionally removes `$LLM_API_LAUNCH` segment from aggregated pytest command when `hardware_config.total_gpus == 1`.
Disaggregated accuracy test execution `jenkins/scripts/perf/local/submit.py`	Reworks disaggregated accuracy flow to derive `MODEL_DIR_NAME` from config, locate and invoke `accuracy_runner.py`, export `ACCURACY_CONFIG_JSON` and accuracy environment variables, and extend srun arguments with container env entries including `DISAGG_SERVER_HOST`.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is incomplete. It lacks a proper description section explaining the issue and solution, test coverage details, and most checklist items are unchecked.	Fill in the Description section with details about why single-GPU test support and customized config file acceptance are needed. Add Test Coverage section listing relevant tests. Review and check appropriate checklist items based on the changes made.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 87.50% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title clearly describes the main changes: enabling single GPU tests and accepting customized config files in submit.py.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

jenkins/scripts/perf/local/submit.py (1)

293-294: ⚡ Quick win

Catch specific exceptions instead of bare Exception.

As per coding guidelines, avoid broad exception handling. The subprocess call can raise subprocess.SubprocessError (or its subclasses like CalledProcessError), FileNotFoundError if sinfo isn't found, or OSError for other OS-level issues.

Proposed fix

-    except Exception:
+    except (subprocess.SubprocessError, FileNotFoundError, OSError):
         return False

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@jenkins/scripts/perf/local/submit.py` around lines 293 - 294, Replace the
broad "except Exception:" in the try/except that runs the external "sinfo"
subprocess with specific exception handlers: catch subprocess.SubprocessError
(to include CalledProcessError and related subprocess failures),
FileNotFoundError (if "sinfo" is missing), and OSError for other OS-level
issues, returning False for those specific cases; ensure you reference the
subprocess invocation of "sinfo" (the block that calls subprocess.run /
subprocess.check_output) and include the caught exception in any logging or
diagnostics to aid debugging.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@jenkins/scripts/perf/local/submit.py`:
- Line 844: Add a missing argparse definition for the --disagg-server-port
option so args.disagg_server_port exists at runtime; update the argparse setup
(the parser variable used in the argument parsing block) to add a new integer
argument named "--disagg-server-port" with a sensible default (e.g., 8000) and a
short help string indicating it is the port for the disaggregated server (used
for accuracy tests), ensuring subsequent references to args.disagg_server_port
(e.g., where the environment line sets DISAGG_SERVER_PORT) no longer raise
AttributeError.
- Around line 337-339: The substring check `llm_src not in mounts` can falsely
skip adding a mount when llm_src is a prefix of another path; instead parse the
existing mounts into their source paths and compare exactly. Split `mounts` on
commas, for each entry split on ':' and collect the left-side source paths
(normalize with os.path.abspath or os.path.normpath), then check `if llm_src and
llm_src not in source_paths:` before creating `llm_src_mount =
f"{llm_src}:{llm_src}"` and appending to `mounts`; update the logic around the
`llm_src`, `mounts`, and `llm_src_mount` variables accordingly.

---

Nitpick comments:
In `@jenkins/scripts/perf/local/submit.py`:
- Around line 293-294: Replace the broad "except Exception:" in the try/except
that runs the external "sinfo" subprocess with specific exception handlers:
catch subprocess.SubprocessError (to include CalledProcessError and related
subprocess failures), FileNotFoundError (if "sinfo" is missing), and OSError for
other OS-level issues, returning False for those specific cases; ensure you
reference the subprocess invocation of "sinfo" (the block that calls
subprocess.run / subprocess.check_output) and include the caught exception in
any logging or diagnostics to aid debugging.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9228a729-16e4-4e0b-93e0-c1bff5ede545

📥 Commits

Reviewing files that changed from the base of the PR and between 8cd28ed and 7852613.

📒 Files selected for processing (1)

jenkins/scripts/perf/local/submit.py

Signed-off-by: Hui Gao <huig@nvidia.com>

HuiGao-NV · 2026-05-27T12:49:17Z

/bot run --stage-list "GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE2-GPU8-Post-Merge-1,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-1,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-4,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-2,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-3," --disable-fail-fast

tensorrt-cicd · 2026-05-27T12:56:22Z

PR_Github #50538 [ run ] triggered by Bot. Commit: 5c51268 Link to invocation

HuiGao-NV · 2026-05-27T13:40:17Z

/bot run --stage-list "GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE2-GPU8-Post-Merge-1,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-1,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-2" --disable-fail-fast

tensorrt-cicd · 2026-05-27T13:47:16Z

PR_Github #50544 [ run ] triggered by Bot. Commit: 5c51268 Link to invocation

tensorrt-cicd · 2026-05-27T13:47:25Z

PR_Github #50538 [ run ] completed with state ABORTED. Commit: 5c51268

Link to invocation

tensorrt-cicd · 2026-05-27T15:02:51Z

PR_Github #50544 [ run ] completed with state FAILURE. Commit: 5c51268
/LLM/main/L0_MergeRequest_PR pipeline #40048 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

HuiGao-NV · 2026-05-27T23:36:37Z

/bot run --stage-list "GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE2-GPU8-Post-Merge-1,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-1,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-2" --disable-fail-fast

tensorrt-cicd · 2026-05-27T23:42:23Z

PR_Github #50630 [ run ] triggered by Bot. Commit: 5c51268 Link to invocation

tensorrt-cicd · 2026-05-28T01:28:32Z

PR_Github #50630 [ run ] completed with state SUCCESS. Commit: 5c51268
/LLM/main/L0_MergeRequest_PR pipeline #40122 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

HuiGao-NV · 2026-05-28T01:32:49Z

/bot run --stage-list "GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE2-GPU8-Post-Merge-1,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-1,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-2" --disable-fail-fast

tensorrt-cicd · 2026-05-28T01:38:24Z

PR_Github #50652 [ run ] triggered by Bot. Commit: 2835b14 Link to invocation

tensorrt-cicd · 2026-05-28T03:17:00Z

PR_Github #50652 [ run ] completed with state SUCCESS. Commit: 2835b14
/LLM/main/L0_MergeRequest_PR pipeline #40144 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

HuiGao-NV · 2026-05-28T04:42:53Z

/bot run --post-merge --only-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2026-05-28T04:48:35Z

PR_Github #50694 [ run ] triggered by Bot. Commit: 2835b14 Link to invocation

tensorrt-cicd · 2026-05-29T04:49:26Z

PR_Github #50694 [ run ] completed with state ABORTED. Commit: 2835b14

Link to invocation

Signed-off-by: HuiGao <huig@nvidia.com>

HuiGao-NV · 2026-05-29T07:11:34Z

/bot run --post-merge --only-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2026-05-29T07:17:33Z

PR_Github #51000 [ run ] triggered by Bot. Commit: f0e0eb4 Link to invocation

QiJune

LGTM

tensorrt-cicd · 2026-05-30T07:18:22Z

PR_Github #51000 [ run ] completed with state ABORTED. Commit: f0e0eb4

Link to invocation

Support customized config file path and run with single GPU test

7852613

Signed-off-by: Hui Gao <huig@nvidia.com>

HuiGao-NV requested review from a team as code owners May 27, 2026 11:14

HuiGao-NV requested review from tburt-nv and zeroepoch May 27, 2026 11:14

github-actions Bot assigned HuiGao-NV May 27, 2026

coderabbitai Bot reviewed May 27, 2026

View reviewed changes

Comment thread jenkins/scripts/perf/local/submit.py

Comment thread jenkins/scripts/perf/local/submit.py

Format file

5c51268

Signed-off-by: Hui Gao <huig@nvidia.com>

HuiGao-NV changed the title ~~Make submit.py can run single GPU test and accept customized config file~~ [None][chore] Make submit.py can run single GPU test and accept customized config file May 27, 2026

HuiGao-NV force-pushed the enhance_submit_script branch from 5c51268 to 2835b14 Compare May 28, 2026 00:01

HuiGao-NV requested review from QiJune and kaiyux May 28, 2026 01:37

zeroepoch approved these changes May 28, 2026

View reviewed changes

QiJune reviewed May 29, 2026

View reviewed changes

Comment thread jenkins/scripts/perf/local/submit.py

Add missed arg from parser

f0e0eb4

Signed-off-by: HuiGao <huig@nvidia.com>

HuiGao-NV force-pushed the enhance_submit_script branch from 2835b14 to f0e0eb4 Compare May 29, 2026 07:09

QiJune approved these changes May 29, 2026

View reviewed changes

Conversation

HuiGao-NV commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

HuiGao-NV commented May 27, 2026

Uh oh!

tensorrt-cicd commented May 27, 2026

Uh oh!

HuiGao-NV commented May 27, 2026

Uh oh!

tensorrt-cicd commented May 27, 2026

Uh oh!

tensorrt-cicd commented May 27, 2026

Uh oh!

tensorrt-cicd commented May 27, 2026

Uh oh!

HuiGao-NV commented May 27, 2026

Uh oh!

tensorrt-cicd commented May 27, 2026

Uh oh!

tensorrt-cicd commented May 28, 2026

Uh oh!

HuiGao-NV commented May 28, 2026

Uh oh!

tensorrt-cicd commented May 28, 2026

Uh oh!

tensorrt-cicd commented May 28, 2026

Uh oh!

HuiGao-NV commented May 28, 2026

Uh oh!

tensorrt-cicd commented May 28, 2026

Uh oh!

Uh oh!

tensorrt-cicd commented May 29, 2026

Uh oh!

HuiGao-NV commented May 29, 2026

Uh oh!

tensorrt-cicd commented May 29, 2026

Uh oh!

QiJune left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HuiGao-NV commented May 27, 2026 •

edited

Loading

coderabbitai Bot commented May 27, 2026 •

edited

Loading