Skip to content

[None][chore] Make submit.py can run single GPU test and accept customized config file#14630

Open
HuiGao-NV wants to merge 3 commits into
NVIDIA:mainfrom
HuiGao-NV:enhance_submit_script
Open

[None][chore] Make submit.py can run single GPU test and accept customized config file#14630
HuiGao-NV wants to merge 3 commits into
NVIDIA:mainfrom
HuiGao-NV:enhance_submit_script

Conversation

@HuiGao-NV
Copy link
Copy Markdown
Collaborator

@HuiGao-NV HuiGao-NV commented May 27, 2026

Make submit.py can run single GPU test and accept customized config file

Summary by CodeRabbit

  • Improvements
    • Enhanced GPU detection and configuration for performance testing environments.
    • Optimized accuracy testing workflow with improved configuration handling.
    • Refined container and MPI settings for better performance across different job configurations.

Review Change Stack

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@HuiGao-NV HuiGao-NV requested review from a team as code owners May 27, 2026 11:14
@HuiGao-NV HuiGao-NV requested review from tburt-nv and zeroepoch May 27, 2026 11:14
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 27, 2026

📝 Walkthrough

Walkthrough

The PR updates jenkins/scripts/perf/local/submit.py to enhance performance test submission with SLURM GPU detection, configuration normalization, improved container argument handling, and updated accuracy test execution for aggregated and disaggregated modes.

Changes

Performance Test Submission Enhancements

Layer / File(s) Summary
GPU detection and SBATCH configuration
jenkins/scripts/perf/local/submit.py
Adds subprocess import and partition_has_gpu_gres() function to probe SLURM partitions via sinfo. Conditionally emits --gpus-per-node and --gres SBATCH flags based on detected GPU GRES availability.
Configuration path resolution and pytest folder setup
jenkins/scripts/perf/local/submit.py
Resolves config_yaml to absolute path when --config-file is provided. Computes and exports AGG_CONFIG_FOLDER and DISAGG_CONFIG_FOLDER from the config file's parent directory for pytest discovery.
Accuracy configuration standardization
jenkins/scripts/perf/local/submit.py
Introduces normalize_accuracy_config() helper to convert flat accuracy configs into nested "tasks" dictionary format. Updates accuracy config initialization to apply normalization.
Enhanced srun arguments with container mounts and MPI tuning
jenkins/scripts/perf/local/submit.py
Updates generate_srun_args() signature to accept llm_src and hardware_config. Auto-mounts llm_src into container via --container-mounts. Omits default --mpi=pmi2 for single-GPU aggregated jobs. Updates call site to pass new parameters.
Aggregated pytest launcher adjustment
jenkins/scripts/perf/local/submit.py
Conditionally removes $LLM_API_LAUNCH segment from aggregated pytest command when hardware_config.total_gpus == 1.
Disaggregated accuracy test execution
jenkins/scripts/perf/local/submit.py
Reworks disaggregated accuracy flow to derive MODEL_DIR_NAME from config, locate and invoke accuracy_runner.py, export ACCURACY_CONFIG_JSON and accuracy environment variables, and extend srun arguments with container env entries including DISAGG_SERVER_HOST.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is incomplete. It lacks a proper description section explaining the issue and solution, test coverage details, and most checklist items are unchecked. Fill in the Description section with details about why single-GPU test support and customized config file acceptance are needed. Add Test Coverage section listing relevant tests. Review and check appropriate checklist items based on the changes made.
✅ Passed checks (4 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 87.50% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title clearly describes the main changes: enabling single GPU tests and accepting customized config files in submit.py.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
jenkins/scripts/perf/local/submit.py (1)

293-294: ⚡ Quick win

Catch specific exceptions instead of bare Exception.

As per coding guidelines, avoid broad exception handling. The subprocess call can raise subprocess.SubprocessError (or its subclasses like CalledProcessError), FileNotFoundError if sinfo isn't found, or OSError for other OS-level issues.

Proposed fix
-    except Exception:
+    except (subprocess.SubprocessError, FileNotFoundError, OSError):
         return False
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@jenkins/scripts/perf/local/submit.py` around lines 293 - 294, Replace the
broad "except Exception:" in the try/except that runs the external "sinfo"
subprocess with specific exception handlers: catch subprocess.SubprocessError
(to include CalledProcessError and related subprocess failures),
FileNotFoundError (if "sinfo" is missing), and OSError for other OS-level
issues, returning False for those specific cases; ensure you reference the
subprocess invocation of "sinfo" (the block that calls subprocess.run /
subprocess.check_output) and include the caught exception in any logging or
diagnostics to aid debugging.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@jenkins/scripts/perf/local/submit.py`:
- Line 844: Add a missing argparse definition for the --disagg-server-port
option so args.disagg_server_port exists at runtime; update the argparse setup
(the parser variable used in the argument parsing block) to add a new integer
argument named "--disagg-server-port" with a sensible default (e.g., 8000) and a
short help string indicating it is the port for the disaggregated server (used
for accuracy tests), ensuring subsequent references to args.disagg_server_port
(e.g., where the environment line sets DISAGG_SERVER_PORT) no longer raise
AttributeError.
- Around line 337-339: The substring check `llm_src not in mounts` can falsely
skip adding a mount when llm_src is a prefix of another path; instead parse the
existing mounts into their source paths and compare exactly. Split `mounts` on
commas, for each entry split on ':' and collect the left-side source paths
(normalize with os.path.abspath or os.path.normpath), then check `if llm_src and
llm_src not in source_paths:` before creating `llm_src_mount =
f"{llm_src}:{llm_src}"` and appending to `mounts`; update the logic around the
`llm_src`, `mounts`, and `llm_src_mount` variables accordingly.

---

Nitpick comments:
In `@jenkins/scripts/perf/local/submit.py`:
- Around line 293-294: Replace the broad "except Exception:" in the try/except
that runs the external "sinfo" subprocess with specific exception handlers:
catch subprocess.SubprocessError (to include CalledProcessError and related
subprocess failures), FileNotFoundError (if "sinfo" is missing), and OSError for
other OS-level issues, returning False for those specific cases; ensure you
reference the subprocess invocation of "sinfo" (the block that calls
subprocess.run / subprocess.check_output) and include the caught exception in
any logging or diagnostics to aid debugging.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9228a729-16e4-4e0b-93e0-c1bff5ede545

📥 Commits

Reviewing files that changed from the base of the PR and between 8cd28ed and 7852613.

📒 Files selected for processing (1)
  • jenkins/scripts/perf/local/submit.py

Comment thread jenkins/scripts/perf/local/submit.py
Comment thread jenkins/scripts/perf/local/submit.py
Signed-off-by: Hui Gao <huig@nvidia.com>
@HuiGao-NV
Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE2-GPU8-Post-Merge-1,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-1,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-4,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-2,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-3," --disable-fail-fast

@HuiGao-NV HuiGao-NV changed the title Make submit.py can run single GPU test and accept customized config file [None][chore] Make submit.py can run single GPU test and accept customized config file May 27, 2026
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50538 [ run ] triggered by Bot. Commit: 5c51268 Link to invocation

@HuiGao-NV
Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE2-GPU8-Post-Merge-1,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-1,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-2" --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50544 [ run ] triggered by Bot. Commit: 5c51268 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50538 [ run ] completed with state ABORTED. Commit: 5c51268

Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50544 [ run ] completed with state FAILURE. Commit: 5c51268
/LLM/main/L0_MergeRequest_PR pipeline #40048 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@HuiGao-NV
Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE2-GPU8-Post-Merge-1,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-1,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-2" --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50630 [ run ] triggered by Bot. Commit: 5c51268 Link to invocation

@HuiGao-NV HuiGao-NV force-pushed the enhance_submit_script branch from 5c51268 to 2835b14 Compare May 28, 2026 00:01
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50630 [ run ] completed with state SUCCESS. Commit: 5c51268
/LLM/main/L0_MergeRequest_PR pipeline #40122 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@HuiGao-NV
Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE2-GPU8-Post-Merge-1,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-1,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-2" --disable-fail-fast

@HuiGao-NV HuiGao-NV requested review from QiJune and kaiyux May 28, 2026 01:37
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50652 [ run ] triggered by Bot. Commit: 2835b14 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50652 [ run ] completed with state SUCCESS. Commit: 2835b14
/LLM/main/L0_MergeRequest_PR pipeline #40144 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@HuiGao-NV
Copy link
Copy Markdown
Collaborator Author

/bot run --post-merge --only-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50694 [ run ] triggered by Bot. Commit: 2835b14 Link to invocation

Comment thread jenkins/scripts/perf/local/submit.py
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50694 [ run ] completed with state ABORTED. Commit: 2835b14

Link to invocation

Signed-off-by: HuiGao <huig@nvidia.com>
@HuiGao-NV HuiGao-NV force-pushed the enhance_submit_script branch from 2835b14 to f0e0eb4 Compare May 29, 2026 07:09
@HuiGao-NV
Copy link
Copy Markdown
Collaborator Author

/bot run --post-merge --only-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #51000 [ run ] triggered by Bot. Commit: f0e0eb4 Link to invocation

Copy link
Copy Markdown
Collaborator

@QiJune QiJune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #51000 [ run ] completed with state ABORTED. Commit: f0e0eb4

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants