[#14619][perf] AutoDeploy: tune Llama-3.1-8B-Instruct-FP8 TP=2/4 config and handle CG max bs when it is unset in the yaml by MrGeva · Pull Request #14622 · NVIDIA/TensorRT-LLM

MrGeva · 2026-05-27T07:46:33Z

Adds the following knobs to the AD registry config for Llama-3.1-8B (based on Nemotron-Nano-V3 setup that's proven to work for FP8 dense TP > 1 inference):

detect_sharding.allreduce_strategy: SYMM_MEM with an explicit manual tp_plan (q/k/v/o/gate/up/down). Closes the ~80 tps/u TP=2 c=1 gap vs the PyTorch backend by removing the default-allreduce overhead.
compile_model.piecewise_enabled: true.
mlir_elementwise_fusion: enabled.

On B200 TP=2, ISL=OSL=1000:

c=1 tokens/s/user: 347 -> 418 (PT: 427)
c=8 tokens/s/user: 351 -> 397 (PT: 418)
c=64 tokens/s/user: 254 -> 256 (PT: 260)

Accuracy on MMLU/GSM8K with the new config matches the reference for nvidia/Llama-3.1-8B-Instruct-FP8 (MMLU 67.30 vs ref 67.87, GSM8K 74.41 vs ref 72.85).

Summary by CodeRabbit

Bug Fixes
- Fixed batch size validation in auto-deployment to automatically adjust CUDA graph configuration when batch sizes exceed configured defaults.
Improvements
- Enhanced default deployment configuration with optimized prefill handling, improved attention backend settings, and refined tensor parallelism sharding strategies for better performance and stability.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

MrGeva · 2026-05-27T16:12:55Z

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast

coderabbitai · 2026-05-27T16:18:57Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: bed1fedb-d3e7-404d-b032-f69f6aba43ea

📥 Commits

Reviewing files that changed from the base of the PR and between 46bf87c and 32fef39.

📒 Files selected for processing (2)

examples/auto_deploy/model_registry/configs/llama3_1_8b.yaml
tensorrt_llm/_torch/auto_deploy/llm_args.py

📝 Walkthrough

Walkthrough

This PR adds automatic cuda_graph_config sizing to LlmArgs and updates the Llama 3.1 8B model registry configuration to use explicit cuda_graph_config with batch sizes, manual sharding strategies, and compilation optimizations including piecewise prefill and elementwise fusion.

Changes

Llama 3.1 8B Deployment with CUDA Graph Configuration

Layer / File(s)	Summary
LlmArgs cuda_graph_config validator `tensorrt_llm/_torch/auto_deploy/llm_args.py`	Added `@model_validator(mode="after")` method `extend_default_cuda_graph_config_to_max_batch_size` that rebuilds `cuda_graph_config` with the top-level `max_batch_size` when the configured batch size is smaller and the config was not explicitly provided by the user, preserving `enable_padding` to regenerate `batch_sizes` heuristically.
Llama 3.1 8B model configuration `examples/auto_deploy/model_registry/configs/llama3_1_8b.yaml`	Updated top-level model settings to include `cuda_graph_config` with explicit `batch_sizes`, expanded `transforms.detect_sharding` with `manual_config` specifying `head_dim` and `tp_plan` for sharding projection modules, and added `compile_model.piecewise_enabled: true` alongside `mlir_elementwise_fusion` settings for improved compilation.

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

NVIDIA/TensorRT-LLM#14352: Updates AutoDeploy test harnesses to derive yaml_extra from the model registry and toggle transforms.compile_model.piecewise_enabled based on enable_chunked_prefill, which aligns with this PR's changes to the model YAML's enable_chunked_prefill and compile_model.piecewise_enabled configuration.

Suggested reviewers

galagam
hnover-nv

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	❓ Inconclusive	The PR description provides implementation details and performance metrics, but the formal description template sections (Description, Test Coverage) are mostly empty.	Fill in the 'Description' section with a concise explanation of what and why, and the 'Test Coverage' section listing relevant tests that safeguard the changes.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly identifies the main change: tuning Llama-3.1-8B-Instruct-FP8 TP=2/4 config and handling CUDA graph max batch size, which aligns with the PR's primary objectives.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

tensorrt-cicd · 2026-05-27T16:19:14Z

PR_Github #50575 [ run ] triggered by Bot. Commit: 32fef39 Link to invocation

tensorrt-cicd · 2026-05-27T23:02:48Z

PR_Github #50575 [ run ] completed with state FAILURE. Commit: 32fef39
/LLM/main/L0_MergeRequest_PR pipeline #40076 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Adds the following knobs to the AD registry config for Llama-3.1-8B (based on Nemotron-Nano-V3 setup that's proven to work for FP8 dense TP > 1 inference): - detect_sharding.allreduce_strategy: SYMM_MEM with an explicit manual tp_plan (q/k/v/o/gate/up/down). Closes the ~80 tps/u TP=2 c=1 gap vs the PyTorch backend by removing the default-allreduce overhead. - compile_model.piecewise_enabled: true. - mlir_elementwise_fusion: enabled. On B200 TP=2, ISL=OSL=1000: - c=1 tokens/s/user: 347 -> 418 (PT: 427) - c=8 tokens/s/user: 351 -> 397 (PT: 418) - c=64 tokens/s/user: 254 -> 256 (PT: 260) Accuracy on MMLU/GSM8K with the new config matches the reference for nvidia/Llama-3.1-8B-Instruct-FP8 (MMLU 67.30 vs ref 67.87, GSM8K 74.41 vs ref 72.85). Adds a TestLlama3_1_8B_Instruct_FP8 accuracy test and registers the TP=4 trtllm entry in the H100 post-merge list. Signed-off-by: egeva <19514940+MrGeva@users.noreply.github.com>

The previous Llama-3.1-8B-Instruct-FP8 yaml uses the AD default cuda graph capture set (max bs=128), but ``max_batch_size: 256``. Any runtime batch > 128 falls back to eager mode and pays ~2x ITL. Adding bs=192 and bs=256 to ``cuda_graph_config.batch_sizes`` makes those iterations cuda-graphable. Measured on B200 TP=2 ISL=OSL=1000: c=256 ITL: 24.10 -> 13.20 ms (-45%, AD now 2x faster than PT @ c=256) c<128 unchanged. Capture time at startup goes up by ~5 s (2 extra graphs) which is negligible vs. the runtime win. Signed-off-by: egeva <19514940+MrGeva@users.noreply.github.com>

…_batch_size ``CudaGraphConfig.validate_cuda_graph_config`` falls back to a hard-coded max_batch_size of 128 when the user has not set ``cuda_graph_config`` explicitly. The top-level ``LlmArgs.max_batch_size`` is not propagated, so any model with ``max_batch_size > 128`` silently runs in eager mode for the larger batches and roughly doubles ITL at those batch sizes. Add a model_validator on the AutoDeploy LlmArgs that, when the user has not set ``cuda_graph_config`` and its max_batch_size is smaller than the top-level value, rebuilds ``cuda_graph_config`` with the larger max so the heuristic regenerates the batch_sizes list (e.g. ``[..., 256]`` instead of ``[..., 128]`` for ``max_batch_size: 256``). Explicit user configuration (``cuda_graph_config`` set in yaml or kwargs) is preserved untouched. Signed-off-by: egeva <19514940+MrGeva@users.noreply.github.com>

…uracy test The previous commit added a dedicated ``TestLlama3_1_8B_Instruct_FP8`` class and a matching ``l0_dgx_h100.yml`` entry. Both are redundant: ``TestModelRegistryAccuracy::test_autodeploy_from_registry`` already covers ``nvidia/Llama-3.1-8B-Instruct-FP8`` via its parametrized ``MODEL_REGISTRY_ACCURACY_PARAMS`` list, loads the ``examples/auto_deploy/model_registry/configs/llama3_1_8b.yaml`` bundle through ``_get_registry_yaml_extra``, and is already wired into ``l0_dgx_h100.yml`` post-merge as ``[nvidia_Llama-3.1-8B-Instruct-FP8-True]``. The yaml tuning landed in this PR is picked up automatically by that existing entry. Signed-off-by: egeva <19514940+MrGeva@users.noreply.github.com>

…y match defaults Mirrors the nano_v3.yaml cleanup. Removes entries that restated the AutoDeploy default verbatim: - runtime, model_factory, compile_backend (LlmArgs defaults) - kv_cache_config.free_gpu_memory_fraction (KvCacheConfig default) - fuse_silu_mul.enabled (default.yaml already sets enabled: true) - mlir_elementwise_fusion.stage, run_shape_prop, bypass_ops Verified via ``LlmArgs(yaml_extra=['llama3_1_8b.yaml'])`` that every effective field is unchanged after the trim. Quant- and TP-specific overrides (``kv_cache_config.dtype: fp8``, ``attn_backend: trtllm``, ``cuda_graph_config.batch_sizes``, ``detect_sharding`` overrides, the ``fuse_*: enabled: true`` toggles, ``compile_model.piecewise_enabled``, ``mlir_elementwise_fusion.enabled``, ``fuse_silu_mul.backend: trtllm``) are preserved. Signed-off-by: egeva <19514940+MrGeva@users.noreply.github.com>

MrGeva · 2026-05-28T13:59:08Z

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast

tensorrt-cicd · 2026-05-28T14:05:43Z

PR_Github #50790 [ run ] triggered by Bot. Commit: e672760 Link to invocation

tensorrt-cicd · 2026-05-28T20:38:00Z

PR_Github #50790 [ run ] completed with state FAILURE. Commit: e672760
/LLM/main/L0_MergeRequest_PR pipeline #40265 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

MrGeva · 2026-05-29T04:05:17Z

/bot run

tensorrt-cicd · 2026-05-29T04:11:03Z

PR_Github #50958 [ run ] triggered by Bot. Commit: e672760 Link to invocation

tensorrt-cicd · 2026-05-29T04:41:31Z

PR_Github #50958 [ run ] completed with state SUCCESS. Commit: e672760
/LLM/main/L0_MergeRequest_PR pipeline #40415 completed with status: 'SUCCESS'

CI Report

Link to invocation

github-actions Bot assigned MrGeva May 27, 2026

MrGeva mentioned this pull request May 27, 2026

[Feature]: LLAMA 3.1 8B bad perf when scaling #14619

Open

1 task

MrGeva changed the title ~~[#14619][perf] AutoDeploy: tune Llama-3.1-8B-Instruct-FP8 TP=2/4 config~~ [#14619][perf] AutoDeploy: tune Llama-3.1-8B-Instruct-FP8 TP=2/4 config and handle CG max bs when it is unset in the yaml May 27, 2026

MrGeva marked this pull request as ready for review May 27, 2026 16:12

MrGeva requested a review from a team as a code owner May 27, 2026 16:12

MrGeva requested a review from hnover-nv May 27, 2026 16:12

MrGeva enabled auto-merge (squash) May 27, 2026 16:13

suyoggupta approved these changes May 27, 2026

View reviewed changes

MrGeva added 5 commits May 28, 2026 16:59

MrGeva force-pushed the egeva/llama_3_1_8b_fp8_perf branch from 32fef39 to e672760 Compare May 28, 2026 13:59

MrGeva merged commit 566c226 into NVIDIA:main May 29, 2026
7 checks passed

Conversation

MrGeva commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

MrGeva commented May 27, 2026

Uh oh!

coderabbitai Bot commented May 27, 2026

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 inconclusive)

Uh oh!

tensorrt-cicd commented May 27, 2026

Uh oh!

tensorrt-cicd commented May 27, 2026

Uh oh!

MrGeva commented May 28, 2026

Uh oh!

tensorrt-cicd commented May 28, 2026

Uh oh!

tensorrt-cicd commented May 28, 2026

Uh oh!

MrGeva commented May 29, 2026

Uh oh!

tensorrt-cicd commented May 29, 2026

Uh oh!

tensorrt-cicd commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MrGeva commented May 27, 2026 •

edited

Loading