Skip to content

[#14619][perf] AutoDeploy: tune Llama-3.1-8B-Instruct-FP8 TP=2/4 config and handle CG max bs when it is unset in the yaml#14622

Merged
MrGeva merged 5 commits into
NVIDIA:mainfrom
nv-auto-deploy:egeva/llama_3_1_8b_fp8_perf
May 29, 2026
Merged

[#14619][perf] AutoDeploy: tune Llama-3.1-8B-Instruct-FP8 TP=2/4 config and handle CG max bs when it is unset in the yaml#14622
MrGeva merged 5 commits into
NVIDIA:mainfrom
nv-auto-deploy:egeva/llama_3_1_8b_fp8_perf

Conversation

@MrGeva
Copy link
Copy Markdown
Collaborator

@MrGeva MrGeva commented May 27, 2026

Adds the following knobs to the AD registry config for Llama-3.1-8B (based on Nemotron-Nano-V3 setup that's proven to work for FP8 dense TP > 1 inference):

  • detect_sharding.allreduce_strategy: SYMM_MEM with an explicit manual tp_plan (q/k/v/o/gate/up/down). Closes the ~80 tps/u TP=2 c=1 gap vs the PyTorch backend by removing the default-allreduce overhead.
  • compile_model.piecewise_enabled: true.
  • mlir_elementwise_fusion: enabled.

On B200 TP=2, ISL=OSL=1000:

  • c=1 tokens/s/user: 347 -> 418 (PT: 427)
  • c=8 tokens/s/user: 351 -> 397 (PT: 418)
  • c=64 tokens/s/user: 254 -> 256 (PT: 260)

Accuracy on MMLU/GSM8K with the new config matches the reference for nvidia/Llama-3.1-8B-Instruct-FP8 (MMLU 67.30 vs ref 67.87, GSM8K 74.41 vs ref 72.85).

Summary by CodeRabbit

  • Bug Fixes

    • Fixed batch size validation in auto-deployment to automatically adjust CUDA graph configuration when batch sizes exceed configured defaults.
  • Improvements

    • Enhanced default deployment configuration with optimized prefill handling, improved attention backend settings, and refined tensor parallelism sharding strategies for better performance and stability.

Review Change Stack

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@MrGeva MrGeva changed the title [#14619][perf] AutoDeploy: tune Llama-3.1-8B-Instruct-FP8 TP=2/4 config [#14619][perf] AutoDeploy: tune Llama-3.1-8B-Instruct-FP8 TP=2/4 config and handle CG max bs when it is unset in the yaml May 27, 2026
@MrGeva MrGeva marked this pull request as ready for review May 27, 2026 16:12
@MrGeva MrGeva requested a review from a team as a code owner May 27, 2026 16:12
@MrGeva MrGeva requested a review from hnover-nv May 27, 2026 16:12
@MrGeva
Copy link
Copy Markdown
Collaborator Author

MrGeva commented May 27, 2026

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast

@MrGeva MrGeva enabled auto-merge (squash) May 27, 2026 16:13
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 27, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: bed1fedb-d3e7-404d-b032-f69f6aba43ea

📥 Commits

Reviewing files that changed from the base of the PR and between 46bf87c and 32fef39.

📒 Files selected for processing (2)
  • examples/auto_deploy/model_registry/configs/llama3_1_8b.yaml
  • tensorrt_llm/_torch/auto_deploy/llm_args.py

📝 Walkthrough

Walkthrough

This PR adds automatic cuda_graph_config sizing to LlmArgs and updates the Llama 3.1 8B model registry configuration to use explicit cuda_graph_config with batch sizes, manual sharding strategies, and compilation optimizations including piecewise prefill and elementwise fusion.

Changes

Llama 3.1 8B Deployment with CUDA Graph Configuration

Layer / File(s) Summary
LlmArgs cuda_graph_config validator
tensorrt_llm/_torch/auto_deploy/llm_args.py
Added @model_validator(mode="after") method extend_default_cuda_graph_config_to_max_batch_size that rebuilds cuda_graph_config with the top-level max_batch_size when the configured batch size is smaller and the config was not explicitly provided by the user, preserving enable_padding to regenerate batch_sizes heuristically.
Llama 3.1 8B model configuration
examples/auto_deploy/model_registry/configs/llama3_1_8b.yaml
Updated top-level model settings to include cuda_graph_config with explicit batch_sizes, expanded transforms.detect_sharding with manual_config specifying head_dim and tp_plan for sharding projection modules, and added compile_model.piecewise_enabled: true alongside mlir_elementwise_fusion settings for improved compilation.

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

  • NVIDIA/TensorRT-LLM#14352: Updates AutoDeploy test harnesses to derive yaml_extra from the model registry and toggle transforms.compile_model.piecewise_enabled based on enable_chunked_prefill, which aligns with this PR's changes to the model YAML's enable_chunked_prefill and compile_model.piecewise_enabled configuration.

Suggested reviewers

  • galagam
  • hnover-nv
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Description check ❓ Inconclusive The PR description provides implementation details and performance metrics, but the formal description template sections (Description, Test Coverage) are mostly empty. Fill in the 'Description' section with a concise explanation of what and why, and the 'Test Coverage' section listing relevant tests that safeguard the changes.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the main change: tuning Llama-3.1-8B-Instruct-FP8 TP=2/4 config and handling CUDA graph max batch size, which aligns with the PR's primary objectives.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50575 [ run ] triggered by Bot. Commit: 32fef39 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50575 [ run ] completed with state FAILURE. Commit: 32fef39
/LLM/main/L0_MergeRequest_PR pipeline #40076 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

MrGeva added 5 commits May 28, 2026 16:59
Adds the following knobs to the AD registry config for Llama-3.1-8B
(based on Nemotron-Nano-V3 setup that's proven to work for FP8 dense
TP > 1 inference):

- detect_sharding.allreduce_strategy: SYMM_MEM with an explicit manual
  tp_plan (q/k/v/o/gate/up/down). Closes the ~80 tps/u TP=2 c=1 gap
  vs the PyTorch backend by removing the default-allreduce overhead.
- compile_model.piecewise_enabled: true.
- mlir_elementwise_fusion: enabled.

On B200 TP=2, ISL=OSL=1000:
- c=1   tokens/s/user: 347 -> 418 (PT: 427)
- c=8   tokens/s/user: 351 -> 397 (PT: 418)
- c=64  tokens/s/user: 254 -> 256 (PT: 260)

Accuracy on MMLU/GSM8K with the new config matches the reference for
nvidia/Llama-3.1-8B-Instruct-FP8 (MMLU 67.30 vs ref 67.87, GSM8K
74.41 vs ref 72.85). Adds a TestLlama3_1_8B_Instruct_FP8 accuracy
test and registers the TP=4 trtllm entry in the H100 post-merge list.

Signed-off-by: egeva <19514940+MrGeva@users.noreply.github.com>
The previous Llama-3.1-8B-Instruct-FP8 yaml uses the AD default cuda
graph capture set (max bs=128), but ``max_batch_size: 256``. Any
runtime batch > 128 falls back to eager mode and pays ~2x ITL.

Adding bs=192 and bs=256 to ``cuda_graph_config.batch_sizes`` makes
those iterations cuda-graphable. Measured on B200 TP=2 ISL=OSL=1000:

  c=256 ITL: 24.10 -> 13.20 ms  (-45%, AD now 2x faster than PT @ c=256)

c<128 unchanged. Capture time at startup goes up by ~5 s (2 extra
graphs) which is negligible vs. the runtime win.

Signed-off-by: egeva <19514940+MrGeva@users.noreply.github.com>
…_batch_size

``CudaGraphConfig.validate_cuda_graph_config`` falls back to a hard-coded
max_batch_size of 128 when the user has not set ``cuda_graph_config``
explicitly. The top-level ``LlmArgs.max_batch_size`` is not propagated,
so any model with ``max_batch_size > 128`` silently runs in eager mode
for the larger batches and roughly doubles ITL at those batch sizes.

Add a model_validator on the AutoDeploy LlmArgs that, when the user has
not set ``cuda_graph_config`` and its max_batch_size is smaller than the
top-level value, rebuilds ``cuda_graph_config`` with the larger max so
the heuristic regenerates the batch_sizes list (e.g. ``[..., 256]``
instead of ``[..., 128]`` for ``max_batch_size: 256``).

Explicit user configuration (``cuda_graph_config`` set in yaml or kwargs)
is preserved untouched.

Signed-off-by: egeva <19514940+MrGeva@users.noreply.github.com>
…uracy test

The previous commit added a dedicated ``TestLlama3_1_8B_Instruct_FP8``
class and a matching ``l0_dgx_h100.yml`` entry. Both are redundant:
``TestModelRegistryAccuracy::test_autodeploy_from_registry`` already
covers ``nvidia/Llama-3.1-8B-Instruct-FP8`` via its parametrized
``MODEL_REGISTRY_ACCURACY_PARAMS`` list, loads the
``examples/auto_deploy/model_registry/configs/llama3_1_8b.yaml``
bundle through ``_get_registry_yaml_extra``, and is already wired into
``l0_dgx_h100.yml`` post-merge as
``[nvidia_Llama-3.1-8B-Instruct-FP8-True]``. The yaml tuning landed in
this PR is picked up automatically by that existing entry.

Signed-off-by: egeva <19514940+MrGeva@users.noreply.github.com>
…y match defaults

Mirrors the nano_v3.yaml cleanup. Removes entries that restated the
AutoDeploy default verbatim:

- runtime, model_factory, compile_backend (LlmArgs defaults)
- kv_cache_config.free_gpu_memory_fraction (KvCacheConfig default)
- fuse_silu_mul.enabled (default.yaml already sets enabled: true)
- mlir_elementwise_fusion.stage, run_shape_prop, bypass_ops

Verified via ``LlmArgs(yaml_extra=['llama3_1_8b.yaml'])`` that every
effective field is unchanged after the trim. Quant- and TP-specific
overrides (``kv_cache_config.dtype: fp8``, ``attn_backend: trtllm``,
``cuda_graph_config.batch_sizes``, ``detect_sharding`` overrides, the
``fuse_*: enabled: true`` toggles, ``compile_model.piecewise_enabled``,
``mlir_elementwise_fusion.enabled``, ``fuse_silu_mul.backend: trtllm``)
are preserved.

Signed-off-by: egeva <19514940+MrGeva@users.noreply.github.com>
@MrGeva MrGeva force-pushed the egeva/llama_3_1_8b_fp8_perf branch from 32fef39 to e672760 Compare May 28, 2026 13:59
@MrGeva
Copy link
Copy Markdown
Collaborator Author

MrGeva commented May 28, 2026

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50790 [ run ] triggered by Bot. Commit: e672760 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50790 [ run ] completed with state FAILURE. Commit: e672760
/LLM/main/L0_MergeRequest_PR pipeline #40265 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@MrGeva
Copy link
Copy Markdown
Collaborator Author

MrGeva commented May 29, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50958 [ run ] triggered by Bot. Commit: e672760 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50958 [ run ] completed with state SUCCESS. Commit: e672760
/LLM/main/L0_MergeRequest_PR pipeline #40415 completed with status: 'SUCCESS'

CI Report

Link to invocation

@MrGeva MrGeva merged commit 566c226 into NVIDIA:main May 29, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants