[#14619][perf] AutoDeploy: tune Llama-3.1-8B-Instruct-FP8 TP=2/4 config and handle CG max bs when it is unset in the yaml#14622
Conversation
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThis PR adds automatic cuda_graph_config sizing to ChangesLlama 3.1 8B Deployment with CUDA Graph Configuration
🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
PR_Github #50575 [ run ] triggered by Bot. Commit: |
|
PR_Github #50575 [ run ] completed with state
|
Adds the following knobs to the AD registry config for Llama-3.1-8B (based on Nemotron-Nano-V3 setup that's proven to work for FP8 dense TP > 1 inference): - detect_sharding.allreduce_strategy: SYMM_MEM with an explicit manual tp_plan (q/k/v/o/gate/up/down). Closes the ~80 tps/u TP=2 c=1 gap vs the PyTorch backend by removing the default-allreduce overhead. - compile_model.piecewise_enabled: true. - mlir_elementwise_fusion: enabled. On B200 TP=2, ISL=OSL=1000: - c=1 tokens/s/user: 347 -> 418 (PT: 427) - c=8 tokens/s/user: 351 -> 397 (PT: 418) - c=64 tokens/s/user: 254 -> 256 (PT: 260) Accuracy on MMLU/GSM8K with the new config matches the reference for nvidia/Llama-3.1-8B-Instruct-FP8 (MMLU 67.30 vs ref 67.87, GSM8K 74.41 vs ref 72.85). Adds a TestLlama3_1_8B_Instruct_FP8 accuracy test and registers the TP=4 trtllm entry in the H100 post-merge list. Signed-off-by: egeva <19514940+MrGeva@users.noreply.github.com>
The previous Llama-3.1-8B-Instruct-FP8 yaml uses the AD default cuda graph capture set (max bs=128), but ``max_batch_size: 256``. Any runtime batch > 128 falls back to eager mode and pays ~2x ITL. Adding bs=192 and bs=256 to ``cuda_graph_config.batch_sizes`` makes those iterations cuda-graphable. Measured on B200 TP=2 ISL=OSL=1000: c=256 ITL: 24.10 -> 13.20 ms (-45%, AD now 2x faster than PT @ c=256) c<128 unchanged. Capture time at startup goes up by ~5 s (2 extra graphs) which is negligible vs. the runtime win. Signed-off-by: egeva <19514940+MrGeva@users.noreply.github.com>
…_batch_size ``CudaGraphConfig.validate_cuda_graph_config`` falls back to a hard-coded max_batch_size of 128 when the user has not set ``cuda_graph_config`` explicitly. The top-level ``LlmArgs.max_batch_size`` is not propagated, so any model with ``max_batch_size > 128`` silently runs in eager mode for the larger batches and roughly doubles ITL at those batch sizes. Add a model_validator on the AutoDeploy LlmArgs that, when the user has not set ``cuda_graph_config`` and its max_batch_size is smaller than the top-level value, rebuilds ``cuda_graph_config`` with the larger max so the heuristic regenerates the batch_sizes list (e.g. ``[..., 256]`` instead of ``[..., 128]`` for ``max_batch_size: 256``). Explicit user configuration (``cuda_graph_config`` set in yaml or kwargs) is preserved untouched. Signed-off-by: egeva <19514940+MrGeva@users.noreply.github.com>
…uracy test The previous commit added a dedicated ``TestLlama3_1_8B_Instruct_FP8`` class and a matching ``l0_dgx_h100.yml`` entry. Both are redundant: ``TestModelRegistryAccuracy::test_autodeploy_from_registry`` already covers ``nvidia/Llama-3.1-8B-Instruct-FP8`` via its parametrized ``MODEL_REGISTRY_ACCURACY_PARAMS`` list, loads the ``examples/auto_deploy/model_registry/configs/llama3_1_8b.yaml`` bundle through ``_get_registry_yaml_extra``, and is already wired into ``l0_dgx_h100.yml`` post-merge as ``[nvidia_Llama-3.1-8B-Instruct-FP8-True]``. The yaml tuning landed in this PR is picked up automatically by that existing entry. Signed-off-by: egeva <19514940+MrGeva@users.noreply.github.com>
…y match defaults Mirrors the nano_v3.yaml cleanup. Removes entries that restated the AutoDeploy default verbatim: - runtime, model_factory, compile_backend (LlmArgs defaults) - kv_cache_config.free_gpu_memory_fraction (KvCacheConfig default) - fuse_silu_mul.enabled (default.yaml already sets enabled: true) - mlir_elementwise_fusion.stage, run_shape_prop, bypass_ops Verified via ``LlmArgs(yaml_extra=['llama3_1_8b.yaml'])`` that every effective field is unchanged after the trim. Quant- and TP-specific overrides (``kv_cache_config.dtype: fp8``, ``attn_backend: trtllm``, ``cuda_graph_config.batch_sizes``, ``detect_sharding`` overrides, the ``fuse_*: enabled: true`` toggles, ``compile_model.piecewise_enabled``, ``mlir_elementwise_fusion.enabled``, ``fuse_silu_mul.backend: trtllm``) are preserved. Signed-off-by: egeva <19514940+MrGeva@users.noreply.github.com>
32fef39 to
e672760
Compare
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast |
|
PR_Github #50790 [ run ] triggered by Bot. Commit: |
|
PR_Github #50790 [ run ] completed with state
|
|
/bot run |
|
PR_Github #50958 [ run ] triggered by Bot. Commit: |
|
PR_Github #50958 [ run ] completed with state |
Adds the following knobs to the AD registry config for Llama-3.1-8B (based on Nemotron-Nano-V3 setup that's proven to work for FP8 dense TP > 1 inference):
On B200 TP=2, ISL=OSL=1000:
Accuracy on MMLU/GSM8K with the new config matches the reference for nvidia/Llama-3.1-8B-Instruct-FP8 (MMLU 67.30 vs ref 67.87, GSM8K 74.41 vs ref 72.85).
Summary by CodeRabbit
Bug Fixes
Improvements
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either
api-compatibleorapi-breaking. Forapi-breaking, includeBREAKINGin the PR title.Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.