qwen3_5_moe: add CUDA Engine/Session execution path#20288
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20288
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 11 Pending, 1 Unrelated Failure, 2 Unclassified FailuresAs of commit 7e8992f with merge base e257a71 ( NEW FAILURE - The following job has failed:
UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
There was a problem hiding this comment.
Pull request overview
Adds a new CUDA-capable Qwen3.5-MoE execution adapter built around the LLMEngine/LLMSession interfaces, enabling a single loaded model to host multiple isolated sessions by rebinding mutable buffers at execution time. The CLI runner is refactored into a thin wrapper over this engine/session path, and the CUDA build gains a “no-bleed” integration test to validate session isolation/capacity behavior.
Changes:
- Introduces
Qwen35MoEEngine(engine + session implementation) supporting CUDA mutable-state rebinding and serving-capacity enforcement. - Refactors
examples/models/qwen3_5_moe/main.cppto use the engine/session API and adds warmup + multi-iteration timing. - Extends CUDA export to record per-session mutable-buffer FQNs and adds a CUDA “no-bleed” integration test + build wiring.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| Makefile | Updates CUDA build target messaging to include the new no-bleed test binary. |
| examples/models/qwen3_5_moe/test_qwen35_moe_nobleed.cpp | Adds a CUDA integration proof validating session isolation, memory delta, and capacity enforcement. |
| examples/models/qwen3_5_moe/README.md | Documents the engine/session architecture and the new no-bleed test + new CLI flags. |
| examples/models/qwen3_5_moe/qwen35_moe_engine.h | Adds the public Qwen35MoEEngine interface and config struct. |
| examples/models/qwen3_5_moe/qwen35_moe_engine.cpp | Implements engine/session logic, CUDA mutable-state rebinding, and capacity enforcement. |
| examples/models/qwen3_5_moe/main.cpp | Converts runner to a thin CLI over the new engine/session path with warmup + timing iterations. |
| examples/models/qwen3_5_moe/export.py | Records mutable-buffer FQNs in export metadata and enables share_mutable_buffers in memory planning. |
| examples/models/qwen3_5_moe/CMakePresets.json | Builds both runner and no-bleed test for the CUDA preset. |
| examples/models/qwen3_5_moe/CMakeLists.txt | Links the new engine implementation into runner/test and registers the CUDA test with CTest. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
003c72a to
7344b98
Compare
7344b98 to
de1d237
Compare
Add a Qwen3.5-MoE execution adapter on top of the new LLMEngine/LLMSession and CUDA mutable-state foundations. The engine loads one physical model, registers exported mutable-buffer metadata, and creates isolated sessions that rebind their own KV/conv/recurrent state before execution while sharing model weights. Keep the existing CLI behavior by making main.cpp a thin wrapper over the engine/session path. The export now records the model-specific mutable-buffer FQNs, and the CUDA build includes a no-bleed integration proof that interleaves two sessions on one loaded model and checks state isolation, memory growth, and capacity enforcement. This intentionally leaves OpenAI serving, worker loops, warm resume, and per-model serving wrappers for later PRs.
de1d237 to
7e8992f
Compare
Gasoonjia
left a comment
There was a problem hiding this comment.
Thanks for adding that!
| | `--num_iters` | `1` | Timed iterations to average after warmup | | ||
| | `--cuda_graph` | `false` | CUDA-only decode graph capture for single-session runner use | | ||
|
|
||
| `--cuda_graph` is intentionally single-session only. CUDA graph replay captures |
There was a problem hiding this comment.
we should find some way to support cuda graph in multiple session setting. One idea is promoting the cuda graph configs into llm sessions and whenever we change to a new session we should recaptured the graph.
There was a problem hiding this comment.
This will be a follow-up issue. Create an #20310
| int32_t max_sessions = 1; | ||
| // CUDA-only: graph-capture decode for single-session runner use. Incompatible | ||
| // with per-session mutable-state rebinding, so capacity remains 1. | ||
| bool enable_cuda_graph = false; |
There was a problem hiding this comment.
pls guard it with #ifdef EXECUTORCH_BUILD_CUDA since it is cuda-only.
There was a problem hiding this comment.
keeping the config shape stable across CUDA and non-CUDA builds is simpler for callers and for main.cpp. A non-CUDA build can still parse --cuda_graph, set config.enable_cuda_graph, and the engine/CLI can report “ignored on non-CUDA build.” If the field is guarded, every caller that touches it needs its own #ifdef, and the public config type changes by build mode.
Add a Qwen3.5-MoE execution adapter on top of the new LLMEngine/LLMSession and CUDA mutable-state foundations. The engine loads one physical model, registers exported mutable-buffer metadata, and creates isolated sessions that rebind their own KV/conv/recurrent state before execution while sharing model weights.
Keep the existing CLI behavior by making main.cpp a thin wrapper over the engine/session path. The export now records the model-specific mutable-buffer FQNs, and the CUDA build includes a no-bleed integration proof that interleaves two sessions on one loaded model and checks state isolation, memory growth, and capacity enforcement.
This intentionally leaves OpenAI serving, worker loops, warm resume, and per-model serving wrappers for later PRs.
#20001
Will do Gemma4 31B in later PRs
CI already exercises the main.cpp, which wraps around QwenEngine and QwenSession