qwen3_5_moe: add CUDA Engine/Session execution path by mergennachin · Pull Request #20288 · pytorch/executorch

mergennachin · 2026-06-15T21:43:56Z

Add a Qwen3.5-MoE execution adapter on top of the new LLMEngine/LLMSession and CUDA mutable-state foundations. The engine loads one physical model, registers exported mutable-buffer metadata, and creates isolated sessions that rebind their own KV/conv/recurrent state before execution while sharing model weights.

Keep the existing CLI behavior by making main.cpp a thin wrapper over the engine/session path. The export now records the model-specific mutable-buffer FQNs, and the CUDA build includes a no-bleed integration proof that interleaves two sessions on one loaded model and checks state isolation, memory growth, and capacity enforcement.

This intentionally leaves OpenAI serving, worker loops, warm resume, and per-model serving wrappers for later PRs.

#20001

Will do Gemma4 31B in later PRs

CI already exercises the main.cpp, which wraps around QwenEngine and QwenSession

pytorch-bot · 2026-06-15T21:44:00Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20288

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 11 Pending, 1 Unrelated Failure, 2 Unclassified Failures

As of commit 7e8992f with merge base e257a71 ():

NEW FAILURE - The following job has failed:

trunk / test-qnn-model (fp32, mv2) / linux-job (gh)
RuntimeError: Command docker exec -t 1fa9ccba58911f71158a377db8fa6a9068cc62e591a2f4e56a8ad902c129786a /exec failed with exit code 92

UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:

Build Aarch64 Linux Wheels / pytorch/executorch / build-wheel-py3_10-cpu-aarch64 (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
/__w/executorch/executorch/pytorch/executorch/backends/apple/coreml/runtime/inmemoryfs/inmemory_filesystem.cpp:722:48: error: ‘inmemoryfs::InMemoryFileSystem::InMemoryNode::Kind’ has not been declared
Build Aarch64 Linux Wheels / pytorch/executorch / upload / upload-wheel-py3_10-cpu-aarch64 (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
Unable to download artifact(s): Artifact not found for name: pytorch_executorch__3.10_cpu_aarch64

FLAKY - The following job failed but was likely due to flakiness present on trunk:

Test CUDA Windows Export and E2E / test-model-cuda-windows-e2e (facebook, dinov2-small-imagenet1k-1-layer, non-quantized) / windows-job (gh) (detected as infra flaky with no log or failing log classifier)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-06-15T21:44:45Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot

Pull request overview

Adds a new CUDA-capable Qwen3.5-MoE execution adapter built around the LLMEngine/LLMSession interfaces, enabling a single loaded model to host multiple isolated sessions by rebinding mutable buffers at execution time. The CLI runner is refactored into a thin wrapper over this engine/session path, and the CUDA build gains a “no-bleed” integration test to validate session isolation/capacity behavior.

Changes:

Introduces Qwen35MoEEngine (engine + session implementation) supporting CUDA mutable-state rebinding and serving-capacity enforcement.
Refactors examples/models/qwen3_5_moe/main.cpp to use the engine/session API and adds warmup + multi-iteration timing.
Extends CUDA export to record per-session mutable-buffer FQNs and adds a CUDA “no-bleed” integration test + build wiring.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
Makefile	Updates CUDA build target messaging to include the new no-bleed test binary.
examples/models/qwen3_5_moe/test_qwen35_moe_nobleed.cpp	Adds a CUDA integration proof validating session isolation, memory delta, and capacity enforcement.
examples/models/qwen3_5_moe/README.md	Documents the engine/session architecture and the new no-bleed test + new CLI flags.
examples/models/qwen3_5_moe/qwen35_moe_engine.h	Adds the public `Qwen35MoEEngine` interface and config struct.
examples/models/qwen3_5_moe/qwen35_moe_engine.cpp	Implements engine/session logic, CUDA mutable-state rebinding, and capacity enforcement.
examples/models/qwen3_5_moe/main.cpp	Converts runner to a thin CLI over the new engine/session path with warmup + timing iterations.
examples/models/qwen3_5_moe/export.py	Records mutable-buffer FQNs in export metadata and enables `share_mutable_buffers` in memory planning.
examples/models/qwen3_5_moe/CMakePresets.json	Builds both runner and no-bleed test for the CUDA preset.
examples/models/qwen3_5_moe/CMakeLists.txt	Links the new engine implementation into runner/test and registers the CUDA test with CTest.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.

Add a Qwen3.5-MoE execution adapter on top of the new LLMEngine/LLMSession and CUDA mutable-state foundations. The engine loads one physical model, registers exported mutable-buffer metadata, and creates isolated sessions that rebind their own KV/conv/recurrent state before execution while sharing model weights. Keep the existing CLI behavior by making main.cpp a thin wrapper over the engine/session path. The export now records the model-specific mutable-buffer FQNs, and the CUDA build includes a no-bleed integration proof that interleaves two sessions on one loaded model and checks state isolation, memory growth, and capacity enforcement. This intentionally leaves OpenAI serving, worker loops, warm resume, and per-model serving wrappers for later PRs.

Gasoonjia

Thanks for adding that!

Gasoonjia · 2026-06-16T17:51:29Z

+| `--num_iters` | `1` | Timed iterations to average after warmup |
+| `--cuda_graph` | `false` | CUDA-only decode graph capture for single-session runner use |
+
+`--cuda_graph` is intentionally single-session only. CUDA graph replay captures


we should find some way to support cuda graph in multiple session setting. One idea is promoting the cuda graph configs into llm sessions and whenever we change to a new session we should recaptured the graph.

This will be a follow-up issue. Create an #20310

Gasoonjia · 2026-06-16T17:55:59Z

+  int32_t max_sessions = 1;
+  // CUDA-only: graph-capture decode for single-session runner use. Incompatible
+  // with per-session mutable-state rebinding, so capacity remains 1.
+  bool enable_cuda_graph = false;


pls guard it with #ifdef EXECUTORCH_BUILD_CUDA since it is cuda-only.

keeping the config shape stable across CUDA and non-CUDA builds is simpler for callers and for main.cpp. A non-CUDA build can still parse --cuda_graph, set config.enable_cuda_graph, and the engine/CLI can report “ignored on non-CUDA build.” If the field is guarded, every caller that touches it needs its own #ifdef, and the public config type changes by build mode.

Copilot AI review requested due to automatic review settings June 15, 2026 21:43

mergennachin requested review from kirklandsign and larryliu0820 as code owners June 15, 2026 21:43

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 15, 2026

mergennachin temporarily deployed to cadence June 15, 2026 21:44 — with GitHub Actions Inactive

Copilot started reviewing on behalf of mergennachin June 15, 2026 21:44 View session

mergennachin requested review from Gasoonjia and digantdesai June 15, 2026 21:44

mergennachin added the ciflow/cuda label Jun 15, 2026

Copilot AI reviewed Jun 15, 2026

View reviewed changes

Comment thread examples/models/qwen3_5_moe/qwen35_moe_engine.cpp

Comment thread examples/models/qwen3_5_moe/main.cpp

Comment thread examples/models/qwen3_5_moe/README.md

mergennachin force-pushed the llm-qwen35-moe-engine branch from 003c72a to 7344b98 Compare June 16, 2026 13:52

mergennachin temporarily deployed to cadence June 16, 2026 13:52 — with GitHub Actions Inactive

Copilot AI review requested due to automatic review settings June 16, 2026 15:28

mergennachin force-pushed the llm-qwen35-moe-engine branch from 7344b98 to de1d237 Compare June 16, 2026 15:28

mergennachin temporarily deployed to cadence June 16, 2026 15:29 — with GitHub Actions Inactive

Copilot started reviewing on behalf of mergennachin June 16, 2026 15:29 View session

Copilot AI reviewed Jun 16, 2026

View reviewed changes

mergennachin deployed to upload-benchmark-results June 16, 2026 16:46 — with GitHub Actions Active

mergennachin force-pushed the llm-qwen35-moe-engine branch from de1d237 to 7e8992f Compare June 16, 2026 17:30

mergennachin temporarily deployed to cadence June 16, 2026 17:30 — with GitHub Actions Inactive

Gasoonjia approved these changes Jun 16, 2026

View reviewed changes

mergennachin merged commit 3b77c08 into main Jun 16, 2026
539 of 546 checks passed

mergennachin deleted the llm-qwen35-moe-engine branch June 16, 2026 18:56

Conversation

mergennachin commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20288

❌ 1 New Failure, 11 Pending, 1 Unrelated Failure, 2 Unclassified Failures

Uh oh!

github-actions Bot commented Jun 15, 2026

This PR needs a release notes: label

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Gasoonjia left a comment

Choose a reason for hiding this comment

Uh oh!

Gasoonjia Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

mergennachin Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

Gasoonjia Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

mergennachin Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mergennachin commented Jun 15, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 15, 2026 •

edited

Loading

This PR needs a `release notes:` label