Skip to content

qwen3_5_moe: add CUDA Engine/Session execution path#20288

Merged
mergennachin merged 1 commit into
mainfrom
llm-qwen35-moe-engine
Jun 16, 2026
Merged

qwen3_5_moe: add CUDA Engine/Session execution path#20288
mergennachin merged 1 commit into
mainfrom
llm-qwen35-moe-engine

Conversation

@mergennachin

@mergennachin mergennachin commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Add a Qwen3.5-MoE execution adapter on top of the new LLMEngine/LLMSession and CUDA mutable-state foundations. The engine loads one physical model, registers exported mutable-buffer metadata, and creates isolated sessions that rebind their own KV/conv/recurrent state before execution while sharing model weights.

Keep the existing CLI behavior by making main.cpp a thin wrapper over the engine/session path. The export now records the model-specific mutable-buffer FQNs, and the CUDA build includes a no-bleed integration proof that interleaves two sessions on one loaded model and checks state isolation, memory growth, and capacity enforcement.

This intentionally leaves OpenAI serving, worker loops, warm resume, and per-model serving wrappers for later PRs.

#20001

Will do Gemma4 31B in later PRs

CI already exercises the main.cpp, which wraps around QwenEngine and QwenSession

Copilot AI review requested due to automatic review settings June 15, 2026 21:43
@pytorch-bot

pytorch-bot Bot commented Jun 15, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20288

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 11 Pending, 1 Unrelated Failure, 2 Unclassified Failures

As of commit 7e8992f with merge base e257a71 (image):

NEW FAILURE - The following job has failed:

UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 15, 2026
@github-actions

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new CUDA-capable Qwen3.5-MoE execution adapter built around the LLMEngine/LLMSession interfaces, enabling a single loaded model to host multiple isolated sessions by rebinding mutable buffers at execution time. The CLI runner is refactored into a thin wrapper over this engine/session path, and the CUDA build gains a “no-bleed” integration test to validate session isolation/capacity behavior.

Changes:

  • Introduces Qwen35MoEEngine (engine + session implementation) supporting CUDA mutable-state rebinding and serving-capacity enforcement.
  • Refactors examples/models/qwen3_5_moe/main.cpp to use the engine/session API and adds warmup + multi-iteration timing.
  • Extends CUDA export to record per-session mutable-buffer FQNs and adds a CUDA “no-bleed” integration test + build wiring.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
Makefile Updates CUDA build target messaging to include the new no-bleed test binary.
examples/models/qwen3_5_moe/test_qwen35_moe_nobleed.cpp Adds a CUDA integration proof validating session isolation, memory delta, and capacity enforcement.
examples/models/qwen3_5_moe/README.md Documents the engine/session architecture and the new no-bleed test + new CLI flags.
examples/models/qwen3_5_moe/qwen35_moe_engine.h Adds the public Qwen35MoEEngine interface and config struct.
examples/models/qwen3_5_moe/qwen35_moe_engine.cpp Implements engine/session logic, CUDA mutable-state rebinding, and capacity enforcement.
examples/models/qwen3_5_moe/main.cpp Converts runner to a thin CLI over the new engine/session path with warmup + timing iterations.
examples/models/qwen3_5_moe/export.py Records mutable-buffer FQNs in export metadata and enables share_mutable_buffers in memory planning.
examples/models/qwen3_5_moe/CMakePresets.json Builds both runner and no-bleed test for the CUDA preset.
examples/models/qwen3_5_moe/CMakeLists.txt Links the new engine implementation into runner/test and registers the CUDA test with CTest.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread examples/models/qwen3_5_moe/qwen35_moe_engine.cpp
Comment thread examples/models/qwen3_5_moe/main.cpp
Comment thread examples/models/qwen3_5_moe/README.md

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.

Comment thread examples/models/qwen3_5_moe/qwen35_moe_engine.cpp
Comment thread examples/models/qwen3_5_moe/qwen35_moe_engine.cpp
Comment thread examples/models/qwen3_5_moe/qwen35_moe_engine.cpp
Comment thread examples/models/qwen3_5_moe/main.cpp
Comment thread examples/models/qwen3_5_moe/main.cpp
Comment thread examples/models/qwen3_5_moe/main.cpp
@mergennachin mergennachin deployed to upload-benchmark-results June 16, 2026 16:46 — with GitHub Actions Active
Add a Qwen3.5-MoE execution adapter on top of the new LLMEngine/LLMSession and CUDA mutable-state foundations. The engine loads one physical model, registers exported mutable-buffer metadata, and creates isolated sessions that rebind their own KV/conv/recurrent state before execution while sharing model weights.

Keep the existing CLI behavior by making main.cpp a thin wrapper over the engine/session path. The export now records the model-specific mutable-buffer FQNs, and the CUDA build includes a no-bleed integration proof that interleaves two sessions on one loaded model and checks state isolation, memory growth, and capacity enforcement.

This intentionally leaves OpenAI serving, worker loops, warm resume, and per-model serving wrappers for later PRs.

@Gasoonjia Gasoonjia left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding that!

| `--num_iters` | `1` | Timed iterations to average after warmup |
| `--cuda_graph` | `false` | CUDA-only decode graph capture for single-session runner use |

`--cuda_graph` is intentionally single-session only. CUDA graph replay captures

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should find some way to support cuda graph in multiple session setting. One idea is promoting the cuda graph configs into llm sessions and whenever we change to a new session we should recaptured the graph.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be a follow-up issue. Create an #20310

int32_t max_sessions = 1;
// CUDA-only: graph-capture decode for single-session runner use. Incompatible
// with per-session mutable-state rebinding, so capacity remains 1.
bool enable_cuda_graph = false;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls guard it with #ifdef EXECUTORCH_BUILD_CUDA since it is cuda-only.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keeping the config shape stable across CUDA and non-CUDA builds is simpler for callers and for main.cpp. A non-CUDA build can still parse --cuda_graph, set config.enable_cuda_graph, and the engine/CLI can report “ignored on non-CUDA build.” If the field is guarded, every caller that touches it needs its own #ifdef, and the public config type changes by build mode.

@mergennachin mergennachin merged commit 3b77c08 into main Jun 16, 2026
539 of 546 checks passed
@mergennachin mergennachin deleted the llm-qwen35-moe-engine branch June 16, 2026 18:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/cuda CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants