[not for land yet]: improve cuda graph support for Qwen-Image by vkuzo · Pull Request #13263 · huggingface/diffusers

vkuzo · 2026-03-12T20:15:42Z

Summary:

Very brief writeup as I'm about to head out for the day:

we want to enable cuda graphs for qwen-image + nvfp4 at small batch sizes, because without cuda graphs are we bottlenecked on cpu ops
to make cuda graphs work, we need to change the modeling code a bit to match the cuda graph requirements

There is a cleaner way to do this change repo-wide without having to change each model's modeling code, for now this
is just a quick hack to demonstrate performance + accuracy

Test Plan:

use a modified version of @sayakpaul's script: https://gist.github.com/vkuzo/acac22c62404c89db2dcf195a64543db

then, run it and see nvfp4 + bsz 1 time on qwen image improve by ~1.6x from 9.5s to 5.9s

// baseline

(pt_nightly) dev@gpu-dev-6c281422:~/tmp$ python 20260212_diffuser_nvfp4.py --compile True --torch_compile_mode reduce-overhead
...
======================================================================
SUMMARY
======================================================================
Quantization: None
Compile: True
Batch size: 1
Latency: 7.461s
Peak Memory: 62.21 GB

// nvfp4 dynamic, torch.compile default

(pt_nightly) dev@gpu-dev-6c281422:~/tmp$ python 20260212_diffuser_nvfp4.py --compile True --quant dynamic --use_filter_fn True
...
======================================================================
SUMMARY
======================================================================
Quantization: dynamic
Compile: True
Batch size: 1
Latency: 9.536s
Peak Memory: 52.45 GB
======================================================================

// nvfp4 dynamic, torch.compile reduce-overhead (for cuda graphs)

(pt_nightly) dev@gpu-dev-6c281422:~/tmp$ python 20260212_diffuser_nvfp4.py --compile True --quant dynamic --use_filter_fn True --torch_compile_mode reduce-overhead
...
======================================================================
SUMMARY
======================================================================
Quantization: dynamic
Compile: True
Batch size: 1
Latency: 5.936s
Peak Memory: 52.45 GB
======================================================================

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@sayakpaul

Summary: Very brief writeup as I'm about to head out for the day: 1. we want to enable cuda graphs for qwen-image + nvfp4 at small batch sizes, because without cuda graphs are we bottlenecked on cpu ops 2. to make cuda graphs work, we need to change the modeling code a bit to match the cuda graph requirements There is a cleaner way to do this change repo-wide without having to change each model's modeling code, for now this is just a quick hack to demonstrate performarnce + accuracy Test Plan: use a modified version of @sayakpaul's script: https://gist.github.com/vkuzo/acac22c62404c89db2dcf195a64543db then, run it and see nvfp4 + bsz 1 time on qwen image improve by ~1.6x from 9.5s to 5.9s ``` // baseline (pt_nightly) dev@gpu-dev-6c281422:~/tmp$ python 20260212_diffuser_nvfp4.py --compile True --torch_compile_mode reduce-overhead ... ====================================================================== SUMMARY ====================================================================== Quantization: None Compile: True Batch size: 1 Latency: 7.461s Peak Memory: 62.21 GB // nvfp4 dynamic, torch.compile default (pt_nightly) dev@gpu-dev-6c281422:~/tmp$ python 20260212_diffuser_nvfp4.py --compile True --quant dynamic --use_filter_fn True ... ====================================================================== SUMMARY ====================================================================== Quantization: dynamic Compile: True Batch size: 1 Latency: 9.536s Peak Memory: 52.45 GB ====================================================================== // nvfp4 dynamic, torch.compile reduce-overhead (for cuda graphs) (pt_nightly) dev@gpu-dev-6c281422:~/tmp$ python 20260212_diffuser_nvfp4.py --compile True --quant dynamic --use_filter_fn True --torch_compile_mode reduce-overhead ... ====================================================================== SUMMARY ====================================================================== Quantization: dynamic Compile: True Batch size: 1 Latency: 5.936s Peak Memory: 52.45 GB ====================================================================== ```

sayakpaul · 2026-03-13T02:08:42Z

Thanks for this PR! Do we know how clone() helps the NVFP4 case but not the others like BF16?

There is a cleaner way to do this change repo-wide without having to change each model's modeling code,

What would you recommend for this? hidden_states and encoder_hidden_states enter the forward() cloned?

sayakpaul · 2026-03-13T02:21:54Z

If we want to keep the modeling code unchanged, the following could be another approach, I guess?

def _clone_inputs_hook(module, args, kwargs):
    args = tuple(a.clone() if isinstance(a, torch.Tensor) else a for a in args)
    kwargs = {k: v.clone() if isinstance(v, torch.Tensor) else v for k, v in kwargs.items()}
    return args, kwargs

transformer.register_forward_pre_hook(clone_inputs_hook, with_kwargs=True)

sayakpaul requested a review from yiyixuxu March 13, 2026 02:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[not for land yet]: improve cuda graph support for Qwen-Image#13263

[not for land yet]: improve cuda graph support for Qwen-Image#13263
vkuzo wants to merge 1 commit intohuggingface:mainfrom
vkuzo:20260312_qwen_image_cuda_graphs

vkuzo commented Mar 12, 2026 •

edited

Loading

Uh oh!

sayakpaul commented Mar 13, 2026

Uh oh!

sayakpaul commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vkuzo commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

sayakpaul commented Mar 13, 2026

Uh oh!

sayakpaul commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vkuzo commented Mar 12, 2026 •

edited

Loading