Skip to content

Add use_real parameter to Z-Image for platform compatibility#13824

Open
st7109 wants to merge 2 commits into
huggingface:mainfrom
st7109:optimize-z-image
Open

Add use_real parameter to Z-Image for platform compatibility#13824
st7109 wants to merge 2 commits into
huggingface:mainfrom
st7109:optimize-z-image

Conversation

@st7109
Copy link
Copy Markdown

@st7109 st7109 commented May 28, 2026

What does this PR do?

Fixes # (issue)

Add optional real-number RoPE implementation to Z-Image transformer and controlnet. When use_real=True,
the rotary position embeddings use (cos, sin) tuples instead of complex numbers, enabling the model to run on platforms that don't support complex arithmetic (e.g., Cambricon MLU, NPU, etc).

Changes:

  • Add apply_rotary_emb() with use_real parameter supporting both complex and real computation
  • Propagate use_real through ZSingleStreamAttnProcessor, ZImageTransformerBlock, RopeEmbedder, ZImageTransformer2DModel, and controlnet variants
  • Update _prepare_sequence and _build_unified_sequence to handle (cos, sin) tuples
  • Default use_real=False maintains backward compatibility

Tested on Cambricon MLU and nvidia A100: successfully generates 1024x1024 images with numerical equivalence (max diff < 1e-6) compared to complex mode.

Test code:

import torch
from diffusers import ZImagePipeline

model_id = "/data/sd/sd_models/hf_models/Tongyi-MAI/Z-Image-Turbo/"
# 1. Load the pipeline
# Use bfloat16 for optimal performance on supported GPUs
pipe = ZImagePipeline.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=False,
)
pipe.to("mlu")

# [Optional] Attention Backend
# Diffusers uses SDPA by default. Switch to Flash Attention for better efficiency if supported:
# pipe.transformer.set_attention_backend("flash")    # Enable Flash-Attention-2
# pipe.transformer.set_attention_backend("_flash_3") # Enable Flash-Attention-3

# [Optional] Model Compilation
# Compiling the DiT model accelerates inference, but the first run will take longer to compile.
# pipe.transformer.compile()

# [Optional] CPU Offloading
# Enable CPU offloading for memory-constrained devices.
# pipe.enable_model_cpu_offload()

prompt = "Young Chinese woman in red Hanfu, intricate embroidery. Impeccable makeup, red floral forehead pattern. Elaborate high bun, golden phoenix headdress, red flowers, beads. Holds round folding fan with lady, trees, bird. Neon lightning-bolt lamp (⚡️), bright yellow glow, above extended left palm. Soft-lit outdoor night background, silhouetted tiered pagoda (西安大雁塔), blurred colorful distant lights."

# 2. Generate Image
image = pipe(
    prompt=prompt,
    height=1024,
    width=1024,
    num_inference_steps=9,  # This actually results in 8 DiT forwards
    guidance_scale=0.0,     # Guidance should be 0 for the Turbo models
    generator=torch.Generator("cpu").manual_seed(42),
).images[0]

image.save("example.png")

The test case is from https://huggingface.co/Tongyi-MAI/Z-Image-Turbo, if tests with Cambricon MLU platform, should set use_real=True, then generate the below:

python zimage_demo.py 
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 34.44it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:04<00:00,  1.48s/it]
Loading pipeline components...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:05<00:00,  1.02s/it]

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:03<00:00,  2.27it/s]
z_image_mlu_real_rope

Before submitting

Who can review?

@sayakpaul

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@github-actions github-actions Bot added models size/L PR with diff > 200 LOC labels May 28, 2026
Add optional real-number RoPE implementation to Z-Image transformer and
controlnet. When use_real=True,
the rotary position embeddings use (cos, sin) tuples instead of complex
numbers, enabling the model to run on platforms that don't support complex
arithmetic (e.g., MLU).

Changes:
- Add apply_rotary_emb() with use_real parameter supporting both complex
  and real computation
- Propagate use_real through ZSingleStreamAttnProcessor, ZImageTransformerBlock,
  RopeEmbedder, ZImageTransformer2DModel, and controlnet variants
- Update _prepare_sequence and _build_unified_sequence to handle (cos, sin)
  tuples
- Default use_real=False maintains backward compatibility
- Replace hardcoded cuda autocast with device-aware torch.autocast for Z-Image

Tested on MLU: successfully generates 1024x1024 images with
numerical equivalence (max diff < 1e-6) compared to complex mode.
@st7109 st7109 force-pushed the optimize-z-image branch from 3f13d27 to 4ecd72f Compare June 2, 2026 04:07
@st7109
Copy link
Copy Markdown
Author

st7109 commented Jun 2, 2026

@sayakpaul @yiyixuxu hello, Please help review this commit, any suggests, let me know. thanks.

@st7109 st7109 mentioned this pull request Jun 3, 2026
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

models size/L PR with diff > 200 LOC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant