Discrete diffusion in diffusers by kashif · Pull Request #12911 · huggingface/diffusers

kashif · 2026-01-04T23:33:53Z

What does this PR do?

Add experimental support for discrete token diffusion methods and pipeline

moved llada2 to its own PR: #13226

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

yiyixuxu · 2026-01-15T21:29:23Z

Thanks for this PR!
cc @dg845, can you take a look here? it's related to Dream 7B #12091 you are working on

dg845 · 2026-01-23T01:14:47Z

Thanks for the PR! Some preliminary design questions and comments:

I think it could be useful to have a natural place to implement logic which is common to discrete diffusion models. Would something like a DiscreteDiffusionPipelineMixin make sense? For example, I think _resolve_start_token_id, _normalize_prefix_ids, _top_p_filtering, etc. could be candidates as mixin methods. (A possible alternative could be to put the methods in DiffusionPipeline, but it feels a little weird to put the methods there because they aren't applicable to continuous diffusion models.) But maybe this is premature, since we might not know what logic will end up being useful for all (or most) discrete diffusion models.
1. One motivation for this is that we often want to do semi-autoregressive (SAR) sampling for discrete diffusion models, so it would be useful to have autoregressive sampling techniques such as top-$p$ sampling, top-$k$ sampling, etc. So I think it would be nice to have a place where these methods can be implemented and tested once, and then new discrete diffusion models that support SAR sampling can have easy access to them without having to copy them every time.

Similarly, would it make sense to have a TokenizerTextProcessor class which handles text pre-processing and and post-processing, analogous to how VaeImageProcessor handles image pre- and post-processing? It's probably less necessary as we don't need to do as much normalization as for images, but I could see this being useful for handling e.g. chat templates like in the SDAR and LLaDA 2 pipelines.

As an aside, this could also be useful for existing (continuous) diffusion models, some of which have pretty involved text processing, such as pipelines like SanaPipeline that use a _text_preprocessing method:

diffusers/src/diffusers/pipelines/sana/pipeline_sana.py

Lines 548 to 549 in d4f97d1

    
           # Copied from diffusers.pipelines.deepfloyd_if.pipeline_if.IFPipeline._text_preprocessing 
        
           def _text_preprocessing(self, text, clean_caption=False):

Currently it looks like the pipelines only support denoising models with a transformers-like interface. But we would probably want to implement some discrete diffusion transformers in diffusers, which currently doesn't enforce that interface. So I think we should think about how we can handle both cases gracefully in discrete diffusion pipelines. (One solution could be to simply adopt the transformers interface for all discrete denoising models in diffusers, but that could be unnecessarily restrictive.)

dg845 · 2026-01-23T01:20:25Z

+        self.register_to_config(
+            seq_len=seq_len,
+            num_inference_steps=num_inference_steps,
+            inject_start_token=inject_start_token,
+        )


Generally we don't register default __call__ arguments to the config, but rather set them as default arguments to the __call__ method:

diffusers/src/diffusers/pipelines/ltx2/pipeline_ltx2.py

Lines 744 to 752 in d4f97d1

def __call__(

self,

prompt: Union[str, List[str]] = None,

negative_prompt: Optional[Union[str, List[str]]] = None,

height: int = 512,

width: int = 768,

num_frames: int = 121,

frame_rate: float = 24.0,

num_inference_steps: int = 40,

dg845 · 2026-01-23T01:26:33Z

+        *,
+        batch_size: int = 1,


diffusers pipelines usually don't set __call__ arguments to be keyword-only. (That's not to say that there are no arguments for it, but because other pipelines allow positional arguments I think the expectation is that discrete diffusion pipelines will allow them as well.)

dg845 · 2026-01-23T01:28:37Z

+        if seq_len is None:
+            seq_len = int(self.config.seq_len)
+        if num_inference_steps is None:
+            num_inference_steps = int(self.config.num_inference_steps)
+        if inject_start_token is None:
+            inject_start_token = bool(self.config.inject_start_token)


Following up on #12911 (comment), this logic could be removed if we don't register default arguments to the config.

dg845 · 2026-01-23T01:32:21Z

+        if infill_mask is not None:
+            if infill_mask.shape != (batch_size, seq_len):
+                raise ValueError(
+                    f"`infill_mask` must have shape {(batch_size, seq_len)}, got {tuple(infill_mask.shape)}."
+                )


I think input checking and exceptions should be moved to a check_inputs method, which is the usual practice for diffusers pipelines:

diffusers/src/diffusers/pipelines/flux2/pipeline_flux2.py

Lines 686 to 693 in d4f97d1

def check_inputs(

self,

prompt,

height,

width,

prompt_embeds=None,

callback_on_step_end_tensor_inputs=None,

):

dg845 · 2026-01-23T01:36:13Z

+                return int(token_id)
+        return None
+
+    def _init_latents(


We usually name methods which sample latents from the prior distribution prepare_latents:

diffusers/src/diffusers/pipelines/flux2/pipeline_flux2.py

Line 619 in d4f97d1

def prepare_latents(

dg845 · 2026-01-23T01:53:41Z

+        if hasattr(self.scheduler, "forward_process") and getattr(self.scheduler, "forward_process") == "uniform":
+            # Uniform prior over token IDs. Mirror scheduler's exclude-mask behavior.
+            if getattr(self.scheduler, "exclude_mask_from_uniform", False) and hasattr(
+                self.scheduler, "_sample_uniform_tokens"
+            ):
+                return self.scheduler._sample_uniform_tokens(
+                    torch.Size((batch_size, seq_len)),
+                    device=device,
+                    dtype=torch.long,
+                    generator=generator,
+                )
+            vocab_size = int(getattr(self.scheduler, "vocab_size", 0))
+            if vocab_size <= 0:
+                raise ValueError("Scheduler must define `vocab_size` for uniform prior sampling.")
+            return torch.randint(
+                0, vocab_size, (batch_size, seq_len), device=device, dtype=torch.long, generator=generator
+            )


Suggestion: maybe it would be cleaner to define a scheduler method called (say) sample_prior which samples from the prior distribution based on the configured forward_process? So if self.forward_process == "uniform", we would call _sample_uniform_tokens under the hood in sample_prior to sample from a uniform prior distribution.

I think this would allow for more graceful support of other possible forward processes, and make the pipeline code cleaner (as most of the logic would be handled inside the scheduler).

dg845 · 2026-04-17T01:28:13Z

@bot /style

github-actions · 2026-04-17T01:28:40Z

Style bot fixed some files and pushed the changes.

dg845 · 2026-04-17T02:19:53Z

@claude, can you review the PR? Please focus on the changes in /src/diffusers and their correponding tests in /tests.

github-actions · 2026-04-17T02:20:09Z

Claude finished @dg845's task in 2m 48s —— View job

Review: Discrete Diffusion in Diffusers

Reviewing changes in src/diffusers/ and corresponding tests in tests/.

This is a large PR adding experimental discrete token diffusion support. I've reviewed all the source files under src/diffusers/ and their corresponding tests. Overall the code is well-structured and follows many diffusers conventions. Below are findings organized by severity. I've incorporated relevant points from dg845's earlier review comments where I agree.

Bugs / Correctness Issues

1. BD3LMTokenDiffusionScheduler uses torch.float64 — breaks on MPS/NPU (scheduling_bd3lm_token_diffusion.py:222-258, 380-386)

Per models.md Gotcha #9: "MPS and several NPU backends don't support float64." The BD3LM scheduler uses torch.float64 unconditionally in multiple places:

t = torch.tensor([timestep], device=sample.device, dtype=torch.float64)  # line 222
logits = model_output[:, -block_size:].to(dtype=torch.float64)           # line 258
sigma_max = -torch.log(torch.tensor(self.config.eps, device=t.device, dtype=torch.float64))  # line 386

These should use torch.float32 by default. If float32 demonstrably degrades output, use the device-gated pattern (is_mps = ... .device.type == "mps").

Fix this →

2. In-place mutation of model_output logits in TokenDiffusionScheduler.step() (scheduling_token_diffusion.py:512)

logits[..., self.config.mask_token_id] = torch.finfo(logits.dtype).min

This mutates the model_output tensor in-place (for the absorbing path), which could corrupt the caller's tensor. The uniform path at line 468 does .clone() first but the absorbing path at 512 doesn't. Should add logits = logits.clone() before the in-place write for the absorbing path, or unify both paths to clone early.

Fix this →

3. SDARPipeline.__call__ calls self.model.eval() explicitly (pipeline_sdar.py:289)

Per AGENTS.md review comment from dg845: "the draft_model and target_model should already be set to eval mode, so we don't need to explicitly call it here." The same principle applies to SDARPipeline — pipelines should not call .eval() on models they don't own. This also has a side effect of modifying the model's state.

4. SDARTokenDiffusionScheduler.step() entropy-bounded remasking — wrong entropy calculation (scheduling_sdar_token_diffusion.py:292-293)

entropies = -(sampled_probs.clamp_min(eps) * sampled_probs.clamp_min(eps).log()).sum(dim=-1)

sampled_probs here has shape (batch, block_length) — it's per-token probabilities, not a distribution over vocab. The .sum(dim=-1) therefore collapses the block dimension, which doesn't compute per-position entropy. This seems to be computing H = -p*log(p) on individual scalar probabilities (not a distribution). The result for entropy_bounded strategy may not match the intended semantics. This needs a closer look.

Design / Architecture Concerns

5. HybridTokenDiffusionPipeline subclasses TokenDiffusionPipeline (pipeline_hybrid_token_diffusion.py:44)

Per AGENTS.md: "Don't subclass an existing pipeline for a variant — DO NOT use an existing pipeline class (e.g., FluxPipeline) to override another pipeline." This pipeline is essentially an empty class inheriting from TokenDiffusionPipeline. dg845 also suggested consolidating these into a single pipeline class since the logic is identical and only the scheduler differs. I agree — either consolidate or copy methods over.

6. Missing TokenDiffusionSchedulerOutput and HybridTokenDiffusionPipelineOutput from top-level exports

TokenDiffusionSchedulerOutput is defined in scheduling_token_diffusion.py and in the scheduler __init__.py, but is not in src/diffusers/__init__.py's _import_structure. Similarly, HybridTokenDiffusionPipelineOutput is defined and exported from the sub-package __init__.py but not from the top-level __init__.py. Per models.md Gotcha #1: "Every new class must be registered in the appropriate __init__.py with lazy imports."

7. BD3LMPipeline.__init__ resolves mask_token_id and eos_token_id and stores them as instance attributes (pipeline_bd3lm.py:88-99)

This sets self.mask_token_id and self.eos_token_id outside of register_modules / register_to_config, which means they won't survive save_pretrained / from_pretrained round-trips. The pattern used by other pipelines (resolving at call time, or storing in the scheduler config) is more robust.

8. DFlashPipeline device resolution bypasses self._execution_device (pipeline_dflash.py:255-259)

target_params = list(self.target_model.parameters())
device = target_params[0].device if len(target_params) > 0 else torch.device("cpu")

This manual device detection via iterating parameters is fragile and may break with model CPU offloading or device maps. Other pipelines use self._execution_device, which is the standard diffusers pattern.

Code Quality / Style

9. TokenDiffusionScheduler has self.vocab_size and self.mask_token_id not set as instance attributes

While register_to_config saves them to self.config.vocab_size etc., some code paths reference scheduler.mask_token_id or scheduler.vocab_size directly (e.g. in tests: scheduler.mask_token_id, scheduler.vocab_size). These work because ConfigMixin delegates attribute access to config, but it's potentially confusing.

10. HybridTokenDiffusionScheduler.__init__ stores self.log_B, self.log_gamma, self.mask, self.unif as instance attributes (scheduling_hybrid_token_diffusion.py:96-108)

These are redundant with register_to_config values and won't be serialized. dg845's review comment noted this: "this is unnecessary as register_to_config should make these available." Computed values should either be computed on-the-fly or stored in a way that survives config round-trips.

11. Redundant int() / float() casts throughout

Multiple files have unnecessary int() and float() casts (e.g., pipeline_sdar.py, scheduling_hybrid_token_diffusion.py, pipeline_dflash.py). Since the parameters are already annotated as int/float, these can be removed for readability. dg845 flagged this in earlier review comments.

12. DFlashPipeline.prepare_latents wraps block_size and mask_token_id in int() (pipeline_dflash.py:150-151)

(1, max_length + int(block_size)),
int(mask_token_id),

These parameters are already typed as int, so the casts add noise.

13. _gumbel_argmax is imported across schedulers — consider centralizing

scheduling_hybrid_token_diffusion.py and scheduling_bd3lm_token_diffusion.py both import _gumbel_argmax from scheduling_token_diffusion.py. This creates a dependency between scheduler files. As dg845 noted, consider putting shared utilities in a common module.

Test Coverage Observations

14. HybridTokenDiffusionScheduler test coverage is minimal (test_scheduler_hybrid_token_diffusion.py)

Only 1 test method (test_add_noise_and_step_shapes). Missing tests for:

Config save/load round-trip
step() produces valid token IDs and converges
sample_prior() returns mask tokens
set_timesteps() validation
_get_alpha_betapi() basic properties

15. No test verifying TokenDiffusionSchedulerOutput from the base scheduler is importable from top-level

Since TokenDiffusionSchedulerOutput appears to be missing from the top-level __init__.py, adding an import test would catch this.

16. Pipeline tests don't test generator determinism

No test verifies that passing the same generator seed produces identical output across two calls. This is important for the generator parameter contract.

Unaddressed Review Comments

Several earlier review comments from dg845 appear not yet addressed:

Consolidating TokenDiffusionPipeline and HybridTokenDiffusionPipeline — still two classes
enforce_fixed_masks should be inlined in the pipeline — the approach changed but the concern about pipeline vs scheduler responsibility stands
BD3LM scheduler alpha schedules differ from TokenDiffusionScheduler — still different implementations
eps should be configurable via __init__ in BD3LM — now has eps as a config param, looks addressed
attention_mask is not used in TokenDiffusionPipeline — still unused

Summary

The PR introduces a solid foundation for discrete token diffusion in diffusers. The main correctness issues to address are:

float64 in BD3LM scheduler (MPS/NPU breakage)
In-place logits mutation in TokenDiffusionScheduler.step() absorbing path
Pipeline subclassing (HybridTokenDiffusionPipeline should not subclass TokenDiffusionPipeline)
Missing top-level exports (TokenDiffusionSchedulerOutput, HybridTokenDiffusionPipelineOutput)

The test suite is comprehensive for most components but the hybrid scheduler tests need significant expansion.

Adds `IDLMPipeline` + `IDLMBlockDiffusionScheduler` implementing Introspective Strided Decoding (Yu et al., 2026). Mirrors the SDAR/DFlash conventions: the scheduler owns the pure-math accept/resample logic (min(1, p/(alpha*q)) with max(0, p - alpha*q) resampling on reject) and per-round new-spec sampling; the pipeline orchestrates the block-N ISD loop, cache management via DynamicCache, and chat-template handling. Each round is a single target-model forward over [pending, spec_0, ..., spec_{K-1}, MASK, ..., MASK] # length 2*N - 1 under strict causal attention. Under I-DLM's Dream-style logit shift, `logits[:, i, :]` predicts the token at input position `i+1`, so the same forward both verifies the pending specs (against the anchor p at their now- clean positions) and samples the next batch of specs from the MASK-position anchors. On partial accept, the corrected token seeds a cold-start next round. Also: as part of aligning with the standard transformers v5 cache convention (use_cache=True + past_key_values always stores), switch `SDARPipeline` from the dual `store_kv=True/False` kwarg to a `DynamicCache.crop(prev_seq_len)` snapshot-and-rollback after read-only denoising forwards. Same behavior, no custom model-side flag required. - src/diffusers/pipelines/idlm/ (pipeline + package init) - src/diffusers/schedulers/scheduling_idlm_block_diffusion.py - tests/pipelines/idlm/test_idlm.py (7 unit tests) - examples/discrete_diffusion/sample_idlm.py (inference) - examples/discrete_diffusion/train_idlm.py (reference training loop) - docs/source/en/api/{pipelines,schedulers}/idlm*.md + TOC entries - src/diffusers/{__init__,pipelines/__init__,schedulers/__init__}.py registrations - src/diffusers/pipelines/sdar/pipeline_sdar.py: crop-based retrieve-only End-to-end verified with `yifanyu/I-DLM-8B` (loaded via `refs/pr/2`): prompt: "What is 2+2?" -> "2 + 2 equals 4."

Keep the files locally (already on disk under examples/research_projects/unified_latents/) but remove them from git tracking. Unrelated to the I-DLM scope of this branch.

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

kashif added 11 commits December 20, 2025 14:45

initial

bb70cab

use the scheduler

f4990a9

added Block-wise sampling

c74d66d

add hybrid

9a9470c

hybrid sample training

886a76d

initial trainer

6f94f4f

added llada2

c1bf227

add sample_llada2.py

f2b9223

added api

cf18acd

fix llada2 sampling

9209924

formatting

5835d2c

kashif marked this pull request as draft January 4, 2026 23:34

kashif changed the title ~~Discrete diffusion in diffuers~~ Discrete diffusion in diffusers Jan 4, 2026

kashif added 7 commits January 6, 2026 11:58

Merge branch 'main' into diff-d2

264886a

Merge branch 'main' into diff-d2

541c03b

make fix-copies

1b3ffd7

fix docs

c263f16

add dflash pipeline

b97e11c

added SDAR JET pipeline and scheduler

471bfd3

Merge branch 'main' into diff-d2

ec4b2a4

kashif added 2 commits January 18, 2026 18:34

Merge branch 'main' into diff-d2

2f8a48b

Merge branch 'main' into diff-d2

3772ca1

dg845 reviewed Jan 23, 2026

View reviewed changes

github-actions Bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 17, 2026

Apply style fixes

b2997e6

github-actions Bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 17, 2026

fix dflash for qwen3_5

d67155c

github-actions Bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 19, 2026

github-actions Bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 23, 2026

kashif force-pushed the diff-d2 branch from 094c118 to a7626f7 Compare April 23, 2026 16:50

github-actions Bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 23, 2026

Exclude examples/research_projects/unified_latents from the PR

630ce33

Keep the files locally (already on disk under examples/research_projects/unified_latents/) but remove them from git tracking. Unrelated to the I-DLM scope of this branch.

kashif force-pushed the diff-d2 branch from a7626f7 to 630ce33 Compare April 23, 2026 16:53

github-actions Bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 23, 2026

hlky mentioned this pull request Apr 29, 2026

llada2 model/pipeline review #13598

Closed

Merge branch 'main' into diff-d2

3c7d3cd

kashif mentioned this pull request May 8, 2026

[discrete diffusion] Add dflash pipeline #13699

Open

6 tasks

kashif added 3 commits May 18, 2026 10:05

Merge branch 'main' into diff-d2

fdb6c15

Merge branch 'main' into diff-d2

5dd5c2b

make the i-dlm example match the official block-diffusion training

b977350

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discrete diffusion in diffusers#12911

Discrete diffusion in diffusers#12911
kashif wants to merge 73 commits into
huggingface:mainfrom
kashif:diff-d2

kashif commented Jan 4, 2026 •

edited

Loading

Uh oh!

yiyixuxu commented Jan 15, 2026

Uh oh!

dg845 commented Jan 23, 2026

Uh oh!

dg845 Jan 23, 2026

Uh oh!

dg845 Jan 23, 2026

Uh oh!

dg845 Jan 23, 2026

Uh oh!

dg845 Jan 23, 2026

Uh oh!

dg845 Jan 23, 2026

Uh oh!

dg845 Jan 23, 2026

Uh oh!

dg845 commented Apr 17, 2026

Uh oh!

github-actions Bot commented Apr 17, 2026 •

edited

Loading

Uh oh!

dg845 commented Apr 17, 2026

Uh oh!

github-actions Bot commented Apr 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	def __call__(
	self,
	prompt: Union[str, List[str]] = None,
	negative_prompt: Optional[Union[str, List[str]]] = None,
	height: int = 512,
	width: int = 768,
	num_frames: int = 121,
	frame_rate: float = 24.0,
	num_inference_steps: int = 40,

	def check_inputs(
	self,
	prompt,
	height,
	width,
	prompt_embeds=None,
	callback_on_step_end_tensor_inputs=None,
	):

Conversation

kashif commented Jan 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

yiyixuxu commented Jan 15, 2026

Uh oh!

dg845 commented Jan 23, 2026

Uh oh!

dg845 Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

dg845 Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

dg845 Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

dg845 Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

dg845 Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

dg845 Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

dg845 commented Apr 17, 2026

Uh oh!

github-actions Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dg845 commented Apr 17, 2026

Uh oh!

github-actions Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: Discrete Diffusion in Diffusers

Bugs / Correctness Issues

Design / Architecture Concerns

Code Quality / Style

Test Coverage Observations

Unaddressed Review Comments

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kashif commented Jan 4, 2026 •

edited

Loading

github-actions Bot commented Apr 17, 2026 •

edited

Loading

github-actions Bot commented Apr 17, 2026 •

edited

Loading