Add Support for LTX-2.3 Models by dg845 · Pull Request #13217 · huggingface/diffusers

dg845 · 2026-03-06T04:10:04Z

What does this PR do?

This PR adds support for LTX-2.3 (official code, model weights), a new model in the LTX-2.X family of audio-video models. LTX-2.3 has improved audio and visual quality and prompt adherence as compared to LTX-2.0.

T2V Example

import torch
from diffusers import LTX2Pipeline
from diffusers.pipelines.ltx2.export_utils import encode_video

model_id = "dg845/LTX-2.3-Diffusers"
device = "cuda"
dtype = torch.bfloat16
seed = 42

frame_rate = 24.0
width = 768
height = 512
num_inference_steps = 30

prompt = (
    "An astronaut hatches from a fragile egg on the surface of the Moon, the shell cracking and peeling apart in "
    "gentle low-gravity motion. Fine lunar dust lifts and drifts outward with each movement, floating in slow arcs "
    "before settling back onto the ground. The astronaut pushes free in a deliberate, weightless motion, small "
    "fragments of the egg tumbling and spinning through the air. In the background, the deep darkness of space subtly "
    "shifts as stars glide with the camera's movement, emphasizing vast depth and scale. The camera performs a "
    "smooth, cinematic slow push-in, with natural parallax between the foreground dust, the astronaut, and the "
    "distant starfield. Ultra-realistic detail, physically accurate low-gravity motion, cinematic lighting, and a "
    "breath-taking, movie-like shot."
)
negative_prompt = (
    "blurry, out of focus, overexposed, underexposed, low contrast, washed out colors, excessive noise, "
    "grainy texture, poor lighting, flickering, motion blur, distorted proportions, unnatural skin tones, "
    "deformed facial features, asymmetrical face, missing facial features, extra limbs, disfigured hands, "
    "wrong hand count, artifacts around text, inconsistent perspective, camera shake, incorrect depth of "
    "field, background too sharp, background clutter, distracting reflections, harsh shadows, inconsistent "
    "lighting direction, color banding, cartoonish rendering, 3D CGI look, unrealistic materials, uncanny "
    "valley effect, incorrect ethnicity, wrong gender, exaggerated expressions, wrong gaze direction, "
    "mismatched lip sync, silent or muted audio, distorted voice, robotic voice, echo, background noise, "
    "off-sync audio, incorrect dialogue, added dialogue, repetitive speech, jittery movement, awkward "
    "pauses, incorrect timing, unnatural transitions, inconsistent framing, tilted camera, flat lighting, "
    "inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts."
)

pipe = LTX2Pipeline.from_pretrained(model_id, torch_dtype=dtype)
pipe.enable_model_cpu_offload(device=device)
pipe.vae.enable_tiling()

generator = torch.Generator(device).manual_seed(seed)
video, audio = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=width,
    height=height,
    num_frames=121,
    frame_rate=frame_rate,
    num_inference_steps=num_inference_steps,
    guidance_scale=3.0,
    stg_scale=1.0,
    modality_scale=3.0,
    guidance_rescale=0.7,
    audio_guidance_scale=7.0,
    audio_stg_scale=1.0,
    audio_modality_scale=3.0,
    audio_guidance_rescale=0.7,
    spatio_temporal_guidance_blocks=[28],
    generator=generator,
    output_type="np",
    return_dict=False,
)

encode_video(
    video[0],
    fps=frame_rate,
    audio=audio[0].float().cpu(),
    audio_sample_rate=pipe.vocoder.config.output_sampling_rate,
    output_path="ltx2_3_t2v.mp4",
)

I2V Example

import torch
from diffusers import LTX2ImageToVideoPipeline
from diffusers.pipelines.ltx2.export_utils import encode_video
from diffusers.utils import load_image

model_id = "dg845/LTX-2.3-Diffusers"
device = "cuda"
dtype = torch.bfloat16
seed = 42

frame_rate = 24.0
width = 768
height = 512
num_inference_steps = 30

prompt = (
    "An astronaut hatches from a fragile egg on the surface of the Moon, the shell cracking and peeling apart in "
    "gentle low-gravity motion. Fine lunar dust lifts and drifts outward with each movement, floating in slow arcs "
    "before settling back onto the ground. The astronaut pushes free in a deliberate, weightless motion, small "
    "fragments of the egg tumbling and spinning through the air. In the background, the deep darkness of space subtly "
    "shifts as stars glide with the camera's movement, emphasizing vast depth and scale. The camera performs a "
    "smooth, cinematic slow push-in, with natural parallax between the foreground dust, the astronaut, and the "
    "distant starfield. Ultra-realistic detail, physically accurate low-gravity motion, cinematic lighting, and a "
    "breath-taking, movie-like shot."
)
negative_prompt = (
    "blurry, out of focus, overexposed, underexposed, low contrast, washed out colors, excessive noise, "
    "grainy texture, poor lighting, flickering, motion blur, distorted proportions, unnatural skin tones, "
    "deformed facial features, asymmetrical face, missing facial features, extra limbs, disfigured hands, "
    "wrong hand count, artifacts around text, inconsistent perspective, camera shake, incorrect depth of "
    "field, background too sharp, background clutter, distracting reflections, harsh shadows, inconsistent "
    "lighting direction, color banding, cartoonish rendering, 3D CGI look, unrealistic materials, uncanny "
    "valley effect, incorrect ethnicity, wrong gender, exaggerated expressions, wrong gaze direction, "
    "mismatched lip sync, silent or muted audio, distorted voice, robotic voice, echo, background noise, "
    "off-sync audio, incorrect dialogue, added dialogue, repetitive speech, jittery movement, awkward "
    "pauses, incorrect timing, unnatural transitions, inconsistent framing, tilted camera, flat lighting, "
    "inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts."
)

image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg"
)

pipe = LTX2ImageToVideoPipeline.from_pretrained(model_id, torch_dtype=dtype)
pipe.enable_model_cpu_offload(device=device)
pipe.vae.enable_tiling()

generator = torch.Generator(device).manual_seed(seed)
video, audio = pipe(
    image=image,
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=width,
    height=height,
    num_frames=121,
    frame_rate=frame_rate,
    num_inference_steps=num_inference_steps,
    guidance_scale=3.0,
    stg_scale=1.0,
    modality_scale=3.0,
    guidance_rescale=0.7,
    audio_guidance_scale=7.0,
    audio_stg_scale=1.0,
    audio_modality_scale=3.0,
    audio_guidance_rescale=0.7,
    spatio_temporal_guidance_blocks=[28],
    generator=generator,
    output_type="np",
    return_dict=False,
)

encode_video(
    video[0],
    fps=frame_rate,
    audio=audio[0].float().cpu(),
    audio_sample_rate=pipe.vocoder.config.output_sampling_rate,
    output_path="ltx2_3_i2v.mp4",
)

FLF2V Example

import torch
from diffusers import LTX2ConditionPipeline
from diffusers.pipelines.ltx2.export_utils import encode_video
from diffusers.pipelines.ltx2.pipeline_ltx2_condition import LTX2VideoCondition
from diffusers.utils import load_image

model_id = "dg845/LTX-2.3-Diffusers"
device = "cuda"
dtype = torch.bfloat16
seed = 42

frame_rate = 24.0
width = 768
height = 512
num_inference_steps = 30

prompt = (
    "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird's feathers are "
    "delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright "
    "sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, "
    "low-angle perspective."
)
negative_prompt = (
    "blurry, out of focus, overexposed, underexposed, low contrast, washed out colors, excessive noise, "
    "grainy texture, poor lighting, flickering, motion blur, distorted proportions, unnatural skin tones, "
    "deformed facial features, asymmetrical face, missing facial features, extra limbs, disfigured hands, "
    "wrong hand count, artifacts around text, inconsistent perspective, camera shake, incorrect depth of "
    "field, background too sharp, background clutter, distracting reflections, harsh shadows, inconsistent "
    "lighting direction, color banding, cartoonish rendering, 3D CGI look, unrealistic materials, uncanny "
    "valley effect, incorrect ethnicity, wrong gender, exaggerated expressions, wrong gaze direction, "
    "mismatched lip sync, silent or muted audio, distorted voice, robotic voice, echo, background noise, "
    "off-sync audio, incorrect dialogue, added dialogue, repetitive speech, jittery movement, awkward "
    "pauses, incorrect timing, unnatural transitions, inconsistent framing, tilted camera, flat lighting, "
    "inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts."
)

first_image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_first_frame.png"
)
last_image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_last_frame.png"
)
first_cond = LTX2VideoCondition(frames=first_image, index=0, strength=1.0)
last_cond = LTX2VideoCondition(frames=last_image, index=-1, strength=1.0)
conditions = [first_cond, last_cond]

pipe = LTX2ConditionPipeline.from_pretrained(model_id, torch_dtype=dtype)
pipe.enable_model_cpu_offload(device=device)
pipe.vae.enable_tiling()

generator = torch.Generator(device).manual_seed(seed)
video, audio = pipe(
    conditions=conditions,
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=width,
    height=height,
    num_frames=121,
    frame_rate=frame_rate,
    num_inference_steps=num_inference_steps,
    guidance_scale=3.0,
    stg_scale=1.0,
    modality_scale=3.0,
    guidance_rescale=0.7,
    audio_guidance_scale=7.0,
    audio_stg_scale=1.0,
    audio_modality_scale=3.0,
    audio_guidance_rescale=0.7,
    spatio_temporal_guidance_blocks=[28],
    generator=generator,
    output_type="np",
    return_dict=False,
)

encode_video(
    video[0],
    fps=frame_rate,
    audio=audio[0].float().cpu(),
    audio_sample_rate=pipe.vocoder.config.output_sampling_rate,
    output_path="ltx2_3_flf2v.mp4",
)

I2V Two Stage Example

import torch
from diffusers import LTX2ImageToVideoPipeline, LTX2LatentUpsamplePipeline
from diffusers.pipelines.ltx2 import LTX2LatentUpsamplerModel
from diffusers.pipelines.ltx2.export_utils import encode_video
from diffusers.pipelines.ltx2.utils import STAGE_2_DISTILLED_SIGMA_VALUES
from diffusers.utils import load_image

model_id = "dg845/LTX-2.3-Diffusers"
device = "cuda"
dtype = torch.bfloat16
seed = 42

frame_rate = 24.0
width = 768
height = 512
num_inference_steps = 30

prompt = (
    "An astronaut hatches from a fragile egg on the surface of the Moon, the shell cracking and peeling apart in "
    "gentle low-gravity motion. Fine lunar dust lifts and drifts outward with each movement, floating in slow arcs "
    "before settling back onto the ground. The astronaut pushes free in a deliberate, weightless motion, small "
    "fragments of the egg tumbling and spinning through the air. In the background, the deep darkness of space subtly "
    "shifts as stars glide with the camera's movement, emphasizing vast depth and scale. The camera performs a "
    "smooth, cinematic slow push-in, with natural parallax between the foreground dust, the astronaut, and the "
    "distant starfield. Ultra-realistic detail, physically accurate low-gravity motion, cinematic lighting, and a "
    "breath-taking, movie-like shot."
)
negative_prompt = (
    "blurry, out of focus, overexposed, underexposed, low contrast, washed out colors, excessive noise, "
    "grainy texture, poor lighting, flickering, motion blur, distorted proportions, unnatural skin tones, "
    "deformed facial features, asymmetrical face, missing facial features, extra limbs, disfigured hands, "
    "wrong hand count, artifacts around text, inconsistent perspective, camera shake, incorrect depth of "
    "field, background too sharp, background clutter, distracting reflections, harsh shadows, inconsistent "
    "lighting direction, color banding, cartoonish rendering, 3D CGI look, unrealistic materials, uncanny "
    "valley effect, incorrect ethnicity, wrong gender, exaggerated expressions, wrong gaze direction, "
    "mismatched lip sync, silent or muted audio, distorted voice, robotic voice, echo, background noise, "
    "off-sync audio, incorrect dialogue, added dialogue, repetitive speech, jittery movement, awkward "
    "pauses, incorrect timing, unnatural transitions, inconsistent framing, tilted camera, flat lighting, "
    "inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts."
)

image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg"
)

pipe = LTX2ImageToVideoPipeline.from_pretrained(model_id, torch_dtype=dtype)
pipe.enable_model_cpu_offload(device=device)
pipe.vae.enable_tiling()

generator = torch.Generator(device).manual_seed(seed)
video_latent, audio_latent = pipe(
    image=image,
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=width,
    height=height,
    num_frames=121,
    frame_rate=frame_rate,
    num_inference_steps=num_inference_steps,
    guidance_scale=3.0,
    stg_scale=1.0,
    modality_scale=3.0,
    guidance_rescale=0.7,
    audio_guidance_scale=7.0,
    audio_stg_scale=1.0,
    audio_modality_scale=3.0,
    audio_guidance_rescale=0.7,
    spatio_temporal_guidance_blocks=[28],
    use_cross_timestep=True,
    generator=generator,
    output_type="latent",
    return_dict=False,
)

latent_upsampler = LTX2LatentUpsamplerModel.from_pretrained(
    "dg845/LTX-2.3-Spatial-Upsampler-Diffusers",
    subfolder="latent_upsampler",
    torch_dtype=dtype,
)
upsample_pipe = LTX2LatentUpsamplePipeline(vae=pipe.vae, latent_upsampler=latent_upsampler)
upsample_pipe.enable_model_cpu_offload(device=device)
upscaled_video_latent = upsample_pipe(
    latents=video_latent,
    output_type="latent",
    return_dict=False,
)[0]

pipe.load_lora_weights(
    "Lightricks/LTX-2.3",
    adapter_name="stage_2_distilled",
    weight_name="ltx-2.3-22b-distilled-lora-384.safetensors",
)
pipe.set_adapters("stage_2_distilled", 1.0)
# Change scheduler to use Stage 2 distilled sigmas as is
new_scheduler = FlowMatchEulerDiscreteScheduler.from_config(
    pipe.scheduler.config, use_dynamic_shifting=False, shift_terminal=None
)
pipe.scheduler = new_scheduler

video, audio = pipe(
    image=image,
    latents=upscaled_video_latent,
    audio_latents=audio_latent,
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=width * 2,
    height=height * 2,
    num_frames=121,
    frame_rate=frame_rate,
    num_inference_steps=3,
    noise_scale=STAGE_2_DISTILLED_SIGMA_VALUES[0],
    sigmas=STAGE_2_DISTILLED_SIGMA_VALUES,
    guidance_scale=1.0,  # For Stage 2 distilled, disable all guidance
    stg_scale=0.0,
    modality_scale=1.0,
    guidance_rescale=0.0,
    audio_guidance_scale=1.0,
    audio_stg_scale=0.0,
    audio_modality_scale=1.0,
    audio_guidance_rescale=0.0,
    spatio_temporal_guidance_blocks=None,
    use_cross_timestep=True,
    generator=generator,
    output_type="np",
    return_dict=False,
)

encode_video(
    video[0],
    fps=frame_rate,
    audio=audio[0].float().cpu(),
    audio_sample_rate=pipe.vocoder.config.output_sampling_rate,
    output_path="ltx2_3_i2v_two_stage.mp4",
)

I2V Distilled Example

import torch
from diffusers import LTX2ImageToVideoPipeline, LTX2LatentUpsamplePipeline
from diffusers.pipelines.ltx2 import LTX2LatentUpsamplerModel
from diffusers.pipelines.ltx2.export_utils import encode_video
from diffusers.pipelines.ltx2.utils import DISTILLED_SIGMA_VALUES, STAGE_2_DISTILLED_SIGMA_VALUES
from diffusers.utils import load_image

model_id = "dg845/LTX-2.3-Distilled-Diffusers"
device = "cuda"
dtype = torch.bfloat16
seed = 42

frame_rate = 24.0
width = 768
height = 512
num_inference_steps = 30

prompt = (
    "An astronaut hatches from a fragile egg on the surface of the Moon, the shell cracking and peeling apart in "
    "gentle low-gravity motion. Fine lunar dust lifts and drifts outward with each movement, floating in slow arcs "
    "before settling back onto the ground. The astronaut pushes free in a deliberate, weightless motion, small "
    "fragments of the egg tumbling and spinning through the air. In the background, the deep darkness of space subtly "
    "shifts as stars glide with the camera's movement, emphasizing vast depth and scale. The camera performs a "
    "smooth, cinematic slow push-in, with natural parallax between the foreground dust, the astronaut, and the "
    "distant starfield. Ultra-realistic detail, physically accurate low-gravity motion, cinematic lighting, and a "
    "breath-taking, movie-like shot."
)
negative_prompt = None

image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg"
)

pipe = LTX2ImageToVideoPipeline.from_pretrained(model_id, torch_dtype=dtype)
pipe.enable_model_cpu_offload(device=device)
pipe.vae.enable_tiling()

generator = torch.Generator(device).manual_seed(seed)
video_latent, audio_latent = pipe(
    image=image,
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=width,
    height=height,
    num_frames=121,
    frame_rate=frame_rate,
    num_inference_steps=8,
    sigmas=DISTILLED_SIGMA_VALUES,
    guidance_scale=1.0,  # Disable all guidance for distilled inference
    stg_scale=0.0,
    modality_scale=1.0,
    guidance_rescale=0.0,
    audio_guidance_scale=1.0,
    audio_stg_scale=0.0,
    audio_modality_scale=1.0,
    audio_guidance_rescale=0.0,
    spatio_temporal_guidance_blocks=None,
    use_cross_timestep=True,
    generator=generator,
    output_type="latent",
    return_dict=False,
)

latent_upsampler = LTX2LatentUpsamplerModel.from_pretrained(
    "dg845/LTX-2.3-Spatial-Upsampler-Diffusers",
    subfolder="latent_upsampler",
    torch_dtype=dtype,
)
upsample_pipe = LTX2LatentUpsamplePipeline(vae=pipe.vae, latent_upsampler=latent_upsampler)
upsample_pipe.enable_model_cpu_offload(device=device)
upscaled_video_latent = upsample_pipe(
    latents=video_latent,
    output_type="latent",
    return_dict=False,
)[0]

video, audio = pipe(
    image=image,
    latents=upscaled_video_latent,
    audio_latents=audio_latent,
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=width * 2,
    height=height * 2,
    num_frames=121,
    frame_rate=frame_rate,
    num_inference_steps=3,
    noise_scale=STAGE_2_DISTILLED_SIGMA_VALUES[0],
    sigmas=STAGE_2_DISTILLED_SIGMA_VALUES,
    guidance_scale=1.0,  # Disable all guidance for distilled inference
    stg_scale=0.0,
    modality_scale=1.0,
    guidance_rescale=0.0,
    audio_guidance_scale=1.0,
    audio_stg_scale=0.0,
    audio_modality_scale=1.0,
    audio_guidance_rescale=0.0,
    spatio_temporal_guidance_blocks=None,
    use_cross_timestep=True,
    generator=generator,
    output_type="np",
    return_dict=False,
)

encode_video(
    video[0],
    fps=frame_rate,
    audio=audio[0].float().cpu(),
    audio_sample_rate=pipe.vocoder.config.output_sampling_rate,
    output_path="ltx2_3_i2v_distilled.mp4",
)

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@yiyixuxu
@sayakpaul

yiyixuxu · 2026-03-06T18:33:47Z

src/diffusers/models/transformers/transformer_ltx2.py

        return hidden_states


+class LTX2PerturbedAttnProcessor:


I think this is just a guider https://github.com/huggingface/diffusers/blob/main/src/diffusers/guiders/skip_layer_guidance.py

Thanks! Looking at the code, it's unclear to me whether SkipLayerGuidance currently works for LTX-2.3 for the following reasons:

Not attention backend agnostic: if I understand correctly, STG is implemented through AttentionProcessorSkipHook, which uses AttentionScoreSkipFunctionMode to intercept calls to torch.nn.functional.scaled_dot_product_attention to simply return the value:

diffusers/src/diffusers/hooks/layer_skip.py

Line 93 in e747fe4

if func is torch.nn.functional.scaled_dot_product_attention:

But I think other attention backends like flash-attn won't call that function and thus will not work with SkipLayerGuidance.

LTX-2.3 does additional computation on the values: LTX-2.3 additionally processes the values using learned per-head gates before sending it to the attention output projection to_out. This is not supported by the current SkipLayerGuidance implementation.

I'm not sure whether these issues can be resolved with changes to the SkipLayerGuidance implementation or whether something like a new attention processor would make more sense here.

I have opened a PR with a possible modification to SkipLayerGuidance to allow it to better support LTX-2.3 at #13220.

This is a good callout! From my understanding, guider as a component doesn't change much. LTX-2 is probably an exception. If more models start to do their own form of SLG, we could think of giving them their own guider classes / attention processors. But for now, I think modifications to the existing SLG class make more sense.

let's merge LTx2.3 with a special custom attention processor in this PR first ASAP

the design from the other PR to refator guider is fundamentally wrong - the purpose of hooks (and guider as well) that it modifies behavior from the outside, without the model needing to be aware & implement logic specific to it
i will look to refactor with guiders in the follow-up modular PR

The point on guiders not being backend agnostic is a good thing to keep in mind.

…ormalization logic

HuggingFaceDocBuilderDev · 2026-03-10T01:55:37Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

dg845 · 2026-03-10T08:17:59Z

LTX-2.3 diffusers converted checkpoint: dg845/LTX-2.3-Diffusers (may still have bugs).

dg845 · 2026-03-11T06:00:22Z

I2V sample using the example above:

ltx2_3_i2v_stage_1.mp4

This uses dg845/LTX-2.3-Diffusers with CFG + STG + modality guidance with the LTX-2.3 default guidance scales.

tin2tin · 2026-03-11T15:14:05Z

Tried the i2v example, and got this error:

ValueError: Spatio-Temporal Guidance (STG) is specified but no STG blocks are supplied.
Please supply a list of block indices at which to apply STG in `spatio_temporal_guidance_blocks`

Adding this seems to make it work:

    spatio_temporal_guidance_blocks=[6,7,8,9,10],

I don't know if this is the correct way to solve this, but the examples should properly be updated to deal with this problem.
A 2. step example could be good to include as well. 2. step cleaned up the pixel jitter pattern seen in your astronaut video.

dg845 · 2026-03-13T04:14:14Z

LTX-2.3 diffusers latent spatial upsampling pipeline: dg845/LTX-2.3-Spatial-Upsampler-Diffusers (may still have bugs).

dg845 · 2026-03-14T01:11:36Z

LTX-2.3 diffusers distilled pipeline: dg845/LTX-2.3-Distilled-Diffusers.

sayakpaul

Thanks so much for the changes and keeping patience.

The amount of changes (also the ability to navigate them) is a bit overwhelming TBH.

I have left a few comments. Let me know if they make sense. We could consider adding a test-suite mirroring the existing LTX-2 pipeline tests but changing the components with changes specific to LTX-2.3?

sayakpaul · 2026-03-16T10:05:53Z

scripts/convert_ltx2_to_diffusers.py

    LTX2VideoTransformer3DModel,
 )
-from diffusers.pipelines.ltx2 import LTX2LatentUpsamplerModel, LTX2TextConnectors, LTX2Vocoder
+from diffusers.pipelines.ltx2 import LTX2LatentUpsamplerModel, LTX2TextConnectors, LTX2Vocoder, LTX2VocoderWithBWE


Any issue in unifying LTX2Vocoder and LTX2VocoderWithBWE?

I think there is no issue in principle. But because LTX2VocoderWithBWE contains two LTX2Vocoders as submodules it was more natural to me to wrap them in a new module (and it's also more parallel to the original code).

sayakpaul · 2026-03-16T10:06:36Z

src/diffusers/loaders/lora_conversion_utils.py

            "q_norm": "norm_q",
            "k_norm": "norm_k",
+            # LTX-2.3
+            "audio_prompt_adaln_single": "audio_prompt_adaln",


Where did this pop up? Distillation checkpoint?

The prompt_adaln and audio_prompt_adaln modules are used by both the full model and distilled model to calculate scale/shift modulation parameters for the text encoder_hidden_states for the video and audio modalities respectively. (I believe this is in place of the caption_projections, which were removed in LTX-2.3.)

sayakpaul · 2026-03-16T10:08:06Z

src/diffusers/models/autoencoders/autoencoder_kl_ltx2.py

        resnet_eps: float = 1e-6,
        resnet_act_fn: str = "swish",
        spatio_temporal_scale: bool = True,
+        upsample_type: str = "spatiotemporal",


Should this go at the last of init params to prevent backwards breaking in case someone is using positional arguments?

I put upsample_type there because it is follows the argument ordering of LTX2VideoDownBlock3D, which already used an analogous downsample_type argument. I think the positional argument point is valid but IMO there is less risk of it breaking things as I think it's less likely that users are explicitly calling LTX2VideoUpBlock3d on its own.

sayakpaul · 2026-03-16T10:11:09Z

src/diffusers/models/autoencoders/autoencoder_kl_ltx2.py

-                    LTXVideoUpsampler3d(
-                        out_channels * upscale_factor,
+            self.upsamplers = nn.ModuleList()
+


It seems like stride is the only factor that varies depending on upsampler_type. So, maybe we could do something like:

if upsample_type == "spatial": stride = (1, 2, 2) elif upsample_type == "temporal": stride = (2, 1, 1) elif upsample_type == "spatio_temporal": stride = (2, 2, 2) self.upsamplers.append(..., strides=strides)

WDYT?

sayakpaul · 2026-03-16T10:14:59Z

src/diffusers/models/transformers/transformer_ltx2.py

        return hidden_states


+class LTX2PerturbedAttnProcessor:


The point on guiders not being backend agnostic is a good thing to keep in mind.

sayakpaul · 2026-03-16T11:02:20Z

src/diffusers/pipelines/ltx2/pipeline_ltx2.py

+                        self_attention_mask=None,
+                        audio_self_attention_mask=None,


Are these used by the other pipelines, such as I2V?

I think they are not used by any currently implemented pipeline. They might be used in pipelines that are in the LTX-2 code but not yet implemented in diffusers.

sayakpaul · 2026-03-16T11:03:16Z

src/diffusers/pipelines/ltx2/pipeline_ltx2.py

-                    )
+                    noise_pred_video_uncond_text, noise_pred_video = noise_pred_video.chunk(2)
+                    # Use delta formulation as it works more nicely with multiple guidance terms
+                    video_cfg_delta = (self.guidance_scale - 1) * (noise_pred_video - noise_pred_video_uncond_text)


(note to other reviewers): guidance is computed a bit latter to account for everything that comes before the computation.

sayakpaul · 2026-03-16T11:06:35Z

src/diffusers/pipelines/ltx2/pipeline_ltx2.py

+
+                if self.do_modality_isolation_guidance:
+                    with self.transformer.cache_context("uncond_modality"):
+                        noise_pred_video_uncond_modality, noise_pred_audio_uncond_modality = self.transformer(


Do these calls vary from the previous ones in terms of the inputs? If so, it could be nice to add a small comment about it because the call arg list is pretty long.

I believe there is already an existing comment:

diffusers/src/diffusers/pipelines/ltx2/pipeline_ltx2.py

Lines 1319 to 1320 in 6ee66c9

# Turn off A2V and V2A cross attn to isolate video and audio modalities

isolate_modalities=True,

sayakpaul · 2026-03-16T11:07:02Z

src/diffusers/pipelines/ltx2/pipeline_ltx2.py

+                noise_pred_audio_g = noise_pred_audio + audio_cfg_delta + audio_stg_delta + audio_modality_delta
+
+                # Apply LTX-2.X guidance rescaling
+                if self.guidance_rescale > 0:


Are we unable to use the rescaling utility?

sayakpaul · 2026-03-16T11:11:19Z

src/diffusers/pipelines/ltx2/vocoder.py

+        return x
+
+
+class SnakeBeta(nn.Module):


TIL.

Should this go to activations.py? Okay if not.

I think ideally it should, although I'm not familiar enough with Snake/SnakeBeta to say whether this is a stable, widely reusable implementation. My impression is that it's more or less standard though (this implementation follows the original LTX-2 code, which itself follows the BigVGAN-V2 implementation).

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

Initial implementation of perturbed attn processor for LTX 2.3

6c7e720

yiyixuxu reviewed Mar 6, 2026

View reviewed changes

dg845 mentioned this pull request Mar 7, 2026

Refactor AttentionProcessorSkipHook to Support Custom STG Logic #13220

Open

dg845 added 10 commits March 7, 2026 03:32

Update DiT block for LTX 2.3 + add self_attention_mask

e90b90a

Add flag to control using perturbed attn processor for now

f768f8d

Add support for new video upsampling blocks used by LTX-2.3

cde6748

Support LTX-2.3 Big-VGAN V2-style vocoder

236eb8d

Initial implementation of LTX-2.3 vocoder with bandwidth extender

1e89cb3

Initial support for LTX-2.3 per-modality feature extractor

5a44adb

Refactor so that text connectors own all text encoder hidden_states n…

4ff3168

…ormalization logic

Fix some bugs for inference

835bed6

Fix LTX-2.X DiT block forward pass

19004ef

Support prompt timestep embeds and prompt cross attn modulation

4dfcfeb

asomoza mentioned this pull request Mar 9, 2026

LTX-2.3 Support #13232

Open

2 tasks

Add LTX-2.3 configs to conversion script

13292dd

dg845 added 5 commits March 10, 2026 05:50

Support converting LTX-2.3 DiT checkpoints

0528fde

Support converting LTX-2.3 Video VAE checkpoints

c5e1fcc

Support converting LTX-2.3 Vocoder with bandwidth extender

50da4df

Support converting LTX-2.3 text connectors

4206280

Don't convert any upsamplers for now

e719d32

dg845 added 5 commits March 10, 2026 09:50

Support self attention mask for LTX2Pipeline

fbb50d9

Fix some inference bugs

de3f869

Support self attn mask and sigmas for LTX-2.3 I2V, Cond pipelines

5056aa8

Support STG and modality isolation guidance for LTX-2.3

f875031

make style and make quality

652d363

dg845 marked this pull request as ready for review March 11, 2026 06:02

Make audio guidance values default to video values by default

d018534

Update to LTX-2.3 style guidance rescaling

c0bb2ef

dg845 added 4 commits March 12, 2026 10:17

Support cross timesteps for LTX-2.3 cross attention modulation

ab0e5b5

Fix RMS norm bug for LTX-2.3 text connectors

f78c3da

Perform guidance rescale in sample (x0) space following original code

63b3c9f

Support LTX-2.3 Latent Spatial Upsampler model

6188af2

Support LTX-2.3 distilled LoRA

89f8cc4

iwr-redmond mentioned this pull request Mar 13, 2026

[Feature Request] SDNQ Quantization Lightricks/LTX-Desktop#61

Open

cjkindel mentioned this pull request Mar 13, 2026

LTX-2.3 Support griptape-ai/griptape-nodes-library-advanced-media#42

Open

Support LTX-2.3 Distilled checkpoint

f1a812a

dg845 added 5 commits March 14, 2026 09:31

Support LTX-2.3 prompt enhancement

145e8e4

Make LTX-2.X processor non-required so that tests pass

8a58073

Fix test_components_function tests for LTX2 T2V and I2V

93247a0

Fix LTX-2.3 Video VAE configuration bug causing pixel jitter

17b53f0

Merge branch 'main' into ltx2-3-pipeline

6ee66c9

dg845 changed the title ~~[WIP] Add Support for LTX-2.3 Models~~ Add Support for LTX-2.3 Models Mar 16, 2026

dg845 requested review from sayakpaul and yiyixuxu March 16, 2026 06:58

sayakpaul reviewed Mar 16, 2026

View reviewed changes

dg845 and others added 2 commits March 16, 2026 18:37

Apply suggestions from code review

c016ce5

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

Refactor LTX-2.X Video VAE upsampler block init logic

2feb460

	# Turn off A2V and V2A cross attn to isolate video and audio modalities
	isolate_modalities=True,

Conversation

dg845 commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Who can review?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Mar 10, 2026

Uh oh!

dg845 commented Mar 10, 2026

Uh oh!

dg845 commented Mar 11, 2026

Uh oh!

tin2tin commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dg845 commented Mar 13, 2026

Uh oh!

dg845 commented Mar 14, 2026

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dg845 commented Mar 6, 2026 •

edited

Loading

tin2tin commented Mar 11, 2026 •

edited

Loading