Conversation
| return hidden_states | ||
|
|
||
|
|
||
| class LTX2PerturbedAttnProcessor: |
There was a problem hiding this comment.
I think this is just a guider https://github.com/huggingface/diffusers/blob/main/src/diffusers/guiders/skip_layer_guidance.py
There was a problem hiding this comment.
Thanks! Looking at the code, it's unclear to me whether SkipLayerGuidance currently works for LTX-2.3 for the following reasons:
- Not attention backend agnostic: if I understand correctly, STG is implemented through
AttentionProcessorSkipHook, which usesAttentionScoreSkipFunctionModeto intercept calls totorch.nn.functional.scaled_dot_product_attentionto simply return thevalue: But I think other attention backends likeflash-attnwon't call that function and thus will not work withSkipLayerGuidance. - LTX-2.3 does additional computation on the
values: LTX-2.3 additionally processes thevalues using learned per-head gates before sending it to the attention output projectionto_out. This is not supported by the currentSkipLayerGuidanceimplementation.
I'm not sure whether these issues can be resolved with changes to the SkipLayerGuidance implementation or whether something like a new attention processor would make more sense here.
There was a problem hiding this comment.
I have opened a PR with a possible modification to SkipLayerGuidance to allow it to better support LTX-2.3 at #13220.
There was a problem hiding this comment.
This is a good callout! From my understanding, guider as a component doesn't change much. LTX-2 is probably an exception. If more models start to do their own form of SLG, we could think of giving them their own guider classes / attention processors. But for now, I think modifications to the existing SLG class make more sense.
There was a problem hiding this comment.
let's merge LTx2.3 with a special custom attention processor in this PR first ASAP
the design from the other PR to refator guider is fundamentally wrong - the purpose of hooks (and guider as well) that it modifies behavior from the outside, without the model needing to be aware & implement logic specific to it
i will look to refactor with guiders in the follow-up modular PR
There was a problem hiding this comment.
The point on guiders not being backend agnostic is a good thing to keep in mind.
…ormalization logic
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
LTX-2.3 |
|
I2V sample using the example above: ltx2_3_i2v_stage_1.mp4This uses |
|
Tried the i2v example, and got this error: Adding this seems to make it work: I don't know if this is the correct way to solve this, but the examples should properly be updated to deal with this problem. |
|
LTX-2.3 |
|
LTX-2.3 |
sayakpaul
left a comment
There was a problem hiding this comment.
Thanks so much for the changes and keeping patience.
The amount of changes (also the ability to navigate them) is a bit overwhelming TBH.
I have left a few comments. Let me know if they make sense. We could consider adding a test-suite mirroring the existing LTX-2 pipeline tests but changing the components with changes specific to LTX-2.3?
| LTX2VideoTransformer3DModel, | ||
| ) | ||
| from diffusers.pipelines.ltx2 import LTX2LatentUpsamplerModel, LTX2TextConnectors, LTX2Vocoder | ||
| from diffusers.pipelines.ltx2 import LTX2LatentUpsamplerModel, LTX2TextConnectors, LTX2Vocoder, LTX2VocoderWithBWE |
There was a problem hiding this comment.
Any issue in unifying LTX2Vocoder and LTX2VocoderWithBWE?
There was a problem hiding this comment.
I think there is no issue in principle. But because LTX2VocoderWithBWE contains two LTX2Vocoders as submodules it was more natural to me to wrap them in a new module (and it's also more parallel to the original code).
| "q_norm": "norm_q", | ||
| "k_norm": "norm_k", | ||
| # LTX-2.3 | ||
| "audio_prompt_adaln_single": "audio_prompt_adaln", |
There was a problem hiding this comment.
Where did this pop up? Distillation checkpoint?
There was a problem hiding this comment.
The prompt_adaln and audio_prompt_adaln modules are used by both the full model and distilled model to calculate scale/shift modulation parameters for the text encoder_hidden_states for the video and audio modalities respectively. (I believe this is in place of the caption_projections, which were removed in LTX-2.3.)
| resnet_eps: float = 1e-6, | ||
| resnet_act_fn: str = "swish", | ||
| spatio_temporal_scale: bool = True, | ||
| upsample_type: str = "spatiotemporal", |
There was a problem hiding this comment.
Should this go at the last of init params to prevent backwards breaking in case someone is using positional arguments?
There was a problem hiding this comment.
I put upsample_type there because it is follows the argument ordering of LTX2VideoDownBlock3D, which already used an analogous downsample_type argument. I think the positional argument point is valid but IMO there is less risk of it breaking things as I think it's less likely that users are explicitly calling LTX2VideoUpBlock3d on its own.
| LTXVideoUpsampler3d( | ||
| out_channels * upscale_factor, | ||
| self.upsamplers = nn.ModuleList() | ||
|
|
There was a problem hiding this comment.
It seems like stride is the only factor that varies depending on upsampler_type. So, maybe we could do something like:
if upsample_type == "spatial":
stride = (1, 2, 2)
elif upsample_type == "temporal":
stride = (2, 1, 1)
elif upsample_type == "spatio_temporal":
stride = (2, 2, 2)
self.upsamplers.append(..., strides=strides)WDYT?
| return hidden_states | ||
|
|
||
|
|
||
| class LTX2PerturbedAttnProcessor: |
There was a problem hiding this comment.
The point on guiders not being backend agnostic is a good thing to keep in mind.
| self_attention_mask=None, | ||
| audio_self_attention_mask=None, |
There was a problem hiding this comment.
Are these used by the other pipelines, such as I2V?
There was a problem hiding this comment.
I think they are not used by any currently implemented pipeline. They might be used in pipelines that are in the LTX-2 code but not yet implemented in diffusers.
| ) | ||
| noise_pred_video_uncond_text, noise_pred_video = noise_pred_video.chunk(2) | ||
| # Use delta formulation as it works more nicely with multiple guidance terms | ||
| video_cfg_delta = (self.guidance_scale - 1) * (noise_pred_video - noise_pred_video_uncond_text) |
There was a problem hiding this comment.
(note to other reviewers): guidance is computed a bit latter to account for everything that comes before the computation.
|
|
||
| if self.do_modality_isolation_guidance: | ||
| with self.transformer.cache_context("uncond_modality"): | ||
| noise_pred_video_uncond_modality, noise_pred_audio_uncond_modality = self.transformer( |
There was a problem hiding this comment.
Do these calls vary from the previous ones in terms of the inputs? If so, it could be nice to add a small comment about it because the call arg list is pretty long.
There was a problem hiding this comment.
I believe there is already an existing comment:
diffusers/src/diffusers/pipelines/ltx2/pipeline_ltx2.py
Lines 1319 to 1320 in 6ee66c9
| noise_pred_audio_g = noise_pred_audio + audio_cfg_delta + audio_stg_delta + audio_modality_delta | ||
|
|
||
| # Apply LTX-2.X guidance rescaling | ||
| if self.guidance_rescale > 0: |
There was a problem hiding this comment.
Are we unable to use the rescaling utility?
| return x | ||
|
|
||
|
|
||
| class SnakeBeta(nn.Module): |
There was a problem hiding this comment.
TIL.
Should this go to activations.py? Okay if not.
There was a problem hiding this comment.
I think ideally it should, although I'm not familiar enough with Snake/SnakeBeta to say whether this is a stable, widely reusable implementation. My impression is that it's more or less standard though (this implementation follows the original LTX-2 code, which itself follows the BigVGAN-V2 implementation).
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
What does this PR do?
This PR adds support for LTX-2.3 (official code, model weights), a new model in the LTX-2.X family of audio-video models. LTX-2.3 has improved audio and visual quality and prompt adherence as compared to LTX-2.0.
T2V Example
I2V Example
FLF2V Example
I2V Two Stage Example
I2V Distilled Example
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@yiyixuxu
@sayakpaul