Skip to content

Qwen3.5-4B DeltaNet layers: FP32 dequant bottleneck causes 0.7 tok/s #70

@unamedkr

Description

@unamedkr

Description

Qwen3.5-4B loads and generates coherent output, but inference is extremely slow at ~0.7 tok/s on Apple M3. The bottleneck is the FP32 dequantization of DeltaNet attention layers at load time.

Benchmark

Model Params Vocab tok/s Notes
Phi-3.5-mini (Q8) 3.8B 32K ~8 Fast
SmolLM2-1.7B (Q8) 1.7B 49K ~12.5 Fastest
Qwen3.5-4B (Q4) 4B 248K ~0.7 18x slower than Phi-3.5

Root Cause

Server log shows all 24 DeltaNet layers being dequantized to FP32:

tq_load_gguf: layer 0 attn_qkv dequant to FP32 (was type 13)
tq_load_gguf: layer 1 attn_qkv dequant to FP32 (was type 13)
...
tq_load_gguf: layer 30 attn_qkv dequant to FP32 (was type 13)

Two bottlenecks:

  1. DeltaNet FP32 dequant — 24 layers × full QKV tensors converted to FP32 at load time, consuming massive memory and removing quantization speed benefits

  2. 248K vocab output projection — Every token requires a 2560 × 248K matmul for logit computation. This is 7.7x larger than Phi-3.5's (3072 × 32K).

Impact

At 0.7 tok/s, generating 80 tokens takes ~103 seconds — unusable for interactive chat. Despite Qwen3.5-4B having the best quality among tested models, the speed makes it impractical.

Suggested Optimizations

  1. Keep DeltaNet layers in quantized format — use Q4/Q8 matmul directly instead of FP32 dequant
  2. Optimize vocab projection — for large-vocab models, consider top-k logit computation or speculative sampling
  3. DeltaNet-specific kernel — linear attention doesn't need full KV cache, leverage this for speed

Environment

  • Model: unsloth/Qwen3.5-4B-GGUF (Q4_K_M, 2.6GB)
  • Hardware: Apple M3, 8-core, 16GB
  • Build: quant.h single-header

Reported by ClawTeam Claw-4 (Optimizer)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions