Skip to content

Gemma 4 forward pass reference points from a working CUDA implementation #96

@Mutdogus

Description

@Mutdogus

Disclosure: This issue was drafted and submitted by an AI assistant (Claude) on behalf of the repository owner. The technical content is based on analysis of public MIT-licensed llama.cpp source code and the author's working Gemma 4 + TurboQuant CUDA implementation.

Hey — I've been following your Gemma 4 work over the last few days and noticed you're hitting some of the same walls I went through. I have Gemma 4 running with TurboQuant KV cache compression on an NVIDIA RTX 4090 via a llama.cpp-based fork, so the architectural issues are fresh.

I'd submit patches directly, but your contribution guidelines exclude AI-generated code, and my implementation work was done with AI assistance, so I'll stick to pointing you at the right reference material instead. Everything below references the public MIT-licensed llama.cpp source.

1. layer_output_scale — the #1 divergence source

I see you've iterated on this 4+ times in the last 48 hours. The correct behavior from src/models/gemma4-iswa.cpp (ggml-org/llama.cpp):

// Applied AFTER all residual connections AND per-layer embedding (PLE)
// It's a simple elementwise multiply on the FULL accumulated tensor
cur = ggml_mul(ctx0, cur, model.layers[il].out_scale);

Key: it's applied to the entire accumulated hidden state (residual included), not to the layer's delta contribution. The "residual-separation" formula (x_old + scale * (x_current - x_old)) that I see in your history is incorrect — the model was trained with scaling applied to the full tensor. The values are small (e.g., 0.0178 for layer 0) but that's by design; each subsequent layer compensates.

Order of operations (this matters):

  1. Attention + attn_post_norm + residual add
  2. FFN (MoE or dense) + ffn_post_norm + residual add
  3. Per-layer embedding (PLE) + residual add
  4. Then layer_output_scale on the result
  5. That becomes inpL for the next layer

2. V-norm — RMSNorm on V before KV cache

Gemma 4 applies RMSNorm to the V projection output before storing to KV cache:

Vcur = ggml_rms_norm(ctx0, Vcur, hparams.f_norm_rms_eps);

This is unusual — most architectures only norm K (for Q-K dot product stability). Gemma 4 norms both K and V. If you're norming V after cache retrieval or not at all, that's a source of divergence.

3. MoE router — non-standard logit calculation

The Gemma 4 expert router doesn't just project the hidden state. It:

  1. Takes attn_out (the post-attention residual, NOT the post-FFN-norm tensor)
  2. Applies RMSNorm
  3. Scales by 1.0 / sqrt(n_embd)
  4. Multiplies by a learned ffn_gate_inp_s scale tensor
  5. Then projects through ffn_gate_inp to get expert logits

If your router operates on the FFN-normed tensor instead of attn_out, or misses the 1/sqrt(n_embd) scaling, expert routing will be wrong and outputs will diverge even if individual expert FFNs are correct.

4. Proportional RoPE + dual head dimensions

Gemma 4 has different head_dim for full-attention vs sliding-window layers (e.g., 256 vs 128, or 512 vs 256). The RoPE dimensions and frequency base must switch per layer based on the layer type. Full-attention layers use learned rope_freqs (proportional RoPE) while sliding layers use computed frequencies from rope.freq_base_swa.

From the GGUF metadata: check that you're reading rope.freq_base_swa (or rope.local.freq_base as fallback) for the sliding-window layers, and using the per-layer rope_freqs tensor for full-attention layers.

5. Attention softcap

Gemma 2/3 use attn_logit_softcapping = 50.0. Gemma 4 does NOT. I see you have the config flag right (is_gemma4 && attn_logit_softcap == 0.0f) but worth double-checking it's not being applied somewhere in the attention computation path.

6. KV sharing

attention.shared_kv_layers — the last N layers reuse K/V projections from earlier same-type layers (same sliding/full classification). The reference layer lookup walks backward to find the most recent layer of the same attention type.


The file to study is src/models/gemma4-iswa.cpp in the ggml-org/llama.cpp repo — it's ~310 lines and covers the complete forward pass. For the model loading side (weight names, GGUF keys), see src/llama-model.cpp searching for LLM_ARCH_GEMMA4.

Happy to answer questions. Great project — looking forward to seeing Gemma 4 run on your CPU/WASM path.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions