Skip to content

TurboQuant follow-ups: per-channel outliers, Llama 3.1 8B reproduction, 5-bit codebook #15

@unamedkr

Description

@unamedkr

Tracking the next round of TurboQuant work after #14 was resolved by Variant F (commit ac3c46a).

Current state (post-Variant F):

KV type Bits/elem Llama 3.2 3B PPL Δ vs FP32
FP32 32 13.56
turbo_kv_4b 4 14.28 +5.3%
uniform_4b 4 14.41 +6.3%
turbo_kv_3b 3 15.39 +13.5%

We beat both our previous baseline and llama.cpp q4_0 KV at 4-bit. The remaining gap to FP32 (5.3%) and to the Google paper's near-zero claim (~0%) is real and could be closed with the following work.

Open follow-ups

1. Per-channel outlier handling (high impact)

The paper allocates ~25% of channels (32 out of 128) to a higher bit width for outliers. Our turbo_kv_4b uses uniform allocation. This is the most likely structural cause of the remaining gap.

Sketch:

  • Compute per-channel max-abs across a calibration corpus (one-time, per layer)
  • Identify the top-K outlier channels
  • Store outlier indices + their values at FP16 (or higher-bit codebook) in the block header
  • Quantize remaining channels at 3-bit / 4-bit

Storage cost: 32 outliers × (FP16 + 7-bit index) = 96 bytes — too big for our 72-byte block. Either:
(a) larger block size (256 bytes? trade off compression ratio for quality)
(b) per-layer outlier mask shared across blocks (much cheaper)
(c) start with K=8 outliers per block (24 bytes added — might fit)

2. Paper-faithful Llama 3.1 8B + LongBench-E reproduction

The paper reports on Llama 3.1 8B + LongBench-E, not WikiText PPL. We need:

  • Download Llama 3.1 8B GGUF
  • Find or build a LongBench-E runner harness
  • Run baseline + turbo_kv_4b + (after outlier handling) and compare to paper Table 1

3. 5-bit codebook variant

Layout for 5-bit per element at TQ_BK=128: 128 × 5 / 8 = 80 bytes for indices + 8 byte header = 88 bytes per block. Larger than 72-byte 4-bit but covers the ~5 bpc point that the paper also tests. Would need to:

  • Compute Lloyd-Max-Gaussian centroids for 32 levels
  • Add to tq_codebook.c lookup table
  • Add TQ_TYPE_TURBO_KV_5B enum + register in tq_traits.c
  • Pack/unpack 5-bit indices

4. Per-head rotation seeds

Currently all keys use `TKV_DEFAULT_SEED`. Per-head or per-layer seeds may help decorrelation in models where certain heads have correlated channels.

5. Regression test pinning quality

Add a slow integration test that fails CI if `turbo_kv_4b` PPL on Llama 3.2 3B exceeds 14.5. This guards against future regressions in the Karpathy-loop optimization.

Out of scope (won't fix)

  • ❌ QJL stage revival — ablation showed it contributes ~0; reinvesting bytes in larger codebook is empirically better in our regime
  • ❌ Multi-stage rotation — Walsh-Hadamard one-pass is fast and good enough
  • ❌ Per-block adaptive bit allocation — not enough header space without breaking ABI

Resources

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions