From 4dc84fa9ee2b300a1d0deb71a44e91a8173fabf5 Mon Sep 17 00:00:00 2001 From: Will Manning Date: Fri, 3 Apr 2026 11:40:41 -0400 Subject: [PATCH 01/19] review ultra Signed-off-by: Will Manning --- proposed/0033-block-turboquant.md | 275 +++++++++++++++++++++++------- 1 file changed, 215 insertions(+), 60 deletions(-) diff --git a/proposed/0033-block-turboquant.md b/proposed/0033-block-turboquant.md index fc1588f..6c72341 100644 --- a/proposed/0033-block-turboquant.md +++ b/proposed/0033-block-turboquant.md @@ -15,8 +15,8 @@ in three stages: blocks of size B = the largest power-of-2 ≥ 64 that divides d. For power-of-2 dimensions, B = d (single block, same as current). Per-block norms stored as internal children. -3. **PDX layout** (later): within each block, transpose codes into groups of - 64 vectors for SIMD scan performance. +3. **PDX layout** (later): transpose codes into dimension-major order within + groups of 64 vectors for SIMD scan performance. QJL correction is deferred to a later stage and may ultimately be dropped. Community findings from 6+ independent TurboQuant implementations consistently @@ -40,10 +40,31 @@ embeddings. It works by: 3. Optionally adding a 1-bit QJL (Quantized Johnson-Lindenstrauss) correction on the residual for unbiased inner product estimation (Theorem 2 in [1]). -The paper prescribes a full random orthogonal rotation (QR of Gaussian) for the -MSE stage — O(d²) storage and O(d²) per-vector. For the QJL stage, the paper -uses a random Gaussian projection matrix S with i.i.d. N(0,1) entries (not an -orthogonal rotation); this distinction matters for the unbiasedness proof. +The paper prescribes a full random orthogonal rotation (QR decomposition of a +matrix with i.i.d. N(0,1) entries, yielding a Haar-uniform orthogonal matrix) +for the MSE stage — O(d²) storage and O(d²) per-vector. For the QJL stage, the +paper uses a random Gaussian projection matrix S with i.i.d. N(0,1) entries (not +an orthogonal rotation); this distinction matters for the unbiasedness proof. + +**Comparison to Product Quantization.** TurboQuant's block decomposition (Stage +2 of this RFC) is structurally similar to Product Quantization (PQ) [9]: both +partition a vector into sub-vectors and quantize each independently. The key +differences are: + +| | TurboQuant | PQ | +| ---------------------- | --------------------------------------------------------------- | -------------------------------------------------------- | +| Quantization type | Scalar (per-coordinate, after rotation) | Vector (per-sub-vector, learned codebook) | +| Codebook | Analytically derived from Beta distribution; **data-oblivious** | Learned via k-means on training data; **data-dependent** | +| Rotation | Random orthogonal within each sub-vector | Typically none (OPQ [10] adds a learned rotation) | +| Theoretical guarantees | Provable MSE bound (Theorem 1 [1]) | Empirical quality only | +| Indexing time | Zero (codebook precomputed from distribution) | Requires training pass over data | +| Bits per sub-vector | Scalar: b bits per coordinate | Vector: typically 8 bits per sub-vector (256 codewords) | + +TurboQuant trades PQ's flexibility (data-dependent codebooks can exploit +structure) for data-obliviousness (no training, provable bounds, zero indexing +time). For uniformly distributed embeddings, TurboQuant's analytically optimal +centroids should match or exceed PQ's learned codebooks. For highly structured +data, PQ may still win empirically. ### Current Vortex implementation @@ -240,9 +261,9 @@ zero-padding for non-power-of-2, slice/take/scalar_at pushdowns, quantized cosine similarity and dot product, compression scheme integration, minimum dim=3. **Added to metadata (for forward compat):** `block_size: u32` (always = -padded_dim), `num_blocks: u32` (always = 1), `is_pdx: bool` (always = false). -These fields are inert in Stage 1 but enable Stage 2/3 decoders to read -Stage 1 files. +padded_dim), `num_blocks: u32` (always = 1). These fields are inert in Stage 1 +but enable Stage 2 decoders to read Stage 1 files. (PDX is handled via the +codes child type, not a metadata flag — see Stage 3.) This is a complete, useful encoding for all dimensions. Power-of-2 dimensions have zero padding waste; non-power-of-2 dimensions have the padding overhead @@ -363,9 +384,9 @@ SORF) are suitable. random orthogonal rotation to make coordinates independent. If we integrate ADSampling-style dimension pruning (see Stage 3), the same rotation could serve both purposes: producing the Beta distribution for quantization AND enabling -hypothesis-testing for early pruning. This would avoid rotating the data twice -and is a natural future optimization when combining block-TurboQuant with -PDX-style scans. +hypothesis-testing for early pruning. This would avoid rotating the data twice. +Note that the query must also be rotated at query time with the same rotation +matrix (stored as a shared child); ADSampling already requires this. #### Quantized-domain operations @@ -421,23 +442,52 @@ x̃ = concat(x̂₀, ..., x̂ₖ₋₁) ### Stage 3: PDX dimension-major layout -Transpose code storage from row-major to dimension-major within groups of 64 -vectors [4]. The 64-vector group size is independent of B. +Introduce a new `PDXArray` encoding type that wraps any `FixedSizeListArray` +with a dimension-major layout within groups of 64 vectors [4]. PDXArray is +**not TurboQuant-specific** — it is a general-purpose layout optimization for +any FixedSizeList of scalar elements (raw float vectors, scalar-quantized +vectors, TurboQuant codes, etc.). **Changes vs. Stage 2:** -| Aspect | Stage 2 | Stage 3 | -| ---------------------- | ------------------------------------------------ | ----------------------------------------------------------------- | -| Codes layout | Row-major (all codes for one vector contiguous) | **Dimension-major within 64-vector chunks** | -| Metadata | `is_pdx = false` | **`is_pdx = true`** | -| Distance kernel | Per-vector loop with per-element centroid lookup | **SIMD-friendly 64-vector inner loop with distance-table lookup** | -| Decode path | Direct inverse SORF per vector | **Un-transpose 64-vector chunk first**, then inverse SORF | -| QJL signs (if present) | Row-major | **Also transposed** (same PDX layout as codes) | +| Aspect | Stage 2 | Stage 3 | +| ---------------- | ------------------------------------------------ | ----------------------------------------------------------------- | +| Codes child type | `FixedSizeListArray` | **`PDXArray`** (wraps FSL with transposed layout) | +| TQ metadata | `is_pdx` field | **Removed** — TQ checks if codes child is PDXArray | +| Distance kernel | Per-vector loop with per-element centroid lookup | **SIMD-friendly 64-vector inner loop with distance-table lookup** | +| Decode path | Direct inverse SORF per vector | **PDXArray.to_fsl() first**, then inverse SORF | **Unchanged from Stage 2:** Block size B, centroid computation, norm storage, -SORF rotation, all encoding logic (PDX transpose is applied after encoding). -The encode path produces row-major codes then transposes; the decode path -un-transposes then decodes. +SORF rotation, all encoding logic. The encode path produces row-major codes +(FSL), then the compressor wraps them in a PDXArray; the decode path converts +PDXArray back to FSL then decodes. + +**PDXArray design:** + +``` +PDXArray (general-purpose dimension-major layout for FixedSizeList) +├── metadata: { list_size, chunk_size (= 64) } +├── elements: PrimitiveArray # transposed: 64 values per dim, contiguous +├── validity: ... # same as FSL validity +``` + +- `PDXArray::try_new(fsl)` — transposes a FixedSizeListArray into PDX layout +- `PDXArray::to_fsl()` — un-transposes back to row-major FSL (for decode, + scalar_at, or non-aligned slice/take) +- `PDXArray::elements_for_dim(dim, chunk)` — O(1) access to a contiguous slice + of 64 values for one dimension within one chunk +- Slice/take: un-transpose to FSL (simplest). Preserving PDX layout is possible + only for 64-vector-aligned ranges. +- The cascade compressor treats PDXArray as a valid encoding of FSL-typed data. + +**Benefits of PDXArray as a separate type:** + +- PDX logic tested and maintained independently of TurboQuant +- Other encodings (raw float vectors, scalar quantization, future encodings) + get PDX scan performance for free +- TurboQuant doesn't need an `is_pdx` metadata flag — it checks its codes + child's type at runtime +- The distance kernel operates on PDXArray's dimension-contiguous slices Within each 64-vector chunk, codes are stored dimension-major: @@ -484,12 +534,15 @@ for tq_block in 0..k { **Int8 layout variant.** The PDX implementation [pdx-impl] uses a different tiling for int8 data: "4 dims × 16 vecs" to leverage VPDPBUSD/UDOT hardware -dot-product instructions. For TurboQuant codes at b_mse ≤ 8, codes are u8 -centroid indices (not linear values), so VPDPBUSD doesn't apply directly — we -need the distance-table-lookup path shown above. However, if we support a linear -quantization mode (b_mse=8 with uniform centroids), the "4 dims × 16 vecs" -layout could enable direct hardware dot-product on the codes, bypassing the -lookup table entirely. This is a potential Stage 3 optimization to evaluate. +dot-product instructions (which process 4 unsigned×signed byte pairs per +operation). For TurboQuant codes at b_mse ≤ 8, codes are uint8 centroid indices, +so VPDPBUSD doesn't apply directly — we need the distance-table-lookup path +shown above. However, at b_mse=8 with high B, the Max-Lloyd centroids are +near-uniformly spaced (see GPU section), potentially enabling direct hardware +dot-product on the codes. Whether this requires a separate linear quantization +mode or works with the existing Max-Lloyd centroids is an empirical question. The +"4 dims × 16 vecs" layout would be a Stage 3 optimization to evaluate alongside +the "1 dim × 64 vecs" float-style layout. **ADSampling integration.** The PDX dimension-pruning approach (ADSampling [4]) is complementary to TurboQuant's block structure. During a scan, the pruner @@ -500,12 +553,12 @@ boundaries (as shown in the kernel above), which our design already provides. **Open design questions:** -- Slice/take on PDX-transposed codes: produce row-major (simpler) or preserve - PDX (aligned 64-vector slices only)? -- Is PDX a property of the encoding or a separate layout layer? -- How does the compressor see the transposed codes? -- Should we support the "4 dims × 16 vecs" int8 layout variant alongside the - "1 dim × 64 vecs" float-style layout? +- Should PDXArray live in `vortex-array` (general infrastructure) or + `vortex-tensor` (vector-specific)? +- Should the cascade compressor automatically PDX-transpose FSL children when + it detects a scan-heavy workload, or should PDX be opt-in? +- Should we support the "4 dims × 16 vecs" uint8 layout variant (for hardware + dot-product) alongside the "1 dim × 64 vecs" float-style layout? ### QJL correction (deferred — experimental) @@ -546,10 +599,11 @@ bit widths, so QJL may not be worth the complexity. ``` TurboQuantArray ├── metadata: { dimension, b_mse, block_size (= padded_dim), -│ num_blocks (= 1), is_pdx (= false) } +│ num_blocks (= 1) } │ │ # Per-row children ├── codes: FixedSizeListArray # list_size = padded_dim +│ (or PDXArray after Stage 3) ├── norms: PrimitiveArray # len = num_rows (F = f64 for f64, f32 otherwise) │ │ # Shared children @@ -558,16 +612,19 @@ TurboQuantArray ``` Same structure as the [current PR][current-impl] minus the 3 QJL slots, plus -the forward-compatible metadata fields and dtype-matching norms. +the forward-compatible metadata fields and dtype-matching norms. The codes child +is `FixedSizeListArray` in Stages 1-2 and may be swapped to `PDXArray` in Stage +3 — TurboQuant checks the child type at runtime, not via a metadata flag. ### Stage 2 (block decomposition) ``` TurboQuantArray (self-contained, handles blocks internally) -├── metadata: { dimension, b_mse, block_size, num_blocks, is_pdx } +├── metadata: { dimension, b_mse, block_size, num_blocks } │ │ # Per-row children (sliced/taken on row operations) ├── codes: FixedSizeListArray # list_size = k × B +│ (or PDXArray after Stage 3) ├── norms: PrimitiveArray # len = num_rows (k=1) │ or FixedSizeListArray # list_size = k (k>1) │ @@ -578,7 +635,8 @@ TurboQuantArray (self-contained, handles blocks internally) ## Compression ratio -For f32 input, b_mse bits MSE, k = d/B blocks, N vectors: +For f32 input, b_mse bits MSE, k = d/B blocks, N vectors (for f64 input, +replace 32 with 64 in the norms row — ratios decrease accordingly): | Component | Bits per vector | | ----------- | --------------- | @@ -605,7 +663,9 @@ improvement. For d=1024 the encoding is identical to current. ### Encode/decode throughput -SORF at B dimensions: 3 × B × log₂(B) + 3 × B FLOPs per block. For k blocks: +SORF at B dimensions: 3 × B × log₂(B) butterflies + 3 × B sign applications +per block (plus B normalization multiplies, omitted for simplicity). For k +blocks: | B | SORF FLOPs/block | k (d=768) | Total MSE FLOPs | | -------------- | ------------------------- | --------- | --------------- | @@ -637,9 +697,41 @@ approach, despite more blocks, because each block is smaller. - Per-block Gaussian QJL vs. per-block SORF QJL vs. full-dim padded SORF QJL vs. MSE-only -- Key metric: ANN recall@k on standard benchmarks (SIFT, GloVe) +- Key metric: ANN recall@k on the datasets above (Contriever, OpenAI, SIFT) - Per community findings, MSE-only is expected to win [8] +### Benchmarking datasets + +The current test suite uses i.i.d. Gaussian vectors, which is a pessimistic +baseline for TurboQuant: real embeddings have structure (clusters, anisotropy) +that rotation-based quantization can exploit, while Gaussian vectors are already +rotationally invariant (the rotation is a no-op in distribution). Recent work +(VIBE [11]) argues that traditional benchmarks (SIFT, GloVe) are no longer +representative of modern ANN workloads. + +**Recommended datasets:** + +| Dataset | Dim | Size | Source | Why | +| ----------------------------- | ------ | ------ | ---------------- | ------------------------------------------------------ | +| Contriever | 768 | ~1M | PDX paper [4] | Key non-power-of-2 target; real embeddings | +| OpenAI text-embedding-3-large | 1536 | ~1M | Common in RAG | High-d production embeddings | +| SIFT | 128 | 1M | Classic | Low-d power-of-2 baseline, well-studied recall numbers | +| arXiv embeddings | 768 | 2.25M | PDX paper [4] | Same dim as Contriever, larger scale | +| DEEP | 96 | 10M | Image embeddings | Large scale | +| Synthetic Gaussian | varies | varies | Internal | Pessimistic baseline; validates theoretical bounds | + +**Metrics** (at b_mse ∈ {2, 3, 4, 5, 8}): + +- Recall@10, Recall@100 (ANN ranking quality) +- Normalized MSE distortion (reconstruction quality) +- Inner product mean signed relative error (bias measurement) +- Encode/decode throughput (vectors/sec) + +The Gaussian baseline validates that theoretical bounds hold. The real-embedding +datasets measure practical quality — which may be **better** than Gaussian +(structured data benefits more from rotation) or **worse** (if the data has +adversarial properties for the specific rotation). + ### Straggler handling (if needed) Rare for common dimensions. If encountered: zero-pad to B (simplest). Follow-up: @@ -657,8 +749,10 @@ internal children. The `TurboQuantScheme::compress()` method must be updated to: (a) choose B based on d, (b) split input into blocks, (c) normalize per-block, (d) encode each block, and (e) store per-block norms as an internal child array. -**Phase 3** — PDX layout: Dimension-major code transposition within 64-vector -chunks. Distance computation kernels. +**Phase 3** — PDXArray + scan kernels: Introduce `PDXArray` as a general-purpose +dimension-major layout for `FixedSizeListArray`. TurboQuant's codes child is +swapped from FSL to PDXArray by the compressor. Distance computation kernels +operate on PDXArray's dimension-contiguous slices. **Phase 4** (experimental) — QJL: If the experimental plan shows QJL improves recall@k beyond MSE-only, add per-block Gaussian or SORF QJL. Based on @@ -696,9 +790,60 @@ distance table fits in shared memory (1 KB at b_mse=4, 4 KB at b_mse=5); the kernel streams code bytes from HBM with gather-reduce accumulation, using 4-8× less bandwidth than full float vectors. -At b=8, codes are raw int8 indices. Direct int8 tensor core GEMM requires -approximately linear centroids (sacrificing Max-Lloyd optimality); viable for -ANN ranking but not reconstruction. +At b_mse=8, codes are uint8 indices (0-255). Direct int8 tensor core GEMM +(using codes as the unsigned operand in VPDPBUSD) requires approximately linear +centroids — but at high B the Max-Lloyd centroids are already near-uniform +(the Beta distribution is highly concentrated, approaching Gaussian, for which +high-resolution optimal quantization is approximately uniform). Whether the +existing Max-Lloyd centroids are "linear enough" for hardware dot-product +instructions is an empirical question worth testing before introducing a +separate linear quantization mode. + +## Integration with Vortex scan engine + +TurboQuant's quantized-domain operations must integrate with Vortex's expression +evaluation and scan pushdown infrastructure. The current implementation provides +this via `ScalarFnVTable` implementations in `vortex-tensor`. + +**Current integration path.** The `CosineSimilarity`, `DotProduct`, and `L2Norm` +scalar functions check whether their input storage arrays are TurboQuant-encoded +(via `TurboQuant::try_match()`). If both operands are TurboQuant and the +`ApproxOptions::Approximate` flag is set, the scalar function dispatches to the +quantized-domain kernel (e.g., `cosine_similarity_quantized_column`), bypassing +full decompression. Otherwise, it falls back to the exact path (decompress → +compute on floats). + +**Stage 2 changes.** With block decomposition, the quantized kernels must be +updated to iterate over TQ blocks, weighting by per-block norms: + +- `cosine_similarity_quantized_column`: currently computes a single unit-norm + dot product per row pair. Must change to `Σ_k norm_a_k · norm_b_k · +unit_dot_k / (‖a‖ · ‖b‖)` with `‖a‖ = √(Σ_k norm_a_k²)`. +- `dot_product_quantized_column`: same per-block weighting. +- `l2_norm`: currently returns the stored norm directly (O(1)). Must change to + `√(Σ_k norm_k²)` — read the norms FSL child and compute. +- Both operands must have the **same block size B** and compatible centroids for + the quantized path to apply. If block sizes differ, fall back to exact. + +**Stage 3 changes.** The PDX distance kernel (shown in Stage 3 pseudocode) is a +new execution path that operates on `PDXArray`-typed codes. It should be exposed +as an alternative `ScalarFnVTable` implementation that activates when the codes +child is a `PDXArray` and the scan is over a contiguous 64-vector-aligned range. +For non-aligned ranges or single-vector access (`scalar_at`), the PDXArray is +converted to FSL first via `PDXArray::to_fsl()`. + +**Expression tree integration.** The typical ANN scan expression is: + +``` +top_k(cosine_similarity(column, constant_query), k=10) +``` + +The `constant_query` is broadcast to match the column length. The +`CosineSimilarity` scalar function receives both the column (TurboQuant-encoded) +and the query (ConstantArray wrapping a single vector). For the quantized path, +the query is first encoded with the column's rotation and centroids to produce +query codes and query block norms, then the PDX kernel runs over the column's +codes without decompressing them. ## Migration and compatibility @@ -706,10 +851,9 @@ TurboQuant has not shipped yet, so there are no existing files to migrate. We can design the metadata for forward compatibility from day one. **Strategy: single array ID, versioned metadata.** All stages use the same array -ID (`vortex.turboquant`). The metadata includes `block_size`, `num_blocks`, and -`is_pdx` fields from Stage 1 onward. Stage 1 always writes `num_blocks=1, -is_pdx=false`, but the fields exist so that Stage 2 and 3 decoders can read -Stage 1 files without migration. +ID (`vortex.turboquant`). The metadata includes `block_size` and `num_blocks` +fields from Stage 1 onward. Stage 1 always writes `num_blocks=1`, but the field +exists so that Stage 2 decoders can read Stage 1 files without migration. **Norms are always internal children.** The TurboQuant array is self-contained — it stores norms as a child slot, not in a parent encoding. This means: @@ -723,18 +867,20 @@ The decoder distinguishes k=1 from k>1 by reading `num_blocks` from metadata. A k=1 decoder is backward-compatible with Stage 1 files. A k>1 decoder is a new code path that only applies to files written by Stage 2+. -**Stage 3 (PDX) is additive.** The `is_pdx` flag in metadata tells the decoder -whether codes are row-major or dimension-major. Stage 1/2 files have -`is_pdx=false`; Stage 3 files have `is_pdx=true`. The decoder un-transposes -PDX files on read if needed. No migration required. +**Stage 3 (PDXArray) is additive.** PDX is not a TurboQuant metadata flag — it's +a separate array type (`PDXArray`) that wraps the codes child. Stage 1/2 files +have `FixedSizeListArray` codes; Stage 3 files have `PDXArray` codes. The +TurboQuant decoder checks the child type and un-transposes PDXArray on decode if +needed. `PDXArray` itself is registered as a new encoding, independent of +TurboQuant. **Incremental shipping:** -| Stage | Ships to users? | Reads Stage 1 files? | Notes | -| ------------ | ---------------- | ---------------------- | ----------------------------------- | -| 1 (MSE-only) | Yes, immediately | N/A (first version) | New encoding, no backcompat concern | -| 2 (blocks) | Yes | Yes (k=1 is identical) | k>1 files need Stage 2+ decoder | -| 3 (PDX) | Yes | Yes (is_pdx=false) | PDX files need Stage 3 decoder | +| Stage | Ships to users? | Reads Stage 1 files? | Notes | +| ------------ | ---------------- | -------------------------- | ----------------------------------- | +| 1 (MSE-only) | Yes, immediately | N/A (first version) | New encoding, no backcompat concern | +| 2 (blocks) | Yes | Yes (k=1 is identical) | k>1 files need Stage 2+ decoder | +| 3 (PDX) | Yes | Yes (FSL codes still work) | PDX codes need PDXArray registered | Each stage is independently shippable. Users can upgrade incrementally. Files written by earlier stages are always readable by later decoders. @@ -770,3 +916,12 @@ ggml-org/llama.cpp#20969 (C/C++, quantized attention analysis), 0xSero/turboquant (Triton kernels), vivekvar-dl/turboquant (pip package), scos-lab/turboquant (reference reproduction). Consensus: MSE-only beats MSE+QJL for attention and ANN ranking at all tested bit widths. + +[9] Jégou, H., Douze, M. and Schmid, C. "Product Quantization for Nearest +Neighbor Search." IEEE Trans. PAMI 33(1):117-128, 2011. + +[10] Ge, T., He, K., Ke, Q. and Sun, J. "Optimized Product Quantization." +IEEE Trans. PAMI 36(4):744-755, 2014. + +[11] Kuffo, L. et al. "VIBE: Vector Index Benchmark for Embeddings." +arXiv:2505.17810, May 2025. From 811ac46c7c7c16eb45fac5de57a965d153f789a8 Mon Sep 17 00:00:00 2001 From: Will Manning Date: Fri, 3 Apr 2026 11:44:53 -0400 Subject: [PATCH 02/19] fixes Signed-off-by: Will Manning --- proposed/0033-block-turboquant.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/proposed/0033-block-turboquant.md b/proposed/0033-block-turboquant.md index 6c72341..88cbc7e 100644 --- a/proposed/0033-block-turboquant.md +++ b/proposed/0033-block-turboquant.md @@ -450,12 +450,12 @@ vectors, TurboQuant codes, etc.). **Changes vs. Stage 2:** -| Aspect | Stage 2 | Stage 3 | -| ---------------- | ------------------------------------------------ | ----------------------------------------------------------------- | -| Codes child type | `FixedSizeListArray` | **`PDXArray`** (wraps FSL with transposed layout) | -| TQ metadata | `is_pdx` field | **Removed** — TQ checks if codes child is PDXArray | -| Distance kernel | Per-vector loop with per-element centroid lookup | **SIMD-friendly 64-vector inner loop with distance-table lookup** | -| Decode path | Direct inverse SORF per vector | **PDXArray.to_fsl() first**, then inverse SORF | +| Aspect | Stage 2 | Stage 3 | +| ---------------- | ------------------------------------------------ | ------------------------------------------------------------------------------- | +| Codes child type | `FixedSizeListArray` | **`PDXArray`** (wraps FSL with transposed layout) | +| Codes detection | N/A (codes always FSL) | **TQ checks child type**: FSL → row-major decode, PDXArray → un-transpose first | +| Distance kernel | Per-vector loop with per-element centroid lookup | **SIMD-friendly 64-vector inner loop with distance-table lookup** | +| Decode path | Direct inverse SORF per vector | **PDXArray.to_fsl() first**, then inverse SORF | **Unchanged from Stage 2:** Block size B, centroid computation, norm storage, SORF rotation, all encoding logic. The encode path produces row-major codes @@ -717,7 +717,7 @@ representative of modern ANN workloads. | OpenAI text-embedding-3-large | 1536 | ~1M | Common in RAG | High-d production embeddings | | SIFT | 128 | 1M | Classic | Low-d power-of-2 baseline, well-studied recall numbers | | arXiv embeddings | 768 | 2.25M | PDX paper [4] | Same dim as Contriever, larger scale | -| DEEP | 96 | 10M | Image embeddings | Large scale | +| DEEP | 96 | 10M | Image embeddings | Large scale; d=96 has no B ≥ 64 divisor → padded path | | Synthetic Gaussian | varies | varies | Internal | Pessimistic baseline; validates theoretical bounds | **Metrics** (at b_mse ∈ {2, 3, 4, 5, 8}): @@ -923,5 +923,5 @@ Neighbor Search." IEEE Trans. PAMI 33(1):117-128, 2011. [10] Ge, T., He, K., Ke, Q. and Sun, J. "Optimized Product Quantization." IEEE Trans. PAMI 36(4):744-755, 2014. -[11] Kuffo, L. et al. "VIBE: Vector Index Benchmark for Embeddings." -arXiv:2505.17810, May 2025. +[11] Jääsaari, E., Hyvönen, V., Ceccarello, M., Roos, T. and Aumüller, M. +"VIBE: Vector Index Benchmark for Embeddings." arXiv:2505.17810, May 2025. From 5c913538a8a18b494f1d197f171e5568f30959f5 Mon Sep 17 00:00:00 2001 From: Will Manning Date: Fri, 3 Apr 2026 11:54:51 -0400 Subject: [PATCH 03/19] ultra review gpt 5.4 Signed-off-by: Will Manning --- .../0033-block-turboquant-review-gpt-5.4.md | 347 ++++++++++++++++++ 1 file changed, 347 insertions(+) create mode 100644 proposed/0033-block-turboquant-review-gpt-5.4.md diff --git a/proposed/0033-block-turboquant-review-gpt-5.4.md b/proposed/0033-block-turboquant-review-gpt-5.4.md new file mode 100644 index 0000000..f6ddb74 --- /dev/null +++ b/proposed/0033-block-turboquant-review-gpt-5.4.md @@ -0,0 +1,347 @@ +# Review of `0033-block-turboquant.md` + +## Scope + +This review checks the RFC against: + +- the TurboQuant paper (`arXiv:2504.19874`) +- the PDX paper (`arXiv:2503.04422`) +- the cited SORF / ORF paper (`arXiv:1610.09072`) +- the cited PQ / OPQ papers +- the referenced open-source implementations and publicly available discussions that could be located + +The goal of this review is not to argue against the proposal direction. The goal is to make the RFC maximally defensible when read by experts who will check claims, citations, and wording very closely. + +## Executive Summary + +The proposal direction is plausible, and several technical points in the RFC are solid, especially: + +- the Theorem 1 constant correction +- the distinction between orthogonal MSE rotation and Gaussian QJL projection +- the rationale for treating SORF as an approximation rather than as a theorem-preserving drop-in replacement + +The largest problems are not in the core block-decomposition idea. They are in the rhetoric and sourcing around it: + +1. The RFC currently overclaims that community evidence supports dropping QJL for **ANN ranking**, when the located evidence is primarily about **KV-cache attention**. +2. The RFC overstates the PDX paper's speedup claim. +3. The PQ comparison contains an unsupported superiority claim that is likely to irritate reviewers. +4. The ADSampling integration discussion makes a nontrivial compatibility question sound easy. +5. The citation hygiene for `[7]` and especially `[8]` is not strong enough for external review. + +## Primary Findings + +### 1. Overclaim: evidence does not currently justify the ANN-ranking conclusion + +The most serious issue is the scope of the QJL claim. The current RFC says: + +> Community findings from 6+ independent TurboQuant implementations consistently show that MSE-only outperforms MSE+QJL for attention and ANN ranking in practice. + +The evidence I could verify does support a strong claim for **KV-cache attention**: + +- `tonbistudio/turboquant-pytorch` explicitly argues that QJL hurts because softmax amplifies variance. +- `scos-lab/turboquant` also reports MSE beating Prod/QJL for attention-like workloads. +- other community sources appear to be in the same family of KV-cache experiments. + +However, that is not the same thing as evidence for ANN ranking. In fact, one of the strongest located community sources explicitly distinguishes the two and says QJL may still work for vector search because there is no softmax nonlinearity. + +That means the current wording is too strong in two ways: + +- it extends **attention evidence** to **ANN ranking** +- it uses that extension to justify a product decision for Vortex's search/storage use case + +For outside review, the RFC should either: + +- narrow the claim to KV-cache attention only, or +- add actual ANN experiments and cite those directly + +### 2. Mis-citation: the PDX paper is overstated + +The RFC currently says PDX achieves "on average 2x speedups over SIMD-optimized row-major kernels." + +The PDX paper's abstract says: + +- PDX beats SIMD-optimized horizontal kernels by **average 40%** +- pruning approaches recover **2-7x** benefit when used with PDX + +Those are different claims. The RFC currently mixes them together in a way that overstates what the paper says. + +### 3. Unsupported comparison: TurboQuant is presented as likely superior to PQ on uniform embeddings + +The RFC currently says: + +> For uniformly distributed embeddings, TurboQuant's analytically optimal centroids should match or exceed PQ's learned codebooks. + +This is not supported by the cited PQ/OPQ literature, and it is not obviously true. PQ uses learned **vector** codebooks in subspaces, while TurboQuant uses rotated **scalar** quantization. The correct contrast is: + +- TurboQuant is training-free, data-oblivious, and analyzable. +- PQ/OPQ are data-dependent and require training. +- PQ/OPQ may still be empirically stronger because vector codebooks are more expressive. + +The current sentence sounds like a theorem-shaped statement without theorem-level support. + +### 4. ADSampling integration is presented too casually + +The RFC suggests that TurboQuant and ADSampling might share the same rotation. + +That is not obviously compatible with the proposed Stage 2 design: + +- ADSampling relies on a single full-dimensional random orthogonal projection whose coordinates can be sequentially sampled. +- Stage 2 proposes per-block rotations with blockwise norms and blockwise accumulation. + +A blockwise-rotated representation is not automatically interchangeable with the globally rotated representation assumed by ADSampling's pruning logic. This may still be possible, but it is a research question, not a straightforward integration detail. + +### 5. Citation hygiene is too weak for external review + +Two issues stand out: + +- `[8]` is a prose bundle of repos and issue references rather than an auditable citation. +- `[7]` was not publicly discoverable under the cited title during review. + +For a document going to experts, `[8]` should be expanded into explicit entries with: + +- repository / issue / PR URL +- commit SHA or tag if relevant +- workload type: KV attention vs ANN search +- metric: perplexity, recall@k, cosine, etc. +- conclusion actually supported by that source + +If `[7]` is intended as a public citation, it should have a public URL. If it is private, the RFC should not lean on it heavily in externally circulated form. + +### 6. GPU section uses CPU instruction terminology + +The GPU section references `VPDPBUSD`, which is an x86 CPU instruction, not a GPU tensor-core primitive. The section needs either: + +- CPU wording, or +- GPU-native terminology + +Otherwise it looks like a hardware-model mix-up. + +### 7. One worked-example note contradicts the design + +The Stage 2 worked example for `d=768, B=256, k=3` is labeled "zero padding" in the notes column. That should be removed or changed; Stage 2 is explicitly avoiding padding in that case. + +## Secondary Notes + +These items looked good or at least defensible: + +- The Theorem 1 constant appears correctly interpreted as `sqrt(3) * pi / 2`. +- The QJL scale-factor correction appears correct. +- The distinction between QR/Haar rotation for MSE and Gaussian `S` for QJL is correctly emphasized. +- The revised VIBE citation is now correct. + +## Recommended Editorial Strategy + +Before sharing this RFC externally, the safest editorial move is: + +1. Keep the proposal structure. +2. Tighten all empirical claims to exactly what the evidence shows. +3. Replace suggestive superiority language with narrower, falsifiable wording. +4. Mark ADSampling integration as speculative / future investigation. +5. Strengthen citations, especially `[8]`. + +## Proposed Redline + +This redline is intentionally targeted. It focuses on the passages that most need correction before external circulation. + +### 1. Summary: narrow the QJL claim + +#### Proposed replacement + +```diff +-QJL correction is deferred to a later stage and may ultimately be dropped. +-Community findings from 6+ independent TurboQuant implementations consistently +-show that MSE-only outperforms MSE+QJL for attention and ANN ranking in +-practice [8]. ++QJL correction is deferred to a later stage and may ultimately be dropped. ++Community findings from multiple independent TurboQuant implementations ++consistently show that MSE-only outperforms MSE+QJL for KV-cache attention in ++practice [8]. For ANN ranking and vector-search workloads, the evidence is ++currently less complete, so QJL should remain an empirical question rather than ++a settled conclusion. +``` + +### 2. PQ comparison: remove unsupported superiority language + +#### Proposed replacement + +```diff + TurboQuant trades PQ's flexibility (data-dependent codebooks can exploit + structure) for data-obliviousness (no training, provable bounds, zero indexing + time). +-For uniformly distributed embeddings, TurboQuant's analytically optimal +-centroids should match or exceed PQ's learned codebooks. For highly structured +-data, PQ may still win empirically. ++In return, PQ and OPQ retain a major advantage in expressivity: they learn ++sub-vector codebooks from data rather than applying an analytic scalar ++quantizer. In practice this means TurboQuant is attractive when training-free ++operation, simple deployment, and theoretical guarantees matter most, while PQ ++or OPQ may still win empirically when a learned vector codebook can exploit ++dataset-specific structure. +``` + +### 3. Community QJL section: separate attention from ANN + +#### Proposed replacement + +```diff + ### Community findings on QJL + + Multiple independent TurboQuant implementations have converged on a +-significant practical finding: **MSE-only consistently outperforms MSE+QJL for +-attention and ANN ranking**. The mechanism is a variance-bias tradeoff: +-TurboQuant's QJL correction eliminates bias but increases variance, and softmax +-attention (and cosine/L2 ranking) amplifies variance more than bias. At the same +-total bit budget, allocating all bits to MSE (more centroids, lower variance) +-beats splitting between MSE + QJL (fewer centroids + 1-bit correction). This has +-been confirmed by 6+ groups across Python, C, and Rust implementations [8]. ++significant practical finding for **KV-cache attention**: MSE-only often ++outperforms MSE+QJL at the same bit budget. The likely mechanism is a ++variance-bias tradeoff: QJL removes bias in raw inner-product estimation but ++adds variance, and the softmax nonlinearity can amplify variance more than it ++penalizes bias. In that setting, allocating all bits to MSE (more centroids, ++lower variance) can beat splitting the budget between MSE + QJL. This behavior ++has been reported by multiple groups across Python, C, and Rust implementations ++[8]. + +-This finding strongly supports making MSE-only the default strategy for our +-columnar storage use case (ANN search, cosine similarity ranking). ++For ANN search, cosine ranking, and other non-softmax vector-search workloads, ++the evidence is currently less settled. MSE-only is still a reasonable default ++because it is simpler and better supported by the current implementation work, ++but the RFC should treat the ANN question as empirical until evaluated on ANN ++datasets with recall@k and ranking metrics. +``` + +### 4. PDX section: correct the speedup claim + +#### Proposed replacement + +```diff + PDX [4] is a data layout for vector similarity search. The paper (SIGMOD '25) + describes a dimension-major layout within fixed-size blocks of 64 vectors, + enabling the compiler to auto-vectorize the inner distance loop over vectors +-rather than dimensions, achieving on average 2× speedups over SIMD-optimized +-row-major kernels on modern CPUs. The block size of 64 is empirically optimal ++rather than dimensions. In the paper, this yields average speedups of about 40% ++over SIMD-optimized row-major kernels for the direct-kernel comparison, while ++dimension-pruning methods recover much larger gains when paired with the PDX ++layout [4]. The block size of 64 is empirically optimal + across AVX-512, AVX2, and NEON architectures [4]. +``` + +### 5. ADSampling integration: mark as speculative + +#### Proposed replacement + +```diff + **Shared rotation with ADSampling.** Both TurboQuant and ADSampling apply a + random orthogonal rotation to make coordinates independent. If we integrate + ADSampling-style dimension pruning (see Stage 3), the same rotation could serve + both purposes: producing the Beta distribution for quantization AND enabling +-hypothesis-testing for early pruning. This would avoid rotating the data twice. +-Note that the query must also be rotated at query time with the same rotation +-matrix (stored as a shared child); ADSampling already requires this. ++hypothesis-testing for early pruning. However, this is not automatic under the ++Stage 2 block-decomposed design: ADSampling is formulated around a single ++full-dimensional random projection, whereas Stage 2 introduces per-block ++rotations and per-block norm weighting. Reusing one rotation across both systems ++should therefore be treated as a future research direction that requires either ++new analysis or direct empirical validation. If it proves viable, it would avoid ++rotating the data twice. The query would also need to be rotated at query time ++with the same stored transform. +``` + +### 6. Worked examples: fix the contradictory note + +#### Proposed replacement + +```diff +-| 768 | 256 | 3 | 3×256×5 + 3×32 = 3936 | 6.2× | Block decomp; zero padding | ++| 768 | 256 | 3 | 3×256×5 + 3×32 = 3936 | 6.2× | Block decomp; no padding | +``` + +### 7. GPU section: remove the CPU/GPU terminology mix + +#### Proposed replacement + +```diff + At b_mse=8, codes are uint8 indices (0-255). Direct int8 tensor core GEMM +-(using codes as the unsigned operand in VPDPBUSD) requires approximately linear ++or byte-dot-product execution on low-precision hardware requires approximately linear + centroids — but at high B the Max-Lloyd centroids are already near-uniform + (the Beta distribution is highly concentrated, approaching Gaussian, for which + high-resolution optimal quantization is approximately uniform). Whether the + existing Max-Lloyd centroids are "linear enough" for hardware dot-product + instructions is an empirical question worth testing before introducing a + separate linear quantization mode. +``` + +If you want to be more explicit, you could instead split this into separate CPU and GPU paragraphs. + +### 8. Reference `[8]`: make it auditable + +#### Proposed replacement + +Replace the current bundled prose citation with something like: + +```diff +-[8] Community TurboQuant implementations and findings. Key sources: +-tonbistudio/turboquant-pytorch (PyTorch, V3 MSE-only findings), +-ggml-org/llama.cpp#20969 (C/C++, quantized attention analysis), +-0xSero/turboquant (Triton kernels), vivekvar-dl/turboquant (pip package), +-scos-lab/turboquant (reference reproduction). Consensus: MSE-only beats +-MSE+QJL for attention and ANN ranking at all tested bit widths. ++[8] Community TurboQuant implementation reports. These sources primarily study ++KV-cache attention rather than ANN search, and should be cited individually ++with exact URLs and workload scope in the final external draft. Representative ++examples include: ++- tonbistudio/turboquant-pytorch, issue #10 and README discussion of V2 ++ (MSE+QJL) vs V3 (MSE-only) behavior on attention and generation. ++- scos-lab/turboquant README discussion of MSE vs Prod/QJL for KV-cache ++ attention workloads. ++- 0xSero/turboquant README and validation scripts for paper checks and ++ implementation behavior. ++These sources support a strong claim for KV-cache attention. They do not, by ++themselves, establish the same conclusion for ANN ranking. +``` + +This version is intentionally conservative. If you have additional ANN-specific sources, add them here explicitly and then strengthen the main text accordingly. + +### 9. Reference `[7]`: either publish it or weaken dependence on it + +#### Proposed replacement note + +Not a text diff, but a release recommendation: + +- If `[7]` is public, add a direct URL. +- If `[7]` is private or unstable, reduce dependence on it in externally + circulated prose. + +For example, this sentence is fine if the report is public: + +```diff +-The Eviox corrections study [7] identified six material bugs in the paper's ++A third-party implementation review [7] identified six material bugs in the paper's + reference Python implementation. +``` + +But the best fix is still to make the citation resolvable. + +## Optional Stronger Rewrite + +If you want the RFC to sound maximally careful in front of skeptical reviewers, the simplest global substitution is: + +- replace `consistently outperforms` with `has often outperformed` +- replace `consensus` with `reported behavior` +- replace `supports making MSE-only the default for ANN` with `supports evaluating MSE-only first, while keeping ANN ranking as an empirical question` + +That wording preserves the proposal but removes the most attackable overclaims. + +## Suggested Next Pass + +If you want a tighter external-facing RFC, the next revision should: + +1. apply the redline above +2. expand `[8]` into exact citations +3. add one explicit sentence saying which claims are backed by theorem, which by implementation, and which remain hypotheses +4. add ANN-specific experiments before claiming ANN superiority for MSE-only over QJL From 476127d61f3de364a0da89d69fb71ac1f4b7df18 Mon Sep 17 00:00:00 2001 From: Will Manning Date: Fri, 3 Apr 2026 11:58:46 -0400 Subject: [PATCH 04/19] composer2 Signed-off-by: Will Manning --- ...k-turboquant-review-synthesis-composer2.md | 171 +++ ...0033-block-turboquant-revised-composer2.md | 988 ++++++++++++++++++ 2 files changed, 1159 insertions(+) create mode 100644 proposed/0033-block-turboquant-review-synthesis-composer2.md create mode 100644 proposed/0033-block-turboquant-revised-composer2.md diff --git a/proposed/0033-block-turboquant-review-synthesis-composer2.md b/proposed/0033-block-turboquant-review-synthesis-composer2.md new file mode 100644 index 0000000..56e03df --- /dev/null +++ b/proposed/0033-block-turboquant-review-synthesis-composer2.md @@ -0,0 +1,171 @@ +# Peer review synthesis: RFC 0033 Block-Decomposed TurboQuant with PDX + +**Document reviewed:** `proposed/0033-block-turboquant.md` +**Review date:** 2026-04-03 +**Purpose:** Consolidated findings from a detailed technical review (citations, papers, and spot-checks against arXiv HTML and GitHub). + +--- + +## Executive summary + +The RFC is unusually strong for an implementation plan: staged delivery, explicit approximation boundaries (SORF vs Haar, QJL vs MSE-only), and credible linkage to TurboQuant [1] and PDX [4]. For an expert audience, the highest-impact gaps are **broken or unverifiable citations**, **PDX speedup wording that does not match the PDX abstract**, **under-specified conditions for quantized dot product between two stored columns**, and **the blockwise MSE composition paragraph mixing deterministic algebra with probabilistic bounds**. Addressing those items—and making community claims auditable—would make the document review-resistant. + +--- + +## Citations and bibliographic issues + +### Broken GitHub reference + +- **Finding:** `ggml-org/llama.cpp#20969` returns **404** (issue does not exist or was removed). +- **Action:** Replace with a **resolvable** link (e.g. [issue #20977](https://github.com/ggml-org/llama.cpp/issues/20977) “Feature Request: TurboQuant support,” or [discussion #21155](https://github.com/ggml-org/llama.cpp/discussions/21155)) and quote the **exact** claim about MSE-only vs MSE+QJL. + +### Eviox report [7] + +- **Finding:** “Eviox Tech Report v1.2.0, March 2026” has **no URL or DOI** in the RFC. Expert readers cannot verify bugs, Theorem 1 constant discussion, or QJL scale claims against that source. +- **Action:** Publish a stable PDF/link, **or** rephrase to “we verified against reference implementation at commit …” with reproducible steps. + +### Community list [8] + +- **Finding:** A list of repos plus “6+ groups” and “consensus” is **not** literature-grade evidence without commits, experiment definitions, and metrics. +- **Action:** Add a small **table** (source, commit or version, workload, bit width, metric, outcome) or move strong claims to “anecdotal / preliminary.” + +### TurboQuant paper internal references + +- **Lemma 1 / Theorem 2:** arXiv HTML aligns with “marginal density” material and **Definition 1** for QJL scaling; **theorem numbering** may differ in the ICLR 2026 camera-ready PDF. **Action:** Reconcile lemma/theorem numbers with the **final** PDF before wide distribution. + +- **QJL scale (Definition 1):** The paper gives \(Q_{\text{qjl}}^{-1}(\mathbf z) := \frac{\sqrt{\pi/2}}{d}\mathbf S^\top\mathbf z\). The RFC’s contrast of `√(π/(2d))` vs `√(π/2)/d` is **correct** (ratio involves **√d**). + +### PDX [4] speedup claims + +- **Finding:** The PDX **abstract** reports beating horizontal SIMD layouts by **~40%** on average (order **1.4×** end-to-end for that comparison), and **2–7×** when **combining PDX with dimension-pruning** (ADSampling/BSA). The RFC’s blanket “**on average 2×**” for PDX vs row-major **overstates** the abstract’s headline scalar-scan claim unless restricted to a specific figure/setup. +- **Action:** Quote **40%** for the core PDX-vs-horizontal result; cite **2–7×** only for **PDX + pruning** (with section/figure reference when possible). + +### Flash-KMeans [6] + +- **Finding:** Flash-KMeans is a **GPU k-means** paper (assignment/update kernels), not TurboQuant decode. Referring to “following the double-buffered streaming pattern” suggests direct algorithmic lineage. +- **Action:** Clarify **analogy** (IO-aware fused kernels), not the same problem or method. + +--- + +## Mathematics and methodology + +### Theorem 1 and related quantities + +- The **dimension-free** MSE bound \(D_{\text{mse}} \le (\sqrt{3}\,\pi/2)\,4^{-b}\) matches the arXiv HTML (intro + Theorem 1 region). The RFC’s **Eviox vs \(\sqrt{3\pi}/2\)** argument is directionally correct: **\(\sqrt{3}\pi/2 \approx 2.72\)** is not **\(\sqrt{3\pi}/2 \approx 1.535\)**. + +- The proof chain also introduces quantities such as \(\mathcal C(f_X,b)\) with a **\(1/d\)** factor in intermediate steps. The RFC can briefly note **\(\mathcal C\)** vs **\(D_{\text{mse}}\)** so readers see the full proof stack was considered. + +### Block decomposition and composed MSE bound + +- The **algebraic** identity partitioning \(\|\mathbf x - \hat{\mathbf x}\|^2/\|\mathbf x\|^2\) by orthogonal blocks is **correct**. + +- The step from per-block **probabilistic** guarantees to a global bound should be stated in terms of **expectations** (linearity) and assumptions on randomization, not as a purely **pointwise** weighted average unless the theorem is worst-case (it is not, as stated). + +- **Conceptual gap:** TurboQuant’s analysis uses **one** global Haar rotation and **high-\(d\)** near-independence across coordinates. **Independent SORF per block** with **smaller \(B\)** may weaken the “coordinates act like independent scalar sources” story even when the **marginal** after Haar in \(\mathbb R^B\) is correct. The RFC already plans empirical validation; **explicitly call out \(B\)-dependence of near-independence**. + +### Centroids and block dimension + +- Centroids must use the **\(B\)-dimensional** marginal (exponent **\((B-3)/2\)**). The RFC states this; good. + +- **Minimum block size:** Global **\(d \ge 3\)** avoids Beta singularities; state that **each block** satisfies **\(B \ge 3\)** under the chosen policy (**\(B \ge 64\)**), so the marginal is well-defined. + +### DCT discussion + +- In the “Why not DCT?” paragraph, the marginal is written with **\((d-3)/2\)**; for per-block discussion, **\((B-3)/2\)** is the relevant exponent to avoid confusion. + +--- + +## Systems and integration + +### Quantized dot product / cosine: two stored columns + +- For **column vs query** re-encoded with the **column’s** rotation and centroids, the story is clear. + +- For **two TurboQuant-encoded columns**, a fast quantized inner product requires **identical** rotation parameters (**bit-identical `mse_rotation_signs`**, same seeds/structure), not only the same **\(B\)** and centroids. The RFC should **require rotation identity** for the two-sided fast path or **mandate exact fallback**. + +### Mixed precision (f64 norms, f32 directions) + +- Generally sound; a **brief** note on numerical ordering or tiny norms avoids pedantic corner-case questions. + +### PDX layout and indexing + +- Implementers will want a **clear mapping** from logical dimension index (spanning TQ blocks) to **PDX transposed offsets**—either a formula or a short diagram. + +### Slice/take with PDX + +- Full **un-transpose to FSL** is simple but can imply **large transient cost** on small slices. Worth noting **worst-case behavior** and optional **64-row-aligned** fast paths. + +### FLOP table + +- Label counts as **heuristic**; real cost is often **memory bandwidth** and constant factors in butterflies. + +### GPU / VPDPBUSD + +- **VPDPBUSD** is a **specific** mixed int8 dot-product idiom, not arbitrary uint8×uint8. Max-Lloyd centroids are **not** naturally constrained to byte-quantized linear scales; treat “linear enough for tensor cores” as a **strong** empirical hypothesis. + +--- + +## Experimental plan and datasets + +### Gaussian “pessimistic baseline” + +- For **isotropic** Gaussians, a random orthogonal transform is **distributionally neutral**, but that does not make the baseline “pessimistic” for **all** error modes; it can be **misaligned** with heavy-tailed or clustered embeddings. **Soften** wording to: theory anchor / sanity check, not a proxy for worst-case production. + +### DEEP \(d=96\) + +- Correctly noted: **no** power-of-two **\(\ge 64\)** divides 96, so the RFC’s block rule forces **padding / straggler** path. Good. + +### Popular dimensions + +- Optional: add rows for dimensions such as **2560** or **1280** if the RFC targets “common model dims” broadly. + +--- + +## Compression ratio section + +- **“30% storage improvement”** is easy to misread: the worked example is roughly **29% higher compression ratio** (4.8× → 6.2×) and about **24% fewer compressed bits per vector** for \(d=768\), \(b_{\text{mse}}=5\). **Disambiguate** ratio vs bit reduction. + +- **Shared** centroids and SORF signs: remind readers that shared cost is **amortized over \(N\)**; **small** columns are metadata-sensitive. + +--- + +## Minor editorial nits + +- Prefer **“greatest”** over **“largest”** for “power-of-two that divides \(d\)” (standard math English). + +- PQ row: “8 bits per sub-vector” is a **typical** configuration, not the definition of PQ; qualify as such. + +- “Indexing time: Zero” vs PQ training: fair as **no k-means training**, but **encode-time** work remains; soften **“zero”** to avoid pedantic pushback. + +- **QJL variance scaling (“\(d/B\) times more”):** align wording with **Lemma 4**’s **exact** statement in the PDF (variance of **averaged** estimators, constants). + +--- + +## Positive highlights (worth preserving) + +- Clear **staging** (MSE-only → blocks → PDX → optional QJL). + +- Honest **SORF vs Haar** and **SORF for QJL** vs Gaussian **S**. + +- **Theorem 1 constant** clarification vs mistaken \(\sqrt{3\pi}/2\) interpretation. + +- **PDX open-source delta** (SQ8 tiling, ADSampling, zones) is valuable context. + +- **Migration / single array ID** story is clean for a greenfield encoding. + +--- + +## Suggested priority order before external expert send-out + +1. Fix **llama.cpp** link and verify **PDX** speedup sentences against [4]. +2. Tighten **block MSE** subsection (expectations, \(B\)-dependence). +3. Specify **rotation-parameter identity** (or fallback) for **two-column** quantized dot. +4. Make **[7]** and **[8]** **auditable** or soften claims. +5. Add **Flash-KMeans** analogy disclaimer, **compression ratio** disambiguation, **slice/PDX** cost note. + +## Deliverables (this review) + +| File | Purpose | +| ---- | ------- | +| `proposed/0033-block-turboquant-review-synthesis.md` | This document: consolidated findings and recommended actions. | +| `proposed/0033-block-turboquant-revised.md` | Full RFC text with proposed edits applied (does not replace `0033-block-turboquant.md`). | diff --git a/proposed/0033-block-turboquant-revised-composer2.md b/proposed/0033-block-turboquant-revised-composer2.md new file mode 100644 index 0000000..fd37059 --- /dev/null +++ b/proposed/0033-block-turboquant-revised-composer2.md @@ -0,0 +1,988 @@ +# Block-Decomposed TurboQuant with PDX Layout + +**Authors:** Will Manning +**Status:** Proposal (revised draft — incorporates peer-review edits; see `0033-block-turboquant-review-synthesis.md`) +**Date:** 2026-04-02 + +## Summary + +We propose evolving the [TurboQuant vector quantization encoding][current-impl] +in three stages: + +1. **MSE-only TurboQuant** (immediate): merge the current PR as an MSE-only + encoding. This is a complete, self-contained building block. +2. **Block decomposition** (next): for non-power-of-2 dimensions, split into + blocks of size B = the **greatest** power-of-2 ≥ 64 that divides d. For + power-of-2 dimensions, B = d (single block, same as current). Per-block + norms stored as internal children. +3. **PDX layout** (later): transpose codes into dimension-major order within + groups of 64 vectors for SIMD scan performance. + +QJL correction is deferred to a later stage and may ultimately be dropped. +Multiple community implementations report that MSE-only often outperforms +MSE+QJL for attention and ANN ranking in practice [8]. **Citation hygiene:** [8] +should be upgraded to pinned commits and a short results table before the RFC +is treated as establishing external “consensus.” + +[current-impl]: https://github.com/vortex-data/vortex/pull/7167 + +## Background + +### TurboQuant + +TurboQuant [1] is a lossy vector quantization algorithm for high-dimensional +embeddings. It works by: + +1. Randomly rotating a unit-norm vector so that each coordinate follows a known + marginal distribution — specifically `(1 - x²)^((d-3)/2)` on [-1, 1], a + concentrated Beta-type marginal on coordinates (see [1]; lemma/section + numbering: verify against the ICLR 2026 / final PDF). +2. Applying an MSE-optimal scalar quantizer (Max-Lloyd centroids) independently + to each coordinate. +3. Optionally adding a 1-bit QJL (Quantized Johnson-Lindenstrauss) correction + on the residual for unbiased inner product estimation (see the unbiased + **TurboQuant_prod** result in [1]; verify theorem number in the proceedings + PDF vs arXiv). + +The paper prescribes a full random orthogonal rotation (QR decomposition of a +matrix with i.i.d. N(0,1) entries, yielding a Haar-uniform orthogonal matrix) +for the MSE stage — O(d²) storage and O(d²) per-vector. For the QJL stage, the +paper uses a random Gaussian projection matrix S with i.i.d. N(0,1) entries (not +an orthogonal rotation); this distinction matters for the unbiasedness proof. + +**Comparison to Product Quantization.** TurboQuant's block decomposition (Stage +2 of this RFC) is structurally similar to Product Quantization (PQ) [9]: both +partition a vector into sub-vectors and quantize each independently. The key +differences are: + +| | TurboQuant | PQ | +| ---------------------- | --------------------------------------------------------------- | -------------------------------------------------------- | +| Quantization type | Scalar (per-coordinate, after rotation) | Vector (per-sub-vector, learned codebook) | +| Codebook | Analytically derived from Beta distribution; **data-oblivious** | Learned via k-means on training data; **data-dependent** | +| Rotation | Random orthogonal within each sub-vector | Typically none (OPQ [10] adds a learned rotation) | +| Theoretical guarantees | Provable MSE bound (Theorem 1 [1]) | Empirical quality only | +| Indexing / training | No k-means or learned codebook training (centroids from theory) | Requires training pass over data for codebooks | +| Bits per sub-vector | Scalar: b bits per coordinate | Vector: common choice e.g. 8 bits × m subquantizers (not universal) | + +TurboQuant trades PQ's flexibility (data-dependent codebooks can exploit +structure) for data-obliviousness (no training, provable bounds, no offline +index-training phase). Encode-time work (rotation + quantization) still +applies. For uniformly distributed embeddings, TurboQuant's analytically optimal +centroids should match or exceed PQ's learned codebooks. For highly structured +data, PQ may still win empirically. + +### Current Vortex implementation + +Our [current implementation][current-impl] (Rust, in the `vortex-tensor` crate) +implements TurboQuant as a Vortex array encoding that compresses +`FixedSizeList` arrays — the storage format of `Vector` and +`FixedShapeTensor` extension types. Key design choices and characteristics: + +**Rotation.** Instead of the paper's O(d²) QR rotation, we use a 3-round +Structured Orthogonal Random Features (SORF) transform `HD₃·HD₂·HD₁` [5] for +both the MSE rotation and the QJL projection, giving O(d) storage (3d sign bits, +bitpacked) and O(d log d) per-vector. The rotation signs are stored as a +bitpacked child array rather than recomputed from a seed at decode time. The +3-round SORF was introduced for kernel approximation [5] and approximates a +random orthogonal matrix. It is distinct from the single-round SRHT (`R·H·D`) +analyzed by Tropp [3] and the FJLT (`P·H·D`) of Ailon-Chazelle [2], both of +which are dimensionality-reducing projections rather than rotation +approximations. + +**Centroids.** Max-Lloyd centroids are computed via numerical integration +(trapezoid rule, 1000 points per interval) of the marginal Beta distribution at +the padded dimension, using the `HalfIntExponent` type for exact integer/half- +integer exponent arithmetic. Centroids are cached in a global `DashMap` keyed by +`(dimension, bit_width)` and stored as a shared `PrimitiveArray` child. + +**Array structure.** The `TurboQuantArray` stores up to 7 child slots: codes +(`FixedSizeListArray`, one per vector, list_size = padded_dim), norms +(`PrimitiveArray`), centroids (shared), MSE rotation signs (shared, +bitpacked), and optionally 3 QJL children (signs, residual norms, QJL rotation +signs). Codes are stored as u8 centroid indices; the cascade compressor +(BitPacked encoding) handles packing to the actual bit width on disk. + +**Compute pushdowns.** Slice and take propagate to per-row children (codes, +norms) while sharing rotation signs and centroids. Quantized cosine similarity +and dot product operate directly on codes and centroids without decompression. +L2 norm returns the stored norm directly (O(1) readthrough). + +**Compression scheme.** `TurboQuantScheme` implements the `Scheme` trait for the +BtrBlocks cascading compressor. It matches `Vector` and `FixedShapeTensor` +extension arrays with non-nullable float elements and dimension ≥ 3, using the +default config (5-bit QJL = 4-bit MSE + 1-bit QJL, seed 42). + +**Input handling.** All float types (f16, f32, f64) are converted to f32 before +quantization. Per-vector L2 norms are computed and stored as f32. Non-power-of-2 +dimensions are zero-padded to the next power of 2 for SORF compatibility. The +minimum dimension is 3 (d=2 causes a singularity in the Beta distribution +exponent). + +### Reference implementation bugs + +The Eviox corrections study [7] identified six material bugs in the paper's +reference Python implementation. **Readers:** [7] should include a stable URL, +DOI, or public artifact; until then, treat detailed Eviox-only claims as +internally verified reproduction notes. The most critical is a mathematical error in +the QJL scale factor: the reference code used `√(π/(2d))` instead of +`√(π/2)/d` (Definition 1 in [1]), differing by a factor of √d (≈11× at d=128). +Our [current implementation][current-impl] uses the correct formula +(`sqrt(FRAC_PI_2) / padded_dim` in Rust), so this bug does **not** affect us. + +Other notable Eviox findings: (a) the reference code recomputes codebooks at +every instantiation (we cache in a `DashMap`); (b) the reference uses float16 +for codebook distance computation, causing misassignment at small centroid +spacings (we cast to f32 before quantization). See [7] for the full list. + +### Theorem 1 constant + +There is an ambiguity in the paper's notation for the MSE bound constant. The +formal proof gives `(√3 · π / 2) · 4^{-b}` where the constant √3·π/2 ≈ 2.72. +The Eviox report [7] interprets the notation as `√(3π)/2 ≈ 1.535`, but this is +incorrect: the measured distortion values from the paper (b=2: 0.117, b=3: 0.03) +exceed the putative `√(3π)/2` bound (b=2: 0.096, b=3: 0.024), confirming that +2.72 is the correct constant. The paper's "explicit values" (0.36, 0.117, 0.03, +0.009) are the actual computed distortion of the optimal quantizer, not the +bound itself — they are well below the 2.72/4^b bound. + +### Community findings on QJL + +Several independent TurboQuant implementations report a recurring practical +pattern: **MSE-only often outperforms MSE+QJL for attention and ANN-style +ranking** under fixed bit budgets. A plausible mechanism is variance–bias +tradeoff: QJL reduces bias but adds variance, and softmax attention (and +cosine/L2 ranking) can amplify variance. At the same total bit budget, allocating +all bits to MSE (more centroids, lower quantization variance) sometimes beats +splitting between MSE + QJL (fewer MSE bits + 1-bit correction). See [8] and the +summary’s note on auditable citations. + +This pattern supports making **MSE-only the default** for our columnar storage +use case (ANN search, cosine similarity ranking), pending our own benchmarks. + +### Current limitations + +The SORF requires power-of-2 input dimension. For non-power-of-2 dimensions +(e.g., 768-d embeddings), the input is zero-padded to the next power of 2 +(1024). This causes: + +- **33% storage overhead** for 768-d vectors: 1024 codes stored vs. 768 useful + (equivalently, 25% of stored codes are wasted on zero-padded dimensions). +- **No scan-optimized layout**: row-major code storage prevents SIMD-over-vectors + distance computation. + +### PDX + +PDX [4] is a data layout for vector similarity search. The paper (SIGMOD '25) +describes a dimension-major layout within fixed-size blocks of 64 vectors, +enabling the compiler to auto-vectorize the inner distance loop over vectors +rather than dimensions. The PDX abstract reports **~40%** faster end-to-end +search than SIMD-optimized **horizontal** storage in that baseline comparison +(order **1.4×**), not a blanket “2×” headline. **Separately**, combining PDX +with dimension-pruning (ADSampling, BSA) restores **2–7×** benefits over +SIMD-optimized linear scans in their reported settings [4]. The block size of +64 is empirically strong across AVX-512, AVX2, and NEON architectures [4]. + +**PDX implementation evolution.** The [open-source implementation][pdx-impl] +has evolved beyond the paper in several ways relevant to this RFC: + +- **8-bit scalar quantization** (`IndexPDXIVFTreeSQ8`): Maps floats to 0-255 via + linear min-max scaling. The int8 layout differs from float32: dimensions are + packed in groups of 4 ("4 dims × 16 vecs") to leverage hardware dot-product + instructions (VPDPBUSD on x86, UDOT/SDOT on ARM) that process 4 byte pairs + per operation. This is a different tiling than the paper's "1 dim × 64 vecs." +- **ADSampling with random rotation**: The pruner applies a random orthogonal + rotation (QR of Gaussian, or DCT when FFTW is available) to the entire + collection as a preprocessing step. This makes coordinates approximately + independent, enabling dimension-by-dimension hypothesis testing for early + pruning. The rotation serves a similar purpose to TurboQuant's rotation — + making the coordinate distribution known — but for pruning rather than + quantization. +- **Dimension zones**: Consecutive dimensions are grouped into zones; at query + time, zones are ranked by "distance-to-means" and the most discriminative + zones are scanned first, enabling faster pruning. +- **Future: 1-bit vectors** are mentioned as planned. + +**Implications for our design.** The PDX paper's float32 layout ("1 dim × 64 +vecs") maps cleanly to our quantized-code scan kernel, where the inner loop +gathers from a centroid-product distance table over 64 vectors. However, if we +pursue direct int8 arithmetic (b_mse=8 with linear centroids, see GPU section), +the "4 dims × 16 vecs" int8 layout from the PDX implementation may be more +appropriate, as it enables hardware dot-product instructions. + +Additionally, ADSampling's dimension-pruning approach is complementary to +TurboQuant's block structure: when scanning with block decomposition, the pruner +could skip entire TQ blocks (B dimensions at a time) if the partial distance +already exceeds the candidate threshold. This combines the storage efficiency of +quantization with the computational savings of early termination. + +[pdx-impl]: https://github.com/cwida/PDX + +## Proposal + +### Block size strategy + +For each dimension d, choose B = the **greatest** power-of-2 ≥ 64 that evenly +divides d. This eliminates stragglers entirely for common embedding dimensions. +Each block uses **B ≥ 64**, hence **B ≥ 3**, so the block-level Beta marginal +(exponent **(B−3)/2**) is well-defined (global **d ≥ 3** remains required for +the single-block padded path): + +| Dimension d | Block size B | Blocks k | Notes | +| ----------- | ------------ | -------- | --------------------------- | +| 512 | 512 | 1 | Single block (= current TQ) | +| 768 | 256 | 3 | Greatest dividing power-of-2 | +| 1024 | 1024 | 1 | Single block | +| 1536 | 512 | 3 | | +| 2048 | 2048 | 1 | Single block | +| 3072 | 1024 | 3 | | +| 4096 | 4096 | 1 | Single block | + +**Key observations:** + +- **Power-of-2 dimensions** (512, 1024, 2048, 4096) use B = d — a single block, + identical to the current implementation except with PDX underneath (Stage 3). + No block decomposition overhead, no per-block norms. These dimensions are + already well-served by the current design. +- **Non-power-of-2 dimensions** (768, 1536, 3072) decompose into k=3 blocks at + B=256 or B=512. Zero padding waste. Each block has its own SORF rotation and + shares a single centroid set. +- **Stragglers are eliminated** for all common embedding dimensions. Dimensions + that are not multiples of 64 (e.g., 100, 200) would need straggler handling, + but these are rare in practice for modern model architectures. +- **The SORF approximation at B=256+ is expected to be adequate**: 3 rounds at + B=256 provides 24 butterfly stages, and at B=512 provides 27 — both comparable + to the current B=1024 (30 stages). This needs empirical validation; see + Experimental plan. + +### Stage 1: MSE-only TurboQuant (immediate — split from current PR) + +Split the [current PR][current-impl] to extract and merge the MSE-only subset. +The QJL code can be preserved on a separate branch for Phase 4. + +**Changes vs. current PR:** + +| Aspect | Current PR | Stage 1 | +| -------------- | ------------------------------------------- | ----------------------------------------------------- | +| QJL support | Full (encode, decode, QJL slots, QJL tests) | **Removed** | +| Array slots | 7 (4 MSE + 3 QJL) | **4** (codes, norms, centroids, rotation_signs) | +| Scheme default | 5-bit QJL (4-bit MSE + 1-bit QJL) | **5-bit MSE-only** (32 centroids) | +| Norms dtype | Always f32 | **Same-or-wider**: f64 for f64 input, f32 for f32/f16 | +| Metadata | `has_qjl: bool` | **Removed** (always MSE-only) | + +**Unchanged from current PR:** SORF rotation, Max-Lloyd centroids, +zero-padding for non-power-of-2, slice/take/scalar_at pushdowns, quantized +cosine similarity and dot product, compression scheme integration, minimum dim=3. + +**Added to metadata (for forward compat):** `block_size: u32` (always = +padded_dim), `num_blocks: u32` (always = 1). These fields are inert in Stage 1 +but enable Stage 2 decoders to read Stage 1 files. (PDX is handled via the +codes child type, not a metadata flag — see Stage 3.) + +This is a complete, useful encoding for all dimensions. Power-of-2 dimensions +have zero padding waste; non-power-of-2 dimensions have the padding overhead +described above. + +### Stage 2: Block decomposition + +For non-power-of-2 dimensions, split into blocks of size B (as determined by the +table above). Each full block gets an independent B-dim SORF rotation. + +**Changes vs. Stage 1:** + +| Aspect | Stage 1 | Stage 2 | +| --------------------- | ------------------------------------ | ---------------------------------------------------------------------------- | +| Block count | k = 1 (single block at padded_dim) | **k = d/B** (multiple blocks, no padding) | +| SORF dimension | padded_dim (e.g., 1024 for d=768) | **B** (e.g., 256 for d=768) | +| Rotation signs | Single set, len = 3 × padded_dim | **k sets**, len = k × 3 × B | +| Centroids | Computed for padded_dim distribution | **Computed for B-dim distribution** (different codebook!) | +| Norms child | `PrimitiveArray`, 1 per vector | **`PrimitiveArray` (k=1) or `FixedSizeListArray` (k>1)**, same dtype F | +| Codes list_size | padded_dim | **k × B** (= d for no-straggler dims) | +| Scheme compress() | Pad → single SORF → quantize | **Choose B → split → per-block normalize/rotate/quantize** | +| Quantized dot product | Single sum over padded_dim centroids | **Per-block weighted sum** (Σ_k norm_a_k · norm_b_k · unit_dot_k) | +| L2 norm readthrough | O(1) — return stored norm | **O(k)** — compute √(Σ_k norm_k²) | +| Zero-padding waste | Up to 33% (768→1024) | **Zero** for common dims | + +**Unchanged from Stage 1:** SORF construction (3-round HD), Max-Lloyd algorithm, +f32 internal quantization, slice/take semantics (per-row data sliced, shared +data cloned), bitpacked rotation sign storage, compression scheme trait. + +**For power-of-2 dimensions**: B = d, k = 1. The encoding produces an identical +wire format to Stage 1 (single norm, single SORF, single codes block). A Stage 2 +encoder writing k=1 data is fully backward-compatible with Stage 1 decoders. + +**Key design properties:** + +- **Self-contained.** The TurboQuant array handles block splitting, per-block + normalization, rotation, and quantization internally. No parent cooperation + is needed. +- **One shared centroid set** for all blocks at the same B-dim distribution. +- **Per-block SORF rotation signs.** Each block's SORF is independent (different + seed). Signs are 3 × B bits per block. + +#### Norm architecture + +Per-block norms are stored as an **internal child** of the TurboQuant array: + +- For k = 1 (power-of-2 dims): `PrimitiveArray` with len = num_rows + (identical to Stage 1's single-norm layout). +- For k > 1: `FixedSizeListArray` with list_size = k, len = num_rows. + +The norm dtype `F` matches or widens the input element type: + +| Input dtype | Norm dtype | Rationale | +| ----------- | ---------- | ---------------------------------------------- | +| f16 | f32 | f16 has insufficient range/precision for norms | +| f32 | f32 | Same type | +| f64 | f64 | Preserve full precision | + +Norms are stored as plain child arrays; the cascading compressor handles +secondary encoding (ALP, Pco, etc.). + +Note: centroids and quantization always operate in f32 internally (the +[current implementation][current-impl] converts all input to f32 before +quantization). For f64 input, decode produces f32 unit-direction reconstructions +scaled by f64 norms — a mixed-precision multiply that preserves norm precision; +use numerically stable ordering (e.g. `norm * direction`) and the existing +zero-block fast path for subnormal edge cases. + +#### Zero-norm sub-vectors + +When splitting a vector into B-dim blocks, some blocks may have zero norm. The +encoding handles ‖xₖ‖ = 0 explicitly: skip rotation and quantization, store +norm = 0, decode as all zeros. + +#### Theoretical MSE bound + +The paper's MSE bound (Theorem 1 in [1]) is stated for **unit** \(\mathbf x \in +S^{d-1}\) with \(D_{\text{mse}} := \mathbb{E}\|\mathbf x - \hat{\mathbf x}\|_2^2\) +(and equals \(\mathbb{E}[\|\mathbf x - \hat{\mathbf x}\|^2/\|\mathbf x\|^2]\) +in that case): + +``` +E[‖x - x̂‖²] ≤ (√3 · π / 2) / 4^b ≈ 2.72 / 4^b (x unit norm; same as normalized MSE) +``` + +The proof chain also uses intermediate quantities (e.g. \(\mathcal C(f_X,b)\)) +that carry a **1/d** factor in some steps; the headline **\(D_{\text{mse}}\)** +bound above is the dimension-free form quoted in the abstract [1]. + +**Crucially, Theorem 1 is proved for true random orthogonal matrices (QR of +Gaussian), not SORF.** Our SORF is an approximation. The bound holds exactly +only with a true random orthogonal rotation or with empirical SORF validation +(see Experimental plan). + +**Blockwise composition.** For an orthogonal partition into blocks, +\(\|\mathbf x - \hat{\mathbf x}\|^2/\|\mathbf x\|^2 = \sum_k +(\|\mathbf x_k\|^2/\|\mathbf x\|^2)\, +(\|\mathbf x_k - \hat{\mathbf x}_k\|^2/\|\mathbf x_k\|^2)\) holds **exactly** +as algebra. To lift Theorem 1’s **probabilistic** guarantee to the whole +vector, state the conclusion in terms of **expectations** and the randomness +model (e.g. independent rotations per block), not as a pointwise inequality +unless a worst-case theorem is invoked. **Independence / near-independence:** +TurboQuant’s original analysis leverages **high-\(d\)** near-independence of +coordinates after **one** global rotation; with **smaller \(B\)**, coordinate +dependence after rotation may strengthen even when marginals match—this is an +additional reason the experimental plan compares block sizes and SORF rounds. + +The actual MSE may depend on block dimension B: at larger B the coordinate +distribution is more concentrated (variance ~1/B), giving the Max-Lloyd +quantizer more to exploit. See Experimental plan. + +**SORF approximation.** The 3-round SORF `HD₃·HD₂·HD₁` [5] provides log₂(B) +butterfly stages per round × 3 rounds = 3·log₂(B) total (18 at B=64, 24 at +B=256, 27 at B=512). +This is a rough heuristic for mixing quality — [5] does not analyze convergence +rate as a function of rounds × dimension. Empirical validation is needed. + +**Fallback: dense rotation.** If SORF proves insufficient at the chosen B, use a +B × B random orthogonal matrix (QR of Gaussian). Storage at B=256: 256 KB per +block. For d=768 with k=3: 768 KB total. Amortizes for large columns (100K+ +vectors). Each block must have an **independent** rotation matrix. + +**Why not DCT?** The PDX implementation [pdx-impl] uses DCT (via FFTW) as a fast +rotation for ADSampling. DCT is O(B log B) and invertible, but it is a **fixed +structured transform**, not a random rotation — it does not produce the Beta +marginal distribution `(1-x²)^((B-3)/2)` in block dimension **B** that TurboQuant's +Max-Lloyd centroids are optimized for. ADSampling only needs approximate coordinate independence +(for hypothesis-testing pruning), so DCT suffices there. TurboQuant needs a +specific known marginal distribution, so only random orthogonal rotations (QR or +SORF) are suitable. + +**Shared rotation with ADSampling.** Both TurboQuant and ADSampling apply a +random orthogonal rotation to make coordinates independent. If we integrate +ADSampling-style dimension pruning (see Stage 3), the same rotation could serve +both purposes: producing the Beta distribution for quantization AND enabling +hypothesis-testing for early pruning. This would avoid rotating the data twice. +Note that the query must also be rotated at query time with the same rotation +matrix (stored as a shared child); ADSampling already requires this. + +#### Quantized-domain operations + +All quantized operations read per-block norms from the internal child array: + +- **L2 distance**: `‖a-b‖² = Σ_k ‖aₖ‖² + Σ_k ‖bₖ‖² - 2·Σ_k ‖aₖ‖·‖bₖ‖· +unit_dotₖ`. Primary ANN metric; reuses per-block dot product and norms. +- **Dot product**: ` ≈ Σ_k ‖aₖ‖·‖bₖ‖ · Σ_j centroids[code_aₖ[j]] · +centroids[code_bₖ[j]]`. +- **Cosine similarity**: `cos(a,b) ≈ dot(a,b) / (‖a‖·‖b‖)` where + `‖a‖ = √(Σ_k ‖aₖ‖²)`. +- **L2 norm**: `√(Σ_k ‖xₖ‖²)`. O(k) per vector — a regression from the + current O(1) single-norm readthrough, but modest. + +#### Encoding algorithm + +``` +Input: x ∈ ℝ^d, b_mse bits per coordinate, block_size B +k = d / B (exact division, no straggler for chosen B) +num_centroids = 2^b_mse + +# Block split and normalize +for i in 0..k: + xᵢ = x[i*B .. (i+1)*B] + nᵢ = ‖xᵢ‖ + if nᵢ > 0: + ûᵢ = xᵢ / nᵢ + else: + ûᵢ = zeros(B) + +# MSE stage (per block, SORF rotation) +for i in 0..k: + if nᵢ > 0: + rᵢ = SORFᵢ(ûᵢ) + cᵢ[j] = nearest_centroid(rᵢ[j]) + else: + cᵢ[j] = 0 + +Store (all as internal children): + codes (k × B per vector), norms (k per vector), + centroids (2^b_mse, shared), SORF signs (k × 3 × B, shared) +``` + +#### Decoding algorithm + +``` +for i in 0..k: + r̂ᵢ[j] = centroids[cᵢ[j]] + ûᵢ = SORF⁻¹ᵢ(r̂ᵢ) + x̂ᵢ = nᵢ × ûᵢ (nᵢ read from internal norms child) +x̃ = concat(x̂₀, ..., x̂ₖ₋₁) +``` + +### Stage 3: PDX dimension-major layout + +Introduce a new `PDXArray` encoding type that wraps any `FixedSizeListArray` +with a dimension-major layout within groups of 64 vectors [4]. PDXArray is +**not TurboQuant-specific** — it is a general-purpose layout optimization for +any FixedSizeList of scalar elements (raw float vectors, scalar-quantized +vectors, TurboQuant codes, etc.). + +**Changes vs. Stage 2:** + +| Aspect | Stage 2 | Stage 3 | +| ---------------- | ------------------------------------------------ | ------------------------------------------------------------------------------- | +| Codes child type | `FixedSizeListArray` | **`PDXArray`** (wraps FSL with transposed layout) | +| Codes detection | N/A (codes always FSL) | **TQ checks child type**: FSL → row-major decode, PDXArray → un-transpose first | +| Distance kernel | Per-vector loop with per-element centroid lookup | **SIMD-friendly 64-vector inner loop with distance-table lookup** | +| Decode path | Direct inverse SORF per vector | **PDXArray.to_fsl() first**, then inverse SORF | + +**Unchanged from Stage 2:** Block size B, centroid computation, norm storage, +SORF rotation, all encoding logic. The encode path produces row-major codes +(FSL), then the compressor wraps them in a PDXArray; the decode path converts +PDXArray back to FSL then decodes. + +**PDXArray design:** + +``` +PDXArray (general-purpose dimension-major layout for FixedSizeList) +├── metadata: { list_size, chunk_size (= 64) } +├── elements: PrimitiveArray # transposed: 64 values per dim, contiguous +├── validity: ... # same as FSL validity +``` + +- `PDXArray::try_new(fsl)` — transposes a FixedSizeListArray into PDX layout +- `PDXArray::to_fsl()` — un-transposes back to row-major FSL (for decode, + scalar_at, or non-aligned slice/take) +- `PDXArray::elements_for_dim(dim, chunk)` — O(1) access to a contiguous slice + of 64 values for one dimension within one chunk. **Indexing:** logical code + index for global dimension \(g \in [0, d)\) maps to TurboQuant block + \(t = \lfloor g/B \rfloor\), within-block dimension \(j = g \bmod B\); the + PDX transpose lays out dimension-major runs of 64 values per **global** + dimension in order \(g = 0, \ldots, d-1\) (spanning TQ blocks contiguously in + code space). +- Slice/take: un-transpose to FSL (simplest). Preserving PDX layout is possible + only for 64-vector-aligned ranges. **Cost note:** naive un-transpose can be + \(O(\text{chunk size} \times d)\) per slice; document worst-case behavior and + consider 64-row-aligned fast paths for hot scans. +- The cascade compressor treats PDXArray as a valid encoding of FSL-typed data. + +**Benefits of PDXArray as a separate type:** + +- PDX logic tested and maintained independently of TurboQuant +- Other encodings (raw float vectors, scalar quantization, future encodings) + get PDX scan performance for free +- TurboQuant doesn't need an `is_pdx` metadata flag — it checks its codes + child's type at runtime +- The distance kernel operates on PDXArray's dimension-contiguous slices + +Within each 64-vector chunk, codes are stored dimension-major: + +``` +TQ block 0, dim 0: [v0 v1 v2 ... v63] +TQ block 0, dim 1: [v0 v1 v2 ... v63] +... +TQ block 0, dim (B - 1): [v0 v1 v2 ... v63] +TQ block 1, dim 0: [v0 v1 v2 ... v63] +... +``` + +The inner SIMD loop (64 vectors) has no inter-vector dependencies. TQ block +boundaries only affect where norm weighting occurs — they don't affect the +transpose. + +**Quantized distance kernel (dot product):** + +```rust +let dist_table = precompute_product_table(¢roids); +// At b_mse=4: 16×16 = 256 floats = 1KB, fits in L1 + +let mut distances = [0.0f32; 64]; +let mut unit_dots = [0.0f32; 64]; +let mut offset = 0; + +for tq_block in 0..k { + for dim in 0..B { + let qd = query_codes[tq_block * B + dim]; + let row = &dist_table[qd as usize]; + for v in 0..64 { // SIMD-friendly: no inter-vector deps + unit_dots[v] += row[codes[offset] as usize]; + offset += 1; + } + } + // Weight per-block unit-norm dot product by both vectors' block norms + for v in 0..64 { + distances[v] += query_norms[tq_block] * data_norms[v][tq_block] + * unit_dots[v]; + unit_dots[v] = 0.0; // reset for next TQ block + } +} +``` + +**Int8 layout variant.** The PDX implementation [pdx-impl] uses a different +tiling for int8 data: "4 dims × 16 vecs" to leverage VPDPBUSD/UDOT hardware +dot-product instructions (specific **mixed** int8 dot-product idioms on each +architecture, e.g. VPDPBUSD on x86). For TurboQuant codes at b_mse ≤ 8, codes are +uint8 **centroid indices**, not quantized coordinate values, so these instructions +do not apply directly — we need the distance-table-lookup path +shown above. However, at b_mse=8 with high B, the Max-Lloyd centroids are +near-uniformly spaced (see GPU section), potentially enabling direct hardware +dot-product on the codes. Whether this requires a separate linear quantization +mode or works with the existing Max-Lloyd centroids is an empirical question. The +"4 dims × 16 vecs" layout would be a Stage 3 optimization to evaluate alongside +the "1 dim × 64 vecs" float-style layout. + +**ADSampling integration.** The PDX dimension-pruning approach (ADSampling [4]) +is complementary to TurboQuant's block structure. During a scan, the pruner +could evaluate partial distances after each TQ block (B dimensions) and skip +remaining blocks if the partial L2 distance already exceeds the candidate +threshold. This requires the per-block norm weighting to happen at block +boundaries (as shown in the kernel above), which our design already provides. + +**Open design questions:** + +- Should PDXArray live in `vortex-array` (general infrastructure) or + `vortex-tensor` (vector-specific)? +- Should the cascade compressor automatically PDX-transpose FSL children when + it detects a scan-heavy workload, or should PDX be opt-in? +- Should we support the "4 dims × 16 vecs" uint8 layout variant (for hardware + dot-product) alongside the "1 dim × 64 vecs" float-style layout? + +### QJL correction (deferred — experimental) + +Based on community findings [8], QJL is deferred to after the MSE stages are +validated. + +**Changes vs. MSE-only (if pursued):** + +| Aspect | MSE-only | MSE + QJL | +| ---------------------- | -------------------------------- | --------------------------------------------------------------- | +| Bit budget | All b bits → MSE (2^b centroids) | b-1 bits MSE + 1 bit QJL (2^(b-1) centroids) | +| Inner product estimate | Biased (MSE quantization noise) | Unbiased (QJL correction; see **TurboQuant_prod** in [1]) | +| Additional children | None | QJL signs, QJL residual norms, QJL projection params | +| Encode cost | SORF only | SORF + QJL projection (O(B²) for Gaussian, O(B log B) for SORF) | +| Decode cost | Inverse SORF only | Inverse SORF + QJL inverse projection | + +If pursued, four strategies should be compared: + +| Strategy | Theoretical | Speed | Storage | +| -------------------- | --------------------- | ---------------- | --------------- | +| Per-block Gaussian | Correct (Lemma 4 [1]) | O(B²)/block | k×B²×4 bytes | +| Per-block SORF | Approximate | O(B log B)/block | k×3×B bits | +| Full-dim padded SORF | Approximate | O(d log d) total | 3×padded_d bits | +| MSE-only (no QJL) | N/A | 0 | None | + +The paper's QJL uses Gaussian S (not SORF); Lemma 4 [1] is proved specifically +for Gaussian. SORF for QJL is an additional approximation (the +[current implementation][current-impl] uses SORF for QJL). Per-block QJL’s +variance scaling vs full-dimension QJL is stated in **Lemma 4 [1]**—quote the +lemma’s **exact** variance expression when making quantitative comparisons (not +just “\(d/B\) times” in prose). + +The community consensus is that MSE-only likely wins for ANN ranking at all +bit widths, so QJL may not be worth the complexity. + +## Array layout + +### Stage 1 (MSE-only single block) + +``` +TurboQuantArray +├── metadata: { dimension, b_mse, block_size (= padded_dim), +│ num_blocks (= 1) } +│ +│ # Per-row children +├── codes: FixedSizeListArray # list_size = padded_dim +│ (or PDXArray after Stage 3) +├── norms: PrimitiveArray # len = num_rows (F = f64 for f64, f32 otherwise) +│ +│ # Shared children +├── centroids: PrimitiveArray # len = 2^b_mse +├── mse_rotation_signs: PrimitiveArray # len = 3 × padded_dim (bitpacked) +``` + +Same structure as the [current PR][current-impl] minus the 3 QJL slots, plus +the forward-compatible metadata fields and dtype-matching norms. The codes child +is `FixedSizeListArray` in Stages 1-2 and may be swapped to `PDXArray` in Stage +3 — TurboQuant checks the child type at runtime, not via a metadata flag. + +### Stage 2 (block decomposition) + +``` +TurboQuantArray (self-contained, handles blocks internally) +├── metadata: { dimension, b_mse, block_size, num_blocks } +│ +│ # Per-row children (sliced/taken on row operations) +├── codes: FixedSizeListArray # list_size = k × B +│ (or PDXArray after Stage 3) +├── norms: PrimitiveArray # len = num_rows (k=1) +│ or FixedSizeListArray # list_size = k (k>1) +│ +│ # Shared children (cloned on row operations, not sliced) +├── centroids: PrimitiveArray # len = 2^b_mse +├── mse_rotation_signs: PrimitiveArray # len = k × 3 × B +``` + +## Compression ratio + +For f32 input, b_mse bits MSE, k = d/B blocks, N vectors (for f64 input, +replace 32 with 64 in the norms row — ratios decrease accordingly): + +| Component | Bits per vector | +| ----------- | --------------- | +| MSE codes | k × B × b_mse | +| Block norms | k × 32 | + +| Component | Shared bits | +| ---------- | ------------ | +| Centroids | 2^b_mse × 32 | +| SORF signs | k × 3 × B | + +### Worked examples (f32, b_mse=5, N=1000) + +| d | B | k | Per-vec bits | Ratio | Notes | +| ------------- | ---- | --- | --------------------- | ----- | -------------------------- | +| 768 | 256 | 3 | 3×256×5 + 3×32 = 3936 | 6.2× | Block decomp; zero padding | +| 1024 | 1024 | 1 | 1024×5 + 32 = 5152 | 6.4× | Single block (= current) | +| 768 (current) | 1024 | 1 | 1024×5 + 32 = 5152 | 4.8× | Padded; 33% overhead | + +Block decomposition improves the **compression ratio** for d=768 from ~4.8× to +~6.2× (about **29%** higher ratio). In **compressed bits per vector** for the +same settings, that is about **24%** fewer bits (5152 → 3936). For d=1024 the +encoding is identical to current. + +**Shared overhead:** centroids and SORF signs are **amortized over N** vectors; +for **small N**, per-vector shared metadata dominates—report totals both with +and without amortization when publishing ratios. + +## Performance analysis + +### Encode/decode throughput + +SORF at B dimensions (order-of-magnitude): 3 × B × log₂(B) butterflies + 3 × B +sign applications per block (plus B normalization multiplies, omitted). Constants +and memory traffic dominate in practice; treat FLOP estimates as **heuristic**. +For k blocks: + +| B | SORF FLOPs/block | k (d=768) | Total MSE FLOPs | +| -------------- | ------------------------- | --------- | --------------- | +| 256 | 3×256×8 + 768 = 6,912 | 3 | 20,736 | +| 512 | 3×512×9 + 1536 = 15,360 | — | — | +| 1024 (current) | 3×1024×10 + 3072 = 33,792 | 1 | 33,792 | + +Block decomposition at d=768 is ~40% fewer FLOPs than the current padded +approach, despite more blocks, because each block is smaller. + +### Benchmarking plan + +1. Encode/decode throughput: block TQ vs. current TQ at d=128, 768, 1024 +2. Quantized cosine similarity: block vs. current +3. L2 norm readthrough: O(k) vs. O(1) +4. PDX scan throughput vs. row-major (Stage 3) + +## Experimental plan + +### MSE quality vs. block size + +- Compare actual normalized MSE at B ∈ {64, 128, 256, 512} vs. single-SORF at + padded dimension, at bit widths b ∈ {2, 3, 4, 5, 8} +- Test SORF coordinate distribution at each B: histogram vs. analytical Beta +- Test 3, 4, 5 SORF rounds at each B +- Determine if the practical MSE constant is worse at smaller B + +### QJL strategy comparison (if pursued) + +- Per-block Gaussian QJL vs. per-block SORF QJL vs. full-dim padded SORF QJL + vs. MSE-only +- Key metric: ANN recall@k on the datasets above (Contriever, OpenAI, SIFT) +- Per community findings, MSE-only is expected to win [8] + +### Benchmarking datasets + +The current test suite uses i.i.d. Gaussian vectors: for **isotropic** data, a +random orthogonal transform is **distributionally neutral**, so this is a clean +**theory/sanity** anchor—not a guaranteed “pessimistic” proxy for all production +embedding geometries (heavy tails, clusters, anisotropy can behave differently). +Recent work (VIBE [11]) argues that traditional benchmarks (SIFT, GloVe) are no +longer representative of modern ANN workloads. + +**Recommended datasets:** + +| Dataset | Dim | Size | Source | Why | +| ----------------------------- | ------ | ------ | ---------------- | ------------------------------------------------------ | +| Contriever | 768 | ~1M | PDX paper [4] | Key non-power-of-2 target; real embeddings | +| OpenAI text-embedding-3-large | 1536 | ~1M | Common in RAG | High-d production embeddings | +| SIFT | 128 | 1M | Classic | Low-d power-of-2 baseline, well-studied recall numbers | +| arXiv embeddings | 768 | 2.25M | PDX paper [4] | Same dim as Contriever, larger scale | +| DEEP | 96 | 10M | Image embeddings | Large scale; d=96 has no B ≥ 64 divisor → padded path | +| Synthetic Gaussian | varies | varies | Internal | Pessimistic baseline; validates theoretical bounds | + +**Metrics** (at b_mse ∈ {2, 3, 4, 5, 8}): + +- Recall@10, Recall@100 (ANN ranking quality) +- Normalized MSE distortion (reconstruction quality) +- Inner product mean signed relative error (bias measurement) +- Encode/decode throughput (vectors/sec) + +The Gaussian baseline validates that theoretical bounds hold. The real-embedding +datasets measure practical quality — which may be **better** than Gaussian +(structured data benefits more from rotation) or **worse** (if the data has +adversarial properties for the specific rotation). + +### Straggler handling (if needed) + +Rare for common dimensions. If encountered: zero-pad to B (simplest). Follow-up: +dense rotation at actual dimension. + +## Phasing + +**Phase 1** — MSE-only single-block TurboQuant: Split the [current PR][current-impl] +to merge MSE-only (no QJL). This is a complete encoding for all dimensions +(with padding for non-power-of-2). + +**Phase 2** — Block decomposition: Add block splitting for non-power-of-2 +dimensions. B = greatest power-of-2 ≥ 64 dividing d. Per-block norms stored as +internal children. The `TurboQuantScheme::compress()` method must be updated to: +(a) choose B based on d, (b) split input into blocks, (c) normalize per-block, +(d) encode each block, and (e) store per-block norms as an internal child array. + +**Phase 3** — PDXArray + scan kernels: Introduce `PDXArray` as a general-purpose +dimension-major layout for `FixedSizeListArray`. TurboQuant's codes child is +swapped from FSL to PDXArray by the compressor. Distance computation kernels +operate on PDXArray's dimension-contiguous slices. + +**Phase 4** (experimental) — QJL: If the experimental plan shows QJL improves +recall@k beyond MSE-only, add per-block Gaussian or SORF QJL. Based on +community findings, this may not be pursued. + +## Practical recommendations + +For common model dimensions, the most promising configurations are: + +| Dimension | Recommendation | Rationale | +| --------------------- | --------------------------- | -------------------------------------------------------------------------- | +| 512, 1024, 2048, 4096 | Single-block MSE-only + PDX | B=d, no decomposition needed. Same as current TQ but with PDX scan layout. | +| 768, 1536, 3072 | 3-block MSE-only + PDX | B=256 or 512. Zero padding waste. 3 blocks, shared centroids. | +| 2560, 1280, … | Evaluate table rule | Greatest power-of-2 ≥ 64 dividing d (e.g. 2560 → B=256, k=10). | +| Arbitrary d (rare) | Padded single-block | Fall back to current approach. Padding overhead bounded by B-1 dims. | + +In all cases, MSE-only is the recommended starting point. QJL should only be +added if experiments demonstrate clear recall@k improvements for the target +workload. + +## Future work: GPU decode and fused distance computation + +The B-dim block structure maps naturally to GPU tile sizes and tensor cores. +For a batch of N vectors sharing the same rotation matrix R⁻¹: + +``` +decoded_batch = diag(norms) × R⁻¹ × codebook_lookup_batch(codes) + ↑ B×N matrix + ↑ B×B × B×N = GEMM +``` + +The codebook gather + inverse rotation + norm scaling can be fused into a single +kernel using an **IO-aware streaming pattern analogous in spirit** to Flash-KMeans’ +fused assignment/update philosophy [6]—**not** the same algorithm (Flash-KMeans is +GPU k-means), but a similar systems goal: reduce HBM traffic and avoid full +materialization. +For distance computation without full decode, a precomputed (2^b_mse)²-entry +distance table fits in shared memory (1 KB at b_mse=4, 4 KB at b_mse=5); the +kernel streams code bytes from HBM with gather-reduce accumulation, using +4-8× less bandwidth than full float vectors. + +At b_mse=8, codes are uint8 indices (0-255). Hypothetical int8 tensor-core paths +(e.g. VPDPBUSD-style idioms) require **quantized coordinate values** in a narrow +dynamic range and typically **near-linear** centroid spacing—but Max-Lloyd +centroids are **not** constrained to such a representation. At high B the +centroids are **near-uniform** under the concentrated marginal +(the Beta distribution is highly concentrated, approaching Gaussian, for which +high-resolution optimal quantization is approximately uniform). Whether the +existing Max-Lloyd centroids are "linear enough" for hardware dot-product +instructions is an empirical question worth testing before introducing a +separate linear quantization mode. + +## Integration with Vortex scan engine + +TurboQuant's quantized-domain operations must integrate with Vortex's expression +evaluation and scan pushdown infrastructure. The current implementation provides +this via `ScalarFnVTable` implementations in `vortex-tensor`. + +**Current integration path.** The `CosineSimilarity`, `DotProduct`, and `L2Norm` +scalar functions check whether their input storage arrays are TurboQuant-encoded +(via `TurboQuant::try_match()`). If both operands are TurboQuant and the +`ApproxOptions::Approximate` flag is set, the scalar function dispatches to the +quantized-domain kernel (e.g., `cosine_similarity_quantized_column`), bypassing +full decompression. Otherwise, it falls back to the exact path (decompress → +compute on floats). + +**Stage 2 changes.** With block decomposition, the quantized kernels must be +updated to iterate over TQ blocks, weighting by per-block norms: + +- `cosine_similarity_quantized_column`: currently computes a single unit-norm + dot product per row pair. Must change to `Σ_k norm_a_k · norm_b_k · +unit_dot_k / (‖a‖ · ‖b‖)` with `‖a‖ = √(Σ_k norm_a_k²)`. +- `dot_product_quantized_column`: same per-block weighting. +- `l2_norm`: currently returns the stored norm directly (O(1)). Must change to + `√(Σ_k norm_k²)` — read the norms FSL child and compute. +- Both operands must have the **same block size B**, **compatible centroids** + (same `b_mse` and block-**B** codebook), and **bit-identical MSE rotation + parameters** (`mse_rotation_signs` and the same SORF construction) for the + quantized inner-product path to equal the true dot product in expectation + under the TurboQuant model. **Two stored columns** with different rotations + must **fall back to exact** (decompress → float) unless a higher-level contract + guarantees shared rotation metadata. The common **column vs constant query** + path remains: re-encode the query with the **column’s** rotation and + centroids. + +**Stage 3 changes.** The PDX distance kernel (shown in Stage 3 pseudocode) is a +new execution path that operates on `PDXArray`-typed codes. It should be exposed +as an alternative `ScalarFnVTable` implementation that activates when the codes +child is a `PDXArray` and the scan is over a contiguous 64-vector-aligned range. +For non-aligned ranges or single-vector access (`scalar_at`), the PDXArray is +converted to FSL first via `PDXArray::to_fsl()`. + +**Expression tree integration.** The typical ANN scan expression is: + +``` +top_k(cosine_similarity(column, constant_query), k=10) +``` + +The `constant_query` is broadcast to match the column length. The +`CosineSimilarity` scalar function receives both the column (TurboQuant-encoded) +and the query (ConstantArray wrapping a single vector). For the quantized path, +the query is first encoded with the column's rotation and centroids to produce +query codes and query block norms, then the PDX kernel runs over the column's +codes without decompressing them. + +## Migration and compatibility + +TurboQuant has not shipped yet, so there are no existing files to migrate. We +can design the metadata for forward compatibility from day one. + +**Strategy: single array ID, versioned metadata.** All stages use the same array +ID (`vortex.turboquant`). The metadata includes `block_size` and `num_blocks` +fields from Stage 1 onward. Stage 1 always writes `num_blocks=1`, but the field +exists so that Stage 2 decoders can read Stage 1 files without migration. + +**Norms are always internal children.** The TurboQuant array is self-contained — +it stores norms as a child slot, not in a parent encoding. This means: + +- Stage 1: norms child is `PrimitiveArray`, one norm per vector (F = f64 for + f64 input, f32 otherwise). +- Stage 2 with k=1 (power-of-2 dims): same as Stage 1, identical wire format. +- Stage 2 with k>1: norms child is `FixedSizeListArray`, k norms per vector. + +The decoder distinguishes k=1 from k>1 by reading `num_blocks` from metadata. +A k=1 decoder is backward-compatible with Stage 1 files. A k>1 decoder is a new +code path that only applies to files written by Stage 2+. + +**Stage 3 (PDXArray) is additive.** PDX is not a TurboQuant metadata flag — it's +a separate array type (`PDXArray`) that wraps the codes child. Stage 1/2 files +have `FixedSizeListArray` codes; Stage 3 files have `PDXArray` codes. The +TurboQuant decoder checks the child type and un-transposes PDXArray on decode if +needed. `PDXArray` itself is registered as a new encoding, independent of +TurboQuant. + +**Incremental shipping:** + +| Stage | Ships to users? | Reads Stage 1 files? | Notes | +| ------------ | ---------------- | -------------------------- | ----------------------------------- | +| 1 (MSE-only) | Yes, immediately | N/A (first version) | New encoding, no backcompat concern | +| 2 (blocks) | Yes | Yes (k=1 is identical) | k>1 files need Stage 2+ decoder | +| 3 (PDX) | Yes | Yes (FSL codes still work) | PDX codes need PDXArray registered | + +Each stage is independently shippable. Users can upgrade incrementally. Files +written by earlier stages are always readable by later decoders. + +## References + +[1] Zandieh, A., Daliri, M., Hadian, M. and Mirrokni, V. "TurboQuant: Online +Vector Quantization with Near-optimal Distortion Rate." ICLR 2026. +arXiv:2504.19874, April 2025. + +[2] Ailon, N. and Chazelle, B. "The Fast Johnson-Lindenstrauss Transform and +Approximate Nearest Neighbors." SIAM J. Comput. 39(1):302-322, 2009. + +[3] Tropp, J.A. "Improved Analysis of the Subsampled Randomized Hadamard +Transform." Adv. Adaptive Data Analysis 3(1-2):115-126, 2011. + +[4] Kuffo, L., Krippner, E. and Boncz, P. "PDX: A Data Layout for Vector +Similarity Search." SIGMOD '25. arXiv:2503.04422, March 2025. + +[5] Yu, F.X., Suresh, A.T., Choromanski, K., Holtmann-Rice, D. and Kumar, S. +"Orthogonal Random Features." NeurIPS 2016. arXiv:1610.09072. + +[6] Yang, S. et al. "Flash-KMeans: Fast and Memory-Efficient Exact K-Means." +arXiv:2603.09229, March 2026. + +[7] Pathare, T. et al. "TurboQuant: Implementation Corrections, Production +Hardening, and Deployment Infrastructure." Eviox Tech Report v1.2.0, +March 2026. + +[8] Community TurboQuant implementations and findings. Key sources (pin +**commits** or **releases** in the final RFC): tonbistudio/turboquant-pytorch +(PyTorch, MSE-only reports); ggml-org/llama.cpp — use a **resolvable** issue or +discussion (e.g. issue **#20977** “Feature Request: TurboQuant support,” or +discussion **#21155**, as of 2026; replace if superseded); 0xSero/turboquant +(Triton); vivekvar-dl/turboquant (pip); scos-lab/turboquant (reproduction). +**Claim:** several groups report MSE-only beating MSE+QJL for attention / ANN-style +metrics at tested bit widths—treat as **empirical community reports** until +summarized in a peer-reviewed study or a pinned benchmark table. + +[9] Jégou, H., Douze, M. and Schmid, C. "Product Quantization for Nearest +Neighbor Search." IEEE Trans. PAMI 33(1):117-128, 2011. + +[10] Ge, T., He, K., Ke, Q. and Sun, J. "Optimized Product Quantization." +IEEE Trans. PAMI 36(4):744-755, 2014. + +[11] Jääsaari, E., Hyvönen, V., Ceccarello, M., Roos, T. and Aumüller, M. +"VIBE: Vector Index Benchmark for Embeddings." arXiv:2505.17810, May 2025. From c9230f672d538de9bc2aa0d198ff71fc5353bf68 Mon Sep 17 00:00:00 2001 From: Will Manning Date: Fri, 3 Apr 2026 12:10:57 -0400 Subject: [PATCH 05/19] incorporate synthesized ultra reviewer feedback Signed-off-by: Will Manning --- proposed/0033-block-turboquant.md | 164 +++++++++++++++++++----------- 1 file changed, 107 insertions(+), 57 deletions(-) diff --git a/proposed/0033-block-turboquant.md b/proposed/0033-block-turboquant.md index 88cbc7e..56b1ff4 100644 --- a/proposed/0033-block-turboquant.md +++ b/proposed/0033-block-turboquant.md @@ -19,9 +19,11 @@ in three stages: groups of 64 vectors for SIMD scan performance. QJL correction is deferred to a later stage and may ultimately be dropped. -Community findings from 6+ independent TurboQuant implementations consistently -show that MSE-only outperforms MSE+QJL for attention and ANN ranking in -practice [8]. +Community findings from multiple independent TurboQuant implementations +consistently show that MSE-only outperforms MSE+QJL for KV-cache attention [8]. +For ANN ranking and vector-search workloads, the evidence is currently less +complete, so QJL should remain an empirical question rather than a settled +conclusion. [current-impl]: https://github.com/vortex-data/vortex/pull/7167 @@ -61,10 +63,14 @@ differences are: | Bits per sub-vector | Scalar: b bits per coordinate | Vector: typically 8 bits per sub-vector (256 codewords) | TurboQuant trades PQ's flexibility (data-dependent codebooks can exploit -structure) for data-obliviousness (no training, provable bounds, zero indexing -time). For uniformly distributed embeddings, TurboQuant's analytically optimal -centroids should match or exceed PQ's learned codebooks. For highly structured -data, PQ may still win empirically. +structure) for data-obliviousness (no training, provable bounds, no offline +index-training phase). Encode-time work (rotation + quantization) still applies. +In return, PQ and OPQ retain a major advantage in expressivity: they learn +sub-vector codebooks from data rather than applying an analytic scalar quantizer. +In practice this means TurboQuant is attractive when training-free operation, +simple deployment, and theoretical guarantees matter most, while PQ or OPQ may +still win empirically when a learned vector codebook can exploit dataset-specific +structure. ### Current Vortex implementation @@ -140,17 +146,20 @@ bound itself — they are well below the 2.72/4^b bound. ### Community findings on QJL -Multiple independent TurboQuant implementations have converged on a -significant practical finding: **MSE-only consistently outperforms MSE+QJL for -attention and ANN ranking**. The mechanism is a variance-bias tradeoff: -TurboQuant's QJL correction eliminates bias but increases variance, and softmax -attention (and cosine/L2 ranking) amplifies variance more than bias. At the same -total bit budget, allocating all bits to MSE (more centroids, lower variance) -beats splitting between MSE + QJL (fewer centroids + 1-bit correction). This has -been confirmed by 6+ groups across Python, C, and Rust implementations [8]. - -This finding strongly supports making MSE-only the default strategy for our -columnar storage use case (ANN search, cosine similarity ranking). +Multiple independent TurboQuant implementations have converged on a significant +practical finding for **KV-cache attention**: MSE-only often outperforms MSE+QJL +at the same bit budget. The likely mechanism is a variance-bias tradeoff: QJL +removes bias in raw inner-product estimation but adds variance, and the softmax +nonlinearity amplifies variance more than it penalizes bias. In that setting, +allocating all bits to MSE (more centroids, lower quantization variance) can beat +splitting the budget between MSE + QJL. This behavior has been reported by +multiple groups across Python, C, and Rust implementations [8]. + +For ANN search, cosine ranking, and other non-softmax vector-search workloads, +the evidence is currently less settled. MSE-only is still a reasonable default +because it is simpler and better supported by the current implementation work, +but the ANN question should be treated as empirical until evaluated on ANN +datasets with recall@k and ranking metrics (see Experimental plan). ### Current limitations @@ -168,8 +177,10 @@ The SORF requires power-of-2 input dimension. For non-power-of-2 dimensions PDX [4] is a data layout for vector similarity search. The paper (SIGMOD '25) describes a dimension-major layout within fixed-size blocks of 64 vectors, enabling the compiler to auto-vectorize the inner distance loop over vectors -rather than dimensions, achieving on average 2× speedups over SIMD-optimized -row-major kernels on modern CPUs. The block size of 64 is empirically optimal +rather than dimensions. In the paper, this yields average speedups of about 40% +over SIMD-optimized row-major kernels for the direct kernel comparison, while +dimension-pruning methods (ADSampling, BSA) recover much larger gains (2-7×) +when paired with the PDX layout [4]. The block size of 64 is empirically optimal across AVX-512, AVX2, and NEON architectures [4]. **PDX implementation evolution.** The [open-source implementation][pdx-impl] @@ -338,7 +349,8 @@ norm = 0, decode as all zeros. #### Theoretical MSE bound -The paper's MSE bound (Theorem 1 in [1]) is: +The paper's MSE bound (Theorem 1 in [1]; verify theorem numbering against the +ICLR 2026 camera-ready if it differs from the arXiv version) is: ``` E[‖x - x̂‖² / ‖x‖²] ≤ (√3 · π / 2) / 4^b ≈ 2.72 / 4^b @@ -349,13 +361,23 @@ Gaussian), not SORF.** Our SORF is an approximation. The bound holds exactly only with a true random orthogonal rotation or with empirical SORF validation (see Experimental plan). -Assuming the per-block MSE bound holds, for a vector split into blocks: +Assuming the per-block MSE bound holds, for a vector split into blocks the +following **algebraic** identity is exact: ``` ‖x - x̂‖² / ‖x‖² = Σ_k (‖xₖ‖² / ‖x‖²) × (‖xₖ - x̂ₖ‖² / ‖xₖ‖²) ≤ MSE_bound × Σ_k (‖xₖ‖² / ‖x‖²) = MSE_bound ``` +The inequality applies Theorem 1's **probabilistic** bound (over the random +rotation) to each block independently. The conclusion should be read in terms +of **expectations**: `E[‖x - x̂‖² / ‖x‖²] ≤ MSE_bound` assuming independent +per-block rotations. Note that TurboQuant's original analysis uses a single +global rotation in high-d where coordinates are nearly independent; with +smaller block dimension B, within-block coordinate dependence after rotation may +be stronger even when marginals are correct — this is an additional motivation +for the experimental plan's comparison of block sizes. + The actual MSE may depend on block dimension B: at larger B the coordinate distribution is more concentrated (variance ~1/B), giving the Max-Lloyd quantizer more to exploit. See Experimental plan. @@ -374,19 +396,25 @@ vectors). Each block must have an **independent** rotation matrix. **Why not DCT?** The PDX implementation [pdx-impl] uses DCT (via FFTW) as a fast rotation for ADSampling. DCT is O(B log B) and invertible, but it is a **fixed structured transform**, not a random rotation — it does not produce the Beta -marginal distribution `(1-x²)^((d-3)/2)` that TurboQuant's Max-Lloyd centroids -are optimized for. ADSampling only needs approximate coordinate independence +marginal distribution `(1-x²)^((B-3)/2)` (in block dimension B) that +TurboQuant's Max-Lloyd centroids are optimized for. ADSampling only needs +approximate coordinate independence (for hypothesis-testing pruning), so DCT suffices there. TurboQuant needs a specific known marginal distribution, so only random orthogonal rotations (QR or SORF) are suitable. -**Shared rotation with ADSampling.** Both TurboQuant and ADSampling apply a -random orthogonal rotation to make coordinates independent. If we integrate -ADSampling-style dimension pruning (see Stage 3), the same rotation could serve -both purposes: producing the Beta distribution for quantization AND enabling -hypothesis-testing for early pruning. This would avoid rotating the data twice. -Note that the query must also be rotated at query time with the same rotation -matrix (stored as a shared child); ADSampling already requires this. +**Shared rotation with ADSampling (speculative).** Both TurboQuant and +ADSampling apply a random orthogonal rotation to make coordinates independent. +If we integrate ADSampling-style dimension pruning (see Stage 3), the same +rotation could in principle serve both purposes. However, this is not automatic +under the Stage 2 block-decomposed design: ADSampling is formulated around a +single full-dimensional random projection whose coordinates can be sequentially +sampled, whereas Stage 2 introduces per-block rotations and per-block norm +weighting. Reusing one rotation across both systems should be treated as a +**future research direction** that requires new analysis or direct empirical +validation. If it proves viable, it would avoid rotating the data twice. The +query would also need to be rotated at query time with the same stored +transform. #### Quantized-domain operations @@ -589,8 +617,9 @@ for Gaussian. SORF for QJL is an additional approximation (the [current implementation][current-impl] uses SORF for QJL). Per-block QJL has d/B times more variance than full-dimension QJL (Lemma 4 [1]). -The community consensus is that MSE-only likely wins for ANN ranking at all -bit widths, so QJL may not be worth the complexity. +Community reports indicate MSE-only often wins for KV-cache attention at all +tested bit widths [8]. Whether this extends to ANN ranking is an empirical +question (see Experimental plan); QJL may not be worth the complexity. ## Array layout @@ -650,22 +679,27 @@ replace 32 with 64 in the norms row — ratios decrease accordingly): ### Worked examples (f32, b_mse=5, N=1000) -| d | B | k | Per-vec bits | Ratio | Notes | -| ------------- | ---- | --- | --------------------- | ----- | -------------------------- | -| 768 | 256 | 3 | 3×256×5 + 3×32 = 3936 | 6.2× | Block decomp; zero padding | -| 1024 | 1024 | 1 | 1024×5 + 32 = 5152 | 6.4× | Single block (= current) | -| 768 (current) | 1024 | 1 | 1024×5 + 32 = 5152 | 4.8× | Padded; 33% overhead | +| d | B | k | Per-vec bits | Ratio | Notes | +| ------------- | ---- | --- | --------------------- | ----- | ------------------------ | +| 768 | 256 | 3 | 3×256×5 + 3×32 = 3936 | 6.2× | Block decomp; no padding | +| 1024 | 1024 | 1 | 1024×5 + 32 = 5152 | 6.4× | Single block (= current) | +| 768 (current) | 1024 | 1 | 1024×5 + 32 = 5152 | 4.8× | Padded; 33% overhead | + +Block decomposition improves the compression ratio for d=768 from ~4.8× to +~6.2× (about 29% higher ratio; equivalently, about 24% fewer compressed bits +per vector: 5152 → 3936). For d=1024 the encoding is identical to current. -Block decomposition improves d=768 from 4.8× to 6.2× — a 30% storage -improvement. For d=1024 the encoding is identical to current. +**Shared overhead note:** centroids and SORF signs are amortized over N vectors; +for small N, per-column shared metadata is significant — report totals with and +without amortization when publishing ratios. ## Performance analysis ### Encode/decode throughput -SORF at B dimensions: 3 × B × log₂(B) butterflies + 3 × B sign applications -per block (plus B normalization multiplies, omitted for simplicity). For k -blocks: +SORF at B dimensions (heuristic — real cost is dominated by memory bandwidth +and constant factors): 3 × B × log₂(B) butterflies + 3 × B sign applications +per block (plus B normalization multiplies, omitted). For k blocks: | B | SORF FLOPs/block | k (d=768) | Total MSE FLOPs | | -------------- | ------------------------- | --------- | --------------- | @@ -698,7 +732,8 @@ approach, despite more blocks, because each block is smaller. - Per-block Gaussian QJL vs. per-block SORF QJL vs. full-dim padded SORF QJL vs. MSE-only - Key metric: ANN recall@k on the datasets above (Contriever, OpenAI, SIFT) -- Per community findings, MSE-only is expected to win [8] +- Per community findings for attention, MSE-only is expected to win [8]; ANN + ranking is the key open question ### Benchmarking datasets @@ -784,14 +819,17 @@ decoded_batch = diag(norms) × R⁻¹ × codebook_lookup_batch(codes) ``` The codebook gather + inverse rotation + norm scaling can be fused into a single -kernel following the double-buffered streaming pattern from Flash-KMeans [6]. +kernel using an IO-aware streaming pattern analogous to Flash-KMeans [6] — not +the same algorithm (Flash-KMeans is GPU k-means), but a similar systems goal: +reduce HBM traffic and avoid full materialization. For distance computation without full decode, a precomputed (2^b_mse)²-entry distance table fits in shared memory (1 KB at b_mse=4, 4 KB at b_mse=5); the kernel streams code bytes from HBM with gather-reduce accumulation, using 4-8× less bandwidth than full float vectors. -At b_mse=8, codes are uint8 indices (0-255). Direct int8 tensor core GEMM -(using codes as the unsigned operand in VPDPBUSD) requires approximately linear +At b_mse=8, codes are uint8 indices (0-255). Direct low-precision GEMM on +hardware accelerators (tensor cores on GPU, byte-dot-product instructions on +CPU) requires approximately linear centroids — but at high B the Max-Lloyd centroids are already near-uniform (the Beta distribution is highly concentrated, approaching Gaussian, for which high-resolution optimal quantization is approximately uniform). Whether the @@ -822,8 +860,13 @@ unit_dot_k / (‖a‖ · ‖b‖)` with `‖a‖ = √(Σ_k norm_a_k²)`. - `dot_product_quantized_column`: same per-block weighting. - `l2_norm`: currently returns the stored norm directly (O(1)). Must change to `√(Σ_k norm_k²)` — read the norms FSL child and compute. -- Both operands must have the **same block size B** and compatible centroids for - the quantized path to apply. If block sizes differ, fall back to exact. +- Both operands must have the **same block size B**, compatible centroids (same + `b_mse` and B-dim codebook), and **bit-identical MSE rotation parameters** + (`mse_rotation_signs` and same SORF construction) for the quantized + inner-product path to be valid. Two stored columns with different rotations + must **fall back to exact** (decompress → float). The common **column vs. + constant query** path avoids this: the query is re-encoded with the column's + rotation and centroids at query time. **Stage 3 changes.** The PDX distance kernel (shown in Stage 3 pseudocode) is a new execution path that operates on `PDXArray`-typed codes. It should be exposed @@ -908,14 +951,21 @@ arXiv:2603.09229, March 2026. [7] Pathare, T. et al. "TurboQuant: Implementation Corrections, Production Hardening, and Deployment Infrastructure." Eviox Tech Report v1.2.0, -March 2026. - -[8] Community TurboQuant implementations and findings. Key sources: -tonbistudio/turboquant-pytorch (PyTorch, V3 MSE-only findings), -ggml-org/llama.cpp#20969 (C/C++, quantized attention analysis), -0xSero/turboquant (Triton kernels), vivekvar-dl/turboquant (pip package), -scos-lab/turboquant (reference reproduction). Consensus: MSE-only beats -MSE+QJL for attention and ANN ranking at all tested bit widths. +March 2026. https://eviox.tech/nexus/eviox_turboquant_corrections_study.pdf + +[8] Community TurboQuant implementation reports. These sources primarily study +KV-cache attention rather than ANN search; claims should be scoped accordingly. +Key sources (pin commits/releases in final external draft): + +- tonbistudio/turboquant-pytorch: MSE-only (V3) vs MSE+QJL (V2) for attention + and generation. Workload: KV-cache attention. +- ggml-org/llama.cpp discussion #21155: TurboQuant quantized attention analysis. + Workload: KV-cache attention. +- 0xSero/turboquant: Triton kernels, paper validation scripts. +- scos-lab/turboquant: Reference reproduction, MSE vs Prod comparison. + Several groups report MSE-only beating MSE+QJL for attention metrics at tested + bit widths. ANN ranking conclusions remain preliminary pending dedicated + benchmarks. [9] Jégou, H., Douze, M. and Schmid, C. "Product Quantization for Nearest Neighbor Search." IEEE Trans. PAMI 33(1):117-128, 2011. From 7c3311b6a8c48a1857f9da5400d1b26659c6b368 Mon Sep 17 00:00:00 2001 From: Will Manning Date: Fri, 3 Apr 2026 12:11:31 -0400 Subject: [PATCH 06/19] delete reviews Signed-off-by: Will Manning --- .../0033-block-turboquant-review-gpt-5.4.md | 347 ------ ...k-turboquant-review-synthesis-composer2.md | 171 --- ...0033-block-turboquant-revised-composer2.md | 988 ------------------ 3 files changed, 1506 deletions(-) delete mode 100644 proposed/0033-block-turboquant-review-gpt-5.4.md delete mode 100644 proposed/0033-block-turboquant-review-synthesis-composer2.md delete mode 100644 proposed/0033-block-turboquant-revised-composer2.md diff --git a/proposed/0033-block-turboquant-review-gpt-5.4.md b/proposed/0033-block-turboquant-review-gpt-5.4.md deleted file mode 100644 index f6ddb74..0000000 --- a/proposed/0033-block-turboquant-review-gpt-5.4.md +++ /dev/null @@ -1,347 +0,0 @@ -# Review of `0033-block-turboquant.md` - -## Scope - -This review checks the RFC against: - -- the TurboQuant paper (`arXiv:2504.19874`) -- the PDX paper (`arXiv:2503.04422`) -- the cited SORF / ORF paper (`arXiv:1610.09072`) -- the cited PQ / OPQ papers -- the referenced open-source implementations and publicly available discussions that could be located - -The goal of this review is not to argue against the proposal direction. The goal is to make the RFC maximally defensible when read by experts who will check claims, citations, and wording very closely. - -## Executive Summary - -The proposal direction is plausible, and several technical points in the RFC are solid, especially: - -- the Theorem 1 constant correction -- the distinction between orthogonal MSE rotation and Gaussian QJL projection -- the rationale for treating SORF as an approximation rather than as a theorem-preserving drop-in replacement - -The largest problems are not in the core block-decomposition idea. They are in the rhetoric and sourcing around it: - -1. The RFC currently overclaims that community evidence supports dropping QJL for **ANN ranking**, when the located evidence is primarily about **KV-cache attention**. -2. The RFC overstates the PDX paper's speedup claim. -3. The PQ comparison contains an unsupported superiority claim that is likely to irritate reviewers. -4. The ADSampling integration discussion makes a nontrivial compatibility question sound easy. -5. The citation hygiene for `[7]` and especially `[8]` is not strong enough for external review. - -## Primary Findings - -### 1. Overclaim: evidence does not currently justify the ANN-ranking conclusion - -The most serious issue is the scope of the QJL claim. The current RFC says: - -> Community findings from 6+ independent TurboQuant implementations consistently show that MSE-only outperforms MSE+QJL for attention and ANN ranking in practice. - -The evidence I could verify does support a strong claim for **KV-cache attention**: - -- `tonbistudio/turboquant-pytorch` explicitly argues that QJL hurts because softmax amplifies variance. -- `scos-lab/turboquant` also reports MSE beating Prod/QJL for attention-like workloads. -- other community sources appear to be in the same family of KV-cache experiments. - -However, that is not the same thing as evidence for ANN ranking. In fact, one of the strongest located community sources explicitly distinguishes the two and says QJL may still work for vector search because there is no softmax nonlinearity. - -That means the current wording is too strong in two ways: - -- it extends **attention evidence** to **ANN ranking** -- it uses that extension to justify a product decision for Vortex's search/storage use case - -For outside review, the RFC should either: - -- narrow the claim to KV-cache attention only, or -- add actual ANN experiments and cite those directly - -### 2. Mis-citation: the PDX paper is overstated - -The RFC currently says PDX achieves "on average 2x speedups over SIMD-optimized row-major kernels." - -The PDX paper's abstract says: - -- PDX beats SIMD-optimized horizontal kernels by **average 40%** -- pruning approaches recover **2-7x** benefit when used with PDX - -Those are different claims. The RFC currently mixes them together in a way that overstates what the paper says. - -### 3. Unsupported comparison: TurboQuant is presented as likely superior to PQ on uniform embeddings - -The RFC currently says: - -> For uniformly distributed embeddings, TurboQuant's analytically optimal centroids should match or exceed PQ's learned codebooks. - -This is not supported by the cited PQ/OPQ literature, and it is not obviously true. PQ uses learned **vector** codebooks in subspaces, while TurboQuant uses rotated **scalar** quantization. The correct contrast is: - -- TurboQuant is training-free, data-oblivious, and analyzable. -- PQ/OPQ are data-dependent and require training. -- PQ/OPQ may still be empirically stronger because vector codebooks are more expressive. - -The current sentence sounds like a theorem-shaped statement without theorem-level support. - -### 4. ADSampling integration is presented too casually - -The RFC suggests that TurboQuant and ADSampling might share the same rotation. - -That is not obviously compatible with the proposed Stage 2 design: - -- ADSampling relies on a single full-dimensional random orthogonal projection whose coordinates can be sequentially sampled. -- Stage 2 proposes per-block rotations with blockwise norms and blockwise accumulation. - -A blockwise-rotated representation is not automatically interchangeable with the globally rotated representation assumed by ADSampling's pruning logic. This may still be possible, but it is a research question, not a straightforward integration detail. - -### 5. Citation hygiene is too weak for external review - -Two issues stand out: - -- `[8]` is a prose bundle of repos and issue references rather than an auditable citation. -- `[7]` was not publicly discoverable under the cited title during review. - -For a document going to experts, `[8]` should be expanded into explicit entries with: - -- repository / issue / PR URL -- commit SHA or tag if relevant -- workload type: KV attention vs ANN search -- metric: perplexity, recall@k, cosine, etc. -- conclusion actually supported by that source - -If `[7]` is intended as a public citation, it should have a public URL. If it is private, the RFC should not lean on it heavily in externally circulated form. - -### 6. GPU section uses CPU instruction terminology - -The GPU section references `VPDPBUSD`, which is an x86 CPU instruction, not a GPU tensor-core primitive. The section needs either: - -- CPU wording, or -- GPU-native terminology - -Otherwise it looks like a hardware-model mix-up. - -### 7. One worked-example note contradicts the design - -The Stage 2 worked example for `d=768, B=256, k=3` is labeled "zero padding" in the notes column. That should be removed or changed; Stage 2 is explicitly avoiding padding in that case. - -## Secondary Notes - -These items looked good or at least defensible: - -- The Theorem 1 constant appears correctly interpreted as `sqrt(3) * pi / 2`. -- The QJL scale-factor correction appears correct. -- The distinction between QR/Haar rotation for MSE and Gaussian `S` for QJL is correctly emphasized. -- The revised VIBE citation is now correct. - -## Recommended Editorial Strategy - -Before sharing this RFC externally, the safest editorial move is: - -1. Keep the proposal structure. -2. Tighten all empirical claims to exactly what the evidence shows. -3. Replace suggestive superiority language with narrower, falsifiable wording. -4. Mark ADSampling integration as speculative / future investigation. -5. Strengthen citations, especially `[8]`. - -## Proposed Redline - -This redline is intentionally targeted. It focuses on the passages that most need correction before external circulation. - -### 1. Summary: narrow the QJL claim - -#### Proposed replacement - -```diff --QJL correction is deferred to a later stage and may ultimately be dropped. --Community findings from 6+ independent TurboQuant implementations consistently --show that MSE-only outperforms MSE+QJL for attention and ANN ranking in --practice [8]. -+QJL correction is deferred to a later stage and may ultimately be dropped. -+Community findings from multiple independent TurboQuant implementations -+consistently show that MSE-only outperforms MSE+QJL for KV-cache attention in -+practice [8]. For ANN ranking and vector-search workloads, the evidence is -+currently less complete, so QJL should remain an empirical question rather than -+a settled conclusion. -``` - -### 2. PQ comparison: remove unsupported superiority language - -#### Proposed replacement - -```diff - TurboQuant trades PQ's flexibility (data-dependent codebooks can exploit - structure) for data-obliviousness (no training, provable bounds, zero indexing - time). --For uniformly distributed embeddings, TurboQuant's analytically optimal --centroids should match or exceed PQ's learned codebooks. For highly structured --data, PQ may still win empirically. -+In return, PQ and OPQ retain a major advantage in expressivity: they learn -+sub-vector codebooks from data rather than applying an analytic scalar -+quantizer. In practice this means TurboQuant is attractive when training-free -+operation, simple deployment, and theoretical guarantees matter most, while PQ -+or OPQ may still win empirically when a learned vector codebook can exploit -+dataset-specific structure. -``` - -### 3. Community QJL section: separate attention from ANN - -#### Proposed replacement - -```diff - ### Community findings on QJL - - Multiple independent TurboQuant implementations have converged on a --significant practical finding: **MSE-only consistently outperforms MSE+QJL for --attention and ANN ranking**. The mechanism is a variance-bias tradeoff: --TurboQuant's QJL correction eliminates bias but increases variance, and softmax --attention (and cosine/L2 ranking) amplifies variance more than bias. At the same --total bit budget, allocating all bits to MSE (more centroids, lower variance) --beats splitting between MSE + QJL (fewer centroids + 1-bit correction). This has --been confirmed by 6+ groups across Python, C, and Rust implementations [8]. -+significant practical finding for **KV-cache attention**: MSE-only often -+outperforms MSE+QJL at the same bit budget. The likely mechanism is a -+variance-bias tradeoff: QJL removes bias in raw inner-product estimation but -+adds variance, and the softmax nonlinearity can amplify variance more than it -+penalizes bias. In that setting, allocating all bits to MSE (more centroids, -+lower variance) can beat splitting the budget between MSE + QJL. This behavior -+has been reported by multiple groups across Python, C, and Rust implementations -+[8]. - --This finding strongly supports making MSE-only the default strategy for our --columnar storage use case (ANN search, cosine similarity ranking). -+For ANN search, cosine ranking, and other non-softmax vector-search workloads, -+the evidence is currently less settled. MSE-only is still a reasonable default -+because it is simpler and better supported by the current implementation work, -+but the RFC should treat the ANN question as empirical until evaluated on ANN -+datasets with recall@k and ranking metrics. -``` - -### 4. PDX section: correct the speedup claim - -#### Proposed replacement - -```diff - PDX [4] is a data layout for vector similarity search. The paper (SIGMOD '25) - describes a dimension-major layout within fixed-size blocks of 64 vectors, - enabling the compiler to auto-vectorize the inner distance loop over vectors --rather than dimensions, achieving on average 2× speedups over SIMD-optimized --row-major kernels on modern CPUs. The block size of 64 is empirically optimal -+rather than dimensions. In the paper, this yields average speedups of about 40% -+over SIMD-optimized row-major kernels for the direct-kernel comparison, while -+dimension-pruning methods recover much larger gains when paired with the PDX -+layout [4]. The block size of 64 is empirically optimal - across AVX-512, AVX2, and NEON architectures [4]. -``` - -### 5. ADSampling integration: mark as speculative - -#### Proposed replacement - -```diff - **Shared rotation with ADSampling.** Both TurboQuant and ADSampling apply a - random orthogonal rotation to make coordinates independent. If we integrate - ADSampling-style dimension pruning (see Stage 3), the same rotation could serve - both purposes: producing the Beta distribution for quantization AND enabling --hypothesis-testing for early pruning. This would avoid rotating the data twice. --Note that the query must also be rotated at query time with the same rotation --matrix (stored as a shared child); ADSampling already requires this. -+hypothesis-testing for early pruning. However, this is not automatic under the -+Stage 2 block-decomposed design: ADSampling is formulated around a single -+full-dimensional random projection, whereas Stage 2 introduces per-block -+rotations and per-block norm weighting. Reusing one rotation across both systems -+should therefore be treated as a future research direction that requires either -+new analysis or direct empirical validation. If it proves viable, it would avoid -+rotating the data twice. The query would also need to be rotated at query time -+with the same stored transform. -``` - -### 6. Worked examples: fix the contradictory note - -#### Proposed replacement - -```diff --| 768 | 256 | 3 | 3×256×5 + 3×32 = 3936 | 6.2× | Block decomp; zero padding | -+| 768 | 256 | 3 | 3×256×5 + 3×32 = 3936 | 6.2× | Block decomp; no padding | -``` - -### 7. GPU section: remove the CPU/GPU terminology mix - -#### Proposed replacement - -```diff - At b_mse=8, codes are uint8 indices (0-255). Direct int8 tensor core GEMM --(using codes as the unsigned operand in VPDPBUSD) requires approximately linear -+or byte-dot-product execution on low-precision hardware requires approximately linear - centroids — but at high B the Max-Lloyd centroids are already near-uniform - (the Beta distribution is highly concentrated, approaching Gaussian, for which - high-resolution optimal quantization is approximately uniform). Whether the - existing Max-Lloyd centroids are "linear enough" for hardware dot-product - instructions is an empirical question worth testing before introducing a - separate linear quantization mode. -``` - -If you want to be more explicit, you could instead split this into separate CPU and GPU paragraphs. - -### 8. Reference `[8]`: make it auditable - -#### Proposed replacement - -Replace the current bundled prose citation with something like: - -```diff --[8] Community TurboQuant implementations and findings. Key sources: --tonbistudio/turboquant-pytorch (PyTorch, V3 MSE-only findings), --ggml-org/llama.cpp#20969 (C/C++, quantized attention analysis), --0xSero/turboquant (Triton kernels), vivekvar-dl/turboquant (pip package), --scos-lab/turboquant (reference reproduction). Consensus: MSE-only beats --MSE+QJL for attention and ANN ranking at all tested bit widths. -+[8] Community TurboQuant implementation reports. These sources primarily study -+KV-cache attention rather than ANN search, and should be cited individually -+with exact URLs and workload scope in the final external draft. Representative -+examples include: -+- tonbistudio/turboquant-pytorch, issue #10 and README discussion of V2 -+ (MSE+QJL) vs V3 (MSE-only) behavior on attention and generation. -+- scos-lab/turboquant README discussion of MSE vs Prod/QJL for KV-cache -+ attention workloads. -+- 0xSero/turboquant README and validation scripts for paper checks and -+ implementation behavior. -+These sources support a strong claim for KV-cache attention. They do not, by -+themselves, establish the same conclusion for ANN ranking. -``` - -This version is intentionally conservative. If you have additional ANN-specific sources, add them here explicitly and then strengthen the main text accordingly. - -### 9. Reference `[7]`: either publish it or weaken dependence on it - -#### Proposed replacement note - -Not a text diff, but a release recommendation: - -- If `[7]` is public, add a direct URL. -- If `[7]` is private or unstable, reduce dependence on it in externally - circulated prose. - -For example, this sentence is fine if the report is public: - -```diff --The Eviox corrections study [7] identified six material bugs in the paper's -+A third-party implementation review [7] identified six material bugs in the paper's - reference Python implementation. -``` - -But the best fix is still to make the citation resolvable. - -## Optional Stronger Rewrite - -If you want the RFC to sound maximally careful in front of skeptical reviewers, the simplest global substitution is: - -- replace `consistently outperforms` with `has often outperformed` -- replace `consensus` with `reported behavior` -- replace `supports making MSE-only the default for ANN` with `supports evaluating MSE-only first, while keeping ANN ranking as an empirical question` - -That wording preserves the proposal but removes the most attackable overclaims. - -## Suggested Next Pass - -If you want a tighter external-facing RFC, the next revision should: - -1. apply the redline above -2. expand `[8]` into exact citations -3. add one explicit sentence saying which claims are backed by theorem, which by implementation, and which remain hypotheses -4. add ANN-specific experiments before claiming ANN superiority for MSE-only over QJL diff --git a/proposed/0033-block-turboquant-review-synthesis-composer2.md b/proposed/0033-block-turboquant-review-synthesis-composer2.md deleted file mode 100644 index 56e03df..0000000 --- a/proposed/0033-block-turboquant-review-synthesis-composer2.md +++ /dev/null @@ -1,171 +0,0 @@ -# Peer review synthesis: RFC 0033 Block-Decomposed TurboQuant with PDX - -**Document reviewed:** `proposed/0033-block-turboquant.md` -**Review date:** 2026-04-03 -**Purpose:** Consolidated findings from a detailed technical review (citations, papers, and spot-checks against arXiv HTML and GitHub). - ---- - -## Executive summary - -The RFC is unusually strong for an implementation plan: staged delivery, explicit approximation boundaries (SORF vs Haar, QJL vs MSE-only), and credible linkage to TurboQuant [1] and PDX [4]. For an expert audience, the highest-impact gaps are **broken or unverifiable citations**, **PDX speedup wording that does not match the PDX abstract**, **under-specified conditions for quantized dot product between two stored columns**, and **the blockwise MSE composition paragraph mixing deterministic algebra with probabilistic bounds**. Addressing those items—and making community claims auditable—would make the document review-resistant. - ---- - -## Citations and bibliographic issues - -### Broken GitHub reference - -- **Finding:** `ggml-org/llama.cpp#20969` returns **404** (issue does not exist or was removed). -- **Action:** Replace with a **resolvable** link (e.g. [issue #20977](https://github.com/ggml-org/llama.cpp/issues/20977) “Feature Request: TurboQuant support,” or [discussion #21155](https://github.com/ggml-org/llama.cpp/discussions/21155)) and quote the **exact** claim about MSE-only vs MSE+QJL. - -### Eviox report [7] - -- **Finding:** “Eviox Tech Report v1.2.0, March 2026” has **no URL or DOI** in the RFC. Expert readers cannot verify bugs, Theorem 1 constant discussion, or QJL scale claims against that source. -- **Action:** Publish a stable PDF/link, **or** rephrase to “we verified against reference implementation at commit …” with reproducible steps. - -### Community list [8] - -- **Finding:** A list of repos plus “6+ groups” and “consensus” is **not** literature-grade evidence without commits, experiment definitions, and metrics. -- **Action:** Add a small **table** (source, commit or version, workload, bit width, metric, outcome) or move strong claims to “anecdotal / preliminary.” - -### TurboQuant paper internal references - -- **Lemma 1 / Theorem 2:** arXiv HTML aligns with “marginal density” material and **Definition 1** for QJL scaling; **theorem numbering** may differ in the ICLR 2026 camera-ready PDF. **Action:** Reconcile lemma/theorem numbers with the **final** PDF before wide distribution. - -- **QJL scale (Definition 1):** The paper gives \(Q_{\text{qjl}}^{-1}(\mathbf z) := \frac{\sqrt{\pi/2}}{d}\mathbf S^\top\mathbf z\). The RFC’s contrast of `√(π/(2d))` vs `√(π/2)/d` is **correct** (ratio involves **√d**). - -### PDX [4] speedup claims - -- **Finding:** The PDX **abstract** reports beating horizontal SIMD layouts by **~40%** on average (order **1.4×** end-to-end for that comparison), and **2–7×** when **combining PDX with dimension-pruning** (ADSampling/BSA). The RFC’s blanket “**on average 2×**” for PDX vs row-major **overstates** the abstract’s headline scalar-scan claim unless restricted to a specific figure/setup. -- **Action:** Quote **40%** for the core PDX-vs-horizontal result; cite **2–7×** only for **PDX + pruning** (with section/figure reference when possible). - -### Flash-KMeans [6] - -- **Finding:** Flash-KMeans is a **GPU k-means** paper (assignment/update kernels), not TurboQuant decode. Referring to “following the double-buffered streaming pattern” suggests direct algorithmic lineage. -- **Action:** Clarify **analogy** (IO-aware fused kernels), not the same problem or method. - ---- - -## Mathematics and methodology - -### Theorem 1 and related quantities - -- The **dimension-free** MSE bound \(D_{\text{mse}} \le (\sqrt{3}\,\pi/2)\,4^{-b}\) matches the arXiv HTML (intro + Theorem 1 region). The RFC’s **Eviox vs \(\sqrt{3\pi}/2\)** argument is directionally correct: **\(\sqrt{3}\pi/2 \approx 2.72\)** is not **\(\sqrt{3\pi}/2 \approx 1.535\)**. - -- The proof chain also introduces quantities such as \(\mathcal C(f_X,b)\) with a **\(1/d\)** factor in intermediate steps. The RFC can briefly note **\(\mathcal C\)** vs **\(D_{\text{mse}}\)** so readers see the full proof stack was considered. - -### Block decomposition and composed MSE bound - -- The **algebraic** identity partitioning \(\|\mathbf x - \hat{\mathbf x}\|^2/\|\mathbf x\|^2\) by orthogonal blocks is **correct**. - -- The step from per-block **probabilistic** guarantees to a global bound should be stated in terms of **expectations** (linearity) and assumptions on randomization, not as a purely **pointwise** weighted average unless the theorem is worst-case (it is not, as stated). - -- **Conceptual gap:** TurboQuant’s analysis uses **one** global Haar rotation and **high-\(d\)** near-independence across coordinates. **Independent SORF per block** with **smaller \(B\)** may weaken the “coordinates act like independent scalar sources” story even when the **marginal** after Haar in \(\mathbb R^B\) is correct. The RFC already plans empirical validation; **explicitly call out \(B\)-dependence of near-independence**. - -### Centroids and block dimension - -- Centroids must use the **\(B\)-dimensional** marginal (exponent **\((B-3)/2\)**). The RFC states this; good. - -- **Minimum block size:** Global **\(d \ge 3\)** avoids Beta singularities; state that **each block** satisfies **\(B \ge 3\)** under the chosen policy (**\(B \ge 64\)**), so the marginal is well-defined. - -### DCT discussion - -- In the “Why not DCT?” paragraph, the marginal is written with **\((d-3)/2\)**; for per-block discussion, **\((B-3)/2\)** is the relevant exponent to avoid confusion. - ---- - -## Systems and integration - -### Quantized dot product / cosine: two stored columns - -- For **column vs query** re-encoded with the **column’s** rotation and centroids, the story is clear. - -- For **two TurboQuant-encoded columns**, a fast quantized inner product requires **identical** rotation parameters (**bit-identical `mse_rotation_signs`**, same seeds/structure), not only the same **\(B\)** and centroids. The RFC should **require rotation identity** for the two-sided fast path or **mandate exact fallback**. - -### Mixed precision (f64 norms, f32 directions) - -- Generally sound; a **brief** note on numerical ordering or tiny norms avoids pedantic corner-case questions. - -### PDX layout and indexing - -- Implementers will want a **clear mapping** from logical dimension index (spanning TQ blocks) to **PDX transposed offsets**—either a formula or a short diagram. - -### Slice/take with PDX - -- Full **un-transpose to FSL** is simple but can imply **large transient cost** on small slices. Worth noting **worst-case behavior** and optional **64-row-aligned** fast paths. - -### FLOP table - -- Label counts as **heuristic**; real cost is often **memory bandwidth** and constant factors in butterflies. - -### GPU / VPDPBUSD - -- **VPDPBUSD** is a **specific** mixed int8 dot-product idiom, not arbitrary uint8×uint8. Max-Lloyd centroids are **not** naturally constrained to byte-quantized linear scales; treat “linear enough for tensor cores” as a **strong** empirical hypothesis. - ---- - -## Experimental plan and datasets - -### Gaussian “pessimistic baseline” - -- For **isotropic** Gaussians, a random orthogonal transform is **distributionally neutral**, but that does not make the baseline “pessimistic” for **all** error modes; it can be **misaligned** with heavy-tailed or clustered embeddings. **Soften** wording to: theory anchor / sanity check, not a proxy for worst-case production. - -### DEEP \(d=96\) - -- Correctly noted: **no** power-of-two **\(\ge 64\)** divides 96, so the RFC’s block rule forces **padding / straggler** path. Good. - -### Popular dimensions - -- Optional: add rows for dimensions such as **2560** or **1280** if the RFC targets “common model dims” broadly. - ---- - -## Compression ratio section - -- **“30% storage improvement”** is easy to misread: the worked example is roughly **29% higher compression ratio** (4.8× → 6.2×) and about **24% fewer compressed bits per vector** for \(d=768\), \(b_{\text{mse}}=5\). **Disambiguate** ratio vs bit reduction. - -- **Shared** centroids and SORF signs: remind readers that shared cost is **amortized over \(N\)**; **small** columns are metadata-sensitive. - ---- - -## Minor editorial nits - -- Prefer **“greatest”** over **“largest”** for “power-of-two that divides \(d\)” (standard math English). - -- PQ row: “8 bits per sub-vector” is a **typical** configuration, not the definition of PQ; qualify as such. - -- “Indexing time: Zero” vs PQ training: fair as **no k-means training**, but **encode-time** work remains; soften **“zero”** to avoid pedantic pushback. - -- **QJL variance scaling (“\(d/B\) times more”):** align wording with **Lemma 4**’s **exact** statement in the PDF (variance of **averaged** estimators, constants). - ---- - -## Positive highlights (worth preserving) - -- Clear **staging** (MSE-only → blocks → PDX → optional QJL). - -- Honest **SORF vs Haar** and **SORF for QJL** vs Gaussian **S**. - -- **Theorem 1 constant** clarification vs mistaken \(\sqrt{3\pi}/2\) interpretation. - -- **PDX open-source delta** (SQ8 tiling, ADSampling, zones) is valuable context. - -- **Migration / single array ID** story is clean for a greenfield encoding. - ---- - -## Suggested priority order before external expert send-out - -1. Fix **llama.cpp** link and verify **PDX** speedup sentences against [4]. -2. Tighten **block MSE** subsection (expectations, \(B\)-dependence). -3. Specify **rotation-parameter identity** (or fallback) for **two-column** quantized dot. -4. Make **[7]** and **[8]** **auditable** or soften claims. -5. Add **Flash-KMeans** analogy disclaimer, **compression ratio** disambiguation, **slice/PDX** cost note. - -## Deliverables (this review) - -| File | Purpose | -| ---- | ------- | -| `proposed/0033-block-turboquant-review-synthesis.md` | This document: consolidated findings and recommended actions. | -| `proposed/0033-block-turboquant-revised.md` | Full RFC text with proposed edits applied (does not replace `0033-block-turboquant.md`). | diff --git a/proposed/0033-block-turboquant-revised-composer2.md b/proposed/0033-block-turboquant-revised-composer2.md deleted file mode 100644 index fd37059..0000000 --- a/proposed/0033-block-turboquant-revised-composer2.md +++ /dev/null @@ -1,988 +0,0 @@ -# Block-Decomposed TurboQuant with PDX Layout - -**Authors:** Will Manning -**Status:** Proposal (revised draft — incorporates peer-review edits; see `0033-block-turboquant-review-synthesis.md`) -**Date:** 2026-04-02 - -## Summary - -We propose evolving the [TurboQuant vector quantization encoding][current-impl] -in three stages: - -1. **MSE-only TurboQuant** (immediate): merge the current PR as an MSE-only - encoding. This is a complete, self-contained building block. -2. **Block decomposition** (next): for non-power-of-2 dimensions, split into - blocks of size B = the **greatest** power-of-2 ≥ 64 that divides d. For - power-of-2 dimensions, B = d (single block, same as current). Per-block - norms stored as internal children. -3. **PDX layout** (later): transpose codes into dimension-major order within - groups of 64 vectors for SIMD scan performance. - -QJL correction is deferred to a later stage and may ultimately be dropped. -Multiple community implementations report that MSE-only often outperforms -MSE+QJL for attention and ANN ranking in practice [8]. **Citation hygiene:** [8] -should be upgraded to pinned commits and a short results table before the RFC -is treated as establishing external “consensus.” - -[current-impl]: https://github.com/vortex-data/vortex/pull/7167 - -## Background - -### TurboQuant - -TurboQuant [1] is a lossy vector quantization algorithm for high-dimensional -embeddings. It works by: - -1. Randomly rotating a unit-norm vector so that each coordinate follows a known - marginal distribution — specifically `(1 - x²)^((d-3)/2)` on [-1, 1], a - concentrated Beta-type marginal on coordinates (see [1]; lemma/section - numbering: verify against the ICLR 2026 / final PDF). -2. Applying an MSE-optimal scalar quantizer (Max-Lloyd centroids) independently - to each coordinate. -3. Optionally adding a 1-bit QJL (Quantized Johnson-Lindenstrauss) correction - on the residual for unbiased inner product estimation (see the unbiased - **TurboQuant_prod** result in [1]; verify theorem number in the proceedings - PDF vs arXiv). - -The paper prescribes a full random orthogonal rotation (QR decomposition of a -matrix with i.i.d. N(0,1) entries, yielding a Haar-uniform orthogonal matrix) -for the MSE stage — O(d²) storage and O(d²) per-vector. For the QJL stage, the -paper uses a random Gaussian projection matrix S with i.i.d. N(0,1) entries (not -an orthogonal rotation); this distinction matters for the unbiasedness proof. - -**Comparison to Product Quantization.** TurboQuant's block decomposition (Stage -2 of this RFC) is structurally similar to Product Quantization (PQ) [9]: both -partition a vector into sub-vectors and quantize each independently. The key -differences are: - -| | TurboQuant | PQ | -| ---------------------- | --------------------------------------------------------------- | -------------------------------------------------------- | -| Quantization type | Scalar (per-coordinate, after rotation) | Vector (per-sub-vector, learned codebook) | -| Codebook | Analytically derived from Beta distribution; **data-oblivious** | Learned via k-means on training data; **data-dependent** | -| Rotation | Random orthogonal within each sub-vector | Typically none (OPQ [10] adds a learned rotation) | -| Theoretical guarantees | Provable MSE bound (Theorem 1 [1]) | Empirical quality only | -| Indexing / training | No k-means or learned codebook training (centroids from theory) | Requires training pass over data for codebooks | -| Bits per sub-vector | Scalar: b bits per coordinate | Vector: common choice e.g. 8 bits × m subquantizers (not universal) | - -TurboQuant trades PQ's flexibility (data-dependent codebooks can exploit -structure) for data-obliviousness (no training, provable bounds, no offline -index-training phase). Encode-time work (rotation + quantization) still -applies. For uniformly distributed embeddings, TurboQuant's analytically optimal -centroids should match or exceed PQ's learned codebooks. For highly structured -data, PQ may still win empirically. - -### Current Vortex implementation - -Our [current implementation][current-impl] (Rust, in the `vortex-tensor` crate) -implements TurboQuant as a Vortex array encoding that compresses -`FixedSizeList` arrays — the storage format of `Vector` and -`FixedShapeTensor` extension types. Key design choices and characteristics: - -**Rotation.** Instead of the paper's O(d²) QR rotation, we use a 3-round -Structured Orthogonal Random Features (SORF) transform `HD₃·HD₂·HD₁` [5] for -both the MSE rotation and the QJL projection, giving O(d) storage (3d sign bits, -bitpacked) and O(d log d) per-vector. The rotation signs are stored as a -bitpacked child array rather than recomputed from a seed at decode time. The -3-round SORF was introduced for kernel approximation [5] and approximates a -random orthogonal matrix. It is distinct from the single-round SRHT (`R·H·D`) -analyzed by Tropp [3] and the FJLT (`P·H·D`) of Ailon-Chazelle [2], both of -which are dimensionality-reducing projections rather than rotation -approximations. - -**Centroids.** Max-Lloyd centroids are computed via numerical integration -(trapezoid rule, 1000 points per interval) of the marginal Beta distribution at -the padded dimension, using the `HalfIntExponent` type for exact integer/half- -integer exponent arithmetic. Centroids are cached in a global `DashMap` keyed by -`(dimension, bit_width)` and stored as a shared `PrimitiveArray` child. - -**Array structure.** The `TurboQuantArray` stores up to 7 child slots: codes -(`FixedSizeListArray`, one per vector, list_size = padded_dim), norms -(`PrimitiveArray`), centroids (shared), MSE rotation signs (shared, -bitpacked), and optionally 3 QJL children (signs, residual norms, QJL rotation -signs). Codes are stored as u8 centroid indices; the cascade compressor -(BitPacked encoding) handles packing to the actual bit width on disk. - -**Compute pushdowns.** Slice and take propagate to per-row children (codes, -norms) while sharing rotation signs and centroids. Quantized cosine similarity -and dot product operate directly on codes and centroids without decompression. -L2 norm returns the stored norm directly (O(1) readthrough). - -**Compression scheme.** `TurboQuantScheme` implements the `Scheme` trait for the -BtrBlocks cascading compressor. It matches `Vector` and `FixedShapeTensor` -extension arrays with non-nullable float elements and dimension ≥ 3, using the -default config (5-bit QJL = 4-bit MSE + 1-bit QJL, seed 42). - -**Input handling.** All float types (f16, f32, f64) are converted to f32 before -quantization. Per-vector L2 norms are computed and stored as f32. Non-power-of-2 -dimensions are zero-padded to the next power of 2 for SORF compatibility. The -minimum dimension is 3 (d=2 causes a singularity in the Beta distribution -exponent). - -### Reference implementation bugs - -The Eviox corrections study [7] identified six material bugs in the paper's -reference Python implementation. **Readers:** [7] should include a stable URL, -DOI, or public artifact; until then, treat detailed Eviox-only claims as -internally verified reproduction notes. The most critical is a mathematical error in -the QJL scale factor: the reference code used `√(π/(2d))` instead of -`√(π/2)/d` (Definition 1 in [1]), differing by a factor of √d (≈11× at d=128). -Our [current implementation][current-impl] uses the correct formula -(`sqrt(FRAC_PI_2) / padded_dim` in Rust), so this bug does **not** affect us. - -Other notable Eviox findings: (a) the reference code recomputes codebooks at -every instantiation (we cache in a `DashMap`); (b) the reference uses float16 -for codebook distance computation, causing misassignment at small centroid -spacings (we cast to f32 before quantization). See [7] for the full list. - -### Theorem 1 constant - -There is an ambiguity in the paper's notation for the MSE bound constant. The -formal proof gives `(√3 · π / 2) · 4^{-b}` where the constant √3·π/2 ≈ 2.72. -The Eviox report [7] interprets the notation as `√(3π)/2 ≈ 1.535`, but this is -incorrect: the measured distortion values from the paper (b=2: 0.117, b=3: 0.03) -exceed the putative `√(3π)/2` bound (b=2: 0.096, b=3: 0.024), confirming that -2.72 is the correct constant. The paper's "explicit values" (0.36, 0.117, 0.03, -0.009) are the actual computed distortion of the optimal quantizer, not the -bound itself — they are well below the 2.72/4^b bound. - -### Community findings on QJL - -Several independent TurboQuant implementations report a recurring practical -pattern: **MSE-only often outperforms MSE+QJL for attention and ANN-style -ranking** under fixed bit budgets. A plausible mechanism is variance–bias -tradeoff: QJL reduces bias but adds variance, and softmax attention (and -cosine/L2 ranking) can amplify variance. At the same total bit budget, allocating -all bits to MSE (more centroids, lower quantization variance) sometimes beats -splitting between MSE + QJL (fewer MSE bits + 1-bit correction). See [8] and the -summary’s note on auditable citations. - -This pattern supports making **MSE-only the default** for our columnar storage -use case (ANN search, cosine similarity ranking), pending our own benchmarks. - -### Current limitations - -The SORF requires power-of-2 input dimension. For non-power-of-2 dimensions -(e.g., 768-d embeddings), the input is zero-padded to the next power of 2 -(1024). This causes: - -- **33% storage overhead** for 768-d vectors: 1024 codes stored vs. 768 useful - (equivalently, 25% of stored codes are wasted on zero-padded dimensions). -- **No scan-optimized layout**: row-major code storage prevents SIMD-over-vectors - distance computation. - -### PDX - -PDX [4] is a data layout for vector similarity search. The paper (SIGMOD '25) -describes a dimension-major layout within fixed-size blocks of 64 vectors, -enabling the compiler to auto-vectorize the inner distance loop over vectors -rather than dimensions. The PDX abstract reports **~40%** faster end-to-end -search than SIMD-optimized **horizontal** storage in that baseline comparison -(order **1.4×**), not a blanket “2×” headline. **Separately**, combining PDX -with dimension-pruning (ADSampling, BSA) restores **2–7×** benefits over -SIMD-optimized linear scans in their reported settings [4]. The block size of -64 is empirically strong across AVX-512, AVX2, and NEON architectures [4]. - -**PDX implementation evolution.** The [open-source implementation][pdx-impl] -has evolved beyond the paper in several ways relevant to this RFC: - -- **8-bit scalar quantization** (`IndexPDXIVFTreeSQ8`): Maps floats to 0-255 via - linear min-max scaling. The int8 layout differs from float32: dimensions are - packed in groups of 4 ("4 dims × 16 vecs") to leverage hardware dot-product - instructions (VPDPBUSD on x86, UDOT/SDOT on ARM) that process 4 byte pairs - per operation. This is a different tiling than the paper's "1 dim × 64 vecs." -- **ADSampling with random rotation**: The pruner applies a random orthogonal - rotation (QR of Gaussian, or DCT when FFTW is available) to the entire - collection as a preprocessing step. This makes coordinates approximately - independent, enabling dimension-by-dimension hypothesis testing for early - pruning. The rotation serves a similar purpose to TurboQuant's rotation — - making the coordinate distribution known — but for pruning rather than - quantization. -- **Dimension zones**: Consecutive dimensions are grouped into zones; at query - time, zones are ranked by "distance-to-means" and the most discriminative - zones are scanned first, enabling faster pruning. -- **Future: 1-bit vectors** are mentioned as planned. - -**Implications for our design.** The PDX paper's float32 layout ("1 dim × 64 -vecs") maps cleanly to our quantized-code scan kernel, where the inner loop -gathers from a centroid-product distance table over 64 vectors. However, if we -pursue direct int8 arithmetic (b_mse=8 with linear centroids, see GPU section), -the "4 dims × 16 vecs" int8 layout from the PDX implementation may be more -appropriate, as it enables hardware dot-product instructions. - -Additionally, ADSampling's dimension-pruning approach is complementary to -TurboQuant's block structure: when scanning with block decomposition, the pruner -could skip entire TQ blocks (B dimensions at a time) if the partial distance -already exceeds the candidate threshold. This combines the storage efficiency of -quantization with the computational savings of early termination. - -[pdx-impl]: https://github.com/cwida/PDX - -## Proposal - -### Block size strategy - -For each dimension d, choose B = the **greatest** power-of-2 ≥ 64 that evenly -divides d. This eliminates stragglers entirely for common embedding dimensions. -Each block uses **B ≥ 64**, hence **B ≥ 3**, so the block-level Beta marginal -(exponent **(B−3)/2**) is well-defined (global **d ≥ 3** remains required for -the single-block padded path): - -| Dimension d | Block size B | Blocks k | Notes | -| ----------- | ------------ | -------- | --------------------------- | -| 512 | 512 | 1 | Single block (= current TQ) | -| 768 | 256 | 3 | Greatest dividing power-of-2 | -| 1024 | 1024 | 1 | Single block | -| 1536 | 512 | 3 | | -| 2048 | 2048 | 1 | Single block | -| 3072 | 1024 | 3 | | -| 4096 | 4096 | 1 | Single block | - -**Key observations:** - -- **Power-of-2 dimensions** (512, 1024, 2048, 4096) use B = d — a single block, - identical to the current implementation except with PDX underneath (Stage 3). - No block decomposition overhead, no per-block norms. These dimensions are - already well-served by the current design. -- **Non-power-of-2 dimensions** (768, 1536, 3072) decompose into k=3 blocks at - B=256 or B=512. Zero padding waste. Each block has its own SORF rotation and - shares a single centroid set. -- **Stragglers are eliminated** for all common embedding dimensions. Dimensions - that are not multiples of 64 (e.g., 100, 200) would need straggler handling, - but these are rare in practice for modern model architectures. -- **The SORF approximation at B=256+ is expected to be adequate**: 3 rounds at - B=256 provides 24 butterfly stages, and at B=512 provides 27 — both comparable - to the current B=1024 (30 stages). This needs empirical validation; see - Experimental plan. - -### Stage 1: MSE-only TurboQuant (immediate — split from current PR) - -Split the [current PR][current-impl] to extract and merge the MSE-only subset. -The QJL code can be preserved on a separate branch for Phase 4. - -**Changes vs. current PR:** - -| Aspect | Current PR | Stage 1 | -| -------------- | ------------------------------------------- | ----------------------------------------------------- | -| QJL support | Full (encode, decode, QJL slots, QJL tests) | **Removed** | -| Array slots | 7 (4 MSE + 3 QJL) | **4** (codes, norms, centroids, rotation_signs) | -| Scheme default | 5-bit QJL (4-bit MSE + 1-bit QJL) | **5-bit MSE-only** (32 centroids) | -| Norms dtype | Always f32 | **Same-or-wider**: f64 for f64 input, f32 for f32/f16 | -| Metadata | `has_qjl: bool` | **Removed** (always MSE-only) | - -**Unchanged from current PR:** SORF rotation, Max-Lloyd centroids, -zero-padding for non-power-of-2, slice/take/scalar_at pushdowns, quantized -cosine similarity and dot product, compression scheme integration, minimum dim=3. - -**Added to metadata (for forward compat):** `block_size: u32` (always = -padded_dim), `num_blocks: u32` (always = 1). These fields are inert in Stage 1 -but enable Stage 2 decoders to read Stage 1 files. (PDX is handled via the -codes child type, not a metadata flag — see Stage 3.) - -This is a complete, useful encoding for all dimensions. Power-of-2 dimensions -have zero padding waste; non-power-of-2 dimensions have the padding overhead -described above. - -### Stage 2: Block decomposition - -For non-power-of-2 dimensions, split into blocks of size B (as determined by the -table above). Each full block gets an independent B-dim SORF rotation. - -**Changes vs. Stage 1:** - -| Aspect | Stage 1 | Stage 2 | -| --------------------- | ------------------------------------ | ---------------------------------------------------------------------------- | -| Block count | k = 1 (single block at padded_dim) | **k = d/B** (multiple blocks, no padding) | -| SORF dimension | padded_dim (e.g., 1024 for d=768) | **B** (e.g., 256 for d=768) | -| Rotation signs | Single set, len = 3 × padded_dim | **k sets**, len = k × 3 × B | -| Centroids | Computed for padded_dim distribution | **Computed for B-dim distribution** (different codebook!) | -| Norms child | `PrimitiveArray`, 1 per vector | **`PrimitiveArray` (k=1) or `FixedSizeListArray` (k>1)**, same dtype F | -| Codes list_size | padded_dim | **k × B** (= d for no-straggler dims) | -| Scheme compress() | Pad → single SORF → quantize | **Choose B → split → per-block normalize/rotate/quantize** | -| Quantized dot product | Single sum over padded_dim centroids | **Per-block weighted sum** (Σ_k norm_a_k · norm_b_k · unit_dot_k) | -| L2 norm readthrough | O(1) — return stored norm | **O(k)** — compute √(Σ_k norm_k²) | -| Zero-padding waste | Up to 33% (768→1024) | **Zero** for common dims | - -**Unchanged from Stage 1:** SORF construction (3-round HD), Max-Lloyd algorithm, -f32 internal quantization, slice/take semantics (per-row data sliced, shared -data cloned), bitpacked rotation sign storage, compression scheme trait. - -**For power-of-2 dimensions**: B = d, k = 1. The encoding produces an identical -wire format to Stage 1 (single norm, single SORF, single codes block). A Stage 2 -encoder writing k=1 data is fully backward-compatible with Stage 1 decoders. - -**Key design properties:** - -- **Self-contained.** The TurboQuant array handles block splitting, per-block - normalization, rotation, and quantization internally. No parent cooperation - is needed. -- **One shared centroid set** for all blocks at the same B-dim distribution. -- **Per-block SORF rotation signs.** Each block's SORF is independent (different - seed). Signs are 3 × B bits per block. - -#### Norm architecture - -Per-block norms are stored as an **internal child** of the TurboQuant array: - -- For k = 1 (power-of-2 dims): `PrimitiveArray` with len = num_rows - (identical to Stage 1's single-norm layout). -- For k > 1: `FixedSizeListArray` with list_size = k, len = num_rows. - -The norm dtype `F` matches or widens the input element type: - -| Input dtype | Norm dtype | Rationale | -| ----------- | ---------- | ---------------------------------------------- | -| f16 | f32 | f16 has insufficient range/precision for norms | -| f32 | f32 | Same type | -| f64 | f64 | Preserve full precision | - -Norms are stored as plain child arrays; the cascading compressor handles -secondary encoding (ALP, Pco, etc.). - -Note: centroids and quantization always operate in f32 internally (the -[current implementation][current-impl] converts all input to f32 before -quantization). For f64 input, decode produces f32 unit-direction reconstructions -scaled by f64 norms — a mixed-precision multiply that preserves norm precision; -use numerically stable ordering (e.g. `norm * direction`) and the existing -zero-block fast path for subnormal edge cases. - -#### Zero-norm sub-vectors - -When splitting a vector into B-dim blocks, some blocks may have zero norm. The -encoding handles ‖xₖ‖ = 0 explicitly: skip rotation and quantization, store -norm = 0, decode as all zeros. - -#### Theoretical MSE bound - -The paper's MSE bound (Theorem 1 in [1]) is stated for **unit** \(\mathbf x \in -S^{d-1}\) with \(D_{\text{mse}} := \mathbb{E}\|\mathbf x - \hat{\mathbf x}\|_2^2\) -(and equals \(\mathbb{E}[\|\mathbf x - \hat{\mathbf x}\|^2/\|\mathbf x\|^2]\) -in that case): - -``` -E[‖x - x̂‖²] ≤ (√3 · π / 2) / 4^b ≈ 2.72 / 4^b (x unit norm; same as normalized MSE) -``` - -The proof chain also uses intermediate quantities (e.g. \(\mathcal C(f_X,b)\)) -that carry a **1/d** factor in some steps; the headline **\(D_{\text{mse}}\)** -bound above is the dimension-free form quoted in the abstract [1]. - -**Crucially, Theorem 1 is proved for true random orthogonal matrices (QR of -Gaussian), not SORF.** Our SORF is an approximation. The bound holds exactly -only with a true random orthogonal rotation or with empirical SORF validation -(see Experimental plan). - -**Blockwise composition.** For an orthogonal partition into blocks, -\(\|\mathbf x - \hat{\mathbf x}\|^2/\|\mathbf x\|^2 = \sum_k -(\|\mathbf x_k\|^2/\|\mathbf x\|^2)\, -(\|\mathbf x_k - \hat{\mathbf x}_k\|^2/\|\mathbf x_k\|^2)\) holds **exactly** -as algebra. To lift Theorem 1’s **probabilistic** guarantee to the whole -vector, state the conclusion in terms of **expectations** and the randomness -model (e.g. independent rotations per block), not as a pointwise inequality -unless a worst-case theorem is invoked. **Independence / near-independence:** -TurboQuant’s original analysis leverages **high-\(d\)** near-independence of -coordinates after **one** global rotation; with **smaller \(B\)**, coordinate -dependence after rotation may strengthen even when marginals match—this is an -additional reason the experimental plan compares block sizes and SORF rounds. - -The actual MSE may depend on block dimension B: at larger B the coordinate -distribution is more concentrated (variance ~1/B), giving the Max-Lloyd -quantizer more to exploit. See Experimental plan. - -**SORF approximation.** The 3-round SORF `HD₃·HD₂·HD₁` [5] provides log₂(B) -butterfly stages per round × 3 rounds = 3·log₂(B) total (18 at B=64, 24 at -B=256, 27 at B=512). -This is a rough heuristic for mixing quality — [5] does not analyze convergence -rate as a function of rounds × dimension. Empirical validation is needed. - -**Fallback: dense rotation.** If SORF proves insufficient at the chosen B, use a -B × B random orthogonal matrix (QR of Gaussian). Storage at B=256: 256 KB per -block. For d=768 with k=3: 768 KB total. Amortizes for large columns (100K+ -vectors). Each block must have an **independent** rotation matrix. - -**Why not DCT?** The PDX implementation [pdx-impl] uses DCT (via FFTW) as a fast -rotation for ADSampling. DCT is O(B log B) and invertible, but it is a **fixed -structured transform**, not a random rotation — it does not produce the Beta -marginal distribution `(1-x²)^((B-3)/2)` in block dimension **B** that TurboQuant's -Max-Lloyd centroids are optimized for. ADSampling only needs approximate coordinate independence -(for hypothesis-testing pruning), so DCT suffices there. TurboQuant needs a -specific known marginal distribution, so only random orthogonal rotations (QR or -SORF) are suitable. - -**Shared rotation with ADSampling.** Both TurboQuant and ADSampling apply a -random orthogonal rotation to make coordinates independent. If we integrate -ADSampling-style dimension pruning (see Stage 3), the same rotation could serve -both purposes: producing the Beta distribution for quantization AND enabling -hypothesis-testing for early pruning. This would avoid rotating the data twice. -Note that the query must also be rotated at query time with the same rotation -matrix (stored as a shared child); ADSampling already requires this. - -#### Quantized-domain operations - -All quantized operations read per-block norms from the internal child array: - -- **L2 distance**: `‖a-b‖² = Σ_k ‖aₖ‖² + Σ_k ‖bₖ‖² - 2·Σ_k ‖aₖ‖·‖bₖ‖· -unit_dotₖ`. Primary ANN metric; reuses per-block dot product and norms. -- **Dot product**: ` ≈ Σ_k ‖aₖ‖·‖bₖ‖ · Σ_j centroids[code_aₖ[j]] · -centroids[code_bₖ[j]]`. -- **Cosine similarity**: `cos(a,b) ≈ dot(a,b) / (‖a‖·‖b‖)` where - `‖a‖ = √(Σ_k ‖aₖ‖²)`. -- **L2 norm**: `√(Σ_k ‖xₖ‖²)`. O(k) per vector — a regression from the - current O(1) single-norm readthrough, but modest. - -#### Encoding algorithm - -``` -Input: x ∈ ℝ^d, b_mse bits per coordinate, block_size B -k = d / B (exact division, no straggler for chosen B) -num_centroids = 2^b_mse - -# Block split and normalize -for i in 0..k: - xᵢ = x[i*B .. (i+1)*B] - nᵢ = ‖xᵢ‖ - if nᵢ > 0: - ûᵢ = xᵢ / nᵢ - else: - ûᵢ = zeros(B) - -# MSE stage (per block, SORF rotation) -for i in 0..k: - if nᵢ > 0: - rᵢ = SORFᵢ(ûᵢ) - cᵢ[j] = nearest_centroid(rᵢ[j]) - else: - cᵢ[j] = 0 - -Store (all as internal children): - codes (k × B per vector), norms (k per vector), - centroids (2^b_mse, shared), SORF signs (k × 3 × B, shared) -``` - -#### Decoding algorithm - -``` -for i in 0..k: - r̂ᵢ[j] = centroids[cᵢ[j]] - ûᵢ = SORF⁻¹ᵢ(r̂ᵢ) - x̂ᵢ = nᵢ × ûᵢ (nᵢ read from internal norms child) -x̃ = concat(x̂₀, ..., x̂ₖ₋₁) -``` - -### Stage 3: PDX dimension-major layout - -Introduce a new `PDXArray` encoding type that wraps any `FixedSizeListArray` -with a dimension-major layout within groups of 64 vectors [4]. PDXArray is -**not TurboQuant-specific** — it is a general-purpose layout optimization for -any FixedSizeList of scalar elements (raw float vectors, scalar-quantized -vectors, TurboQuant codes, etc.). - -**Changes vs. Stage 2:** - -| Aspect | Stage 2 | Stage 3 | -| ---------------- | ------------------------------------------------ | ------------------------------------------------------------------------------- | -| Codes child type | `FixedSizeListArray` | **`PDXArray`** (wraps FSL with transposed layout) | -| Codes detection | N/A (codes always FSL) | **TQ checks child type**: FSL → row-major decode, PDXArray → un-transpose first | -| Distance kernel | Per-vector loop with per-element centroid lookup | **SIMD-friendly 64-vector inner loop with distance-table lookup** | -| Decode path | Direct inverse SORF per vector | **PDXArray.to_fsl() first**, then inverse SORF | - -**Unchanged from Stage 2:** Block size B, centroid computation, norm storage, -SORF rotation, all encoding logic. The encode path produces row-major codes -(FSL), then the compressor wraps them in a PDXArray; the decode path converts -PDXArray back to FSL then decodes. - -**PDXArray design:** - -``` -PDXArray (general-purpose dimension-major layout for FixedSizeList) -├── metadata: { list_size, chunk_size (= 64) } -├── elements: PrimitiveArray # transposed: 64 values per dim, contiguous -├── validity: ... # same as FSL validity -``` - -- `PDXArray::try_new(fsl)` — transposes a FixedSizeListArray into PDX layout -- `PDXArray::to_fsl()` — un-transposes back to row-major FSL (for decode, - scalar_at, or non-aligned slice/take) -- `PDXArray::elements_for_dim(dim, chunk)` — O(1) access to a contiguous slice - of 64 values for one dimension within one chunk. **Indexing:** logical code - index for global dimension \(g \in [0, d)\) maps to TurboQuant block - \(t = \lfloor g/B \rfloor\), within-block dimension \(j = g \bmod B\); the - PDX transpose lays out dimension-major runs of 64 values per **global** - dimension in order \(g = 0, \ldots, d-1\) (spanning TQ blocks contiguously in - code space). -- Slice/take: un-transpose to FSL (simplest). Preserving PDX layout is possible - only for 64-vector-aligned ranges. **Cost note:** naive un-transpose can be - \(O(\text{chunk size} \times d)\) per slice; document worst-case behavior and - consider 64-row-aligned fast paths for hot scans. -- The cascade compressor treats PDXArray as a valid encoding of FSL-typed data. - -**Benefits of PDXArray as a separate type:** - -- PDX logic tested and maintained independently of TurboQuant -- Other encodings (raw float vectors, scalar quantization, future encodings) - get PDX scan performance for free -- TurboQuant doesn't need an `is_pdx` metadata flag — it checks its codes - child's type at runtime -- The distance kernel operates on PDXArray's dimension-contiguous slices - -Within each 64-vector chunk, codes are stored dimension-major: - -``` -TQ block 0, dim 0: [v0 v1 v2 ... v63] -TQ block 0, dim 1: [v0 v1 v2 ... v63] -... -TQ block 0, dim (B - 1): [v0 v1 v2 ... v63] -TQ block 1, dim 0: [v0 v1 v2 ... v63] -... -``` - -The inner SIMD loop (64 vectors) has no inter-vector dependencies. TQ block -boundaries only affect where norm weighting occurs — they don't affect the -transpose. - -**Quantized distance kernel (dot product):** - -```rust -let dist_table = precompute_product_table(¢roids); -// At b_mse=4: 16×16 = 256 floats = 1KB, fits in L1 - -let mut distances = [0.0f32; 64]; -let mut unit_dots = [0.0f32; 64]; -let mut offset = 0; - -for tq_block in 0..k { - for dim in 0..B { - let qd = query_codes[tq_block * B + dim]; - let row = &dist_table[qd as usize]; - for v in 0..64 { // SIMD-friendly: no inter-vector deps - unit_dots[v] += row[codes[offset] as usize]; - offset += 1; - } - } - // Weight per-block unit-norm dot product by both vectors' block norms - for v in 0..64 { - distances[v] += query_norms[tq_block] * data_norms[v][tq_block] - * unit_dots[v]; - unit_dots[v] = 0.0; // reset for next TQ block - } -} -``` - -**Int8 layout variant.** The PDX implementation [pdx-impl] uses a different -tiling for int8 data: "4 dims × 16 vecs" to leverage VPDPBUSD/UDOT hardware -dot-product instructions (specific **mixed** int8 dot-product idioms on each -architecture, e.g. VPDPBUSD on x86). For TurboQuant codes at b_mse ≤ 8, codes are -uint8 **centroid indices**, not quantized coordinate values, so these instructions -do not apply directly — we need the distance-table-lookup path -shown above. However, at b_mse=8 with high B, the Max-Lloyd centroids are -near-uniformly spaced (see GPU section), potentially enabling direct hardware -dot-product on the codes. Whether this requires a separate linear quantization -mode or works with the existing Max-Lloyd centroids is an empirical question. The -"4 dims × 16 vecs" layout would be a Stage 3 optimization to evaluate alongside -the "1 dim × 64 vecs" float-style layout. - -**ADSampling integration.** The PDX dimension-pruning approach (ADSampling [4]) -is complementary to TurboQuant's block structure. During a scan, the pruner -could evaluate partial distances after each TQ block (B dimensions) and skip -remaining blocks if the partial L2 distance already exceeds the candidate -threshold. This requires the per-block norm weighting to happen at block -boundaries (as shown in the kernel above), which our design already provides. - -**Open design questions:** - -- Should PDXArray live in `vortex-array` (general infrastructure) or - `vortex-tensor` (vector-specific)? -- Should the cascade compressor automatically PDX-transpose FSL children when - it detects a scan-heavy workload, or should PDX be opt-in? -- Should we support the "4 dims × 16 vecs" uint8 layout variant (for hardware - dot-product) alongside the "1 dim × 64 vecs" float-style layout? - -### QJL correction (deferred — experimental) - -Based on community findings [8], QJL is deferred to after the MSE stages are -validated. - -**Changes vs. MSE-only (if pursued):** - -| Aspect | MSE-only | MSE + QJL | -| ---------------------- | -------------------------------- | --------------------------------------------------------------- | -| Bit budget | All b bits → MSE (2^b centroids) | b-1 bits MSE + 1 bit QJL (2^(b-1) centroids) | -| Inner product estimate | Biased (MSE quantization noise) | Unbiased (QJL correction; see **TurboQuant_prod** in [1]) | -| Additional children | None | QJL signs, QJL residual norms, QJL projection params | -| Encode cost | SORF only | SORF + QJL projection (O(B²) for Gaussian, O(B log B) for SORF) | -| Decode cost | Inverse SORF only | Inverse SORF + QJL inverse projection | - -If pursued, four strategies should be compared: - -| Strategy | Theoretical | Speed | Storage | -| -------------------- | --------------------- | ---------------- | --------------- | -| Per-block Gaussian | Correct (Lemma 4 [1]) | O(B²)/block | k×B²×4 bytes | -| Per-block SORF | Approximate | O(B log B)/block | k×3×B bits | -| Full-dim padded SORF | Approximate | O(d log d) total | 3×padded_d bits | -| MSE-only (no QJL) | N/A | 0 | None | - -The paper's QJL uses Gaussian S (not SORF); Lemma 4 [1] is proved specifically -for Gaussian. SORF for QJL is an additional approximation (the -[current implementation][current-impl] uses SORF for QJL). Per-block QJL’s -variance scaling vs full-dimension QJL is stated in **Lemma 4 [1]**—quote the -lemma’s **exact** variance expression when making quantitative comparisons (not -just “\(d/B\) times” in prose). - -The community consensus is that MSE-only likely wins for ANN ranking at all -bit widths, so QJL may not be worth the complexity. - -## Array layout - -### Stage 1 (MSE-only single block) - -``` -TurboQuantArray -├── metadata: { dimension, b_mse, block_size (= padded_dim), -│ num_blocks (= 1) } -│ -│ # Per-row children -├── codes: FixedSizeListArray # list_size = padded_dim -│ (or PDXArray after Stage 3) -├── norms: PrimitiveArray # len = num_rows (F = f64 for f64, f32 otherwise) -│ -│ # Shared children -├── centroids: PrimitiveArray # len = 2^b_mse -├── mse_rotation_signs: PrimitiveArray # len = 3 × padded_dim (bitpacked) -``` - -Same structure as the [current PR][current-impl] minus the 3 QJL slots, plus -the forward-compatible metadata fields and dtype-matching norms. The codes child -is `FixedSizeListArray` in Stages 1-2 and may be swapped to `PDXArray` in Stage -3 — TurboQuant checks the child type at runtime, not via a metadata flag. - -### Stage 2 (block decomposition) - -``` -TurboQuantArray (self-contained, handles blocks internally) -├── metadata: { dimension, b_mse, block_size, num_blocks } -│ -│ # Per-row children (sliced/taken on row operations) -├── codes: FixedSizeListArray # list_size = k × B -│ (or PDXArray after Stage 3) -├── norms: PrimitiveArray # len = num_rows (k=1) -│ or FixedSizeListArray # list_size = k (k>1) -│ -│ # Shared children (cloned on row operations, not sliced) -├── centroids: PrimitiveArray # len = 2^b_mse -├── mse_rotation_signs: PrimitiveArray # len = k × 3 × B -``` - -## Compression ratio - -For f32 input, b_mse bits MSE, k = d/B blocks, N vectors (for f64 input, -replace 32 with 64 in the norms row — ratios decrease accordingly): - -| Component | Bits per vector | -| ----------- | --------------- | -| MSE codes | k × B × b_mse | -| Block norms | k × 32 | - -| Component | Shared bits | -| ---------- | ------------ | -| Centroids | 2^b_mse × 32 | -| SORF signs | k × 3 × B | - -### Worked examples (f32, b_mse=5, N=1000) - -| d | B | k | Per-vec bits | Ratio | Notes | -| ------------- | ---- | --- | --------------------- | ----- | -------------------------- | -| 768 | 256 | 3 | 3×256×5 + 3×32 = 3936 | 6.2× | Block decomp; zero padding | -| 1024 | 1024 | 1 | 1024×5 + 32 = 5152 | 6.4× | Single block (= current) | -| 768 (current) | 1024 | 1 | 1024×5 + 32 = 5152 | 4.8× | Padded; 33% overhead | - -Block decomposition improves the **compression ratio** for d=768 from ~4.8× to -~6.2× (about **29%** higher ratio). In **compressed bits per vector** for the -same settings, that is about **24%** fewer bits (5152 → 3936). For d=1024 the -encoding is identical to current. - -**Shared overhead:** centroids and SORF signs are **amortized over N** vectors; -for **small N**, per-vector shared metadata dominates—report totals both with -and without amortization when publishing ratios. - -## Performance analysis - -### Encode/decode throughput - -SORF at B dimensions (order-of-magnitude): 3 × B × log₂(B) butterflies + 3 × B -sign applications per block (plus B normalization multiplies, omitted). Constants -and memory traffic dominate in practice; treat FLOP estimates as **heuristic**. -For k blocks: - -| B | SORF FLOPs/block | k (d=768) | Total MSE FLOPs | -| -------------- | ------------------------- | --------- | --------------- | -| 256 | 3×256×8 + 768 = 6,912 | 3 | 20,736 | -| 512 | 3×512×9 + 1536 = 15,360 | — | — | -| 1024 (current) | 3×1024×10 + 3072 = 33,792 | 1 | 33,792 | - -Block decomposition at d=768 is ~40% fewer FLOPs than the current padded -approach, despite more blocks, because each block is smaller. - -### Benchmarking plan - -1. Encode/decode throughput: block TQ vs. current TQ at d=128, 768, 1024 -2. Quantized cosine similarity: block vs. current -3. L2 norm readthrough: O(k) vs. O(1) -4. PDX scan throughput vs. row-major (Stage 3) - -## Experimental plan - -### MSE quality vs. block size - -- Compare actual normalized MSE at B ∈ {64, 128, 256, 512} vs. single-SORF at - padded dimension, at bit widths b ∈ {2, 3, 4, 5, 8} -- Test SORF coordinate distribution at each B: histogram vs. analytical Beta -- Test 3, 4, 5 SORF rounds at each B -- Determine if the practical MSE constant is worse at smaller B - -### QJL strategy comparison (if pursued) - -- Per-block Gaussian QJL vs. per-block SORF QJL vs. full-dim padded SORF QJL - vs. MSE-only -- Key metric: ANN recall@k on the datasets above (Contriever, OpenAI, SIFT) -- Per community findings, MSE-only is expected to win [8] - -### Benchmarking datasets - -The current test suite uses i.i.d. Gaussian vectors: for **isotropic** data, a -random orthogonal transform is **distributionally neutral**, so this is a clean -**theory/sanity** anchor—not a guaranteed “pessimistic” proxy for all production -embedding geometries (heavy tails, clusters, anisotropy can behave differently). -Recent work (VIBE [11]) argues that traditional benchmarks (SIFT, GloVe) are no -longer representative of modern ANN workloads. - -**Recommended datasets:** - -| Dataset | Dim | Size | Source | Why | -| ----------------------------- | ------ | ------ | ---------------- | ------------------------------------------------------ | -| Contriever | 768 | ~1M | PDX paper [4] | Key non-power-of-2 target; real embeddings | -| OpenAI text-embedding-3-large | 1536 | ~1M | Common in RAG | High-d production embeddings | -| SIFT | 128 | 1M | Classic | Low-d power-of-2 baseline, well-studied recall numbers | -| arXiv embeddings | 768 | 2.25M | PDX paper [4] | Same dim as Contriever, larger scale | -| DEEP | 96 | 10M | Image embeddings | Large scale; d=96 has no B ≥ 64 divisor → padded path | -| Synthetic Gaussian | varies | varies | Internal | Pessimistic baseline; validates theoretical bounds | - -**Metrics** (at b_mse ∈ {2, 3, 4, 5, 8}): - -- Recall@10, Recall@100 (ANN ranking quality) -- Normalized MSE distortion (reconstruction quality) -- Inner product mean signed relative error (bias measurement) -- Encode/decode throughput (vectors/sec) - -The Gaussian baseline validates that theoretical bounds hold. The real-embedding -datasets measure practical quality — which may be **better** than Gaussian -(structured data benefits more from rotation) or **worse** (if the data has -adversarial properties for the specific rotation). - -### Straggler handling (if needed) - -Rare for common dimensions. If encountered: zero-pad to B (simplest). Follow-up: -dense rotation at actual dimension. - -## Phasing - -**Phase 1** — MSE-only single-block TurboQuant: Split the [current PR][current-impl] -to merge MSE-only (no QJL). This is a complete encoding for all dimensions -(with padding for non-power-of-2). - -**Phase 2** — Block decomposition: Add block splitting for non-power-of-2 -dimensions. B = greatest power-of-2 ≥ 64 dividing d. Per-block norms stored as -internal children. The `TurboQuantScheme::compress()` method must be updated to: -(a) choose B based on d, (b) split input into blocks, (c) normalize per-block, -(d) encode each block, and (e) store per-block norms as an internal child array. - -**Phase 3** — PDXArray + scan kernels: Introduce `PDXArray` as a general-purpose -dimension-major layout for `FixedSizeListArray`. TurboQuant's codes child is -swapped from FSL to PDXArray by the compressor. Distance computation kernels -operate on PDXArray's dimension-contiguous slices. - -**Phase 4** (experimental) — QJL: If the experimental plan shows QJL improves -recall@k beyond MSE-only, add per-block Gaussian or SORF QJL. Based on -community findings, this may not be pursued. - -## Practical recommendations - -For common model dimensions, the most promising configurations are: - -| Dimension | Recommendation | Rationale | -| --------------------- | --------------------------- | -------------------------------------------------------------------------- | -| 512, 1024, 2048, 4096 | Single-block MSE-only + PDX | B=d, no decomposition needed. Same as current TQ but with PDX scan layout. | -| 768, 1536, 3072 | 3-block MSE-only + PDX | B=256 or 512. Zero padding waste. 3 blocks, shared centroids. | -| 2560, 1280, … | Evaluate table rule | Greatest power-of-2 ≥ 64 dividing d (e.g. 2560 → B=256, k=10). | -| Arbitrary d (rare) | Padded single-block | Fall back to current approach. Padding overhead bounded by B-1 dims. | - -In all cases, MSE-only is the recommended starting point. QJL should only be -added if experiments demonstrate clear recall@k improvements for the target -workload. - -## Future work: GPU decode and fused distance computation - -The B-dim block structure maps naturally to GPU tile sizes and tensor cores. -For a batch of N vectors sharing the same rotation matrix R⁻¹: - -``` -decoded_batch = diag(norms) × R⁻¹ × codebook_lookup_batch(codes) - ↑ B×N matrix - ↑ B×B × B×N = GEMM -``` - -The codebook gather + inverse rotation + norm scaling can be fused into a single -kernel using an **IO-aware streaming pattern analogous in spirit** to Flash-KMeans’ -fused assignment/update philosophy [6]—**not** the same algorithm (Flash-KMeans is -GPU k-means), but a similar systems goal: reduce HBM traffic and avoid full -materialization. -For distance computation without full decode, a precomputed (2^b_mse)²-entry -distance table fits in shared memory (1 KB at b_mse=4, 4 KB at b_mse=5); the -kernel streams code bytes from HBM with gather-reduce accumulation, using -4-8× less bandwidth than full float vectors. - -At b_mse=8, codes are uint8 indices (0-255). Hypothetical int8 tensor-core paths -(e.g. VPDPBUSD-style idioms) require **quantized coordinate values** in a narrow -dynamic range and typically **near-linear** centroid spacing—but Max-Lloyd -centroids are **not** constrained to such a representation. At high B the -centroids are **near-uniform** under the concentrated marginal -(the Beta distribution is highly concentrated, approaching Gaussian, for which -high-resolution optimal quantization is approximately uniform). Whether the -existing Max-Lloyd centroids are "linear enough" for hardware dot-product -instructions is an empirical question worth testing before introducing a -separate linear quantization mode. - -## Integration with Vortex scan engine - -TurboQuant's quantized-domain operations must integrate with Vortex's expression -evaluation and scan pushdown infrastructure. The current implementation provides -this via `ScalarFnVTable` implementations in `vortex-tensor`. - -**Current integration path.** The `CosineSimilarity`, `DotProduct`, and `L2Norm` -scalar functions check whether their input storage arrays are TurboQuant-encoded -(via `TurboQuant::try_match()`). If both operands are TurboQuant and the -`ApproxOptions::Approximate` flag is set, the scalar function dispatches to the -quantized-domain kernel (e.g., `cosine_similarity_quantized_column`), bypassing -full decompression. Otherwise, it falls back to the exact path (decompress → -compute on floats). - -**Stage 2 changes.** With block decomposition, the quantized kernels must be -updated to iterate over TQ blocks, weighting by per-block norms: - -- `cosine_similarity_quantized_column`: currently computes a single unit-norm - dot product per row pair. Must change to `Σ_k norm_a_k · norm_b_k · -unit_dot_k / (‖a‖ · ‖b‖)` with `‖a‖ = √(Σ_k norm_a_k²)`. -- `dot_product_quantized_column`: same per-block weighting. -- `l2_norm`: currently returns the stored norm directly (O(1)). Must change to - `√(Σ_k norm_k²)` — read the norms FSL child and compute. -- Both operands must have the **same block size B**, **compatible centroids** - (same `b_mse` and block-**B** codebook), and **bit-identical MSE rotation - parameters** (`mse_rotation_signs` and the same SORF construction) for the - quantized inner-product path to equal the true dot product in expectation - under the TurboQuant model. **Two stored columns** with different rotations - must **fall back to exact** (decompress → float) unless a higher-level contract - guarantees shared rotation metadata. The common **column vs constant query** - path remains: re-encode the query with the **column’s** rotation and - centroids. - -**Stage 3 changes.** The PDX distance kernel (shown in Stage 3 pseudocode) is a -new execution path that operates on `PDXArray`-typed codes. It should be exposed -as an alternative `ScalarFnVTable` implementation that activates when the codes -child is a `PDXArray` and the scan is over a contiguous 64-vector-aligned range. -For non-aligned ranges or single-vector access (`scalar_at`), the PDXArray is -converted to FSL first via `PDXArray::to_fsl()`. - -**Expression tree integration.** The typical ANN scan expression is: - -``` -top_k(cosine_similarity(column, constant_query), k=10) -``` - -The `constant_query` is broadcast to match the column length. The -`CosineSimilarity` scalar function receives both the column (TurboQuant-encoded) -and the query (ConstantArray wrapping a single vector). For the quantized path, -the query is first encoded with the column's rotation and centroids to produce -query codes and query block norms, then the PDX kernel runs over the column's -codes without decompressing them. - -## Migration and compatibility - -TurboQuant has not shipped yet, so there are no existing files to migrate. We -can design the metadata for forward compatibility from day one. - -**Strategy: single array ID, versioned metadata.** All stages use the same array -ID (`vortex.turboquant`). The metadata includes `block_size` and `num_blocks` -fields from Stage 1 onward. Stage 1 always writes `num_blocks=1`, but the field -exists so that Stage 2 decoders can read Stage 1 files without migration. - -**Norms are always internal children.** The TurboQuant array is self-contained — -it stores norms as a child slot, not in a parent encoding. This means: - -- Stage 1: norms child is `PrimitiveArray`, one norm per vector (F = f64 for - f64 input, f32 otherwise). -- Stage 2 with k=1 (power-of-2 dims): same as Stage 1, identical wire format. -- Stage 2 with k>1: norms child is `FixedSizeListArray`, k norms per vector. - -The decoder distinguishes k=1 from k>1 by reading `num_blocks` from metadata. -A k=1 decoder is backward-compatible with Stage 1 files. A k>1 decoder is a new -code path that only applies to files written by Stage 2+. - -**Stage 3 (PDXArray) is additive.** PDX is not a TurboQuant metadata flag — it's -a separate array type (`PDXArray`) that wraps the codes child. Stage 1/2 files -have `FixedSizeListArray` codes; Stage 3 files have `PDXArray` codes. The -TurboQuant decoder checks the child type and un-transposes PDXArray on decode if -needed. `PDXArray` itself is registered as a new encoding, independent of -TurboQuant. - -**Incremental shipping:** - -| Stage | Ships to users? | Reads Stage 1 files? | Notes | -| ------------ | ---------------- | -------------------------- | ----------------------------------- | -| 1 (MSE-only) | Yes, immediately | N/A (first version) | New encoding, no backcompat concern | -| 2 (blocks) | Yes | Yes (k=1 is identical) | k>1 files need Stage 2+ decoder | -| 3 (PDX) | Yes | Yes (FSL codes still work) | PDX codes need PDXArray registered | - -Each stage is independently shippable. Users can upgrade incrementally. Files -written by earlier stages are always readable by later decoders. - -## References - -[1] Zandieh, A., Daliri, M., Hadian, M. and Mirrokni, V. "TurboQuant: Online -Vector Quantization with Near-optimal Distortion Rate." ICLR 2026. -arXiv:2504.19874, April 2025. - -[2] Ailon, N. and Chazelle, B. "The Fast Johnson-Lindenstrauss Transform and -Approximate Nearest Neighbors." SIAM J. Comput. 39(1):302-322, 2009. - -[3] Tropp, J.A. "Improved Analysis of the Subsampled Randomized Hadamard -Transform." Adv. Adaptive Data Analysis 3(1-2):115-126, 2011. - -[4] Kuffo, L., Krippner, E. and Boncz, P. "PDX: A Data Layout for Vector -Similarity Search." SIGMOD '25. arXiv:2503.04422, March 2025. - -[5] Yu, F.X., Suresh, A.T., Choromanski, K., Holtmann-Rice, D. and Kumar, S. -"Orthogonal Random Features." NeurIPS 2016. arXiv:1610.09072. - -[6] Yang, S. et al. "Flash-KMeans: Fast and Memory-Efficient Exact K-Means." -arXiv:2603.09229, March 2026. - -[7] Pathare, T. et al. "TurboQuant: Implementation Corrections, Production -Hardening, and Deployment Infrastructure." Eviox Tech Report v1.2.0, -March 2026. - -[8] Community TurboQuant implementations and findings. Key sources (pin -**commits** or **releases** in the final RFC): tonbistudio/turboquant-pytorch -(PyTorch, MSE-only reports); ggml-org/llama.cpp — use a **resolvable** issue or -discussion (e.g. issue **#20977** “Feature Request: TurboQuant support,” or -discussion **#21155**, as of 2026; replace if superseded); 0xSero/turboquant -(Triton); vivekvar-dl/turboquant (pip); scos-lab/turboquant (reproduction). -**Claim:** several groups report MSE-only beating MSE+QJL for attention / ANN-style -metrics at tested bit widths—treat as **empirical community reports** until -summarized in a peer-reviewed study or a pinned benchmark table. - -[9] Jégou, H., Douze, M. and Schmid, C. "Product Quantization for Nearest -Neighbor Search." IEEE Trans. PAMI 33(1):117-128, 2011. - -[10] Ge, T., He, K., Ke, Q. and Sun, J. "Optimized Product Quantization." -IEEE Trans. PAMI 36(4):744-755, 2014. - -[11] Jääsaari, E., Hyvönen, V., Ceccarello, M., Roos, T. and Aumüller, M. -"VIBE: Vector Index Benchmark for Embeddings." arXiv:2505.17810, May 2025. From d4ca306c253884d0dab44ada4544a7843857d557 Mon Sep 17 00:00:00 2001 From: Will Manning Date: Fri, 3 Apr 2026 12:16:54 -0400 Subject: [PATCH 07/19] second pass with external reviewers Signed-off-by: Will Manning --- proposed/0033-block-turboquant.md | 65 ++++++++++++++++--------------- 1 file changed, 33 insertions(+), 32 deletions(-) diff --git a/proposed/0033-block-turboquant.md b/proposed/0033-block-turboquant.md index 56b1ff4..ea3378a 100644 --- a/proposed/0033-block-turboquant.md +++ b/proposed/0033-block-turboquant.md @@ -12,7 +12,7 @@ in three stages: 1. **MSE-only TurboQuant** (immediate): merge the current PR as an MSE-only encoding. This is a complete, self-contained building block. 2. **Block decomposition** (next): for non-power-of-2 dimensions, split into - blocks of size B = the largest power-of-2 ≥ 64 that divides d. For + blocks of size B = the greatest power-of-2 ≥ 64 that divides d. For power-of-2 dimensions, B = d (single block, same as current). Per-block norms stored as internal children. 3. **PDX layout** (later): transpose codes into dimension-major order within @@ -20,7 +20,7 @@ in three stages: QJL correction is deferred to a later stage and may ultimately be dropped. Community findings from multiple independent TurboQuant implementations -consistently show that MSE-only outperforms MSE+QJL for KV-cache attention [8]. +often show that MSE-only outperforms MSE+QJL for KV-cache attention [8]. For ANN ranking and vector-search workloads, the evidence is currently less complete, so QJL should remain an empirical question rather than a settled conclusion. @@ -59,7 +59,7 @@ differences are: | Codebook | Analytically derived from Beta distribution; **data-oblivious** | Learned via k-means on training data; **data-dependent** | | Rotation | Random orthogonal within each sub-vector | Typically none (OPQ [10] adds a learned rotation) | | Theoretical guarantees | Provable MSE bound (Theorem 1 [1]) | Empirical quality only | -| Indexing time | Zero (codebook precomputed from distribution) | Requires training pass over data | +| Codebook training | None (centroids derived from theory) | Requires training pass over data | | Bits per sub-vector | Scalar: b bits per coordinate | Vector: typically 8 bits per sub-vector (256 codewords) | TurboQuant trades PQ's flexibility (data-dependent codebooks can exploit @@ -222,18 +222,18 @@ quantization with the computational savings of early termination. ### Block size strategy -For each dimension d, choose B = the largest power-of-2 ≥ 64 that evenly +For each dimension d, choose B = the greatest power-of-2 ≥ 64 that evenly divides d. This eliminates stragglers entirely for common embedding dimensions: -| Dimension d | Block size B | Blocks k | Notes | -| ----------- | ------------ | -------- | --------------------------- | -| 512 | 512 | 1 | Single block (= current TQ) | -| 768 | 256 | 3 | Largest dividing power-of-2 | -| 1024 | 1024 | 1 | Single block | -| 1536 | 512 | 3 | | -| 2048 | 2048 | 1 | Single block | -| 3072 | 1024 | 3 | | -| 4096 | 4096 | 1 | Single block | +| Dimension d | Block size B | Blocks k | Notes | +| ----------- | ------------ | -------- | ---------------------------- | +| 512 | 512 | 1 | Single block (= current TQ) | +| 768 | 256 | 3 | Greatest dividing power-of-2 | +| 1024 | 1024 | 1 | Single block | +| 1536 | 512 | 3 | | +| 2048 | 2048 | 1 | Single block | +| 3072 | 1024 | 3 | | +| 4096 | 4096 | 1 | Single block | **Key observations:** @@ -614,8 +614,9 @@ If pursued, four strategies should be compared: The paper's QJL uses Gaussian S (not SORF); Lemma 4 [1] is proved specifically for Gaussian. SORF for QJL is an additional approximation (the -[current implementation][current-impl] uses SORF for QJL). Per-block QJL has -d/B times more variance than full-dimension QJL (Lemma 4 [1]). +[current implementation][current-impl] uses SORF for QJL). Per-block QJL can +incur up to d/B times larger variance bound than full-dimension QJL (Lemma 4 +[1]), depending on how query and residual energy are distributed across blocks. Community reports indicate MSE-only often wins for KV-cache attention at all tested bit widths [8]. Whether this extends to ANN ranking is an empirical @@ -737,10 +738,11 @@ approach, despite more blocks, because each block is smaller. ### Benchmarking datasets -The current test suite uses i.i.d. Gaussian vectors, which is a pessimistic -baseline for TurboQuant: real embeddings have structure (clusters, anisotropy) -that rotation-based quantization can exploit, while Gaussian vectors are already -rotationally invariant (the rotation is a no-op in distribution). Recent work +The current test suite uses i.i.d. Gaussian vectors as a theory anchor and +sanity check: for isotropic data, a random orthogonal transform is +distributionally neutral, which cleanly validates theoretical bounds. This is +not a universal "worst case" for all production workloads — heavy-tailed or +clustered embeddings can behave differently. Recent work (VIBE [11]) argues that traditional benchmarks (SIFT, GloVe) are no longer representative of modern ANN workloads. @@ -779,7 +781,7 @@ to merge MSE-only (no QJL). This is a complete encoding for all dimensions (with padding for non-power-of-2). **Phase 2** — Block decomposition: Add block splitting for non-power-of-2 -dimensions. B = largest power-of-2 ≥ 64 dividing d. Per-block norms stored as +dimensions. B = greatest power-of-2 ≥ 64 dividing d. Per-block norms stored as internal children. The `TurboQuantScheme::compress()` method must be updated to: (a) choose B based on d, (b) split input into blocks, (c) normalize per-block, (d) encode each block, and (e) store per-block norms as an internal child array. @@ -791,7 +793,7 @@ operate on PDXArray's dimension-contiguous slices. **Phase 4** (experimental) — QJL: If the experimental plan shows QJL improves recall@k beyond MSE-only, add per-block Gaussian or SORF QJL. Based on -community findings, this may not be pursued. +KV-cache community reports [8], this may not be pursued. ## Practical recommendations @@ -953,17 +955,16 @@ arXiv:2603.09229, March 2026. Hardening, and Deployment Infrastructure." Eviox Tech Report v1.2.0, March 2026. https://eviox.tech/nexus/eviox_turboquant_corrections_study.pdf -[8] Community TurboQuant implementation reports. These sources primarily study -KV-cache attention rather than ANN search; claims should be scoped accordingly. -Key sources (pin commits/releases in final external draft): - -- tonbistudio/turboquant-pytorch: MSE-only (V3) vs MSE+QJL (V2) for attention - and generation. Workload: KV-cache attention. -- ggml-org/llama.cpp discussion #21155: TurboQuant quantized attention analysis. - Workload: KV-cache attention. -- 0xSero/turboquant: Triton kernels, paper validation scripts. -- scos-lab/turboquant: Reference reproduction, MSE vs Prod comparison. - Several groups report MSE-only beating MSE+QJL for attention metrics at tested +[8] Community TurboQuant implementation reports (primarily KV-cache attention): + +- https://github.com/tonbistudio/turboquant-pytorch — MSE-only (V3) vs + MSE+QJL (V2); reports MSE-only wins for attention and generation quality. +- https://github.com/ggml-org/llama.cpp/discussions/21155 — Quantized + attention analysis; MSE vs Prod comparison for KV-cache workloads. +- https://github.com/0xSero/turboquant — Triton kernels; paper validation. +- https://github.com/scos-lab/turboquant — Reference reproduction; MSE vs + Prod/QJL comparison. + Multiple groups report MSE-only beating MSE+QJL for attention metrics at tested bit widths. ANN ranking conclusions remain preliminary pending dedicated benchmarks. From 58a64a2c3bd36e55467c82a09f8c3abf2d6d9898 Mon Sep 17 00:00:00 2001 From: Will Manning Date: Fri, 3 Apr 2026 12:26:50 -0400 Subject: [PATCH 08/19] third pass with external reviewers Signed-off-by: Will Manning --- proposed/0033-block-turboquant.md | 115 +++++++++++++++++------------- 1 file changed, 66 insertions(+), 49 deletions(-) diff --git a/proposed/0033-block-turboquant.md b/proposed/0033-block-turboquant.md index ea3378a..5af2529 100644 --- a/proposed/0033-block-turboquant.md +++ b/proposed/0033-block-turboquant.md @@ -146,7 +146,7 @@ bound itself — they are well below the 2.72/4^b bound. ### Community findings on QJL -Multiple independent TurboQuant implementations have converged on a significant +Multiple independent TurboQuant implementations have repeatedly reported a practical finding for **KV-cache attention**: MSE-only often outperforms MSE+QJL at the same bit budget. The likely mechanism is a variance-bias tradeoff: QJL removes bias in raw inner-product estimation but adds variance, and the softmax @@ -223,7 +223,9 @@ quantization with the computational savings of early termination. ### Block size strategy For each dimension d, choose B = the greatest power-of-2 ≥ 64 that evenly -divides d. This eliminates stragglers entirely for common embedding dimensions: +divides d. If no such B exists (e.g., d=96), fall back to the padded +single-block path from Stage 1. For common embedding dimensions, this rule +always produces a valid B and eliminates padding entirely: | Dimension d | Block size B | Blocks k | Notes | | ----------- | ------------ | -------- | ---------------------------- | @@ -242,8 +244,8 @@ divides d. This eliminates stragglers entirely for common embedding dimensions: No block decomposition overhead, no per-block norms. These dimensions are already well-served by the current design. - **Non-power-of-2 dimensions** (768, 1536, 3072) decompose into k=3 blocks at - B=256 or B=512. Zero padding waste. Each block has its own SORF rotation and - shares a single centroid set. + B=256 or B=512. No padding waste (vs. 33% for the padded single-block path). + Each block has its own SORF rotation and shares a single centroid set. - **Stragglers are eliminated** for all common embedding dimensions. Dimensions that are not multiples of 64 (e.g., 100, 200) would need straggler handling, but these are rare in practice for modern model architectures. @@ -362,17 +364,19 @@ only with a true random orthogonal rotation or with empirical SORF validation (see Experimental plan). Assuming the per-block MSE bound holds, for a vector split into blocks the -following **algebraic** identity is exact: - -``` -‖x - x̂‖² / ‖x‖² = Σ_k (‖xₖ‖² / ‖x‖²) × (‖xₖ - x̂ₖ‖² / ‖xₖ‖²) - ≤ MSE_bound × Σ_k (‖xₖ‖² / ‖x‖²) = MSE_bound -``` - -The inequality applies Theorem 1's **probabilistic** bound (over the random -rotation) to each block independently. The conclusion should be read in terms -of **expectations**: `E[‖x - x̂‖² / ‖x‖²] ≤ MSE_bound` assuming independent -per-block rotations. Note that TurboQuant's original analysis uses a single +first line is an **algebraic** identity (exact); the inequality on the second +line applies Theorem 1's **probabilistic** bound to each block and should be +read as holding in **expectation** over independent per-block rotations, not +almost surely: + +```` +‖x - x̂‖² / ‖x‖² = Σ_k (‖xₖ‖² / ‖x‖²) × (‖xₖ - x̂ₖ‖² / ‖xₖ‖²) (exact) + E[...] ≤ MSE_bound × Σ_k (‖xₖ‖² / ‖x‖²) = MSE_bound (in expectation) +``` The conclusion: `E[‖x - x̂‖² / ‖x‖²] ≤ MSE_bound` assuming independent +per-block rotations. (Theorem 1 applies because each block is normalized to +unit norm before rotation and quantization; the per-block encoding pipeline is: +split → normalize → rotate → quantize, matching the theorem's unit-sphere +assumption.) Note that TurboQuant's original analysis uses a single global rotation in high-d where coordinates are nearly independent; with smaller block dimension B, within-block coordinate dependence after rotation may be stronger even when marginals are correct — this is an additional motivation @@ -431,41 +435,47 @@ centroids[code_bₖ[j]]`. #### Encoding algorithm -``` +```` + Input: x ∈ ℝ^d, b_mse bits per coordinate, block_size B -k = d / B (exact division, no straggler for chosen B) +k = d / B (exact division, no straggler for chosen B) num_centroids = 2^b_mse # Block split and normalize + for i in 0..k: - xᵢ = x[i*B .. (i+1)*B] - nᵢ = ‖xᵢ‖ - if nᵢ > 0: - ûᵢ = xᵢ / nᵢ - else: - ûᵢ = zeros(B) +xᵢ = x[i*B .. (i+1)*B] +nᵢ = ‖xᵢ‖ +if nᵢ > 0: +ûᵢ = xᵢ / nᵢ +else: +ûᵢ = zeros(B) # MSE stage (per block, SORF rotation) + for i in 0..k: - if nᵢ > 0: - rᵢ = SORFᵢ(ûᵢ) - cᵢ[j] = nearest_centroid(rᵢ[j]) - else: - cᵢ[j] = 0 +if nᵢ > 0: +rᵢ = SORFᵢ(ûᵢ) +cᵢ[j] = nearest_centroid(rᵢ[j]) +else: +cᵢ[j] = 0 Store (all as internal children): - codes (k × B per vector), norms (k per vector), - centroids (2^b_mse, shared), SORF signs (k × 3 × B, shared) +codes (k × B per vector), norms (k per vector), +centroids (2^b_mse, shared), SORF signs (k × 3 × B, shared) + ``` #### Decoding algorithm ``` + for i in 0..k: - r̂ᵢ[j] = centroids[cᵢ[j]] - ûᵢ = SORF⁻¹ᵢ(r̂ᵢ) - x̂ᵢ = nᵢ × ûᵢ (nᵢ read from internal norms child) +r̂ᵢ[j] = centroids[cᵢ[j]] +ûᵢ = SORF⁻¹ᵢ(r̂ᵢ) +x̂ᵢ = nᵢ × ûᵢ (nᵢ read from internal norms child) x̃ = concat(x̂₀, ..., x̂ₖ₋₁) + ``` ### Stage 3: PDX dimension-major layout @@ -493,10 +503,12 @@ PDXArray back to FSL then decodes. **PDXArray design:** ``` + PDXArray (general-purpose dimension-major layout for FixedSizeList) ├── metadata: { list_size, chunk_size (= 64) } -├── elements: PrimitiveArray # transposed: 64 values per dim, contiguous -├── validity: ... # same as FSL validity +├── elements: PrimitiveArray # transposed: 64 values per dim, contiguous +├── validity: ... # same as FSL validity + ``` - `PDXArray::try_new(fsl)` — transposes a FixedSizeListArray into PDX layout @@ -504,8 +516,10 @@ PDXArray (general-purpose dimension-major layout for FixedSizeList) scalar_at, or non-aligned slice/take) - `PDXArray::elements_for_dim(dim, chunk)` — O(1) access to a contiguous slice of 64 values for one dimension within one chunk -- Slice/take: un-transpose to FSL (simplest). Preserving PDX layout is possible - only for 64-vector-aligned ranges. +- Slice/take: un-transpose to FSL (simplest). Un-transpose cost is + O(rows × list_size) per operation; consider 64-row-aligned fast paths for + hot scan workloads. Preserving PDX layout is possible only for + 64-vector-aligned ranges. - The cascade compressor treats PDXArray as a valid encoding of FSL-typed data. **Benefits of PDXArray as a separate type:** @@ -520,13 +534,15 @@ PDXArray (general-purpose dimension-major layout for FixedSizeList) Within each 64-vector chunk, codes are stored dimension-major: ``` -TQ block 0, dim 0: [v0 v1 v2 ... v63] -TQ block 0, dim 1: [v0 v1 v2 ... v63] + +TQ block 0, dim 0: [v0 v1 v2 ... v63] +TQ block 0, dim 1: [v0 v1 v2 ... v63] ... -TQ block 0, dim (B - 1): [v0 v1 v2 ... v63] -TQ block 1, dim 0: [v0 v1 v2 ... v63] +TQ block 0, dim (B - 1): [v0 v1 v2 ... v63] +TQ block 1, dim 0: [v0 v1 v2 ... v63] ... -``` + +```` The inner SIMD loop (64 vectors) has no inter-vector dependencies. TQ block boundaries only affect where norm weighting occurs — they don't affect the @@ -558,7 +574,7 @@ for tq_block in 0..k { unit_dots[v] = 0.0; // reset for next TQ block } } -``` +```` **Int8 layout variant.** The PDX implementation [pdx-impl] uses a different tiling for int8 data: "4 dims × 16 vecs" to leverage VPDPBUSD/UDOT hardware @@ -598,7 +614,7 @@ validated. | Aspect | MSE-only | MSE + QJL | | ---------------------- | -------------------------------- | --------------------------------------------------------------- | | Bit budget | All b bits → MSE (2^b centroids) | b-1 bits MSE + 1 bit QJL (2^(b-1) centroids) | -| Inner product estimate | Biased (MSE quantization noise) | Unbiased (QJL correction, Theorem 2 [1]) | +| Inner product estimate | Biased (MSE quantization noise) | Unbiased (QJL correction; see TurboQuant_prod in [1]) | | Additional children | None | QJL signs, QJL residual norms, QJL projection params | | Encode cost | SORF only | SORF + QJL projection (O(B²) for Gaussian, O(B log B) for SORF) | | Decode cost | Inverse SORF only | Inverse SORF + QJL inverse projection | @@ -755,7 +771,7 @@ representative of modern ANN workloads. | SIFT | 128 | 1M | Classic | Low-d power-of-2 baseline, well-studied recall numbers | | arXiv embeddings | 768 | 2.25M | PDX paper [4] | Same dim as Contriever, larger scale | | DEEP | 96 | 10M | Image embeddings | Large scale; d=96 has no B ≥ 64 divisor → padded path | -| Synthetic Gaussian | varies | varies | Internal | Pessimistic baseline; validates theoretical bounds | +| Synthetic Gaussian | varies | varies | Internal | Theory anchor / sanity check; not universal worst case | **Metrics** (at b_mse ∈ {2, 3, 4, 5, 8}): @@ -802,7 +818,7 @@ For common model dimensions, the most promising configurations are: | Dimension | Recommendation | Rationale | | --------------------- | --------------------------- | -------------------------------------------------------------------------- | | 512, 1024, 2048, 4096 | Single-block MSE-only + PDX | B=d, no decomposition needed. Same as current TQ but with PDX scan layout. | -| 768, 1536, 3072 | 3-block MSE-only + PDX | B=256 or 512. Zero padding waste. 3 blocks, shared centroids. | +| 768, 1536, 3072 | 3-block MSE-only + PDX | B=256 or 512. No padding waste. 3 blocks, shared centroids. | | Arbitrary d (rare) | Padded single-block | Fall back to current approach. Padding overhead bounded by B-1 dims. | In all cases, MSE-only is the recommended starting point. QJL should only be @@ -861,7 +877,8 @@ updated to iterate over TQ blocks, weighting by per-block norms: unit_dot_k / (‖a‖ · ‖b‖)` with `‖a‖ = √(Σ_k norm_a_k²)`. - `dot_product_quantized_column`: same per-block weighting. - `l2_norm`: currently returns the stored norm directly (O(1)). Must change to - `√(Σ_k norm_k²)` — read the norms FSL child and compute. + `√(Σ_k norm_k²)` — read the norms child (`PrimitiveArray` for k=1, + `FixedSizeListArray` for k>1) and compute. - Both operands must have the **same block size B**, compatible centroids (same `b_mse` and B-dim codebook), and **bit-identical MSE rotation parameters** (`mse_rotation_signs` and same SORF construction) for the quantized @@ -959,8 +976,8 @@ March 2026. https://eviox.tech/nexus/eviox_turboquant_corrections_study.pdf - https://github.com/tonbistudio/turboquant-pytorch — MSE-only (V3) vs MSE+QJL (V2); reports MSE-only wins for attention and generation quality. -- https://github.com/ggml-org/llama.cpp/discussions/21155 — Quantized - attention analysis; MSE vs Prod comparison for KV-cache workloads. +- https://github.com/ggml-org/llama.cpp/discussions/20969 — TurboQuant + discussion; quantized attention analysis and MSE vs Prod comparison. - https://github.com/0xSero/turboquant — Triton kernels; paper validation. - https://github.com/scos-lab/turboquant — Reference reproduction; MSE vs Prod/QJL comparison. From f327e5ddf7b4128fa74afa162e93b618801f0eb9 Mon Sep 17 00:00:00 2001 From: Will Manning Date: Fri, 3 Apr 2026 12:32:11 -0400 Subject: [PATCH 09/19] fourth pass with external reviewers Signed-off-by: Will Manning --- proposed/0033-block-turboquant.md | 47 ++++++++++++++++++------------- 1 file changed, 27 insertions(+), 20 deletions(-) diff --git a/proposed/0033-block-turboquant.md b/proposed/0033-block-turboquant.md index 5af2529..563f474 100644 --- a/proposed/0033-block-turboquant.md +++ b/proposed/0033-block-turboquant.md @@ -11,10 +11,11 @@ in three stages: 1. **MSE-only TurboQuant** (immediate): merge the current PR as an MSE-only encoding. This is a complete, self-contained building block. -2. **Block decomposition** (next): for non-power-of-2 dimensions, split into - blocks of size B = the greatest power-of-2 ≥ 64 that divides d. For - power-of-2 dimensions, B = d (single block, same as current). Per-block - norms stored as internal children. +2. **Block decomposition** (next): for dimensions where a valid B exists + (greatest power-of-2 ≥ 64 dividing d), split into blocks of size B. For + power-of-2 dimensions, B = d (single block). Dimensions with no qualifying + B fall back to padded single-block. Per-block norms stored as internal + children. 3. **PDX layout** (later): transpose codes into dimension-major order within groups of 64 vectors for SIMD scan performance. @@ -36,11 +37,13 @@ embeddings. It works by: 1. Randomly rotating a unit-norm vector so that each coordinate follows a known marginal distribution — specifically `(1 - x²)^((d-3)/2)` on [-1, 1], a - concentrated Beta distribution (Lemma 1 in [1]). + concentrated Beta distribution (Lemma 1 in [1]; verify numbering against the + ICLR 2026 camera-ready if it differs from the arXiv version). 2. Applying an MSE-optimal scalar quantizer (Max-Lloyd centroids) independently to each coordinate. 3. Optionally adding a 1-bit QJL (Quantized Johnson-Lindenstrauss) correction - on the residual for unbiased inner product estimation (Theorem 2 in [1]). + on the residual for unbiased inner product estimation (Theorem 2 in [1]; + same camera-ready caveat). The paper prescribes a full random orthogonal rotation (QR decomposition of a matrix with i.i.d. N(0,1) entries, yielding a Haar-uniform orthogonal matrix) @@ -284,8 +287,10 @@ described above. ### Stage 2: Block decomposition -For non-power-of-2 dimensions, split into blocks of size B (as determined by the -table above). Each full block gets an independent B-dim SORF rotation. +For dimensions where the block-size rule produces a valid B (see table above), +split into blocks of size B. Each full block gets an independent B-dim SORF +rotation. Dimensions with no qualifying B (e.g., d=96) remain on the padded +single-block path from Stage 1. **Changes vs. Stage 1:** @@ -369,10 +374,12 @@ line applies Theorem 1's **probabilistic** bound to each block and should be read as holding in **expectation** over independent per-block rotations, not almost surely: -```` +``` ‖x - x̂‖² / ‖x‖² = Σ_k (‖xₖ‖² / ‖x‖²) × (‖xₖ - x̂ₖ‖² / ‖xₖ‖²) (exact) E[...] ≤ MSE_bound × Σ_k (‖xₖ‖² / ‖x‖²) = MSE_bound (in expectation) -``` The conclusion: `E[‖x - x̂‖² / ‖x‖²] ≤ MSE_bound` assuming independent +``` + +The conclusion: `E[‖x - x̂‖² / ‖x‖²] ≤ MSE_bound` assuming independent per-block rotations. (Theorem 1 applies because each block is normalized to unit norm before rotation and quantization; the per-block encoding pipeline is: split → normalize → rotate → quantize, matching the theorem's unit-sphere @@ -435,7 +442,7 @@ centroids[code_bₖ[j]]`. #### Encoding algorithm -```` +``` Input: x ∈ ℝ^d, b_mse bits per coordinate, block_size B k = d / B (exact division, no straggler for chosen B) @@ -542,7 +549,7 @@ TQ block 0, dim (B - 1): [v0 v1 v2 ... v63] TQ block 1, dim 0: [v0 v1 v2 ... v63] ... -```` +``` The inner SIMD loop (64 vectors) has no inter-vector dependencies. TQ block boundaries only affect where norm weighting occurs — they don't affect the @@ -574,7 +581,7 @@ for tq_block in 0..k { unit_dots[v] = 0.0; // reset for next TQ block } } -```` +``` **Int8 layout variant.** The PDX implementation [pdx-impl] uses a different tiling for int8 data: "4 dims × 16 vecs" to leverage VPDPBUSD/UDOT hardware @@ -796,8 +803,8 @@ dense rotation at actual dimension. to merge MSE-only (no QJL). This is a complete encoding for all dimensions (with padding for non-power-of-2). -**Phase 2** — Block decomposition: Add block splitting for non-power-of-2 -dimensions. B = greatest power-of-2 ≥ 64 dividing d. Per-block norms stored as +**Phase 2** — Block decomposition: Add block splitting for dimensions where a +valid B exists (greatest power-of-2 ≥ 64 dividing d). Per-block norms stored as internal children. The `TurboQuantScheme::compress()` method must be updated to: (a) choose B based on d, (b) split input into blocks, (c) normalize per-block, (d) encode each block, and (e) store per-block norms as an internal child array. @@ -815,11 +822,11 @@ KV-cache community reports [8], this may not be pursued. For common model dimensions, the most promising configurations are: -| Dimension | Recommendation | Rationale | -| --------------------- | --------------------------- | -------------------------------------------------------------------------- | -| 512, 1024, 2048, 4096 | Single-block MSE-only + PDX | B=d, no decomposition needed. Same as current TQ but with PDX scan layout. | -| 768, 1536, 3072 | 3-block MSE-only + PDX | B=256 or 512. No padding waste. 3 blocks, shared centroids. | -| Arbitrary d (rare) | Padded single-block | Fall back to current approach. Padding overhead bounded by B-1 dims. | +| Dimension | Recommendation | Rationale | +| ---------------------- | --------------------------- | -------------------------------------------------------------------------- | +| 512, 1024, 2048, 4096 | Single-block MSE-only + PDX | B=d, no decomposition needed. Same as current TQ but with PDX scan layout. | +| 768, 1536, 3072 | 3-block MSE-only + PDX | B=256 or 512. No padding waste. 3 blocks, shared centroids. | +| No qualifying B (rare) | Padded single-block | Fall back to Stage 1 padded path. Padding overhead bounded by B-1 dims. | In all cases, MSE-only is the recommended starting point. QJL should only be added if experiments demonstrate clear recall@k improvements for the target From 6ea13784aba94da864332d17cf30722a741d3f12 Mon Sep 17 00:00:00 2001 From: Will Manning Date: Fri, 3 Apr 2026 12:36:49 -0400 Subject: [PATCH 10/19] fifth pass with external reviewers Signed-off-by: Will Manning --- proposed/0033-block-turboquant.md | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/proposed/0033-block-turboquant.md b/proposed/0033-block-turboquant.md index 563f474..04fbd0e 100644 --- a/proposed/0033-block-turboquant.md +++ b/proposed/0033-block-turboquant.md @@ -249,9 +249,9 @@ always produces a valid B and eliminates padding entirely: - **Non-power-of-2 dimensions** (768, 1536, 3072) decompose into k=3 blocks at B=256 or B=512. No padding waste (vs. 33% for the padded single-block path). Each block has its own SORF rotation and shares a single centroid set. -- **Stragglers are eliminated** for all common embedding dimensions. Dimensions - that are not multiples of 64 (e.g., 100, 200) would need straggler handling, - but these are rare in practice for modern model architectures. +- **No qualifying B is rare** for common embedding dimensions. Dimensions where + no power-of-2 ≥ 64 divides d (e.g., 96, 100) fall back to Stage 1's padded + single-block path. These are uncommon in modern model architectures. - **The SORF approximation at B=256+ is expected to be adequate**: 3 rounds at B=256 provides 24 butterfly stages, and at B=512 provides 27 — both comparable to the current B=1024 (30 stages). This needs empirical validation; see @@ -635,7 +635,8 @@ If pursued, four strategies should be compared: | Full-dim padded SORF | Approximate | O(d log d) total | 3×padded_d bits | | MSE-only (no QJL) | N/A | 0 | None | -The paper's QJL uses Gaussian S (not SORF); Lemma 4 [1] is proved specifically +The paper's QJL uses Gaussian S (not SORF); Lemma 4 [1] (same camera-ready +numbering caveat as Theorem 1) is proved specifically for Gaussian. SORF for QJL is an additional approximation (the [current implementation][current-impl] uses SORF for QJL). Per-block QJL can incur up to d/B times larger variance bound than full-dimension QJL (Lemma 4 @@ -792,10 +793,11 @@ datasets measure practical quality — which may be **better** than Gaussian (structured data benefits more from rotation) or **worse** (if the data has adversarial properties for the specific rotation). -### Straggler handling (if needed) +### Dimensions with no qualifying B -Rare for common dimensions. If encountered: zero-pad to B (simplest). Follow-up: -dense rotation at actual dimension. +Rare for common embedding dimensions (e.g., d=96). These fall back to the +Stage 1 padded single-block path (pad to next power-of-2, single SORF). No +block decomposition is attempted. ## Phasing @@ -826,7 +828,7 @@ For common model dimensions, the most promising configurations are: | ---------------------- | --------------------------- | -------------------------------------------------------------------------- | | 512, 1024, 2048, 4096 | Single-block MSE-only + PDX | B=d, no decomposition needed. Same as current TQ but with PDX scan layout. | | 768, 1536, 3072 | 3-block MSE-only + PDX | B=256 or 512. No padding waste. 3 blocks, shared centroids. | -| No qualifying B (rare) | Padded single-block | Fall back to Stage 1 padded path. Padding overhead bounded by B-1 dims. | +| No qualifying B (rare) | Padded single-block | Fall back to Stage 1: pad to next power-of-2, single SORF. | In all cases, MSE-only is the recommended starting point. QJL should only be added if experiments demonstrate clear recall@k improvements for the target From 8d03678d0e6cfc7e619fc85e1b22ae5e98b000f6 Mon Sep 17 00:00:00 2001 From: Will Manning Date: Fri, 3 Apr 2026 12:41:06 -0400 Subject: [PATCH 11/19] min dim 128 Signed-off-by: Will Manning --- proposed/0033-block-turboquant.md | 52 +++++++++++++++++++++++++++++-- 1 file changed, 50 insertions(+), 2 deletions(-) diff --git a/proposed/0033-block-turboquant.md b/proposed/0033-block-turboquant.md index 04fbd0e..9de937e 100644 --- a/proposed/0033-block-turboquant.md +++ b/proposed/0033-block-turboquant.md @@ -113,7 +113,8 @@ L2 norm returns the stored norm directly (O(1) readthrough). **Compression scheme.** `TurboQuantScheme` implements the `Scheme` trait for the BtrBlocks cascading compressor. It matches `Vector` and `FixedShapeTensor` -extension arrays with non-nullable float elements and dimension ≥ 3, using the +extension arrays with non-nullable float elements and dimension ≥ 3 (to be +raised to ≥ 128 in Stage 1; see Minimum dimension below), using the default config (5-bit QJL = 4-bit MSE + 1-bit QJL, seed 42). **Input handling.** All float types (f16, f32, f64) are converted to f32 before @@ -257,6 +258,37 @@ always produces a valid B and eliminates padding entirely: to the current B=1024 (30 stages). This needs empirical validation; see Experimental plan. +### Minimum dimension + +The compression scheme should only select TurboQuant for vectors with +dimension ≥ 128. Below this threshold, several factors degrade quality and +efficiency: + +- **SORF mixing quality:** 3-round SORF at d=64 provides only 18 butterfly + stages (vs. 21 at d=128, 30 at d=1024). The coordinate distribution deviates + more from the analytical Beta, making Max-Lloyd centroids less optimal. +- **Practical MSE:** At smaller d, the Beta marginal is wider (variance ~1/d), + so the Max-Lloyd quantizer achieves distortion closer to the theoretical + bound — worse in absolute terms than at higher d. +- **Overhead ratio:** Per-vector norm (32 bits) is a larger fraction of the + compressed representation at small d. At d=32, b=5: norm is 20% of the + compressed size. At d=768: <1%. +- **Diminishing returns for high bit widths:** With fewer coordinates, the + fine-grained centroid structure of high-b quantization has less to exploit. + +The threshold of 128 is conservative: + +- d=128 (SIFT) is the smallest common embedding dimension. +- SORF at d=128 has 21 butterfly stages — tested and adequate in the current + implementation. +- The block-size rule produces B=128 for d=128 (single block, no decomposition). + +The array-level minimum remains d=3 (for the Beta distribution to be +well-defined), so users can still explicitly construct a TurboQuantArray at +smaller dimensions. The scheme minimum (128) controls automatic selection only. + +The exact threshold should be validated experimentally — see Experimental plan. + ### Stage 1: MSE-only TurboQuant (immediate — split from current PR) Split the [current PR][current-impl] to extract and merge the MSE-only subset. @@ -271,10 +303,11 @@ The QJL code can be preserved on a separate branch for Phase 4. | Scheme default | 5-bit QJL (4-bit MSE + 1-bit QJL) | **5-bit MSE-only** (32 centroids) | | Norms dtype | Always f32 | **Same-or-wider**: f64 for f64 input, f32 for f32/f16 | | Metadata | `has_qjl: bool` | **Removed** (always MSE-only) | +| Scheme minimum | dimension ≥ 3 | **dimension ≥ 128** (see Minimum dimension below) | **Unchanged from current PR:** SORF rotation, Max-Lloyd centroids, zero-padding for non-power-of-2, slice/take/scalar_at pushdowns, quantized -cosine similarity and dot product, compression scheme integration, minimum dim=3. +cosine similarity and dot product, compression scheme integration. **Added to metadata (for forward compat):** `block_size: u32` (always = padded_dim), `num_blocks: u32` (always = 1). These fields are inert in Stage 1 @@ -744,6 +777,21 @@ approach, despite more blocks, because each block is smaller. ## Experimental plan +### Minimum dimension threshold + +Test TurboQuant quality at d ∈ {32, 64, 96, 128, 256} to validate the scheme +minimum of 128: + +- Compare TurboQuant MSE distortion and ANN recall@k against scalar + quantization (SQ8, linear min-max to uint8) at the same compressed bit budget +- Plot the crossover point: at what d does TurboQuant's recall@k drop below SQ8? +- Test SORF coordinate distribution quality at each d (histogram vs. Beta) +- Measure overhead ratio (norm bits / total compressed bits) at each d + +The scheme minimum should be set at the smallest d where TurboQuant reliably +beats SQ8 on recall@k across the benchmarking datasets. The current proposal +of 128 is conservative; experiments may justify lowering to 64 or raising to 256. + ### MSE quality vs. block size - Compare actual normalized MSE at B ∈ {64, 128, 256, 512} vs. single-SORF at From f4c211e3f99298171a39ffaa613c9b80f576b676 Mon Sep 17 00:00:00 2001 From: Will Manning Date: Fri, 3 Apr 2026 12:44:22 -0400 Subject: [PATCH 12/19] sixth pass with external reviewers Signed-off-by: Will Manning --- proposed/0033-block-turboquant.md | 38 ++++++++++++++++++------------- 1 file changed, 22 insertions(+), 16 deletions(-) diff --git a/proposed/0033-block-turboquant.md b/proposed/0033-block-turboquant.md index 9de937e..561b09a 100644 --- a/proposed/0033-block-turboquant.md +++ b/proposed/0033-block-turboquant.md @@ -268,8 +268,9 @@ efficiency: stages (vs. 21 at d=128, 30 at d=1024). The coordinate distribution deviates more from the analytical Beta, making Max-Lloyd centroids less optimal. - **Practical MSE:** At smaller d, the Beta marginal is wider (variance ~1/d), - so the Max-Lloyd quantizer achieves distortion closer to the theoretical - bound — worse in absolute terms than at higher d. + leading to higher absolute MSE at the same bit width b. The gap between + practical MSE and the theoretical upper bound is an empirical question at + each d. - **Overhead ratio:** Per-vector norm (32 bits) is a larger fraction of the compressed representation at small d. At d=32, b=5: norm is 20% of the compressed size. At d=768: <1%. @@ -278,7 +279,7 @@ efficiency: The threshold of 128 is conservative: -- d=128 (SIFT) is the smallest common embedding dimension. +- d=128 (SIFT) is the smallest dimension in our recommended benchmark table. - SORF at d=128 has 21 butterfly stages — tested and adequate in the current implementation. - The block-size rule produces B=128 for d=128 (single block, no decomposition). @@ -314,7 +315,9 @@ padded_dim), `num_blocks: u32` (always = 1). These fields are inert in Stage 1 but enable Stage 2 decoders to read Stage 1 files. (PDX is handled via the codes child type, not a metadata flag — see Stage 3.) -This is a complete, useful encoding for all dimensions. Power-of-2 dimensions +This is a complete, useful encoding for all dimensions ≥ 3 (automatic scheme +selection applies only for d ≥ 128; smaller d remains available via explicit +array construction). Power-of-2 dimensions have zero padding waste; non-power-of-2 dimensions have the padding overhead described above. @@ -783,14 +786,17 @@ Test TurboQuant quality at d ∈ {32, 64, 96, 128, 256} to validate the scheme minimum of 128: - Compare TurboQuant MSE distortion and ANN recall@k against scalar - quantization (SQ8, linear min-max to uint8) at the same compressed bit budget -- Plot the crossover point: at what d does TurboQuant's recall@k drop below SQ8? + quantization at matched bit rates (e.g., linear min-max quantization at the + same bits-per-coordinate as TurboQuant's b_mse setting) +- Plot the crossover point: at what d does TurboQuant's recall@k drop below + rate-matched scalar quantization? - Test SORF coordinate distribution quality at each d (histogram vs. Beta) - Measure overhead ratio (norm bits / total compressed bits) at each d The scheme minimum should be set at the smallest d where TurboQuant reliably -beats SQ8 on recall@k across the benchmarking datasets. The current proposal -of 128 is conservative; experiments may justify lowering to 64 or raising to 256. +beats rate-matched scalar quantization on recall@k across the benchmarking +datasets. The current proposal of 128 is conservative; experiments may justify +lowering to 64 or raising to 256. ### MSE quality vs. block size @@ -820,14 +826,14 @@ representative of modern ANN workloads. **Recommended datasets:** -| Dataset | Dim | Size | Source | Why | -| ----------------------------- | ------ | ------ | ---------------- | ------------------------------------------------------ | -| Contriever | 768 | ~1M | PDX paper [4] | Key non-power-of-2 target; real embeddings | -| OpenAI text-embedding-3-large | 1536 | ~1M | Common in RAG | High-d production embeddings | -| SIFT | 128 | 1M | Classic | Low-d power-of-2 baseline, well-studied recall numbers | -| arXiv embeddings | 768 | 2.25M | PDX paper [4] | Same dim as Contriever, larger scale | -| DEEP | 96 | 10M | Image embeddings | Large scale; d=96 has no B ≥ 64 divisor → padded path | -| Synthetic Gaussian | varies | varies | Internal | Theory anchor / sanity check; not universal worst case | +| Dataset | Dim | Size | Source | Why | +| ----------------------------- | ------ | ------ | ---------------- | ----------------------------------------------------------------------------------------------------------------------------------------- | +| Contriever | 768 | ~1M | PDX paper [4] | Key non-power-of-2 target; real embeddings | +| OpenAI text-embedding-3-large | 1536 | ~1M | Common in RAG | High-d production embeddings | +| SIFT | 128 | 1M | Classic | Low-d power-of-2 baseline, well-studied recall numbers | +| arXiv embeddings | 768 | 2.25M | PDX paper [4] | Same dim as Contriever, larger scale | +| DEEP | 96 | 10M | Image embeddings | Large scale; d=96 < scheme min (128) and has no B ≥ 64 — requires explicit TurboQuantArray construction or benchmark-only scheme override | +| Synthetic Gaussian | varies | varies | Internal | Theory anchor / sanity check; not universal worst case | **Metrics** (at b_mse ∈ {2, 3, 4, 5, 8}): From d649adbb7c7c1610a1cf08296959941133146dbc Mon Sep 17 00:00:00 2001 From: Will Manning Date: Fri, 3 Apr 2026 12:47:29 -0400 Subject: [PATCH 13/19] another pass with external reviewers Signed-off-by: Will Manning --- proposed/0033-block-turboquant.md | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/proposed/0033-block-turboquant.md b/proposed/0033-block-turboquant.md index 561b09a..5d44ba4 100644 --- a/proposed/0033-block-turboquant.md +++ b/proposed/0033-block-turboquant.md @@ -10,7 +10,8 @@ We propose evolving the [TurboQuant vector quantization encoding][current-impl] in three stages: 1. **MSE-only TurboQuant** (immediate): merge the current PR as an MSE-only - encoding. This is a complete, self-contained building block. + encoding for d ≥ 128 (see Minimum dimension). This is a complete, + self-contained building block. 2. **Block decomposition** (next): for dimensions where a valid B exists (greatest power-of-2 ≥ 64 dividing d), split into blocks of size B. For power-of-2 dimensions, B = d (single block). Dimensions with no qualifying @@ -786,10 +787,11 @@ Test TurboQuant quality at d ∈ {32, 64, 96, 128, 256} to validate the scheme minimum of 128: - Compare TurboQuant MSE distortion and ANN recall@k against scalar - quantization at matched bit rates (e.g., linear min-max quantization at the - same bits-per-coordinate as TurboQuant's b_mse setting) + quantization matched on **total compressed bits per vector** (codes + norm + + amortized shared metadata), not just bits-per-coordinate — this is critical + at small d where norm overhead is significant - Plot the crossover point: at what d does TurboQuant's recall@k drop below - rate-matched scalar quantization? + the rate-matched scalar baseline? - Test SORF coordinate distribution quality at each d (histogram vs. Beta) - Measure overhead ratio (norm bits / total compressed bits) at each d @@ -856,8 +858,8 @@ block decomposition is attempted. ## Phasing **Phase 1** — MSE-only single-block TurboQuant: Split the [current PR][current-impl] -to merge MSE-only (no QJL). This is a complete encoding for all dimensions -(with padding for non-power-of-2). +to merge MSE-only (no QJL). Scheme auto-selects for d ≥ 128; smaller d available +via explicit construction. Padding for non-power-of-2 dimensions. **Phase 2** — Block decomposition: Add block splitting for dimensions where a valid B exists (greatest power-of-2 ≥ 64 dividing d). Per-block norms stored as From 4ab0b7924b170f9c4642a28fd835d79eca7a7282 Mon Sep 17 00:00:00 2001 From: Will Manning Date: Fri, 3 Apr 2026 12:51:25 -0400 Subject: [PATCH 14/19] another pass with external reviewers Signed-off-by: Will Manning --- proposed/0033-block-turboquant.md | 25 +++++++++++++++---------- 1 file changed, 15 insertions(+), 10 deletions(-) diff --git a/proposed/0033-block-turboquant.md b/proposed/0033-block-turboquant.md index 5d44ba4..3b6d67a 100644 --- a/proposed/0033-block-turboquant.md +++ b/proposed/0033-block-turboquant.md @@ -10,8 +10,9 @@ We propose evolving the [TurboQuant vector quantization encoding][current-impl] in three stages: 1. **MSE-only TurboQuant** (immediate): merge the current PR as an MSE-only - encoding for d ≥ 128 (see Minimum dimension). This is a complete, - self-contained building block. + encoding with d ≥ 128 scheme selection (see Minimum dimension; smaller d + available via explicit construction). This is a complete, self-contained + building block. 2. **Block decomposition** (next): for dimensions where a valid B exists (greatest power-of-2 ≥ 64 dividing d), split into blocks of size B. For power-of-2 dimensions, B = d (single block). Dimensions with no qualifying @@ -112,11 +113,11 @@ norms) while sharing rotation signs and centroids. Quantized cosine similarity and dot product operate directly on codes and centroids without decompression. L2 norm returns the stored norm directly (O(1) readthrough). -**Compression scheme.** `TurboQuantScheme` implements the `Scheme` trait for the -BtrBlocks cascading compressor. It matches `Vector` and `FixedShapeTensor` -extension arrays with non-nullable float elements and dimension ≥ 3 (to be -raised to ≥ 128 in Stage 1; see Minimum dimension below), using the -default config (5-bit QJL = 4-bit MSE + 1-bit QJL, seed 42). +**Compression scheme (pre-Stage 1).** `TurboQuantScheme` implements the `Scheme` +trait for the BtrBlocks cascading compressor. It matches `Vector` and +`FixedShapeTensor` extension arrays with non-nullable float elements and +dimension ≥ 3 (to be raised to ≥ 128 in Stage 1; see Minimum dimension below), +using the default config (5-bit QJL = 4-bit MSE + 1-bit QJL, seed 42). **Input handling.** All float types (f16, f32, f64) are converted to f32 before quantization. Per-vector L2 norms are computed and stored as f32. Non-power-of-2 @@ -796,9 +797,13 @@ minimum of 128: - Measure overhead ratio (norm bits / total compressed bits) at each d The scheme minimum should be set at the smallest d where TurboQuant reliably -beats rate-matched scalar quantization on recall@k across the benchmarking -datasets. The current proposal of 128 is conservative; experiments may justify -lowering to 64 or raising to 256. +beats the scalar baseline on recall@k across the benchmarking datasets. Default +scalar baseline: per-dimension linear min-max quantization at b bits per +coordinate plus an f32 norm (matching TurboQuant's norm overhead). Report +results at a reference N (e.g., N=100K vectors) where shared metadata is +amortized; optionally show sensitivity to small N where shared costs dominate. +The current proposal of 128 is conservative; experiments may justify lowering +to 64 or raising to 256. ### MSE quality vs. block size From 4c32a12093eab7ef896a56e934fb0fa686ab2cc2 Mon Sep 17 00:00:00 2001 From: Will Manning Date: Fri, 3 Apr 2026 12:56:47 -0400 Subject: [PATCH 15/19] another pass with external reviewers Signed-off-by: Will Manning --- proposed/0033-block-turboquant.md | 42 ++++++++++++++++++++----------- 1 file changed, 27 insertions(+), 15 deletions(-) diff --git a/proposed/0033-block-turboquant.md b/proposed/0033-block-turboquant.md index 3b6d67a..664d8dd 100644 --- a/proposed/0033-block-turboquant.md +++ b/proposed/0033-block-turboquant.md @@ -63,7 +63,7 @@ differences are: | Quantization type | Scalar (per-coordinate, after rotation) | Vector (per-sub-vector, learned codebook) | | Codebook | Analytically derived from Beta distribution; **data-oblivious** | Learned via k-means on training data; **data-dependent** | | Rotation | Random orthogonal within each sub-vector | Typically none (OPQ [10] adds a learned rotation) | -| Theoretical guarantees | Provable MSE bound (Theorem 1 [1]) | Empirical quality only | +| Theoretical guarantees | Provable data-oblivious MSE bound (Theorem 1 [1]) | No comparable data-oblivious bound | | Codebook training | None (centroids derived from theory) | Requires training pass over data | | Bits per sub-vector | Scalar: b bits per coordinate | Vector: typically 8 bits per sub-vector (256 codewords) | @@ -119,8 +119,9 @@ trait for the BtrBlocks cascading compressor. It matches `Vector` and dimension ≥ 3 (to be raised to ≥ 128 in Stage 1; see Minimum dimension below), using the default config (5-bit QJL = 4-bit MSE + 1-bit QJL, seed 42). -**Input handling.** All float types (f16, f32, f64) are converted to f32 before -quantization. Per-vector L2 norms are computed and stored as f32. Non-power-of-2 +**Input handling (pre-Stage 1).** All float types (f16, f32, f64) are converted +to f32 before quantization. Per-vector L2 norms are computed and stored as f32 +(Stage 1 changes this to dtype-matching: f64 for f64 input). Non-power-of-2 dimensions are zero-padded to the next power of 2 for SORF compatibility. The minimum dimension is 3 (d=2 causes a singularity in the Beta distribution exponent). @@ -143,12 +144,17 @@ spacings (we cast to f32 before quantization). See [7] for the full list. There is an ambiguity in the paper's notation for the MSE bound constant. The formal proof gives `(√3 · π / 2) · 4^{-b}` where the constant √3·π/2 ≈ 2.72. -The Eviox report [7] interprets the notation as `√(3π)/2 ≈ 1.535`, but this is -incorrect: the measured distortion values from the paper (b=2: 0.117, b=3: 0.03) -exceed the putative `√(3π)/2` bound (b=2: 0.096, b=3: 0.024), confirming that -2.72 is the correct constant. The paper's "explicit values" (0.36, 0.117, 0.03, -0.009) are the actual computed distortion of the optimal quantizer, not the -bound itself — they are well below the 2.72/4^b bound. +The Eviox report [7] (Item 7) deliberately adopts the alternative parsing +`√(3π)/2 ≈ 1.535`, claiming it is "consistent with the formal proof." We treat +`√3·π/2 ≈ 2.72` as the theorem constant because: (a) the paper's prose +describes the constant as "≈ 2.7," which matches 2.72 not 1.535; and (b) the +paper's reported distortion values (b=2: 0.117, b=3: 0.03) exceed the 1.535- +based bound (b=2: 0.096, b=3: 0.024), ruling out `√(3π)/2` as a valid +**upper** bound on the measured quantity. The definitive resolution requires +checking the exact LaTeX grouping in the ICLR 2026 camera-ready proof. The +paper's "explicit values" (0.36, 0.117, 0.03, 0.009) are the actual computed +distortion of the optimal quantizer, not the bound itself — they are well below +the 2.72/4^b bound. ### Community findings on QJL @@ -269,13 +275,13 @@ efficiency: - **SORF mixing quality:** 3-round SORF at d=64 provides only 18 butterfly stages (vs. 21 at d=128, 30 at d=1024). The coordinate distribution deviates more from the analytical Beta, making Max-Lloyd centroids less optimal. -- **Practical MSE:** At smaller d, the Beta marginal is wider (variance ~1/d), - leading to higher absolute MSE at the same bit width b. The gap between - practical MSE and the theoretical upper bound is an empirical question at - each d. +- **Practical MSE:** At smaller d, the SORF mixing quality and coordinate- + independence approximations are weaker, potentially worsening practical + quantization quality beyond what the dimension-free theoretical bound + captures. The actual MSE at each d is an empirical question. - **Overhead ratio:** Per-vector norm (32 bits) is a larger fraction of the - compressed representation at small d. At d=32, b=5: norm is 20% of the - compressed size. At d=768: <1%. + compressed representation at small d. At d=32, b=5: codes=160 bits, + norm=32 bits, total=192 — norm is ~17% of compressed size. At d=768: <1%. - **Diminishing returns for high bit widths:** With fewer coordinates, the fine-grained centroid structure of high-b quantization has less to exploit. @@ -987,6 +993,12 @@ ID (`vortex.turboquant`). The metadata includes `block_size` and `num_blocks` fields from Stage 1 onward. Stage 1 always writes `num_blocks=1`, but the field exists so that Stage 2 decoders can read Stage 1 files without migration. +**Decoder invariant:** `block_size` is always the per-block SORF dimension B. +`codes.list_size` = `num_blocks × block_size`. In Stage 1, `num_blocks=1` and +`block_size = padded_dim`, so `codes.list_size = padded_dim`. In Stage 2 with +k>1, `block_size = B` (e.g., 256) and `codes.list_size = d` (e.g., 768). The +decoder reconstructs `k = codes.list_size / block_size`. + **Norms are always internal children.** The TurboQuant array is self-contained — it stores norms as a child slot, not in a parent encoding. This means: From b409f07c3d442c2bd6af41e10529211bf428508d Mon Sep 17 00:00:00 2001 From: Will Manning Date: Fri, 3 Apr 2026 13:01:30 -0400 Subject: [PATCH 16/19] another pass with external reviewers Signed-off-by: Will Manning --- proposed/0033-block-turboquant.md | 37 ++++++++++++++++++++----------- 1 file changed, 24 insertions(+), 13 deletions(-) diff --git a/proposed/0033-block-turboquant.md b/proposed/0033-block-turboquant.md index 664d8dd..3045ad6 100644 --- a/proposed/0033-block-turboquant.md +++ b/proposed/0033-block-turboquant.md @@ -39,13 +39,12 @@ embeddings. It works by: 1. Randomly rotating a unit-norm vector so that each coordinate follows a known marginal distribution — specifically `(1 - x²)^((d-3)/2)` on [-1, 1], a - concentrated Beta distribution (Lemma 1 in [1]; verify numbering against the - ICLR 2026 camera-ready if it differs from the arXiv version). + concentrated Beta distribution (Lemma 1 in [1]; numbering per arXiv v1). 2. Applying an MSE-optimal scalar quantizer (Max-Lloyd centroids) independently to each coordinate. 3. Optionally adding a 1-bit QJL (Quantized Johnson-Lindenstrauss) correction on the residual for unbiased inner product estimation (Theorem 2 in [1]; - same camera-ready caveat). + numbering per arXiv v1). The paper prescribes a full random orthogonal rotation (QR decomposition of a matrix with i.i.d. N(0,1) entries, yielding a Haar-uniform orthogonal matrix) @@ -400,8 +399,7 @@ norm = 0, decode as all zeros. #### Theoretical MSE bound -The paper's MSE bound (Theorem 1 in [1]; verify theorem numbering against the -ICLR 2026 camera-ready if it differs from the arXiv version) is: +The paper's MSE bound (Theorem 1 in [1]; numbering per arXiv v1) is: ``` E[‖x - x̂‖² / ‖x‖²] ≤ (√3 · π / 2) / 4^b ≈ 2.72 / 4^b @@ -679,8 +677,8 @@ If pursued, four strategies should be compared: | Full-dim padded SORF | Approximate | O(d log d) total | 3×padded_d bits | | MSE-only (no QJL) | N/A | 0 | None | -The paper's QJL uses Gaussian S (not SORF); Lemma 4 [1] (same camera-ready -numbering caveat as Theorem 1) is proved specifically +The paper's QJL uses Gaussian S (not SORF); Lemma 4 [1] (numbering per arXiv +v1) is proved specifically for Gaussian. SORF for QJL is an additional approximation (the [current implementation][current-impl] uses SORF for QJL). Per-block QJL can incur up to d/B times larger variance bound than full-dimension QJL (Lemma 4 @@ -811,14 +809,22 @@ amortized; optionally show sensitivity to small N where shared costs dominate. The current proposal of 128 is conservative; experiments may justify lowering to 64 or raising to 256. -### MSE quality vs. block size +### MSE quality and scan performance vs. block size - Compare actual normalized MSE at B ∈ {64, 128, 256, 512} vs. single-SORF at padded dimension, at bit widths b ∈ {2, 3, 4, 5, 8} +- Compare ANN recall@k and scan throughput at fixed d (e.g., d=3072) across + B ∈ {256, 512, 1024} — smaller B gives more pruning checkpoints for + ADSampling-style early termination but increases norm overhead - Test SORF coordinate distribution at each B: histogram vs. analytical Beta - Test 3, 4, 5 SORF rounds at each B - Determine if the practical MSE constant is worse at smaller B +The block-size rule ("greatest qualifying B") is a starting heuristic that +maximizes per-block quality and minimizes norm count. Experiments may show that +smaller B with more pruning checkpoints yields better end-to-end scan +performance despite higher per-block overhead. + ### QJL strategy comparison (if pursued) - Per-block Gaussian QJL vs. per-block SORF QJL vs. full-dim padded SORF QJL @@ -904,7 +910,8 @@ workload. ## Future work: GPU decode and fused distance computation The B-dim block structure maps naturally to GPU tile sizes and tensor cores. -For a batch of N vectors sharing the same rotation matrix R⁻¹: +For a single block (k=1; Stage 2 generalizes to k independent per-block GEMMs) +with a batch of N vectors sharing the same rotation matrix R⁻¹: ``` decoded_batch = diag(norms) × R⁻¹ × codebook_lookup_batch(codes) @@ -994,10 +1001,14 @@ fields from Stage 1 onward. Stage 1 always writes `num_blocks=1`, but the field exists so that Stage 2 decoders can read Stage 1 files without migration. **Decoder invariant:** `block_size` is always the per-block SORF dimension B. -`codes.list_size` = `num_blocks × block_size`. In Stage 1, `num_blocks=1` and -`block_size = padded_dim`, so `codes.list_size = padded_dim`. In Stage 2 with -k>1, `block_size = B` (e.g., 256) and `codes.list_size = d` (e.g., 768). The -decoder reconstructs `k = codes.list_size / block_size`. +`codes.list_size` = `num_blocks × block_size`. The decoder reconstructs +`k = codes.list_size / block_size`. Note that `metadata.dimension` may differ +from `codes.list_size`: + +- Stage 1, non-power-of-2 d: `dimension=768`, `block_size=1024` (padded), + `list_size=1024`. `dimension < list_size` is expected; trailing code slots + are structural zeros from padding. +- Stage 2, no stragglers: `dimension = list_size = num_blocks × block_size`. **Norms are always internal children.** The TurboQuant array is self-contained — it stores norms as a child slot, not in a parent encoding. This means: From 757089881e8ec23a2bdf35a9bc7b1e0559209f32 Mon Sep 17 00:00:00 2001 From: Will Manning Date: Fri, 3 Apr 2026 13:01:43 -0400 Subject: [PATCH 17/19] prettier Signed-off-by: Will Manning --- proposed/0033-block-turboquant.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/proposed/0033-block-turboquant.md b/proposed/0033-block-turboquant.md index 3045ad6..5fd85be 100644 --- a/proposed/0033-block-turboquant.md +++ b/proposed/0033-block-turboquant.md @@ -62,7 +62,7 @@ differences are: | Quantization type | Scalar (per-coordinate, after rotation) | Vector (per-sub-vector, learned codebook) | | Codebook | Analytically derived from Beta distribution; **data-oblivious** | Learned via k-means on training data; **data-dependent** | | Rotation | Random orthogonal within each sub-vector | Typically none (OPQ [10] adds a learned rotation) | -| Theoretical guarantees | Provable data-oblivious MSE bound (Theorem 1 [1]) | No comparable data-oblivious bound | +| Theoretical guarantees | Provable data-oblivious MSE bound (Theorem 1 [1]) | No comparable data-oblivious bound | | Codebook training | None (centroids derived from theory) | Requires training pass over data | | Bits per sub-vector | Scalar: b bits per coordinate | Vector: typically 8 bits per sub-vector (256 codewords) | From f385bdb13e81967ee82968a90ed63c57e6c66a94 Mon Sep 17 00:00:00 2001 From: Will Manning Date: Fri, 3 Apr 2026 13:09:10 -0400 Subject: [PATCH 18/19] another pass with external reviewers Signed-off-by: Will Manning --- proposed/0033-block-turboquant.md | 24 ++++++++++++++---------- 1 file changed, 14 insertions(+), 10 deletions(-) diff --git a/proposed/0033-block-turboquant.md b/proposed/0033-block-turboquant.md index 5fd85be..1c6819d 100644 --- a/proposed/0033-block-turboquant.md +++ b/proposed/0033-block-turboquant.md @@ -39,12 +39,11 @@ embeddings. It works by: 1. Randomly rotating a unit-norm vector so that each coordinate follows a known marginal distribution — specifically `(1 - x²)^((d-3)/2)` on [-1, 1], a - concentrated Beta distribution (Lemma 1 in [1]; numbering per arXiv v1). + concentrated Beta distribution (Lemma 1 in [1]). 2. Applying an MSE-optimal scalar quantizer (Max-Lloyd centroids) independently to each coordinate. 3. Optionally adding a 1-bit QJL (Quantized Johnson-Lindenstrauss) correction - on the residual for unbiased inner product estimation (Theorem 2 in [1]; - numbering per arXiv v1). + on the residual for unbiased inner product estimation (Theorem 2 in [1]). The paper prescribes a full random orthogonal rotation (QR decomposition of a matrix with i.i.d. N(0,1) entries, yielding a Haar-uniform orthogonal matrix) @@ -227,7 +226,7 @@ could skip entire TQ blocks (B dimensions at a time) if the partial distance already exceeds the candidate threshold. This combines the storage efficiency of quantization with the computational savings of early termination. -[pdx-impl]: https://github.com/cwida/PDX +[pdx-impl]: https://github.com/cwida/PDX (specific files: `include/pdx/quantizers/scalar.hpp` for SQ8, `include/pdx/pruners/adsampling.hpp` for ADSampling/DCT, `include/pdx/layout.hpp` for int8 interleaving, `include/pdx/distance_computers/avx512_computers.hpp` for VPDPBUSD kernels) ## Proposal @@ -399,7 +398,7 @@ norm = 0, decode as all zeros. #### Theoretical MSE bound -The paper's MSE bound (Theorem 1 in [1]; numbering per arXiv v1) is: +The paper's MSE bound (Theorem 1 in [1]) is: ``` E[‖x - x̂‖² / ‖x‖²] ≤ (√3 · π / 2) / 4^b ≈ 2.72 / 4^b @@ -677,8 +676,7 @@ If pursued, four strategies should be compared: | Full-dim padded SORF | Approximate | O(d log d) total | 3×padded_d bits | | MSE-only (no QJL) | N/A | 0 | None | -The paper's QJL uses Gaussian S (not SORF); Lemma 4 [1] (numbering per arXiv -v1) is proved specifically +The paper's QJL uses Gaussian S (not SORF); Lemma 4 [1] is proved specifically for Gaussian. SORF for QJL is an additional approximation (the [current implementation][current-impl] uses SORF for QJL). Per-block QJL can incur up to d/B times larger variance bound than full-dimension QJL (Lemma 4 @@ -686,7 +684,9 @@ incur up to d/B times larger variance bound than full-dimension QJL (Lemma 4 Community reports indicate MSE-only often wins for KV-cache attention at all tested bit widths [8]. Whether this extends to ANN ranking is an empirical -question (see Experimental plan); QJL may not be worth the complexity. +question (see Experimental plan); QJL may not be worth the complexity. Note: +the [current PR][current-impl] flags a known SORF-related QJL bias for +non-power-of-2 padded dimensions (#7245); MSE-only Stage 1 avoids this path. ## Array layout @@ -1001,8 +1001,9 @@ fields from Stage 1 onward. Stage 1 always writes `num_blocks=1`, but the field exists so that Stage 2 decoders can read Stage 1 files without migration. **Decoder invariant:** `block_size` is always the per-block SORF dimension B. -`codes.list_size` = `num_blocks × block_size`. The decoder reconstructs -`k = codes.list_size / block_size`. Note that `metadata.dimension` may differ +`codes.list_size` = `num_blocks × block_size`. The decoder **validates** +`num_blocks == codes.list_size / block_size` (exact integer division; reject +files where this does not hold). Note that `metadata.dimension` may differ from `codes.list_size`: - Stage 1, non-power-of-2 d: `dimension=768`, `block_size=1024` (padded), @@ -1042,6 +1043,9 @@ written by earlier stages are always readable by later decoders. ## References +*All lemma, theorem, and definition numbers for [1] refer to arXiv:2504.19874v1. +The ICLR 2026 camera-ready proceedings may use different numbering.* + [1] Zandieh, A., Daliri, M., Hadian, M. and Mirrokni, V. "TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate." ICLR 2026. arXiv:2504.19874, April 2025. From 4abedb21da38b6190513a2b3ca01019a1b194e77 Mon Sep 17 00:00:00 2001 From: Will Manning Date: Fri, 3 Apr 2026 13:09:16 -0400 Subject: [PATCH 19/19] prettier Signed-off-by: Will Manning --- proposed/0033-block-turboquant.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/proposed/0033-block-turboquant.md b/proposed/0033-block-turboquant.md index 1c6819d..9a53d81 100644 --- a/proposed/0033-block-turboquant.md +++ b/proposed/0033-block-turboquant.md @@ -226,7 +226,7 @@ could skip entire TQ blocks (B dimensions at a time) if the partial distance already exceeds the candidate threshold. This combines the storage efficiency of quantization with the computational savings of early termination. -[pdx-impl]: https://github.com/cwida/PDX (specific files: `include/pdx/quantizers/scalar.hpp` for SQ8, `include/pdx/pruners/adsampling.hpp` for ADSampling/DCT, `include/pdx/layout.hpp` for int8 interleaving, `include/pdx/distance_computers/avx512_computers.hpp` for VPDPBUSD kernels) +[pdx-impl]: https://github.com/cwida/PDX "specific files: `include/pdx/quantizers/scalar.hpp` for SQ8, `include/pdx/pruners/adsampling.hpp` for ADSampling/DCT, `include/pdx/layout.hpp` for int8 interleaving, `include/pdx/distance_computers/avx512_computers.hpp` for VPDPBUSD kernels" ## Proposal @@ -1043,8 +1043,8 @@ written by earlier stages are always readable by later decoders. ## References -*All lemma, theorem, and definition numbers for [1] refer to arXiv:2504.19874v1. -The ICLR 2026 camera-ready proceedings may use different numbering.* +_All lemma, theorem, and definition numbers for [1] refer to arXiv:2504.19874v1. +The ICLR 2026 camera-ready proceedings may use different numbering._ [1] Zandieh, A., Daliri, M., Hadian, M. and Mirrokni, V. "TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate." ICLR 2026.