StatPopGen: per-column compression analysis for the Default (no Pco/Zstd) writer — GT is the big win

## Summary

Analysis of the `statpopgen` benchmark file (gnomAD v3.1.2 HGDP-1kG chr21, first 100,000 variants — the CI default `scale_factor=100`) focused on the **Default** write strategy, i.e. the stable BtrBlocks scheme set **without Pco or Zstd**.

Sizes today:

| Format | Size |
|---|---|
| Source Parquet | **1404 MB** |
| Vortex **Default** | **1996 MB** ⚠️ larger than Parquet |
| Vortex **Compact** (Pco+Zstd) | 952 MB (reference only — uses Pco/Zstd) |

The Default file is larger than Parquet because the Default scheme set excludes Pco/Zstd, so the heavy per-sample FORMAT columns fall back to plain FastLanes bit-packing / `sparse`. The **9 per-sample FORMAT columns are ~88% of the file**; the ~1160 INFO columns + 7 core fields together are only ~250 MB and are already well-encoded (dict / sparse / constant / ALP / FSST). So all of the addressable headroom is in the FORMAT columns.

> Methodology note: real-data stats below were sampled **uniformly across all 100k rows** (every 50th/200th row). The first ~1000 chr21 rows are a low-coverage telomeric region and are pathologically compressible, so they must not be used as a sample.

## Per-column classification (Default, no Pco/Zstd)

| Col | Default MB | dtype | Real data (sampled) | Current encoding | Family → improvement | Est. Δ (MB) |
|---|--:|---|---|---|---|--:|
| **GT** | 304.9 | list(u8) | {null 21%, 0 ≈73%, 1/2 ≈6%}, 3 distinct, **run 5.28** | `sparse(fill=0)` + bit-packed **u32** indices | **A: low-card + high-null + runs** → sorted Dict, nullable codes (2-bit), **RLE** codes | **−225…−265** |
| **PGT** | 5.9 | list(i32) | {null 57%, 0/1/2}, 3 distinct, **run 7.86** | sparse/bitpack | **A** (same family as GT) | −4 |
| **MIN_DP** | 359.8 | list(i32) | 0..1296 (11b), 1045 distinct, 7.6% null, run 1.06 | FastLanes bitpack (+FoR/patches) | **B: noisy ints, no runs** → ≈bitpack floor; tighter per-block FoR only | −5…−15 |
| **DP** | 353.1 | list(i32) | 0..1296 (11b), 1089 distinct, ~0% null, run 1.04 | bitpack (+patches) | **B** → near floor | −5…−15 |
| **GQ** | 275.8 | list(i32) | 0..99 (**7b**), 100 distinct, 0% null, run 1.55 | bitpack | **B** → 7-bit FoR already tight | −3…−8 |
| **AD** | 190.5 | list(list(i32)) | 0..767 (10b), 524 distinct, 0% null, **run 1.01** | bitpack (+patches) | **B** → near floor | −3…−10 |
| **PL** | 152.9 | list(list(i32)) | **0..102103 (17b)**, 10,989 distinct, 0% null, run 1.00 | bitpack — likely **too wide** for rare outliers | **C: skewed wide-range** → FoR(~9b)+patch outliers (phred 0 dominates) | **−30…−70** (uncertain) |
| **SB** | 80.8 | list(list(i32)) | 0..437 (9b), 322 distinct, 0% null, run 1.0 | bitpack | **B** → near floor | −2…−6 |
| **PID** | 22.9 | list(utf8) | highly repetitive phasing IDs | dict/bitpack | **D: repetitive strings** → FSST+dict | −12…−16 |
| **vep** | 12.4 | utf8 | free-text annotation | dict | **D** → FSST | −9…−11 |

## The four families

**A — GT + PGT (the big win).** Low cardinality {0,1,2}, high null, **and runs** (5.3 / 7.9 mean run length). The current `sparse(fill=0)` is a poor model at ~24.7% non-fill: it stores ~8.6M u32 patch indices/chunk → **GT = 305 MB** in Default (Pco only rescues this to 33.6 MB in Compact). Proposed encoding — directly expressible with Vortex `DictArray` (`{codes, values}` with `is_nullable_codes`):

- `values` = **sorted** dictionary `[0,1,2]` (3 × u8)
- `codes` = 2-bit packed, with **validity as a child** (nullable codes) carrying the ~21% nulls — keeps the dictionary a pure, ordered numeric domain
- **RLE** on the codes exploits the runs (0 ≈ 73%)

Sizing (no Pco/Zstd): dense 2-bit codes ≈ 104 MB + validity bitmap ≈ 52 MB = ~156 MB even with **zero** entropy coding (already ~2× better than 305 MB); RLE on the runs should push GT toward ~40–80 MB. A **sorted** dictionary additionally makes codes order-preserving (`code_i < code_j ⇔ value_i < value_j`), which restores predicate pushdown (`GT = 2`, `GT >= 1`) and zone-map min/max pruning on the packed codes — perf the `sparse` path loses. Likely also wants a `values_are_sorted` flag on `DictArray` to advertise that property to the query layer.

**B — DP / MIN_DP / GQ / AD / SB (~1,260 MB, stuck).** Noisy small-range integers with **run length ≈ 1**, so no RLE; the range already sets the bit width. The entropy gain you'd want here is exactly Pco. Without it, the only stable lever is tighter per-block FoR via smaller row blocks (see Layout below), and it's small. A prior micro-benchmark confirmed naive global-width FoR+bitpack is *worse* than what Default already does (e.g. DP 453 vs 353 MB).

**C — PL.** The one Family-B exception worth measuring: 17-bit range but phred-likelihoods are mostly 0/small with rare huge values, so a FoR(~9-bit)+patch-outlier split could beat a wide bitpack.

**D — PID + vep (strings).** FSST is the no-Zstd substitute; ~−25 MB combined.

## Layout note (orthogonal to encoding)

The 9 FORMAT columns are written as **13 chunks of ~10–40 MB each** (8192 rows × ~4151 samples/row), 10–40× the writer's own ~1 MB segment target. List **offsets are fine** — every row has exactly 4,151 genotypes (constant stride), so offsets compress to ~nothing. A field-aware smaller `row_block_size` (~512–1024 rows) for the per-sample list columns would bring segments toward the target and improve random access, read concurrency, and pruning resolution. This is roughly **size-neutral** — a perf/layout change, not a compression lever — and may give small Family-B gains via tighter per-block FoR.

## Bottom line

- **GT (Family A) is the single biggest Default-only win: ~−240 MB**, achievable with stable schemes (sorted Dict + nullable codes + RLE), and it *improves* query perf via order-preserving codes.
- Strings (Family D) add ~−25 MB; PGT/PL add some more.
- **Net realistic Default reduction ≈ −300…−380 MB → ~1620–1700 MB.** Default stays **above Parquet** because Family B (~1,260 MB of noisy depth/quality integers) is entropy-bound — no stable scheme compresses random ~9–11-bit integers further without Pco/Zstd.

## Proposed work items

1. **Dense low-cardinality encoding for GT/PGT**: make a sorted `Dict` (nullable codes, 2-bit, RLE'd) a candidate that beats `Sparse` when the non-fill fraction is high (e.g. > ~10–15%); add a `values_are_sorted` flag for pushdown. *(largest impact)*
2. **FSST for PID/vep** in the Default scheme set.
3. **PL outlier-split** (FoR + patch list) — measure first.
4. **Field-aware FORMAT `row_block_size`** for segment granularity (perf).

> Δ figures are current estimates grounded in measured per-column real data; prototypes quantifying GT (item 1), the noisy integers + strings, and chunk sizing are in progress and will firm these up.

### Related
While reproducing the CI default, the `myrrc/statpopgen-u8` branch (GT schema → `list(u8)`) crashed data-gen: the schema was changed to `UInt8` but the Arrow builder still emitted `UInt64` (`expected List(UInt8) but found List(UInt64)`). Fixed by emitting `u8` in the GnomAD builder / `parse_genotype`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StatPopGen: per-column compression analysis for the Default (no Pco/Zstd) writer — GT is the big win #8229

Summary

Per-column classification (Default, no Pco/Zstd)

The four families

Layout note (orthogonal to encoding)

Bottom line

Proposed work items

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Format	Size
Source Parquet	1404 MB
Vortex Default	1996 MB ⚠️ larger than Parquet
Vortex Compact (Pco+Zstd)	952 MB (reference only — uses Pco/Zstd)

Col	Default MB	dtype	Real data (sampled)	Current encoding	Family → improvement	Est. Δ (MB)
GT	304.9	list(u8)	{null 21%, 0 ≈73%, 1/2 ≈6%}, 3 distinct, run 5.28	`sparse(fill=0)` + bit-packed u32 indices	A: low-card + high-null + runs → sorted Dict, nullable codes (2-bit), RLE codes	−225…−265
PGT	5.9	list(i32)	{null 57%, 0/1/2}, 3 distinct, run 7.86	sparse/bitpack	A (same family as GT)	−4
MIN_DP	359.8	list(i32)	0..1296 (11b), 1045 distinct, 7.6% null, run 1.06	FastLanes bitpack (+FoR/patches)	B: noisy ints, no runs → ≈bitpack floor; tighter per-block FoR only	−5…−15
DP	353.1	list(i32)	0..1296 (11b), 1089 distinct, ~0% null, run 1.04	bitpack (+patches)	B → near floor	−5…−15
GQ	275.8	list(i32)	0..99 (7b), 100 distinct, 0% null, run 1.55	bitpack	B → 7-bit FoR already tight	−3…−8
AD	190.5	list(list(i32))	0..767 (10b), 524 distinct, 0% null, run 1.01	bitpack (+patches)	B → near floor	−3…−10
PL	152.9	list(list(i32))	0..102103 (17b), 10,989 distinct, 0% null, run 1.00	bitpack — likely too wide for rare outliers	C: skewed wide-range → FoR(~9b)+patch outliers (phred 0 dominates)	−30…−70 (uncertain)
SB	80.8	list(list(i32))	0..437 (9b), 322 distinct, 0% null, run 1.0	bitpack	B → near floor	−2…−6
PID	22.9	list(utf8)	highly repetitive phasing IDs	dict/bitpack	D: repetitive strings → FSST+dict	−12…−16
vep	12.4	utf8	free-text annotation	dict	D → FSST	−9…−11

StatPopGen: per-column compression analysis for the Default (no Pco/Zstd) writer — GT is the big win #8229

Description

Summary

Per-column classification (Default, no Pco/Zstd)

The four families

Layout note (orthogonal to encoding)

Bottom line

Proposed work items

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions