Skip to content

StatPopGen: per-column compression analysis for the Default (no Pco/Zstd) writer — GT is the big win #8229

@joseph-isaacs

Description

@joseph-isaacs

Summary

Analysis of the statpopgen benchmark file (gnomAD v3.1.2 HGDP-1kG chr21, first 100,000 variants — the CI default scale_factor=100) focused on the Default write strategy, i.e. the stable BtrBlocks scheme set without Pco or Zstd.

Sizes today:

Format Size
Source Parquet 1404 MB
Vortex Default 1996 MB ⚠️ larger than Parquet
Vortex Compact (Pco+Zstd) 952 MB (reference only — uses Pco/Zstd)

The Default file is larger than Parquet because the Default scheme set excludes Pco/Zstd, so the heavy per-sample FORMAT columns fall back to plain FastLanes bit-packing / sparse. The 9 per-sample FORMAT columns are ~88% of the file; the ~1160 INFO columns + 7 core fields together are only ~250 MB and are already well-encoded (dict / sparse / constant / ALP / FSST). So all of the addressable headroom is in the FORMAT columns.

Methodology note: real-data stats below were sampled uniformly across all 100k rows (every 50th/200th row). The first ~1000 chr21 rows are a low-coverage telomeric region and are pathologically compressible, so they must not be used as a sample.

Per-column classification (Default, no Pco/Zstd)

Col Default MB dtype Real data (sampled) Current encoding Family → improvement Est. Δ (MB)
GT 304.9 list(u8) {null 21%, 0 ≈73%, 1/2 ≈6%}, 3 distinct, run 5.28 sparse(fill=0) + bit-packed u32 indices A: low-card + high-null + runs → sorted Dict, nullable codes (2-bit), RLE codes −225…−265
PGT 5.9 list(i32) {null 57%, 0/1/2}, 3 distinct, run 7.86 sparse/bitpack A (same family as GT) −4
MIN_DP 359.8 list(i32) 0..1296 (11b), 1045 distinct, 7.6% null, run 1.06 FastLanes bitpack (+FoR/patches) B: noisy ints, no runs → ≈bitpack floor; tighter per-block FoR only −5…−15
DP 353.1 list(i32) 0..1296 (11b), 1089 distinct, ~0% null, run 1.04 bitpack (+patches) B → near floor −5…−15
GQ 275.8 list(i32) 0..99 (7b), 100 distinct, 0% null, run 1.55 bitpack B → 7-bit FoR already tight −3…−8
AD 190.5 list(list(i32)) 0..767 (10b), 524 distinct, 0% null, run 1.01 bitpack (+patches) B → near floor −3…−10
PL 152.9 list(list(i32)) 0..102103 (17b), 10,989 distinct, 0% null, run 1.00 bitpack — likely too wide for rare outliers C: skewed wide-range → FoR(~9b)+patch outliers (phred 0 dominates) −30…−70 (uncertain)
SB 80.8 list(list(i32)) 0..437 (9b), 322 distinct, 0% null, run 1.0 bitpack B → near floor −2…−6
PID 22.9 list(utf8) highly repetitive phasing IDs dict/bitpack D: repetitive strings → FSST+dict −12…−16
vep 12.4 utf8 free-text annotation dict D → FSST −9…−11

The four families

A — GT + PGT (the big win). Low cardinality {0,1,2}, high null, and runs (5.3 / 7.9 mean run length). The current sparse(fill=0) is a poor model at ~24.7% non-fill: it stores ~8.6M u32 patch indices/chunk → GT = 305 MB in Default (Pco only rescues this to 33.6 MB in Compact). Proposed encoding — directly expressible with Vortex DictArray ({codes, values} with is_nullable_codes):

  • values = sorted dictionary [0,1,2] (3 × u8)
  • codes = 2-bit packed, with validity as a child (nullable codes) carrying the ~21% nulls — keeps the dictionary a pure, ordered numeric domain
  • RLE on the codes exploits the runs (0 ≈ 73%)

Sizing (no Pco/Zstd): dense 2-bit codes ≈ 104 MB + validity bitmap ≈ 52 MB = ~156 MB even with zero entropy coding (already ~2× better than 305 MB); RLE on the runs should push GT toward ~40–80 MB. A sorted dictionary additionally makes codes order-preserving (code_i < code_j ⇔ value_i < value_j), which restores predicate pushdown (GT = 2, GT >= 1) and zone-map min/max pruning on the packed codes — perf the sparse path loses. Likely also wants a values_are_sorted flag on DictArray to advertise that property to the query layer.

B — DP / MIN_DP / GQ / AD / SB (~1,260 MB, stuck). Noisy small-range integers with run length ≈ 1, so no RLE; the range already sets the bit width. The entropy gain you'd want here is exactly Pco. Without it, the only stable lever is tighter per-block FoR via smaller row blocks (see Layout below), and it's small. A prior micro-benchmark confirmed naive global-width FoR+bitpack is worse than what Default already does (e.g. DP 453 vs 353 MB).

C — PL. The one Family-B exception worth measuring: 17-bit range but phred-likelihoods are mostly 0/small with rare huge values, so a FoR(~9-bit)+patch-outlier split could beat a wide bitpack.

D — PID + vep (strings). FSST is the no-Zstd substitute; ~−25 MB combined.

Layout note (orthogonal to encoding)

The 9 FORMAT columns are written as 13 chunks of ~10–40 MB each (8192 rows × ~4151 samples/row), 10–40× the writer's own ~1 MB segment target. List offsets are fine — every row has exactly 4,151 genotypes (constant stride), so offsets compress to ~nothing. A field-aware smaller row_block_size (~512–1024 rows) for the per-sample list columns would bring segments toward the target and improve random access, read concurrency, and pruning resolution. This is roughly size-neutral — a perf/layout change, not a compression lever — and may give small Family-B gains via tighter per-block FoR.

Bottom line

  • GT (Family A) is the single biggest Default-only win: ~−240 MB, achievable with stable schemes (sorted Dict + nullable codes + RLE), and it improves query perf via order-preserving codes.
  • Strings (Family D) add ~−25 MB; PGT/PL add some more.
  • Net realistic Default reduction ≈ −300…−380 MB → ~1620–1700 MB. Default stays above Parquet because Family B (~1,260 MB of noisy depth/quality integers) is entropy-bound — no stable scheme compresses random ~9–11-bit integers further without Pco/Zstd.

Proposed work items

  1. Dense low-cardinality encoding for GT/PGT: make a sorted Dict (nullable codes, 2-bit, RLE'd) a candidate that beats Sparse when the non-fill fraction is high (e.g. > ~10–15%); add a values_are_sorted flag for pushdown. (largest impact)
  2. FSST for PID/vep in the Default scheme set.
  3. PL outlier-split (FoR + patch list) — measure first.
  4. Field-aware FORMAT row_block_size for segment granularity (perf).

Δ figures are current estimates grounded in measured per-column real data; prototypes quantifying GT (item 1), the noisy integers + strings, and chunk sizing are in progress and will firm these up.

Related

While reproducing the CI default, the myrrc/statpopgen-u8 branch (GT schema → list(u8)) crashed data-gen: the schema was changed to UInt8 but the Arrow builder still emitted UInt64 (expected List(UInt8) but found List(UInt64)). Fixed by emitting u8 in the GnomAD builder / parse_genotype.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions