Skip to content

Add blog: Sort Pushdown in DataFusion: Skip Sorts, Skip I/O#186

Draft
zhuqi-lucas wants to merge 14 commits into
apache:mainfrom
zhuqi-lucas:blog-sort-pushdown
Draft

Add blog: Sort Pushdown in DataFusion: Skip Sorts, Skip I/O#186
zhuqi-lucas wants to merge 14 commits into
apache:mainfrom
zhuqi-lucas:blog-sort-pushdown

Conversation

@zhuqi-lucas
Copy link
Copy Markdown
Contributor

Summary

A walkthrough of the sort pushdown work landed and in flight on Apache DataFusion. Opening as a draft to share the narrative early — the in-flight PRs the post discusses are still in flight, but the structure and the merged work (Phase 1 #19064, Phase 2 #21182, BufferExec #21426, row-group reverse #18817) are in their final shape.

What this post covers

  • Why SortExec is expensive, and what Exact / Inexact mean at runtime (static fetch vs TopK dynamic filter).
  • Phase 1 (#19064) — the PushdownSort rule + reverse row-group case.
  • Phase 2 (#21182) — statistics-based file sort that upgrades Unsupported to Exact, eliminating the SortExec on non-overlapping ASC scans. Includes the BufferExec compensation (#21426) so the SPM above doesn't lose its implicit memory buffer.
  • Reverse scans — today's row-group reverse (#18817, Inexact) and the community decision to wait for arrow-rs page-level reverse (apache/arrow-rs#9937) before pursuing Exact reverse, after memory-profile pushback on the original row-group-level proposal.
  • Benchmarks — 2.1×–49× on the ASC-LIMIT sort_pushdown suite.
  • What's next — the dynamic / TopK-driven path (#21351 merged, #21733, #21712, #21956, #21580) including the precise RG-level pruning vs mid-stream early return distinction, and the EnsureRequirements unification (#21976).
  • Links into the prior dynamic filters and limit pruning posts so the series reads as a coherent thread.

Why a draft

A few of the in-flight PRs the "What's next" section references may evolve in review (e.g. #21580 may be split into smaller pieces, dynamic RG scheduling on top of #21351 is described but not yet on a PR). Opening as draft so we can adjust wording as those land or change shape — happy to flip to ready for review when the dust settles, or earlier if reviewers prefer.

Test plan

  • Rendered locally with the Pelican Docker image from the project README — images, internal links, code blocks, and tables all render correctly.
  • All issue / PR / blog-post links checked against current state.

A walkthrough of the sort pushdown work landed and in flight on Apache
DataFusion. Covers:

- Why SortExec is expensive and what `Exact` / `Inexact` ordering mean
  at runtime (static `fetch` vs `TopK` dynamic filter).
- Phase 1 (#19064): the `PushdownSort` rule + reverse row-group case.
- Phase 2 (#21182): statistics-based file sort that upgrades
  `Unsupported` to `Exact`, eliminating the `SortExec` on
  non-overlapping ASC scans. Includes the BufferExec compensation
  (#21426) so the SPM above doesn't lose its implicit memory buffer.
- Reverse scans: today's row-group reverse (Inexact, #18817) and the
  community decision to wait for arrow-rs page-level reverse (#9937)
  before pursuing Exact reverse, after memory-profile pushback on the
  original row-group-level proposal.
- Benchmarks: 2.1×-49× on the ASC-LIMIT sort_pushdown suite.
- What's next: the dynamic / TopK-driven path (#21351 merged, #21733,
  #21712, #21956, #21580) including the precise RG-pruning vs
  mid-stream-early-return distinction, and the EnsureRequirements
  unification (#21976).
- Links into the prior dynamic filters and limit pruning posts so the
  series reads as a coherent thread.
The post previously only mentioned #21956 in passing. The PR is
landing the full mechanism — `try_pushdown_sort` decision tree,
two flags on `ParquetSource`, three composable runtime steps
(file reorder + RG reorder + reverse), and a `sort_prefix`-
preserving short-circuit — so cover it as a dedicated phase
between Phase 2 and the existing reverse-scan section.

- TL;DR: add a Phase 3 bullet alongside Phase 1 and Phase 2.
- Phase 1: replace the in-flight `#21956` aside with a cross-link
  to the new section.
- Phase 2: keep the caveat about function-wrapped sorts but note
  that #21956's `Inexact` path now covers them via monotonicity
  inference.
- New `## Phase 3` section with two SVG diagrams: a decision tree
  for the three protocol outcomes, and a three-step pipeline for
  the `Inexact` runtime. Covers the two-flag design, the nested
  file/RG layers, when EXPLAIN surfaces each flag, and four
  scenarios where Phase 3 does not fire (aggregations, multi-
  column secondary keys, function-wrapped sorts without a
  declared ordering, source declares a forward prefix of the
  request).
- "What's Next": rename the old "Phase 3 — filter + sort" bullet
  to "Filtered reverse TopK end-to-end" so the label doesn't
  clash with the new section, and add a follow-up bullet
  referencing #22198 for multi-column / function-wrapped reorder.
Add an ### Empirical note subsection inside "Reverse Scans for
`ORDER BY ... DESC`" that records what we found running an in-house
RG-level `Exact` reverse against upstream `Inexact` + `TopK`:

- `LIMIT N` does not propagate as a static stop signal in the
  Inexact path. The dynamic filter pushdown can stats-prune
  *subsequent* row groups once the threshold tightens, but inside
  the row group `TopK` is currently reading the sort column has to
  be fully decoded so the filter can be evaluated row by row.
  `LIMIT 10` on a 1M-row row group is still ~1M sort-column
  decodes regardless of N. LIMIT only saves work on non-sort
  columns inside that row group and on whole subsequent row
  groups the threshold prunes.
- `SortExec` stays on top of `Inexact`, so the final ordering
  pass and per-row heap maintenance are both extra costs the
  `Exact` path (which deletes `SortExec`) does not pay.

Then explain why we run RG-level `Exact` in production but did
not upstream it: parquet does not allow partial row-group reads,
so any RG-level `Exact` implementation peaks at one whole row
group (~128 MB) of decoded data in memory — the same constraint
that closed `#18817`. Our runtime advantage comes from skipping
heap / filter / `SortExec` overhead, not from decoding less.

Frame the page-level `Exact` reverse work in arrow-rs `#9937` as
the path that keeps the runtime win we measured while bringing
peak memory back into the streaming regime via `OffsetIndex`
page-level seek.
…tion

Two corrections in the empirical-note / reverse-scans section:

1. The internal RG-level `Exact` reverse path was incorrectly
   described as "walks pages from the back, decodes each batch,
   reverses row-wise, stops the moment K rows have been delivered."
   That is actually the page-level `Exact` shape (arrow-rs #9937),
   not the in-house RG-level implementation. Parquet does not allow
   partial row-group reads, so the in-house path has to decode the
   entire target row group, reverse the buffer in memory, take the
   first K rows, and stop — same memory profile as #18817's
   proposal. The runtime advantage over `Inexact` + `TopK` comes
   from removing the per-row heap maintenance, dynamic-filter
   evaluation, and `SortExec` final ordering pass, not from
   decoding less sort-column data. Sort col decode on the target
   row group is the same on both paths.

2. The arrow-rs #9937 paragraph I previously added duplicated the
   technical detail already present in the long-standing
   "That primitive is the page-level reverse traversal..."
   paragraph. Replaced with a one-sentence bridge ("The fix that
   keeps both the runtime win and a streaming memory profile is
   page-level `Exact` reverse via arrow-rs #9937, described next.")
   so the existing paragraph carries the explanation without
   repetition.
All three sort-pushdown PRs have now landed, so the chronological
'Phase 1/2/3' framing is less useful for readers than a capability
breakdown. Sections are now:
- The PushdownSort Rule (#19064)
- Sort Elimination via Statistics (#21182)
- Runtime Reorder for TopK Convergence (#21956)
- Reverse Scans for ORDER BY ... DESC (unchanged)

In-body Phase references replaced with PR numbers or capability names;
anchor links updated; references section restructured.
…ction'

The previous wording 'nest by construction' could be read as a code-
enforced property. It's actually a logical consequence of using the
same sort key (min) at both file and row-group level: a file's
min(col) is just the minimum over its row groups' min(col) values, so
the most-promising file contains the most-promising row group. The
rewritten paragraph spells that out and ties it to why TopK's dynamic
filter tightens fast.
Match the dynamic-filter blog's style: narrative talks about
capabilities/mechanisms, not 'PR #21956 did X / PR #19064 introduced
Y'. The 81 inline PR/issue references in the body were dropping the
reader out of the narrative; they belong in a single Issues-and-PRs
list at the end.

Changes:
- TL;DR: drop 6 inline PR refs from the 4 capability bullets
- Body sections (PushdownSort Rule, Sort Elimination, Runtime Reorder,
  Reverse Scans, Empirical Note): drop ~30 inline refs to historical
  PRs; replace with capability names or 'the rule' / 'the runtime
  reorder path' style descriptions
- What's Next: switch from [#NNNNN] format to named markdown links
  (matching dynamic-filter's Future Work style)
- Issues for new contributors: same conversion
- References section: rewrite using full URLs (no link-ref
  indirection); split into 'Landed' vs 'In flight / open' for
  clarity

Net: ~90 lines removed, all PR/issue numbers now consolidated at
the bottom of the post.
The previous draft mixed merged work, in-flight work, and runtime-cost
analysis into a single 'Reverse Scans' section and a sprawling 'What's
Next' section. Reorganize so the post answers three clear questions in
sequence:

1. What's merged today? (Sort Elimination via Statistics + benchmark,
   Runtime Reorder for TopK Convergence, Reverse Scans for DESC) —
   unchanged content, just kept tight.
2. Where do those merged features still leave performance on the
   table? New 'Current Bottlenecks' section with three explicitly
   numbered bottlenecks: SortExec stays / sort column fully decoded
   inside open RG / file-granular scheduling can't close the tap
   mid-file. Pulls in the runtime-cost content that used to be buried
   in an 'Empirical note' subsection.
3. How does each next-step optimization remove a specific bottleneck?
   New 'Roadmap' section maps page-level Exact reverse to bottlenecks
   1+2, row-group-level dynamic early termination to bottleneck 3, and
   shows the in-flight 17x-60x pipeline benchmark as a preview of what
   stacking these mechanisms can deliver.

Smaller follow-ups (EnsureRequirements, OFFSET pushdown, multi-column
reorder) collected at the end of the roadmap section as a short
'Other follow-ups' bullet list.
- 'Why not full Exact reverse' paragraph: cut reviewer attribution and
  forward-pointers that were already in the bottleneck/roadmap
  sections that follow.
- TL;DR: trim Runtime Reorder + Reverse Scans bullets to capability
  and impact; drop implementation mechanics like 'stamps two flags'
  and 'three-step pipeline'.
- 'The PushdownSort Rule' section: cut three paragraphs of
  'covered in X below' forward-references that were repeating
  the section TOC.
- Function-wrapped parenthetical in Sort Elimination: 4 lines to 2.
- Single-partition vs multi-partition edge case: drop the trailing
  'which is why the example is drawn that way' tangent.
- 'What this change does not affect' note: trimmed redundant prose.
- Remove all references to the limit-pruning blog (intro mention,
  link definition, References section bullet) — that work is about
  static LIMIT as an I/O optimization, separate problem from sort
  ordering.
…mmetry explicit

Per code in datasource-parquet/src/source.rs:849-870, the reversed-
satisfies branch is 'strictly more powerful than the column-in-schema
check' — it runs the request through EquivalenceProperties's full
reasoning machinery and handles function monotonicity, constants from
filters, equivalence relationships, and multi-column composite
orderings. The blog had been treating reverse as just step 3 of the
runtime pipeline, which undersold its standalone reach.

Structural changes:
- Drop the standalone 'Reverse Scans for ORDER BY DESC' H2 section;
  reverse is now a case of the Inexact runtime reorder path.
- Rename Runtime Reorder section to 'Runtime Reorder for TopK and
  DESC Queries'; intro now lists three classes that fall outside
  Exact (unsorted, overlapping, DESC).
- 'try_pushdown_sort' subsection rewritten as 'Two independent
  triggers for Inexact', describing column-in-schema vs reversed-
  satisfies as separate signals with the latter being strictly more
  powerful.
- 'Three runtime steps' subsection: step 3 now explicitly notes when
  steps 1-2 are skipped and only the iteration reverse runs.
- New 'ORDER BY DESC in practice' subsection right after the 3-step
  pipeline, explaining the RGs-descending-x-rows-ascending stream.
- Move reverse-scan.svg from the deleted Reverse Scans section into
  the Roadmap > Page-level Exact reverse subsection where it
  illustrates the 128 MB vs 1 MB peak comparison directly.

Accuracy fix:
- 'Multi-column reorder follow-ups' bullet was inaccurate — said the
  reorder 'only fires on plain columns'. The reverse path does
  handle function-wrapped and multi-column cases via
  EquivalenceProperties; only the stats reorder step is restricted.
  Updated wording to scope the limitation correctly.
…shdown

EnsureRequirements (#21976) is a rule-unification effort for
EnforceDistribution+EnforceSorting. Touches the same area but isn't
a sort pushdown optimization.

OFFSET pushdown (#21828) is about LIMIT/OFFSET pruning. Same kind of
tangent as the limit-pruning blog reference removed earlier — it's
LIMIT optimization, not sort pushdown.

The remaining 'Multi-column and function-wrapped reorder follow-ups'
bullet is actually directly about sort pushdown's reorder step
(#22198), so it stays. With the other two removed, 'Other follow-ups'
collapsed to a single point — promoted to its own subsection
'Extending the stats reorder step' for clarity.

Also dropped the corresponding entries from the References section.
… (k-way merge)

The 'switch work queue from PartitionedFile to RG descriptors' fix is
sufficient for non-overlapping ranges (post-reorder), where the first
file globally has the smallest values and subsequent files are
already stats-pruned. For overlapping ranges, the next smallest value
could sit in any of several open files — matching the non-overlap
efficiency requires explicit k-way merge across open files' next-RG
mins. The dynamic filter does this implicitly (RGs with max < threshold
are dropped), but explicit comparison closes the tap earlier when the
filter tightens slowly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant