Add blog: Sort Pushdown in DataFusion: Skip Sorts, Skip I/O by zhuqi-lucas · Pull Request #186 · apache/datafusion-site

zhuqi-lucas · 2026-05-14T07:12:23Z

Summary

A walkthrough of the sort pushdown work landed and in flight on Apache DataFusion. Opening as a draft to share the narrative early — the in-flight PRs the post discusses are still in flight, but the structure and the merged work (Phase 1 #19064, Phase 2 #21182, BufferExec #21426, row-group reverse #18817) are in their final shape.

What this post covers

Why SortExec is expensive, and what Exact / Inexact mean at runtime (static fetch vs TopK dynamic filter).
Phase 1 (#19064) — the PushdownSort rule + reverse row-group case.
Phase 2 (#21182) — statistics-based file sort that upgrades Unsupported to Exact, eliminating the SortExec on non-overlapping ASC scans. Includes the BufferExec compensation (#21426) so the SPM above doesn't lose its implicit memory buffer.
Reverse scans — today's row-group reverse (#18817, Inexact) and the community decision to wait for arrow-rs page-level reverse (apache/arrow-rs#9937) before pursuing Exact reverse, after memory-profile pushback on the original row-group-level proposal.
Benchmarks — 2.1×–49× on the ASC-LIMIT sort_pushdown suite.
What's next — the dynamic / TopK-driven path (#21351 merged, #21733, #21712, #21956, #21580) including the precise RG-level pruning vs mid-stream early return distinction, and the EnsureRequirements unification (#21976).
Links into the prior dynamic filters and limit pruning posts so the series reads as a coherent thread.

Why a draft

A few of the in-flight PRs the "What's next" section references may evolve in review (e.g. #21580 may be split into smaller pieces, dynamic RG scheduling on top of #21351 is described but not yet on a PR). Opening as draft so we can adjust wording as those land or change shape — happy to flip to ready for review when the dust settles, or earlier if reviewers prefer.

Test plan

Rendered locally with the Pelican Docker image from the project README — images, internal links, code blocks, and tables all render correctly.
All issue / PR / blog-post links checked against current state.

A walkthrough of the sort pushdown work landed and in flight on Apache DataFusion. Covers: - Why SortExec is expensive and what `Exact` / `Inexact` ordering mean at runtime (static `fetch` vs `TopK` dynamic filter). - Phase 1 (#19064): the `PushdownSort` rule + reverse row-group case. - Phase 2 (#21182): statistics-based file sort that upgrades `Unsupported` to `Exact`, eliminating the `SortExec` on non-overlapping ASC scans. Includes the BufferExec compensation (#21426) so the SPM above doesn't lose its implicit memory buffer. - Reverse scans: today's row-group reverse (Inexact, #18817) and the community decision to wait for arrow-rs page-level reverse (#9937) before pursuing Exact reverse, after memory-profile pushback on the original row-group-level proposal. - Benchmarks: 2.1×-49× on the ASC-LIMIT sort_pushdown suite. - What's next: the dynamic / TopK-driven path (#21351 merged, #21733, #21712, #21956, #21580) including the precise RG-pruning vs mid-stream-early-return distinction, and the EnsureRequirements unification (#21976). - Links into the prior dynamic filters and limit pruning posts so the series reads as a coherent thread.

The post previously only mentioned #21956 in passing. The PR is landing the full mechanism — `try_pushdown_sort` decision tree, two flags on `ParquetSource`, three composable runtime steps (file reorder + RG reorder + reverse), and a `sort_prefix`- preserving short-circuit — so cover it as a dedicated phase between Phase 2 and the existing reverse-scan section. - TL;DR: add a Phase 3 bullet alongside Phase 1 and Phase 2. - Phase 1: replace the in-flight `#21956` aside with a cross-link to the new section. - Phase 2: keep the caveat about function-wrapped sorts but note that #21956's `Inexact` path now covers them via monotonicity inference. - New `## Phase 3` section with two SVG diagrams: a decision tree for the three protocol outcomes, and a three-step pipeline for the `Inexact` runtime. Covers the two-flag design, the nested file/RG layers, when EXPLAIN surfaces each flag, and four scenarios where Phase 3 does not fire (aggregations, multi- column secondary keys, function-wrapped sorts without a declared ordering, source declares a forward prefix of the request). - "What's Next": rename the old "Phase 3 — filter + sort" bullet to "Filtered reverse TopK end-to-end" so the label doesn't clash with the new section, and add a follow-up bullet referencing #22198 for multi-column / function-wrapped reorder.

Add an ### Empirical note subsection inside "Reverse Scans for `ORDER BY ... DESC`" that records what we found running an in-house RG-level `Exact` reverse against upstream `Inexact` + `TopK`: - `LIMIT N` does not propagate as a static stop signal in the Inexact path. The dynamic filter pushdown can stats-prune *subsequent* row groups once the threshold tightens, but inside the row group `TopK` is currently reading the sort column has to be fully decoded so the filter can be evaluated row by row. `LIMIT 10` on a 1M-row row group is still ~1M sort-column decodes regardless of N. LIMIT only saves work on non-sort columns inside that row group and on whole subsequent row groups the threshold prunes. - `SortExec` stays on top of `Inexact`, so the final ordering pass and per-row heap maintenance are both extra costs the `Exact` path (which deletes `SortExec`) does not pay. Then explain why we run RG-level `Exact` in production but did not upstream it: parquet does not allow partial row-group reads, so any RG-level `Exact` implementation peaks at one whole row group (~128 MB) of decoded data in memory — the same constraint that closed `#18817`. Our runtime advantage comes from skipping heap / filter / `SortExec` overhead, not from decoding less. Frame the page-level `Exact` reverse work in arrow-rs `#9937` as the path that keeps the runtime win we measured while bringing peak memory back into the streaming regime via `OffsetIndex` page-level seek.

…tion Two corrections in the empirical-note / reverse-scans section: 1. The internal RG-level `Exact` reverse path was incorrectly described as "walks pages from the back, decodes each batch, reverses row-wise, stops the moment K rows have been delivered." That is actually the page-level `Exact` shape (arrow-rs #9937), not the in-house RG-level implementation. Parquet does not allow partial row-group reads, so the in-house path has to decode the entire target row group, reverse the buffer in memory, take the first K rows, and stop — same memory profile as #18817's proposal. The runtime advantage over `Inexact` + `TopK` comes from removing the per-row heap maintenance, dynamic-filter evaluation, and `SortExec` final ordering pass, not from decoding less sort-column data. Sort col decode on the target row group is the same on both paths. 2. The arrow-rs #9937 paragraph I previously added duplicated the technical detail already present in the long-standing "That primitive is the page-level reverse traversal..." paragraph. Replaced with a one-sentence bridge ("The fix that keeps both the runtime win and a streaming memory profile is page-level `Exact` reverse via arrow-rs #9937, described next.") so the existing paragraph carries the explanation without repetition.

All three sort-pushdown PRs have now landed, so the chronological 'Phase 1/2/3' framing is less useful for readers than a capability breakdown. Sections are now: - The PushdownSort Rule (#19064) - Sort Elimination via Statistics (#21182) - Runtime Reorder for TopK Convergence (#21956) - Reverse Scans for ORDER BY ... DESC (unchanged) In-body Phase references replaced with PR numbers or capability names; anchor links updated; references section restructured.

…ction' The previous wording 'nest by construction' could be read as a code- enforced property. It's actually a logical consequence of using the same sort key (min) at both file and row-group level: a file's min(col) is just the minimum over its row groups' min(col) values, so the most-promising file contains the most-promising row group. The rewritten paragraph spells that out and ties it to why TopK's dynamic filter tightens fast.

Match the dynamic-filter blog's style: narrative talks about capabilities/mechanisms, not 'PR #21956 did X / PR #19064 introduced Y'. The 81 inline PR/issue references in the body were dropping the reader out of the narrative; they belong in a single Issues-and-PRs list at the end. Changes: - TL;DR: drop 6 inline PR refs from the 4 capability bullets - Body sections (PushdownSort Rule, Sort Elimination, Runtime Reorder, Reverse Scans, Empirical Note): drop ~30 inline refs to historical PRs; replace with capability names or 'the rule' / 'the runtime reorder path' style descriptions - What's Next: switch from [#NNNNN] format to named markdown links (matching dynamic-filter's Future Work style) - Issues for new contributors: same conversion - References section: rewrite using full URLs (no link-ref indirection); split into 'Landed' vs 'In flight / open' for clarity Net: ~90 lines removed, all PR/issue numbers now consolidated at the bottom of the post.

The previous draft mixed merged work, in-flight work, and runtime-cost analysis into a single 'Reverse Scans' section and a sprawling 'What's Next' section. Reorganize so the post answers three clear questions in sequence: 1. What's merged today? (Sort Elimination via Statistics + benchmark, Runtime Reorder for TopK Convergence, Reverse Scans for DESC) — unchanged content, just kept tight. 2. Where do those merged features still leave performance on the table? New 'Current Bottlenecks' section with three explicitly numbered bottlenecks: SortExec stays / sort column fully decoded inside open RG / file-granular scheduling can't close the tap mid-file. Pulls in the runtime-cost content that used to be buried in an 'Empirical note' subsection. 3. How does each next-step optimization remove a specific bottleneck? New 'Roadmap' section maps page-level Exact reverse to bottlenecks 1+2, row-group-level dynamic early termination to bottleneck 3, and shows the in-flight 17x-60x pipeline benchmark as a preview of what stacking these mechanisms can deliver. Smaller follow-ups (EnsureRequirements, OFFSET pushdown, multi-column reorder) collected at the end of the roadmap section as a short 'Other follow-ups' bullet list.

- 'Why not full Exact reverse' paragraph: cut reviewer attribution and forward-pointers that were already in the bottleneck/roadmap sections that follow. - TL;DR: trim Runtime Reorder + Reverse Scans bullets to capability and impact; drop implementation mechanics like 'stamps two flags' and 'three-step pipeline'. - 'The PushdownSort Rule' section: cut three paragraphs of 'covered in X below' forward-references that were repeating the section TOC. - Function-wrapped parenthetical in Sort Elimination: 4 lines to 2. - Single-partition vs multi-partition edge case: drop the trailing 'which is why the example is drawn that way' tangent. - 'What this change does not affect' note: trimmed redundant prose. - Remove all references to the limit-pruning blog (intro mention, link definition, References section bullet) — that work is about static LIMIT as an I/O optimization, separate problem from sort ordering.

…mmetry explicit Per code in datasource-parquet/src/source.rs:849-870, the reversed- satisfies branch is 'strictly more powerful than the column-in-schema check' — it runs the request through EquivalenceProperties's full reasoning machinery and handles function monotonicity, constants from filters, equivalence relationships, and multi-column composite orderings. The blog had been treating reverse as just step 3 of the runtime pipeline, which undersold its standalone reach. Structural changes: - Drop the standalone 'Reverse Scans for ORDER BY DESC' H2 section; reverse is now a case of the Inexact runtime reorder path. - Rename Runtime Reorder section to 'Runtime Reorder for TopK and DESC Queries'; intro now lists three classes that fall outside Exact (unsorted, overlapping, DESC). - 'try_pushdown_sort' subsection rewritten as 'Two independent triggers for Inexact', describing column-in-schema vs reversed- satisfies as separate signals with the latter being strictly more powerful. - 'Three runtime steps' subsection: step 3 now explicitly notes when steps 1-2 are skipped and only the iteration reverse runs. - New 'ORDER BY DESC in practice' subsection right after the 3-step pipeline, explaining the RGs-descending-x-rows-ascending stream. - Move reverse-scan.svg from the deleted Reverse Scans section into the Roadmap > Page-level Exact reverse subsection where it illustrates the 128 MB vs 1 MB peak comparison directly. Accuracy fix: - 'Multi-column reorder follow-ups' bullet was inaccurate — said the reorder 'only fires on plain columns'. The reverse path does handle function-wrapped and multi-column cases via EquivalenceProperties; only the stats reorder step is restricted. Updated wording to scope the limitation correctly.

…shdown EnsureRequirements (#21976) is a rule-unification effort for EnforceDistribution+EnforceSorting. Touches the same area but isn't a sort pushdown optimization. OFFSET pushdown (#21828) is about LIMIT/OFFSET pruning. Same kind of tangent as the limit-pruning blog reference removed earlier — it's LIMIT optimization, not sort pushdown. The remaining 'Multi-column and function-wrapped reorder follow-ups' bullet is actually directly about sort pushdown's reorder step (#22198), so it stays. With the other two removed, 'Other follow-ups' collapsed to a single point — promoted to its own subsection 'Extending the stats reorder step' for clarity. Also dropped the corresponding entries from the References section.

… (k-way merge) The 'switch work queue from PartitionedFile to RG descriptors' fix is sufficient for non-overlapping ranges (post-reorder), where the first file globally has the smallest values and subsequent files are already stats-pruned. For overlapping ranges, the next smallest value could sit in any of several open files — matching the non-overlap efficiency requires explicit k-way merge across open files' next-RG mins. The dynamic filter does this implicitly (RGs with max < threshold are dropped), but explicit comparison closes the tap earlier when the filter tightens slowly.

zhuqi-lucas added 14 commits May 14, 2026 15:11

Push draft date to 2026-05-25

47e45dd

Mark #21956 as landed (merged via merge queue)

3395119

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add blog: Sort Pushdown in DataFusion: Skip Sorts, Skip I/O#186

Add blog: Sort Pushdown in DataFusion: Skip Sorts, Skip I/O#186
zhuqi-lucas wants to merge 14 commits into
apache:mainfrom
zhuqi-lucas:blog-sort-pushdown

zhuqi-lucas commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zhuqi-lucas commented May 14, 2026

Summary

What this post covers

Why a draft

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant