Add blog: Sort Pushdown in DataFusion: Skip Sorts, Skip I/O#186
Draft
zhuqi-lucas wants to merge 14 commits into
Draft
Add blog: Sort Pushdown in DataFusion: Skip Sorts, Skip I/O#186zhuqi-lucas wants to merge 14 commits into
zhuqi-lucas wants to merge 14 commits into
Conversation
A walkthrough of the sort pushdown work landed and in flight on Apache DataFusion. Covers: - Why SortExec is expensive and what `Exact` / `Inexact` ordering mean at runtime (static `fetch` vs `TopK` dynamic filter). - Phase 1 (#19064): the `PushdownSort` rule + reverse row-group case. - Phase 2 (#21182): statistics-based file sort that upgrades `Unsupported` to `Exact`, eliminating the `SortExec` on non-overlapping ASC scans. Includes the BufferExec compensation (#21426) so the SPM above doesn't lose its implicit memory buffer. - Reverse scans: today's row-group reverse (Inexact, #18817) and the community decision to wait for arrow-rs page-level reverse (#9937) before pursuing Exact reverse, after memory-profile pushback on the original row-group-level proposal. - Benchmarks: 2.1×-49× on the ASC-LIMIT sort_pushdown suite. - What's next: the dynamic / TopK-driven path (#21351 merged, #21733, #21712, #21956, #21580) including the precise RG-pruning vs mid-stream-early-return distinction, and the EnsureRequirements unification (#21976). - Links into the prior dynamic filters and limit pruning posts so the series reads as a coherent thread.
The post previously only mentioned #21956 in passing. The PR is landing the full mechanism — `try_pushdown_sort` decision tree, two flags on `ParquetSource`, three composable runtime steps (file reorder + RG reorder + reverse), and a `sort_prefix`- preserving short-circuit — so cover it as a dedicated phase between Phase 2 and the existing reverse-scan section. - TL;DR: add a Phase 3 bullet alongside Phase 1 and Phase 2. - Phase 1: replace the in-flight `#21956` aside with a cross-link to the new section. - Phase 2: keep the caveat about function-wrapped sorts but note that #21956's `Inexact` path now covers them via monotonicity inference. - New `## Phase 3` section with two SVG diagrams: a decision tree for the three protocol outcomes, and a three-step pipeline for the `Inexact` runtime. Covers the two-flag design, the nested file/RG layers, when EXPLAIN surfaces each flag, and four scenarios where Phase 3 does not fire (aggregations, multi- column secondary keys, function-wrapped sorts without a declared ordering, source declares a forward prefix of the request). - "What's Next": rename the old "Phase 3 — filter + sort" bullet to "Filtered reverse TopK end-to-end" so the label doesn't clash with the new section, and add a follow-up bullet referencing #22198 for multi-column / function-wrapped reorder.
Add an ### Empirical note subsection inside "Reverse Scans for `ORDER BY ... DESC`" that records what we found running an in-house RG-level `Exact` reverse against upstream `Inexact` + `TopK`: - `LIMIT N` does not propagate as a static stop signal in the Inexact path. The dynamic filter pushdown can stats-prune *subsequent* row groups once the threshold tightens, but inside the row group `TopK` is currently reading the sort column has to be fully decoded so the filter can be evaluated row by row. `LIMIT 10` on a 1M-row row group is still ~1M sort-column decodes regardless of N. LIMIT only saves work on non-sort columns inside that row group and on whole subsequent row groups the threshold prunes. - `SortExec` stays on top of `Inexact`, so the final ordering pass and per-row heap maintenance are both extra costs the `Exact` path (which deletes `SortExec`) does not pay. Then explain why we run RG-level `Exact` in production but did not upstream it: parquet does not allow partial row-group reads, so any RG-level `Exact` implementation peaks at one whole row group (~128 MB) of decoded data in memory — the same constraint that closed `#18817`. Our runtime advantage comes from skipping heap / filter / `SortExec` overhead, not from decoding less. Frame the page-level `Exact` reverse work in arrow-rs `#9937` as the path that keeps the runtime win we measured while bringing peak memory back into the streaming regime via `OffsetIndex` page-level seek.
…tion
Two corrections in the empirical-note / reverse-scans section:
1. The internal RG-level `Exact` reverse path was incorrectly
described as "walks pages from the back, decodes each batch,
reverses row-wise, stops the moment K rows have been delivered."
That is actually the page-level `Exact` shape (arrow-rs #9937),
not the in-house RG-level implementation. Parquet does not allow
partial row-group reads, so the in-house path has to decode the
entire target row group, reverse the buffer in memory, take the
first K rows, and stop — same memory profile as #18817's
proposal. The runtime advantage over `Inexact` + `TopK` comes
from removing the per-row heap maintenance, dynamic-filter
evaluation, and `SortExec` final ordering pass, not from
decoding less sort-column data. Sort col decode on the target
row group is the same on both paths.
2. The arrow-rs #9937 paragraph I previously added duplicated the
technical detail already present in the long-standing
"That primitive is the page-level reverse traversal..."
paragraph. Replaced with a one-sentence bridge ("The fix that
keeps both the runtime win and a streaming memory profile is
page-level `Exact` reverse via arrow-rs #9937, described next.")
so the existing paragraph carries the explanation without
repetition.
All three sort-pushdown PRs have now landed, so the chronological 'Phase 1/2/3' framing is less useful for readers than a capability breakdown. Sections are now: - The PushdownSort Rule (#19064) - Sort Elimination via Statistics (#21182) - Runtime Reorder for TopK Convergence (#21956) - Reverse Scans for ORDER BY ... DESC (unchanged) In-body Phase references replaced with PR numbers or capability names; anchor links updated; references section restructured.
…ction' The previous wording 'nest by construction' could be read as a code- enforced property. It's actually a logical consequence of using the same sort key (min) at both file and row-group level: a file's min(col) is just the minimum over its row groups' min(col) values, so the most-promising file contains the most-promising row group. The rewritten paragraph spells that out and ties it to why TopK's dynamic filter tightens fast.
Match the dynamic-filter blog's style: narrative talks about capabilities/mechanisms, not 'PR #21956 did X / PR #19064 introduced Y'. The 81 inline PR/issue references in the body were dropping the reader out of the narrative; they belong in a single Issues-and-PRs list at the end. Changes: - TL;DR: drop 6 inline PR refs from the 4 capability bullets - Body sections (PushdownSort Rule, Sort Elimination, Runtime Reorder, Reverse Scans, Empirical Note): drop ~30 inline refs to historical PRs; replace with capability names or 'the rule' / 'the runtime reorder path' style descriptions - What's Next: switch from [#NNNNN] format to named markdown links (matching dynamic-filter's Future Work style) - Issues for new contributors: same conversion - References section: rewrite using full URLs (no link-ref indirection); split into 'Landed' vs 'In flight / open' for clarity Net: ~90 lines removed, all PR/issue numbers now consolidated at the bottom of the post.
The previous draft mixed merged work, in-flight work, and runtime-cost analysis into a single 'Reverse Scans' section and a sprawling 'What's Next' section. Reorganize so the post answers three clear questions in sequence: 1. What's merged today? (Sort Elimination via Statistics + benchmark, Runtime Reorder for TopK Convergence, Reverse Scans for DESC) — unchanged content, just kept tight. 2. Where do those merged features still leave performance on the table? New 'Current Bottlenecks' section with three explicitly numbered bottlenecks: SortExec stays / sort column fully decoded inside open RG / file-granular scheduling can't close the tap mid-file. Pulls in the runtime-cost content that used to be buried in an 'Empirical note' subsection. 3. How does each next-step optimization remove a specific bottleneck? New 'Roadmap' section maps page-level Exact reverse to bottlenecks 1+2, row-group-level dynamic early termination to bottleneck 3, and shows the in-flight 17x-60x pipeline benchmark as a preview of what stacking these mechanisms can deliver. Smaller follow-ups (EnsureRequirements, OFFSET pushdown, multi-column reorder) collected at the end of the roadmap section as a short 'Other follow-ups' bullet list.
- 'Why not full Exact reverse' paragraph: cut reviewer attribution and forward-pointers that were already in the bottleneck/roadmap sections that follow. - TL;DR: trim Runtime Reorder + Reverse Scans bullets to capability and impact; drop implementation mechanics like 'stamps two flags' and 'three-step pipeline'. - 'The PushdownSort Rule' section: cut three paragraphs of 'covered in X below' forward-references that were repeating the section TOC. - Function-wrapped parenthetical in Sort Elimination: 4 lines to 2. - Single-partition vs multi-partition edge case: drop the trailing 'which is why the example is drawn that way' tangent. - 'What this change does not affect' note: trimmed redundant prose. - Remove all references to the limit-pruning blog (intro mention, link definition, References section bullet) — that work is about static LIMIT as an I/O optimization, separate problem from sort ordering.
…mmetry explicit Per code in datasource-parquet/src/source.rs:849-870, the reversed- satisfies branch is 'strictly more powerful than the column-in-schema check' — it runs the request through EquivalenceProperties's full reasoning machinery and handles function monotonicity, constants from filters, equivalence relationships, and multi-column composite orderings. The blog had been treating reverse as just step 3 of the runtime pipeline, which undersold its standalone reach. Structural changes: - Drop the standalone 'Reverse Scans for ORDER BY DESC' H2 section; reverse is now a case of the Inexact runtime reorder path. - Rename Runtime Reorder section to 'Runtime Reorder for TopK and DESC Queries'; intro now lists three classes that fall outside Exact (unsorted, overlapping, DESC). - 'try_pushdown_sort' subsection rewritten as 'Two independent triggers for Inexact', describing column-in-schema vs reversed- satisfies as separate signals with the latter being strictly more powerful. - 'Three runtime steps' subsection: step 3 now explicitly notes when steps 1-2 are skipped and only the iteration reverse runs. - New 'ORDER BY DESC in practice' subsection right after the 3-step pipeline, explaining the RGs-descending-x-rows-ascending stream. - Move reverse-scan.svg from the deleted Reverse Scans section into the Roadmap > Page-level Exact reverse subsection where it illustrates the 128 MB vs 1 MB peak comparison directly. Accuracy fix: - 'Multi-column reorder follow-ups' bullet was inaccurate — said the reorder 'only fires on plain columns'. The reverse path does handle function-wrapped and multi-column cases via EquivalenceProperties; only the stats reorder step is restricted. Updated wording to scope the limitation correctly.
…shdown EnsureRequirements (#21976) is a rule-unification effort for EnforceDistribution+EnforceSorting. Touches the same area but isn't a sort pushdown optimization. OFFSET pushdown (#21828) is about LIMIT/OFFSET pruning. Same kind of tangent as the limit-pruning blog reference removed earlier — it's LIMIT optimization, not sort pushdown. The remaining 'Multi-column and function-wrapped reorder follow-ups' bullet is actually directly about sort pushdown's reorder step (#22198), so it stays. With the other two removed, 'Other follow-ups' collapsed to a single point — promoted to its own subsection 'Extending the stats reorder step' for clarity. Also dropped the corresponding entries from the References section.
… (k-way merge) The 'switch work queue from PartitionedFile to RG descriptors' fix is sufficient for non-overlapping ranges (post-reorder), where the first file globally has the smallest values and subsequent files are already stats-pruned. For overlapping ranges, the next smallest value could sit in any of several open files — matching the non-overlap efficiency requires explicit k-way merge across open files' next-RG mins. The dynamic filter does this implicitly (RGs with max < threshold are dropped), but explicit comparison closes the tap earlier when the filter tightens slowly.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A walkthrough of the sort pushdown work landed and in flight on Apache DataFusion. Opening as a draft to share the narrative early — the in-flight PRs the post discusses are still in flight, but the structure and the merged work (Phase 1 #19064, Phase 2 #21182, BufferExec #21426, row-group reverse #18817) are in their final shape.
What this post covers
SortExecis expensive, and whatExact/Inexactmean at runtime (staticfetchvsTopKdynamic filter).PushdownSortrule + reverse row-group case.UnsupportedtoExact, eliminating theSortExecon non-overlapping ASC scans. Includes theBufferExeccompensation (#21426) so the SPM above doesn't lose its implicit memory buffer.Exactreverse, after memory-profile pushback on the original row-group-level proposal.sort_pushdownsuite.EnsureRequirementsunification (#21976).Why a draft
A few of the in-flight PRs the "What's next" section references may evolve in review (e.g. #21580 may be split into smaller pieces, dynamic RG scheduling on top of #21351 is described but not yet on a PR). Opening as draft so we can adjust wording as those land or change shape — happy to flip to ready for review when the dust settles, or earlier if reviewers prefer.
Test plan