[CALCITE-5101] LISTAGG function with DISTINCT and ORDER BY fails by xuzifu666 · Pull Request #4983 · apache/calcite

xuzifu666 · 2026-06-01T16:27:33Z

jira: https://issues.apache.org/jira/browse/CALCITE-5101

mihaibudiu · 2026-06-01T17:39:18Z

    if (collations.size() == 1) {
      RelFieldCollation collation = collations.get(0);
+      final int index = collation.getFieldIndex();
+      if (index < 0 || index >= rowType.getFieldList().size()) {


I guess these should never be reachable if the code generator is correct

Yes, this is just a defensive check, but it seems unnecessary now, so I deleted it.

mihaibudiu · 2026-06-01T17:41:33Z

+          }
+        }
+
+        if (!validCollation) {


I am a bit confused: there is a sort order specified, yet it is ignored? That cannot be right.

There are two options:

Complete Fix: add ORDER BY columns to the re-grouped input, requires architectural changes to aggregate planning,changes query semantics, potential impact on other parts.This is a large-scale refactoring;

Graceful Degradation(current way): returns correct results without crashes, loses sorting information, but doesn't crash.This is a rare scenario:only occurs with LISTAGG(DISTINCT ...) WITHIN GROUP (ORDER BY non-group-column).

According the conditions I choose the second way. So do you think my current plan is reasonable?

Why is the result correct if you ignore the order?

This is a compromise solution to avoid crashes by downgrading the handling of such special statements (removing the order by clause). The main issue is that a thorough fix involves many aspects, with less consideration for ROI. If it becomes clear that a completely accurate order by result is required, I will attempt a fix with the lowest possible cost in the future.

I don't understand the difference between "less acurate" and "wrong"
Either the result is correct, or it's not. I don't think we should take a fix which avoids crashes and produces wrong results.

Okay, I will try to fix this issue completely later.

@mihaibudiu I had fixed the issue, PTAL.

FYI, if you're going to change AggregateExpandDistinctAggregatesRule to fix this bug, you'll also should to update rewriteUsingGroupingSets, since the same issue can be triggered by a query like:

SELECT deptno, LISTAGG(DISTINCT ename) WITHIN GROUP (ORDER BY sal), LISTAGG(DISTINCT deptno) WITHIN GROUP (ORDER BY ename) FROM emp GROUP BY deptno;```

At the very least, a test checking this should be added

Good suggestion! I had addressed this issue and add related test about it. @mihaibudiu @GoncaloCoutoDosSantos

mihaibudiu

Please reply to the other comment you have received as well.

mihaibudiu · 2026-06-03T17:48:59Z

      if (aggCall.isDistinct()) {
        bottomGroups.addAll(aggCall.getArgList());
+        // Also include ORDER BY columns from WITHIN GROUP
+        for (RelFieldCollation fc : aggCall.collation.getFieldCollations()) {


I think this is a better approach.
Can this add the same column twice? Is that a problem?

Wouldn't this cause DISTINCT to be ignored in certain scenarios?

Consider the following example:

SELECT deptno, SUM(DISTINCT sal) WITHIN GROUP (ORDER BY bonus) FROM EMP GROUP BY deptno

With this modification, the rule would rewrite the query as:

SELECT deptno, SUM(sal) WITHIN GROUP (ORDER BY bonus) FROM ( SELECT deptno, sal, bonus FROM EMP GROUP BY deptno, sal, bonus ) GROUP BY deptno

This does not correctly enforce DISTINCT on sal: if two rows share the same sal value but have different bonus values, both survive the inner GROUP BY and sal ends up counted twice in the outer SUM, which violates the DISTINCT semantics.

PS: Please feel free to correct me or let me know if I am intervening inappropriately.

I think this is a better approach. Can this add the same column twice? Is that a problem?

That's should not a problem. bottomGroups is a NavigableSet<Integer> (TreeSet), which automatically removes duplicates. If the ORDER BY column is already in the GROUP BY or DISTINCT parameter, the Set will ignore duplicates. I've added a comment to clarify this. @mihaibudiu

Wouldn't this cause DISTINCT to be ignored in certain scenarios?

Consider the following example:

SELECT deptno, SUM(DISTINCT sal) WITHIN GROUP (ORDER BY bonus) FROM EMP GROUP BY deptno

With this modification, the rule would rewrite the query as:

SELECT deptno, SUM(sal) WITHIN GROUP (ORDER BY bonus) FROM ( SELECT deptno, sal, bonus FROM EMP GROUP BY deptno, sal, bonus ) GROUP BY deptno

This does not correctly enforce DISTINCT on sal: if two rows share the same sal value but have different bonus values, both survive the inner GROUP BY and sal ends up counted twice in the outer SUM, which violates the DISTINCT semantics.

PS: Please feel free to correct me or let me know if I am intervening inappropriately.

I think the question you raised is quite valuable, and I've already fixed it in a recent commit. You can take a look. @GoncaloCoutoDosSantos

mihaibudiu · 2026-06-04T17:43:39Z

+      int originalIdx = fc.getFieldIndex();
+      int newIdx = fullGroupSet.indexOf(originalIdx);
+      if (newIdx >= 0) {
+        remappedFCs.add(fc.withFieldIndex(newIdx));


can you explain why is it ok for the result collation to have fewer elements than the input collation?

In my view it's acceptable for result collation to have fewer elements because:

fullGroupSet only contains columns that are part of some grouping set. If an ORDER BY column is not in fullGroupSet, it means that column is not in any grouping set.

Columns not in any grouping set have inconsistent values within each group — different rows with the same group key may have different values for that column.

Sorting by columns with inconsistent values is meaningless. It's safe to silently drop them from the collation.

This should be in a comment in the code.
Can you also create a test case to cover this case and make sure it gives the same results as some other reference database?

OK, I had add comment and add a test case cover this case.
I test it in postgresql and result is expected. link:https://onecompiler.com/postgresql/44rcggaqf

mihaibudiu · 2026-06-05T03:31:42Z

            "deptno=20; total_distinct_salary=8000.0");
  }

+  @Test void countDistinctWithOrderByNonGroupColumn() {


I am confused, where is the ORDER BY in this example?

I updated test case, Based on the PostgreSQL results https://onecompiler.com/postgresql/44rcggaqf, it looks like it's OK.

mihaibudiu · 2026-06-05T04:43:26Z

+            "deptno=20; total_distinct_salary=8000.0");
+  }
+
+  @Test void groupingSetsWithCollationReferencingOutsideGroupingSets() {


I don't see any ORDER BY in these last two SQL programs.
Where is the collation coming from?
Maybe you can show the plan having a SORT?

I have added order by, and the corresponding PostgreSQL test is here: https://onecompiler.com/postgresql/44rcggaqf

sonarqubecloud · 2026-06-05T06:22:23Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
81.3% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

[CALCITE-5101] LISTAGG function with DISTINCT and ORDER BY fails

2bd42fd

xuzifu666 mentioned this pull request Jun 1, 2026

[CALCITE-5101] AggregateExpandDistinctAggregatesRule does not remap WITHIN GROUP collation indices, causing ArrayIndexOutOfBoundsException #4982

Closed

mihaibudiu reviewed Jun 1, 2026

View reviewed changes

Addressed

32de196

xuzifu666 requested a review from mihaibudiu June 2, 2026 15:27

xuzifu666 added 2 commits June 3, 2026 12:03

Addressed

2886c8e

Addressed

c94e4c3

mihaibudiu reviewed Jun 3, 2026

View reviewed changes

xuzifu666 added 4 commits June 4, 2026 09:59

Addressed

6998e3b

Addressed

a06107d

Addressed

147069b

Addressed

bb6db76

mihaibudiu reviewed Jun 4, 2026

View reviewed changes

Addressed

e92f573

mihaibudiu reviewed Jun 5, 2026

View reviewed changes

xuzifu666 added 2 commits June 5, 2026 11:42

Addressed

a520f79

Addressed

ebf6804

mihaibudiu reviewed Jun 5, 2026

View reviewed changes

Addressed

92ebb06

mihaibudiu approved these changes Jun 5, 2026

View reviewed changes

Conversation

xuzifu666 commented Jun 1, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GoncaloCoutoDosSantos Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mihaibudiu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xuzifu666 Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xuzifu666 Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud Bot commented Jun 5, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

GoncaloCoutoDosSantos Jun 3, 2026 •

edited

Loading

xuzifu666 Jun 5, 2026 •

edited

Loading

xuzifu666 Jun 5, 2026 •

edited

Loading