[CALCITE-5101] LISTAGG function with DISTINCT and ORDER BY fails#4983
[CALCITE-5101] LISTAGG function with DISTINCT and ORDER BY fails#4983xuzifu666 wants to merge 12 commits into
Conversation
| if (collations.size() == 1) { | ||
| RelFieldCollation collation = collations.get(0); | ||
| final int index = collation.getFieldIndex(); | ||
| if (index < 0 || index >= rowType.getFieldList().size()) { |
There was a problem hiding this comment.
I guess these should never be reachable if the code generator is correct
There was a problem hiding this comment.
Yes, this is just a defensive check, but it seems unnecessary now, so I deleted it.
| } | ||
| } | ||
|
|
||
| if (!validCollation) { |
There was a problem hiding this comment.
I am a bit confused: there is a sort order specified, yet it is ignored? That cannot be right.
There was a problem hiding this comment.
There are two options:
- Complete Fix: add
ORDER BYcolumns to the re-grouped input, requires architectural changes to aggregate planning,changes query semantics, potential impact on other parts.This is a large-scale refactoring; - Graceful Degradation(current way): returns correct results without crashes, loses sorting information, but doesn't crash.This is a rare scenario:only occurs with
LISTAGG(DISTINCT ...) WITHIN GROUP (ORDER BY non-group-column).
According the conditions I choose the second way. So do you think my current plan is reasonable?
There was a problem hiding this comment.
Why is the result correct if you ignore the order?
There was a problem hiding this comment.
This is a compromise solution to avoid crashes by downgrading the handling of such special statements (removing the order by clause). The main issue is that a thorough fix involves many aspects, with less consideration for ROI. If it becomes clear that a completely accurate order by result is required, I will attempt a fix with the lowest possible cost in the future.
There was a problem hiding this comment.
I don't understand the difference between "less acurate" and "wrong"
Either the result is correct, or it's not. I don't think we should take a fix which avoids crashes and produces wrong results.
There was a problem hiding this comment.
Okay, I will try to fix this issue completely later.
There was a problem hiding this comment.
FYI, if you're going to change AggregateExpandDistinctAggregatesRule to fix this bug, you'll also should to update rewriteUsingGroupingSets, since the same issue can be triggered by a query like:
SELECT deptno,
LISTAGG(DISTINCT ename) WITHIN GROUP (ORDER BY sal),
LISTAGG(DISTINCT deptno) WITHIN GROUP (ORDER BY ename)
FROM emp
GROUP BY deptno;```
There was a problem hiding this comment.
At the very least, a test checking this should be added
There was a problem hiding this comment.
Good suggestion! I had addressed this issue and add related test about it. @mihaibudiu @GoncaloCoutoDosSantos
mihaibudiu
left a comment
There was a problem hiding this comment.
Please reply to the other comment you have received as well.
| if (aggCall.isDistinct()) { | ||
| bottomGroups.addAll(aggCall.getArgList()); | ||
| // Also include ORDER BY columns from WITHIN GROUP | ||
| for (RelFieldCollation fc : aggCall.collation.getFieldCollations()) { |
There was a problem hiding this comment.
I think this is a better approach.
Can this add the same column twice? Is that a problem?
There was a problem hiding this comment.
Wouldn't this cause DISTINCT to be ignored in certain scenarios?
Consider the following example:
SELECT deptno, SUM(DISTINCT sal) WITHIN GROUP (ORDER BY bonus)
FROM EMP
GROUP BY deptno
With this modification, the rule would rewrite the query as:
SELECT deptno, SUM(sal) WITHIN GROUP (ORDER BY bonus)
FROM (
SELECT deptno, sal, bonus FROM EMP GROUP BY deptno, sal, bonus
)
GROUP BY deptno
This does not correctly enforce DISTINCT on sal: if two rows share the same sal value but have different bonus values, both survive the inner GROUP BY and sal ends up counted twice in the outer SUM, which violates the DISTINCT semantics.
PS: Please feel free to correct me or let me know if I am intervening inappropriately.
There was a problem hiding this comment.
I think this is a better approach. Can this add the same column twice? Is that a problem?
That's should not a problem. bottomGroups is a NavigableSet<Integer> (TreeSet), which automatically removes duplicates. If the ORDER BY column is already in the GROUP BY or DISTINCT parameter, the Set will ignore duplicates. I've added a comment to clarify this. @mihaibudiu
There was a problem hiding this comment.
Wouldn't this cause DISTINCT to be ignored in certain scenarios?
Consider the following example:
SELECT deptno, SUM(DISTINCT sal) WITHIN GROUP (ORDER BY bonus) FROM EMP GROUP BY deptnoWith this modification, the rule would rewrite the query as:
SELECT deptno, SUM(sal) WITHIN GROUP (ORDER BY bonus) FROM ( SELECT deptno, sal, bonus FROM EMP GROUP BY deptno, sal, bonus ) GROUP BY deptnoThis does not correctly enforce DISTINCT on sal: if two rows share the same sal value but have different bonus values, both survive the inner GROUP BY and sal ends up counted twice in the outer SUM, which violates the DISTINCT semantics.
PS: Please feel free to correct me or let me know if I am intervening inappropriately.
I think the question you raised is quite valuable, and I've already fixed it in a recent commit. You can take a look. @GoncaloCoutoDosSantos
| int originalIdx = fc.getFieldIndex(); | ||
| int newIdx = fullGroupSet.indexOf(originalIdx); | ||
| if (newIdx >= 0) { | ||
| remappedFCs.add(fc.withFieldIndex(newIdx)); |
There was a problem hiding this comment.
can you explain why is it ok for the result collation to have fewer elements than the input collation?
There was a problem hiding this comment.
In my view it's acceptable for result collation to have fewer elements because:
- fullGroupSet only contains columns that are part of some grouping set. If an ORDER BY column is not in fullGroupSet, it means that column is not in any grouping set.
- Columns not in any grouping set have inconsistent values within each group — different rows with the same group key may have different values for that column.
- Sorting by columns with inconsistent values is meaningless. It's safe to silently drop them from the collation.
There was a problem hiding this comment.
This should be in a comment in the code.
Can you also create a test case to cover this case and make sure it gives the same results as some other reference database?
There was a problem hiding this comment.
OK, I had add comment and add a test case cover this case.
I test it in postgresql and result is expected. link:https://onecompiler.com/postgresql/44rcggaqf
| "deptno=20; total_distinct_salary=8000.0"); | ||
| } | ||
|
|
||
| @Test void countDistinctWithOrderByNonGroupColumn() { |
There was a problem hiding this comment.
I am confused, where is the ORDER BY in this example?
There was a problem hiding this comment.
I updated test case, Based on the PostgreSQL results https://onecompiler.com/postgresql/44rcggaqf, it looks like it's OK.
| "deptno=20; total_distinct_salary=8000.0"); | ||
| } | ||
|
|
||
| @Test void groupingSetsWithCollationReferencingOutsideGroupingSets() { |
There was a problem hiding this comment.
I don't see any ORDER BY in these last two SQL programs.
Where is the collation coming from?
Maybe you can show the plan having a SORT?
There was a problem hiding this comment.
I have added order by, and the corresponding PostgreSQL test is here: https://onecompiler.com/postgresql/44rcggaqf
|



jira: https://issues.apache.org/jira/browse/CALCITE-5101