Skip to content

Optimize multiple COUNT(DISTINCT) memory via logical plan rewrite #21087

@ydgandhi

Description

@ydgandhi

Is your feature request related to a problem or challenge?

DataFusion's default execution creates separate HashSet accumulators for each distinct count per group. With high-cardinality data (eg sellers with 4,000+ distinct cities in an ecommerce dataset), this causes memory explosion.

Describe the solution you'd like

DataFusion already optimizes the single shared distinct field case via SingleDistinctToGroupBy. This PR adds a conservative logical rewrite for multiple distinct COUNT(DISTINCT …) arguments by splitting work into per-distinct branches joined on the group keys, which reduces peak memory for eligible plans.

COUNT(DISTINCT x) must ignore NULL x; the rewrite applies x IS NOT NULL on each distinct branch before inner grouping so semantics stay aligned with count_distinct behavior.

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions