Skip to content

Fix/support duplicate column names #6543#21126

Open
RafaelHerrero wants to merge 2 commits intoapache:mainfrom
RafaelHerrero:fix/support-duplicate-column-names-6543
Open

Fix/support duplicate column names #6543#21126
RafaelHerrero wants to merge 2 commits intoapache:mainfrom
RafaelHerrero:fix/support-duplicate-column-names-6543

Conversation

@RafaelHerrero
Copy link

Which issue does this PR close?

Rationale for this change

We're building a SQL engine on top of DataFusion and hit this while running TPC-DS benchmarks — Q39 fails during planning with:

Projections require unique expression names but the expression
"CAST(inv1.cov AS Decimal128(30, 10))" at position 4 and "inv1.cov"
at position 10 have the same name.

The underlying issue is that CAST is transparent to schema_name(), so both expressions resolve to inv1.cov. But this also affects simpler cases like SELECT 1, 1 or SELECT x, x FROM t — all of which PostgreSQL, Trino, and SQLite handle without errors.

Looking at the issue discussion, @alamb suggested adding auto-aliases in the SQL planner:

"I think that is actually a pretty neat idea -- specifically add the aliases in the SQL planner. I would be happy to review such a PR"

That's what this PR does.

TPC-DS Q39 reproduction

The query joins two CTEs that both produce columns named cov, mean, etc. When the planner applies implicit casts during type coercion, the cast-wrapped and original expressions end up with the same schema name:

WITH inv1 AS (
    SELECT w_warehouse_name, w_warehouse_sk, i_item_sk, d_moy,
           stdev, mean, (CASE mean WHEN 0 THEN NULL ELSE stdev/mean END) AS cov
    FROM (SELECT w_warehouse_name, w_warehouse_sk, i_item_sk, d_moy,
                 stddev_samp(inv_quantity_on_hand) AS stdev,
                 avg(inv_quantity_on_hand) AS mean
          FROM inventory, item, warehouse, date_dim
          WHERE inv_item_sk = i_item_sk AND inv_warehouse_sk = w_warehouse_sk
            AND inv_date_sk = d_date_sk AND d_year = 2001
          GROUP BY w_warehouse_name, w_warehouse_sk, i_item_sk, d_moy) foo
    WHERE CASE mean WHEN 0 THEN 0 ELSE stdev/mean END > 1
),
inv2 AS (
    SELECT w_warehouse_name, w_warehouse_sk, i_item_sk, d_moy,
           stdev, mean, (CASE mean WHEN 0 THEN NULL ELSE stdev/mean END) AS cov
    FROM (SELECT w_warehouse_name, w_warehouse_sk, i_item_sk, d_moy,
                 stddev_samp(inv_quantity_on_hand) AS stdev,
                 avg(inv_quantity_on_hand) AS mean
          FROM inventory, item, warehouse, date_dim
          WHERE inv_item_sk = i_item_sk AND inv_warehouse_sk = w_warehouse_sk
            AND inv_date_sk = d_date_sk AND d_year = 2001 AND d_moy = 2
          GROUP BY w_warehouse_name, w_warehouse_sk, i_item_sk, d_moy) foo
    WHERE CASE mean WHEN 0 THEN 0 ELSE stdev/mean END > 1
)
SELECT inv1.w_warehouse_sk, inv1.i_item_sk, inv1.d_moy, inv1.mean, inv1.cov,
       inv2.w_warehouse_sk, inv2.i_item_sk, inv2.d_moy, inv2.mean, inv2.cov
FROM inv1 JOIN inv2
  ON inv1.i_item_sk = inv2.i_item_sk AND inv1.w_warehouse_sk = inv2.w_warehouse_sk
ORDER BY inv1.w_warehouse_sk, inv1.i_item_sk, inv1.d_moy, inv1.mean, inv1.cov,
         inv2.d_moy, inv2.mean, inv2.cov;

What changes are included in this PR?

A dedup pass in SqlToRel that runs right after prepare_select_exprs() and before self.project(). It detects duplicate schema_name() values and wraps the second (and subsequent) occurrences in an Alias with a :{N} suffix:

SELECT x AS c1, y AS c1 FROM t;
-- produces columns: c1, c1:1

The actual code is small (~45 lines of logic across 2 files):

  • datafusion/sql/src/utils.rs — new deduplicate_select_expr_names() function
  • datafusion/sql/src/select.rs — one call site between prepare_select_exprs() and self.project()

I intentionally kept this scoped to the SQL planner only:

  • validate_unique_names("Projections") in builder.rs is untouched, so the Rust API (LogicalPlanBuilder::project) still rejects duplicates
  • No changes to the optimizer, physical planner, or DFSchema
  • validate_unique_names("Windows") is unchanged

Known limitation: SELECT *, x FROM t still errors when x overlaps with *, because wildcard expansion happens after this dedup pass (inside project_with_validation). Happy to address that in a follow-up if desired.

Are these changes tested?

New sqllogictest file (duplicate_column_alias.slt) with 13 test cases covering:

  • Basic duplicate aliases, literals, and same-column-twice
  • Subquery with duplicate names
  • ORDER BY resolving to first occurrence
  • CTE join (TPC-DS Q39 pattern)
  • Three-way duplicates
  • CAST producing same schema_name as original column
  • GROUP BY and aggregates with duplicates
  • ORDER BY positional reference to the renamed column
  • iszero(0.0), iszero(-0.0) (reported in the issue by @jatin510)
  • UNION with duplicate column names
  • Wildcard limitation documented as explicit query error test

Updated existing tests in sql_integration.rs (5 tests), aggregate.slt, and unnest.slt that previously asserted the "Projections require unique" error.

Are there any user-facing changes?

Yes, this is a behavior change:

  • SQL queries with duplicate expression names now succeed instead of erroring
  • Duplicate columns get a :{N} suffix in the output (e.g., cov, cov:1)
  • First occurrence keeps its original name, so ORDER BY / HAVING references still work
  • The programmatic Rust API is unchanged

RafaelHerrero and others added 2 commits March 22, 2026 23:55
Add a deduplication pass in the SQL planner that auto-suffixes
duplicate expression names with :{count} before projection, so
queries like SELECT x, x or TPC-DS Q39 no longer error.

The fix is scoped to the SQL path only. The Rust API
(LogicalPlanBuilder::project) still rejects duplicates via
validate_unique_names, keeping optimizer invariants intact.
@github-actions github-actions bot added sql SQL Planner sqllogictest SQL Logic Tests (.slt) labels Mar 23, 2026
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @RafaelHerrero -- this looks really nice. My only concern is that this may slow down planning (as now it has to create a bunch more strings). I'll run some benchmarks to be sure

@alamb
Copy link
Contributor

alamb commented Mar 24, 2026

run benchmark sql_planner

@adriangbot
Copy link

🤖 Criterion benchmark running (GKE) | trigger
Linux bench-c4121355245-526-gxcdt 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux
Comparing fix/support-duplicate-column-names-6543 (b3b94bd) to 2b986c8 (merge-base) diff
BENCH_NAME=sql_planner
BENCH_COMMAND=cargo bench --features=parquet --bench sql_planner
BENCH_FILTER=
Results will be posted here when complete

@adriangbot
Copy link

🤖 Criterion benchmark completed (GKE) | trigger

Details

group                                                 fix_support-duplicate-column-names-6543    main
-----                                                 ---------------------------------------    ----
logical_aggregate_with_join                           1.03    439.6±1.09µs        ? ?/sec        1.00    428.0±1.29µs        ? ?/sec
logical_plan_struct_join_agg_sort                     1.05    175.5±0.72µs        ? ?/sec        1.00    166.7±0.51µs        ? ?/sec
logical_select_all_from_1000                          1.00      7.9±0.03ms        ? ?/sec        1.00      7.9±0.04ms        ? ?/sec
logical_select_one_from_700                           1.00    320.5±1.17µs        ? ?/sec        1.00    318.9±2.42µs        ? ?/sec
logical_trivial_join_high_numbered_columns            1.02    266.6±0.75µs        ? ?/sec        1.00    261.8±0.67µs        ? ?/sec
logical_trivial_join_low_numbered_columns             1.01    253.8±1.59µs        ? ?/sec        1.00    250.6±1.55µs        ? ?/sec
physical_intersection                                 1.03    597.9±1.35µs        ? ?/sec        1.00    578.8±1.95µs        ? ?/sec
physical_join_consider_sort                           1.02   1030.2±4.16µs        ? ?/sec        1.00   1006.7±3.06µs        ? ?/sec
physical_join_distinct                                1.01    247.5±0.67µs        ? ?/sec        1.00    244.1±0.54µs        ? ?/sec
physical_many_self_joins                              1.02      7.6±0.08ms        ? ?/sec        1.00      7.5±0.08ms        ? ?/sec
physical_plan_clickbench_all                          1.00    110.8±2.28ms        ? ?/sec        1.01    111.4±2.92ms        ? ?/sec
physical_plan_clickbench_q1                           1.00  1082.4±11.09µs        ? ?/sec        1.00  1082.4±11.73µs        ? ?/sec
physical_plan_clickbench_q10                          1.01  1810.9±46.44µs        ? ?/sec        1.00  1784.6±45.05µs        ? ?/sec
physical_plan_clickbench_q11                          1.01  1900.6±64.90µs        ? ?/sec        1.00  1874.4±52.55µs        ? ?/sec
physical_plan_clickbench_q12                          1.00  1957.4±71.70µs        ? ?/sec        1.00  1966.0±59.60µs        ? ?/sec
physical_plan_clickbench_q13                          1.01  1732.9±57.15µs        ? ?/sec        1.00  1707.7±32.39µs        ? ?/sec
physical_plan_clickbench_q14                          1.00  1861.3±36.62µs        ? ?/sec        1.02  1898.6±46.38µs        ? ?/sec
physical_plan_clickbench_q15                          1.01  1786.8±47.21µs        ? ?/sec        1.00  1764.2±28.33µs        ? ?/sec
physical_plan_clickbench_q16                          1.00  1491.5±16.94µs        ? ?/sec        1.00  1492.4±26.26µs        ? ?/sec
physical_plan_clickbench_q17                          1.01  1565.2±31.72µs        ? ?/sec        1.00  1549.5±38.05µs        ? ?/sec
physical_plan_clickbench_q18                          1.01  1396.8±21.23µs        ? ?/sec        1.00  1378.8±19.48µs        ? ?/sec
physical_plan_clickbench_q19                          1.00  1795.9±30.13µs        ? ?/sec        1.00  1803.4±32.04µs        ? ?/sec
physical_plan_clickbench_q2                           1.00  1436.0±21.95µs        ? ?/sec        1.02  1459.9±24.99µs        ? ?/sec
physical_plan_clickbench_q20                          1.02  1171.5±19.19µs        ? ?/sec        1.00   1144.3±8.06µs        ? ?/sec
physical_plan_clickbench_q21                          1.01  1449.9±15.71µs        ? ?/sec        1.00  1437.6±15.81µs        ? ?/sec
physical_plan_clickbench_q22                          1.00  1861.7±40.00µs        ? ?/sec        1.00  1861.3±62.31µs        ? ?/sec
physical_plan_clickbench_q23                          1.00      2.1±0.08ms        ? ?/sec        1.00      2.1±0.07ms        ? ?/sec
physical_plan_clickbench_q24                          1.00      2.9±0.08ms        ? ?/sec        1.02      3.0±0.10ms        ? ?/sec
physical_plan_clickbench_q25                          1.00  1545.4±34.55µs        ? ?/sec        1.00  1543.8±33.15µs        ? ?/sec
physical_plan_clickbench_q26                          1.00  1372.2±16.97µs        ? ?/sec        1.01  1389.5±14.72µs        ? ?/sec
physical_plan_clickbench_q27                          1.00  1543.0±21.88µs        ? ?/sec        1.01  1557.4±25.86µs        ? ?/sec
physical_plan_clickbench_q28                          1.02      2.1±0.05ms        ? ?/sec        1.00      2.0±0.04ms        ? ?/sec
physical_plan_clickbench_q29                          1.00      2.2±0.04ms        ? ?/sec        1.03      2.3±0.07ms        ? ?/sec
physical_plan_clickbench_q3                           1.01  1416.7±21.65µs        ? ?/sec        1.00  1400.0±17.11µs        ? ?/sec
physical_plan_clickbench_q30                          1.01     16.6±0.13ms        ? ?/sec        1.00     16.4±0.23ms        ? ?/sec
physical_plan_clickbench_q31                          1.01      2.1±0.08ms        ? ?/sec        1.00      2.1±0.04ms        ? ?/sec
physical_plan_clickbench_q32                          1.00      2.1±0.04ms        ? ?/sec        1.03      2.2±0.06ms        ? ?/sec
physical_plan_clickbench_q33                          1.01  1800.5±33.07µs        ? ?/sec        1.00  1786.9±41.80µs        ? ?/sec
physical_plan_clickbench_q34                          1.01  1523.7±32.23µs        ? ?/sec        1.00  1508.7±23.38µs        ? ?/sec
physical_plan_clickbench_q35                          1.00  1579.0±31.49µs        ? ?/sec        1.01  1590.9±26.54µs        ? ?/sec
physical_plan_clickbench_q36                          1.01  1936.2±49.84µs        ? ?/sec        1.00  1922.4±33.37µs        ? ?/sec
physical_plan_clickbench_q37                          1.00      2.2±0.05ms        ? ?/sec        1.04      2.3±0.07ms        ? ?/sec
physical_plan_clickbench_q38                          1.00      2.3±0.06ms        ? ?/sec        1.00      2.3±0.08ms        ? ?/sec
physical_plan_clickbench_q39                          1.02      2.1±0.06ms        ? ?/sec        1.00      2.1±0.03ms        ? ?/sec
physical_plan_clickbench_q4                           1.00  1202.3±12.42µs        ? ?/sec        1.01  1218.1±15.32µs        ? ?/sec
physical_plan_clickbench_q40                          1.00      2.6±0.07ms        ? ?/sec        1.02      2.7±0.09ms        ? ?/sec
physical_plan_clickbench_q41                          1.01      2.3±0.06ms        ? ?/sec        1.00      2.3±0.09ms        ? ?/sec
physical_plan_clickbench_q42                          1.01      2.2±0.07ms        ? ?/sec        1.00      2.2±0.04ms        ? ?/sec
physical_plan_clickbench_q43                          1.00      2.4±0.05ms        ? ?/sec        1.01      2.5±0.08ms        ? ?/sec
physical_plan_clickbench_q44                          1.00  1307.7±15.34µs        ? ?/sec        1.00  1303.8±34.07µs        ? ?/sec
physical_plan_clickbench_q45                          1.01  1300.5±21.88µs        ? ?/sec        1.00  1293.9±20.40µs        ? ?/sec
physical_plan_clickbench_q46                          1.00  1635.4±27.09µs        ? ?/sec        1.01  1646.9±28.60µs        ? ?/sec
physical_plan_clickbench_q47                          1.02      2.4±0.06ms        ? ?/sec        1.00      2.3±0.04ms        ? ?/sec
physical_plan_clickbench_q48                          1.00      2.4±0.05ms        ? ?/sec        1.01      2.4±0.08ms        ? ?/sec
physical_plan_clickbench_q49                          1.03      2.7±0.10ms        ? ?/sec        1.00      2.6±0.08ms        ? ?/sec
physical_plan_clickbench_q5                           1.01  1360.7±31.12µs        ? ?/sec        1.00  1348.4±29.93µs        ? ?/sec
physical_plan_clickbench_q50                          1.02      2.5±0.10ms        ? ?/sec        1.00      2.5±0.05ms        ? ?/sec
physical_plan_clickbench_q51                          1.00  1760.5±32.39µs        ? ?/sec        1.02  1792.6±54.40µs        ? ?/sec
physical_plan_clickbench_q6                           1.01  1353.1±25.67µs        ? ?/sec        1.00  1339.4±23.75µs        ? ?/sec
physical_plan_clickbench_q7                           1.01  1125.7±12.41µs        ? ?/sec        1.00  1112.8±12.55µs        ? ?/sec
physical_plan_clickbench_q8                           1.02  1605.4±37.62µs        ? ?/sec        1.00  1568.6±21.05µs        ? ?/sec
physical_plan_clickbench_q9                           1.01  1681.2±49.12µs        ? ?/sec        1.00  1665.1±28.06µs        ? ?/sec
physical_plan_struct_join_agg_sort                    1.01   1326.0±4.69µs        ? ?/sec        1.00   1316.3±5.39µs        ? ?/sec
physical_plan_tpcds_all                               1.01   795.4±10.86ms        ? ?/sec        1.00   789.6±10.42ms        ? ?/sec
physical_plan_tpch_all                                1.02     49.2±1.07ms        ? ?/sec        1.00     48.2±0.76ms        ? ?/sec
physical_plan_tpch_q1                                 1.00   1528.8±6.75µs        ? ?/sec        1.00   1528.3±8.67µs        ? ?/sec
physical_plan_tpch_q10                                1.00      2.9±0.03ms        ? ?/sec        1.02      2.9±0.03ms        ? ?/sec
physical_plan_tpch_q11                                1.00      2.6±0.03ms        ? ?/sec        1.00      2.6±0.02ms        ? ?/sec
physical_plan_tpch_q12                                1.00   1299.1±6.63µs        ? ?/sec        1.01   1315.9±8.35µs        ? ?/sec
physical_plan_tpch_q13                                1.00   1006.3±3.85µs        ? ?/sec        1.01   1013.7±3.74µs        ? ?/sec
physical_plan_tpch_q14                                1.02   1342.2±6.94µs        ? ?/sec        1.00   1312.5±4.98µs        ? ?/sec
physical_plan_tpch_q16                                1.01   1716.3±8.42µs        ? ?/sec        1.00  1697.2±19.47µs        ? ?/sec
physical_plan_tpch_q17                                1.02  1864.8±18.18µs        ? ?/sec        1.00  1836.3±16.51µs        ? ?/sec
physical_plan_tpch_q18                                1.00  1994.0±25.80µs        ? ?/sec        1.00  1986.5±21.22µs        ? ?/sec
physical_plan_tpch_q19                                1.00      2.5±0.02ms        ? ?/sec        1.00      2.5±0.03ms        ? ?/sec
physical_plan_tpch_q2                                 1.01      4.4±0.06ms        ? ?/sec        1.00      4.4±0.09ms        ? ?/sec
physical_plan_tpch_q20                                1.03      2.3±0.02ms        ? ?/sec        1.00      2.3±0.02ms        ? ?/sec
physical_plan_tpch_q21                                1.02      3.1±0.06ms        ? ?/sec        1.00      3.1±0.05ms        ? ?/sec
physical_plan_tpch_q22                                1.00      2.1±0.02ms        ? ?/sec        1.00      2.1±0.02ms        ? ?/sec
physical_plan_tpch_q3                                 1.00  1931.1±15.18µs        ? ?/sec        1.00  1933.6±12.37µs        ? ?/sec
physical_plan_tpch_q4                                 1.00   1043.0±3.37µs        ? ?/sec        1.01   1057.7±3.68µs        ? ?/sec
physical_plan_tpch_q5                                 1.00      2.5±0.02ms        ? ?/sec        1.00      2.5±0.02ms        ? ?/sec
physical_plan_tpch_q6                                 1.00    628.3±1.80µs        ? ?/sec        1.01    635.7±2.02µs        ? ?/sec
physical_plan_tpch_q7                                 1.00      3.1±0.03ms        ? ?/sec        1.02      3.2±0.04ms        ? ?/sec
physical_plan_tpch_q8                                 1.01      4.3±0.08ms        ? ?/sec        1.00      4.2±0.10ms        ? ?/sec
physical_plan_tpch_q9                                 1.01      3.0±0.06ms        ? ?/sec        1.00      3.0±0.03ms        ? ?/sec
physical_select_aggregates_from_200                   1.00     14.7±0.10ms        ? ?/sec        1.00     14.7±0.11ms        ? ?/sec
physical_select_all_from_1000                         1.00     17.8±0.08ms        ? ?/sec        1.00     17.7±0.07ms        ? ?/sec
physical_select_one_from_700                          1.01    788.7±2.52µs        ? ?/sec        1.00    778.4±4.68µs        ? ?/sec
physical_sorted_union_order_by_10_int64               1.01      4.7±0.05ms        ? ?/sec        1.00      4.6±0.05ms        ? ?/sec
physical_sorted_union_order_by_10_uint64              1.01     12.0±0.13ms        ? ?/sec        1.00     11.8±0.10ms        ? ?/sec
physical_sorted_union_order_by_50_int64               1.00    114.5±1.39ms        ? ?/sec        1.00    114.9±1.26ms        ? ?/sec
physical_sorted_union_order_by_50_uint64              1.00    658.1±9.96ms        ? ?/sec        1.00    658.1±9.79ms        ? ?/sec
physical_theta_join_consider_sort                     1.02   1336.3±4.80µs        ? ?/sec        1.00   1313.7±7.28µs        ? ?/sec
physical_unnest_to_join                               1.01   1361.3±4.54µs        ? ?/sec        1.00   1348.3±7.37µs        ? ?/sec
physical_window_function_partition_by_12_on_values    1.03    731.7±2.21µs        ? ?/sec        1.00    711.6±1.68µs        ? ?/sec
physical_window_function_partition_by_30_on_values    1.02   1457.9±4.72µs        ? ?/sec        1.00   1432.3±3.80µs        ? ?/sec
physical_window_function_partition_by_4_on_values     1.04    439.1±1.26µs        ? ?/sec        1.00    420.4±1.47µs        ? ?/sec
physical_window_function_partition_by_7_on_values     1.04    547.6±1.84µs        ? ?/sec        1.00    526.8±4.14µs        ? ?/sec
physical_window_function_partition_by_8_on_values     1.03    586.6±1.80µs        ? ?/sec        1.00    571.1±1.76µs        ? ?/sec
with_param_values_many_columns                        1.00    461.4±2.84µs        ? ?/sec        1.00    462.2±2.42µs        ? ?/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 1273.5s
Peak memory 18.5 GiB
Avg memory 18.5 GiB
CPU user 1517.8s
CPU sys 1.8s
Disk read 0 B
Disk write 634.3 MiB

branch

Metric Value
Wall time 1277.0s
Peak memory 18.5 GiB
Avg memory 18.5 GiB
CPU user 1521.3s
CPU sys 1.2s
Disk read 0 B
Disk write 23.8 MiB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

sql SQL Planner sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support columns having the same alias

3 participants