unexpected schema field name conflicts in joins #670

jimexist · 2021-07-04T14:44:03Z

Describe the bug
A clear and concise description of what the bug is.

with a join, we'd get false positive schema name collision

To Reproduce
Steps to reproduce the behavior:

this works:

SELECT
  t1.c9 x
FROM test t1
INNER JOIN test t2
ON t1.c9 = t2.c9
ORDER BY x;

but this won't:

SELECT
  t1.c9 x, t1.c1 c1
FROM test t1
INNER JOIN test t2
ON t1.c9 = t2.c9
ORDER BY x;

Plan("The left schema and the right schema have the following columns with the same name without being on the ON statement: {Column { name: \"c1\", index: 0 }}. Consider aliasing them.")

Expected behavior
A clear and concise description of what you expected to happen.

it shall work as expected (e.g. as in Postgres)

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

houqp · 2021-07-04T18:35:12Z

I think this might have something to do with our projection push down logic:

> SELECT
  t1.c9 x, t1.c1 c1
FROM test t1
INNER JOIN test t2
ON t1.c9 = t2.c9
ORDER BY x;
[datafusion/src/execution/dataframe_impl.rs:148] &plan = Sort: #x ASC NULLS FIRST
  Projection: #t1.c9 AS x, #t1.c1 AS c1
    Join: #t1.c9 = #t2.c9
      TableScan: t1 projection=Some([0, 1])
      TableScan: t2 projection=Some([0, 1])

Noticed c1 got pushed down to t2 table scan as well even though it's not being projected. Most likely the projection push down logic is not distinguishing between column names and aliases.

Jefffrey · 2023-01-26T20:25:21Z

Issue seems resolved now as of latest master:

DataFusion CLI v16.0.0
❯ CREATE EXTERNAL TABLE test STORED AS CSV WITH HEADER ROW LOCATION 'test.csv';
0 rows in set. Query took 0.011 seconds.
❯ SELECT
  t1.c9 x
FROM test t1
INNER JOIN test t2
ON t1.c9 = t2.c9
ORDER BY x;
+---+
| x |
+---+
| 2 |
| 3 |
| 4 |
| 5 |
+---+
4 rows in set. Query took 0.013 seconds.
❯ SELECT
  t1.c9 x, t1.c1 c1
FROM test t1
INNER JOIN test t2
ON t1.c9 = t2.c9
ORDER BY x;
+---+----+
| x | c1 |
+---+----+
| 2 | 1  |
| 3 | 2  |
| 4 | 3  |
| 5 | 4  |
+---+----+
4 rows in set. Query took 0.013 seconds.
❯ explain SELECT
  t1.c9 x, t1.c1 c1
FROM test t1
INNER JOIN test t2
ON t1.c9 = t2.c9
ORDER BY x;
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                       |
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Sort: x ASC NULLS LAST                                                                                                                                     |
|               |   Projection: t1.c9 AS x, t1.c1 AS c1                                                                                                                      |
|               |     Inner Join: t1.c9 = t2.c9                                                                                                                              |
|               |       SubqueryAlias: t1                                                                                                                                    |
|               |         TableScan: test projection=[c1, c9]                                                                                                                |
|               |       SubqueryAlias: t2                                                                                                                                    |
|               |         TableScan: test projection=[c9]                                                                                                                    |
| physical_plan | SortExec: [x@0 ASC NULLS LAST]                                                                                                                             |
|               |   CoalescePartitionsExec                                                                                                                                   |
|               |     ProjectionExec: expr=[c9@1 as x, c1@0 as c1]                                                                                                           |
|               |       CoalesceBatchesExec: target_batch_size=8192                                                                                                          |
|               |         HashJoinExec: mode=Partitioned, join_type=Inner, on=[(Column { name: "c9", index: 1 }, Column { name: "c9", index: 0 })]                           |
|               |           CoalesceBatchesExec: target_batch_size=8192                                                                                                      |
|               |             RepartitionExec: partitioning=Hash([Column { name: "c9", index: 1 }], 12), input_partitions=12                                                 |
|               |               RepartitionExec: partitioning=RoundRobinBatch(12), input_partitions=1                                                                        |
|               |                 CsvExec: files={1 group: [[home/jeffrey/Code/arrow-datafusion/datafusion-cli/test.csv]]}, has_header=true, limit=None, projection=[c1, c9] |
|               |           CoalesceBatchesExec: target_batch_size=8192                                                                                                      |
|               |             RepartitionExec: partitioning=Hash([Column { name: "c9", index: 0 }], 12), input_partitions=12                                                 |
|               |               RepartitionExec: partitioning=RoundRobinBatch(12), input_partitions=1                                                                        |
|               |                 CsvExec: files={1 group: [[home/jeffrey/Code/arrow-datafusion/datafusion-cli/test.csv]]}, has_header=true, limit=None, projection=[c9]     |
|               |                                                                                                                                                            |
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.010 seconds.
❯

yukkit · 2023-04-28T08:51:22Z

@alamb resolved, can be closed

alamb · 2023-04-28T10:06:23Z

Thanks @yukkit

…apache#741) ## Which issue does this PR close? Part of apache#670 ## Rationale for this change This PR improves the native execution performance on decimals with a small precision ## What changes are included in this PR? This PR changes not to promote decimal128 to decimal256 if the precisions are small enough ## How are these changes tested? Existing tests

jimexist added the bug Something isn't working label Jul 4, 2021

alamb closed this as completed Apr 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unexpected schema field name conflicts in joins #670

unexpected schema field name conflicts in joins #670

jimexist commented Jul 4, 2021 •

edited

Loading

houqp commented Jul 4, 2021

Jefffrey commented Jan 26, 2023

yukkit commented Apr 28, 2023

alamb commented Apr 28, 2023

unexpected schema field name conflicts in joins #670

unexpected schema field name conflicts in joins #670

Comments

jimexist commented Jul 4, 2021 • edited Loading

houqp commented Jul 4, 2021

Jefffrey commented Jan 26, 2023

yukkit commented Apr 28, 2023

alamb commented Apr 28, 2023

jimexist commented Jul 4, 2021 •

edited

Loading