Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unexpected schema field name conflicts in joins #670

Closed
jimexist opened this issue Jul 4, 2021 · 4 comments
Closed

unexpected schema field name conflicts in joins #670

jimexist opened this issue Jul 4, 2021 · 4 comments
Labels
bug Something isn't working

Comments

@jimexist
Copy link
Member

jimexist commented Jul 4, 2021

Describe the bug
A clear and concise description of what the bug is.

with a join, we'd get false positive schema name collision

To Reproduce
Steps to reproduce the behavior:

this works:

SELECT
  t1.c9 x
FROM test t1
INNER JOIN test t2
ON t1.c9 = t2.c9
ORDER BY x;

but this won't:

SELECT
  t1.c9 x, t1.c1 c1
FROM test t1
INNER JOIN test t2
ON t1.c9 = t2.c9
ORDER BY x;
Plan("The left schema and the right schema have the following columns with the same name without being on the ON statement: {Column { name: \"c1\", index: 0 }}. Consider aliasing them.")

Expected behavior
A clear and concise description of what you expected to happen.

it shall work as expected (e.g. as in Postgres)

Additional context
Add any other context about the problem here.

@jimexist jimexist added the bug Something isn't working label Jul 4, 2021
@houqp
Copy link
Member

houqp commented Jul 4, 2021

I think this might have something to do with our projection push down logic:

> SELECT
  t1.c9 x, t1.c1 c1
FROM test t1
INNER JOIN test t2
ON t1.c9 = t2.c9
ORDER BY x;
[datafusion/src/execution/dataframe_impl.rs:148] &plan = Sort: #x ASC NULLS FIRST
  Projection: #t1.c9 AS x, #t1.c1 AS c1
    Join: #t1.c9 = #t2.c9
      TableScan: t1 projection=Some([0, 1])
      TableScan: t2 projection=Some([0, 1])

Noticed c1 got pushed down to t2 table scan as well even though it's not being projected. Most likely the projection push down logic is not distinguishing between column names and aliases.

@Jefffrey
Copy link
Contributor

Issue seems resolved now as of latest master:

DataFusion CLI v16.0.0
❯ CREATE EXTERNAL TABLE test STORED AS CSV WITH HEADER ROW LOCATION 'test.csv';
0 rows in set. Query took 0.011 seconds.
❯ SELECT
  t1.c9 x
FROM test t1
INNER JOIN test t2
ON t1.c9 = t2.c9
ORDER BY x;
+---+
| x |
+---+
| 2 |
| 3 |
| 4 |
| 5 |
+---+
4 rows in set. Query took 0.013 seconds.
❯ SELECT
  t1.c9 x, t1.c1 c1
FROM test t1
INNER JOIN test t2
ON t1.c9 = t2.c9
ORDER BY x;
+---+----+
| x | c1 |
+---+----+
| 2 | 1  |
| 3 | 2  |
| 4 | 3  |
| 5 | 4  |
+---+----+
4 rows in set. Query took 0.013 seconds.
❯ explain SELECT
  t1.c9 x, t1.c1 c1
FROM test t1
INNER JOIN test t2
ON t1.c9 = t2.c9
ORDER BY x;
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                       |
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Sort: x ASC NULLS LAST                                                                                                                                     |
|               |   Projection: t1.c9 AS x, t1.c1 AS c1                                                                                                                      |
|               |     Inner Join: t1.c9 = t2.c9                                                                                                                              |
|               |       SubqueryAlias: t1                                                                                                                                    |
|               |         TableScan: test projection=[c1, c9]                                                                                                                |
|               |       SubqueryAlias: t2                                                                                                                                    |
|               |         TableScan: test projection=[c9]                                                                                                                    |
| physical_plan | SortExec: [x@0 ASC NULLS LAST]                                                                                                                             |
|               |   CoalescePartitionsExec                                                                                                                                   |
|               |     ProjectionExec: expr=[c9@1 as x, c1@0 as c1]                                                                                                           |
|               |       CoalesceBatchesExec: target_batch_size=8192                                                                                                          |
|               |         HashJoinExec: mode=Partitioned, join_type=Inner, on=[(Column { name: "c9", index: 1 }, Column { name: "c9", index: 0 })]                           |
|               |           CoalesceBatchesExec: target_batch_size=8192                                                                                                      |
|               |             RepartitionExec: partitioning=Hash([Column { name: "c9", index: 1 }], 12), input_partitions=12                                                 |
|               |               RepartitionExec: partitioning=RoundRobinBatch(12), input_partitions=1                                                                        |
|               |                 CsvExec: files={1 group: [[home/jeffrey/Code/arrow-datafusion/datafusion-cli/test.csv]]}, has_header=true, limit=None, projection=[c1, c9] |
|               |           CoalesceBatchesExec: target_batch_size=8192                                                                                                      |
|               |             RepartitionExec: partitioning=Hash([Column { name: "c9", index: 0 }], 12), input_partitions=12                                                 |
|               |               RepartitionExec: partitioning=RoundRobinBatch(12), input_partitions=1                                                                        |
|               |                 CsvExec: files={1 group: [[home/jeffrey/Code/arrow-datafusion/datafusion-cli/test.csv]]}, has_header=true, limit=None, projection=[c9]     |
|               |                                                                                                                                                            |
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.010 seconds.
❯

@yukkit
Copy link
Contributor

yukkit commented Apr 28, 2023

@alamb resolved, can be closed

@alamb
Copy link
Contributor

alamb commented Apr 28, 2023

Thanks @yukkit

@alamb alamb closed this as completed Apr 28, 2023
andygrove pushed a commit to andygrove/datafusion that referenced this issue Jan 31, 2025
…apache#741)

## Which issue does this PR close?

Part of apache#670

## Rationale for this change

This PR improves the native execution performance on decimals with a small precision

## What changes are included in this PR?

This PR changes not to promote decimal128 to decimal256 if the precisions are small enough

## How are these changes tested?

Existing tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants