Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug in projection: "column types must match schema types, expected XXX but found YYY" #1448

Merged
merged 3 commits into from
Dec 15, 2021

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Dec 14, 2021

Which issue does this PR close?

Closes #1447

Rationale for this change

There was a bug introduced in #1378

What changes are included in this PR?

Do not copy the field from the input schema, but explicitly copy the metadata

Are there any user-facing changes?

Queries that should work will once again do so

cc @hntd187

@github-actions github-actions bot added the datafusion Changes in the datafusion crate label Dec 14, 2021
@@ -63,13 +63,22 @@ impl ProjectionExec {

let fields: Result<Vec<Field>> = expr
.iter()
.map(|(e, name)| match input_schema.field_with_name(name) {
Ok(f) => Ok(f.clone()),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally missed this in my review -- basically the projection's output is not the same for its input even if the field name matches (due to aliases). The actual output type needs to be calculated from the expr for all cases.

let nullable = e.nullable(&input_schema)?;
Ok(Field::new(name, dt, nullable))
}
.map(|(e, name)| {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not convinced it is a great idea to copy field metadata from the input to the output based on the field name alone... As that would mean metadata on a field named a from the input could end up on the output field a even if the output was not related at all to a.

Thus I will make made this code only copy field level metadata for a direct input column reference

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the question I had asked previously about when projections created new columns, basically, it doesn't seem to make sense to just blindly carry over metadata because it's about a column that may no longer exist.

@hntd187
Copy link
Contributor

hntd187 commented Dec 14, 2021

Queries that should work will once again do so

I like what you did there :)

I have a perhaps naive question, shouldn't aliasing a query to an already existing column just fail outright? Wouldn't that reasonably be something of a "which column do you really want here?" confusion?

@alamb
Copy link
Contributor Author

alamb commented Dec 14, 2021

I have a perhaps naive question, shouldn't aliasing a query to an already existing column just fail outright? Wouldn't that reasonably be something of a "which column do you really want here?" confusion?

@hntd187 -- I agree that aliasing one column to another column is unlikely to be actually useful 😆 it was just the minimal reproducer I could come up with.

Another use of Projection, despite its slightly misleading name, is to evaluate expressions as well as to control the names of the fields in the output schema, which is what IOx was doing that triggered this bug.

Specifically, the plan in the IOx test was:

2021-12-13T14:38:49.774041Z DEBUG datafusion::execution::context: Logical plan:
 Projection: #cpu.cpu, #cpu.host, CAST(#usage_system AS Int64) AS usage_system, CAST(#usage_user AS Int64) AS usage_user, #time
  Sort: #cpu.cpu ASC NULLS FIRST, #cpu.host ASC NULLS FIRST
    Projection: #cpu.cpu, #cpu.host, #usage_system, #usage_user, #time
      Aggregate: groupBy=[[#cpu.cpu, #cpu.host]], aggr=[[COUNT(#cpu.usage_system AS usage_system) AS usage_system, COUNT(#cpu.usage_user AS usage_user) AS usage_user, MAX(#cpu.time) AS time]]
        Filter: TimestampNanosecond(0) <= #cpu.time AND #cpu.time < TimestampNanosecond(2001)
          TableScan: cpu projection=None

Which was built programatically, but is approximately what would come out of this query

SELECT 
  cpu,
  host,
  count(usage_system) as usage_system, 
  count(usage_user) as usage_user
  max(time) as time,
FROM
  cpu
WHERE 
  0 < time 
  AND time < 2001

In this case, the consumer of the output expect the columns named a certain way, and without the alias count(usage_system) results in a column named something like count(usage_system) rather than the expected usage_system, and this the alias is added

@hntd187
Copy link
Contributor

hntd187 commented Dec 14, 2021

Ahh this fails because of the count changing the type of usage_system and usage_user correct?

@alamb
Copy link
Contributor Author

alamb commented Dec 14, 2021

Ahh this fails because of the count changing the type of usage_system and usage_user correct?

Yes, I think the type of usage_system is UInt64 but the output of the projection count(usage_system) is Int64 which is not the same as UInt64 🤣

@alamb alamb merged commit 1448d97 into apache:master Dec 15, 2021
EricJoy2048 pushed a commit to EricJoy2048/arrow-datafusion that referenced this pull request Feb 23, 2022
Fix bug in projection: "column types must match schema types, expected XXX but found YYY" (apache#1448)

See merge request noah/arrow-datafusion!2
@alamb alamb deleted the alamb/fix_projection branch August 8, 2023 20:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error on some queries: "column types must match schema types, expected XXX but found YYY"
2 participants