Fix bug in projection: "column types must match schema types, expected XXX but found YYY" #1448

alamb · 2021-12-14T15:57:05Z

Which issue does this PR close?

Closes #1447

Rationale for this change

There was a bug introduced in #1378

What changes are included in this PR?

Do not copy the field from the input schema, but explicitly copy the metadata

Are there any user-facing changes?

Queries that should work will once again do so

cc @hntd187

alamb · 2021-12-14T15:58:18Z

datafusion/src/physical_plan/projection.rs

@@ -63,13 +63,22 @@ impl ProjectionExec {

        let fields: Result<Vec<Field>> = expr
            .iter()
-            .map(|(e, name)| match input_schema.field_with_name(name) {
-                Ok(f) => Ok(f.clone()),


I totally missed this in my review -- basically the projection's output is not the same for its input even if the field name matches (due to aliases). The actual output type needs to be calculated from the expr for all cases.

alamb · 2021-12-14T16:00:23Z

datafusion/src/physical_plan/projection.rs

-                    let nullable = e.nullable(&input_schema)?;
-                    Ok(Field::new(name, dt, nullable))
-                }
+            .map(|(e, name)| {


I am not convinced it is a great idea to copy field metadata from the input to the output based on the field name alone... As that would mean metadata on a field named a from the input could end up on the output field a even if the output was not related at all to a.

Thus I ~~will make~~ made this code only copy field level metadata for a direct input column reference

This was the question I had asked previously about when projections created new columns, basically, it doesn't seem to make sense to just blindly carry over metadata because it's about a column that may no longer exist.

hntd187 · 2021-12-14T16:55:08Z

Queries that should work will once again do so

I like what you did there :)

I have a perhaps naive question, shouldn't aliasing a query to an already existing column just fail outright? Wouldn't that reasonably be something of a "which column do you really want here?" confusion?

alamb · 2021-12-14T18:49:28Z

I have a perhaps naive question, shouldn't aliasing a query to an already existing column just fail outright? Wouldn't that reasonably be something of a "which column do you really want here?" confusion?

@hntd187 -- I agree that aliasing one column to another column is unlikely to be actually useful 😆 it was just the minimal reproducer I could come up with.

Another use of Projection, despite its slightly misleading name, is to evaluate expressions as well as to control the names of the fields in the output schema, which is what IOx was doing that triggered this bug.

Specifically, the plan in the IOx test was:

2021-12-13T14:38:49.774041Z DEBUG datafusion::execution::context: Logical plan:
 Projection: #cpu.cpu, #cpu.host, CAST(#usage_system AS Int64) AS usage_system, CAST(#usage_user AS Int64) AS usage_user, #time
  Sort: #cpu.cpu ASC NULLS FIRST, #cpu.host ASC NULLS FIRST
    Projection: #cpu.cpu, #cpu.host, #usage_system, #usage_user, #time
      Aggregate: groupBy=[[#cpu.cpu, #cpu.host]], aggr=[[COUNT(#cpu.usage_system AS usage_system) AS usage_system, COUNT(#cpu.usage_user AS usage_user) AS usage_user, MAX(#cpu.time) AS time]]
        Filter: TimestampNanosecond(0) <= #cpu.time AND #cpu.time < TimestampNanosecond(2001)
          TableScan: cpu projection=None

Which was built programatically, but is approximately what would come out of this query

SELECT 
  cpu,
  host,
  count(usage_system) as usage_system, 
  count(usage_user) as usage_user
  max(time) as time,
FROM
  cpu
WHERE 
  0 < time 
  AND time < 2001

In this case, the consumer of the output expect the columns named a certain way, and without the alias count(usage_system) results in a column named something like count(usage_system) rather than the expected usage_system, and this the alias is added

hntd187 · 2021-12-14T19:21:59Z

Ahh this fails because of the count changing the type of usage_system and usage_user correct?

alamb · 2021-12-14T20:01:05Z

Ahh this fails because of the count changing the type of usage_system and usage_user correct?

Yes, I think the type of usage_system is UInt64 but the output of the projection count(usage_system) is Int64 which is not the same as UInt64 🤣

Fix bug in projection: "column types must match schema types, expected XXX but found YYY" (apache#1448) See merge request noah/arrow-datafusion!2

Fix bug in projection exec

6ef7f4d

github-actions bot added the datafusion Changes in the datafusion crate label Dec 14, 2021

Only extract field metadata for direct field access

bd0f1ac

alamb commented Dec 14, 2021

View reviewed changes

clippy

bd4dced

alamb merged commit 1448d97 into apache:master Dec 15, 2021

alamb deleted the alamb/fix_projection branch August 8, 2023 20:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bug in projection: "column types must match schema types, expected XXX but found YYY" #1448

Fix bug in projection: "column types must match schema types, expected XXX but found YYY" #1448

alamb commented Dec 14, 2021

alamb Dec 14, 2021

alamb Dec 14, 2021

hntd187 Dec 14, 2021

hntd187 commented Dec 14, 2021 •

edited

Loading

alamb commented Dec 14, 2021 •

edited

Loading

hntd187 commented Dec 14, 2021

alamb commented Dec 14, 2021

Fix bug in projection: "column types must match schema types, expected XXX but found YYY" #1448

Fix bug in projection: "column types must match schema types, expected XXX but found YYY" #1448

Conversation

alamb commented Dec 14, 2021

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

alamb Dec 14, 2021

Choose a reason for hiding this comment

alamb Dec 14, 2021

Choose a reason for hiding this comment

hntd187 Dec 14, 2021

Choose a reason for hiding this comment

hntd187 commented Dec 14, 2021 • edited Loading

alamb commented Dec 14, 2021 • edited Loading

hntd187 commented Dec 14, 2021

alamb commented Dec 14, 2021

hntd187 commented Dec 14, 2021 •

edited

Loading

alamb commented Dec 14, 2021 •

edited

Loading