A problem about the projection_push_down optimizer gathers valid columns #1312

ic4y · 2021-11-15T16:47:18Z

Describe the bug
why gather all columns needed for expressions in this Aggregate by using aggr_expr, rather than schema.
An error occurs when the schema is generated by a custom optimizer and does not match the column name in aggr_expr.

https://github.com/apache/arrow-datafusion/blob/0b6db52435c7c955b51e44de8a3eae58e53716d0/datafusion/src/optimizer/projection_push_down.rs#L292-L305

Nothing will go wrong like this

            schema
                .fields()[group_expr.len()..].to_vec()
                .iter()
                .enumerate()
                .try_for_each(|(i, field)| {
                    if required_columns.contains(&field.qualified_column()) {
                        new_aggr_expr.push(aggr_expr[i].clone());
                        utils::expr_to_columns(&aggr_expr[i], &mut new_required_columns)
                    } else {
                        Ok(())
                    }
                })?;

The text was updated successfully, but these errors were encountered:

houqp · 2021-11-16T06:49:01Z

I think this is probably written with the assumption that schema is always in sync with aggr_expr. @ic4y could you give a more concrete example on how a custom optimizer would change the schema without updating aggr_expr?

ic4y · 2021-11-16T07:37:31Z

@houqp
Like the following, I added group by for distinct Aggregate, and then deleted distinct in count. At this time, aggr_expr has changed, but I don’t want to rewrite the schema, because subsequent Plan Nodes will use this schema like ProjectionNode

Projection: #COUNT(DISTINCT lineorder_flat.lo_orderkey) AS a [a:UInt64;N]
  Aggregate: groupBy=[[]], aggr=[[COUNT(DISTINCT #lineorder_flat.lo_orderkey)]] [COUNT(DISTINCT lineorder_flat.lo_orderkey):UInt64;N]
    TableScan: lineorder_flat projection=Some([0]) [lo_orderkey:Int64;N]

-------------------custom optimizer------------------------------------------
Projection: #COUNT(DISTINCT lineorder_flat.lo_orderkey) AS a [a:UInt64;N]
  Aggregate: groupBy=[[]], aggr=[[COUNT(#lineorder_flat.lo_orderkey)]] [COUNT(DISTINCT lineorder_flat.lo_orderkey):UInt64;N]
    Aggregate: groupBy=[[#lineorder_flat.lo_orderkey]], aggr=[[]] [lo_orderkey:Int64;N]
      TableScan: lineorder_flat projection=Some([0]) [lo_orderkey:Int64;N]

alamb · 2021-11-17T16:57:41Z

This sounds very similar to #1316, which @viirya is working on in #1319 I believe

ic4y · 2021-11-22T02:44:10Z

I used projection output relation alias to solve this problem

ic4y mentioned this issue Nov 16, 2021

Optimize the performance queries with a single distinct aggregate #1315

Merged

ic4y closed this as completed Nov 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A problem about the projection_push_down optimizer gathers valid columns #1312

A problem about the projection_push_down optimizer gathers valid columns #1312

ic4y commented Nov 15, 2021

houqp commented Nov 16, 2021

ic4y commented Nov 16, 2021

alamb commented Nov 17, 2021

ic4y commented Nov 22, 2021

A problem about the projection_push_down optimizer gathers valid columns #1312

A problem about the projection_push_down optimizer gathers valid columns #1312

Comments

ic4y commented Nov 15, 2021

houqp commented Nov 16, 2021

ic4y commented Nov 16, 2021

alamb commented Nov 17, 2021

ic4y commented Nov 22, 2021