Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Public Expr simplification API #1694

Closed
alamb opened this issue Jan 28, 2022 · 0 comments · Fixed by #1717
Closed

Public Expr simplification API #1694

alamb opened this issue Jan 28, 2022 · 0 comments · Fixed by #1717
Labels
datafusion Changes in the datafusion crate enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Jan 28, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
In IOx each table is broken up logically into chunks (like row groups in parquet files) but the chunks might be missing some columns and each chunk has its own statistics

When predicates are applied to scan / filter these chunks, they are potentially in terms of all columns of a table. If a chunk is missing a column (or we know from statistics that it is not null) expressions like col IS NULL and col IS NOT NULL can be replaced with true or false and predicates like col > 5 can be replaced with null > 5 in some cases

Once this substitution is done, that may allow additional simplification of the predicate -- ideally all the way down to true or false

One particular type of this expression we will use in IOx is to map null to a '' value like this:

CASE 
  WHEN col is NULL THEN '' 
  ELSE col 
END

The same general pattern likely holds for ParquetExec now that @thinkharderdev has added support to merge schemas for multiple files in #1622 once DataFusion is able to push predicates down into the parquet scans, simplifying the predicates as much as possible beforehand would be ideal.

The current API in https://github.com/apache/arrow-datafusion/blob/03075d5f4b3fdfd8f82144fcd409418832a4bf69/datafusion/src/optimizer/simplify_expressions.rs is

  1. Private
  2. Requires ExecutionProps which is fairly entangled with the overall machinery of how plans are executed (and means we see issues like DiskManager and TempFiles getting created several times per query #1690 )

Describe the solution you'd like
I would like a DataFusion to have a public API for simplifying expressions. Proposed looks like

pub trait ExprEvalContext {
}

struct Expr {
  fn simplify(self, &dyn ExprEvalContext) -> Self {
  }

}

I am thinking like ExprEvalContext as a trait so that it is clear what Expression Evaluation actually requires as well as allow Expr's to be simplified prior to execution or in the bowels of DataFusion's planer (and I will implement it for ExecutionProps).

Describe alternatives you've considered
I am not fully sure about the API design -- I'll know more when I sketch one out

Additional context
#1693
https://github.com/influxdata/influxdb_iox/pull/3557

@alamb alamb added the enhancement New feature or request label Jan 28, 2022
@alamb alamb added the datafusion Changes in the datafusion crate label Feb 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant