You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
In IOx each table is broken up logically into chunks (like row groups in parquet files) but the chunks might be missing some columns and each chunk has its own statistics
When predicates are applied to scan / filter these chunks, they are potentially in terms of all columns of a table. If a chunk is missing a column (or we know from statistics that it is not null) expressions like col IS NULL and col IS NOT NULL can be replaced with true or false and predicates like col > 5 can be replaced with null > 5 in some cases
Once this substitution is done, that may allow additional simplification of the predicate -- ideally all the way down to true or false
One particular type of this expression we will use in IOx is to map null to a '' value like this:
CASE
WHEN col is NULL THEN ''
ELSE col
END
The same general pattern likely holds for ParquetExec now that @thinkharderdev has added support to merge schemas for multiple files in #1622 once DataFusion is able to push predicates down into the parquet scans, simplifying the predicates as much as possible beforehand would be ideal.
I am thinking like ExprEvalContext as a trait so that it is clear what Expression Evaluation actually requires as well as allow Expr's to be simplified prior to execution or in the bowels of DataFusion's planer (and I will implement it for ExecutionProps).
Describe alternatives you've considered
I am not fully sure about the API design -- I'll know more when I sketch one out
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
In IOx each table is broken up logically into chunks (like row groups in parquet files) but the chunks might be missing some columns and each chunk has its own statistics
When predicates are applied to scan / filter these chunks, they are potentially in terms of all columns of a table. If a chunk is missing a column (or we know from statistics that it is not null) expressions like
col IS NULL
andcol IS NOT NULL
can be replaced withtrue
orfalse
and predicates likecol > 5
can be replaced withnull > 5
in some casesOnce this substitution is done, that may allow additional simplification of the predicate -- ideally all the way down to
true
orfalse
One particular type of this expression we will use in IOx is to map
null
to a''
value like this:The same general pattern likely holds for ParquetExec now that @thinkharderdev has added support to merge schemas for multiple files in #1622 once DataFusion is able to push predicates down into the parquet scans, simplifying the predicates as much as possible beforehand would be ideal.
The current API in https://github.com/apache/arrow-datafusion/blob/03075d5f4b3fdfd8f82144fcd409418832a4bf69/datafusion/src/optimizer/simplify_expressions.rs is
ExecutionProps
which is fairly entangled with the overall machinery of how plans are executed (and means we see issues like DiskManager and TempFiles getting created several times per query #1690 )Describe the solution you'd like
I would like a DataFusion to have a public API for simplifying expressions. Proposed looks like
I am thinking like
ExprEvalContext
as a trait so that it is clear what Expression Evaluation actually requires as well as allow Expr's to be simplified prior to execution or in the bowels of DataFusion's planer (and I will implement it for ExecutionProps).Describe alternatives you've considered
I am not fully sure about the API design -- I'll know more when I sketch one out
Additional context
#1693
https://github.com/influxdata/influxdb_iox/pull/3557
The text was updated successfully, but these errors were encountered: