Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify Pushed Down Predicates #4020

Closed
tustvold opened this issue Oct 29, 2022 · 3 comments · Fixed by #4279
Closed

Simplify Pushed Down Predicates #4020

tustvold opened this issue Oct 29, 2022 · 3 comments · Fixed by #4279
Labels
enhancement New feature or request

Comments

@tustvold
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
(This section helps Arrow developers understand the context and why for this feature, in addition to the what)

Not sure if this is a bug of a feature request but running the query contained in #4005 pushes down a somewhat unexpected predicate

explain select * from foo where container = 'backend_container_0' OR pod = 'aqcathnxqsphdhgjtgvxsfyiwbmhlmg';

...
|               |         ParquetExec: limit=None, partitions=[home/raphael/Downloads/data.parquet], predicate=container_min@0 <= backend_container_0 AND backend_container_0 <= container_max@1 OR pod_min@2 <= aqcathnxqsphdhgjtgvxsfyiwbmhlmg AND aqcathnxqsphdhgjtgvxsfyiwbmhlmg <= pod_max@3, projection=[service, host, pod, container, image, time, client_addr, request_duration_ns, request_user_agent, request_method, request_host, request_bytes, response_bytes, response_status] |

In particular the equality predicates appear to be being split into a <= and a >=, effectively doubling the work to evaluate them.

Describe the solution you'd like

I would expect the minimal set of predicates to be pushed down

Describe alternatives you've considered

Additional context

FYI @alamb

@tustvold tustvold added the enhancement New feature or request label Oct 29, 2022
@alamb
Copy link
Contributor

alamb commented Oct 30, 2022

I agree that is not good -- we can isolate what optimizer pass is doing it by using explain verbose and then track it down further

@alamb
Copy link
Contributor

alamb commented Nov 11, 2022

I have confirmed that what is displayed as predicate is really the PruningPredicate not the pushed down filters

https://github.com/apache/arrow-datafusion/blob/f2f846512ab032845de5dcee768a8a69ddf17eac/datafusion/core/src/physical_plan/file_format/parquet.rs#L304-L313

While reviewing the code I think there is a real limitation as well which is that we only push down filter expressions into the parquet scan ONLY if we could make a pruning predicate out of it (which is a much more limited set)

I think what we should do to fix this issue is:

  1. Change the display name from predicate to pruning_predicate
  2. Push down the filter that is passed to ParquetExec::new() rather than just the pruning predicate.

@alamb
Copy link
Contributor

alamb commented Nov 18, 2022

I plan to fix this in the next few days

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants