Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Dynamic partition pruning improvements #1121

Open
4 of 11 tasks
sarahyurick opened this issue Apr 21, 2023 · 0 comments
Open
4 of 11 tasks

[ENH] Dynamic partition pruning improvements #1121

sarahyurick opened this issue Apr 21, 2023 · 0 comments
Labels
enhancement New feature or request needs triage Awaiting triage by a dask-sql maintainer

Comments

@sarahyurick
Copy link
Collaborator

sarahyurick commented Apr 21, 2023

#1102 adds dynamic partition pruning functionality. While working on this, I noticed several features that could be used to enhance this optimization rule that are outside of the original intended scope of the project. I think DPP could benefit by expanding to include these cases in the future.

  • Currently, we only check a join's on conditions, but we should also try to check and make use of join filters
  • Right now, we only use DPP for joins between 2 columns. However, it would also be possible to run DPP for joins between binary expressions, e.g. WHERE col1 + 10 = col2 + 20
  • In a similar vein, we should expand the get_filtered_fields function to be able to handle more complex binary expressions than it currently does
  • Allow the fact_dimension_ratio and possibly other parameters to be specified by the user
  • Be careful if there's more than 1 scan of the same table
  • Modify the c.explain() function to cut off large strings of INLIST vectors
  • Currently, we can only use DPP with local Parquet files, and we assume a Parquet table is formatted as table_name/*.parquet. Ideally, we should have logic handling remote files (i.e., adding checks to not apply DPP for remote files), folders of subfolders with Parquet files (like Hive partitioning), and other format types like CSV, etc.
  • In the satisfies_int64 function, if we match a Utf8, we should add logic to check if the string can be converted to a timestamp.

In addition, we should add some DPP tests, including:

  • Rust functionality tests
  • DPP functionality PyTests
  • DPP config PyTests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request needs triage Awaiting triage by a dask-sql maintainer
Projects
None yet
Development

No branches or pull requests

1 participant