Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for filter pushdown rule #924

Merged
merged 21 commits into from
Dec 1, 2022

Conversation

ayushdg
Copy link
Collaborator

@ayushdg ayushdg commented Nov 16, 2022

This PR enables using the filter_pushdown_rule from datafusion and allows passing filters down to the tablescan operation.

@codecov-commenter
Copy link

codecov-commenter commented Nov 18, 2022

Codecov Report

Merging #924 (acdab19) into main (6ae69a8) will increase coverage by 0.22%.
The diff coverage is 100.00%.

❗ Current head acdab19 differs from pull request most recent head 47eee60. Consider uploading reports for the commit 47eee60 to get more accurate results

@@            Coverage Diff             @@
##             main     #924      +/-   ##
==========================================
+ Coverage   75.81%   76.04%   +0.22%     
==========================================
  Files          73       73              
  Lines        4065     4082      +17     
  Branches      737      739       +2     
==========================================
+ Hits         3082     3104      +22     
+ Misses        823      814       -9     
- Partials      160      164       +4     
Impacted Files Coverage Δ
dask_sql/physical/rel/logical/table_scan.py 88.63% <100.00%> (+7.15%) ⬆️
dask_sql/_version.py 35.31% <0.00%> (+1.41%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Comment on lines +160 to +165
[
[("a", "==", 1), ("b", "<", 10)],
[("a", "==", 1), ("b", ">", 5)],
[("b", ">", 5), ("b", "<", 10)],
[("a", "==", 1)],
],
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest version of datafusion applies some conversions that convert filters to a cnf like format. So the dnf here
(b > 5 AND b < 10) OR a = 1 gets remapped to
(b>5 or a=1) and (b<10 or a=1) by datafusion which becomes this is dnf
(b>5 and a=1) or (b<10 and a=1) or (b>5 and b<10) or (a=1) .

I don't know if there's an easy way to simplify these predicate expressions but some cases might lead to larger number of redundant filters because of this.

sarahyurick added a commit to sarahyurick/dask-sql that referenced this pull request Nov 30, 2022
@@ -11,7 +11,6 @@
14,
16,
18,
21,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into both these queries, for both it seems like they previously failed with metadata inference errors trying to do a comparison between a datetime64 and string object.

I'll open up another issue to investigate but in the meanwhile pushing those filters down to the io seems to fix the queries

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Digging into this before it seemed like the underlying issue was that we were at some point casting a column to Date32 and trying to compare to a Date32 scalar, but their dtypes didn't match up - not sure if that's helpful to make a simple reproducer for this

@ayushdg
Copy link
Collaborator Author

ayushdg commented Dec 1, 2022

rerun tests

@ayushdg ayushdg merged commit cf719c5 into dask-contrib:main Dec 1, 2022
@ayushdg ayushdg deleted the support-filter-pushdown-rule branch February 6, 2023 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants