-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support pushdown multi-columns in PageIndex pruning. #3967
Conversation
offset_indexes: &'a Vec<Vec<PageLocation>>, | ||
parquet_schema: &'a Schema, | ||
col_id: usize, | ||
col_page_indexes: &'a Index, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simplify the PagesPruningStatistics
no need pass all cols index!
// | ||
// returned: NNNNNNNNY | ||
// set `need_combine` true will combine result: Select(2) + Select(1) + Skip(2) -> Select(3) + Skip(2) | ||
pub(crate) fn intersect_row_selection( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will move this to arrow-rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed apache/arrow-rs#3003
Signed-off-by: yangjiang <[email protected]>
metrics.predicate_evaluation_errors.add(1); | ||
return Ok(vec![RowSelector::select(group.num_rows() as usize)]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb related to #4002 add one fallback, which select all rows.
got
❯ select count(*) from foo where container = 'backend_container_1';
[2022-10-29T15:51:31Z ERROR datafusion::physical_plan::file_format::parquet] Error evaluating page index predicate values Error during planning: Can not create statistics record batch: Invalid argument error: Column 'container_min' is declared as non-nullable but contains null values
+-----------------+
| COUNT(UInt8(1)) |
+-----------------+
| 15963 |
+-----------------+
1 row in set. Query took 0.035 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool! I hope to have #3976 ready for review later today -- with that additional testing coverage I will feel pretty good about this particular optimization 🎉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: yangjiang <[email protected]>
Sorry @Ted-Jiang -- i will review this PR today |
@Ted-Jiang - I took the liberty of merging up from master and fixing a merge conflict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great @Ted-Jiang 🦾 🏆 -- I think it can be merged as is.
I am impressed with the tests in this PR and I also verified that this optimization passes my parquet predicate torture integration test #4062 that found #4002
I have some follow up suggestions but I think we can do them as follow on PRs if you prefer
Other items to do:
- Add statistics to verify and report on the efficiency of page index pruning (I will file a follow on ticket)
- Add some more variations to the parquet predicate integration tests to ensure multiple pages are being used (will file a follow on ticket)
}), | ||
); | ||
} else { | ||
// fallback select all rows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
} | ||
} | ||
_ => { | ||
if a.row_count < b.row_count { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be taking which one of a
or b
was skipped (rather than just the one that was smaller)? I tried messing around and trying to write some test that failed, but was not able to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think two element at least one is skip
, we can sure about skip the min len is correct🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
||
// create filter (year = 2009 and id = 1) or (year = 2010) | ||
// this filter use two columns will not push down | ||
// todo but after use CNF rewrite it could rewrite to (year = 2009 or year = 2010) and (id = 1 or year = 2010) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about the CNF rewrite comment here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we file a ticket do some tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what tests you have in mind here so I did not file a ticket -- perhaps you could add to one of the existing tickets (#4087 ?) or file a new one?
Thanks for the helping!😄
Agree! will try to familiar with the IT, support these✌️ |
Signed-off-by: yangjiang <[email protected]>
Signed-off-by: yangjiang <[email protected]>
Looks good -- thanks @Ted-Jiang -- I'll merge this in and file some follow on tickets for the remaining items |
I filed #4085 to track enabling this feature by default |
Filed #4086 to track adding statistics |
Filed #4087 for increasing test coverage |
Benchmark runs are scheduled for baseline = 065a478 and contender = 1a5f6ab. 1a5f6ab is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
* Support pushdown multi-columns in PageIndex pruning. Signed-off-by: yangjiang <[email protected]> * fix test Signed-off-by: yangjiang <[email protected]> * fix comments Signed-off-by: yangjiang <[email protected]> * avoid extract predicates when enable is false Signed-off-by: yangjiang <[email protected]> Signed-off-by: yangjiang <[email protected]> Co-authored-by: Andrew Lamb <[email protected]>
Signed-off-by: yangjiang [email protected]
Which issue does this PR close?
Closes #3834.
Closes #4002
Rationale for this change
What changes are included in this PR?
Thanks the original pic from @alamb ❤️
Finally do the intersection get selector(200~244)
Are there any user-facing changes?