-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] AST filtering in parquet reader #13348
[FEA] AST filtering in parquet reader #13348
Conversation
CompactProtocolFieldWriter c(*this); | ||
if (s.max.size() != 0) { c.field_binary(1, s.max); } | ||
if (s.min.size() != 0) { c.field_binary(2, s.min); } | ||
if (s.null_count != -1) { c.field_int(3, s.null_count); } | ||
if (s.distinct_count != -1) { c.field_int(4, s.distinct_count); } | ||
if (s.max_value.size() != 0) { c.field_binary(5, s.max_value); } | ||
if (s.min_value.size() != 0) { c.field_binary(6, s.min_value); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised to see a writer change here - was there something incorrect about the way we've been writing row group statistics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. The statistics are pre-encoded during page encoding itself and stored as binary blob. I changed the datatype from binary blob to Statistics
struct. So, this change is encoding that struct again.
- Add device column from Column chunk statistics in metadata - Add Filter AST to StatsAST transformer - Apply StatsAST on chunk statistics column in device - Filter the row groups, and give to base parquet reader
…uet_predicate_row_group
…uet_predicate_row_group
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No further requests from me. I'm ok once Bradley signs off on this.
Thanks Mike — I’ll review again tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments -- mostly minor changes but I definitely want to get stream conventions figured out before approving. I know that our I/O code might not be the best example of stream conventions today, but we need to get the entirety of libcudf moving in that direction. New APIs should accept and leverage streams correctly.
// Add empty columns if needed. | ||
return finalize_output(out_metadata, out_columns); | ||
// Add empty columns if needed. Filter output columns based on filter. | ||
return finalize_output(out_metadata, out_columns, filter); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have much familiarity with the chunked reader. Do we need to raise an error, or something like that? Just trying to understand if additional work is needed here, or if this can be resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like all the requested changes in my previous review were addressed. I think this is close enough to approve, but we need to address the conversation about compatibility with the chunked reader in https://github.com/rapidsai/cudf/pull/13348/files#r1271118219 before merging.
/merge |
Description
The plan to support AST based filter predicate pushdown in parquet. This PR adds predicate pushdown on row group filtering.
The statistics of columns of each row group are loaded to a device column, and AST filter is applied on min, max of each column to select the row groups to read. The user given AST needs to be converted to another AST to be applied on min, max values of each column ('Statistics AST'). After the row groups are parsed, the user given AST is applied on the output columns to filter any remaining rows in the row groups.
New
column_name_reference
is introduced to help the users create AST's that reference columns by name, as the user may or may not have the column indices information before reading. Since AST engine takes only column index reference, a transformation is applied to the user given AST. So, 2 new AST transformation classes are introduced:named_to_reference_converter
- Converts column name references to column index referencesstats_expression_converter
- Converts the above output table filtering AST to 'Statistics AST'.Note: This column_name_reference only supported for predicate pushdown filtering, but not supported for other AST operations such as transform, joins etc.
Depends on #13472
Checklist