-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-39064: [C++][Parquet] Support row group filtering for nested paths for struct fields #39065
GH-39064: [C++][Parquet] Support row group filtering for nested paths for struct fields #39065
Conversation
… paths for struct fields
|
const Field& field, const parquet::Statistics& statistics) { | ||
auto field_expr = compute::field_ref(field.name()); | ||
const Field& field, const FieldRef& field_ref, const parquet::Statistics& statistics) { | ||
auto field_expr = compute::field_ref(field_ref); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if there is a better way, but the reason I am passing the FieldRef through the various function calls (through ColumnChunkStatisticsAsExpression to EvaluateStatisticsAsExpression), is that once we are here and we have the "schild" field, we don't know what the original full nested field was.
Neither the SchemaField passed to ColumnChunkStatisticsAsExpression
nor the Field (schema_field.field
) passed to EvaluateStatisticsAsExpression
currently have that information AFAIU.
cpp/src/arrow/dataset/file_parquet.h
Outdated
static std::optional<compute::Expression> EvaluateStatisticsAsExpression( | ||
const Field& field, const parquet::Statistics& statistics); | ||
const Field& field, const FieldRef& field_ref, const parquet::Statistics& statistics); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is public API? So I can add a variant with the original signature that creates a FieldRef from the Field?
I'll take a quick look now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General LGTM
if (schema_field->field->type()->id() != Type::STRUCT) { | ||
return Status::Invalid("nested paths only supported for structs"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this limit user passing an filter on Map/List?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but we currently also don't support any predicate kernels for those data types at the moment AFAIK.
For example for a list column, you can't do something like "list_field > 1" because 1) such kernel isn't implemented, and 2) that actually also doesn't really make sense as a list scalar contains multiple values, so that doesn't evaluate to simple True/False, you need some kind of aggregation like "elementwise_all(list_field > 1)" (i.e. are "all" (or any) values in a list scalar larger than 1).
And even then simplifying such more complex expression based on the parquet statistics would also need to be implemented.
(I would like to see this work at some point, but that's certainly future work)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind add a test for the "List/Map" filter doesn't work in cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree list/map filter is so hard to filtering, which might need extra predicates. Let disable it now, but maybe we can test some more complex struct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind add a test for the "List/Map" filter doesn't work in cpp
Filtering with a list or map field actually already fails in an earlier step, when binding the filter expression to the schema (and binding isn't done in FilterRowGroups
, it's expected to already be done, also in the test for this it is done up front in the test setup code).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM on parquet side, but I think we need a more one familiar with dataset to review it
cc @bkietz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just one nit
@@ -897,16 +907,25 @@ Result<std::vector<compute::Expression>> ParquetFileFragment::TestRowGroups( | |||
ARROW_ASSIGN_OR_RAISE(auto match, ref.FindOneOrNone(*physical_schema_)); | |||
|
|||
if (match.empty()) continue; | |||
if (statistics_expressions_complete_[match[0]]) continue; | |||
statistics_expressions_complete_[match[0]] = true; | |||
const SchemaField* schema_field = &manifest_->schema_fields[match[0]]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is the same logic as FieldPath::Get, would you mind extracting it as a separate function? It would be nice to have a clear single entry point for future work on nested field references in parquet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean including the for loop below, right?
I think a method on the SchemaManifest might be a logical place to have this. It already has a GetColumnField
to return a SchemaField based on a single integer index. There could be a variant which accepts a FieldPath
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to merge this before the 15.0 branch cut-off, so going to merge as is, but will look into factoring it out as a helper function in a follow-up!
…uet-dataset-row-group-filtering-nested-path
033d6e6
to
fb10913
Compare
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit ffcfabd. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them. |
@github-actions crossbow submit wheel-macos-* |
Revision: fb10913 Submitted crossbow builds: ursacomputing/crossbow @ actions-e8de21cfb7 |
… paths for struct fields (apache#39065) ### Rationale for this change Currently when filtering with a nested field reference, we were taking the corresponding parquet SchemaField for just the first index of the nested path, i.e. the parent node in the Parquet schema. But logically, filtering on statistics only works for a primitive leaf node. This PR changes that logic to iterate over all indices of the FieldPath, if nested, to ensure we use the actual corresponding child leaf node of the ParquetSchema to get the statistics from. ### Are there any user-facing changes? No, only improving performance by doing the filtering at the row group stage, instead of afterwards on the read data * Closes: apache#39064 Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
… paths for struct fields (apache#39065) ### Rationale for this change Currently when filtering with a nested field reference, we were taking the corresponding parquet SchemaField for just the first index of the nested path, i.e. the parent node in the Parquet schema. But logically, filtering on statistics only works for a primitive leaf node. This PR changes that logic to iterate over all indices of the FieldPath, if nested, to ensure we use the actual corresponding child leaf node of the ParquetSchema to get the statistics from. ### Are there any user-facing changes? No, only improving performance by doing the filtering at the row group stage, instead of afterwards on the read data * Closes: apache#39064 Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
… paths for struct fields (apache#39065) ### Rationale for this change Currently when filtering with a nested field reference, we were taking the corresponding parquet SchemaField for just the first index of the nested path, i.e. the parent node in the Parquet schema. But logically, filtering on statistics only works for a primitive leaf node. This PR changes that logic to iterate over all indices of the FieldPath, if nested, to ensure we use the actual corresponding child leaf node of the ParquetSchema to get the statistics from. ### Are there any user-facing changes? No, only improving performance by doing the filtering at the row group stage, instead of afterwards on the read data * Closes: apache#39064 Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
Rationale for this change
Currently when filtering with a nested field reference, we were taking the corresponding parquet SchemaField for just the first index of the nested path, i.e. the parent node in the Parquet schema. But logically, filtering on statistics only works for a primitive leaf node.
This PR changes that logic to iterate over all indices of the FieldPath, if nested, to ensure we use the actual corresponding child leaf node of the ParquetSchema to get the statistics from.
Are there any user-facing changes?
No, only improving performance by doing the filtering at the row group stage, instead of afterwards on the read data