-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-39064: [C++][Parquet] Support row group filtering for nested paths for struct fields #39065
Changes from all commits
3312381
b1115bf
56476cf
b87c829
0d3da99
bd8f127
4205837
fb10913
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -161,7 +161,8 @@ bool IsNan(const Scalar& value) { | |
} | ||
|
||
std::optional<compute::Expression> ColumnChunkStatisticsAsExpression( | ||
const SchemaField& schema_field, const parquet::RowGroupMetaData& metadata) { | ||
const FieldRef& field_ref, const SchemaField& schema_field, | ||
const parquet::RowGroupMetaData& metadata) { | ||
// For the remaining of this function, failure to extract/parse statistics | ||
// are ignored by returning nullptr. The goal is two fold. First | ||
// avoid an optimization which breaks the computation. Second, allow the | ||
|
@@ -180,7 +181,8 @@ std::optional<compute::Expression> ColumnChunkStatisticsAsExpression( | |
return std::nullopt; | ||
} | ||
|
||
return ParquetFileFragment::EvaluateStatisticsAsExpression(*field, *statistics); | ||
return ParquetFileFragment::EvaluateStatisticsAsExpression(*field, field_ref, | ||
*statistics); | ||
} | ||
|
||
void AddColumnIndices(const SchemaField& schema_field, | ||
|
@@ -360,8 +362,9 @@ Result<bool> IsSupportedParquetFile(const ParquetFileFormat& format, | |
} // namespace | ||
|
||
std::optional<compute::Expression> ParquetFileFragment::EvaluateStatisticsAsExpression( | ||
const Field& field, const parquet::Statistics& statistics) { | ||
auto field_expr = compute::field_ref(field.name()); | ||
const Field& field, const FieldRef& field_ref, | ||
const parquet::Statistics& statistics) { | ||
auto field_expr = compute::field_ref(field_ref); | ||
|
||
// Optimize for corner case where all values are nulls | ||
if (statistics.num_values() == 0 && statistics.null_count() > 0) { | ||
|
@@ -418,6 +421,13 @@ std::optional<compute::Expression> ParquetFileFragment::EvaluateStatisticsAsExpr | |
return std::nullopt; | ||
} | ||
|
||
std::optional<compute::Expression> ParquetFileFragment::EvaluateStatisticsAsExpression( | ||
const Field& field, const parquet::Statistics& statistics) { | ||
const auto field_name = field.name(); | ||
return EvaluateStatisticsAsExpression(field, FieldRef(std::move(field_name)), | ||
statistics); | ||
} | ||
|
||
ParquetFileFormat::ParquetFileFormat() | ||
: FileFormat(std::make_shared<ParquetFragmentScanOptions>()) {} | ||
|
||
|
@@ -810,7 +820,7 @@ Status ParquetFileFragment::SetMetadata( | |
manifest_ = std::move(manifest); | ||
|
||
statistics_expressions_.resize(row_groups_->size(), compute::literal(true)); | ||
statistics_expressions_complete_.resize(physical_schema_->num_fields(), false); | ||
statistics_expressions_complete_.resize(manifest_->descr->num_columns(), false); | ||
|
||
for (int row_group : *row_groups_) { | ||
// Ensure RowGroups are indexing valid RowGroups before augmenting. | ||
|
@@ -900,16 +910,25 @@ Result<std::vector<compute::Expression>> ParquetFileFragment::TestRowGroups( | |
ARROW_ASSIGN_OR_RAISE(auto match, ref.FindOneOrNone(*physical_schema_)); | ||
|
||
if (match.empty()) continue; | ||
if (statistics_expressions_complete_[match[0]]) continue; | ||
statistics_expressions_complete_[match[0]] = true; | ||
const SchemaField* schema_field = &manifest_->schema_fields[match[0]]; | ||
|
||
for (size_t i = 1; i < match.indices().size(); ++i) { | ||
if (schema_field->field->type()->id() != Type::STRUCT) { | ||
return Status::Invalid("nested paths only supported for structs"); | ||
} | ||
Comment on lines
+916
to
+918
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So this limit user passing an filter on Map/List? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, but we currently also don't support any predicate kernels for those data types at the moment AFAIK. For example for a list column, you can't do something like "list_field > 1" because 1) such kernel isn't implemented, and 2) that actually also doesn't really make sense as a list scalar contains multiple values, so that doesn't evaluate to simple True/False, you need some kind of aggregation like "elementwise_all(list_field > 1)" (i.e. are "all" (or any) values in a list scalar larger than 1). (I would like to see this work at some point, but that's certainly future work) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would you mind add a test for the "List/Map" filter doesn't work in cpp There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree list/map filter is so hard to filtering, which might need extra predicates. Let disable it now, but maybe we can test some more complex struct? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Filtering with a list or map field actually already fails in an earlier step, when binding the filter expression to the schema (and binding isn't done in |
||
schema_field = &schema_field->children[match[i]]; | ||
} | ||
|
||
if (!schema_field->is_leaf()) continue; | ||
if (statistics_expressions_complete_[schema_field->column_index]) continue; | ||
statistics_expressions_complete_[schema_field->column_index] = true; | ||
|
||
const SchemaField& schema_field = manifest_->schema_fields[match[0]]; | ||
int i = 0; | ||
for (int row_group : *row_groups_) { | ||
auto row_group_metadata = metadata_->RowGroup(row_group); | ||
|
||
if (auto minmax = | ||
ColumnChunkStatisticsAsExpression(schema_field, *row_group_metadata)) { | ||
if (auto minmax = ColumnChunkStatisticsAsExpression(ref, *schema_field, | ||
*row_group_metadata)) { | ||
FoldingAnd(&statistics_expressions_[i], std::move(*minmax)); | ||
ARROW_ASSIGN_OR_RAISE(statistics_expressions_[i], | ||
statistics_expressions_[i].Bind(*physical_schema_)); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is the same logic as FieldPath::Get, would you mind extracting it as a separate function? It would be nice to have a clear single entry point for future work on nested field references in parquet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean including the for loop below, right?
I think a method on the SchemaManifest might be a logical place to have this. It already has a
GetColumnField
to return a SchemaField based on a single integer index. There could be a variant which accepts a FieldPathThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to merge this before the 15.0 branch cut-off, so going to merge as is, but will look into factoring it out as a helper function in a follow-up!