-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-43944: [C++][Parquet] Add support for arrow::ArrayStatistics: non zero-copy int based types #43945
Conversation
…: non zero-copy int based types Target types: * `UInt8` * `Int8` * `UInt16` * `Int16` * `UInt32` * `UInt64` * `Date32` * `Time32` * `Time64` * `Duration`
|
array_statistics->is_min_exact = true; | ||
array_statistics->is_max_exact = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add correspond comment here? This might be a bit tricky
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. We should document about the discussion at #43595 (comment) , right?
BTW, could you share the e-mail URL for #43595 (comment) ?
I guess no, I'll send a mail to maillist to make it sure
I couldn't find it at https://lists.apache.org/[email protected] .
Ah, I forgot to add a writer check here. I should have set true
only when a writer is Apache Parquet C++. I'll fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, let me setup a discussion, generally if it's from Parquet C++, it will works. I'm a bit busy this morning preparing for my tour, I'll try to work it out this noon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a comment that we can always use true
for integer based min/max.
I didn't need if (::arrow::internal::StartsWith(ctx->reader->metadata()->created_by(), "parquet-cpp-arrow"))
for this case based on your e-mail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, I found the string and FLBA might being truncated, other types in public impl will not being truncated if exists
auto array_data = | ||
::arrow::ArrayData::Make(field->type(), length, std::move(buffers), null_count); | ||
auto array_statistics = std::make_shared<::arrow::ArrayStatistics>(); | ||
array_statistics->null_count = null_count; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The null_count for some type ( nested ) would be a bit weird, FYI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the information.
Let's revisit it when we add support for arrow::ArrayStatistics
of nested types.
array_statistics->null_count = null_count; | ||
auto statistics = metadata->statistics().get(); | ||
if (statistics) { | ||
if (statistics->HasDistinctCount()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a separate function for the stats conversion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do it when I add more target types as the next pull request.
I'll know what is common pattern when I add more target types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 262d6f6. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…: non zero-copy int based types (apache#43945) ### Rationale for this change Statistics is useful for fast processing. Target types: * `UInt8` * `Int8` * `UInt16` * `Int16` * `UInt32` * `UInt64` * `Date32` * `Time32` * `Time64` * `Duration` ### What changes are included in this PR? Map `ColumnChunkMetaData` information to `arrow::ArrayStatistics`. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: apache#43944 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…: non zero-copy int based types (apache#43945) ### Rationale for this change Statistics is useful for fast processing. Target types: * `UInt8` * `Int8` * `UInt16` * `Int16` * `UInt32` * `UInt64` * `Date32` * `Time32` * `Time64` * `Duration` ### What changes are included in this PR? Map `ColumnChunkMetaData` information to `arrow::ArrayStatistics`. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: apache#43944 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
Rationale for this change
Statistics is useful for fast processing.
Target types:
UInt8
Int8
UInt16
Int16
UInt32
UInt64
Date32
Time32
Time64
Duration
What changes are included in this PR?
Map
ColumnChunkMetaData
information toarrow::ArrayStatistics
.Are these changes tested?
Yes.
Are there any user-facing changes?
Yes.