Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-43944: [C++][Parquet] Add support for arrow::ArrayStatistics: non zero-copy int based types #43945

Merged
merged 3 commits into from
Sep 5, 2024

Conversation

kou
Copy link
Member

@kou kou commented Sep 4, 2024

Rationale for this change

Statistics is useful for fast processing.

Target types:

  • UInt8
  • Int8
  • UInt16
  • Int16
  • UInt32
  • UInt64
  • Date32
  • Time32
  • Time64
  • Duration

What changes are included in this PR?

Map ColumnChunkMetaData information to arrow::ArrayStatistics.

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes.

…: non zero-copy int based types

Target types:

* `UInt8`
* `Int8`
* `UInt16`
* `Int16`
* `UInt32`
* `UInt64`
* `Date32`
* `Time32`
* `Time64`
* `Duration`
@kou kou requested a review from wgtmac as a code owner September 4, 2024 08:36
Copy link

github-actions bot commented Sep 4, 2024

⚠️ GitHub issue #43944 has been automatically assigned in GitHub to PR creator.

Comment on lines +361 to +362
array_statistics->is_min_exact = true;
array_statistics->is_max_exact = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add correspond comment here? This might be a bit tricky

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. We should document about the discussion at #43595 (comment) , right?

BTW, could you share the e-mail URL for #43595 (comment) ?

I guess no, I'll send a mail to maillist to make it sure

I couldn't find it at https://lists.apache.org/[email protected] .

Ah, I forgot to add a writer check here. I should have set true only when a writer is Apache Parquet C++. I'll fix it.

Copy link
Member

@mapleFU mapleFU Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, let me setup a discussion, generally if it's from Parquet C++, it will works. I'm a bit busy this morning preparing for my tour, I'll try to work it out this noon

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a comment that we can always use true for integer based min/max.

I didn't need if (::arrow::internal::StartsWith(ctx->reader->metadata()->created_by(), "parquet-cpp-arrow")) for this case based on your e-mail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I found the string and FLBA might being truncated, other types in public impl will not being truncated if exists

auto array_data =
::arrow::ArrayData::Make(field->type(), length, std::move(buffers), null_count);
auto array_statistics = std::make_shared<::arrow::ArrayStatistics>();
array_statistics->null_count = null_count;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The null_count for some type ( nested ) would be a bit weird, FYI

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the information.
Let's revisit it when we add support for arrow::ArrayStatistics of nested types.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Sep 5, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Sep 5, 2024
@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting change review Awaiting change review awaiting changes Awaiting changes labels Sep 5, 2024
array_statistics->null_count = null_count;
auto statistics = metadata->statistics().get();
if (statistics) {
if (statistics->HasDistinctCount()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a separate function for the stats conversion?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do it when I add more target types as the next pull request.
I'll know what is common pattern when I add more target types.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Sep 5, 2024
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@kou kou merged commit 262d6f6 into apache:main Sep 5, 2024
38 of 39 checks passed
@kou kou deleted the cpp-parquet-statistics branch September 5, 2024 20:41
@kou kou removed the awaiting changes Awaiting changes label Sep 5, 2024
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 262d6f6.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them.

zanmato1984 pushed a commit to zanmato1984/arrow that referenced this pull request Sep 6, 2024
…: non zero-copy int based types (apache#43945)

### Rationale for this change

Statistics is useful for fast processing.

Target types:

* `UInt8`
* `Int8`
* `UInt16`
* `Int16`
* `UInt32`
* `UInt64`
* `Date32`
* `Time32`
* `Time64`
* `Duration`

### What changes are included in this PR?

Map `ColumnChunkMetaData` information to `arrow::ArrayStatistics`.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.

* GitHub Issue: apache#43944

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
khwilson pushed a commit to khwilson/arrow that referenced this pull request Sep 14, 2024
…: non zero-copy int based types (apache#43945)

### Rationale for this change

Statistics is useful for fast processing.

Target types:

* `UInt8`
* `Int8`
* `UInt16`
* `Int16`
* `UInt32`
* `UInt64`
* `Date32`
* `Time32`
* `Time64`
* `Duration`

### What changes are included in this PR?

Map `ColumnChunkMetaData` information to `arrow::ArrayStatistics`.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.

* GitHub Issue: apache#43944

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants