Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read FIXED_LEN_BYTE_ARRAY as binary in parquet reader #13437

Merged
merged 34 commits into from
Aug 24, 2023

Conversation

PointKernel
Copy link
Member

@PointKernel PointKernel commented May 24, 2023

Description

Closes #12590

This PR adds support of reading FIXED_LEN_BYTE_ARRAY as lists of INT8 in the parquet reader.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label May 24, 2023
@PointKernel PointKernel self-assigned this May 24, 2023
@PointKernel PointKernel added feature request New feature or request cuIO cuIO issue non-breaking Non-breaking change labels May 24, 2023
@etseidl
Copy link
Contributor

etseidl commented May 24, 2023

Can this be deferred until #13302 is finished (or rejected >.<)? I think the fixed width byte arrays can be processed more efficiently since they'll behave like any other fixed-width type (sizing can be known up front).

@PointKernel
Copy link
Member Author

Can this be deferred until #13302 is finished (or rejected >.<)? I think the fixed width byte arrays can be processed more efficiently since they'll behave like any other fixed-width type (sizing can be known up front).

@etseidl sure, this is still a draft PR and I'm putting it up just for testing.

@revans2
Copy link
Contributor

revans2 commented May 25, 2023

I ran some tests with this and it looks good to me. Note that I just did some very simple tests though.

@PointKernel
Copy link
Member Author

PointKernel commented May 25, 2023

I ran some tests with this and it looks good to me. Note that I just did some very simple tests though.

Thanks for testing. I have some improvements to apply once #13302 is merged. In the meanwhile, @revans2 please let me know if you find any further issues with the current work.

Copy link
Contributor

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how this is working out. Nice cleanups along the way as well!

@@ -603,7 +611,7 @@ inline __device__ void gpuOutputString(volatile page_state_s* s,
void* dstv)
{
auto [ptr, len] = gpuGetStringData(s, sb, src_pos);
if (s->dtype_len == 4) {
if (s->dtype_len == 4 and (s->col.data_type & 7) == BYTE_ARRAY) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this really be a logical and? It seems in the past we would take this path if the data size was 4 bytes, but now we only do it if the data size if 4 bytes AND it is a byte array.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this PR, only BYTE_ARRAY invokes gpuOutputString and the input length cannot be 4 in that case. Now with function being potentially invoked by FIXED_LEN_BYTE_ARRAY where the length could be 4, this and logic is needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a comment along the lines of "// make sure to only hash variable length byte arrays when specified with the output type size"?

@@ -2027,11 +2035,14 @@ __global__ void __launch_bounds__(decode_block_size) gpuDecodePageData(
return;
}

auto const data_type = s->col.data_type & 7;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I liked seeing this change above to use this type, but why not move this above and use this variable all around?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data_type is not used that often in other functions. Let me have a further look.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO ColumnChunkDesc should have data_type() and type_len() so the bitwise packing madness is hidden from the rest of the code.
data_type & 7 appears 11 times in the code :\

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use parquet::Type directly instead of uint16_t?

uint16_t data_type; // basic column data type, ((type_length << 3) |
// parquet::Type)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(for some reason) data_type holds the type in the last three bits and length in the rest of the number. Using Type here would make this data member even more misleading.

cpp/src/io/parquet/page_data.cu Outdated Show resolved Hide resolved
@@ -603,7 +611,7 @@ inline __device__ void gpuOutputString(volatile page_state_s* s,
void* dstv)
{
auto [ptr, len] = gpuGetStringData(s, sb, src_pos);
if (s->dtype_len == 4) {
if (s->dtype_len == 4 and (s->col.data_type & 7) == BYTE_ARRAY) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this PR, only BYTE_ARRAY invokes gpuOutputString and the input length cannot be 4 in that case. Now with function being potentially invoked by FIXED_LEN_BYTE_ARRAY where the length could be 4, this and logic is needed.

cpp/src/io/parquet/page_data.cu Outdated Show resolved Hide resolved
@etseidl
Copy link
Contributor

etseidl commented Jul 5, 2023

So will this take FIXED_LEN_BYTE_ARRAY data in and produce a string column on output? Would it be better to produce a fixed length UINT8 array? I'm a little bit worried about getting the column statistics right.

@PointKernel
Copy link
Member Author

PointKernel commented Jul 5, 2023

So will this take FIXED_LEN_BYTE_ARRAY data in and produce a string column on output? Would it be better to produce a fixed length UINT8 array? I'm a little bit worried about getting the column statistics right.

Right, similar to BYTE_ARRAY, FIXED_LEN_BYTE_ARRAY will be handled as string columns by default.

Would it be better to produce a fixed length UINT8 array?

This work is based on the assumption that users may switch between string and binary as the eventual output of FIXED_LEN_BYTE_ARRAY. Are you suggesting producing binary as default and converting back to strings when requested?

I'm a little bit worried about getting the column statistics right.

Valid concern, I didn't think about this.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small change, but tricky! Cool stuff 👍
Partial review, will continue tomorrow.

cpp/src/io/parquet/page_decode.cuh Outdated Show resolved Hide resolved
cpp/src/io/parquet/page_decode.cuh Show resolved Hide resolved
@@ -41,7 +41,7 @@ inline __device__ void gpuOutputString(volatile page_state_s* s,
void* dstv)
{
auto [ptr, len] = gpuGetStringData(s, sb, src_pos);
if (s->dtype_len == 4) {
if (s->dtype_len == 4 and (s->col.data_type & 7) == BYTE_ARRAY) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really related to this PR, but we have this dtype_len == 4 condition as a stand-in for "output hash" in a few places and it really requires detailed knowledge of the code to understand. Nothing actionable, I just don't like it :D

cpp/src/io/parquet/parquet_gpu.hpp Outdated Show resolved Hide resolved
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, just a small suggestion

cpp/src/io/parquet/reader_impl.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reran my tests and they still all pass

@vuule
Copy link
Contributor

vuule commented Aug 24, 2023

/merge

@rapids-bot rapids-bot bot merged commit 6095a92 into rapidsai:branch-23.10 Aug 24, 2023
@vyasr vyasr added 4 - Needs Review Waiting for reviewer to review or respond and removed 4 - Needs cuDF (Python) Reviewer labels Feb 23, 2024
rapids-bot bot pushed a commit that referenced this pull request May 8, 2024
#13437 added the ability to consume FIXED_LEN_BYTE_ARRAY encoded data and represent it as lists of `UINT8`. When trying to write this data back to Parquet there are two problems. 1) the notion of fixed length is lost, and 2) the `UINT8` data is written as a list of `INT32` which can quadruple the storage required. This PR addresses both issues by adding fields to the input and output metadata to allow for preserving the form of the original data.

Authors:
  - Ed Seidl (https://github.com/etseidl)
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Muhammad Haseeb (https://github.com/mhaseeb123)

URL: #15600
@PointKernel PointKernel deleted the parquet-fixed-len-binary branch May 23, 2024 22:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support reading FIXED_LEN_BYTE_ARRAY as binary in read_parquet
8 participants