Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optionally write version 2 page headers in Parquet writer #13751

Merged
merged 36 commits into from
Aug 15, 2023

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Jul 25, 2023

Description

Part of #13501. This adds the ability to write V2 page headers to the Parquet writer.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@etseidl etseidl requested a review from a team as a code owner July 25, 2023 20:13
@etseidl etseidl requested review from mythrocks and vuule July 25, 2023 20:13
@rapids-bot
Copy link

rapids-bot bot commented Jul 25, 2023

Pull requests from external contributors require approval from a rapidsai organization member with write permissions or greater before CI can begin.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jul 25, 2023
@vuule vuule added feature request New feature or request non-breaking Non-breaking change cuIO cuIO issue labels Jul 25, 2023
@vuule
Copy link
Contributor

vuule commented Jul 25, 2023

/ok to test

@vuule
Copy link
Contributor

vuule commented Jul 25, 2023

/ok to test

Comment on lines 256 to +259
ParquetFieldInt32(3, p->compressed_page_size),
ParquetFieldStruct(5, p->data_page_header),
ParquetFieldStruct(7, p->dictionary_page_header));
ParquetFieldStruct(7, p->dictionary_page_header),
ParquetFieldStruct(8, p->data_page_header_v2));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is number 6?

Copy link
Contributor Author

@etseidl etseidl Aug 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 would be the un-implemented IndexPageHeader. It's a placeholder type in the Parquet spec.

And before you ask, 4 is an optional CRC that no one adds either. 😉

@vuule
Copy link
Contributor

vuule commented Aug 4, 2023

/ok to test

@vuule
Copy link
Contributor

vuule commented Aug 7, 2023

/ok to test

@@ -275,6 +276,18 @@ bool CompactProtocolReader::read(DictionaryPageHeader* d)
return function_builder(this, op);
}

bool CompactProtocolReader::read(DataPageHeaderV2* d)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come this wasn't needed in #11778?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the header decoder is device code, so there's a duplicate thrift decoder buried in page_hdr.cu We've got little bits of thrift decoding sprinkled all over the place :(

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just one minor suggestion.

comp_in[blockIdx.x] = {base, actual_data_size};
comp_out[blockIdx.x] = {s->page.compressed_data + s->page.max_hdr_size, 0}; // size is unused
// V2 does not compress rep and def level data
size_t const offset = s->page.def_lvl_bytes + s->page.rep_lvl_bytes;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took me a bit to connect the dots in this block. Maybe a rename?

Suggested change
size_t const offset = s->page.def_lvl_bytes + s->page.rep_lvl_bytes;
size_t const skip_comp_size = s->page.def_lvl_bytes + s->page.rep_lvl_bytes;

Copy link
Contributor Author

@etseidl etseidl Aug 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good. Oh, the memcpy just below...I'm thinking that should be parallelized, but keep forgetting. Thoughts? The level data can get pretty large (100s of bytes?).

Or I can just address this in the next PR.

@ttnghia
Copy link
Contributor

ttnghia commented Aug 11, 2023

/ok to test

@vuule
Copy link
Contributor

vuule commented Aug 15, 2023

/ok to test

@vuule vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Aug 15, 2023
@vuule
Copy link
Contributor

vuule commented Aug 15, 2023

/merge

@rapids-bot rapids-bot bot merged commit da2560f into rapidsai:branch-23.10 Aug 15, 2023
@etseidl etseidl deleted the feature/write_v2_header branch August 15, 2023 21:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants