Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for list<int8> columns to be written as byte arrays in parquet #11328

Merged

Conversation

hyperbolic2346
Copy link
Contributor

This is the last major feature in the byte array changes for parquet. This PR enables support for lists of bytes to be written as byte arrays in parquet files. This is a more efficient storage mechanism than what was used before.

Limitations:

  • Only top-level lists are currently considered for writing. Some changes are necessary to allow nesting of these including dremel changes which are not here. This isn't a must-have yet, but is desired.
  • No dictionary support for lists of bytes. Dictionaries are supported for string columns, so the workaround is currently to change the column type to string before saving and using the option to write as byte arrays. This will require some more work with murmur hash to support.

This is based on top of #11160 and should not merge until it does. Once that merges, the delta here will reduce a good deal.

@hyperbolic2346 hyperbolic2346 added feature request New feature or request 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. 5 - DO NOT MERGE Hold off on merging; see PR for details non-breaking Non-breaking change labels Jul 21, 2022
@hyperbolic2346 hyperbolic2346 requested a review from a team as a code owner July 21, 2022 23:39
@hyperbolic2346 hyperbolic2346 self-assigned this Jul 21, 2022
@hyperbolic2346 hyperbolic2346 requested review from a team as code owners July 21, 2022 23:39
hyperbolic2346 and others added 2 commits July 28, 2022 17:54
nice fine, thanks Vukasin

Co-authored-by: Vukasin Milovanovic <[email protected]>
…e during review changes, but I'm not 100% sure yet. Still testing and bisecting.
Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some very minor, optional suggestions for changes.
Apologies for taking so long with this.

@github-actions github-actions bot removed Python Affects Python cuDF API. Java Affects Java cuDF API. labels Jul 29, 2022
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

@hyperbolic2346 hyperbolic2346 removed request for a team, mroeschke and galipremsagar July 29, 2022 05:57
@hyperbolic2346
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 03f1c1c into rapidsai:branch-22.08 Jul 29, 2022
@hyperbolic2346 hyperbolic2346 deleted the mwilson/parquet_list_write branch July 29, 2022 08:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants