Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for list type in ORC writer #8723

Merged
merged 96 commits into from
Jul 21, 2021

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Jul 13, 2021

Fixes #7640

Adds support for list columns to the ORC writer, including nested lists.
Adds Python tests for the new type.
Modifies a lot of host-side logic in the writer, because rowgroup sizes are not constant now. Rowgroup sizes are now precomputed for all columns.

Performance impact: ~5% improvement for existing types (expect floating point - no changes there) 🎉

vuule added 30 commits May 24, 2021 15:48

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
…fea-orc-write-list

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Copy link
Contributor

@devavret devavret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still reviewing.

cpp/src/io/orc/writer_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/orc/stripe_enc.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@galipremsagar galipremsagar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python Approval

@galipremsagar galipremsagar added 4 - Needs cuIO Reviewer 4 - Needs Review Waiting for reviewer to review or respond labels Jul 20, 2021
Copy link
Contributor

@devavret devavret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry but still reviewing.

cpp/src/io/orc/writer_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/orc/writer_impl.cu Outdated Show resolved Hide resolved
@@ -390,71 +464,74 @@ orc_streams writer::impl::create_streams(host_span<orc_column_view> columns,
});

std::vector<int32_t> ids(columns.size() * gpu::CI_NUM_STREAMS, -1);
std::vector<TypeKind> types(streams.size(), INVALID_TYPE_KIND);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intentional or did you mean to reserve/use subscript instead of push_back where using.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is intentional, the first n+1 streams don't have a valid type, so we fill out the type vector here and then append valid types as we append streams.


std::vector<std::vector<uint8_t>> stat_blobs(num_stat_blobs);
hostdevice_vector<stats_column_desc> stat_desc(columns.size(), stream);
hostdevice_vector<stats_column_desc> stat_desc(orc_table.num_columns(), stream);
hostdevice_vector<statistics_merge_group> stat_merge(num_stat_blobs, stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2d?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually pretty messy to convert, the array is used in creative ways.
If you don't mind, I'd prefer to do this is a separate PR (since there's a few more improvements that can be made in this code).

cpp/src/io/orc/writer_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/orc/writer_impl.cu Outdated Show resolved Hide resolved
@vuule vuule requested a review from devavret July 20, 2021 20:52
@vuule
Copy link
Contributor Author

vuule commented Jul 21, 2021

rerun tests

Copy link
Contributor

@codereport codereport left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks awesome! just one small spelling fix

cpp/src/io/orc/writer_impl.cu Outdated Show resolved Hide resolved
Co-authored-by: Conor Hoekstra <[email protected]>
@vuule vuule added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 4 - Needs cuIO Reviewer 4 - Needs Review Waiting for reviewer to review or respond labels Jul 21, 2021
@vuule
Copy link
Contributor Author

vuule commented Jul 21, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 9875e6f into rapidsai:branch-21.08 Jul 21, 2021
@vuule vuule deleted the fea-orc-write-list branch July 21, 2021 22:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support list types in ORC writer
7 participants