-
Notifications
You must be signed in to change notification settings - Fork 928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for list type in ORC writer #8723
Conversation
…fea-orc-write-list
…fea-orc-write-list
…fea-orc-write-list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still reviewing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python Approval
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry but still reviewing.
@@ -390,71 +464,74 @@ orc_streams writer::impl::create_streams(host_span<orc_column_view> columns, | |||
}); | |||
|
|||
std::vector<int32_t> ids(columns.size() * gpu::CI_NUM_STREAMS, -1); | |||
std::vector<TypeKind> types(streams.size(), INVALID_TYPE_KIND); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this intentional or did you mean to reserve/use subscript instead of push_back where using.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is intentional, the first n+1 streams don't have a valid type, so we fill out the type vector here and then append valid types as we append streams.
|
||
std::vector<std::vector<uint8_t>> stat_blobs(num_stat_blobs); | ||
hostdevice_vector<stats_column_desc> stat_desc(columns.size(), stream); | ||
hostdevice_vector<stats_column_desc> stat_desc(orc_table.num_columns(), stream); | ||
hostdevice_vector<statistics_merge_group> stat_merge(num_stat_blobs, stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2d?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually pretty messy to convert, the array is used in creative ways.
If you don't mind, I'd prefer to do this is a separate PR (since there's a few more improvements that can be made in this code).
rerun tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks awesome! just one small spelling fix
Co-authored-by: Conor Hoekstra <[email protected]>
@gpucibot merge |
Fixes #7640
Adds support for list columns to the ORC writer, including nested lists.
Adds Python tests for the new type.
Modifies a lot of host-side logic in the writer, because rowgroup sizes are not constant now. Rowgroup sizes are now precomputed for all columns.
Performance impact: ~5% improvement for existing types (expect floating point - no changes there) 🎉