-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include writer code and writerVersion in ORC files #14458
Include writer code and writerVersion in ORC files #14458
Conversation
…fea-write_orc-version-and-code
…fea-write_orc-version-and-code
@@ -40,6 +40,9 @@ namespace orc { | |||
static constexpr uint32_t block_header_size = 3; | |||
// Seconds from January 1st, 1970 to January 1st, 2015 | |||
static constexpr int64_t orc_utc_epoch = 1420070400; | |||
// Each ORC writer implementation should write its code to the file footer | |||
// the codes are specified in the ORC specification | |||
static constexpr int32_t cudf_writer_code = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are ORC writer number 5?
Is there a link to these codes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we got assigned 5. Search for cudf in https://orc.apache.org/specification/ORCv1/
I can add the link above.
I think someone more familiar with ORC should review. Removing my assignment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me.
cpp/src/io/orc/writer_impl.cu
Outdated
@@ -2702,7 +2703,8 @@ void writer::impl::close() | |||
ps.footerLength = pbw.size(); | |||
ps.compression = _compression_kind; | |||
ps.compressionBlockSize = _compression_blocksize; | |||
ps.version = {0, 12}; | |||
ps.version = {0, 12}; // Hive 0.12 | |||
ps.writerVersion = 7; // ORC-517 fixed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this 7
be defined above like MAGIC
with a comment about it? Do we change the writer version every major release or major changes? Is this something we never change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that we would increase this version when we implement newer features and fix notable bugs, see https://github.com/apache/orc/blob/main/proto/orc_proto.proto#L412-L422
I'm not sure if this should be a constant. I'll make the change if you think that's better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is a constant it is easier to add a multi-line comment explaining that major changes should increment it. I'd also be ok with a comment about the version. At the extreme, I would have a changelist in the comments
// version 1: Initial implementation
// version 2: added nested types
// version 3: fixed bug xyz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Included the list of versions, let me know what you think
…fea-write_orc-version-and-code
cpp/src/io/orc/orc.hpp
Outdated
static constexpr int32_t cudf_writer_code = 5; | ||
// Each ORC writer implementation should write its version to the PostScript | ||
// The version values are based on the ORC Java writer bug fixes and features | ||
// From https://github.com/apache/orc/blob/main/proto/orc_proto.proto: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The link cannot be accessed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, they removed the proto module in a recent PR :(
I'll track down where the information is now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed the link
what timing!
uint32_t compressionBlockSize{}; // the maximum size of each compression chunk | ||
std::vector<uint32_t> version; // the version of the file format [major, minor] | ||
uint64_t metadataLength = 0; // the length of the metadata section in bytes | ||
std::optional<uint32_t> writerVersion; // The version of the writer that wrote the file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this is an optional? Should it be a permanent field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's based on the protobuf specs: https://github.com/apache/orc/blob/branch-1.9/proto/orc_proto.proto#L439
We might always write the version now, but we should be able to read files without this field.
@@ -90,6 +113,7 @@ struct FileFooter { | |||
uint64_t numberOfRows = 0; // the total number of rows in the file | |||
std::vector<ColStatsBlob> statistics; // Column statistics blobs | |||
uint32_t rowIndexStride = 0; // the maximum number of rows in each index entry | |||
std::optional<uint32_t> writer; // Writer code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, should it be a permanent field?
…fea-write_orc-version-and-code
/merge |
Closes rapidsai#14325 Changes some of the metadata written to ORC file: - Include the (cuDF) writer code (5). - include writerVersion, with the value of 7; This value means that bugs up to ORC-517 are fixed. This version (6+ required) allows us to write the nanosecond statistics. This change can have unexpected impact, depending on how readers use these fields. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Robert (Bobby) Evans (https://github.com/revans2) - Nghia Truong (https://github.com/ttnghia) - Mike Wilson (https://github.com/hyperbolic2346) URL: rapidsai#14458
Description
Closes #14325
Changes some of the metadata written to ORC file:
This change can have unexpected impact, depending on how readers use these fields.
Checklist