-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include row group level stats when writing ORC files #10041
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-22.02 #10041 +/- ##
================================================
- Coverage 10.49% 10.41% -0.08%
================================================
Files 119 119
Lines 20305 20541 +236
================================================
+ Hits 2130 2139 +9
- Misses 18175 18402 +227
Continue to review full report at Codecov.
|
Measured a small performance regression due to additional stats encode. Difference is within a few percent (hard to measure exactly due to variance between runs). |
Will look into adding a test, unsure if this is possible with the available readers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copyrights need to be updated on some files here. I like these changes, I think they are going in the right direction for readability.
Maybe we can add a configure to enable or disable this FEA after SPARK has bumped to the ORC repo which has the fix. |
👍 |
That only works when everyone stops using the older Spark version(s) (and any other Java-based data processing frameworks) that still using the old ORC version with the reading bug. While those frameworks on the older ORC version are still in use, cudf applications could still end up creating ORC files that those frameworks will silently drop data when reading with predicate pushdown. Even though the spec says these things are technically optional, it is very sketchy to be the only ORC writer on the planet that is not generating these stats. I think it's fine making it possible in libcudf to avoid writing these stats, but IMO cudf applications should always ask for the stats to be generated unless they know there's no chance the files they're creating could be read by data processing frameworks that could be affected by the ORC reading bug. |
That's right. Writing all statistics will be the default. |
rerun tests |
m_buf->data()[lpos + 2] = (uint8_t)(sz); | ||
|
||
if (stats != nullptr) { | ||
sz += put_uint(encode_field_number<decltype(*stats)>(2)); // 2: statistics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: maybe field number (2 in this case) should be an enum. I see that it's used in a lot of places though, so maybe a followup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's doable, but there would need to be different enums for each ORC message type, since the numbers are not unique between messages (see https://orc.apache.org/specification/ORCv1/). We can have the set of enums (non-class) and still pass them as int. I would really need to do this in a follow up for this one to make it into 22.02.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Might not hurt to cook up some kind of tests for this.
…fea-rowgroup-stats
…fea-rowgroup-stats
rerun tests |
1 similar comment
rerun tests |
@gpucibot merge |
Depends on #10041. The erstwhile ORC writer API exposed only a binary choice to choose the level of statistics: ENABLED/DISABLED. This commit allows the ORC writer to further choose whether statistics are collected at the ROW_GROUP or STRIPE level. This commit also includes the relevant changes to `java/` and `python/`. Authors: - MithunR (https://github.com/mythrocks) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Jason Lowe (https://github.com/jlowe) - GALI PREM SAGAR (https://github.com/galipremsagar) - Christopher Harris (https://github.com/cwharris) - Vukasin Milovanovic (https://github.com/vuule) URL: #10058
Closes #9964
Encodes row group level stats with the rest and writes the encoded blobs into the protobuf, at the start of each stripe (other stats are in the file footer).
Adds
put_bytes
toProtobufWriter
to optimize writing of buffers.Adds new struct to represent the encoded ORC statistics so they are separated by granularity level (instead of using a single vector).