-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix compression in ORC writer #12194
Fix compression in ORC writer #12194
Conversation
Codecov ReportBase: 88.25% // Head: 88.25% // No change to project coverage 👍
Additional details and impacted files@@ Coverage Diff @@
## branch-22.12 #12194 +/- ##
=============================================
Coverage 88.25% 88.25%
=============================================
Files 137 137
Lines 22571 22571
=============================================
Hits 19921 19921
Misses 2650 2650
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
@@ -1234,7 +1235,7 @@ __global__ void __launch_bounds__(1024) | |||
? results[ss.first_block + b].bytes_written | |||
: src_len; | |||
uint32_t blk_size24{}; | |||
if (results[ss.first_block + b].status == compression_status::SUCCESS) { | |||
if (src_len < dst_len) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the semantics here? "If we failed to produce a compressed block smaller than the uncompressed data (either by failure or because of unlucky compression), just use the uncompressed data"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's exactly right.
Planning to add a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also just invert the original condition, but this gives us additional benefits with uncompressable data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. This seems like a good way to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, so there is no longer a check condition for SUCCESS
compression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is, indirectly, in the code slightly above it:
auto dst_len = (results[ss.first_block + b].status == compression_status::SUCCESS)
? results[ss.first_block + b].bytes_written
: src_len;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't throw any exception if the decomp process failed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment. The logic should be refactored, but I don't want to make any unnecessary changes at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested against spark plugin + integration tests. All passed. Suggest adding comment on the if statement for clarity.
Should this also close #12170? |
Most likely it will address the issue. I don't see a way to verify the fix, file is not attached. |
…bug-write_orc-compressission
auto const dst_offset = b * (padded_block_header_size + padded_comp_block_size); | ||
inputs[ss.first_block + b] = {src + b * comp_blk_size, blk_size}; | ||
auto const dst_offset = | ||
padded_block_header_size + b * (padded_block_header_size + padded_comp_block_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this also wrongly computed before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sort of. We insert a 3-byte header before each block when the compact the block in the output buffer. output buffer is actually same as compressed data buffer. So without the additional padded_block_header_size
offset we would overwrite first three bytes of the first compressed block and then copy the block 3 bytes ahead, potentially doing even more damage to the block. This is why tests failed when @nvdbaranec fixed the SUCCESS condition locally. Luckily, this one change to the location where we write compressed block fixes all issues in compaction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code that copies the blocks:
cudf/cpp/src/io/orc/stripe_enc.cu
Lines 1258 to 1264 in d335aa3
if (src != dst) { | |
for (uint32_t i = 0; i < blk_size; i += 1024) { | |
uint8_t v = (i + t < blk_size) ? src[i + t] : 0; | |
__syncthreads(); | |
if (i + t < blk_size) { dst[i + t] = v; } | |
} | |
} |
rerun tests |
Description
issue #12066, #12170
There is a logic error in the ORC writer that prevents use of compressed blocks in the output file. This caused all ORC files to effectively be written without compression, causing large files in many cases.
This PR makes minimal changes to fix the logic and use compressed block when compression ratio is larger than one.
Also fixed the offset at which compressed blocks are written to avoid overwriting the data as block are compacted in the output buffer.
Verified reduction in file size for the files generated in benchmarks.
Checklist