Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix compression in ORC writer #12194

Merged
merged 4 commits into from
Nov 21, 2022

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Nov 18, 2022

Description

issue #12066, #12170
There is a logic error in the ORC writer that prevents use of compressed blocks in the output file. This caused all ORC files to effectively be written without compression, causing large files in many cases.
This PR makes minimal changes to fix the logic and use compressed block when compression ratio is larger than one.
Also fixed the offset at which compressed blocks are written to avoid overwriting the data as block are compacted in the output buffer.
Verified reduction in file size for the files generated in benchmarks.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@vuule vuule added bug Something isn't working non-breaking Non-breaking change labels Nov 18, 2022
@vuule vuule self-assigned this Nov 18, 2022
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Nov 18, 2022
@codecov
Copy link

codecov bot commented Nov 18, 2022

Codecov Report

Base: 88.25% // Head: 88.25% // No change to project coverage 👍

Coverage data is based on head (08c0c5a) compared to base (782fba3).
Patch has no changes to coverable lines.

Additional details and impacted files
@@              Coverage Diff              @@
##           branch-22.12   #12194   +/-   ##
=============================================
  Coverage         88.25%   88.25%           
=============================================
  Files               137      137           
  Lines             22571    22571           
=============================================
  Hits              19921    19921           
  Misses             2650     2650           
Impacted Files Coverage Δ
python/dask_cudf/dask_cudf/backends.py 85.17% <0.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@@ -1234,7 +1235,7 @@ __global__ void __launch_bounds__(1024)
? results[ss.first_block + b].bytes_written
: src_len;
uint32_t blk_size24{};
if (results[ss.first_block + b].status == compression_status::SUCCESS) {
if (src_len < dst_len) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the semantics here? "If we failed to produce a compressed block smaller than the uncompressed data (either by failure or because of unlucky compression), just use the uncompressed data"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's exactly right.
Planning to add a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also just invert the original condition, but this gives us additional benefits with uncompressable data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. This seems like a good way to do it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, so there is no longer a check condition for SUCCESS compression?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, indirectly, in the code slightly above it:

  auto dst_len = (results[ss.first_block + b].status == compression_status::SUCCESS)
                       ? results[ss.first_block + b].bytes_written
                       : src_len;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't throw any exception if the decomp process failed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment. The logic should be refactored, but I don't want to make any unnecessary changes at this point.

@vuule vuule marked this pull request as ready for review November 18, 2022 18:36
@vuule vuule requested a review from a team as a code owner November 18, 2022 18:36
@vuule vuule requested review from harrism and PointKernel and removed request for a team November 18, 2022 18:36
Copy link
Contributor

@nvdbaranec nvdbaranec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested against spark plugin + integration tests. All passed. Suggest adding comment on the if statement for clarity.

@ttnghia
Copy link
Contributor

ttnghia commented Nov 18, 2022

Should this also close #12170?

@vuule
Copy link
Contributor Author

vuule commented Nov 18, 2022

Should this also close #12170?

Most likely it will address the issue. I don't see a way to verify the fix, file is not attached.

auto const dst_offset = b * (padded_block_header_size + padded_comp_block_size);
inputs[ss.first_block + b] = {src + b * comp_blk_size, blk_size};
auto const dst_offset =
padded_block_header_size + b * (padded_block_header_size + padded_comp_block_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this also wrongly computed before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sort of. We insert a 3-byte header before each block when the compact the block in the output buffer. output buffer is actually same as compressed data buffer. So without the additional padded_block_header_size offset we would overwrite first three bytes of the first compressed block and then copy the block 3 bytes ahead, potentially doing even more damage to the block. This is why tests failed when @nvdbaranec fixed the SUCCESS condition locally. Luckily, this one change to the location where we write compressed block fixes all issues in compaction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code that copies the blocks:

if (src != dst) {
for (uint32_t i = 0; i < blk_size; i += 1024) {
uint8_t v = (i + t < blk_size) ? src[i + t] : 0;
__syncthreads();
if (i + t < blk_size) { dst[i + t] = v; }
}
}

@vuule
Copy link
Contributor Author

vuule commented Nov 18, 2022

rerun tests

@jolorunyomi jolorunyomi merged commit 769dfbb into rapidsai:branch-22.12 Nov 21, 2022
@vuule vuule deleted the bug-write_orc-compressission branch August 10, 2023 03:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants