Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use cuco::static_map to build string dictionaries in ORC writer #13580

Merged
merged 59 commits into from
Jul 14, 2023

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Jun 14, 2023

Description

Issue #13326, #10495

This PR reimplements creation of stripe dictionaries in ORC writer to eliminate row group size limitations.
New implementation uses cuco::static_map in a way that's very similar to the Parquet writer.

PR brings large performance gains because per-column X per-stripe sorting that invoked hundreds of thrust calls is now removed.
Also verified that the original row group size limit (2^16) for dictionary encoding is removed, allowing dictionaries to be applicable to large lists of strings.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@vuule vuule added feature request New feature or request non-breaking Non-breaking change labels Jun 14, 2023
@vuule vuule self-assigned this Jun 14, 2023
@vuule vuule changed the title Use cuco map in OCR writer Use cuco map in ORC writer Jun 14, 2023
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jun 14, 2023
Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor suggestion and LGTM

cpp/src/io/orc/orc_gpu.hpp Outdated Show resolved Hide resolved
}
namespace cudf::io::orc::gpu {

constexpr int DEFAULT_BLOCK_SIZE = 256;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What block? Data block or thread block? Can you add a comment for it please?

cpp/src/io/orc/dict_enc.cu Outdated Show resolved Hide resolved
cpp/src/io/orc/dict_enc.cu Outdated Show resolved Hide resolved
@vuule vuule requested a review from ttnghia July 13, 2023 22:51
@vuule vuule added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 5 - Ready to Merge Testing and reviews complete, ready to merge labels Jul 13, 2023
@vuule
Copy link
Contributor Author

vuule commented Jul 14, 2023

/merge

@rapids-bot rapids-bot bot merged commit d9f1d94 into rapidsai:branch-23.08 Jul 14, 2023
@vuule vuule deleted the fea-write-orc-dict branch July 14, 2023 02:27
@vuule vuule restored the fea-write-orc-dict branch August 10, 2023 03:13
@vuule vuule deleted the fea-write-orc-dict branch August 10, 2023 03:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants