-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Refactor ORC's dictionary encoding using cuCo #10495
Comments
This issue has been labeled |
Still relevant |
This issue has been labeled |
This issue has been labeled |
…3580) Issue #13326, #10495 This PR reimplements creation of stripe dictionaries in ORC writer to eliminate row group size limitations. New implementation uses `cuco::static_map` in a way that's very similar to the Parquet writer. PR brings large performance gains because per-column X per-stripe sorting that invoked hundreds of thrust calls is now removed. Also verified that the original row group size limit (2^16) for dictionary encoding is removed, allowing dictionaries to be applicable to large lists of strings. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Nghia Truong (https://github.com/ttnghia) URL: #13580
Closed by #13580 |
ORC writer uses dictionary encoding on string columns and uses a custom hash map like structure. This is not needed anymore as we can use cuCo's hash maps.
Using #8476 as a template, we should be able to refactor and clean up the dictionary encoding code that currently resides in
src/io/orc/dict_enc.cu
The text was updated successfully, but these errors were encountered: