[FEA] Refactor ORC's dictionary encoding using cuCo #10495

devavret · 2022-03-23T17:17:32Z

ORC writer uses dictionary encoding on string columns and uses a custom hash map like structure. This is not needed anymore as we can use cuCo's hash maps.

Using #8476 as a template, we should be able to refactor and clean up the dictionary encoding code that currently resides in src/io/orc/dict_enc.cu

The text was updated successfully, but these errors were encountered:

github-actions · 2022-05-05T22:03:10Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

PointKernel · 2022-05-05T22:08:31Z

Still relevant

github-actions · 2022-06-04T23:03:13Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions · 2022-09-26T05:30:47Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

…3580) Issue #13326, #10495 This PR reimplements creation of stripe dictionaries in ORC writer to eliminate row group size limitations. New implementation uses `cuco::static_map` in a way that's very similar to the Parquet writer. PR brings large performance gains because per-column X per-stripe sorting that invoked hundreds of thrust calls is now removed. Also verified that the original row group size limit (2^16) for dictionary encoding is removed, allowing dictionaries to be applicable to large lists of strings. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Nghia Truong (https://github.com/ttnghia) URL: #13580

GregoryKimball · 2023-07-31T19:28:44Z

Closed by #13580

devavret added feature request New feature or request Needs Triage Need team to review and classify labels Mar 23, 2022

GregoryKimball added the cuIO cuIO issue label Mar 23, 2022

PointKernel self-assigned this Mar 23, 2022

github-actions bot added the inactive-30d label May 5, 2022

github-actions bot removed the inactive-30d label May 5, 2022

github-actions bot added the inactive-30d label Jun 4, 2022

GregoryKimball added Performance Performance related issue tech debt and removed Needs Triage Need team to review and classify feature request New feature or request labels Jun 28, 2022

github-actions bot added the inactive-90d label Sep 26, 2022

GregoryKimball added this to the Refactor using cuco containers milestone Oct 4, 2022

PointKernel mentioned this issue Nov 29, 2022

[FEA] Refactor hash-based algorithms with new cuco data structures #12261

Open

GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. and removed inactive-90d labels Apr 2, 2023

vuule mentioned this issue Jun 29, 2023

Use cuco::static_map to build string dictionaries in ORC writer #13580

Merged

3 tasks

vuule assigned vuule and unassigned PointKernel Jun 29, 2023

GregoryKimball closed this as completed Jul 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Refactor ORC's dictionary encoding using cuCo #10495

[FEA] Refactor ORC's dictionary encoding using cuCo #10495

devavret commented Mar 23, 2022

github-actions bot commented May 5, 2022

PointKernel commented May 5, 2022

github-actions bot commented Jun 4, 2022

github-actions bot commented Sep 26, 2022

GregoryKimball commented Jul 31, 2023

[FEA] Refactor ORC's dictionary encoding using cuCo #10495

[FEA] Refactor ORC's dictionary encoding using cuCo #10495

Comments

devavret commented Mar 23, 2022

github-actions bot commented May 5, 2022

PointKernel commented May 5, 2022

github-actions bot commented Jun 4, 2022

github-actions bot commented Sep 26, 2022

GregoryKimball commented Jul 31, 2023