[FEA] Support dictionary compression with up to 32 bits for Parquet writer #10948

beckernick · 2022-05-24T14:29:01Z

Currently, the parquet writer has a 16 bit maximum dictionary size.

Lines 48 to 50 in da74744

    
           // Total number of unsigned 16 bit values 
        
           constexpr size_type MAX_DICT_SIZE = 
        
             std::numeric_limits<uint16_t>::max() - std::numeric_limits<uint16_t>::min() + 1;

This means that within a column chunk, if there are <= 65535 unique entries, we can use dictionary based compression to potentially save space. However, if there are > 65535 unique entries we can't use dictionary compression, potentially leading to larger file sizes.

The Apache Parquet format specification indicates support for a 32 bit maximum width, and this is available in the official Java implementation of Parquet.

Some users have shared that using libcudf instead of the traditional Java Parquet implementation has resulted in file sizes >5x larger in some scenarios (due to not being able to use dictionary encoding), which can pose a storage challenge at scale.

beckernick · 2022-06-07T16:41:48Z

Per offline discussion, 24 bits would be sufficient for known use cases. cc @devavret @vuule

github-actions · 2022-07-08T08:05:36Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Closes #10948 Adds support for dictionary encoding with 24 bit indices. Authors: - Devavret Makkar (https://github.com/devavret) Approvers: - David Wendt (https://github.com/davidwendt) - Vukasin Milovanovic (https://github.com/vuule) URL: #11216

Allows use of any bit width from 1 to 24 for Parquet dictionary keys. Also removes some more magic numbers and cleans up the dictionary test code. Finishes off changes for #10948. Authors: - Ed Seidl (https://github.com/etseidl) Approvers: - https://github.com/nvdbaranec - Vukasin Milovanovic (https://github.com/vuule) URL: #11580

beckernick added feature request New feature or request cuda libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue labels May 24, 2022

vuule assigned devavret Jun 8, 2022

github-actions bot added the inactive-30d label Jul 8, 2022

devavret mentioned this issue Jul 11, 2022

Add 24 bit dictionary support to Parquet writer #11216

Merged

rapids-bot bot closed this as completed in #11216 Jul 21, 2022

etseidl mentioned this issue Aug 23, 2022

Add full 24-bit dictionary support to Parquet writer #11580

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Support dictionary compression with up to 32 bits for Parquet writer #10948

[FEA] Support dictionary compression with up to 32 bits for Parquet writer #10948

beckernick commented May 24, 2022 •

edited

Loading

beckernick commented Jun 7, 2022

github-actions bot commented Jul 8, 2022

[FEA] Support dictionary compression with up to 32 bits for Parquet writer #10948

[FEA] Support dictionary compression with up to 32 bits for Parquet writer #10948

Comments

beckernick commented May 24, 2022 • edited Loading

beckernick commented Jun 7, 2022

github-actions bot commented Jul 8, 2022

beckernick commented May 24, 2022 •

edited

Loading