[BUG] Parquet writer does not apply offset to nullmask #6642

devavret · 2020-11-02T16:31:33Z

cuDF's parquet writer writes incorrect file when columns have a null mask and an offset.

In [1]: import cudf

In [2]: df = cudf.DataFrame({'a':[1,None,3,4,5]})

In [3]: df2 = df[2:]

In [4]: df2
Out[4]: 
   a
2  3
3  4
4  5

In [5]: df2.to_parquet("sliced.parquet")

In [6]: cudf.read_parquet("sliced.parquet")
Out[6]: 
      a
0     3
1  <NA>
2     5

The above example demonstrates that the validity of the output was written without taking the input's offset into account and hence the element at index 1 in the output is null.

This is possibly an issue with other cuDF writers as well. I haven't tried to repro with ORC, although the code suggests this might be the case.

Fixes #6642 Authors: - Kumar Aatish <[email protected]> - skirui-source <[email protected]> Approvers: - Vukasin Milovanovic - Devavret Makkar - Ram (Ramakrishna Prabhu) URL: #6889

devavret added bug Something isn't working Needs Triage Need team to review and classify labels Nov 2, 2020

devavret added cuIO cuIO issue and removed Needs Triage Need team to review and classify labels Nov 2, 2020

kaatish self-assigned this Nov 24, 2020

kaatish mentioned this issue Dec 3, 2020

Fix nullmask offset handling in parquet and orc writer #6889

Merged

rapids-bot bot closed this as completed in #6889 Dec 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Parquet writer does not apply offset to nullmask #6642

[BUG] Parquet writer does not apply offset to nullmask #6642

devavret commented Nov 2, 2020 •

edited

Loading

[BUG] Parquet writer does not apply offset to nullmask #6642

[BUG] Parquet writer does not apply offset to nullmask #6642

Comments

devavret commented Nov 2, 2020 • edited Loading

devavret commented Nov 2, 2020 •

edited

Loading