-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Bools written by cuIO ORC writer don't match when read by pyarrow/pyorc #6763
Comments
Based on the offline discussion with @rgsl888prabhu , this is potentially a Pyarrow issue, as they don't handle the way our writer splits boolean data streams into stripes. |
For the reference, https://issues.apache.org/jira/browse/ARROW-10635 And assumption in cudf ORC writer Line 210 in 01b8b5c
|
We got confirmation that the issue also repros with Spark reader, so treating this as a cuIO bug (not Pyarrow bug). |
code to reproduce
|
Root cause: Thus, we need to encode bool values from the next row group into the incomplete byte and set the next row group starting offset to the correct bit within the data encoded as part of the current row group. This offsets the encoding of the next row groups and the effect ripples over the entire stripe. Significant changes are needed to the current implementation to be able to support this. |
Suggested approach:
|
…#7261) Issue #6763 Authors: - Vukasin Milovanovic (@vuule) Approvers: - Ram (Ramakrishna Prabhu) (@rgsl888prabhu) - @nvdbaranec - GALI PREM SAGAR (@galipremsagar) - Keith Kraus (@kkraus14) URL: #7261
This issue has been labeled |
Issue #6763 Clean up of the code surrounding the column data encode in the ORC writer: 1. Add a 2D version of `hostdevice_vector` (single allocation); 2. Add 2D versions of `host_span` and `device_span`; 3. Add implicit conversions from `hostdevice_vector` to `host_span` and `device_span`. 4. Use the new types to represent collections that currently use flattened `hostdevice_vectors`; 5. Separated a part of `EncChunk` into a separate class, `encoder_chunk_streams`, as this is the only part used after data encode; 6. Add `orc_streams` to represent per-column streams and compute offsets. 7. Partial `writer_impl.cu` code "modernization". 8. Removed redundant size parameters (since 2dspan and 2dvector hold the size info). 9. use `device_uvector` instead of `device_vector`. Authors: - Vukasin Milovanovic (@vuule) Approvers: - Jake Hemstad (@jrhemstad) - Kumar Aatish (@kaatish) - Ram (Ramakrishna Prabhu) (@rgsl888prabhu) URL: #7324
Issue rapidsai#6763 Clean up of the code surrounding the column data encode in the ORC writer: 1. Add a 2D version of `hostdevice_vector` (single allocation); 2. Add 2D versions of `host_span` and `device_span`; 3. Add implicit conversions from `hostdevice_vector` to `host_span` and `device_span`. 4. Use the new types to represent collections that currently use flattened `hostdevice_vectors`; 5. Separated a part of `EncChunk` into a separate class, `encoder_chunk_streams`, as this is the only part used after data encode; 6. Add `orc_streams` to represent per-column streams and compute offsets. 7. Partial `writer_impl.cu` code "modernization". 8. Removed redundant size parameters (since 2dspan and 2dvector hold the size info). 9. use `device_uvector` instead of `device_vector`. Authors: - Vukasin Milovanovic (@vuule) Approvers: - Jake Hemstad (@jrhemstad) - Kumar Aatish (@kaatish) - Ram (Ramakrishna Prabhu) (@rgsl888prabhu) URL: rapidsai#7324
When writing a large dataframe with bool column using cuIO ORC writer, the result of reading the file back using pyarrow does not match the input dataframe. However when reading back from cudf's ORC reader it matches.
Note that this doesn't occur when there are no nulls in the input.
The text was updated successfully, but these errors were encountered: