[BUG] Bools written by cuIO ORC writer don't match when read by pyarrow/pyorc #6763

devavret · 2020-11-13T08:17:43Z

When writing a large dataframe with bool column using cuIO ORC writer, the result of reading the file back using pyarrow does not match the input dataframe. However when reading back from cudf's ORC reader it matches.

import pandas as pd
import numpy as np
import pyarrow as pa
import pyorc
import cudf

np.random.seed(0)
from cudf._lib.null_mask import bitmask_allocation_size_bytes

def random_bitmask(size):
    sz = bitmask_allocation_size_bytes(size)
    data = np.random.randint(0, 255, dtype="u1", size=sz)
    return data.view("i1")

size = 6000000
arr = np.random.randint(low=0, high=2, size=size).astype(np.bool)
s = cudf.Series.from_masked_array(arr, random_bitmask(size))
gdf = cudf.DataFrame({"col_bool": s})

# write with cuIO
fname = "brokenbool.orc"
gdf.to_orc(fname)

# read with pyarrow
pdf = pa.orc.ORCFile(fname).read().to_pandas()

# the sum doesn't match
print(gdf.col_bool.sum(), pdf.col_bool.sum())

# read with pyorc
file = open(fname, 'rb')
data = pyorc.Reader(file).read()
pdf = pd.DataFrame(data, columns=["col_bool"])

# sum matches pyarrow but not original df
print(gdf.col_bool.sum(), pdf.col_bool.sum())

# reading with cuIO gives the correct result
print(gdf.col_bool.sum(), cudf.read_orc(fname).col_bool.sum())

Note that this doesn't occur when there are no nulls in the input.

vuule · 2020-11-18T00:00:58Z

Based on the offline discussion with @rgsl888prabhu , this is potentially a Pyarrow issue, as they don't handle the way our writer splits boolean data streams into stripes.
Keeping the priority for now.

rgsl888prabhu · 2020-11-18T00:36:50Z

For the reference, https://issues.apache.org/jira/browse/ARROW-10635

And assumption in cudf ORC writer

cudf/cpp/src/io/orc/orc.cpp

Line 210 in 01b8b5c

putb(0); // bit position in byte, always zero

vuule · 2021-01-21T20:10:53Z

We got confirmation that the issue also repros with Spark reader, so treating this as a cuIO bug (not Pyarrow bug).

rgsl888prabhu · 2021-01-23T00:50:16Z

bool_pq.zip

code to reproduce

import cudf
import pandas as pd
df = cudf.read_parquet("bool_pq.parquet")
df.to_orc("broken_bool.orc")
pdf = pd.read_orc("broken_bool.orc")
gdf = cudf.read_orc("broken_bool.orc")
# test pandas and cudf orc read
pdf.dropna()[gdf.dropna().to_pandas()['col_bool'] != pdf.dropna()['col_bool']]
# Compare parquet and orc cudf read
gdf.dropna()[df.dropna().to_pandas()['col_bool'] != gdf.dropna().to_pandas()['col_bool']]

vuule · 2021-01-26T21:59:33Z

Root cause:
ORC encodes bools as bits, where null values are omitted from the data stream. Row groups have 10k elements, so when there are no nulls they fill 1250 bytes completely.
When nulls are present, the last byte might be incomplete. Other readers (pyarrow, Spark) expect all bits in encoded column to be valid (with the exception of last byte in the stripe).

Thus, we need to encode bool values from the next row group into the incomplete byte and set the next row group starting offset to the correct bit within the data encoded as part of the current row group. This offsets the encoding of the next row groups and the effect ripples over the entire stripe. Significant changes are needed to the current implementation to be able to support this.

vuule · 2021-01-26T23:30:09Z

Suggested approach:

Gather row ranges that need to be encoded together (new step/code):
1. Gather null counts per row group.
2. Apply thrust::inclusive_scan on the null counts.
3. Modulo 8 each element to get the number of bits that this row group borrow from the next one (as part of inclusive_scan?).
4. Find the row number of the last row each row group encoder needs to encode (might be larger than the number of borrowed bits, as nulls are not encoded in the data stream) - thread per row group.
Encode kernel changes:
1. Pass the information about bool row ranges to the encode kernel.
2. Encode the bool column rows in the range, instead of rows in the row group.
3. Save the location of the last batch of encoded (Byte RLE) values .
Row group offset computation: use information about last encoded batch and the number of borrowed bits.

@vuule

…#7261) Issue #6763 Authors: - Vukasin Milovanovic (@vuule) Approvers: - Ram (Ramakrishna Prabhu) (@rgsl888prabhu) - @nvdbaranec - GALI PREM SAGAR (@galipremsagar) - Keith Kraus (@kkraus14) URL: #7261

github-actions · 2021-02-26T00:40:07Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@vuule

Issue #6763 Clean up of the code surrounding the column data encode in the ORC writer: 1. Add a 2D version of `hostdevice_vector` (single allocation); 2. Add 2D versions of `host_span` and `device_span`; 3. Add implicit conversions from `hostdevice_vector` to `host_span` and `device_span`. 4. Use the new types to represent collections that currently use flattened `hostdevice_vectors`; 5. Separated a part of `EncChunk` into a separate class, `encoder_chunk_streams`, as this is the only part used after data encode; 6. Add `orc_streams` to represent per-column streams and compute offsets. 7. Partial `writer_impl.cu` code "modernization". 8. Removed redundant size parameters (since 2dspan and 2dvector hold the size info). 9. use `device_uvector` instead of `device_vector`. Authors: - Vukasin Milovanovic (@vuule) Approvers: - Jake Hemstad (@jrhemstad) - Kumar Aatish (@kaatish) - Ram (Ramakrishna Prabhu) (@rgsl888prabhu) URL: #7324

@vuule

Issue rapidsai#6763 Clean up of the code surrounding the column data encode in the ORC writer: 1. Add a 2D version of `hostdevice_vector` (single allocation); 2. Add 2D versions of `host_span` and `device_span`; 3. Add implicit conversions from `hostdevice_vector` to `host_span` and `device_span`. 4. Use the new types to represent collections that currently use flattened `hostdevice_vectors`; 5. Separated a part of `EncChunk` into a separate class, `encoder_chunk_streams`, as this is the only part used after data encode; 6. Add `orc_streams` to represent per-column streams and compute offsets. 7. Partial `writer_impl.cu` code "modernization". 8. Removed redundant size parameters (since 2dspan and 2dvector hold the size info). 9. use `device_uvector` instead of `device_vector`. Authors: - Vukasin Milovanovic (@vuule) Approvers: - Jake Hemstad (@jrhemstad) - Kumar Aatish (@kaatish) - Ram (Ramakrishna Prabhu) (@rgsl888prabhu) URL: rapidsai#7324

devavret added bug Something isn't working Needs Triage Need team to review and classify labels Nov 13, 2020

devavret added the cuIO cuIO issue label Nov 13, 2020

devavret added a commit to devavret/cudf that referenced this issue Nov 13, 2020

Added bool test minus bool with nulls test rapidsai#6763

a4177b3

rgsl888prabhu self-assigned this Nov 13, 2020

kkraus14 added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Nov 17, 2020

vuule self-assigned this Jan 21, 2021

vuule mentioned this issue Jan 30, 2021

Throw if bool column would cause incorrect result when writing to ORC #7261

Merged

vuule mentioned this issue Feb 5, 2021

Partial clean up of ORC writer #7324

Merged

github-actions bot added the inactive-30d label Feb 26, 2021

vuule unassigned rgsl888prabhu Oct 28, 2021

GregoryKimball added this to the ORC continuous improvement milestone Nov 20, 2022

GregoryKimball removed the inactive-90d label Apr 6, 2023

GregoryKimball added this to libcudf Jul 12, 2023

GregoryKimball removed the inactive-30d label Jul 12, 2023

GregoryKimball removed this from libcudf Oct 26, 2023

bdice mentioned this issue Feb 7, 2024

Reduce execution time of Python ORC tests #14776

Merged

3 tasks

jlowe mentioned this issue Nov 19, 2024

[BUG] GPU file writes only test writing a single row group or stripe NVIDIA/spark-rapids#11735

Open

tgravescs mentioned this issue Nov 19, 2024

[BUG] Orc writes don't fully support Booleans with nulls NVIDIA/spark-rapids#11736

Closed

GregoryKimball added this to libcudf Nov 19, 2024

GregoryKimball moved this to Needs owner in libcudf Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Bools written by cuIO ORC writer don't match when read by pyarrow/pyorc #6763

[BUG] Bools written by cuIO ORC writer don't match when read by pyarrow/pyorc #6763

devavret commented Nov 13, 2020 •

edited by rgsl888prabhu

Loading

vuule commented Nov 18, 2020

rgsl888prabhu commented Nov 18, 2020 •

edited

Loading

vuule commented Jan 21, 2021

rgsl888prabhu commented Jan 23, 2021

vuule commented Jan 26, 2021

vuule commented Jan 26, 2021

github-actions bot commented Feb 26, 2021

[BUG] Bools written by cuIO ORC writer don't match when read by pyarrow/pyorc #6763

[BUG] Bools written by cuIO ORC writer don't match when read by pyarrow/pyorc #6763

Comments

devavret commented Nov 13, 2020 • edited by rgsl888prabhu Loading

vuule commented Nov 18, 2020

rgsl888prabhu commented Nov 18, 2020 • edited Loading

vuule commented Jan 21, 2021

rgsl888prabhu commented Jan 23, 2021

vuule commented Jan 26, 2021

vuule commented Jan 26, 2021

github-actions bot commented Feb 26, 2021

devavret commented Nov 13, 2020 •

edited by rgsl888prabhu

Loading

rgsl888prabhu commented Nov 18, 2020 •

edited

Loading