Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Bools written by cuIO ORC writer don't match when read by pyarrow/pyorc #6763

Open
devavret opened this issue Nov 13, 2020 · 8 comments
Assignees
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.

Comments

@devavret
Copy link
Contributor

devavret commented Nov 13, 2020

When writing a large dataframe with bool column using cuIO ORC writer, the result of reading the file back using pyarrow does not match the input dataframe. However when reading back from cudf's ORC reader it matches.

import pandas as pd
import numpy as np
import pyarrow as pa
import pyorc
import cudf

np.random.seed(0)
from cudf._lib.null_mask import bitmask_allocation_size_bytes

def random_bitmask(size):
    sz = bitmask_allocation_size_bytes(size)
    data = np.random.randint(0, 255, dtype="u1", size=sz)
    return data.view("i1")

size = 6000000
arr = np.random.randint(low=0, high=2, size=size).astype(np.bool)
s = cudf.Series.from_masked_array(arr, random_bitmask(size))
gdf = cudf.DataFrame({"col_bool": s})

# write with cuIO
fname = "brokenbool.orc"
gdf.to_orc(fname)

# read with pyarrow
pdf = pa.orc.ORCFile(fname).read().to_pandas()

# the sum doesn't match
print(gdf.col_bool.sum(), pdf.col_bool.sum())

# read with pyorc
file = open(fname, 'rb')
data = pyorc.Reader(file).read()
pdf = pd.DataFrame(data, columns=["col_bool"])

# sum matches pyarrow but not original df
print(gdf.col_bool.sum(), pdf.col_bool.sum())

# reading with cuIO gives the correct result
print(gdf.col_bool.sum(), cudf.read_orc(fname).col_bool.sum())

Note that this doesn't occur when there are no nulls in the input.

@devavret devavret added bug Something isn't working Needs Triage Need team to review and classify labels Nov 13, 2020
@devavret devavret added the cuIO cuIO issue label Nov 13, 2020
devavret added a commit to devavret/cudf that referenced this issue Nov 13, 2020
@rgsl888prabhu rgsl888prabhu self-assigned this Nov 13, 2020
@kkraus14 kkraus14 added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Nov 17, 2020
@vuule
Copy link
Contributor

vuule commented Nov 18, 2020

Based on the offline discussion with @rgsl888prabhu , this is potentially a Pyarrow issue, as they don't handle the way our writer splits boolean data streams into stripes.
Keeping the priority for now.

@rgsl888prabhu
Copy link
Contributor

rgsl888prabhu commented Nov 18, 2020

For the reference, https://issues.apache.org/jira/browse/ARROW-10635

And assumption in cudf ORC writer

putb(0); // bit position in byte, always zero

@vuule vuule self-assigned this Jan 21, 2021
@vuule
Copy link
Contributor

vuule commented Jan 21, 2021

We got confirmation that the issue also repros with Spark reader, so treating this as a cuIO bug (not Pyarrow bug).

@rgsl888prabhu
Copy link
Contributor

bool_pq.zip

code to reproduce

import cudf
import pandas as pd
df = cudf.read_parquet("bool_pq.parquet")
df.to_orc("broken_bool.orc")
pdf = pd.read_orc("broken_bool.orc")
gdf = cudf.read_orc("broken_bool.orc")
# test pandas and cudf orc read
pdf.dropna()[gdf.dropna().to_pandas()['col_bool'] != pdf.dropna()['col_bool']]
# Compare parquet and orc cudf read
gdf.dropna()[df.dropna().to_pandas()['col_bool'] != gdf.dropna().to_pandas()['col_bool']]

@vuule
Copy link
Contributor

vuule commented Jan 26, 2021

Root cause:
ORC encodes bools as bits, where null values are omitted from the data stream. Row groups have 10k elements, so when there are no nulls they fill 1250 bytes completely.
When nulls are present, the last byte might be incomplete. Other readers (pyarrow, Spark) expect all bits in encoded column to be valid (with the exception of last byte in the stripe).

Thus, we need to encode bool values from the next row group into the incomplete byte and set the next row group starting offset to the correct bit within the data encoded as part of the current row group. This offsets the encoding of the next row groups and the effect ripples over the entire stripe. Significant changes are needed to the current implementation to be able to support this.

@vuule
Copy link
Contributor

vuule commented Jan 26, 2021

Suggested approach:

  1. Gather row ranges that need to be encoded together (new step/code):
    1. Gather null counts per row group.
    2. Apply thrust::inclusive_scan on the null counts.
    3. Modulo 8 each element to get the number of bits that this row group borrow from the next one (as part of inclusive_scan?).
    4. Find the row number of the last row each row group encoder needs to encode (might be larger than the number of borrowed bits, as nulls are not encoded in the data stream) - thread per row group.
  2. Encode kernel changes:
    1. Pass the information about bool row ranges to the encode kernel.
    2. Encode the bool column rows in the range, instead of rows in the row group.
    3. Save the location of the last batch of encoded (Byte RLE) values .
  3. Row group offset computation: use information about last encoded batch and the number of borrowed bits.

rapids-bot bot pushed a commit that referenced this issue Feb 4, 2021
…#7261)

Issue #6763

Authors:
  - Vukasin Milovanovic (@vuule)

Approvers:
  - Ram (Ramakrishna Prabhu) (@rgsl888prabhu)
  - @nvdbaranec
  - GALI PREM SAGAR (@galipremsagar)
  - Keith Kraus (@kkraus14)

URL: #7261
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

rapids-bot bot pushed a commit that referenced this issue Mar 4, 2021
Issue #6763

Clean up of the code surrounding the column data encode in the ORC writer:
1. Add a 2D version of `hostdevice_vector` (single allocation);
2. Add 2D versions of `host_span` and `device_span`;
3. Add implicit conversions from `hostdevice_vector` to `host_span` and `device_span`.
4. Use the new types to represent collections that currently use flattened `hostdevice_vectors`;
5. Separated a part of `EncChunk` into a separate class, `encoder_chunk_streams`, as this is the only part used after data encode;
6. Add `orc_streams` to represent per-column streams and compute offsets.
7. Partial `writer_impl.cu` code "modernization".
8. Removed redundant size parameters (since 2dspan and 2dvector hold the size info).
9. use `device_uvector` instead of `device_vector`.

Authors:
  - Vukasin Milovanovic (@vuule)

Approvers:
  - Jake Hemstad (@jrhemstad)
  - Kumar Aatish (@kaatish)
  - Ram (Ramakrishna Prabhu) (@rgsl888prabhu)

URL: #7324
hyperbolic2346 pushed a commit to hyperbolic2346/cudf that referenced this issue Mar 25, 2021
Issue rapidsai#6763

Clean up of the code surrounding the column data encode in the ORC writer:
1. Add a 2D version of `hostdevice_vector` (single allocation);
2. Add 2D versions of `host_span` and `device_span`;
3. Add implicit conversions from `hostdevice_vector` to `host_span` and `device_span`.
4. Use the new types to represent collections that currently use flattened `hostdevice_vectors`;
5. Separated a part of `EncChunk` into a separate class, `encoder_chunk_streams`, as this is the only part used after data encode;
6. Add `orc_streams` to represent per-column streams and compute offsets.
7. Partial `writer_impl.cu` code "modernization".
8. Removed redundant size parameters (since 2dspan and 2dvector hold the size info).
9. use `device_uvector` instead of `device_vector`.

Authors:
  - Vukasin Milovanovic (@vuule)

Approvers:
  - Jake Hemstad (@jrhemstad)
  - Kumar Aatish (@kaatish)
  - Ram (Ramakrishna Prabhu) (@rgsl888prabhu)

URL: rapidsai#7324
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.
Projects
Status: Needs owner
Development

No branches or pull requests

5 participants