Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Chunked parquet reader incorrect results for positive values of n_rows #17311

Closed
brandon-b-miller opened this issue Nov 13, 2024 · 2 comments · Fixed by #17321
Closed
Assignees
Labels
bug Something isn't working cuIO cuIO issue

Comments

@brandon-b-miller
Copy link
Contributor

With 24.12, I have:

import pylibcudf as plc
import pyarrow as pa
import pyarrow.parquet as pq


data = {
            "a": [1, 2, 3, None, 4, 5],
            "b": ["ẅ", "x", "y", "z", "123", "abcd"],
            "c": [None, None, 4, 5, -1, 0],
        }


path = "./test.parquet"
pq.write_table(pa.Table.from_pydict(data), path)

reader = plc.io.parquet.ChunkedParquetReader(
        plc.io.SourceInfo([path]),
        columns=['a', 'b', 'c'],
        nrows=2,
        skip_rows=0,
        chunk_read_limit=0,
        pass_read_limit=17179869184 # 16 GiB
)

# Read data by chunk
chk = reader.read_chunk()
tbl = chk.tbl
names = chk.column_names()
concatenated_columns = tbl.columns()
while reader.has_next():
    tbl = reader.read_chunk().tbl

    for i in range(tbl.num_columns()):
        concatenated_columns[i] = plc.concatenate.concatenate(
            [concatenated_columns[i], tbl._columns[i]]
        )
        # Drop residual columns to save memory
        tbl._columns[i] = None

gpu_result = plc.interop.to_arrow(tbl)
cpu_result = pq.read_table(path)[:2]

print(cpu_result.column(1).to_pylist())
print(gpu_result.column(1).to_pylist())

This results in

['ẅ', 'x']
['ẅ', 'x\x00\x00\x00\x00\x00\x00\x00\x00\x00']
@brandon-b-miller brandon-b-miller added bug Something isn't working cuIO cuIO issue labels Nov 13, 2024
@vyasr
Copy link
Contributor

vyasr commented Nov 14, 2024

Based on this comment could this be related to #16186 @mhaseeb123?

@mhaseeb123
Copy link
Member

mhaseeb123 commented Nov 14, 2024

Based on this comment could this be related to #16186 @mhaseeb123?

Nah, as discussed offline, this one is related to the decoder while #16186 is related to chunking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants