[BUG] Chunked parquet reader incorrect results for positive values of `n_rows` #17311

brandon-b-miller · 2024-11-13T16:47:39Z

With 24.12, I have:

import pylibcudf as plc
import pyarrow as pa
import pyarrow.parquet as pq


data = {
            "a": [1, 2, 3, None, 4, 5],
            "b": ["ẅ", "x", "y", "z", "123", "abcd"],
            "c": [None, None, 4, 5, -1, 0],
        }


path = "./test.parquet"
pq.write_table(pa.Table.from_pydict(data), path)

reader = plc.io.parquet.ChunkedParquetReader(
        plc.io.SourceInfo([path]),
        columns=['a', 'b', 'c'],
        nrows=2,
        skip_rows=0,
        chunk_read_limit=0,
        pass_read_limit=17179869184 # 16 GiB
)

# Read data by chunk
chk = reader.read_chunk()
tbl = chk.tbl
names = chk.column_names()
concatenated_columns = tbl.columns()
while reader.has_next():
    tbl = reader.read_chunk().tbl

    for i in range(tbl.num_columns()):
        concatenated_columns[i] = plc.concatenate.concatenate(
            [concatenated_columns[i], tbl._columns[i]]
        )
        # Drop residual columns to save memory
        tbl._columns[i] = None

gpu_result = plc.interop.to_arrow(tbl)
cpu_result = pq.read_table(path)[:2]

print(cpu_result.column(1).to_pylist())
print(gpu_result.column(1).to_pylist())

This results in

['ẅ', 'x']
['ẅ', 'x\x00\x00\x00\x00\x00\x00\x00\x00\x00']

The text was updated successfully, but these errors were encountered:

vyasr · 2024-11-14T00:26:57Z

Based on this comment could this be related to #16186 @mhaseeb123?

mhaseeb123 · 2024-11-14T04:19:30Z

Based on this comment could this be related to #16186 @mhaseeb123?

Nah, as discussed offline, this one is related to the decoder while #16186 is related to chunking.

brandon-b-miller added bug Something isn't working cuIO cuIO issue labels Nov 13, 2024

mhaseeb123 added this to the Parquet continuous improvement milestone Nov 13, 2024

GregoryKimball assigned mhaseeb123 Nov 13, 2024

mhaseeb123 mentioned this issue Nov 14, 2024

Fix reading Parquet string cols when nrows and input_pass_limit > 0 #17321

Merged

3 tasks

rapids-bot bot closed this as completed in #17321 Nov 18, 2024

rapids-bot bot closed this as completed in 43f2f68 Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Chunked parquet reader incorrect results for positive values of `n_rows` #17311

[BUG] Chunked parquet reader incorrect results for positive values of `n_rows` #17311

brandon-b-miller commented Nov 13, 2024

vyasr commented Nov 14, 2024

mhaseeb123 commented Nov 14, 2024 •

edited

Loading

[BUG] Chunked parquet reader incorrect results for positive values of n_rows #17311

[BUG] Chunked parquet reader incorrect results for positive values of n_rows #17311

Comments

brandon-b-miller commented Nov 13, 2024

vyasr commented Nov 14, 2024

mhaseeb123 commented Nov 14, 2024 • edited Loading

[BUG] Chunked parquet reader incorrect results for positive values of `n_rows` #17311

[BUG] Chunked parquet reader incorrect results for positive values of `n_rows` #17311

mhaseeb123 commented Nov 14, 2024 •

edited

Loading