[BUG] Unable to read parquet with cuDF which are written with ParquetWriter #9216

bschifferer · 2021-09-10T16:58:19Z

Describe the bug
I write multiple dataframes with ParquetWriter to a single parquet file. The reason is, that convert tfrecords to parquet files. tfrecords could be large to load all into memory. Therefore, I need to write it into chunks to a parquet file.
When I try to load the parquet file with cuDF, I get following error:

ValueError: Length mismatch: Expected axis has 20 elements, new values have 5 elements

However, I am able to read the same parquet file with pandas.

Steps/Code to reproduce bug
cudf version: '21.06.01+0.g101fc0fda4.dirty'

import cudf

from cudf.io.parquet import ParquetWriter

df = cudf.DataFrame({'col1': [0,1,2,3,4]})

w = ParquetWriter('test1' + ".parquet")
w.write_table(df)
w.write_table(df)
w.write_table(df)
w.write_table(df)
w.close()

df1 = cudf.read_parquet('test1.parquet')

The text was updated successfully, but these errors were encountered:

benfred · 2021-09-10T18:40:19Z

I also see this on cudf v21.08.02

beckernick · 2021-09-15T14:11:34Z

I am able to read the file, but we only get 5 records in our index (we expect 10 -- 5 per call to write_table)

import cudf
import pandas as pd
from cudf.io.parquet import ParquetWriter

df = cudf.DataFrame({'col1': [0,1,2,3,4]})

w = ParquetWriter('test1' + ".parquet")
w.write_table(df)
w.write_table(df)
w.close()

df1 = cudf.read_parquet('test1.parquet')
len(df1), len(pd.read_parquet("test1.parquet"))
(5, 10)

Our underlying issue appears to be that we construct a RangeIndex from 0 to 5, rather than 0 to 10. We have the full buffer of data.

df1 = cudf.read_parquet('test1.parquet')
df1.col1._column
<cudf.core.column.numerical.NumericalColumn object at 0x7fc0010367c0>
[
  0,
  1,
  2,
  3,
  4,
  0,
  1,
  2,
  3,
  4
]
dtype: int64

devavret · 2021-09-17T20:27:07Z

Seems to be a duplicate of #7011

github-actions · 2021-11-15T21:03:34Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Chunked writer (`class ParquetWriter`) now takes an argument `partition_cols`. For each call to `write_table(df)`, the `df` is partitioned and the parts are appended to the same corresponding file in the dataset directory. This can be used when partitioning is desired but when one wants to avoid making many small files in each sub directory e.g. Instead of repeated call to `write_to_dataset` like so: ```python write_to_dataset(df1, root_path, partition_cols=['group']) write_to_dataset(df2, root_path, partition_cols=['group']) ... ``` which will yield the following structure ``` root_dir/ group=value1/ <uuid1>.parquet <uuid2>.parquet ... group=value2/ <uuid1>.parquet <uuid2>.parquet ... ... ``` One can write with ```python pw = ParquetWriter(root_path, partition_cols=['group']) pw.write_table(df1) pw.write_table(df2) pw.close() ``` to get the structure ``` root_dir/ group=value1/ <uuid1>.parquet group=value2/ <uuid1>.parquet ... ``` Closes #7196 Also workaround fixes fixes #9216 fixes #7011 TODO: - [x] Tests Authors: - Devavret Makkar (https://github.com/devavret) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Ashwin Srinath (https://github.com/shwina) URL: #10000

bschifferer added Needs Triage Need team to review and classify bug Something isn't working labels Sep 10, 2021

bschifferer mentioned this issue Sep 10, 2021

Tfrecords to parquet NVIDIA-Merlin/NVTabular#1085

Merged

beckernick added Cython libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Sep 15, 2021

devavret mentioned this issue Sep 17, 2021

Add support for struct type in ORC writer #9025

Merged

bschifferer mentioned this issue Sep 22, 2021

Fix TFRecords2Parquet read by cuDF NVIDIA-Merlin/NVTabular#1138

Merged

github-actions bot added the inactive-30d label Nov 15, 2021

devavret mentioned this issue Jan 10, 2022

Add partitioning support to Parquet chunked writer #10000

Merged

1 task

rapids-bot bot closed this as completed in #10000 Jan 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Unable to read parquet with cuDF which are written with ParquetWriter #9216

[BUG] Unable to read parquet with cuDF which are written with ParquetWriter #9216

bschifferer commented Sep 10, 2021

benfred commented Sep 10, 2021

beckernick commented Sep 15, 2021 •

edited

Loading

devavret commented Sep 17, 2021

github-actions bot commented Nov 15, 2021

[BUG] Unable to read parquet with cuDF which are written with ParquetWriter #9216

[BUG] Unable to read parquet with cuDF which are written with ParquetWriter #9216

Comments

bschifferer commented Sep 10, 2021

benfred commented Sep 10, 2021

beckernick commented Sep 15, 2021 • edited Loading

devavret commented Sep 17, 2021

github-actions bot commented Nov 15, 2021

beckernick commented Sep 15, 2021 •

edited

Loading