Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Unable to read parquet with cuDF which are written with ParquetWriter #9216

Closed
bschifferer opened this issue Sep 10, 2021 · 4 comments · Fixed by #10000
Closed

[BUG] Unable to read parquet with cuDF which are written with ParquetWriter #9216

bschifferer opened this issue Sep 10, 2021 · 4 comments · Fixed by #10000
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code.

Comments

@bschifferer
Copy link

Describe the bug
I write multiple dataframes with ParquetWriter to a single parquet file. The reason is, that convert tfrecords to parquet files. tfrecords could be large to load all into memory. Therefore, I need to write it into chunks to a parquet file.
When I try to load the parquet file with cuDF, I get following error:

ValueError: Length mismatch: Expected axis has 20 elements, new values have 5 elements

However, I am able to read the same parquet file with pandas.

Steps/Code to reproduce bug
cudf version: '21.06.01+0.g101fc0fda4.dirty'

import cudf

from cudf.io.parquet import ParquetWriter

df = cudf.DataFrame({'col1': [0,1,2,3,4]})

w = ParquetWriter('test1' + ".parquet")
w.write_table(df)
w.write_table(df)
w.write_table(df)
w.write_table(df)
w.close()

df1 = cudf.read_parquet('test1.parquet')
@bschifferer bschifferer added Needs Triage Need team to review and classify bug Something isn't working labels Sep 10, 2021
@benfred
Copy link
Member

benfred commented Sep 10, 2021

I also see this on cudf v21.08.02

@beckernick
Copy link
Member

beckernick commented Sep 15, 2021

I am able to read the file, but we only get 5 records in our index (we expect 10 -- 5 per call to write_table)

import cudf
import pandas as pd
from cudf.io.parquet import ParquetWriterdf = cudf.DataFrame({'col1': [0,1,2,3,4]})
​
w = ParquetWriter('test1' + ".parquet")
w.write_table(df)
w.write_table(df)
w.close()
​
df1 = cudf.read_parquet('test1.parquet')
len(df1), len(pd.read_parquet("test1.parquet"))
(5, 10)

Our underlying issue appears to be that we construct a RangeIndex from 0 to 5, rather than 0 to 10. We have the full buffer of data.

df1 = cudf.read_parquet('test1.parquet')
df1.col1._column
<cudf.core.column.numerical.NumericalColumn object at 0x7fc0010367c0>
[
  0,
  1,
  2,
  3,
  4,
  0,
  1,
  2,
  3,
  4
]
dtype: int64

@beckernick beckernick added Cython libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Sep 15, 2021
@devavret
Copy link
Contributor

Seems to be a duplicate of #7011

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

rapids-bot bot pushed a commit that referenced this issue Jan 14, 2022
Chunked writer (`class ParquetWriter`) now takes an argument `partition_cols`. For each call to `write_table(df)`, the `df` is partitioned and the parts are appended to the same corresponding file in the dataset directory. This can be used when partitioning is desired but when one wants to avoid making many small files in each sub directory e.g.
Instead of repeated call to `write_to_dataset` like so:
```python
write_to_dataset(df1, root_path, partition_cols=['group'])
write_to_dataset(df2, root_path, partition_cols=['group'])
...
```
which will yield the following structure
```
root_dir/
  group=value1/
    <uuid1>.parquet
    <uuid2>.parquet
    ...
  group=value2/
    <uuid1>.parquet
    <uuid2>.parquet
    ...
  ...
```
One can write with
```python
pw = ParquetWriter(root_path, partition_cols=['group'])
pw.write_table(df1)
pw.write_table(df2)
pw.close()
```
to get the structure
```
root_dir/
  group=value1/
    <uuid1>.parquet
  group=value2/
    <uuid1>.parquet
  ...
```

Closes #7196
Also workaround fixes
fixes #9216
fixes #7011

TODO:

- [x] Tests

Authors:
  - Devavret Makkar (https://github.com/devavret)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Ashwin Srinath (https://github.com/shwina)

URL: #10000
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants