-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Unable to read parquet with cuDF which are written with ParquetWriter #9216
Comments
I also see this on cudf v21.08.02 |
I am able to read the file, but we only get 5 records in our index (we expect 10 -- 5 per call to import cudf
import pandas as pd
from cudf.io.parquet import ParquetWriter
df = cudf.DataFrame({'col1': [0,1,2,3,4]})
w = ParquetWriter('test1' + ".parquet")
w.write_table(df)
w.write_table(df)
w.close()
df1 = cudf.read_parquet('test1.parquet')
len(df1), len(pd.read_parquet("test1.parquet"))
(5, 10) Our underlying issue appears to be that we construct a RangeIndex from 0 to 5, rather than 0 to 10. We have the full buffer of data. df1 = cudf.read_parquet('test1.parquet')
df1.col1._column
<cudf.core.column.numerical.NumericalColumn object at 0x7fc0010367c0>
[
0,
1,
2,
3,
4,
0,
1,
2,
3,
4
]
dtype: int64 |
Seems to be a duplicate of #7011 |
This issue has been labeled |
Chunked writer (`class ParquetWriter`) now takes an argument `partition_cols`. For each call to `write_table(df)`, the `df` is partitioned and the parts are appended to the same corresponding file in the dataset directory. This can be used when partitioning is desired but when one wants to avoid making many small files in each sub directory e.g. Instead of repeated call to `write_to_dataset` like so: ```python write_to_dataset(df1, root_path, partition_cols=['group']) write_to_dataset(df2, root_path, partition_cols=['group']) ... ``` which will yield the following structure ``` root_dir/ group=value1/ <uuid1>.parquet <uuid2>.parquet ... group=value2/ <uuid1>.parquet <uuid2>.parquet ... ... ``` One can write with ```python pw = ParquetWriter(root_path, partition_cols=['group']) pw.write_table(df1) pw.write_table(df2) pw.close() ``` to get the structure ``` root_dir/ group=value1/ <uuid1>.parquet group=value2/ <uuid1>.parquet ... ``` Closes #7196 Also workaround fixes fixes #9216 fixes #7011 TODO: - [x] Tests Authors: - Devavret Makkar (https://github.com/devavret) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Ashwin Srinath (https://github.com/shwina) URL: #10000
Describe the bug
I write multiple dataframes with ParquetWriter to a single parquet file. The reason is, that convert tfrecords to parquet files. tfrecords could be large to load all into memory. Therefore, I need to write it into chunks to a parquet file.
When I try to load the parquet file with cuDF, I get following error:
However, I am able to read the same parquet file with pandas.
Steps/Code to reproduce bug
cudf version: '21.06.01+0.g101fc0fda4.dirty'
The text was updated successfully, but these errors were encountered: