-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Reading back chunked parquet file fails #7011
Comments
As for the writer, the issue is in how the ParquetWriter class handles writing the pandas metadata. It generates the metadata upon the first call to write_table() and therefore the index is fixed as the one used for the first table. This parquet metadata generation could be moved to the close() call because that's when it is written out but I can't guess how to merge the index across all the write_table() calls. I'll take a look at the reader next. |
It may be fair to leave the index information out of the pandas metadata for the It may also make sense to leave |
It's a simple fix to ignore pandas metadata when
Fixing this is going to involve moving around the code to generate pandas metadata and passing the final metadata to the state, as detailed in my previous comment.
BTW, |
Great! An ideal fix would automatically ignore the pandas metadata if that metadata contains a RangeIndex and the size of that index does not agree with the number of rows in the data. However, I would consider the "full" fix a much lower priority than a functional
It is probably rare for a user to actually need a
Oops - Good point. I thought the pyarrow engine was using this kwarg on the backend, but I was wrong. |
While trying to "Fix" this, and while writing tests, I found that pyarrow doesn't work as expected when specifying In [1]: import cudf
In [2]: import pandas as pd
In [3]: df = pd.DataFrame(
...: {
...: "a": range(6, 9),
...: "b": range(3, 6),
...: "c": range(6, 9),
...: "d": ["abc", "def", "xyz"],
...: }
...: )
...:
In [6]: df.set_index(pd.RangeIndex(stop=9, step=3), inplace=True)
In [7]: df
Out[7]:
a b c d
0 6 3 6 abc
3 7 4 7 def
6 8 5 8 xyz
In [8]: df.to_parquet('bleh.parquet')
In [9]: import pyarrow as pa
In [12]: pa.parquet.read_table('bleh.parquet', use_pandas_metadata=False).to_pandas()
Out[12]:
a b c d
0 6 3 6 abc
3 7 4 7 def
6 8 5 8 xyz |
For example: #2748 (comment) (Verified this on my end) I think we should raise this in |
This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d. |
This issue has been labeled |
@rjzamora Is this issue still relevant? |
This bug still exists: If you use I wouldn't consider it a priority, since NVTabular works around it. However, it would be great if cudf automatically ignored a problematic |
Chunked writer (`class ParquetWriter`) now takes an argument `partition_cols`. For each call to `write_table(df)`, the `df` is partitioned and the parts are appended to the same corresponding file in the dataset directory. This can be used when partitioning is desired but when one wants to avoid making many small files in each sub directory e.g. Instead of repeated call to `write_to_dataset` like so: ```python write_to_dataset(df1, root_path, partition_cols=['group']) write_to_dataset(df2, root_path, partition_cols=['group']) ... ``` which will yield the following structure ``` root_dir/ group=value1/ <uuid1>.parquet <uuid2>.parquet ... group=value2/ <uuid1>.parquet <uuid2>.parquet ... ... ``` One can write with ```python pw = ParquetWriter(root_path, partition_cols=['group']) pw.write_table(df1) pw.write_table(df2) pw.close() ``` to get the structure ``` root_dir/ group=value1/ <uuid1>.parquet group=value2/ <uuid1>.parquet ... ``` Closes #7196 Also workaround fixes fixes #9216 fixes #7011 TODO: - [x] Tests Authors: - Devavret Makkar (https://github.com/devavret) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Ashwin Srinath (https://github.com/shwina) URL: #10000
Describe the bug
Using cudf-0.16, NVTabular uses
cudf.io.parquet.ParquetWriter
to iteratively write out a parquet file to aBytesIO
object as data is processed, and then later reads back the final "file" into device memory (the intention being to pre-emptively "spill" data to host memory without relying on Dask-CUDA). This approach fails for cudf>=0.17, unless the initalParquetWriter
object is initialized withindex=False
.The main problem seems to be in the final
cudf.read_parquet
call, becuase everything works fine if that call is replaced withpd.read_parquet
. However, the parquet "file" does include incomplete index information in the pandas metadata. So, there is also a problem in the write phase.Steps/Code to reproduce bug
Output Traceback
I believe that part of the problem is that the final
BytesIO
object contains "pandas metadata" that specifies aRangeIndex
that does not match the number of rows in the parquet file. Within the bytes returned bybio.getvalue()
, I see:This suggests that the pandas metadata was not updated when the second chunk was added to the file, and explains the
Length mismatch
error. However, it is not clear why this metadata would matter if we are spefiyingbothindex=False
anduse_pandas_metadata=False
inread_parquet
. Note that, even with this "bad" metadata, the pandas version ofread_parquet
returns the correct result.Expected behavior
I would expect:
ParquetWriter
to include complete "pandas metadata"cudf.read_parquet
to ignore the pandas metadata if it is "bad," and especially ifuse_pandas_metadata==False
Environment overview (please complete the following information)
Environment details
Click here to see environment details
Additional context
cc @albert17
The text was updated successfully, but these errors were encountered: