-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] GPU Parquet output for TIMESTAMP_MICROS is misinteterpreted by fastparquet as nanos #8778
Comments
The LogicalType was added more recently and the CUDF parquet writer does not support it. We should be tagging the column with TIMESTAMP_MILLIS or TIMESTAMP_MICROS if it is not in nanoseconds, which is the default. |
@gerashegalov what version of When I try to read the CPU file with fast parquet I get an error
I am running with fastparquet 0.8.1 Also I get different results for the GPU file too. fastparquet gives me an overflow error
Not sure what is happening here. I could see that the footers are tagged equivalently, but it is clear that fastparquet is taking a different path to parse the GPU file vs the CPU file because the GPU one does not get the error that the CPU does. When I try to read them using pandas I get a very similar error about overflow, but it looks like the CPU version has set the GPU Error:
CPU Error:
|
@revans2 good point, should have listed versions in my venv for p in [fastparquet, numpy, pandas]:
print(f"name={p.__name__} version={p.__version__}\n")
added pip list to https://github.com/gerashegalov/rapids-shell/blob/557a96c450a307a206330410b335d346d3cc4170/src/jupyter/timestamp_micros.ipynb |
I have reproduced the issue and gone through it several times. It appears to be a bug in fastparquet and how they compute a large timestamp from a V1 file. CUDF is still spitting out parquet files with a V1 footer. When I use Spark 3.1.1 to write the file (which also writes them out with a V1 footer) then I get the exact same result. fastparquet thinks it is from 1854.
@gerashegalov do you want me to file an issue against fastparquet? Just FYI CUDF is in the process of going to V2 for writes, eventually. rapidsai/cudf#13501 |
yes feel free to file a fastparquet issue @revans2 |
Done. dask/fastparquet#872 |
@sameerz should we document this, or do we just close this issue because it is a bug in fastparquet. |
I am inclined to close this as it is a bug in fastparquet. |
Superseded by the issue dask/fastparquet#872. Thanks @revans2 for investigating. |
Describe the bug
GPU Parquet timestamp is misinterpreted as nanos by fastparquet. Output produced by Identical code on the CPU is interpreted by fastparquet correctly.
Steps/Code to reproduce bug
On CPU, Spark and fastparquet are consistent
Out
GPU's output appears corrupt when read by fastparquet
Out
The issue appears to be in the GPU case, fast parquet assumes logical time unit nanos
because unlike in the CPU case GPU output does not have the logicalType metadata
Expected behavior
spark-rapids should be interoperable with non-Spark parquet readers, at least with the ones that work with upstream Spark
Environment details (please complete the following information)
spark.sql.parquet.outputTimestampType='TIMESTAMP_MICROS'
Additional context
encountered working on #8625
The text was updated successfully, but these errors were encountered: