-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] fastparquet compatibility tests fail with data mismatch if TZ is not set and system timezone is not UTC #9776
Labels
bug
Something isn't working
Comments
jlowe
added
bug
Something isn't working
? - Needs Triage
Need team to review and classify
labels
Nov 17, 2023
I just ran into the problem with |
mythrocks
added a commit
to mythrocks/spark-rapids
that referenced
this issue
Nov 21, 2023
Fixes NVIDIA#9776. `fastparquet` seems to read Parquet timestamp columns and interpret them in the UTC timezone, regardless of timestamp settings. The Spark RAPIDS plugin falls back on Apache Spark (CPU) to interpret timestamp columns, when it detects that the timezone is non-UTC. Apache Spark seems to correctly interpret the timestamps based on timezone. This causes the `fastparquet` timestamp tests to fail in cases where the timezone is unspecified. This commit xfails the timestamp tests when a non-UTC timezone is detected. Signed-off-by: MithunR <[email protected]>
mythrocks
added a commit
to mythrocks/spark-rapids
that referenced
this issue
Nov 22, 2023
Fixes NVIDIA#9776. The tests in `fastparquet_compatibility_test.py` check for compatibility between Apache Spark, the Spark RAPIDS plugin, and fastparquet. In particular: 1. `test_reading_file_written_by_spark_cpu` checks if timestamp columns written with Apache Spark are read similarly with fastparquet and the plugin. 2. `test_reading_file_written_with_gpu` checks if timestamps written with the plugin are read the same on Apache Spark and fastparquet. If the timezone is not set to "UTC", and the system timezone isn't "UTC" either, the plugin falls back to CPU for read/write of Parquet timestamp columns. This would cause the above tests not to run: the plugin can neither read nor write timestamps on GPU. Further, fastparquet seems to interpret timestamps written from Spark as being in "UTC", regardless of the timezone settings. So on non-UTC timezones, Apache Spark and fastparquet get different results for the same input. For the two reasons above, it is best to only run the three-way timestamp comparison tests in setups with "UTC" timezone. This commit skips the timestamp tests described above, when a non-UTC timezone is detected. Signed-off-by: MithunR <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Running the fastparquet compatibility tests will fail if the TZ environment variable is not set. I suspect this is because after #9652 we can tell the JVM to use a timezone that could be out-of-sync with what the native code thinks the timezone is.
Here's an example of how I reproduced the errors:
The tests pass if I export TZ=UTC or fail due to a fallback to the CPU (which is appropriate for the current state of the plugin) if I explicitly set TZ=America/Chicago.
The text was updated successfully, but these errors were encountered: