Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] fastparquet compatibility tests fail with data mismatch if TZ is not set and system timezone is not UTC #9776

Closed
jlowe opened this issue Nov 17, 2023 · 1 comment · Fixed by #9831
Assignees
Labels
bug Something isn't working

Comments

@jlowe
Copy link
Contributor

jlowe commented Nov 17, 2023

Running the fastparquet compatibility tests will fail if the TZ environment variable is not set. I suspect this is because after #9652 we can tell the JVM to use a timezone that could be out-of-sync with what the native code thinks the timezone is.

Here's an example of how I reproduced the errors:

$ echo $TZ

$ date
Fri 17 Nov 2023 09:36:39 AM CST
$ SPARK_HOME=/home/jlowe/spark-3.4.1-bin-hadoop3/ integration_tests/run_pyspark_from_build.sh -k test_reading_file_written --pyarrow_test
[...]
FAILED ../../src/main/python/fastparquet_compatibility_test.py::test_reading_file_written_by_spark_cpu[Timestamp(not_null)0][DATAGEN_SEED=1700171382, INJECT_OOM] - AssertionError: GPU and CPU timestamp values are different at [0, 'a']
FAILED ../../src/main/python/fastparquet_compatibility_test.py::test_reading_file_written_with_gpu[Timestamp(not_null)0][DATAGEN_SEED=1700171382, INJECT_OOM] - AssertionError: GPU and CPU timestamp values are different at [0, 'a']

The tests pass if I export TZ=UTC or fail due to a fallback to the CPU (which is appropriate for the current state of the plugin) if I explicitly set TZ=America/Chicago.

@jlowe jlowe added bug Something isn't working ? - Needs Triage Need team to review and classify labels Nov 17, 2023
@revans2
Copy link
Collaborator

revans2 commented Nov 21, 2023

I just ran into the problem with test_timestamp_{seconds,millis,micros}. I didn't set TZ explicitly to UTC, and python thought we were in one time zone, while java thought we were in UTC. This resulted in them failing with an error that the minimum value generated by the tests for a long we before the year 0.

@mythrocks mythrocks self-assigned this Nov 21, 2023
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Nov 21, 2023
mythrocks added a commit to mythrocks/spark-rapids that referenced this issue Nov 21, 2023
Fixes NVIDIA#9776.

`fastparquet` seems to read Parquet timestamp columns and interpret them
in the UTC timezone, regardless of timestamp settings.
The Spark RAPIDS plugin falls back on Apache Spark (CPU) to interpret
timestamp columns, when it detects that the timezone is non-UTC.
Apache Spark seems to correctly interpret the timestamps based on
timezone.

This causes the `fastparquet` timestamp tests to fail in cases where
the timezone is unspecified.

This commit xfails the timestamp tests when a non-UTC timezone is
detected.

Signed-off-by: MithunR <[email protected]>
mythrocks added a commit to mythrocks/spark-rapids that referenced this issue Nov 22, 2023
Fixes NVIDIA#9776.

The tests in `fastparquet_compatibility_test.py` check for compatibility between
Apache Spark, the Spark RAPIDS plugin, and fastparquet. In particular:
1. `test_reading_file_written_by_spark_cpu` checks if timestamp columns written
    with Apache Spark are read similarly with fastparquet and the plugin.
2. `test_reading_file_written_with_gpu` checks if timestamps written with
   the plugin are read the same on Apache Spark and fastparquet.

If the timezone is not set to "UTC", and the system timezone isn't "UTC" either,
the plugin falls back to CPU for read/write of Parquet timestamp columns. This would
cause the above tests not to run: the plugin can neither read nor write timestamps
on GPU.

Further, fastparquet seems to interpret timestamps written from Spark as being
in "UTC", regardless of the timezone settings. So on non-UTC timezones,
Apache Spark and fastparquet get different results for the same input.

For the two reasons above, it is best to only run the three-way timestamp comparison
tests in setups with "UTC" timezone.

This commit skips the timestamp tests described above, when a non-UTC timezone is
detected.

Signed-off-by: MithunR <[email protected]>
@nvauto nvauto closed this as completed in c1a0c61 Dec 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants