Skip to content

Commit

Permalink
Skip fastparquet timestamp tests for non-UTC timezones.
Browse files Browse the repository at this point in the history
Fixes NVIDIA#9776.

The tests in `fastparquet_compatibility_test.py` check for compatibility between
Apache Spark, the Spark RAPIDS plugin, and fastparquet. In particular:
1. `test_reading_file_written_by_spark_cpu` checks if timestamp columns written
    with Apache Spark are read similarly with fastparquet and the plugin.
2. `test_reading_file_written_with_gpu` checks if timestamps written with
   the plugin are read the same on Apache Spark and fastparquet.

If the timezone is not set to "UTC", and the system timezone isn't "UTC" either,
the plugin falls back to CPU for read/write of Parquet timestamp columns. This would
cause the above tests not to run: the plugin can neither read nor write timestamps
on GPU.

Further, fastparquet seems to interpret timestamps written from Spark as being
in "UTC", regardless of the timezone settings. So on non-UTC timezones,
Apache Spark and fastparquet get different results for the same input.

For the two reasons above, it is best to only run the three-way timestamp comparison
tests in setups with "UTC" timezone.

This commit skips the timestamp tests described above, when a non-UTC timezone is
detected.

Signed-off-by: MithunR <[email protected]>
  • Loading branch information
mythrocks committed Nov 22, 2023
1 parent 2667941 commit 8693bdb
Showing 1 changed file with 19 additions and 6 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,13 @@ def read_with_fastparquet_or_plugin(spark):
return read_with_fastparquet_or_plugin


def is_timezone_utc():
from spark_init_internal import get_spark_i_know_what_i_am_doing
import time
spark = get_spark_i_know_what_i_am_doing()
return spark.conf.get("spark.sql.session.timeZone") == "UTC" and time.tzname[time.daylight] == "UTC"


@pytest.mark.skipif(condition=fastparquet_unavailable(),
reason="fastparquet is required for testing fastparquet compatibility")
@pytest.mark.skipif(condition=spark_version() < "3.4.0",
Expand All @@ -119,9 +126,12 @@ def read_with_fastparquet_or_plugin(spark):
marks=pytest.mark.xfail(reason="fastparquet reads dates as timestamps.")),
pytest.param(DateGen(nullable=False),
marks=pytest.mark.xfail(reason="fastparquet reads far future dates (e.g. year=8705) incorrectly.")),
TimestampGen(nullable=False,
start=pandas_min_datetime,
end=pandas_max_datetime), # Vanilla case.
pytest.param(TimestampGen(nullable=False,
start=pandas_min_datetime,
end=pandas_max_datetime),
marks=pytest.mark.skipif(condition=not is_timezone_utc(),
reason="fastparquet interprets timestamps in UTC timezone, regardless "
"of timezone settings")), # Vanilla case.
pytest.param(TimestampGen(nullable=False,
start=pandas_min_datetime,
end=pandas_max_datetime),
Expand Down Expand Up @@ -188,9 +198,12 @@ def test_reading_file_written_by_spark_cpu(data_gen, spark_tmp_path):
marks=pytest.mark.xfail(reason="fastparquet reads dates as timestamps.")),
pytest.param(DateGen(nullable=False),
marks=pytest.mark.xfail(reason="fastparquet reads far future dates (e.g. year=8705) incorrectly.")),
TimestampGen(nullable=False,
start=pandas_min_datetime,
end=pandas_max_datetime), # Vanilla case.
pytest.param(TimestampGen(nullable=False,
start=pandas_min_datetime,
end=pandas_max_datetime),
marks=pytest.mark.skipif(condition=not is_timezone_utc(),
reason="fastparquet interprets timestamps in UTC timezone, regardless "
"of timezone settings")), # Vanilla case.
pytest.param(TimestampGen(nullable=False,
start=datetime(1, 1, 1, tzinfo=timezone.utc),
end=pandas_min_datetime),
Expand Down

0 comments on commit 8693bdb

Please sign in to comment.