Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Understand why we were able to parse a timestamp correctly for America/Los_Angeles when all we support is UTC. #10488

Open
revans2 opened this issue Feb 23, 2024 · 1 comment
Labels
feature request New feature or request

Comments

@revans2
Copy link
Collaborator

revans2 commented Feb 23, 2024

Is your feature request related to a problem? Please describe.
This is really odd to me. I have a test that is still a WIP.

@pytest.mark.parametrize('zone_id', [
    "UTC",
    pytest.param("-08:00",marks=pytest.mark.xfail(reason='https://github.com/NVIDIA/spark-rapids/issues/10485')),
    pytest.param("+01:00",marks=pytest.mark.xfail(reason='https://github.com/NVIDIA/spark-rapids/issues/10485')),
    pytest.param("Africa/Dakar",marks=pytest.mark.xfail(reason='https://github.com/NVIDIA/spark-rapids/issues/10485')),
    "America/Los_Angeles", # technically this is passing for this test, but it should not be the same as UTC
    pytest.param("Asia/Urumqi",marks=pytest.mark.xfail(reason='https://github.com/NVIDIA/spark-rapids/issues/10485')),
    pytest.param("Asia/Hong_Kong",marks=pytest.mark.xfail(reason='https://github.com/NVIDIA/spark-rapids/issues/10485')),
    pytest.param("Europe/Brussels",marks=pytest.mark.xfail(reason='https://github.com/NVIDIA/spark-rapids/issues/10485'))], ids=idfn)
@allow_non_gpu(*non_utc_allow)
def test_spark_from_json_timestamp_format_option_zoneid_but_default_format(zone_id):
    schema = StructType([StructField("t", TimestampType())])
    data = [[r'''{"t": "2016-01-01 00:00:00"}'''],
        [r'''{"t": "2023-07-27 12:21:05"}''']]
    assert_gpu_and_cpu_are_equal_collect(
            lambda spark : spark.createDataFrame(data, 'json STRING').select(f.col('json'), f.from_json(f.col('json'), schema, {'timeZone': zone_id})),
        conf = { 'spark.rapids.sql.expression.JsonToStructs': True })

It lest you set the timezone for from_json which is used to parse timestamps. But when I set the timezone to "America/Los_Angeles" it works correctly and I don't know why. It shouldn't. It has DST rules that are ongoing.

scala> import java.time.ZoneId
import java.time.ZoneId

scala> val al = ZoneId.of("America/Los_Angeles").normalized
al: java.time.ZoneId = America/Los_Angeles

scala> al.getRules.getTransitions
res1: java.util.List[java.time.zone.ZoneOffsetTransition] = [Transition[Overlap at 1883-11-18T12:07:02-07:52:58 to -08:00], Transition[Gap at 1918-03-31T02:00-08:00 to -07:00], Transition[Overlap at 1918-10-27T02:00-07:00 to -08:00], Transition[Gap at 1919-03-30T02:00-08:00 to -07:00], Transition[Overlap at 1919-10-26T02:00-07:00 to -08:00], Transition[Gap at 1942-02-09T02:00-08:00 to -07:00], Transition[Overlap at 1945-09-30T02:00-07:00 to -08:00], Transition[Gap at 1948-03-14T02:01-08:00 to -07:00], Transition[Overlap at 1949-01-01T02:00-07:00 to -08:00], Transition[Gap at 1950-04-30T01:00-08:00 to -07:00], Transition[Overlap at 1950-09-24T02:00-07:00 to -08:00], Transition[Gap at 1951-04-29T01:00-08:00 to -07:00], Transition[Overlap at 1951-09-30T02:00-07:00...

scala> al.getRules.getTransitionRules
res2: java.util.List[java.time.zone.ZoneOffsetTransitionRule] = [TransitionRule[Gap -08:00 to -07:00, SUNDAY on or after MARCH 8 at 02:00 WALL, standard offset -08:00], TransitionRule[Overlap -07:00 to -08:00, SUNDAY on or after NOVEMBER 1 at 02:00 WALL, standard offset -08:00]]

We should not be producing the right answer. More likely Spark is producing the wrong answer somehow.

https://github.com/apache/spark/blob/e6a3385e27fa95391433ea02fa053540fe101d40/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala#L529-L571

is the test that this is based off of. I hope that I am wrong and everything is working fine, but it looks really odd to me.

@revans2 revans2 added feature request New feature or request ? - Needs Triage Need team to review and classify labels Feb 23, 2024
@revans2
Copy link
Collaborator Author

revans2 commented Feb 23, 2024

I just failed for me when I ran it with all of the tests together instead of running it single threaded by itself. This might be a test related issue.

@revans2 revans2 changed the title [FEA] Unserstand why we were able to parse a timestamp correctly for America/Los_Angeles when all we support is UTC. [FEA] Understand why we were able to parse a timestamp correctly for America/Los_Angeles when all we support is UTC. Feb 26, 2024
@revans2 revans2 mentioned this issue Feb 26, 2024
62 tasks
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants