Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration tests failing for non-utc timestamp in date_test.py #11539

Open
nartal1 opened this issue Sep 27, 2024 · 3 comments
Open

Integration tests failing for non-utc timestamp in date_test.py #11539

nartal1 opened this issue Sep 27, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@nartal1
Copy link
Collaborator

nartal1 commented Sep 27, 2024

Below nightly integration tests are failing:


FAILED ../../src/main/python/date_time_test.py::test_formats_for_legacy_mode[([0-9]{3}[1-9])([0-5][0-9])([0-3][0-9])-yyyyMMdd][DATAGEN_SEED=1727466731, TZ=America/Punta_Arenas] - AssertionError: GPU and CPU int values are different at [356, 'unix_timestamp(a, yyyyMMdd)']

FAILED ../../src/main/python/date_time_test.py::test_formats_for_legacy_mode[([0-9]{3}[1-9])([0-5][0-9])([0-3][0-9])-yyyymmdd][DATAGEN_SEED=1727466731, TZ=America/Punta_Arenas, INJECT_OOM] - AssertionError: GPU and CPU int values are different at [356, 'unix_timestamp(a, yyyymmdd)']

Additional info of failing tests:
[2024-09-27T22:01:34.718Z] =================================== FAILURES ===================================

[2024-09-27T22:01:34.718Z] _ test_formats_for_legacy_mode[([0-9]{3}[1-9])([0-5][0-9])([0-3][0-9])-yyyyMMdd] _

[2024-09-27T22:01:34.718Z] [gw4] linux -- Python 3.10.15 /opt/conda/bin/python

[2024-09-27T22:01:34.718Z]

[2024-09-27T22:01:34.718Z] format = 'yyyyMMdd', data_gen_regexp = '([0-9]{3}[1-9])([0-5][0-9])([0-3][0-9])'

[2024-09-27T22:01:34.718Z]

[2024-09-27T22:01:34.718Z] @pytest.mark.skipif(not is_supported_time_zone(), reason="not all time zones are supported now, refer to #6839, please update after all time zones are supported")

[2024-09-27T22:01:34.718Z] @pytest.mark.parametrize("format", ['yyyyMMdd', 'yyyymmdd'], ids=idfn)

[2024-09-27T22:01:34.718Z] # these regexps exclude zero year, python does not like zero year

[2024-09-27T22:01:34.718Z] @pytest.mark.parametrize("data_gen_regexp", ['([0-9]{3}[1-9])([0-5][0-9])([0-3][0-9])', '([0-9]{3}[1-9])([0-9]{4})'], ids=idfn)

[2024-09-27T22:01:34.718Z] def test_formats_for_legacy_mode(format, data_gen_regexp):

[2024-09-27T22:01:34.719Z] gen = StringGen(data_gen_regexp)

[2024-09-27T22:01:34.719Z] > assert_gpu_and_cpu_are_equal_sql(

[2024-09-27T22:01:34.719Z] lambda spark : unary_op_df(spark, gen),

[2024-09-27T22:01:34.719Z] "tab",

[2024-09-27T22:01:34.719Z] '''select unix_timestamp(a, '{}'),

[2024-09-27T22:01:34.719Z] from_unixtime(unix_timestamp(a, '{}'), '{}'),

[2024-09-27T22:01:34.719Z] date_format(to_timestamp(a, '{}'), '{}')

[2024-09-27T22:01:34.719Z] from tab

[2024-09-27T22:01:34.719Z] '''.format(format, format, format, format, format),

[2024-09-27T22:01:34.719Z] { 'spark.sql.legacy.timeParserPolicy': 'LEGACY',

[2024-09-27T22:01:34.719Z] 'spark.rapids.sql.incompatibleDateFormats.enabled': True})

[2024-09-27T22:01:34.719Z]

[2024-09-27T22:01:34.719Z] ../../src/main/python/date_time_test.py:469:

[2024-09-27T22:01:34.719Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

[2024-09-27T22:01:34.719Z] ../../src/main/python/asserts.py:641: in assert_gpu_and_cpu_are_equal_sql

[2024-09-27T22:01:34.719Z] assert_gpu_and_cpu_are_equal_collect(do_it_all, conf, is_cpu_first=is_cpu_first)

[2024-09-27T22:01:34.719Z] ../../src/main/python/asserts.py:599: in assert_gpu_and_cpu_are_equal_collect

[2024-09-27T22:01:34.719Z] _assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf, is_cpu_first=is_cpu_first, result_canonicalize_func_before_compare=result_canonicalize_func_before_compare)

[2024-09-27T22:01:34.719Z] ../../src/main/python/asserts.py:521: in _assert_gpu_and_cpu_are_equal

[2024-09-27T22:01:34.719Z] assert_equal(from_cpu, from_gpu)

[2024-09-27T22:01:34.719Z] ../../src/main/python/asserts.py:111: in assert_equal

[2024-09-27T22:01:34.719Z] _assert_equal(cpu, gpu, float_check=get_float_check(), path=[])

[2024-09-27T22:01:34.719Z] ../../src/main/python/asserts.py:43: in _assert_equal

[2024-09-27T22:01:34.719Z] _assert_equal(cpu[index], gpu[index], float_check, path + [index])

[2024-09-27T22:01:34.719Z] ../../src/main/python/asserts.py:36: in _assert_equal

[2024-09-27T22:01:34.719Z] _assert_equal(cpu[field], gpu[field], float_check, path + [field])

[2024-09-27T22:01:34.719Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

[2024-09-27T22:01:34.719Z]

[2024-09-27T22:01:34.719Z] cpu = -2311528634, gpu = -2311528635

[2024-09-27T22:01:34.719Z] float_check = <function get_float_check.. at 0x7f94e28a83a0>

[2024-09-27T22:01:34.719Z] path = [356, 'unix_timestamp(a, yyyyMMdd)']

[2024-09-27T22:01:34.719Z]

[2024-09-27T22:01:34.719Z] def _assert_equal(cpu, gpu, float_check, path):

[2024-09-27T22:01:34.719Z] t = type(cpu)

[2024-09-27T22:01:34.719Z] if (t is Row):

[2024-09-27T22:01:34.719Z] assert len(cpu) == len(gpu), "CPU and GPU row have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))

[2024-09-27T22:01:34.719Z] if hasattr(cpu, "fields") and hasattr(gpu, "fields"):

[2024-09-27T22:01:34.719Z] assert cpu.fields == gpu.fields, "CPU and GPU row have different fields at {} CPU: {} GPU: {}".format(path, cpu.fields, gpu.fields)

[2024-09-27T22:01:34.719Z] for field in cpu.fields:

[2024-09-27T22:01:34.719Z] _assert_equal(cpu[field], gpu[field], float_check, path + [field])

[2024-09-27T22:01:34.719Z] else:

[2024-09-27T22:01:34.719Z] for index in range(len(cpu)):

[2024-09-27T22:01:34.719Z] _assert_equal(cpu[index], gpu[index], float_check, path + [index])

[2024-09-27T22:01:34.719Z] elif (t is list):

[2024-09-27T22:01:34.719Z] assert len(cpu) == len(gpu), "CPU and GPU list have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))

[2024-09-27T22:01:34.719Z] for index in range(len(cpu)):

[2024-09-27T22:01:34.719Z] _assert_equal(cpu[index], gpu[index], float_check, path + [index])

[2024-09-27T22:01:34.719Z] elif (t is tuple):

[2024-09-27T22:01:34.719Z] assert len(cpu) == len(gpu), "CPU and GPU list have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))

[2024-09-27T22:01:34.719Z] for index in range(len(cpu)):

[2024-09-27T22:01:34.719Z] _assert_equal(cpu[index], gpu[index], float_check, path + [index])

[2024-09-27T22:01:34.719Z] elif (t is pytypes.GeneratorType):

[2024-09-27T22:01:34.719Z] index = 0

[2024-09-27T22:01:34.719Z] # generator has no zip :( so we have to do this the hard way

[2024-09-27T22:01:34.719Z] done = False

[2024-09-27T22:01:34.719Z] while not done:

[2024-09-27T22:01:34.719Z] sub_cpu = None

[2024-09-27T22:01:34.719Z] sub_gpu = None

[2024-09-27T22:01:34.719Z] try:

[2024-09-27T22:01:34.719Z] sub_cpu = next(cpu)

[2024-09-27T22:01:34.719Z] except StopIteration:

[2024-09-27T22:01:34.719Z] done = True

[2024-09-27T22:01:34.719Z]

[2024-09-27T22:01:34.719Z] try:

[2024-09-27T22:01:34.719Z] sub_gpu = next(gpu)

[2024-09-27T22:01:34.719Z] except StopIteration:

[2024-09-27T22:01:34.720Z] done = True

[2024-09-27T22:01:34.720Z]

[2024-09-27T22:01:34.720Z] if done:

[2024-09-27T22:01:34.720Z] assert sub_cpu == sub_gpu and sub_cpu == None, "CPU and GPU generators have different lengths at {}".format(path)

[2024-09-27T22:01:34.720Z] else:

[2024-09-27T22:01:34.720Z] _assert_equal(sub_cpu, sub_gpu, float_check, path + [index])

[2024-09-27T22:01:34.720Z]

[2024-09-27T22:01:34.720Z] index = index + 1

[2024-09-27T22:01:34.720Z] elif (t is dict):

[2024-09-27T22:01:34.720Z] # The order of key/values is not guaranteed in python dicts, nor are they guaranteed by Spark

[2024-09-27T22:01:34.720Z] # so sort the items to do our best with ignoring the order of dicts

[2024-09-27T22:01:34.720Z] cpu_items = list(cpu.items()).sort(key=_RowCmp)

[2024-09-27T22:01:34.720Z] gpu_items = list(gpu.items()).sort(key=_RowCmp)

[2024-09-27T22:01:34.720Z] _assert_equal(cpu_items, gpu_items, float_check, path + ["map"])

[2024-09-27T22:01:34.720Z] elif (t is int):

[2024-09-27T22:01:34.720Z] > assert cpu == gpu, "GPU and CPU int values are different at {}".format(path)

[2024-09-27T22:01:34.720Z] E AssertionError: GPU and CPU int values are different at [356, 'unix_timestamp(a, yyyyMMdd)']

[2024-09-27T22:01:34.720Z]

[2024-09-27T22:01:34.720Z] ../../src/main/python/asserts.py:78: AssertionError

[2024-09-27T22:01:34.720Z] ----------------------------- Captured stdout call -----------------------------

[2024-09-27T22:01:34.720Z] ### CPU RUN ###

[2024-09-27T22:01:34.720Z] ### GPU RUN ###

[2024-09-27T22:01:34.720Z] ### COLLECT: GPU TOOK 0.20502257347106934 CPU TOOK 0.25730133056640625 ###

[2024-09-27T22:01:34.720Z] --- CPU OUTPUT

[2024-09-27T22:01:34.720Z] +++ GPU OUTPUT

[2024-09-27T22:01:34.720Z] @@ -354,7 +354,7 @@

[2024-09-27T22:01:34.720Z] Row(unix_timestamp(a, yyyyMMdd)=1851217200, from_unixtime(unix_timestamp(a, yyyyMMdd), yyyyMMdd)='20280830', date_format(to_timestamp(a, yyyyMMdd), yyyyMMdd)='20280830')

[2024-09-27T22:01:34.720Z] Row(unix_timestamp(a, yyyyMMdd)=None, from_unixtime(unix_timestamp(a, yyyyMMdd), yyyyMMdd)=None, date_format(to_timestamp(a, yyyyMMdd), yyyyMMdd)=None)

[2024-09-27T22:01:34.720Z] Row(unix_timestamp(a, yyyyMMdd)=None, from_unixtime(unix_timestamp(a, yyyyMMdd), yyyyMMdd)=None, date_format(to_timestamp(a, yyyyMMdd), yyyyMMdd)=None)

[2024-09-27T22:01:34.720Z] -Row(unix_timestamp(a, yyyyMMdd)=-2311528634, from_unixtime(unix_timestamp(a, yyyyMMdd), yyyyMMdd)='18961001', date_format(to_timestamp(a, yyyyMMdd), yyyyMMdd)='18961001')

[2024-09-27T22:01:34.720Z] +Row(unix_timestamp(a, yyyyMMdd)=-2311528635, from_unixtime(unix_timestamp(a, yyyyMMdd), yyyyMMdd)='18961001', date_format(to_timestamp(a, yyyyMMdd), yyyyMMdd)='18961001')

[2024-09-27T22:01:34.720Z] Row(unix_timestamp(a, yyyyMMdd)=None, from_unixtime(unix_timestamp(a, yyyyMMdd), yyyyMMdd)=None, date_format(to_timestamp(a, yyyyMMdd), yyyyMMdd)=None)

[2024-09-27T22:01:34.720Z] Row(unix_timestamp(a, yyyyMMdd)=None, from_unixtime(unix_timestamp(a, yyyyMMdd), yyyyMMdd)=None, date_format(to_timestamp(a, yyyyMMdd), yyyyMMdd)=None)

[2024-09-27T22:01:34.720Z] Row(unix_timestamp(a, yyyyMMdd)=None, from_unixtime(unix_timestamp(a, yyyyMMdd), yyyyMMdd)=None, date_format(to_timestamp(a, yyyyMMdd), yyyyMMdd)=None)

[2024-09-27T22:01:34.720Z] _ test_formats_for_legacy_mode[([0-9]{3}[1-9])([0-5][0-9])([0-3][0-9])-yyyymmdd] _

[2024-09-27T22:01:34.720Z] [gw4] linux -- Python 3.10.15 /opt/conda/bin/python

[2024-09-27T22:01:34.720Z]

[2024-09-27T22:01:34.720Z] format = 'yyyymmdd', data_gen_regexp = '([0-9]{3}[1-9])([0-5][0-9])([0-3][0-9])'

[2024-09-27T22:01:34.720Z]

[2024-09-27T22:01:34.720Z] @pytest.mark.skipif(not is_supported_time_zone(), reason="not all time zones are supported now, refer to #6839, please update after all time zones are supported")

[2024-09-27T22:01:34.720Z] @pytest.mark.parametrize("format", ['yyyyMMdd', 'yyyymmdd'], ids=idfn)

[2024-09-27T22:01:34.720Z] # these regexps exclude zero year, python does not like zero year

[2024-09-27T22:01:34.720Z] @pytest.mark.parametrize("data_gen_regexp", ['([0-9]{3}[1-9])([0-5][0-9])([0-3][0-9])', '([0-9]{3}[1-9])([0-9]{4})'], ids=idfn)

[2024-09-27T22:01:34.720Z] def test_formats_for_legacy_mode(format, data_gen_regexp):

[2024-09-27T22:01:34.720Z] gen = StringGen(data_gen_regexp)

[2024-09-27T22:01:34.720Z] > assert_gpu_and_cpu_are_equal_sql(

[2024-09-27T22:01:34.720Z] lambda spark : unary_op_df(spark, gen),

[2024-09-27T22:01:34.720Z] "tab",

[2024-09-27T22:01:34.720Z] '''select unix_timestamp(a, '{}'),

[2024-09-27T22:01:34.720Z] from_unixtime(unix_timestamp(a, '{}'), '{}'),

[2024-09-27T22:01:34.720Z] date_format(to_timestamp(a, '{}'), '{}')

[2024-09-27T22:01:34.720Z] from tab

[2024-09-27T22:01:34.720Z] '''.format(format, format, format, format, format),

[2024-09-27T22:01:34.720Z] { 'spark.sql.legacy.timeParserPolicy': 'LEGACY',

[2024-09-27T22:01:34.720Z] 'spark.rapids.sql.incompatibleDateFormats.enabled': True})

[2024-09-27T22:01:34.720Z]

[2024-09-27T22:01:34.720Z] ../../src/main/python/date_time_test.py:469:

[2024-09-27T22:01:34.720Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

[2024-09-27T22:01:34.720Z] ../../src/main/python/asserts.py:641: in assert_gpu_and_cpu_are_equal_sql

[2024-09-27T22:01:34.720Z] assert_gpu_and_cpu_are_equal_collect(do_it_all, conf, is_cpu_first=is_cpu_first)

[2024-09-27T22:01:34.720Z] ../../src/main/python/asserts.py:599: in assert_gpu_and_cpu_are_equal_collect

[2024-09-27T22:01:34.720Z] _assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf, is_cpu_first=is_cpu_first, result_canonicalize_func_before_compare=result_canonicalize_func_before_compare)

[2024-09-27T22:01:34.720Z] ../../src/main/python/asserts.py:521: in _assert_gpu_and_cpu_are_equal

[2024-09-27T22:01:34.720Z] assert_equal(from_cpu, from_gpu)

[2024-09-27T22:01:34.720Z] ../../src/main/python/asserts.py:111: in assert_equal

[2024-09-27T22:01:34.720Z] _assert_equal(cpu, gpu, float_check=get_float_check(), path=[])

[2024-09-27T22:01:34.721Z] ../../src/main/python/asserts.py:43: in _assert_equal

[2024-09-27T22:01:34.721Z] _assert_equal(cpu[index], gpu[index], float_check, path + [index])

[2024-09-27T22:01:34.721Z] ../../src/main/python/asserts.py:36: in _assert_equal

[2024-09-27T22:01:34.721Z] _assert_equal(cpu[field], gpu[field], float_check, path + [field])

[2024-09-27T22:01:34.721Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

[2024-09-27T22:01:34.721Z]

[2024-09-27T22:01:34.721Z] cpu = -2335201634, gpu = -2335201635

[2024-09-27T22:01:34.721Z] float_check = <function get_float_check.. at 0x7f94e22bd870>

[2024-09-27T22:01:34.721Z] path = [356, 'unix_timestamp(a, yyyymmdd)']

[2024-09-27T22:01:34.721Z]

[2024-09-27T22:01:34.721Z] def _assert_equal(cpu, gpu, float_check, path):

[2024-09-27T22:01:34.721Z] t = type(cpu)

[2024-09-27T22:01:34.721Z] if (t is Row):

[2024-09-27T22:01:34.721Z] assert len(cpu) == len(gpu), "CPU and GPU row have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))

[2024-09-27T22:01:34.721Z] if hasattr(cpu, "fields") and hasattr(gpu, "fields"):

[2024-09-27T22:01:34.721Z] assert cpu.fields == gpu.fields, "CPU and GPU row have different fields at {} CPU: {} GPU: {}".format(path, cpu.fields, gpu.fields)

[2024-09-27T22:01:34.721Z] for field in cpu.fields:

[2024-09-27T22:01:34.721Z] _assert_equal(cpu[field], gpu[field], float_check, path + [field])

[2024-09-27T22:01:34.721Z] else:

[2024-09-27T22:01:34.721Z] for index in range(len(cpu)):

[2024-09-27T22:01:34.721Z] _assert_equal(cpu[index], gpu[index], float_check, path + [index])

[2024-09-27T22:01:34.721Z] elif (t is list):

[2024-09-27T22:01:34.721Z] assert len(cpu) == len(gpu), "CPU and GPU list have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))

[2024-09-27T22:01:34.721Z] for index in range(len(cpu)):

[2024-09-27T22:01:34.721Z] _assert_equal(cpu[index], gpu[index], float_check, path + [index])

[2024-09-27T22:01:34.721Z] elif (t is tuple):

[2024-09-27T22:01:34.721Z] assert len(cpu) == len(gpu), "CPU and GPU list have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))

[2024-09-27T22:01:34.721Z] for index in range(len(cpu)):

[2024-09-27T22:01:34.721Z] _assert_equal(cpu[index], gpu[index], float_check, path + [index])

[2024-09-27T22:01:34.721Z] elif (t is pytypes.GeneratorType):

[2024-09-27T22:01:34.721Z] index = 0

[2024-09-27T22:01:34.721Z] # generator has no zip :( so we have to do this the hard way

[2024-09-27T22:01:34.721Z] done = False

[2024-09-27T22:01:34.721Z] while not done:

[2024-09-27T22:01:34.721Z] sub_cpu = None

[2024-09-27T22:01:34.721Z] sub_gpu = None

[2024-09-27T22:01:34.721Z] try:

[2024-09-27T22:01:34.721Z] sub_cpu = next(cpu)

[2024-09-27T22:01:34.721Z] except StopIteration:

[2024-09-27T22:01:34.721Z] done = True

[2024-09-27T22:01:34.721Z]

[2024-09-27T22:01:34.721Z] try:

[2024-09-27T22:01:34.721Z] sub_gpu = next(gpu)

[2024-09-27T22:01:34.721Z] except StopIteration:

[2024-09-27T22:01:34.721Z] done = True

[2024-09-27T22:01:34.721Z]

[2024-09-27T22:01:34.721Z] if done:

[2024-09-27T22:01:34.721Z] assert sub_cpu == sub_gpu and sub_cpu == None, "CPU and GPU generators have different lengths at {}".format(path)

[2024-09-27T22:01:34.721Z] else:

[2024-09-27T22:01:34.721Z] _assert_equal(sub_cpu, sub_gpu, float_check, path + [index])

[2024-09-27T22:01:34.721Z]

[2024-09-27T22:01:34.721Z] index = index + 1

[2024-09-27T22:01:34.721Z] elif (t is dict):

[2024-09-27T22:01:34.721Z] # The order of key/values is not guaranteed in python dicts, nor are they guaranteed by Spark

[2024-09-27T22:01:34.721Z] # so sort the items to do our best with ignoring the order of dicts

[2024-09-27T22:01:34.721Z] cpu_items = list(cpu.items()).sort(key=_RowCmp)

[2024-09-27T22:01:34.721Z] gpu_items = list(gpu.items()).sort(key=_RowCmp)

[2024-09-27T22:01:34.721Z] _assert_equal(cpu_items, gpu_items, float_check, path + ["map"])

[2024-09-27T22:01:34.721Z] elif (t is int):

[2024-09-27T22:01:34.721Z] > assert cpu == gpu, "GPU and CPU int values are different at {}".format(path)

[2024-09-27T22:01:34.721Z] E AssertionError: GPU and CPU int values are different at [356, 'unix_timestamp(a, yyyymmdd)']

[2024-09-27T22:01:34.721Z]

[2024-09-27T22:01:34.721Z] ../../src/main/python/asserts.py:78: AssertionError

[2024-09-27T22:01:34.721Z] ----------------------------- Captured stdout call -----------------------------

[2024-09-27T22:01:34.721Z] ### CPU RUN ###

[2024-09-27T22:01:34.721Z] ### GPU RUN ###

[2024-09-27T22:01:34.721Z] ### COLLECT: GPU TOOK 0.20046424865722656 CPU TOOK 0.2261199951171875 ###

[2024-09-27T22:01:34.721Z] --- CPU OUTPUT

[2024-09-27T22:01:34.721Z] +++ GPU OUTPUT

[2024-09-27T22:01:34.721Z] @@ -354,7 +354,7 @@

[2024-09-27T22:01:34.721Z] Row(unix_timestamp(a, yyyymmdd)=1832814480, from_unixtime(unix_timestamp(a, yyyymmdd), yyyymmdd)='20280830', date_format(to_timestamp(a, yyyymmdd), yyyymmdd)='20280830')

[2024-09-27T22:01:34.721Z] Row(unix_timestamp(a, yyyymmdd)=188302419960, from_unixtime(unix_timestamp(a, yyyymmdd), yyyymmdd)='79374625', date_format(to_timestamp(a, yyyymmdd), yyyymmdd)='79374625')

[2024-09-27T22:01:34.722Z] Row(unix_timestamp(a, yyyymmdd)=106473843300, from_unixtime(unix_timestamp(a, yyyymmdd), yyyymmdd)='53443509', date_format(to_timestamp(a, yyyymmdd), yyyymmdd)='53443509')

[2024-09-27T22:01:34.722Z] -Row(unix_timestamp(a, yyyymmdd)=-2335201634, from_unixtime(unix_timestamp(a, yyyymmdd), yyyymmdd)='18961001', date_format(to_timestamp(a, yyyymmdd), yyyymmdd)='18961001')

[2024-09-27T22:01:34.722Z] +Row(unix_timestamp(a, yyyymmdd)=-2335201635, from_unixtime(unix_timestamp(a, yyyymmdd), yyyymmdd)='18961001', date_format(to_timestamp(a, yyyymmdd), yyyymmdd)='18961001')

[2024-09-27T22:01:34.722Z] Row(unix_timestamp(a, yyyymmdd)=54974432640, from_unixtime(unix_timestamp(a, yyyymmdd), yyyymmdd)='37124427', date_format(to_timestamp(a, yyyymmdd), yyyymmdd)='37124427')

[2024-09-27T22:01:34.722Z] Row(unix_timestamp(a, yyyymmdd)=138914363880, from_unixtime(unix_timestamp(a, yyyymmdd), yyyymmdd)='63721809', date_format(to_timestamp(a, yyyymmdd), yyyymmdd)='63721809')

[2024-09-27T22:01:34.722Z] Row(unix_timestamp(a, yyyymmdd)=252931605180, from_unixtime(unix_timestamp(a, yyyymmdd), yyyymmdd)='99851331', date_format(to_timestamp(a, yyyymmdd), yyyymmdd)='99851331')

@res-life
Copy link
Collaborator

Reproduce

select unix_timestamp('18961001', 'yyyyMMdd')
with config:
  'spark.sql.legacy.timeParserPolicy': 'LEGACY',
  'spark.rapids.sql.incompatibleDateFormats.enabled': True
with timezone:
  America/Punta_Arenas

CPU: -2311528634
GPU: -2311528635

The diff is one second.
Note: Other timezones like Aisa/Shanghai, Iran do not have this issue.

Analysis

Test Spark 330 shell

scala> import java.time._
import java.time._

scala> import org.apache.spark.sql.catalyst.util.DateTimeUtils
import org.apache.spark.sql.catalyst.util.DateTimeUtils

scala> val epochSeconds = LocalDateTime.of(1896,10,1,0,0,0).toInstant(ZoneOffset.UTC).getEpochSecond()
epochSeconds: Long = -2311545600

scala> val micros = epochSeconds * 1000000
micros: Long = -2311545600000000

scala> val expected = DateTimeUtils.convertTz(micros, ZoneId.of("America/Punta_Arenas"),  ZoneId.of("UTC"))/1000000L
expected: Long = -2311528635    //  this is the same with GPU output

test non-LEACY mode

Save the following line into a parquet
"1896-10-01"
select unix_timestamp(col, 'yyyy-MM-dd') from tab
Results are correct:

CPU: -2311528635
GPU: -2311528635

conclusion

This is a corner case in LEGACY mode; Non-LEGACY does not have this problem.
Other timezones like Aisa/Shanghai, Iran do not have this issue

TODO

Debug into Spark to see what happened in LEGACY mode.

@res-life
Copy link
Collaborator

Spark has different behavior between LEGACY and non-LEGACY mode:

Spark330:

scala> spark.conf.set("spark.sql.session.timeZone", "America/Punta_Arenas")

scala> spark.conf.set("spark.sql.legacy.timeParserPolicy", "CORRECTED")

scala> spark.sql("select unix_timestamp('18961001', 'yyyyMMdd')").show()
+----------------------------------+
|unix_timestamp(18961001, yyyyMMdd)|
+----------------------------------+
|                       -2311528635|
+----------------------------------+


scala> spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")

scala> spark.sql("select unix_timestamp('18961001', 'yyyyMMdd')").show()
+----------------------------------+
|unix_timestamp(18961001, yyyyMMdd)|
+----------------------------------+
|                       -2311535143|
+----------------------------------+

@res-life
Copy link
Collaborator

We already documented that LEGACY mode has several limitations:

LEGACY timeParserPolicy support has the following limitations when running on the GPU:

Only 4 digit years are supported
The proleptic Gregorian calendar is used instead of the hybrid Julian+Gregorian calendar that Spark uses in legacy mode
When format is yyyyMMdd, GPU only supports 8 digit strings. Spark supports like 7 digit 2024101 string while GPU does not support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants