Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support TimeAdd for non-UTC time zone #10068

Closed
wants to merge 26 commits into from

Conversation

thirtiseven
Copy link
Collaborator

Closes #10067

This PR supports TimeAdd for non-UTC timezone. Removed GpuTimeMath because it is only overridden by GpuTimeAdd ( only in 311 shim) and GpuDateAddInterval (columnarEval is overridden). Also cleaned up some integration tests.

Perf test results to be updated.

@thirtiseven thirtiseven self-assigned this Dec 18, 2023
Signed-off-by: Haoyang Li <[email protected]>
@sameerz sameerz added the feature request New feature or request label Dec 18, 2023
@sperlingxx sperlingxx self-requested a review December 19, 2023 03:15
def test_timeadd(data_gen):
days, seconds = data_gen
assert_gpu_and_cpu_are_equal_collect(
# We are starting at year 0005 to make sure we don't go before year 0001
# and beyond year 10000 while doing TimeAdd
lambda spark: unary_op_df(spark, TimestampGen(start=datetime(5, 1, 1, tzinfo=timezone.utc), end=datetime(15, 1, 1, tzinfo=timezone.utc)), seed=1)
.selectExpr("a + (interval {} days {} seconds)".format(days, seconds)))
.selectExpr("a + (interval {} days {} seconds)".format(days, seconds)),
conf = {'spark.rapids.sql.nonUTC.enabled': True})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this configuration any more.

@@ -473,7 +473,7 @@ object GpuScalar extends Logging {
*
* This class is introduced because many expressions require both the cudf Scalar and its
* corresponding Scala value to complete their computations. e.g. 'GpuStringSplit',
* 'GpuStringLocate', 'GpuDivide', 'GpuDateAddInterval', 'GpuTimeMath' ...
* 'GpuStringLocate', 'GpuDivide', 'GpuDateAddInterval', 'GpuTimeAdd' ...
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Why name changed? It seems different over different Spark version. We can comment both in GpuTimeAdd/GpuTimeMath.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GpuTimeMath was an abstract class being implemented by GpuTimeAdd and GpuDateAddInterval. I removed it because the reusability of the two classes is actually poor.

GpuTimeZoneDB.fromTimestampToUtcTimestamp(utcRes, zoneId)
}
}
GpuColumnVector.from(res, dataType)
Copy link
Collaborator

@res-life res-life Dec 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be careful, seems res leaked.

@thirtiseven
Copy link
Collaborator Author

Behavior seems to mismatch with cpu when timestamp is long overflow. Will check and fix it, converting to draft

@thirtiseven thirtiseven marked this pull request as draft December 25, 2023 03:11
@thirtiseven thirtiseven marked this pull request as ready for review December 29, 2023 02:51
@thirtiseven
Copy link
Collaborator Author

Perf test results on 50000000 timestamps from big data gen:

spark.time(df.selectExpr("COUNT(a + INTERVAL '1 02:03:04' DAY TO SECOND) as a1", "COUNT(a + INTERVAL '1 02:03:04' DAY TO SECOND) as a2", "COUNT(a + INTERVAL '1 02:03:04' DAY TO SECOND) as a3", "COUNT(a + INTERVAL '1 02:03:04' DAY TO SECOND) as a4", "COUNT(a + INTERVAL '1 02:03:04' DAY TO SECOND) as a5", "COUNT(a + INTERVAL '1 02:03:04' DAY TO SECOND) as a6", "COUNT(a + INTERVAL '1 02:03:04' DAY TO SECOND) as a7", "COUNT(a + INTERVAL '1 02:03:04' DAY TO SECOND) as a8", "COUNT(a + INTERVAL '1 02:03:04' DAY TO SECOND) as a9", "COUNT(a + INTERVAL '1 02:03:04' DAY TO SECOND) as a10", "COUNT(a + INTERVAL '1 02:03:04' DAY TO SECOND) as a11", "COUNT(a + INTERVAL '1 02:03:04' DAY TO SECOND) as a12", "COUNT(a + INTERVAL '1 02:03:04' DAY TO SECOND) as a13").show())
GPU Time (ms) CPU Time (ms) Speed up
370.2 8,124.6 21.95x

@thirtiseven thirtiseven requested a review from res-life December 29, 2023 03:15
@res-life
Copy link
Collaborator

We can use this class to test perf:

https://github.com/NVIDIA/spark-rapids/blob/branch-24.02/tests/src/test/scala/com/nvidia/spark/rapids/timezone/TimeZonePerfSuite.scala

Add a new case in this suite.

Signed-off-by: Haoyang Li <[email protected]>
@thirtiseven
Copy link
Collaborator Author

Perf test results with TimeZonePerfSuite:

test,type,zone,used MS
time_add,Cpu,Asia/Shanghai,9557
time_add,Gpu,Asia/Shanghai,189
time_add,Cpu,Asia/Shanghai,9397
time_add,Gpu,Asia/Shanghai,181
time_add,Cpu,Asia/Shanghai,9303
time_add,Gpu,Asia/Shanghai,186
time_add,Cpu,Asia/Shanghai,9376
time_add,Gpu,Asia/Shanghai,167
time_add,Cpu,Asia/Shanghai,9478
time_add,Gpu,Asia/Shanghai,154
time_add, Asia/Shanghai: mean cpu time: 9422.20 ms, mean gpu time: 175.40 ms, speedup: 53.72 x

Also added mean and speed up stats for it.

@@ -133,10 +164,56 @@ case class GpuTimeAdd(start: Expression,
}
}

// A tricky way to check overflow. The result is overflow when positive + positive = negative
// or negative + negative = positive, so we can check the sign of the result is the same as
// the sign of the operands.
private def timestampAddDuration(cv: ColumnView, duration: BinaryOperable): ColumnVector = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dulicated.
Could we extract this function into a file like: datetimeExpressionsUtils.scala. It seems that it applys for all Spark versions, so do not put this funciton into a shim.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already has overflow check utility: AddOverflowChecks.basicOpOverflowCheck

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks.

@@ -60,6 +67,15 @@ def test_timeadd_daytime_column():
assert_gpu_and_cpu_are_equal_collect(
lambda spark: gen_df(spark, gen_list).selectExpr("t + d", "t + INTERVAL '1 02:03:04' DAY TO SECOND"))

@pytest.mark.skipif(is_before_spark_330(), reason='DayTimeInterval is not supported before Pyspark 3.3.0')
@allow_non_gpu(*non_supported_tz_allow)
def test_timeadd_daytime_column_long_overflow():
Copy link
Collaborator

@res-life res-life Dec 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to ensure the random df will 100% overflow?
Maybe specify some constant variables to ensure overflow.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By not making it actually random.

DayTimeIntervalGen Has both a min_value and a max_value. You could set it up so all of the values generated would overflow. You might need to also remove the special cases and disable nulls to be 100% sure of it.

def __init__(self, min_value=MIN_DAY_TIME_INTERVAL, max_value=MAX_DAY_TIME_INTERVAL, start_field="day", end_field="second",

You could also use SetValuesGen with only values in it that would overflow.

class SetValuesGen(DataGen):

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to SetValuesGen.

integration_tests/src/main/python/date_time_test.py Outdated Show resolved Hide resolved
@@ -60,6 +67,15 @@ def test_timeadd_daytime_column():
assert_gpu_and_cpu_are_equal_collect(
lambda spark: gen_df(spark, gen_list).selectExpr("t + d", "t + INTERVAL '1 02:03:04' DAY TO SECOND"))

@pytest.mark.skipif(is_before_spark_330(), reason='DayTimeInterval is not supported before Pyspark 3.3.0')
@allow_non_gpu(*non_supported_tz_allow)
def test_timeadd_daytime_column_long_overflow():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By not making it actually random.

DayTimeIntervalGen Has both a min_value and a max_value. You could set it up so all of the values generated would overflow. You might need to also remove the special cases and disable nulls to be 100% sure of it.

def __init__(self, min_value=MIN_DAY_TIME_INTERVAL, max_value=MAX_DAY_TIME_INTERVAL, start_field="day", end_field="second",

You could also use SetValuesGen with only values in it that would overflow.

class SetValuesGen(DataGen):

Signed-off-by: Haoyang Li <[email protected]>
@thirtiseven
Copy link
Collaborator Author

thirtiseven commented Jan 3, 2024

Current code will randomly fail tests. If the results are inside overlap or gap, they may not match with cpu.

I think this is because TimeAdd uses a different way to handle overlap or gap in Spark.

ZonedDateTime in Java (doc) will try to keep the offset of the additive number if the results are in overlap or gap, while this PR will ignore the offset of the additive number and simply convert the timestamp results in utc back to the given timezone.

We can't access offset values of data in plugin side, so this behavior needs a kernel to match. It won't be difficult, and we can also add the long overflow check to the kernel.

For now, the TimeAdd in this PR can be off by default, or note the difference in the compatibility doc, or just hang there to wait for kernel if we will do it soon.

Two cases that will always fail:

@pytest.mark.skipif(is_before_spark_330(), reason='DayTimeInterval is not supported before Pyspark 3.3.0')
@allow_non_gpu(*non_supported_tz_allow)
def test_timeadd_daytime_column_normal():
    gen_list = [
        # timestamp column max year is 1000
        ('t', TimestampGen(start=datetime(1900, 12, 31, 15, tzinfo=timezone.utc), end=datetime(1900, 12, 31, 16, tzinfo=timezone.utc))),
        # max days is 8000 year, so added result will not be out of range
        ('d', DayTimeIntervalGen(min_value=timedelta(seconds=0), max_value=timedelta(seconds=0)))]
    assert_gpu_and_cpu_are_equal_collect(
        lambda spark: gen_df(spark, gen_list, length=2048).selectExpr("t", "d", "t + d"))

@pytest.mark.parametrize('data_gen', [(0, 1)], ids=idfn)
@allow_non_gpu(*non_supported_tz_allow)
def test_timeadd_special(data_gen):
    days, seconds = data_gen
    assert_gpu_and_cpu_are_equal_collect(
        lambda spark: unary_op_df(spark, TimestampGen(start=datetime(1900, 12, 31, 15, 55, tzinfo=timezone.utc), end=datetime(1900, 12, 31, 16, tzinfo=timezone.utc)), length=100)
            .selectExpr("a + (interval {} days {} seconds)".format(days, seconds)))

@thirtiseven thirtiseven marked this pull request as draft January 3, 2024 12:11
integration_tests/src/main/python/date_time_test.py Outdated Show resolved Hide resolved
integration_tests/src/main/python/date_time_test.py Outdated Show resolved Hide resolved
integration_tests/src/main/python/date_time_test.py Outdated Show resolved Hide resolved
integration_tests/src/main/python/date_time_test.py Outdated Show resolved Hide resolved
integration_tests/src/main/python/date_time_test.py Outdated Show resolved Hide resolved
integration_tests/src/main/python/date_time_test.py Outdated Show resolved Hide resolved
integration_tests/src/main/python/date_time_test.py Outdated Show resolved Hide resolved
integration_tests/src/main/python/date_time_test.py Outdated Show resolved Hide resolved
Signed-off-by: Haoyang Li <[email protected]>
@thirtiseven
Copy link
Collaborator Author

@revans2 Sorry I forgot to revert my test code when investigating failed cases. This PR have something wrong now, please check my comment above

Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
@thirtiseven
Copy link
Collaborator Author

New perf test results:

test,type,zone,used MS
time_add,Cpu,Asia/Shanghai,8164
time_add,Gpu,Asia/Shanghai,189
time_add,Cpu,Asia/Shanghai,8094
time_add,Gpu,Asia/Shanghai,186
time_add,Cpu,Asia/Shanghai,8217
time_add,Gpu,Asia/Shanghai,190
time_add,Cpu,Asia/Shanghai,8181
time_add,Gpu,Asia/Shanghai,172
time_add,Cpu,Asia/Shanghai,8132
time_add,Gpu,Asia/Shanghai,172
time_add, Asia/Shanghai: mean cpu time: 8157.60 ms, mean gpu time: 181.80 ms, speedup: 44.87 x
time_add,Cpu,Japan,8205
time_add,Gpu,Japan,165
time_add,Cpu,Japan,8129
time_add,Gpu,Japan,168
time_add,Cpu,Japan,8248
time_add,Gpu,Japan,179
time_add,Cpu,Japan,8182
time_add,Gpu,Japan,160
time_add,Cpu,Japan,8040
time_add,Gpu,Japan,159
time_add, Japan: mean cpu time: 8160.80 ms, mean gpu time: 166.20 ms, speedup: 49.10 x

Signed-off-by: Haoyang Li <[email protected]>
@thirtiseven
Copy link
Collaborator Author

Depends on NVIDIA/spark-rapids-jni#1700

@thirtiseven thirtiseven marked this pull request as ready for review January 16, 2024 11:02
import com.nvidia.spark.rapids.jni.GpuTimeZoneDB

object datetimeExpressionsUtils {
def timestampAddDuration(cv: ColumnVector, duration: BinaryOperable,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add some comments/checks here about the types expected. They can be assertions that get turned off in production if we want. I just want it to be very clear what is and is not supported for the types here. As soon as we start to try and do things like bitCastTo I get a little nervous that we might get errors showing up over time if we are not clear/defensive now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

override def left: Expression = start
override def right: Expression = interval

override def toString: String = s"$left - $right"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is an add why do we show it as left - right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
@thirtiseven thirtiseven changed the base branch from branch-24.02 to branch-24.04 January 27, 2024 05:15
@thirtiseven thirtiseven marked this pull request as draft February 20, 2024 06:59
@thirtiseven
Copy link
Collaborator Author

Close for not planned recently

@thirtiseven thirtiseven closed this Apr 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support TimeAdd for non-UTC time zone
5 participants