Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support from_unixtime via Gpu for non-UTC time zone #9814

Merged
merged 10 commits into from
Dec 7, 2023

Conversation

res-life
Copy link
Collaborator

@res-life res-life commented Nov 21, 2023

closes #9605

Support from_unixtime via Gpu

  • First convert long to Timestamp Seconds then to Timestamp Microseconds, then shift to the time to the expected time zone.
  • Add is Daylight Saving Time utils.

Signed-off-by: Chong Gao [email protected]

@res-life
Copy link
Collaborator Author

build

@res-life
Copy link
Collaborator Author

Please only review the last commit: Support from_unixtime via CPU POC

@res-life
Copy link
Collaborator Author

build

1 similar comment
@res-life
Copy link
Collaborator Author

build

@@ -1098,6 +1098,7 @@ abstract class BaseExprMeta[INPUT <: Expression](
//+------------------------+-------------------+-----------------------------------------+
lazy val needTimezoneCheck: Boolean = {
wrapped match {
case _: FromUnixTime => false
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we still need FromUnixTime given it's a sub-class of TimeZoneAwareExpression?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, now use new APIs.

tsVector.asStrings(strfFormat)
withResource(lhs.getBase.asTimestampSeconds) { secondsVector =>
withResource(secondsVector.asTimestampMicroseconds) { tsVector =>
if (zoneId.normalized().equals(ZoneId.of("UTC").normalized())) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use TimeZoneDB utils instead of hard coding UTC here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@res-life
Copy link
Collaborator Author

Premerge failed, case is: convert large InternalRow iterator to cached batch single col *** FAILED ***

[2023-11-21T10:38:24.433Z] CachedBatchWriterSuite:

[2023-11-21T10:38:26.315Z] - convert columnar batch to cached batch on single col table with 0 rows in a batch

[2023-11-21T10:38:26.883Z] - convert large columnar batch to cached batch on single col table

[2023-11-21T10:38:26.883Z] - convert large columnar batch to cached batch on multi-col table

[2023-11-21T10:38:28.766Z] - convert large InternalRow iterator to cached batch single col *** FAILED ***

[2023-11-21T10:38:28.767Z]   java.lang.RuntimeException: ai.rapids.cudf.CudfException: RMM failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-211-cuda11/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/arena_memory_resource.hpp:238: allocation not found

[2023-11-21T10:38:28.767Z]   at ai.rapids.cudf.ColumnVector$OffHeapState.cleanImpl(ColumnVector.java:1131)

[2023-11-21T10:38:28.767Z]   at ai.rapids.cudf.MemoryCleaner$Cleaner.clean(MemoryCleaner.java:117)

[2023-11-21T10:38:28.767Z]   at ai.rapids.cudf.ColumnVector.close(ColumnVector.java:268)

[2023-11-21T10:38:28.767Z]   at com.nvidia.spark.rapids.CachedBatchWriterSuite$TestResources.close(CachedBatchWriterSuite.scala:56)

[2023-11-21T10:38:28.767Z]   at com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableColumn.safeClose(implicits.scala:61)

[2023-11-21T10:38:28.767Z]   at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:31)

[2023-11-21T10:38:28.767Z]   at com.nvidia.spark.rapids.CachedBatchWriterSuite.$anonfun$new$7(CachedBatchWriterSuite.scala:96)

[2023-11-21T10:38:28.767Z]   at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)

[2023-11-21T10:38:28.767Z]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)

[2023-11-21T10:38:28.767Z]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)

[2023-11-21T10:38:28.767Z]   ...

[2023-11-21T10:38:28.767Z]   Cause: ai.rapids.cudf.CudfException: RMM failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-211-cuda11/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/arena_memory_resource.hpp:238: allocation not found

[2023-11-21T10:38:28.767Z]   at ai.rapids.cudf.Rmm.free(Native Method)

[2023-11-21T10:38:28.767Z]   at ai.rapids.cudf.DeviceMemoryBuffer$DeviceBufferCleaner.cleanImpl(DeviceMemoryBuffer.java:50)

[2023-11-21T10:38:28.767Z]   at ai.rapids.cudf.MemoryCleaner$Cleaner.clean(MemoryCleaner.java:117)

[2023-11-21T10:38:28.767Z]   at ai.rapids.cudf.MemoryBuffer.close(MemoryBuffer.java:247)

[2023-11-21T10:38:28.767Z]   at ai.rapids.cudf.ColumnVector$OffHeapState.cleanImpl(ColumnVector.java:1115)

[2023-11-21T10:38:28.767Z]   at ai.rapids.cudf.MemoryCleaner$Cleaner.clean(MemoryCleaner.java:117)

[2023-11-21T10:38:28.767Z]   at ai.rapids.cudf.ColumnVector.close(ColumnVector.java:268)

[2023-11-21T10:38:28.767Z]   at com.nvidia.spark.rapids.CachedBatchWriterSuite$TestResources.close(CachedBatchWriterSuite.scala:56)

[2023-11-21T10:38:28.767Z]   at com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableColumn.safeClose(implicits.scala:61)

[2023-11-21T10:38:28.767Z]   at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:31)

[2023-11-21T10:38:28.767Z]   ...

@res-life
Copy link
Collaborator Author

build

@res-life res-life changed the title Support from_unixtime via CPU POC for non-UTC time zone Support from_unixtime via Gpu for non-UTC time zone Nov 27, 2023
@sameerz sameerz added the feature request New feature or request label Nov 28, 2023
@res-life res-life changed the base branch from branch-23.12 to branch-24.02 November 28, 2023 01:58
@@ -1130,7 +1130,7 @@ abstract class BaseExprMeta[INPUT <: Expression](
if (!isTimeZoneSupported) return checkUTCTimezone(this)

// Level 4 check
if (TimeZoneDB.isSupportedTimezone(getZoneId())) {
if (!TimeZoneDB.isSupportedTimezone(getZoneId())) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switch to GpuTimeZoneDB.isSupportedTimezone. It's the same logic but in a class that will not be removed in the future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@res-life res-life marked this pull request as ready for review December 5, 2023 00:52
@res-life
Copy link
Collaborator Author

res-life commented Dec 5, 2023

build

@res-life
Copy link
Collaborator Author

res-life commented Dec 5, 2023

build

@@ -0,0 +1,630 @@
# Copyright (c) 2021-2022, NVIDIA CORPORATION.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be 2023.
will update.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@res-life
Copy link
Collaborator Author

res-life commented Dec 5, 2023

build

# limitations under the License.

# get from Java:
# ZoneId.getAvailableZoneIds
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The problem is that the backend on java is pluggable. This can change from one version of java to another. The TimeZoneProvider can also change this within a single version of java. Also this does not actually reflect all of the possible time zones. These are the normalized ones. There are deprecated ones that java still kind of supports, but get mapped to these. There are also fixed time zone offsets that are not covered by this and are technically any time range up to 24 hours at second granularity. This is minor because it is not likely to change, but if we could take the time zone and send it to java to check if it is valid/etc and then memorize the result instead that would make me feel better about this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can potentially handle as part of #9747

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this issue: #9633

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now forward to Java to check:

jvm.org.apache.spark.sql.rapids.TimeZoneDB.isSupportedTimeZone(tz)

I tested, it works in the Pytest marker.

NVnavkumar
NVnavkumar previously approved these changes Dec 5, 2023
@NVnavkumar NVnavkumar dismissed their stale review December 5, 2023 18:47

Didn't see newer comments

@res-life
Copy link
Collaborator Author

res-life commented Dec 6, 2023

build

@res-life
Copy link
Collaborator Author

res-life commented Dec 6, 2023

Tested Iran and America/Los_Angeles time zones.
It falls back when TZ is Los_Angeles

tz = get_test_tz()
jvm = spark_jvm()
ret = jvm.org.apache.spark.sql.rapids.TimeZoneDB.isSupportedTimeZone(tz)
print("my debug: is_supported_time_zone " + str(ret))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this print is useful it should not be "my debug"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, remove the debug print.

"""
tz = get_test_tz()
jvm = spark_jvm()
ret = jvm.org.apache.spark.sql.rapids.TimeZoneDB.isSupportedTimeZone(tz)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we memorize this? Calling back into the JVM can be expensive. And if this is never going to change I would much rather have us keep the result cached.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also use the GpuTimeZoneDB version of this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now cached the support info; And updated to GpuTimeZoneDB.

@res-life
Copy link
Collaborator Author

res-life commented Dec 7, 2023

build

NVnavkumar
NVnavkumar previously approved these changes Dec 7, 2023
Copy link
Collaborator

@NVnavkumar NVnavkumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

winningsix
winningsix previously approved these changes Dec 7, 2023
@res-life res-life dismissed stale reviews from winningsix and NVnavkumar via e7b1e7e December 7, 2023 07:19
@res-life res-life changed the title Support from_unixtime via Gpu for non-UTC time zone Support from_unixtime via Gpu for non-UTC time zone [databricks] Dec 7, 2023
@res-life
Copy link
Collaborator Author

res-life commented Dec 7, 2023

build

@res-life res-life changed the title Support from_unixtime via Gpu for non-UTC time zone [databricks] Support from_unixtime via Gpu for non-UTC time zone Dec 7, 2023
@res-life
Copy link
Collaborator Author

res-life commented Dec 7, 2023

build

@res-life res-life merged commit 2805b95 into NVIDIA:branch-24.02 Dec 7, 2023
37 checks passed
@res-life res-life deleted the FromUnixTime branch December 7, 2023 10:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
6 participants