Add timestamp parser to parse timestamp string with time zone #1539

res-life · 2023-11-06T01:04:06Z

contributes to NVIDIA/spark-rapids#6846
contributes to #1654
closes #1721

Add kernel code for SparkDateTimeUtils functions:

SparkDateTimeUtils.stringToTimestamp
SparkDateTimeUtils.stringToTimestampWithoutTimeZone
SparkDateTimeUtils.stringToTimestampAnsi

This PR contains 3 parts:

parser
time rebase part:
rebase local time construction (year,month,day,hour,minute,second) in a time zone to UTC time.
JNI part.

Cast(string as timestamp) for special strings: now, today, ...

from pyspark.sql.types import *
schema = StructType([
    StructField("c1", StringType()),
    StructField("c2", IntegerType()),
])
data = [
    ("today",1),
    ("now",2)
]
df = spark.createDataFrame(
        SparkContext.getOrCreate().parallelize(data, numSlices=2),
        schema).createOrReplaceTempView("tab")
spark.sql("select cast(c1 as timestamp) from tab").show()

Spark 311: return non-null values. Supports special strings.
Spark 320 and 320+: return null values. Do not support special strings.

Note:

This Kernel does not supports special strings: now, today...
Will create another PR to handle: Cast to date
Do not support parse just time, refer to follow-up issue

Signed-off-by: sperlingxx [email protected]
Signed-off-by: Chong Gao [email protected]

winningsix

Later, please add some UTs.

src/main/cpp/src/datetime_parser.cu

src/main/cpp/src/datetime_parser.hpp

src/main/cpp/src/datetime_parser.cu

revans2

It would also be nice to clarify that this is for a single timestamp format. CSV/JSON and others allow you to configure the timestamp format. That is not a requirement here, but a follow on issue should be filed for it.

src/main/cpp/src/datetime_parser.cu

res-life · 2023-12-14T02:32:35Z

This PR contains 3 parts:

parser: ready for review.
time rebase part:
rebase local time construction (year,month,day,hour,minute,second) in a time zone to UTC time. @sperlingxx will add this part in this PR.
JNI part. Will add later.

@revans2 Could you first review the parser part?

res-life · 2023-12-14T05:21:15Z

It would also be nice to clarify that this is for a single timestamp format. CSV/JSON and others allow you to configure the timestamp format. That is not a requirement here, but a follow on issue should be filed for it.

Currently suports formats:

  `[+-]yyyy*`
  `[+-]yyyy*-[m]m`
  `[+-]yyyy*-[m]m-[d]d`
  `[+-]yyyy*-[m]m-[d]d `
  `[+-]yyyy*-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]`
  `[+-]yyyy*-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]`

For ToUnixTimestamp and GetTimestamp, they require a format parameter. Refer to Spark link

val formatter = formatterOption.getOrElse(getFormatter(fmt.toString))
formatter.parse()

I think our GPU implemetation currently does not support non-utc:

def parseStringAsTimestamp(

We may need a new kernel to replace current GPU implemetation parseStringAsTimestamp

res-life · 2023-12-14T08:27:26Z

Later, please add some UTs.

Done.

src/main/cpp/src/datetime_parser.cu

res-life · 2024-01-02T07:04:10Z

@revans2 @hyperbolic2346 Could you review this PR.

This PR contains 2 parts:

Parser: ready for review.
JNI part. Ready for review. But will be modified by Alfred to add more parameters.

Alfred will post a PR for a sub-task.

Time rebase part. Not ready for review. (Alfred will post)

Maybe we can first merge this PR and then review Alfred's PR(based on this PR).

revans2

The code looks okay to me. I am not thrilled with this only being half done and the APIs give no indication of that. At a minimum they need to be marked in some way so that they are not used while the timezone parts are added in.

I am also not thrilled with the tests being set to test that the answer is wrong. I would like to see each test tagged that it is doing the wrong thing and ideally have code (commented out if needed) for what the correct result really is.

src/main/java/com/nvidia/spark/rapids/jni/CastStrings.java

src/test/java/com/nvidia/spark/rapids/jni/CastStringsTest.java

src/main/cpp/tests/datetime_parser.cpp

src/main/cpp/src/datetime_parser.cu

NVnavkumar · 2024-01-03T00:21:22Z

The timestamp rebase that is referenced in the TODOs ultimately is the requirement for #6846. Can we update the description so that merging this PR doesn't close that issue? I think the TODOs here should ultimately reference that issue.

Right now from what I'm gathering, this code right now is purely an implementation of the Spark date time parser so that we're more consistent with Spark. This should ultimately close the overflow bug here NVIDIA/spark-rapids#10083.

Signed-off-by: Chong Gao <[email protected]>

…i mode check

res-life · 2024-01-24T10:48:24Z

build

res-life · 2024-01-25T01:40:32Z

Two follow-up tasks:

Replace looseInstant
GpuTimeZoneDB missed some time zones which are not normalized.

res-life · 2024-01-25T02:14:39Z

build

…short time zone ID handling, remove binary search on short IDs

res-life · 2024-01-25T11:50:30Z

[commit]:

fixed (97a8f8f) fixed GpuTimeZoneDB missed some time zones which are not normalized.
Removed binary search for short IDs. Because TimeZone.getAvailableIDs contains all short IDs.

Only one follow-up task:
Replace looseInstant with epoch to save computation. Will target Release 24.04.

res-life · 2024-01-25T11:53:23Z

@revans2 Help review.

hyperbolic2346

Thanks for working on this. It is very complex and daunting of a task, I am sure. My comments look worse than they really are, this is coming along nicely.

src/main/cpp/src/datetime_parser.hpp

src/main/java/com/nvidia/spark/rapids/jni/GpuTimeZoneDB.java

src/main/cpp/src/datetime_parser.cu

res-life · 2024-01-26T13:17:19Z

@hyperbolic2346 Thanks a lot for your review.
Only this comment is not done.

It is not ideal to pass in these pointers like this. Better would be having the parse_string_to_timestamp_us return those things via a move. Then they could be const as well.
auto [ts_comp, tz_lit_ptr, tz_lit_len, result] = parse_string_to_timestamp_us(d_str);
switch (result) {
 ...
I'm not strongly against this, but it is more the cudf way.

res-life · 2024-01-26T13:19:36Z

build

res-life · 2024-01-30T06:55:41Z

@hyperbolic2346 I addressed all the comments, please review agian.

res-life · 2024-01-30T06:55:51Z

build

revans2

A few nits. I still want us to have a unified way of dealing with the gap, and I'd rather not incur a lot of tech debt just because we don't want to stop and think about it now.

revans2 · 2024-01-30T14:57:43Z

src/main/java/com/nvidia/spark/rapids/jni/GpuTimeZoneDB.java

  // use this reference to indicate if time zone cache is initialized.
+  // `fixedTransitions` saves transitions for deduplicated time zones, diferent time zones


nit different is misspelled.

revans2 · 2024-01-30T14:59:52Z

src/main/java/com/nvidia/spark/rapids/jni/GpuTimeZoneDB.java

+      // Note: Spark uses ZoneId.SHORT_IDS
+      // `TimeZone.getAvailableIDs` contains all keys in `ZoneId.SHORT_IDS`
+      // So do not need extra work for ZoneId.SHORT_IDS, here just check this assumption
+      for (String tz : ZoneId.SHORT_IDS.keySet()) {


nit: It might be nice to only do this when assertions are enabled, but this is really minor.

hyperbolic2346 · 2024-01-31T03:06:44Z

My review is for the C++ changes, please address other comments and requests before merging.

res-life · 2024-01-31T09:10:01Z

A few nits. I still want us to have a unified way of dealing with the gap, and I'd rather not incur a lot of tech debt just because we don't want to stop and think about it now.

Working on the design for DST/TimeAdd, after the design doc are reviewed, then develop unified way first.

NVnavkumar · 2024-02-12T19:40:12Z

Converted this to draft, since a newer unified way is being developed to parse timestamps.

res-life · 2024-04-11T05:53:55Z

Close it first, and will open if it's needed again.

res-life requested review from revans2 and hyperbolic2346 November 6, 2023 01:04

winningsix reviewed Nov 6, 2023

View reviewed changes

src/main/cpp/src/datetime_parser.cu Outdated Show resolved Hide resolved

src/main/cpp/src/datetime_parser.cu Outdated Show resolved Hide resolved

res-life changed the base branch from branch-23.12 to branch-24.02 November 28, 2023 02:01

res-life mentioned this pull request Dec 8, 2023

[FEA] non-UTC time zone feature needs a new kernel to parse time string in a local time zone. NVIDIA/spark-rapids#9997

Closed

sperlingxx reviewed Dec 8, 2023

View reviewed changes

src/main/cpp/src/datetime_parser.cu Outdated Show resolved Hide resolved

sperlingxx reviewed Dec 8, 2023

View reviewed changes

src/main/cpp/src/datetime_parser.hpp Outdated Show resolved Hide resolved

src/main/cpp/src/datetime_parser.cu Outdated Show resolved Hide resolved

src/main/cpp/src/datetime_parser.cu Outdated Show resolved Hide resolved

sperlingxx reviewed Dec 8, 2023

View reviewed changes

src/main/cpp/src/datetime_parser.cu Outdated Show resolved Hide resolved

res-life force-pushed the timestamp-parser branch from 2320805 to 6854225 Compare December 12, 2023 10:08

revans2 reviewed Dec 12, 2023

View reviewed changes

src/main/cpp/src/datetime_parser.cu Outdated Show resolved Hide resolved

res-life assigned res-life and unassigned res-life Dec 14, 2023

res-life mentioned this pull request Dec 15, 2023

[FEA] New kernel to support parsing dates/timestamps string with a timezone parameter. #1655

Closed

3 tasks

res-life commented Dec 27, 2023

View reviewed changes

src/main/cpp/src/datetime_parser.cu Outdated Show resolved Hide resolved

res-life marked this pull request as ready for review January 2, 2024 06:37

res-life changed the title ~~Add timestamp parser to parse timestamp string with time zone~~ Part 1: Add timestamp parser to parse timestamp string with time zone Jan 2, 2024

revans2 reviewed Jan 2, 2024

View reviewed changes

res-life changed the title ~~Part 1: Add timestamp parser to parse timestamp string with time zone~~ Add timestamp parser to parse timestamp string with time zone Jan 8, 2024

res-life marked this pull request as draft January 8, 2024 06:15

Chong Gao added 6 commits January 9, 2024 12:26

Add timestamp parser

b15c762

Signed-off-by: Chong Gao <[email protected]>

Refine parser

73e0f7e

Update

df60772

Update

a4a83c0

Fix bitmask; Parse special timestamp strings: now, today ...; Add Ans…

89eef6b

…i mode check

Format code

533f590

sameerz added the feature request label Jan 25, 2024

Refector GpuTimeZoneDB; Add comment for year has max 6 digits

c8dffb1

Chong Gao added 3 commits January 25, 2024 10:18

format cpp code

4104173

Remove .clang-format

5af012c

Fix do not support non-normalized time zone, like: Etc/GMT; Optimize …

97a8f8f

…short time zone ID handling, remove binary search on short IDs

hyperbolic2346 requested changes Jan 26, 2024

View reviewed changes

Chong Gao added 4 commits January 26, 2024 13:56

Refector to address comments

0a7efd9

Merge branch 'branch-24.02' into timestamp-parser

f947fbd

Fix cases

21f99db

Fix cudaErrorIllegalAddress error; Fix null pointer bug

6ddb91c

Update comments

863cb83

res-life changed the base branch from branch-24.02 to branch-24.04 January 29, 2024 02:11

Refactor

de74645

revans2 reviewed Jan 30, 2024

View reviewed changes

hyperbolic2346 approved these changes Jan 31, 2024

View reviewed changes

NVnavkumar marked this pull request as draft February 12, 2024 19:39

res-life closed this Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add timestamp parser to parse timestamp string with time zone #1539

Add timestamp parser to parse timestamp string with time zone #1539

res-life commented Nov 6, 2023 •

edited

Loading

winningsix left a comment

revans2 left a comment

res-life commented Dec 14, 2023

res-life commented Dec 14, 2023 •

edited

Loading

res-life commented Dec 14, 2023

res-life commented Jan 2, 2024 •

edited

Loading

revans2 left a comment

NVnavkumar commented Jan 3, 2024 •

edited

Loading

res-life commented Jan 24, 2024

res-life commented Jan 25, 2024

res-life commented Jan 25, 2024

res-life commented Jan 25, 2024

res-life commented Jan 25, 2024

hyperbolic2346 left a comment

res-life commented Jan 26, 2024

res-life commented Jan 26, 2024

res-life commented Jan 30, 2024

res-life commented Jan 30, 2024

revans2 left a comment

revans2 Jan 30, 2024

revans2 Jan 30, 2024

hyperbolic2346 commented Jan 31, 2024

res-life commented Jan 31, 2024

NVnavkumar commented Feb 12, 2024

res-life commented Apr 11, 2024

		// use this reference to indicate if time zone cache is initialized.
		// `fixedTransitions` saves transitions for deduplicated time zones, diferent time zones

Add timestamp parser to parse timestamp string with time zone #1539

Add timestamp parser to parse timestamp string with time zone #1539

Conversation

res-life commented Nov 6, 2023 • edited Loading

winningsix left a comment

Choose a reason for hiding this comment

revans2 left a comment

Choose a reason for hiding this comment

res-life commented Dec 14, 2023

res-life commented Dec 14, 2023 • edited Loading

res-life commented Dec 14, 2023

res-life commented Jan 2, 2024 • edited Loading

revans2 left a comment

Choose a reason for hiding this comment

NVnavkumar commented Jan 3, 2024 • edited Loading

res-life commented Jan 24, 2024

res-life commented Jan 25, 2024

res-life commented Jan 25, 2024

res-life commented Jan 25, 2024

res-life commented Jan 25, 2024

hyperbolic2346 left a comment

Choose a reason for hiding this comment

res-life commented Jan 26, 2024

res-life commented Jan 26, 2024

res-life commented Jan 30, 2024

res-life commented Jan 30, 2024

revans2 left a comment

Choose a reason for hiding this comment

revans2 Jan 30, 2024

Choose a reason for hiding this comment

revans2 Jan 30, 2024

Choose a reason for hiding this comment

hyperbolic2346 commented Jan 31, 2024

res-life commented Jan 31, 2024

NVnavkumar commented Feb 12, 2024

res-life commented Apr 11, 2024

res-life commented Nov 6, 2023 •

edited

Loading

res-life commented Dec 14, 2023 •

edited

Loading

res-life commented Jan 2, 2024 •

edited

Loading

NVnavkumar commented Jan 3, 2024 •

edited

Loading