-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add timestamp parser to parse timestamp string with time zone #1539
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Later, please add some UTs.
2320805
to
6854225
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would also be nice to clarify that this is for a single timestamp format. CSV/JSON and others allow you to configure the timestamp format. That is not a requirement here, but a follow on issue should be filed for it.
This PR contains 3 parts:
@revans2 Could you first review the parser part? |
Currently suports formats:
For ToUnixTimestamp and GetTimestamp, they require a format parameter. Refer to Spark link
I think our GPU implemetation currently does not support non-utc:
We may need a new kernel to replace current GPU implemetation |
Done. |
@revans2 @hyperbolic2346 Could you review this PR. This PR contains 2 parts:
Alfred will post a PR for a sub-task.
Maybe we can first merge this PR and then review Alfred's PR(based on this PR). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks okay to me. I am not thrilled with this only being half done and the APIs give no indication of that. At a minimum they need to be marked in some way so that they are not used while the timezone parts are added in.
I am also not thrilled with the tests being set to test that the answer is wrong. I would like to see each test tagged that it is doing the wrong thing and ideally have code (commented out if needed) for what the correct result really is.
The timestamp rebase that is referenced in the TODOs ultimately is the requirement for #6846. Can we update the description so that merging this PR doesn't close that issue? I think the TODOs here should ultimately reference that issue. Right now from what I'm gathering, this code right now is purely an implementation of the Spark date time parser so that we're more consistent with Spark. This should ultimately close the overflow bug here NVIDIA/spark-rapids#10083. |
Signed-off-by: Chong Gao <[email protected]>
build |
Two follow-up tasks:
|
build |
…short time zone ID handling, remove binary search on short IDs
[commit]:
Only one follow-up task: |
@revans2 Help review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this. It is very complex and daunting of a task, I am sure. My comments look worse than they really are, this is coming along nicely.
@hyperbolic2346 Thanks a lot for your review.
|
build |
@hyperbolic2346 I addressed all the comments, please review agian. |
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few nits. I still want us to have a unified way of dealing with the gap, and I'd rather not incur a lot of tech debt just because we don't want to stop and think about it now.
// use this reference to indicate if time zone cache is initialized. | ||
// `fixedTransitions` saves transitions for deduplicated time zones, diferent time zones |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit different
is misspelled.
// Note: Spark uses ZoneId.SHORT_IDS | ||
// `TimeZone.getAvailableIDs` contains all keys in `ZoneId.SHORT_IDS` | ||
// So do not need extra work for ZoneId.SHORT_IDS, here just check this assumption | ||
for (String tz : ZoneId.SHORT_IDS.keySet()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It might be nice to only do this when assertions are enabled, but this is really minor.
My review is for the C++ changes, please address other comments and requests before merging. |
Working on the design for DST/TimeAdd, after the design doc are reviewed, then develop unified way first. |
Converted this to draft, since a newer unified way is being developed to parse timestamps. |
Close it first, and will open if it's needed again. |
contributes to NVIDIA/spark-rapids#6846
contributes to #1654
closes #1721
Add kernel code for SparkDateTimeUtils functions:
Main logic is in: parseTimestampString:
This PR contains 3 parts:
rebase local time construction (year,month,day,hour,minute,second) in a time zone to UTC time.
Cast(string as timestamp) for special strings: now, today, ...
Note:
Signed-off-by: sperlingxx [email protected]
Signed-off-by: Chong Gao [email protected]