Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] strings::from_timestamp can overflow on valid timestamps #9790

Closed
revans2 opened this issue Nov 29, 2021 · 4 comments · Fixed by #9793
Closed

[BUG] strings::from_timestamp can overflow on valid timestamps #9790

revans2 opened this issue Nov 29, 2021 · 4 comments · Fixed by #9793
Assignees
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)

Comments

@revans2
Copy link
Contributor

revans2 commented Nov 29, 2021

Describe the bug
When I try to convert some microsecond values to strings. scale_time appears to overflow and I get the wrong answer out.

Steps/Code to reproduce bug
This is Spark code, but I will try to get a repro case in C++ shortly.

spark.range(128849018879999995L, 128849018880000003L).selectExpr("*", "timestamp_micros(id) as ts").repartition(1).selectExpr("*", "minute(ts)").show(truncate = false)

+------------------+--------------------------+----------+
|id                |ts                        |minute(ts)|
+------------------+--------------------------+----------+
|128849018879999995|6053-01-23 02:07:59.999995|7         |
|128849018879999996|6053-01-23 02:07:59.999996|7         |
|128849018879999997|6053-01-23 02:07:59.999997|7         |
|128849018879999998|6053-01-23 02:07:59.999998|7         |
|128849018879999999|6053-01-23 02:07:59.999999|7         |
|128849018880000000|6053-01-23 02:52:00       |8         |
|128849018880000001|6053-01-23 02:52:00.000001|8         |
|128849018880000002|6053-01-23 02:52:00.000002|8         |
+------------------+--------------------------+----------+

Essentially when we get to 128849018880000000 which is the micro second timestamp value. get_time_components is called to find the hour, min, sec, and sub-sec parts of the timestamp. If it is micro-seconds, then the micro-second part is calculate and the timestamp is converted to seconds.
128849018880000000L / 1000000 => 21474836480

Then scale_time(tstamp, 60) is called, to convert the timestamp into mins instead of seconds. But 21474836480/60 is 2147483648, and scale_time returns an int32_t, so it overflows (Int.MaxValue is 2147483647).

Expected behavior
We get the correct answer back.

@revans2 revans2 added bug Something isn't working Needs Triage Need team to review and classify labels Nov 29, 2021
@davidwendt davidwendt self-assigned this Nov 29, 2021
@revans2
Copy link
Contributor Author

revans2 commented Nov 29, 2021

diff --git a/cpp/src/strings/convert/convert_datetime.cu b/cpp/src/strings/convert/convert_datetime.cu
index 51a6a796ba..8d0c5704a7 100644
--- a/cpp/src/strings/convert/convert_datetime.cu
+++ b/cpp/src/strings/convert/convert_datetime.cu
@@ -707,9 +707,9 @@ struct from_timestamp_base {
    *     scale( 61,60) ->  1
    * @endcode
    */
-  __device__ int32_t scale_time(int64_t time, int64_t base) const
+  __device__ int64_t scale_time(int64_t time, int64_t base) const
   {
-    return static_cast<int32_t>((time - ((time < 0) * (base - 1L))) / base);
+    return (time - ((time < 0) * (base - 1L))) / base;
   };
 
   __device__ time_components get_time_components(int64_t tstamp) const

Appears to fix the issue. I'll try to turn it into a PR with some tests.

@revans2
Copy link
Contributor Author

revans2 commented Nov 29, 2021

@davidwendt sorry feel free to take this if you want.

@davidwendt
Copy link
Contributor

@revans2 This change you proposed here looks good. I thought it might be a more involved change but this looks correct. Feel free to post a PR with this change.

@davidwendt davidwendt assigned revans2 and unassigned davidwendt Nov 29, 2021
@davidwendt davidwendt added libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) labels Nov 29, 2021
@revans2
Copy link
Contributor Author

revans2 commented Nov 29, 2021

Will do

@rapids-bot rapids-bot bot closed this as completed in #9793 Dec 1, 2021
rapids-bot bot pushed a commit that referenced this issue Dec 1, 2021
This fixes #9790 

When converting a timestamp to a String it is possible for the %M min calculation to overflow an int32_t part way through casting. This moves that result to be an int64_t which avoids the overflow issues.

Authors:
  - Robert (Bobby) Evans (https://github.com/revans2)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - David Wendt (https://github.com/davidwendt)

URL: #9793
@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants