Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] to_datetime and astype('datetime[]') left-pads zeros or fills with zeros to fractions of second #11350

Closed
OpenDGPS opened this issue Jul 26, 2022 · 3 comments
Assignees
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code.

Comments

@OpenDGPS
Copy link

Describe the bug
Both cudf.to_datetime() and cudf.DataFrame.astype('datetime64[]') produce wrong results for a given fraction of second by adding leading zeros instead of append zeros by non-aligned (to the number of digits) datetime strings.

Also the result depends on the order of the given datetime strings in a dataframe column. If the columns first entry is not aligned to nine digits both functions come up with different results and astype returns all values in the column in the same format as the first one.

Steps/Code to reproduce bug

import pandas as pd
import cudf
df = cudf.DataFrame({
    't': [
        "2022-04-11T20:14:37.123456789Z",
        "2022-04-11T20:14:37.12345Z",
        "2022-04-11T20:14:37.12345678Z",
        "2022-04-11T20:14:37.1234567Z",
        "2022-04-11T20:14:37.123456Z",
        "2022-04-11T20:14:37.1234Z"
    ],
    'tx': [
        "2022-04-11T20:14:37.123456789Z",
        "2022-04-11T20:14:37.123450000Z",
        "2022-04-11T20:14:37.123456780Z",
        "2022-04-11T20:14:37.123456700Z",
        "2022-04-11T20:14:37.123456000Z",
        "2022-04-11T20:14:37.123400000Z"
    ]})
df['t_astype'] = df['t'].astype("datetime64[ns]")
df['t_to_datetime'] = cudf.to_datetime(df['t'], format='%Y-%m-%dT%H:%M:%S.%f%z')
df['tx_astype'] = df['tx'].astype("datetime64[ns]")
df['tx_to_datetime'] = cudf.to_datetime(df['tx'], format='%Y-%m-%dT%H:%M:%S.%f%z')
df

                                t                              tx                      t_astype                 t_to_datetime                     tx_astype                tx_to_datetime
0  2022-04-11T20:14:37.123456789Z  2022-04-11T20:14:37.123456789Z 2022-04-11 20:14:37.123456789 2022-04-11 20:14:37.123456789 2022-04-11 20:14:37.123456789 2022-04-11 20:14:37.123456789
1      2022-04-11T20:14:37.12345Z  2022-04-11T20:14:37.123450000Z 2022-04-11 20:14:37.001234500 2022-04-11 20:14:37.001234500 2022-04-11 20:14:37.123450000 2022-04-11 20:14:37.123450000
2   2022-04-11T20:14:37.12345678Z  2022-04-11T20:14:37.123456780Z 2022-04-11 20:14:37.000000000 2022-04-11 20:14:37.000000000 2022-04-11 20:14:37.123456780 2022-04-11 20:14:37.123456780
3    2022-04-11T20:14:37.1234567Z  2022-04-11T20:14:37.123456700Z 2022-04-11 20:14:37.001234567 2022-04-11 20:14:37.001234567 2022-04-11 20:14:37.123456700 2022-04-11 20:14:37.123456700
4     2022-04-11T20:14:37.123456Z  2022-04-11T20:14:37.123456000Z 2022-04-11 20:14:37.001234560 2022-04-11 20:14:37.001234560 2022-04-11 20:14:37.123456000 2022-04-11 20:14:37.123456000
5       2022-04-11T20:14:37.1234Z  2022-04-11T20:14:37.123400000Z 2022-04-11 20:14:37.001234000 2022-04-11 20:14:37.001234000 2022-04-11 20:14:37.123400000 2022-04-11 20:14:37.123400000

Expected behavior
For the values of the t-column in rows 1,3,4,5 both methods returns fractions of seconds with leading zeros: "...:37.00123...". Correct result would be "...:37.123...". In row 2 the fractions are replace by zeros complete.

Columns tx_astype and tx_to_datetime represents the expected return values. These results are identical to what the pandas function would return by the given values from 't'.

Environment overview (please complete the following information)

docker pull rapidsai/rapidsai:22.06-cuda11.5-runtime-ubuntu18.04-py3.9
docker run --gpus all --rm -it \
    --shm-size=1g --ulimit memlock=-1 \
    -p 8888:8888 -p 8787:8787 -p 8786:8786 \
    rapidsai/rapidsai:22.06-cuda11.5-runtime-ubuntu18.04-py3.9

Environment details
CUDF from rapids docker image "22.06-cuda11.5-runtime-ubuntu18.04-py3.9"
Host environment: "5.4.0-121-generic #137~18.04.1-Ubuntu SMP"
GPU: NVIDIA 1080ti

Additional context
To minimise I/O, APIs reduce the number of fractions for ms, us, and ns often (i.e. alpaca Stock API and IoT APIs).

The observed behaviour results in time drifts up to one second backwards which compromise the order of a time-series.

@OpenDGPS OpenDGPS added Needs Triage Need team to review and classify bug Something isn't working labels Jul 26, 2022
@davidwendt davidwendt self-assigned this Jul 26, 2022
@davidwendt davidwendt added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jul 26, 2022
@davidwendt
Copy link
Contributor

This appears to be a bug in cudf::to_timestamps logic when processing truncated fractional seconds.

@davidwendt
Copy link
Contributor

Technically we do not support varying formats for a column a timestamp strings. The %f specifier expects to see 6 digits.

Regardless, the code has logic to not read past the end of the string and so less digits are handled gracefully and more digits are ignored (though the 'Z' would not be processed correctly in this case).
The use case here uncovered a bug in this logic that is accessing invalid memory and I'll put a fix in a PR soon.

Note you can specify the precision of the %f specifier by including a digit (1-9) -- like %9f to read 9 digits (nanoseconds) for example.

rapids-bot bot pushed a commit that referenced this issue Jul 29, 2022
Fix truncated subsecond calculation to adjust for the actual digits read in decimal 10 notation.
Added new test strings to include an appended 'Z' after the subsecond digits to test this fix.

Closes #11350

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Yunsong Wang (https://github.com/PointKernel)

URL: #11367
@davidwendt
Copy link
Contributor

Closed by #11367

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

No branches or pull requests

2 participants