[BUG] to_datetime and astype('datetime[]') left-pads zeros or fills with zeros to fractions of second #11350

OpenDGPS · 2022-07-26T10:21:19Z

Describe the bug
Both cudf.to_datetime() and cudf.DataFrame.astype('datetime64[]') produce wrong results for a given fraction of second by adding leading zeros instead of append zeros by non-aligned (to the number of digits) datetime strings.

Also the result depends on the order of the given datetime strings in a dataframe column. If the columns first entry is not aligned to nine digits both functions come up with different results and astype returns all values in the column in the same format as the first one.

Steps/Code to reproduce bug

import pandas as pd
import cudf
df = cudf.DataFrame({
    't': [
        "2022-04-11T20:14:37.123456789Z",
        "2022-04-11T20:14:37.12345Z",
        "2022-04-11T20:14:37.12345678Z",
        "2022-04-11T20:14:37.1234567Z",
        "2022-04-11T20:14:37.123456Z",
        "2022-04-11T20:14:37.1234Z"
    ],
    'tx': [
        "2022-04-11T20:14:37.123456789Z",
        "2022-04-11T20:14:37.123450000Z",
        "2022-04-11T20:14:37.123456780Z",
        "2022-04-11T20:14:37.123456700Z",
        "2022-04-11T20:14:37.123456000Z",
        "2022-04-11T20:14:37.123400000Z"
    ]})
df['t_astype'] = df['t'].astype("datetime64[ns]")
df['t_to_datetime'] = cudf.to_datetime(df['t'], format='%Y-%m-%dT%H:%M:%S.%f%z')
df['tx_astype'] = df['tx'].astype("datetime64[ns]")
df['tx_to_datetime'] = cudf.to_datetime(df['tx'], format='%Y-%m-%dT%H:%M:%S.%f%z')
df

                                t                              tx                      t_astype                 t_to_datetime                     tx_astype                tx_to_datetime
0  2022-04-11T20:14:37.123456789Z  2022-04-11T20:14:37.123456789Z 2022-04-11 20:14:37.123456789 2022-04-11 20:14:37.123456789 2022-04-11 20:14:37.123456789 2022-04-11 20:14:37.123456789
1      2022-04-11T20:14:37.12345Z  2022-04-11T20:14:37.123450000Z 2022-04-11 20:14:37.001234500 2022-04-11 20:14:37.001234500 2022-04-11 20:14:37.123450000 2022-04-11 20:14:37.123450000
2   2022-04-11T20:14:37.12345678Z  2022-04-11T20:14:37.123456780Z 2022-04-11 20:14:37.000000000 2022-04-11 20:14:37.000000000 2022-04-11 20:14:37.123456780 2022-04-11 20:14:37.123456780
3    2022-04-11T20:14:37.1234567Z  2022-04-11T20:14:37.123456700Z 2022-04-11 20:14:37.001234567 2022-04-11 20:14:37.001234567 2022-04-11 20:14:37.123456700 2022-04-11 20:14:37.123456700
4     2022-04-11T20:14:37.123456Z  2022-04-11T20:14:37.123456000Z 2022-04-11 20:14:37.001234560 2022-04-11 20:14:37.001234560 2022-04-11 20:14:37.123456000 2022-04-11 20:14:37.123456000
5       2022-04-11T20:14:37.1234Z  2022-04-11T20:14:37.123400000Z 2022-04-11 20:14:37.001234000 2022-04-11 20:14:37.001234000 2022-04-11 20:14:37.123400000 2022-04-11 20:14:37.123400000

Expected behavior
For the values of the t-column in rows 1,3,4,5 both methods returns fractions of seconds with leading zeros: "...:37.00123...". Correct result would be "...:37.123...". In row 2 the fractions are replace by zeros complete.

Columns tx_astype and tx_to_datetime represents the expected return values. These results are identical to what the pandas function would return by the given values from 't'.

Environment overview (please complete the following information)

docker pull rapidsai/rapidsai:22.06-cuda11.5-runtime-ubuntu18.04-py3.9
docker run --gpus all --rm -it \
    --shm-size=1g --ulimit memlock=-1 \
    -p 8888:8888 -p 8787:8787 -p 8786:8786 \
    rapidsai/rapidsai:22.06-cuda11.5-runtime-ubuntu18.04-py3.9

Environment details
CUDF from rapids docker image "22.06-cuda11.5-runtime-ubuntu18.04-py3.9"
Host environment: "5.4.0-121-generic #137~18.04.1-Ubuntu SMP"
GPU: NVIDIA 1080ti

Additional context
To minimise I/O, APIs reduce the number of fractions for ms, us, and ns often (i.e. alpaca Stock API and IoT APIs).

The observed behaviour results in time drifts up to one second backwards which compromise the order of a time-series.

The text was updated successfully, but these errors were encountered:

davidwendt · 2022-07-26T18:18:29Z

This appears to be a bug in cudf::to_timestamps logic when processing truncated fractional seconds.

davidwendt · 2022-07-27T14:23:52Z

Technically we do not support varying formats for a column a timestamp strings. The %f specifier expects to see 6 digits.

Regardless, the code has logic to not read past the end of the string and so less digits are handled gracefully and more digits are ignored (though the 'Z' would not be processed correctly in this case).
The use case here uncovered a bug in this logic that is accessing invalid memory and I'll put a fix in a PR soon.

Note you can specify the precision of the %f specifier by including a digit (1-9) -- like %9f to read 9 digits (nanoseconds) for example.

Fix truncated subsecond calculation to adjust for the actual digits read in decimal 10 notation. Added new test strings to include an appended 'Z' after the subsecond digits to test this fix. Closes #11350 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Yunsong Wang (https://github.com/PointKernel) URL: #11367

davidwendt · 2022-07-29T11:50:26Z

Closed by #11367

OpenDGPS added Needs Triage Need team to review and classify bug Something isn't working labels Jul 26, 2022

davidwendt self-assigned this Jul 26, 2022

davidwendt added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jul 26, 2022

davidwendt mentioned this issue Jul 27, 2022

Fix to_timestamps truncated subsecond calculation #11367

Merged

3 tasks

davidwendt closed this as completed Jul 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] to_datetime and astype('datetime[]') left-pads zeros or fills with zeros to fractions of second #11350

[BUG] to_datetime and astype('datetime[]') left-pads zeros or fills with zeros to fractions of second #11350

OpenDGPS commented Jul 26, 2022

davidwendt commented Jul 26, 2022

davidwendt commented Jul 27, 2022

davidwendt commented Jul 29, 2022

[BUG] to_datetime and astype('datetime[]') left-pads zeros or fills with zeros to fractions of second #11350

[BUG] to_datetime and astype('datetime[]') left-pads zeros or fills with zeros to fractions of second #11350

Comments

OpenDGPS commented Jul 26, 2022

davidwendt commented Jul 26, 2022

davidwendt commented Jul 27, 2022

davidwendt commented Jul 29, 2022