Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] SUM (and probably MEAN) of TIMESTAMP types should be disallowed and require duration types instead #4074

Closed
jrhemstad opened this issue Feb 5, 2020 · 12 comments · Fixed by #5319
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@jrhemstad
Copy link
Contributor

jrhemstad commented Feb 5, 2020

Describe the bug

The DeviceSum aggregation operator has an incorrect specialization for timestamp types:

return T{DeviceSum{}(lhs.time_since_epoch(), rhs.time_since_epoch())};

It erroneously allows summing together two TIMESTAMP* types together. I believe this is currently only used in the rolling implementation for supporting the MEAN of timestamp types.

Additional context

libcudf TIMESTAMP* types are analogous to C++ std::chrono::time_point. They represent discrete points in time, such as "April 14, 2012 12:37:13 UTC" or "December 18, 1990 16:29:42 UTC".

Logically, it doesn't make any sense to try and sum together two points in time:

April 14, 2012 12:37:13 UTC + December 18, 1990 16:29:42 UTC = ???

A duration is a span of time, such as 37 seconds or 43 days. Adding two durations together is logical and well-defined:

43 days + 37 seconds = 43 days 37 seconds.

Adding a duration to a time point also has a logical and well-defined interpretation:

April 14, 2012 12:37:13 UTC + 37 seconds = April 14, 2012 12:37:50.

Furthermore, subtracting two time points has a well defined result of returning a duration:

April 14, 2012 12:37:13 UTC - December 18, 1990 16:29:42 UTC = 21 years 3 months 26.84 days

All of these statements are reflected in the arithmetic operators that are defined for std::chrono::time_point and std::chrono::duration: https://en.cppreference.com/w/cpp/chrono/time_point/operator_arith2

From the preceding statements, I make the following claims:

  1. Trying to sum together two TIMESTAMP* types should be disallowed because the result is nonsense. Thus, the DeviceSum specialization is incorrect and should be removed.

  2. Computing the MEAN of TIMESTAMP* types directly should probably be disallowed as this would require summing TIMESTAMP* types. Instead, the user should be required to convert to a duration/timedelta type(*).

I feel strongly about statement 1, but 2 could be relaxed.

Re: 2., while it doesn't make sense to compute the average of points in time, it does make sense to compute the average of the duration of each point in time from the epoch. For std::chrono::time_point that just means getting it's underlying duration via time_since_epoch() which returns a std::chrono::duration where it is well-defined to sum and average duration types.

As such, in any libcudf algorithm that is requested to compute the MEAN on a TIMESTAMP* type would require a specialization for TIMESTAMP* types to invoke time_since_epoch(). What that would look like:

timestamp_D t1, t2;

DeviceSum s{};

timestamp_D t3 = s(t1, t2); // This should be illegal, but it currently works because of the `DeviceSum` specialization for timestamps

auto duration = s(t1.time_since_epoch(), t2.time_since_epoch()); // This is okay!

Types like std::chrono::time_point and std::chrono::duration convey meaning and enforce rules for how those types are used. We should respect those rules.

(*) libcudf does not currently have a duration or "timedelta" type yet, but it's on the roadmap and should be straightforward to add via the cuda::std::chrono::duration provided by libcu++. Once it does, users can easily compute the average of a set of timestamp's durations since the epoch by first converting to the duration/timedelta type. Until that time, the average of TIMESTAMPs should not be allowed directly.

It would appear Pandas has the same requirement where the user is required to first convert to a timedelta type before you're allowed to compute the average: https://stackoverflow.com/questions/44964484/pandas-average-timestamp-for-dateframe-subset

@jrhemstad jrhemstad added bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. labels Feb 5, 2020
@codereport
Copy link
Contributor

I support this. There is a great keynote on the design rationale for std::chrono. Note that Howard Hinnant was a previous standard library maintainer (before Marshall Clow). It is worth the watch: https://youtu.be/adSAN282YIw

@harrism
Copy link
Member

harrism commented Feb 5, 2020

Until we have time delta support (it's not on the roadmap yet), we will leave the working timestamp mean functionality in place, since users requested it.

@jrhemstad
Copy link
Contributor Author

Until we have time delta support (it's not on the roadmap yet), we will leave the working timestamp mean functionality in place, since users requested it.

Then we should do my suggestion here:

As such, in any libcudf algorithm that is requested to compute the MEAN on a TIMESTAMP* type would require a specialization for TIMESTAMP* types to invoke time_since_epoch(). What that would look like:

timestamp_D t1, t2;

DeviceSum s{};

timestamp_D t3 = s(t1, t2); // This should be illegal, but it currently works because of the DeviceSum specialization for timestamps

auto duration = s(t1.time_since_epoch(), t2.time_since_epoch()); // This is okay!

@jrhemstad
Copy link
Contributor Author

Until we have time delta support (it's not on the roadmap yet), we will leave the working timestamp mean functionality in place, since users requested it.

It's odd that user's requested it if Pandas doesn't support the same functionality.

@harrism
Copy link
Member

harrism commented Feb 6, 2020

Is it possible to avoid specializing every aggregation kernel for timestamps with this approach?

edit: edited for clarity that I meant only aggregation kernels.

@jrhemstad
Copy link
Contributor Author

jrhemstad commented Feb 6, 2020

Is it possible to avoid specializing every kernel for timestamps with this approach?

It's not every kernel. It's only features that are attempting to sum elements. More than that, it's only for features that allow computing the average of elements.

The way to avoid any specialization at all is to require using a duration type :)

@harrism harrism added feature request New feature or request and removed bug Something isn't working labels Feb 10, 2020
@karthikeyann
Copy link
Contributor

karthikeyann commented May 8, 2020

Is it possible to avoid specializing every kernel for timestamps with this approach?

It's not every kernel. It's only features that are attempting to sum elements. More than that, it's only for features that allow computing the average of elements.

The way to avoid any specialization at all is to require using a duration type :)

DeviceSum on cudf::timestamp_* is compiled only for rolling.cu MEAN. For reduction and group_by, it's not enabled because cudf::timestamp_* is neither numeric nor arithmetic.

now, should this rolling MEAN of timestamp_* be disabled or specialized?
Meanwhile, should MEAN of timestamp_* be enabled for group_by and reduction?

@karthikeyann karthikeyann self-assigned this May 8, 2020
@jrhemstad
Copy link
Contributor Author

now, should this rolling MEAN of timestamp_* be disabled or specialized?

Once we have duration types, it should be specialized to convert to a duration and return the mean as a duration.

Meanwhile, should MEAN of timestamp_* be enabled for group_by and reduction?

Only once we have duration types.

@harrism
Copy link
Member

harrism commented May 9, 2020

A mean of times is a time, not a duration. If I want to know what is the average time of day that customers place orders, that's an average time of day, not an average duration.

@karthikeyann
Copy link
Contributor

karthikeyann commented May 22, 2020

in pandas 1.0.3,
SUM in DatetimeIndex is disabled,
MEAN in DatetimeIndex is allowed. pandas.DatetimeIndex.mean

It is implemented by typecasting to "i8" (signed 64-bit integer)
https://github.com/pandas-dev/pandas/blob/239b6a7219a4f89637451406819623f0fb631e6f/pandas/core/arrays/datetimelike.py#L1630

Implementing mean(TIMESTAMP_* column) should be similar as well.

@karthikeyann
Copy link
Contributor

Only once we have duration types.

It can be specialized without duration column types too.
In specialization, timestamp is typecasted to duration type and Intermediate SUM type could be duration type. After dividing by count, duration type can by typecasted to timestamp_* again. (Already tested with column reduction kernel). DeviceSum<timestamp_*> are removed.
created a PR #5319 with above said change.

@jrhemstad
Copy link
Contributor Author

Only once we have duration types.

It can be specialized without duration column types too.
In specialization, timestamp is typecasted to duration type and Intermediate SUM type could be duration type. After dividing by count, duration type can by typecasted to timestamp_* again. (Already tested with column reduction kernel). DeviceSum<timestamp_*> are removed.
created a PR #5319 with above said change.

That's an unnecessary amount of complication when timestamps can be zero-copy casted to durations where the operations work natively without specialization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants