-
Notifications
You must be signed in to change notification settings - Fork 920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] SUM (and probably MEAN) of TIMESTAMP types should be disallowed and require duration types instead #4074
Comments
I support this. There is a great keynote on the design rationale for |
Until we have time delta support (it's not on the roadmap yet), we will leave the working timestamp mean functionality in place, since users requested it. |
Then we should do my suggestion here:
|
It's odd that user's requested it if Pandas doesn't support the same functionality. |
Is it possible to avoid specializing every aggregation kernel for timestamps with this approach? edit: edited for clarity that I meant only aggregation kernels. |
It's not every kernel. It's only features that are attempting to sum elements. More than that, it's only for features that allow computing the average of elements. The way to avoid any specialization at all is to require using a duration type :) |
DeviceSum on now, should this rolling MEAN of |
Once we have duration types, it should be specialized to convert to a duration and return the mean as a duration.
Only once we have duration types. |
A mean of times is a time, not a duration. If I want to know what is the average time of day that customers place orders, that's an average time of day, not an average duration. |
in pandas 1.0.3, It is implemented by typecasting to "i8" (signed 64-bit integer) Implementing mean( |
It can be specialized without duration column types too. |
That's an unnecessary amount of complication when timestamps can be zero-copy casted to durations where the operations work natively without specialization. |
Describe the bug
The
DeviceSum
aggregation operator has an incorrect specialization for timestamp types:cudf/cpp/include/cudf/detail/utilities/device_operators.cuh
Line 41 in 1ead2d5
It erroneously allows summing together two
TIMESTAMP*
types together. I believe this is currently only used in therolling
implementation for supporting theMEAN
of timestamp types.Additional context
libcudf
TIMESTAMP*
types are analogous to C++std::chrono::time_point
. They represent discrete points in time, such as "April 14, 2012 12:37:13 UTC" or "December 18, 1990 16:29:42 UTC".Logically, it doesn't make any sense to try and sum together two points in time:
A duration is a span of time, such as 37 seconds or 43 days. Adding two durations together is logical and well-defined:
Adding a duration to a time point also has a logical and well-defined interpretation:
Furthermore, subtracting two time points has a well defined result of returning a duration:
All of these statements are reflected in the arithmetic operators that are defined for
std::chrono::time_point
andstd::chrono::duration
: https://en.cppreference.com/w/cpp/chrono/time_point/operator_arith2From the preceding statements, I make the following claims:
Trying to sum together two
TIMESTAMP*
types should be disallowed because the result is nonsense. Thus, theDeviceSum
specialization is incorrect and should be removed.Computing the MEAN of
TIMESTAMP*
types directly should probably be disallowed as this would require summingTIMESTAMP*
types. Instead, the user should be required to convert to a duration/timedelta type(*).I feel strongly about statement 1, but 2 could be relaxed.
Re: 2., while it doesn't make sense to compute the average of points in time, it does make sense to compute the average of the
duration
of each point in time from the epoch. Forstd::chrono::time_point
that just means getting it's underlying duration viatime_since_epoch()
which returns astd::chrono::duration
where it is well-defined to sum and average duration types.As such, in any libcudf algorithm that is requested to compute the
MEAN
on aTIMESTAMP*
type would require a specialization forTIMESTAMP*
types to invoketime_since_epoch()
. What that would look like:Types like
std::chrono::time_point
andstd::chrono::duration
convey meaning and enforce rules for how those types are used. We should respect those rules.(*) libcudf does not currently have a duration or "timedelta" type yet, but it's on the roadmap and should be straightforward to add via the
cuda::std::chrono::duration
provided by libcu++. Once it does, users can easily compute the average of a set of timestamp's durations since the epoch by first converting to the duration/timedelta type. Until that time, the average ofTIMESTAMP
s should not be allowed directly.It would appear Pandas has the same requirement where the user is required to first convert to a timedelta type before you're allowed to compute the average: https://stackoverflow.com/questions/44964484/pandas-average-timestamp-for-dateframe-subset
The text was updated successfully, but these errors were encountered: