-
Notifications
You must be signed in to change notification settings - Fork 920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] duration types #5359
[REVIEW] duration types #5359
Conversation
Please update the changelog in order to start CI tests. View the gpuCI docs here. |
cpp/include/cudf/types.hpp
Outdated
@@ -191,6 +195,16 @@ enum type_id { | |||
TIMESTAMP_MILLISECONDS, ///< duration of milliseconds since Unix Epoch in int64 | |||
TIMESTAMP_MICROSECONDS, ///< duration of microseconds since Unix Epoch in int64 | |||
TIMESTAMP_NANOSECONDS, ///< duration of nanoseconds since Unix Epoch in int64 | |||
DURATION_YEARS, ///< duration of years in int32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoah. My CPU just cried thinking about the compile time explosion this is going to cause. We don't need YEARS/MONTHS/WEEKS/DAYS/HOURS/MINUTES for duration since we don't have corresponding timestamp types for that.
We should only have/need durations for where we have timestamp types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need days since we have TIMESTAMP_DAYS
, but otherwise agreed for now.
In the future the goal would be to support most of these though 😢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
YEARS/MONTHS/WEEKS/DAYS/HOURS/MINUTES would be useful for constructing all timedelta operations. This would allow for eg. (+1 year + 1 month +1 days) operations with timestamp.
This is still open for discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
useful for constructing all timedelta operations
Useful, yes, practical, no. I do not think our compile time or library file size can handle add all of these types in addition to all of the other new ones being added.
This would allow for eg. (+1 year + 1 month +1 days) operations with timestamp.
You can still express that in terms of smaller units like DAYS. So these additional types are not essential for functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, Arrow only supports durations on seconds, milliseconds, microseconds, and nanoseconds.
Numpy supports it on years, months, weeks, days, hour, and minutes as well but uses 64 bit types for all of them.
Maybe we only support seconds, milliseconds, microseconds, and nanoseconds for now, considering using a duration of 64 bit seconds allows you to capture all of a duration of 32 bit days, albeit with memory / performance tradeoffs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For spark we need MONTHS, DAYS, and MICROSECONDS.
This is because if I want to go one year in the future it is 12 months, but it is not always 365 days. If I want to go 1 week in the future it is always 7 days, but I cannot always say how many microseconds it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the use case of months, but isn't 1 week deterministic in the number of microseconds? https://godbolt.org/z/EhSau3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, looks like std::chrono::duration
is all just ratio based: https://en.cppreference.com/w/cpp/chrono/duration
Unit | Ratio |
---|---|
seconds | 1 |
minutes | 60 |
hours | 3600 |
days | 86400 |
weeks | 604800 |
months | 2629746 |
years | 31556952 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because if I want to go one year in the future it is 12 months, but it is not always 365 days.
According to C++ duration types, a year is always 365.2425 days. See:https://en.cppreference.com/w/cpp/chrono/duration
years is equal to 365.2425 days (the average length of a Gregorian year). months is equal to 30.436875 days (exactly 1/12 of years).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now, DAYS, SECONDS, MILLISECONDS, MICROSECONDS, NANOSECONDS are the only types in duration type.
@karthikeyann Could we limit the scope of this PR to just the types and factories and use follow up PRs for other pieces? This also lets us parallelize the work more efficiently 😄 |
@@ -85,7 +85,6 @@ struct dispatch_compute_indices { | |||
cudaStream_t stream) | |||
{ | |||
CUDF_FAIL("list_view dictionary set_keys not supported yet"); | |||
return nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect some compiler versions are going to complain about not having a return value here. Furthermore, this is an unrelated change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, compiler complains if return type is auto
.
namespace detail { | ||
|
||
template <class Duration> | ||
struct duration : Duration { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the purpose of this type and the already existing timestamp
type? is it to:
- implicitly convert the tick counts into
duration
andtimestamp
(timestamp_X{3}
etc.)? - convert between 2
time_point
that have different duration time periods where one can't be otherwise converted to another (from a higher precision to a lower precision - seconds to days etc.)?
is this what is preventing us from using?
using duration_D = simt::std::chrono::days;
and using timestamp_D = detail::timestamp<duration_D>;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The duration helper types like std::chrono::days
don't provide sufficient guarantees for the size of the type, e.g., std::chrono::days
is only guaranteed to be at least 25 bits.
That said, I can't recall why we made the explicit timestamp
type instead of doing:
using timestamp_D = simt::std::chrono::time_point<simt::std::chrono::system_clock, simt::std::chrono::duration<int32_t, simt::std::ratio<86400>>>
@trxcllnt do you remember why we didn't make the libcudf timestamp types simply be aliases and instead made a concrete new type? I'm sure we had a good reason, but I can't remember right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think there were a few things:
- We didn't want to make the clock part of the type, since we expected to switch to
utc_clock
eventually once it lands in libcu++. - We have an extra constructor that takes a numeric
typename Duration::rep r
arg that's more convenient to use:timestamp_ms{1000}
- We needed to override the conversion constructor to explicitly call
duration_cast()
since the libcu++ impl doesn't call it (possible bug in libcu++?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the 2nd item above - should we be even supporting constructing a timestamp with a tick count? in this case, what does 1000 really represent? shouldn't it be a duration (of ms in this example) or another time_point with a similar or a different duration. presently, we interpret to mean add 1000 ms to the system_clock epoch (as we only support system_clock for now), which doesn't look to be the original intent, as 1000 could mean different units or displacement from some other clock.
i think 3rd item is required if we want to truncate and lose precision - i.e. assigning say timestamp_s
to a timestamp_D
. that constructor will prevent this if the periods aren't divisible - is_convertible
should evaluate to true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2). 2nd item is present for ease the testing code. timestamp(typename Duration::rep r)
should be explicit.
3) 3rd item is not bug since is_convertible
takes care of it. (lossy conversion is not allowed as per std::chrono
, needs explicit duration_cast
). std::chrono
does not recommend using duration_cast
(lossy conversion) implicitly. One plus point is that it's explicit constructor.
Perhaps lossy conversion is the requirement from users (Spark, Python, etc). is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@karthikeyann do you think we should move away from having our own concrete timestamp
type and instead just us a type alias for both timestamps and durations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't replace due to a CUDA limitation. If timestamp
is replaced with simt::std::chrono::time_point
, following error is hit due to limitation of shared memory initialization.
cpp/include/cudf/detail/copy_if.cuh(106): error: initializer not allowed for __shared__ variable
cudf/cpp/include/cudf/detail/copy_if.cuh
Line 106 in efebcd1
__shared__ T temp_data[block_size]; |
This is a limitation of device variable. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-specifiers
- The constructor function has no parameters, the initializer list is empty and the function body is an empty compound statement.
- The default constructors of all base classes of its class can be considered empty.
simt::std::chrono::time_point
calls parameterized constructor of duration
https://github.com/rapidsai/thirdparty-libcxx/blob/a97a7380c76346c22bb67b93695bed19592afad2/include/chrono#L1372
_LIBCPP_INLINE_VISIBILITY _LIBCPP_CONSTEXPR_AFTER_CXX11 time_point() : __d_(duration::zero()) {}
timestamp
class avoids it by calling no parameter constructor of duration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simt::std::chrono::duration
is used for duration types.
for timestamp, wait for #5400 .
- this builds on rapidsai#5359 and fixes or adds a few things - unifies the duration and timestamp scalars - adds `is_chrono` and relevant chrono traits - reworked the duration type so that we can support conversions between durations that can't be supported by the language(lower_precision_duration = higher_precision_duration) - include duration types in the typed tests - added tests for new unary casts with durations i have left a few TODOs which will be in the follow-up pr's
Codecov Report
@@ Coverage Diff @@
## branch-0.15 #5359 +/- ##
============================================
Coverage 88.68% 88.68%
============================================
Files 57 57
Lines 10780 10780
============================================
Hits 9560 9560
Misses 1220 1220 Continue to review full report at Codecov.
|
- sync with upstream 0.15 and rapidsai#5359
/** | ||
* @brief Type alias representing an int32_t duration of days. | ||
**/ | ||
using duration_D = simt::std::chrono::duration<int32_t, simt::std::chrono::days::period>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use these as the duration type of the timestamp types?
@@ -45,6 +45,7 @@ struct timestamp : time_point<Duration> { | |||
* | |||
* @param other The `timestamp` to copy | |||
*/ | |||
// TODO: This explict truncation is intended? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would this one and the constructor constexpr timestamp(Duration d)
have to be explicitly defined? they are:
- either relaxing or adding explicit specifiers that aren't in the chrono's
time_point
type - allows narrowing/truncation of types, when it is disallowed by
std::chrono
is it not possible to inherit them from the base type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the list of timestamp constructors evolved over time, and in response to various compiler errors (@karthikeyann reminds me of one I forgot in this comment). Some of them may not be valid reasons any longer.
I'd say change/remove whichever ones we want, and recompile to see what breaks (this can be frustrating, as changing this header triggers a near full re-compile).
Does this PR ensure that |
the support for these operators (add/sub etc.) with the timestamp/duration operands will be added as a part of a follow-up pr's focusing on unary/binary ops. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments, but approving as they're minor.
template <typename Element, typename std::enable_if_t<is_duration<Element>()>* = nullptr> | ||
void operator()(cudf::column_view const& col, std::vector<std::string>& out) | ||
{ | ||
CUDF_FAIL("duration printing not supported yet"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not implement this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Supposedly coming in a later PR where testing code will be updated to work with duration types.
@@ -45,6 +45,7 @@ struct timestamp : time_point<Duration> { | |||
* | |||
* @param other The `timestamp` to copy | |||
*/ | |||
// TODO: This explict truncation is intended? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the list of timestamp constructors evolved over time, and in response to various compiler errors (@karthikeyann reminds me of one I forgot in this comment). Some of them may not be valid reasons any longer.
I'd say change/remove whichever ones we want, and recompile to see what breaks (this can be frustrating, as changing this header triggers a near full re-compile).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -351,6 +353,23 @@ __inline__ __device__ cudf::timestamp_ns decode_value(const char *data, | |||
return milli * 1000000; | |||
} | |||
|
|||
// The purpose of this is merely to allow compilation ONLY | |||
// TODO : make this work for json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a TODO to add parsing implementation to csv or is it a reminder to do the same thing for json?
template <typename Element, typename std::enable_if_t<is_duration<Element>()>* = nullptr> | ||
void operator()(cudf::column_view const& col, std::vector<std::string>& out) | ||
{ | ||
CUDF_FAIL("duration printing not supported yet"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Supposedly coming in a later PR where testing code will be updated to work with duration types.
addresses C++ part of #5272