Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] duration types #5359

Merged
merged 23 commits into from
Jun 9, 2020

Conversation

karthikeyann
Copy link
Contributor

@karthikeyann karthikeyann commented Jun 2, 2020

addresses C++ part of #5272

  • add duration types
  • factories

@karthikeyann karthikeyann requested a review from sriramch June 2, 2020 14:29
@GPUtester
Copy link
Collaborator

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

@@ -191,6 +195,16 @@ enum type_id {
TIMESTAMP_MILLISECONDS, ///< duration of milliseconds since Unix Epoch in int64
TIMESTAMP_MICROSECONDS, ///< duration of microseconds since Unix Epoch in int64
TIMESTAMP_NANOSECONDS, ///< duration of nanoseconds since Unix Epoch in int64
DURATION_YEARS, ///< duration of years in int32
Copy link
Contributor

@jrhemstad jrhemstad Jun 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoah. My CPU just cried thinking about the compile time explosion this is going to cause. We don't need YEARS/MONTHS/WEEKS/DAYS/HOURS/MINUTES for duration since we don't have corresponding timestamp types for that.

We should only have/need durations for where we have timestamp types.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need days since we have TIMESTAMP_DAYS, but otherwise agreed for now.

In the future the goal would be to support most of these though 😢

Copy link
Contributor Author

@karthikeyann karthikeyann Jun 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YEARS/MONTHS/WEEKS/DAYS/HOURS/MINUTES would be useful for constructing all timedelta operations. This would allow for eg. (+1 year + 1 month +1 days) operations with timestamp.
This is still open for discussion.

Copy link
Contributor

@jrhemstad jrhemstad Jun 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

useful for constructing all timedelta operations

Useful, yes, practical, no. I do not think our compile time or library file size can handle add all of these types in addition to all of the other new ones being added.

This would allow for eg. (+1 year + 1 month +1 days) operations with timestamp.

You can still express that in terms of smaller units like DAYS. So these additional types are not essential for functionality.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, Arrow only supports durations on seconds, milliseconds, microseconds, and nanoseconds.

Numpy supports it on years, months, weeks, days, hour, and minutes as well but uses 64 bit types for all of them.

Maybe we only support seconds, milliseconds, microseconds, and nanoseconds for now, considering using a duration of 64 bit seconds allows you to capture all of a duration of 32 bit days, albeit with memory / performance tradeoffs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For spark we need MONTHS, DAYS, and MICROSECONDS.

This is because if I want to go one year in the future it is 12 months, but it is not always 365 days. If I want to go 1 week in the future it is always 7 days, but I cannot always say how many microseconds it is.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the use case of months, but isn't 1 week deterministic in the number of microseconds? https://godbolt.org/z/EhSau3

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, looks like std::chrono::duration is all just ratio based: https://en.cppreference.com/w/cpp/chrono/duration

Unit Ratio
seconds 1
minutes 60
hours 3600
days 86400
weeks 604800
months 2629746
years 31556952

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because if I want to go one year in the future it is 12 months, but it is not always 365 days.

According to C++ duration types, a year is always 365.2425 days. See:https://en.cppreference.com/w/cpp/chrono/duration

years is equal to 365.2425 days (the average length of a Gregorian year). months is equal to 30.436875 days (exactly 1/12 of years).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now, DAYS, SECONDS, MILLISECONDS, MICROSECONDS, NANOSECONDS are the only types in duration type.

@kkraus14
Copy link
Collaborator

kkraus14 commented Jun 2, 2020

@karthikeyann Could we limit the scope of this PR to just the types and factories and use follow up PRs for other pieces? This also lets us parallelize the work more efficiently 😄

@@ -85,7 +85,6 @@ struct dispatch_compute_indices {
cudaStream_t stream)
{
CUDF_FAIL("list_view dictionary set_keys not supported yet");
return nullptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect some compiler versions are going to complain about not having a return value here. Furthermore, this is an unrelated change.

Copy link
Contributor Author

@karthikeyann karthikeyann Jun 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, compiler complains if return type is auto.

cpp/include/cudf/utilities/traits.hpp Outdated Show resolved Hide resolved
namespace detail {

template <class Duration>
struct duration : Duration {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the purpose of this type and the already existing timestamp type? is it to:

  • implicitly convert the tick counts into duration and timestamp (timestamp_X{3} etc.)?
  • convert between 2 time_point that have different duration time periods where one can't be otherwise converted to another (from a higher precision to a lower precision - seconds to days etc.)?

is this what is preventing us from using?
using duration_D = simt::std::chrono::days; and using timestamp_D = detail::timestamp<duration_D>;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The duration helper types like std::chrono::days don't provide sufficient guarantees for the size of the type, e.g., std::chrono::days is only guaranteed to be at least 25 bits.

That said, I can't recall why we made the explicit timestamp type instead of doing:

using timestamp_D = simt::std::chrono::time_point<simt::std::chrono::system_clock, simt::std::chrono::duration<int32_t, simt::std::ratio<86400>>>

@trxcllnt do you remember why we didn't make the libcudf timestamp types simply be aliases and instead made a concrete new type? I'm sure we had a good reason, but I can't remember right now.

Copy link
Contributor

@trxcllnt trxcllnt Jun 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think there were a few things:

  1. We didn't want to make the clock part of the type, since we expected to switch to utc_clock eventually once it lands in libcu++.
  2. We have an extra constructor that takes a numeric typename Duration::rep r arg that's more convenient to use: timestamp_ms{1000}
  3. We needed to override the conversion constructor to explicitly call duration_cast() since the libcu++ impl doesn't call it (possible bug in libcu++?).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the 2nd item above - should we be even supporting constructing a timestamp with a tick count? in this case, what does 1000 really represent? shouldn't it be a duration (of ms in this example) or another time_point with a similar or a different duration. presently, we interpret to mean add 1000 ms to the system_clock epoch (as we only support system_clock for now), which doesn't look to be the original intent, as 1000 could mean different units or displacement from some other clock.

i think 3rd item is required if we want to truncate and lose precision - i.e. assigning say timestamp_s to a timestamp_D. that constructor will prevent this if the periods aren't divisible - is_convertible should evaluate to true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2). 2nd item is present for ease the testing code. timestamp(typename Duration::rep r) should be explicit.
3) 3rd item is not bug since is_convertible takes care of it. (lossy conversion is not allowed as per std::chrono, needs explicit duration_cast). std::chrono does not recommend using duration_cast (lossy conversion) implicitly. One plus point is that it's explicit constructor.
Perhaps lossy conversion is the requirement from users (Spark, Python, etc). is it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karthikeyann do you think we should move away from having our own concrete timestamp type and instead just us a type alias for both timestamps and durations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't replace due to a CUDA limitation. If timestamp is replaced with simt::std::chrono::time_point, following error is hit due to limitation of shared memory initialization.
cpp/include/cudf/detail/copy_if.cuh(106): error: initializer not allowed for __shared__ variable

__shared__ T temp_data[block_size];

This is a limitation of device variable. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-specifiers

  • The constructor function has no parameters, the initializer list is empty and the function body is an empty compound statement.
  • The default constructors of all base classes of its class can be considered empty.

simt::std::chrono::time_point calls parameterized constructor of duration
https://github.com/rapidsai/thirdparty-libcxx/blob/a97a7380c76346c22bb67b93695bed19592afad2/include/chrono#L1372

_LIBCPP_INLINE_VISIBILITY _LIBCPP_CONSTEXPR_AFTER_CXX11 time_point() : __d_(duration::zero()) {}

timestamp class avoids it by calling no parameter constructor of duration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simt::std::chrono::duration is used for duration types.
for timestamp, wait for #5400 .

cpp/include/cudf/scalar/scalar.hpp Outdated Show resolved Hide resolved
sriramch added a commit to sriramch/cudf that referenced this pull request Jun 4, 2020
  - this builds on rapidsai#5359 and fixes or adds a few things
  - unifies the duration and timestamp scalars
  - adds `is_chrono` and relevant chrono traits
  - reworked the duration type so that we can support conversions between durations that
    can't be supported by the language(lower_precision_duration = higher_precision_duration)
  - include duration types in the typed tests
  - added tests for new unary casts with durations

i have left a few TODOs which will be in the follow-up pr's
@karthikeyann karthikeyann marked this pull request as ready for review June 7, 2020 15:14
@karthikeyann karthikeyann requested review from a team as code owners June 7, 2020 15:14
@codecov
Copy link

codecov bot commented Jun 7, 2020

Codecov Report

Merging #5359 into branch-0.15 will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff              @@
##           branch-0.15    #5359   +/-   ##
============================================
  Coverage        88.68%   88.68%           
============================================
  Files               57       57           
  Lines            10780    10780           
============================================
  Hits              9560     9560           
  Misses            1220     1220           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 029005c...030927b. Read the comment docs.

@karthikeyann karthikeyann changed the title [WIP] duration types [REVIEW] duration types Jun 8, 2020
sriramch added a commit to sriramch/cudf that referenced this pull request Jun 8, 2020
- sync with upstream 0.15 and rapidsai#5359
/**
* @brief Type alias representing an int32_t duration of days.
**/
using duration_D = simt::std::chrono::duration<int32_t, simt::std::chrono::days::period>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use these as the duration type of the timestamp types?

@@ -45,6 +45,7 @@ struct timestamp : time_point<Duration> {
*
* @param other The `timestamp` to copy
*/
// TODO: This explict truncation is intended?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this one and the constructor constexpr timestamp(Duration d) have to be explicitly defined? they are:

  • either relaxing or adding explicit specifiers that aren't in the chrono's time_point type
  • allows narrowing/truncation of types, when it is disallowed by std::chrono

is it not possible to inherit them from the base type?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the list of timestamp constructors evolved over time, and in response to various compiler errors (@karthikeyann reminds me of one I forgot in this comment). Some of them may not be valid reasons any longer.

I'd say change/remove whichever ones we want, and recompile to see what breaks (this can be frustrating, as changing this header triggers a near full re-compile).

@devavret
Copy link
Contributor

devavret commented Jun 8, 2020

Does this PR ensure that timestamp_s - timestamp_s == duration_s? Is that automatic or would this need overloading timestamp_s::operator-()

@sriramch
Copy link
Contributor

sriramch commented Jun 8, 2020

Does this PR ensure that timestamp_s - timestamp_s == duration_s? Is that automatic or would this need overloading timestamp_s::operator-()

the support for these operators (add/sub etc.) with the timestamp/duration operands will be added as a part of a follow-up pr's focusing on unary/binary ops.

@karthikeyann karthikeyann requested a review from trxcllnt June 8, 2020 17:30
Copy link
Contributor

@trxcllnt trxcllnt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments, but approving as they're minor.

template <typename Element, typename std::enable_if_t<is_duration<Element>()>* = nullptr>
void operator()(cudf::column_view const& col, std::vector<std::string>& out)
{
CUDF_FAIL("duration printing not supported yet");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not implement this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Supposedly coming in a later PR where testing code will be updated to work with duration types.

@@ -45,6 +45,7 @@ struct timestamp : time_point<Duration> {
*
* @param other The `timestamp` to copy
*/
// TODO: This explict truncation is intended?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the list of timestamp constructors evolved over time, and in response to various compiler errors (@karthikeyann reminds me of one I forgot in this comment). Some of them may not be valid reasons any longer.

I'd say change/remove whichever ones we want, and recompile to see what breaks (this can be frustrating, as changing this header triggers a near full re-compile).

Copy link
Contributor

@devavret devavret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -351,6 +353,23 @@ __inline__ __device__ cudf::timestamp_ns decode_value(const char *data,
return milli * 1000000;
}

// The purpose of this is merely to allow compilation ONLY
// TODO : make this work for json
Copy link
Contributor

@devavret devavret Jun 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a TODO to add parsing implementation to csv or is it a reminder to do the same thing for json?

template <typename Element, typename std::enable_if_t<is_duration<Element>()>* = nullptr>
void operator()(cudf::column_view const& col, std::vector<std::string>& out)
{
CUDF_FAIL("duration printing not supported yet");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Supposedly coming in a later PR where testing code will be updated to work with duration types.

@karthikeyann karthikeyann merged commit 3fd7b38 into rapidsai:branch-0.15 Jun 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants