Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] disallow SUM and MEAN of timestamp types #5319

Merged

Conversation

karthikeyann
Copy link
Contributor

@karthikeyann karthikeyann commented May 29, 2020

closes #4074

  • disable SUM of timestamp types
  • disable MEAN of timestamp types
    • column reduction
    • rolling
    • group rolling
    • unit tests
  • remove DeviceProduct of timestamp types

@karthikeyann karthikeyann requested a review from a team as a code owner May 29, 2020 07:55
@GPUtester
Copy link
Collaborator

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

@jrhemstad
Copy link
Contributor

We shouldn't be allowing MEAN of timestamp at all. The user should be required to first cast to a duration or integer type.

@harrism
Copy link
Member

harrism commented Jun 1, 2020

We shouldn't be allowing MEAN of timestamp at all. The user should be required to first cast to a duration or integer type.

You were not so firm about this here, which may have led to this PR: #4074 (comment)

From the preceding statements, I make the following claims:

  1. Trying to sum together two TIMESTAMP* types should be disallowed because the result is nonsense. Thus, the DeviceSum specialization is incorrect and should be removed.

  2. Computing the MEAN of TIMESTAMP* types directly should probably be disallowed as this would require summing TIMESTAMP* types. Instead, the user should be required to convert to a duration/timedelta type(*).

I feel strongly about statement 1, but 2 could be relaxed.

@harrism
Copy link
Member

harrism commented Jun 1, 2020

@karthikeyann I think the first step is to add support for duration types... #5272

@karthikeyann
Copy link
Contributor Author

karthikeyann commented Jun 1, 2020

We shouldn't be allowing MEAN of timestamp at all. The user should be required to first cast to a duration or integer type.

casting to duration type/integer type is a workaround.
pandas 1.0.3 implemented MEAN of timestamp without casting (#4074 (comment)). That's the main reason for this PR.

duration types will still be implemented. but it's not required for implementing MEAN of timestamp.

@jrhemstad
Copy link
Contributor

pandas 1.0.3 implemented MEAN of timestamp without casting (#4074 (comment)). That's the main reason for this PR.

libcudf is not Pandas. libcudf is a C++ library that tries to follow semantics of STL types (std::chrono) as closely as possible. As such, we shouldn't allow any operation that requires summing timestamp types.

If Python wants to allow sum/mean of timestamps, then they can zero-copy cast from a timestamp to a duration/integer type and do so. But that's Python's responsibility, not libcudf's.

template <typename OutputType,
typename agg_op,
bool is_mean_of_timestamp,
std::enable_if_t<!is_mean_of_timestamp>* = nullptr>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've always found this explicit "is_mean" bool that gets passed around a little strange (the actual rolling code did it too). The logic for this can be inferred from the other parameters, ie

Suggested change
std::enable_if_t<!is_mean_of_timestamp>* = nullptr>
template <typename OutputType,
typename agg_op,
std::enable_if_t<!(agg_op == MEAN && OutputType == cudf::is_timestamp<OutputType>())>* = nullptr
>

Then this could be bubbled up further so that is_mean can be removed from the template params for create_reference_output

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue #5466 is created to track this.

template <typename OutputType,
typename agg_op,
bool is_mean_of_timestamp,
std::enable_if_t<is_mean_of_timestamp>* = nullptr>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

template <typename OutputType,
typename agg_op,
bool is_mean_of_timestamp,
std::enable_if_t<!is_mean_of_timestamp>* = nullptr>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as comment from grouped_rolling_test.cpp

Copy link
Contributor

@jrhemstad jrhemstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this PR should be made to just disable sum of timestamp types. Emulating Pandas MEAN of timestamp can be done by casting to duration.

@harrism
Copy link
Member

harrism commented Jun 5, 2020

Merge after #5359 and #5394

@codecov
Copy link

codecov bot commented Jun 15, 2020

Codecov Report

Merging #5319 into branch-0.15 will increase coverage by 0.29%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.15    #5319      +/-   ##
===============================================
+ Coverage        86.09%   86.38%   +0.29%     
===============================================
  Files               75       75              
  Lines            12667    12983     +316     
===============================================
+ Hits             10906    11216     +310     
- Misses            1761     1767       +6     
Impacted Files Coverage Δ
python/cudf/cudf/io/feather.py 100.00% <0.00%> (ø)
python/cudf/cudf/comm/serialize.py 0.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/_version.py 0.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/io/tests/test_orc.py 100.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/io/tests/test_json.py 100.00% <0.00%> (ø)
...ython/dask_cudf/dask_cudf/io/tests/test_parquet.py 100.00% <0.00%> (ø)
python/cudf/cudf/utils/applyutils.py 98.75% <0.00%> (+0.02%) ⬆️
python/cudf/cudf/core/join/join.py 92.41% <0.00%> (+0.03%) ⬆️
...ython/custreamz/custreamz/tests/test_dataframes.py 98.80% <0.00%> (+0.03%) ⬆️
python/cudf/cudf/core/frame.py 89.67% <0.00%> (+0.04%) ⬆️
... and 26 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 448f38a...53f427a. Read the comment docs.

@karthikeyann karthikeyann changed the title [REVIEW] disallow SUM and specialize MEAN of timestamp types [REVIEW] disallow SUM and MEAN of timestamp types Jun 15, 2020
@nvdbaranec nvdbaranec self-requested a review July 27, 2020 14:46
@harrism harrism mentioned this pull request Jul 28, 2020
15 tasks
@karthikeyann karthikeyann merged commit a4a9aa6 into rapidsai:branch-0.15 Jul 28, 2020
@karthikeyann karthikeyann removed 3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond labels Jul 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] SUM (and probably MEAN) of TIMESTAMP types should be disallowed and require duration types instead
5 participants