Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] DataFrame.describe prints datetimes differently than pandas #9826

Closed
Tracked by #9815
bdice opened this issue Dec 2, 2021 · 1 comment
Closed
Tracked by #9815

[BUG] DataFrame.describe prints datetimes differently than pandas #9826

bdice opened this issue Dec 2, 2021 · 1 comment
Assignees
Labels
bug Something isn't working Python Affects Python cuDF API.

Comments

@bdice
Copy link
Contributor

bdice commented Dec 2, 2021

Describe the bug
The DataFrame.describe string formatting for datetime differs between cudf and pandas. Specifically, the mean is printed differently. I think the types are the same, and that this is only an issue with formatting -- not data types or casting -- but I'm not completely sure.

Steps/Code to reproduce bug

>>> cudf.Series([
...     np.datetime64("2000-01-01"),
...     np.datetime64("2010-01-01"),
...     np.datetime64("2010-01-01"),
... ]).describe(datetime_is_numeric=True)
count                                3
mean     2006-09-01T08:00:00.000000000
min                2000-01-01 00:00:00
25%                2004-12-31 12:00:00
50%                2010-01-01 00:00:00
75%                2010-01-01 00:00:00
max                2010-01-01 00:00:00
dtype: object

Expected behavior

>>> pd.Series([
...     np.datetime64("2000-01-01"),
...     np.datetime64("2010-01-01"),
...     np.datetime64("2010-01-01"),
... ]).describe(datetime_is_numeric=True)
count                      3
mean     2006-09-01 08:00:00
min      2000-01-01 00:00:00
25%      2004-12-31 12:00:00
50%      2010-01-01 00:00:00
75%      2010-01-01 00:00:00
max      2010-01-01 00:00:00
dtype: object

This came up in doctests (#9815) because the existing formatting of the docstring does not match either pandas or the current cudf implementation.

>>> s = cudf.Series([
... np.datetime64("2000-01-01"),
... np.datetime64("2010-01-01"),
... np.datetime64("2010-01-01")
... ])
>>> s
0 2000-01-01
1 2010-01-01
2 2010-01-01
dtype: datetime64[s]
>>> s.describe()
count 3
mean 2006-09-01 08:00:00.000000000
min 2000-01-01 00:00:00.000000000
25% 2004-12-31 12:00:00.000000000
50% 2010-01-01 00:00:00.000000000
75% 2010-01-01 00:00:00.000000000
max 2010-01-01 00:00:00.000000000
dtype: object

In the short term, I am updating the doctest in #9815 to match the current implementation (5111a71).

@bdice bdice added bug Something isn't working Python Affects Python cuDF API. labels Dec 2, 2021
@bdice bdice mentioned this issue Dec 2, 2021
9 tasks
@shwina shwina self-assigned this Dec 3, 2021
@bdice
Copy link
Contributor Author

bdice commented Jan 13, 2022

This appears to be resolved by #9867. The changes in _describe_timestamp fixed it by calling

"mean": str(pd.Timestamp(self.mean())),

https://github.com/rapidsai/cudf/pull/9867/files#diff-20a3f41cdf918eefd141f3aba57a2db388915724daea1f3829366cbb591e72ceR3403

@bdice bdice closed this as completed Jan 13, 2022
rapids-bot bot pushed a commit that referenced this issue Jan 15, 2022
This PR adds doctests and resolves #9513. Several issues were found by running doctests that have now been resolved:

- [x] #9821
- [x] #9822
- [x] #9823
- [x] #9824
- [x] #9825
- [x] #9826
- [x] #9827
- [x] #9828 (workaround by deleting doctests)
- [x] #9829

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #9815
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Python Affects Python cuDF API.
Projects
None yet
Development

No branches or pull requests

3 participants