Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement DataFrame diff() #9817

Merged
merged 51 commits into from
Feb 5, 2022
Merged

Conversation

skirui-source
Copy link
Contributor

@skirui-source skirui-source commented Dec 1, 2021

Fixes: #9604 and resolves #1271

This PR introduces diff() method for DataFrames of numeric types only. As a follow up, PR # #10212 will add support for non-numeric types (specifically timestamps/duration)
for both Series and DataFrame

@skirui-source skirui-source added feature request New feature or request Python Affects Python cuDF API. non-breaking Non-breaking change labels Dec 1, 2021
@skirui-source skirui-source self-assigned this Dec 1, 2021
@codecov
Copy link

codecov bot commented Dec 2, 2021

Codecov Report

Merging #9817 (575c118) into branch-22.04 (a7d88cd) will increase coverage by 0.04%.
The diff coverage is 0.00%.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-22.04    #9817      +/-   ##
================================================
+ Coverage         10.42%   10.47%   +0.04%     
================================================
  Files               119      122       +3     
  Lines             20603    20496     -107     
================================================
- Hits               2148     2147       -1     
+ Misses            18455    18349     -106     
Impacted Files Coverage Δ
python/cudf/cudf/core/_base_index.py 0.00% <ø> (ø)
python/cudf/cudf/core/column/categorical.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/column.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/datetime.py 0.00% <ø> (ø)
python/cudf/cudf/core/column/numerical.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/string.py 0.00% <ø> (ø)
python/cudf/cudf/core/column/timedelta.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column_accessor.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/dataframe.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/frame.py 0.00% <0.00%> (ø)
... and 35 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a25a2ec...575c118. Read the comment docs.

@skirui-source skirui-source marked this pull request as ready for review December 3, 2021 05:00
@skirui-source skirui-source requested a review from a team as a code owner December 3, 2021 05:00
python/cudf/cudf/core/dataframe.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/dataframe.py Outdated Show resolved Hide resolved
@skirui-source skirui-source requested a review from isVoid January 26, 2022 17:08
@vyasr vyasr removed their request for review January 26, 2022 19:20
@vyasr
Copy link
Contributor

vyasr commented Jan 26, 2022

Removing my review request since this PR looks to be pretty well-covered by other reviewers.

@karthikeyann
Copy link
Contributor

rerun tests

1 similar comment
@karthikeyann
Copy link
Contributor

rerun tests

@skirui-source
Copy link
Contributor Author

skirui-source commented Feb 3, 2022

@bdice @isVoid in response to you review comment above, here's an example:

In [1]: import pandas as pd

In [2]: import cudf

In [3]: gdf = cudf.to_datetime(
   ...:     cudf.DataFrame({"year": [2015, 2016], "month": [2, 3], "day": [4, 5]})
   ...: )
   ...: gdf
Out[3]: 
0   2015-02-04
1   2016-03-05
dtype: datetime64[s]

In [4]: pdf = gdf.to_pandas()

In [5]: pdf.diff(periods=2)
Out[5]: 
0   NaT
1   NaT
dtype: timedelta64[ns]

In [6]: gdf.diff(periods=2)
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Input In [6], in <module>
----> 1 gdf.diff(periods=2)

File ~/cudf/python/cudf/cudf/core/series.py:3203, in Series.diff(self, periods)
   3198     raise AssertionError(
   3199         "Diff currently requires columns with no null values"
   3200     )
   3202 if not np.issubdtype(self.dtype, np.number):
-> 3203     raise NotImplementedError(
   3204         "Diff currently only supports numeric dtypes"
   3205     )
   3207 # TODO: move this libcudf
   3208 input_col = self._column

NotImplementedError: Diff currently only supports numeric dtypes

This PR doesn;t cover timestamps, so should I go ahead and get this merged first ( add a TO-DO/ file follow-up issue)?
I'm also unsure what the scope is, to add support for time data, any thoughts?

@bdice
Copy link
Contributor

bdice commented Feb 3, 2022

@skirui-source I see that this is already implemented for Series for only numeric types, so we can introduce this PR as-is without expanding the scope. However, we should consider a follow-up that relaxes that restriction (see code below) for both Series and DataFrame to support non-numeric types (specifically to support timestamps/durations, I'm not sure what else) in a follow-up PR. It shouldn't require anything fancy, since we already support binops on timestamp types and this just does a shift-and-subtract. A time-series diff is probably a very common user need, if you have timestamps and want the durations between rows.

These should be relaxed to support any type that can be subtracted:

if not np.issubdtype(self.dtype, np.number):
raise NotImplementedError(
"Diff currently only supports numeric dtypes"
)

if not all(is_numeric_dtype(i) for i in self.dtypes):
raise NotImplementedError(
"DataFrame.diff only supports numeric dtypes"
)

@skirui-source
Copy link
Contributor Author

@isVoid @bdice Cool, Thanks 👍🏾
in that case I think this PR is ready for another review/approval unless there's any more unresolved discussions

@bdice
Copy link
Contributor

bdice commented Feb 3, 2022

I opened #10212 about enabling broader type support for timestamps.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few small requests to improve the tests. Otherwise LGTM.

python/cudf/cudf/tests/test_dataframe.py Outdated Show resolved Hide resolved
python/cudf/cudf/tests/test_dataframe.py Outdated Show resolved Hide resolved
python/cudf/cudf/tests/test_dataframe.py Outdated Show resolved Hide resolved
@skirui-source
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 2e458b9 into rapidsai:branch-22.04 Feb 5, 2022
@skirui-source skirui-source deleted the df.diff branch March 12, 2022 03:32
rapids-bot bot pushed a commit that referenced this pull request Nov 1, 2022
This PR removes all "smart quotes" from the library by enforcing a pre-commit hook.

Smart quotes typically arise from copying rendered docstrings from Pandas, because Sphinx automatically transforms straight quotes into smart quotes when rendering the docs as HTML. However, the use of smart quotes is undesirable in code, and makes it difficult to do find-replace transformations if straight and smart quotes are mixed.

I have made suggestions to fix this several times before, so I am making the suggestions more permanent and automatically enforceable via a pre-commit style check:
- #12025 (comment)
- #9817 (comment)
- #9571 (comment)

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #12035
@vyasr vyasr added 4 - Needs Review Waiting for reviewer to review or respond and removed 4 - Needs cuDF (Python) Reviewer labels Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] DataFrame diff [FEA][BACKLOG] Difference functionality
9 participants