Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement DataFrame.hash_values, deprecate DataFrame.hash_columns. #9458

Merged
merged 4 commits into from
Oct 19, 2021

Conversation

bdice
Copy link
Contributor

@bdice bdice commented Oct 16, 2021

This PR implements DataFrame.hash_values, which will replace DataFrame.hash_columns (which is deprecated in this PR). This proposal was discussed offline with @vyasr and in the weekly cuDF Python dev meeting.

This unifies the method name and signature for Series.hash_values and DataFrame.hash_values, enabling future internal refactoring by moving the method's implementation to the Frame class (though I'm waiting for the removal of Series.hash_encode to follow up on this so it can be done in a single pass, see #9381 and #9457).

@bdice bdice added 3 - Ready for Review Ready for review by team Python Affects Python cuDF API. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change deprecation labels Oct 16, 2021
@bdice bdice self-assigned this Oct 16, 2021
@bdice bdice requested a review from a team as a code owner October 16, 2021 01:23
@codecov
Copy link

codecov bot commented Oct 16, 2021

Codecov Report

Merging #9458 (8f19a31) into branch-21.12 (ab4bfaa) will decrease coverage by 0.13%.
The diff coverage is 0.00%.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-21.12    #9458      +/-   ##
================================================
- Coverage         10.79%   10.65%   -0.14%     
================================================
  Files               116      117       +1     
  Lines             18869    19759     +890     
================================================
+ Hits               2036     2106      +70     
- Misses            16833    17653     +820     
Impacted Files Coverage Δ
python/cudf/cudf/__init__.py 0.00% <0.00%> (ø)
python/cudf/cudf/_lib/__init__.py 0.00% <ø> (ø)
python/cudf/cudf/_lib/strings/__init__.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/csv.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/hdf.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/orc.py 0.00% <0.00%> (ø)
python/cudf/cudf/_version.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/abc.py 0.00% <0.00%> (ø)
python/cudf/cudf/api/types.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/dlpack.py 0.00% <0.00%> (ø)
... and 62 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b66f14e...8f19a31. Read the comment docs.

Copy link
Contributor

@skirui-source skirui-source left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Need to add tests though!

@bdice
Copy link
Contributor Author

bdice commented Oct 18, 2021

@skirui-source Good catch! I have now added tests. The tests for DataFrame.hash_values are identical to the tests for DataFrame.hash_columns aside from the API change. I also added pytest.warns context managers to the existing tests for DataFrame.hash_columns to make sure that the deprecations are working and hide the deprecation warnings from our tests' output until the feature is removed.

@bdice bdice requested a review from skirui-source October 18, 2021 23:09
@bdice bdice requested a review from galipremsagar October 19, 2021 15:56
@rgsl888prabhu
Copy link
Contributor

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 2144034 into rapidsai:branch-21.12 Oct 19, 2021
@bdice bdice deleted the dataframe-hash-values branch October 19, 2021 19:56
rapids-bot bot pushed a commit that referenced this pull request Jan 4, 2022
This PR removes the deprecated method `DataFrame.hash_columns`. Users can replace existing calls like `df.hash_columns(columns, method)` with `df[columns].hash_values(method)`. Resolves #9503, follows up on #9458.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - https://github.com/brandon-b-miller

URL: #9943
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants