Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose seed argument to hash_values #12795

Merged
merged 11 commits into from
Feb 24, 2023

Conversation

ayushdg
Copy link
Member

@ayushdg ayushdg commented Feb 16, 2023

Description

This PR exposes the seed param to hash_values that is already supported by libcudf's hash method.
Closes #12775

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added the Python Affects Python cuDF API. label Feb 16, 2023
@ayushdg ayushdg added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Feb 17, 2023
@ayushdg ayushdg marked this pull request as ready for review February 17, 2023 23:52
@ayushdg ayushdg requested a review from a team as a code owner February 17, 2023 23:52
Seed value to use for the hash function.
Note - This only has effect for the following supported
hash functions:
* murmur3: MurmurHash3 hash function.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the seed has no effect, I think maybe we should warn, at least since we have some flexibility here given this has no equivalent pandas API.

# Check single column
out_one = gdf[["a"]].hash_values(method=method)
if warning_expected:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another alternative is to separate out the test for warning into a separate pytest and not use seed at all for this one.

@ayushdg
Copy link
Member Author

ayushdg commented Feb 22, 2023

One observation while testing out hash_values for murmur3 with different seed values is that there seems to be little variance and somewhat of a linear relationship b/w the seed and the hashed values. I'm not sure if this is expected behavior or not, but most implementations of murmur3 from 3rd party libraries seem to give much more varied results.

ser = cudf.Series(["hello world"]).str.character_ngrams(5,False)

ser.hash_values(seed=0)
ser.hash_values(seed=0xFFFFFFF)
|               | Seed=0     | Seed=1     | Seed=0xFFFFFFF |
|---------------|------------|------------|----------------|
| 0             | 3267589120 | 3267589185 | 3376188480     |
| 1             | 1877329862 | 1877329927 | 2082093178     |
| 2             | 1758410039 | 1758410102 | 1664142089     |
| 3             | 1327917612 | 1327917677 | 1557763604     |
| 4             | 2896800200 | 2896800265 | 3210106488     |
| 5             | 197501035  | 197501098  | 3825621        |
| 6             | 2580395700 | 2580395765 | 2452769164     |
| dtype: uint32 |            |            |                |

Copy link
Contributor

@brandon-b-miller brandon-b-miller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving. Not sure about the variance question but given that we're just wrapping here I'd say it's a libcudf question.

@ayushdg
Copy link
Member Author

ayushdg commented Feb 23, 2023

/merge

@rapids-bot rapids-bot bot merged commit e64e26e into rapidsai:branch-23.04 Feb 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Expose seed argument for series.hash_values
2 participants