-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose seed argument to hash_values #12795
Merged
rapids-bot
merged 11 commits into
rapidsai:branch-23.04
from
ayushdg:enh-hash_values-seed
Feb 24, 2023
Merged
Changes from 9 commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
4bcf861
Expose seed param to frame hash_values
ayushdg 6f593e7
Make seed for hash_values an optional argument
ayushdg 1e30bcf
Add tests for hash_values with seed
ayushdg 263c14e
Merge branch 'branch-23.04' into enh-hash_values-seed
ayushdg c930a14
Merge branch 'branch-23.04' into enh-hash_values-seed
ayushdg fe07863
Merge branch 'branch-23.04' into enh-hash_values-seed
ayushdg e59c434
Use np.arange instead of np.asarray(range)
ayushdg 7c5a83f
Add warning when using seed with unsupported methods
ayushdg 6cfe7ca
Update tests
ayushdg e11d21b
Merge branch 'branch-23.04' into enh-hash_values-seed
ayushdg be4fdd8
Merge branch 'branch-23.04' into enh-hash_values-seed
ayushdg File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -38,6 +38,7 @@ | |
NUMERIC_TYPES, | ||
assert_eq, | ||
assert_exceptions_equal, | ||
assert_neq, | ||
does_not_raise, | ||
expect_warning_if, | ||
gen_rand, | ||
|
@@ -1323,9 +1324,10 @@ def test_assign(): | |
|
||
@pytest.mark.parametrize("nrows", [1, 8, 100, 1000]) | ||
@pytest.mark.parametrize("method", ["murmur3", "md5"]) | ||
def test_dataframe_hash_values(nrows, method): | ||
@pytest.mark.parametrize("seed", [None, 42]) | ||
def test_dataframe_hash_values(nrows, method, seed): | ||
gdf = cudf.DataFrame() | ||
data = np.asarray(range(nrows)) | ||
data = np.arange(nrows) | ||
data[0] = data[-1] # make first and last the same | ||
gdf["a"] = data | ||
gdf["b"] = gdf.a + 100 | ||
|
@@ -1334,12 +1336,41 @@ def test_dataframe_hash_values(nrows, method): | |
assert len(out) == nrows | ||
assert out.dtype == np.uint32 | ||
|
||
warning_expected = ( | ||
True if seed is not None and method not in {"murmur3"} else False | ||
) | ||
# Check single column | ||
out_one = gdf[["a"]].hash_values(method=method) | ||
if warning_expected: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another alternative is to separate out the test for warning into a separate pytest and not use |
||
with pytest.warns( | ||
UserWarning, match="Provided seed value has no effect*" | ||
): | ||
out_one = gdf[["a"]].hash_values(method=method, seed=seed) | ||
else: | ||
out_one = gdf[["a"]].hash_values(method=method, seed=seed) | ||
# First matches last | ||
assert out_one.iloc[0] == out_one.iloc[-1] | ||
# Equivalent to the cudf.Series.hash_values() | ||
assert_eq(gdf["a"].hash_values(method=method), out_one) | ||
if warning_expected: | ||
with pytest.warns( | ||
UserWarning, match="Provided seed value has no effect*" | ||
): | ||
assert_eq(gdf["a"].hash_values(method=method, seed=seed), out_one) | ||
else: | ||
assert_eq(gdf["a"].hash_values(method=method, seed=seed), out_one) | ||
|
||
|
||
@pytest.mark.parametrize("method", ["murmur3"]) | ||
def test_dataframe_hash_values_seed(method): | ||
gdf = cudf.DataFrame() | ||
data = np.arange(10) | ||
data[0] = data[-1] # make first and last the same | ||
gdf["a"] = data | ||
gdf["b"] = gdf.a + 100 | ||
out_one = gdf.hash_values(method=method, seed=0) | ||
out_two = gdf.hash_values(method=method, seed=1) | ||
assert out_one.iloc[0] == out_one.iloc[-1] | ||
assert out_two.iloc[0] == out_two.iloc[-1] | ||
assert_neq(out_one, out_two) | ||
|
||
|
||
@pytest.mark.parametrize("nrows", [3, 10, 100, 1000]) | ||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the seed has no effect, I think maybe we should warn, at least since we have some flexibility here given this has no equivalent pandas API.