-
Notifications
You must be signed in to change notification settings - Fork 541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEA Add support for accepting a Numpy RandomState #6150
base: branch-25.02
Are you sure you want to change the base?
Changes from 6 commits
61768cb
77aa6e3
ea1dfe5
8b973ce
814c62b
c411530
7c4aae3
1daab08
b0f7841
7092d59
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
# | ||
# Copyright (c) 2024, NVIDIA CORPORATION. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
# | ||
import numbers | ||
import numpy as np | ||
|
||
|
||
def check_random_seed(seed): | ||
"""Turn a np.random.RandomState instance into a seed. | ||
Parameters | ||
---------- | ||
seed : None | int | instance of RandomState | ||
If seed is None, return a random int as seed. | ||
If seed is an int, return it. | ||
If seed is a RandomState instance, derive a seed from it. | ||
Otherwise raise ValueError. | ||
""" | ||
if seed is None: | ||
seed = np.random.RandomState(None) | ||
|
||
if isinstance(seed, numbers.Integral): | ||
return seed | ||
if isinstance(seed, np.random.RandomState): | ||
return seed.randint( | ||
low=0, high=np.iinfo(np.uint32).max, dtype=np.uint32 | ||
) | ||
raise ValueError("%r cannot be used to create a seed." % seed) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
# | ||
# Copyright (c) 2024, NVIDIA CORPORATION. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
# | ||
import pytest | ||
|
||
import numpy as np | ||
import cuml | ||
from cuml.datasets import make_blobs | ||
|
||
|
||
@pytest.mark.parametrize( | ||
"Estimator", | ||
[ | ||
cuml.KMeans, | ||
cuml.RandomForestRegressor, | ||
cuml.RandomForestClassifier, | ||
cuml.TSNE, | ||
cuml.UMAP, | ||
], | ||
) | ||
def test_random_state_argument(Estimator): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we add a quick test here that the results are the same with the seed, or is that tested in the individual algo tests? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think the results will be the same because We can't pass any form of "RNG state" to the internal functions, we can just pass an integer. So I think the best we can do when a For example in this (contrived) example I think the two RFs should not both use rs = RandomState(42)
rf1 = cuml.RandomForestClassifier(random_state=rs)
rf2 = cuml.RandomForestClassifier(random_state=rs) |
||
X, y = make_blobs(random_state=0) | ||
# Check that both integer and np.random.RandomState are accepted | ||
for seed in (42, np.random.RandomState(42)): | ||
est = Estimator(random_state=seed) | ||
|
||
if est.__class__.__name__ != "TSNE": | ||
est.fit(X, y) | ||
else: | ||
est.fit(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do people think about this? Should we re-derive a seed each time
fit
is called?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is an excellent question... what would be the behavior of sklearn?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you pass an
int
each call tofit
is the same, but if you pass a random state it keeps getting forwarded, so eachfit
is different. (I think it is at least somewhat unclear what should happen, at least within scikit-learn we've not really been able to converge on something :-/)I think here I'd vote for deriving a new seed each time. My thinking is that that way we match scikit-learn (no need to somehow special case this for the accelerator). Even if I can't justify why having a new seed each time is "the right thing to do"