Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEA Add support for accepting a Numpy RandomState #6150

Open
wants to merge 10 commits into
base: branch-25.02
Choose a base branch
from

Conversation

betatim
Copy link
Member

@betatim betatim commented Nov 28, 2024

In addition to accepting integers you can now also pass a RandomState object. It is used to derive an integer to use a seed.

  • add support for cupy random state objects something for a new PR

Closes #4753

In addition to accepting integers you can now also pass a RandomState
object. It is used to derive an integer to use a seed.
@github-actions github-actions bot added the Cython / Python Cython or Python issue label Nov 28, 2024
@betatim betatim marked this pull request as ready for review December 3, 2024 12:25
@betatim betatim requested a review from a team as a code owner December 3, 2024 12:25
@betatim betatim requested review from dantegd and divyegala December 3, 2024 12:25
@betatim
Copy link
Member Author

betatim commented Dec 3, 2024

The failures seemed to be related to some dask timeout which is unrelated I think.

Lets see what happens for the latest commit

For my education, why does it say I requested a code review from people? I don't remember clicking any buttons :-/

@dantegd dantegd added non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels Dec 3, 2024
cuml.UMAP,
],
)
def test_random_state_argument(Estimator):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a quick test here that the results are the same with the seed, or is that tested in the individual algo tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the results will be the same because RandomState(42) will not lead to 42 being passed as the seed to the internal functions that cuml calls.

We can't pass any form of "RNG state" to the internal functions, we can just pass an integer. So I think the best we can do when a RandomState is passed in is to use it to generate a uint64 and use that as seed for the internal functions. I think this is better than trying to extract the (original) seed from the RandomState because that way you get a different value if the random state has been used previously.

For example in this (contrived) example I think the two RFs should not both use 42 as the seed internally as they are two separate instances.

rs = RandomState(42)

rf1 = cuml.RandomForestClassifier(random_state=rs)
rf2 = cuml.RandomForestClassifier(random_state=rs)

Copy link
Contributor

@viclafargue viclafargue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -302,6 +306,11 @@ class KMeans(UniversalBase,
else None),
check_dtype=check_dtype)

# XXX Should deriving a seed from a random state be idempotent? Should repeated
# XXX calls of `fit` create new seeds or not?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do people think about this? Should we re-derive a seed each time fit is called?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is an excellent question... what would be the behavior of sklearn?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you pass an int each call to fit is the same, but if you pass a random state it keeps getting forwarded, so each fit is different. (I think it is at least somewhat unclear what should happen, at least within scikit-learn we've not really been able to converge on something :-/)

I think here I'd vote for deriving a new seed each time. My thinking is that that way we match scikit-learn (no need to somehow special case this for the accelerator). Even if I can't justify why having a new seed each time is "the right thing to do"

@dantegd dantegd changed the base branch from branch-24.12 to branch-25.02 December 11, 2024 23:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Cython / Python Cython or Python issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Accept NumPy (and CuPy) RandomState objects as estimator random_state
3 participants