Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] make_blobs doesn't behave like cuML and scikit-learn counterparts, and shuffle doesn't really shuffle #1127

Open
Nyrio opened this issue Jan 10, 2023 · 0 comments
Labels
bug Something isn't working

Comments

@Nyrio
Copy link
Contributor

Nyrio commented Jan 10, 2023

This is better understood by example.

cuML and scikit-learn behavior:

unshuffled: [0,0,0,1,1,1,2,2,2,3,3,3,4,4,4,5,5,6,6,7,7,8,8,9,9]
shuffled:   [0,4,9,4,9,8,5,6,3,6,0,8,4,3,2,7,1,5,0,7,1,2,2,3,1]

raft behavior:

unshuffled: [0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9,0,1,2,3,4]
shuffled:   [8,5,2,9,6,3,0,7,4,1,8,5,2,9,6,3,0,7,4,1,8,5,2,9,6]

The difference between the unshuffled versions is cosmetic (cuML and sci-kit learn have continuous labels, raft has the index modulo the number of labels). However, the "shuffled" version in raft is not properly shuffled, as the labels appear cyclically in the same order.

This is due to how raft attempts to shuffle them, using an affine transform. Modulo congruence is not sensitive to multiplication and addition. That is, if two points have the same label pre-transform, they have the same label post-transform.

We should use a different method to shuffle.

@Nyrio Nyrio added the bug Something isn't working label Jan 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant