-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove scikit-learn #1063
Remove scikit-learn #1063
Conversation
7ebd138
to
349a1de
Compare
Trying to decide if the +100 lines is worth it to remove one dependency... |
Which is more likely to cause issues? (I'm thinking sklearn obviously -- more generally, I'm tired of dependencies breaking everything, it seems we spend as much time fiddling with CI and dependencies as we do on actual features) Another relevant question is "do we see any other features of sklearn that we'll be using in the future?" (I'm thinking maybe if we implement fine tuning and want to use sklearn, but that isn't necessary) |
Can you give some arguments for/against? |
Pros
+/- sklearn isn't a huge deal just because most ML people will already have it installed anyway. But it's always nice to decrease the number of our deps, both for install times and for simpler solves. For example, sklearn requires setuptools < 60, while fiona requires setuptools 61+. Pip can handle this, but Spack can't, meaning you would have to choose between the latest version of sklearn or fiona, you couldn't have both. This was fixed the other day, but still. Cons
Maintenance burden is my biggest fear. We aren't explicitly making this function public (it doesn't get an import alias), but it's still something people could potentially try to use. If we do keep it, we should prob prefix with an underscore to avoid people relying on it. Alternatives
I'm really on the fence with this one, not sure how to decide. Curious how you feel about alternative 1 (moving to |
I don't really like the idea of moving sklearn to If we divide current lines of code by current number dependencies we can get an idea of how many lines of code are worth one dependency. This number will be assuredly by larger than 100 (if it isn't I'm sure I can code golf the current implementation of group splitting...). Put differently, would you take on a new dependency just to get rid of 100 lines of code? (I think no way in hell 🙂) For the "Possibility we may decide to re-add sklearn for another feature someday" -- that doesn't seem like a con at all. If we do, then we can drop these 100 lines of code, else it is a non-issue. The function that we are maintaining has pretty clear logic:
We don't really care if sklearn changes their API for doing this because we just care about the functionality. |
I was 50/50 on this but now I'm more like 60/40 in favor. Still approaching critical mass... |
Anything actionable? |
Not yet, still mulling... |
Yeah let's do this. We're adding a bunch more new deps now, so it would be good to reduce. Can you rebase? |
349a1de
to
dcb455a
Compare
I'm not trying to mimic torch.utils or torchgo.datasets.splits here. This is meant to be a simple drop in replacement that doesn't depend on all of scikit-learn. (also, are you thinking of something other than |
I'm talking about If we're going to write our own splitting utility, I don't see why we shouldn't follow the PyTorch style instead of the sklearn style. Could even put it in |
Okay, I see what you're saying now! Sorry that took me a minute -- the disconnect was because I wasn't seeing this as something that operated on torch Datasets. It could go in (and, given that, should it really be in |
That's fair. I'm fine with keeping this internal-only and using a different style since it doesn't yet support passing in a NonGeoDataset. But if we can figure out a stable API that would support that it would be pretty cool. Want me to merge this as is and save this for another day? |
Co-authored-by: Adam J. Stewart <[email protected]>
Tests fail with |
I would expect it to behave similarly to our other splitting functions. |
Co-authored-by: Adam J. Stewart <[email protected]>
can you finish this? |
We only use sklearn for GroupShuffleSplit. Reimplementing our own version to remove this dependency.