Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove scikit-learn #1063

Merged
merged 18 commits into from
Apr 24, 2023
Merged

Remove scikit-learn #1063

merged 18 commits into from
Apr 24, 2023

Conversation

calebrob6
Copy link
Member

We only use sklearn for GroupShuffleSplit. Reimplementing our own version to remove this dependency.

@github-actions github-actions bot added datamodules PyTorch Lightning datamodules dependencies Packaging and dependencies testing Continuous integration testing labels Jan 29, 2023
@adamjstewart
Copy link
Collaborator

Trying to decide if the +100 lines is worth it to remove one dependency...

@calebrob6
Copy link
Member Author

Which is more likely to cause issues? (I'm thinking sklearn obviously -- more generally, I'm tired of dependencies breaking everything, it seems we spend as much time fiddling with CI and dependencies as we do on actual features)

Another relevant question is "do we see any other features of sklearn that we'll be using in the future?" (I'm thinking maybe if we implement fine tuning and want to use sklearn, but that isn't necessary)

@calebrob6
Copy link
Member Author

Can you give some arguments for/against?

@adamjstewart
Copy link
Collaborator

Pros

  • -1 required direct dependency
  • -11 required indirect dependencies (including scipy which was listed as optional)
  • Fewer dependabot PRs
  • Faster installs during tests

+/- sklearn isn't a huge deal just because most ML people will already have it installed anyway. But it's always nice to decrease the number of our deps, both for install times and for simpler solves. For example, sklearn requires setuptools < 60, while fiona requires setuptools 61+. Pip can handle this, but Spack can't, meaning you would have to choose between the latest version of sklearn or fiona, you couldn't have both. This was fixed the other day, but still.

Cons

  • +100 lines of code we have to maintain
  • Possibility of sklearn implementation getting bug fixes we don't
  • Possibility of sklearn implementation changing API someday
  • Possibility we may decide to re-add sklearn for another feature someday (

Maintenance burden is my biggest fear. We aren't explicitly making this function public (it doesn't get an import alias), but it's still something people could potentially try to use. If we do keep it, we should prob prefix with an underscore to avoid people relying on it.

Alternatives

  • Could move sklearn to [datasets] and make it optional
  • Could simplify the function to the absolute bare minimum feature with no error checks, but that might lose compatibility with the sklearn implementation, making it difficult to revert someday if we decide to use sklearn for something else

I'm really on the fence with this one, not sure how to decide. Curious how you feel about alternative 1 (moving to [datasets]). Most of the pros go away, but at least it's optional. Technically it's for datamodules, not datasets, but close enough.

@adamjstewart
Copy link
Collaborator

@calebrob6
Copy link
Member Author

I don't really like the idea of moving sklearn to [datasets] as the whole point of this was to reduce dependency related maintenance burden.

If we divide current lines of code by current number dependencies we can get an idea of how many lines of code are worth one dependency. This number will be assuredly by larger than 100 (if it isn't I'm sure I can code golf the current implementation of group splitting...). Put differently, would you take on a new dependency just to get rid of 100 lines of code? (I think no way in hell 🙂)

For the "Possibility we may decide to re-add sklearn for another feature someday" -- that doesn't seem like a con at all. If we do, then we can drop these 100 lines of code, else it is a non-issue.

The function that we are maintaining has pretty clear logic:

  • Takes of list of something (hashables maybe?), groups
  • Splits the input list into two (according to a given percentage) while ensuring that the values in groups assigned to both splits are distinct.

We don't really care if sklearn changes their API for doing this because we just care about the functionality.

@adamjstewart
Copy link
Collaborator

I was 50/50 on this but now I'm more like 60/40 in favor. Still approaching critical mass...

@calebrob6
Copy link
Member Author

I was 50/50 on this but now I'm more like 60/40 in favor. Still approaching critical mass...

Anything actionable?

@adamjstewart
Copy link
Collaborator

Not yet, still mulling...

@adamjstewart
Copy link
Collaborator

Yeah let's do this. We're adding a bunch more new deps now, so it would be good to reduce. Can you rebase?

@calebrob6
Copy link
Member Author

calebrob6 commented Apr 23, 2023

I'm not trying to mimic torch.utils or torchgo.datasets.splits here. This is meant to be a simple drop in replacement that doesn't depend on all of scikit-learn. (also, are you thinking of something other than torchgeo.datasets.split -- that method doesn't support lengths or generators)

@adamjstewart
Copy link
Collaborator

I'm talking about torchgeo/datasets/splits.py.

If we're going to write our own splitting utility, I don't see why we shouldn't follow the PyTorch style instead of the sklearn style. Could even put it in torchgeo/datasets/splits.py and export it to the user.

@calebrob6
Copy link
Member Author

calebrob6 commented Apr 23, 2023

Okay, I see what you're saying now! Sorry that took me a minute -- the disconnect was because I wasn't seeing this as something that operated on torch Datasets. It could go in torchgeo/datasets/splits.py but it isn't really specific to Datasets.

(and, given that, should it really be in torchgeo in the first place 😉 ?)

@adamjstewart
Copy link
Collaborator

That's fair. I'm fine with keeping this internal-only and using a different style since it doesn't yet support passing in a NonGeoDataset. But if we can figure out a stable API that would support that it would be pretty cool.

Want me to merge this as is and save this for another day?

torchgeo/datamodules/utils.py Outdated Show resolved Hide resolved
torchgeo/datamodules/utils.py Outdated Show resolved Hide resolved
torchgeo/datamodules/utils.py Show resolved Hide resolved
@calebrob6
Copy link
Member Author

Tests fail with round(...) I expect the behavior guaranteed by ceiling.

@adamjstewart
Copy link
Collaborator

I expect the behavior guaranteed by ceiling.

I would expect it to behave similarly to our other splitting functions.

@calebrob6
Copy link
Member Author

can you finish this?

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Apr 24, 2023
@adamjstewart adamjstewart added this to the 0.4.2 milestone Apr 24, 2023
@adamjstewart adamjstewart enabled auto-merge (squash) April 24, 2023 21:40
@adamjstewart adamjstewart merged commit 4f714f7 into main Apr 24, 2023
@adamjstewart adamjstewart deleted the remove_sklearn branch April 24, 2023 22:00
@adamjstewart adamjstewart modified the milestones: 0.4.2, 0.5.0 Sep 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datamodules PyTorch Lightning datamodules dependencies Packaging and dependencies documentation Improvements or additions to documentation testing Continuous integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants