Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SimCLR trainer #1252

Merged
merged 67 commits into from
May 11, 2023
Merged

Add SimCLR trainer #1252

merged 67 commits into from
May 11, 2023

Conversation

isaaccorley
Copy link
Collaborator

@isaaccorley isaaccorley commented Apr 15, 2023

This PR adds a SimCLR trainer which uses implementations from the lightly package

Reboot of #1195

@isaaccorley isaaccorley self-assigned this Apr 15, 2023
@github-actions github-actions bot added dependencies Packaging and dependencies testing Continuous integration testing trainers PyTorch Lightning trainers labels Apr 15, 2023
Copy link
Collaborator

@adamjstewart adamjstewart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Messaged on Slack, need to update to new style.

requirements/required.txt Outdated Show resolved Hide resolved
setup.cfg Outdated Show resolved Hide resolved
@adamjstewart adamjstewart added this to the 0.5.0 milestone Apr 15, 2023
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Apr 16, 2023
@adamjstewart
Copy link
Collaborator

Really sick of codecov/feedback#126

@calebrob6
Copy link
Member

I tried this trainer with multigpu and got ValueError: bad value(s) in fds_to_keep, however the MultiLabel trainer worked fine with the same DataModule.

@adamjstewart
Copy link
Collaborator

I tried this trainer with multigpu and got ValueError: bad value(s) in fds_to_keep, however the MultiLabel trainer worked fine with the same DataModule.

With or without gather_distributed?

@calebrob6
Copy link
Member

calebrob6 commented Apr 26, 2023

I don't touch gather_distributed as I would expect this to work by specifying multiple GPUs in the trainer.

E.g. this config works

module:
  _target_: torchgeo.trainers.MultiLabelClassificationTask
  loss: "bce"
  model: "resnet18"
  learning_rate: 1e-3
  learning_rate_schedule_patience: 6
  weights: null
  in_channels: 12
  num_classes: 19

datamodule:
  _target_: torchgeo.datamodules.BigEarthNetDataModule
  root: "data/BigEarthNet"
  batch_size: 256
  bands: s2
  num_workers: 16

trainer:
  _target_: lightning.pytorch.Trainer
  accelerator: gpu
  devices:
    - 4
    - 6
  min_epochs: 100
  max_epochs: 100

however, if I change the module to the SimCLR trainer, then it doesn't work.

@adamjstewart
Copy link
Collaborator

What's the stack trace for the error? Only thing I can do is search repos for that error message.

@calebrob6
Copy link
Member

Can you try running train.py with multiple GPUs?

@calebrob6
Copy link
Member

(geospatiallib) calebrobinson@cdf00374c70c:~/ssdprivate/torchgeo$ python train.py config_file=conf/custom/simclr_bigearthnet2.yaml
Global seed set to 0
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Traceback (most recent call last):                                                                                                                                                                   File "/home/calebrobinson/ssdprivate/torchgeo/train.py", line 156, in <module>
    main(conf)
  File "/home/calebrobinson/ssdprivate/torchgeo/train.py", line 129, in main
    trainer.fit(model=task, datamodule=datamodule)
  File "/home/calebrobinson/.conda/envs/geospatiallib/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 608, in fit
    call._call_and_handle_interrupt(
  File "/home/calebrobinson/.conda/envs/geospatiallib/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 36, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/home/calebrobinson/.conda/envs/geospatiallib/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/multiprocessing.py", line 113, in launch
    mp.start_processes(
  File "/home/calebrobinson/.conda/envs/geospatiallib/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    process.start()
  File "/home/calebrobinson/.conda/envs/geospatiallib/lib/python3.10/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/home/calebrobinson/.conda/envs/geospatiallib/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
    return Popen(process_obj)
  File "/home/calebrobinson/.conda/envs/geospatiallib/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)                                                                                                                                                                    File "/home/calebrobinson/.conda/envs/geospatiallib/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)                                                                                                                                                                        File "/home/calebrobinson/.conda/envs/geospatiallib/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 58, in _launch
    self.pid = util.spawnv_passfds(spawn.get_executable(),
  File "/home/calebrobinson/.conda/envs/geospatiallib/lib/python3.10/multiprocessing/util.py", line 452, in spawnv_passfds                                                                             return _posixsubprocess.fork_exec(
ValueError: bad value(s) in fds_to_keep

@adamjstewart
Copy link
Collaborator

Our node is still down. Been going back and forth with the sys admins for days 😭

@adamjstewart adamjstewart marked this pull request as draft May 3, 2023 03:05
@adamjstewart adamjstewart marked this pull request as ready for review May 3, 2023 21:12
loss = self.criterion(z1, z2)

# Calculate the mean normalized standard deviation over features dimensions.
# If this is << 1 / sqrt(h1.shape[1]), then the model is not learning anything.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@calebrob6 do you want this in every SSL trainer or no SSL trainers? Want to be consistent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a good thing to monitor in training per the SimSiam paper so we shouldn't remove it just for convenience, however it also isn't urgent to add it to other trainers

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add it to MoCo, should only take a minute.

loss = self.criterion(z1, z2)

# Calculate the mean normalized standard deviation over features dimensions.
# If this is << 1 / sqrt(h1.shape[1]), then the model is not learning anything.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a good thing to monitor in training per the SimSiam paper so we shouldn't remove it just for convenience, however it also isn't urgent to add it to other trainers

@calebrob6 calebrob6 merged commit ef7a9ad into microsoft:main May 11, 2023
@isaaccorley isaaccorley deleted the trainers/simclr branch May 23, 2023 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Packaging and dependencies documentation Improvements or additions to documentation scripts Training and evaluation scripts testing Continuous integration testing trainers PyTorch Lightning trainers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants