-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PSNR not working with multiple GPUs and dataparallel #266
Comments
Hi! thanks for your contribution!, great first issue! |
Here is the full traceback:
|
This happens because we only reduce between distributed processes. That being said, I am not sure, how we would correctly implement it with dp (due to the internal state we cannot easily copy them). Also you cannot gather results since they are added to the states within the DPs module meaning they don't have access to the DP scope information. I'll need to think about this. To unblock you, can you use DDP instead (which is recommended anyways)? |
Thanks for the answer. I am using DDP in the mean time, but I might still be interested in using DP simply because I'm porting some previous pytorch code that uses DP to lightning and I want to make sure things still work the same way (wasn't using torchmetrics before though). |
Hi @amonod-gpfw, so we cannot support updates of metrics in the def training_step(self, batch, batch_idx):
data, target = batch
preds = self(data)
...
return {'loss' : loss, 'preds' : preds, 'target' : target}
def training_step_end(self, outputs):
#update and log
self.metric(outputs['preds'], outputs['target'])
self.log('metric', self.metric) I am also going to add a note to the documentation for future reference. |
🐛 Bug
(This is a sort of follow up to lightning issue #7257 and torchmetrics bugfix #214)
Hi folks,
I have a problem when using lightning, DataParallel and torchmetrics. When training a small denoising network on MNIST with 2 gpus using DP and torchmetrics to compute training and validation PSNR I get the following error:
RuntimeError: All input tensors must be on the same device. Received cuda:0 and cuda:1
Code for reproduction
Expected behavior
Training should perform correctly.
Training and validation work fine when using a single GPU and DP (although there is not much point in doing that).
Environment
conda
,pip
, source): installed everything using condaThe text was updated successfully, but these errors were encountered: