Strange behavior using PyTorch DDP #32

snakers4 · 2022-01-13T09:49:45Z

So far I have been able to use the loss with DDP on a single GPU , it behaves more or less as expected.

But when I use more than 1 device, the following happens:

On GPU-0 loss is calculated properly
On GPU-1 loss is close to zero for each batch

I checked the input tensors, devices, tensor values, etc - so far everything seems to be identical for GPU-0 and other GPUs.

The text was updated successfully, but these errors were encountered:

snakers4 · 2022-01-13T11:23:21Z

@burchim
By the way, since you used this loss, did you encounter anything of this sort in your work?

burchim · 2022-01-13T12:13:12Z

Hi @snakers4!
Yes I had a similar problem with 4 GPU devices where the rnnt loss was properly computed on the first devices but 0 on the others. I don't really remember what was the exact cause but it had something to with tensor devices. Maybe the frames / label lengths.

I also recently experimented replacing it with the official torchaudio.transforms.RNNTLoss loss from torchaudio 0.10.0.
Was working very well but I didn't try to do a full training with it.

snakers4 · 2022-01-13T12:17:25Z

Thanks for the heads up about the torchaudio loss!
I remember seeing it sometime ago, but I totally forgot about it.

snakers4 · 2022-01-13T12:45:57Z

@burchim
By the way, did you have RuntimeError: input length mismatch when migrating from warp-rnnt towards torchaudio?

burchim · 2022-01-13T12:52:43Z

Yes, this means that logits / target lengths tensors do not match the logits / target tensors.
If you have logits lengths longer than your logits tensor for instance.

burchim · 2022-01-13T12:55:11Z

Because I used the targets lengths instead of logits lengths, stupid error

csukuangfj · 2022-05-14T14:27:06Z

Thanks for the heads up about the torchaudio loss!

@snakers4
You may find https://github.com/danpovey/fast_rnnt useful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strange behavior using PyTorch DDP #32

Strange behavior using PyTorch DDP #32

snakers4 commented Jan 13, 2022

snakers4 commented Jan 13, 2022

burchim commented Jan 13, 2022

snakers4 commented Jan 13, 2022

snakers4 commented Jan 13, 2022

burchim commented Jan 13, 2022

burchim commented Jan 13, 2022

csukuangfj commented May 14, 2022

Strange behavior using PyTorch DDP #32

Strange behavior using PyTorch DDP #32

Comments

snakers4 commented Jan 13, 2022

snakers4 commented Jan 13, 2022

burchim commented Jan 13, 2022

snakers4 commented Jan 13, 2022

snakers4 commented Jan 13, 2022

burchim commented Jan 13, 2022

burchim commented Jan 13, 2022

csukuangfj commented May 14, 2022