You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have seen a difference in model convergence across old and new GPU architectures.
For example,
With latest Nvidia GPUs like V100, A100, RTX 4090 (those that support mixed precision), we see that the same models converge faster even when trained with Single/Full precision (fp32) with batch size 1. Loss ~ 0.007 after 16 hours of training on a 40GB A100 after 120000 epochs (total epochs 300000).
However, when the trained with Titan XP cards, this converge much slowly - loss ~0.05 after 120000 epochs. So, this requires more training time.
We need to investigate this further. But this is something one must be careful of as it has been seen that fast converging models may not learn the task!
Please report anything that like this till be we get a chance to look into this further.
The text was updated successfully, but these errors were encountered:
Using AdamW in place of Adam further cause convergence issues.
For example - models can converge rapidly when AdamW is used with LRs between 1e-02 and 1e-4 with or without BatchNorm in a single GPU.
We have seen a difference in model convergence across old and new GPU architectures.
For example,
With latest Nvidia GPUs like V100, A100, RTX 4090 (those that support mixed precision), we see that the same models converge faster even when trained with Single/Full precision (fp32) with batch size
1
. Loss~ 0.007
after 16 hours of training on a 40GB A100 after 120000 epochs (total epochs 300000).However, when the trained with Titan XP cards, this converge much slowly - loss
~0.05
after 120000 epochs. So, this requires more training time.We need to investigate this further. But this is something one must be careful of as it has been seen that fast converging models may not learn the task!
Please report anything that like this till be we get a chance to look into this further.
The text was updated successfully, but these errors were encountered: