Model convergence across Old and New GPU architectures #15

Mohinta2892 · 2024-07-24T09:35:45Z

We have seen a difference in model convergence across old and new GPU architectures.
For example,

With latest Nvidia GPUs like V100, A100, RTX 4090 (those that support mixed precision), we see that the same models converge faster even when trained with Single/Full precision (fp32) with batch size 1. Loss ~ 0.007 after 16 hours of training on a 40GB A100 after 120000 epochs (total epochs 300000).
However, when the trained with Titan XP cards, this converge much slowly - loss ~0.05 after 120000 epochs. So, this requires more training time.

We need to investigate this further. But this is something one must be careful of as it has been seen that fast converging models may not learn the task!

Please report anything that like this till be we get a chance to look into this further.

The text was updated successfully, but these errors were encountered:

Mohinta2892 · 2024-07-27T20:50:02Z

Using AdamW in place of Adam further cause convergence issues.
For example - models can converge rapidly when AdamW is used with LRs between 1e-02 and 1e-4 with or without BatchNorm in a single GPU.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model convergence across Old and New GPU architectures #15

Model convergence across Old and New GPU architectures #15

Mohinta2892 commented Jul 24, 2024 •

edited

Loading

Mohinta2892 commented Jul 27, 2024 •

edited

Loading

Model convergence across Old and New GPU architectures #15

Model convergence across Old and New GPU architectures #15

Comments

Mohinta2892 commented Jul 24, 2024 • edited Loading

Mohinta2892 commented Jul 27, 2024 • edited Loading

Mohinta2892 commented Jul 24, 2024 •

edited

Loading

Mohinta2892 commented Jul 27, 2024 •

edited

Loading