Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model convergence across Old and New GPU architectures #15

Open
Mohinta2892 opened this issue Jul 24, 2024 · 1 comment
Open

Model convergence across Old and New GPU architectures #15

Mohinta2892 opened this issue Jul 24, 2024 · 1 comment

Comments

@Mohinta2892
Copy link
Owner

Mohinta2892 commented Jul 24, 2024

We have seen a difference in model convergence across old and new GPU architectures.
For example,

With latest Nvidia GPUs like V100, A100, RTX 4090 (those that support mixed precision), we see that the same models converge faster even when trained with Single/Full precision (fp32) with batch size 1. Loss ~ 0.007 after 16 hours of training on a 40GB A100 after 120000 epochs (total epochs 300000).
However, when the trained with Titan XP cards, this converge much slowly - loss ~0.05 after 120000 epochs. So, this requires more training time.

We need to investigate this further. But this is something one must be careful of as it has been seen that fast converging models may not learn the task!

Please report anything that like this till be we get a chance to look into this further.

@Mohinta2892
Copy link
Owner Author

Mohinta2892 commented Jul 27, 2024

Using AdamW in place of Adam further cause convergence issues.
For example - models can converge rapidly when AdamW is used with LRs between 1e-02 and 1e-4 with or without BatchNorm in a single GPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant