Stage-1 batchsize>4 CUDA out of memory #240

Niujunbo2002 · 2024-12-08T15:39:38Z

Hi～
When I use 8 A100-80G and set the batch size to 8, I will get CUDA out of memory. May I ask what is the batch size per GPU you set in stage 1?
When I add part of the Scene OCR data to the Vary-600k dataset for Stage 1 training for 3 epochs, and the final loss is greater than 2, what could be the problem?(When training using only the Vary-600k dataset, the loss is correct.) TAT

Ucas-HaoranWei · 2024-12-09T08:46:50Z

It is because that codebase only supports torch DDP.
How many scene OCR data you add?

Niujunbo2002 · 2024-12-09T12:01:50Z

It is because that codebase only supports torch DDP.

How many scene OCR data you add?

I don't use a large amount of data for the time being.
I use Vary-600k as pdf data and add ~600k scene OCR data. I train for 3 epoch using such data, but I can't see the convergence of loss，the average loss > 2.
So I want to know how you achieved training only for 3 epochs as mentioned in the article. Is it completely using the code in the repo https://github.com/Ucas-HaoranWei/Vary-tiny-600k?

Niujunbo2002 · 2024-12-09T16:38:17Z

It also show loss = nan in the stage1(vary-600k + scene OCR full image ~ 300k)
But every thing is normal when I use the scene OCR crop from full image + vary-600k.
I am very strange about this situation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stage-1 batchsize>4 CUDA out of memory #240

Stage-1 batchsize>4 CUDA out of memory #240

Niujunbo2002 commented Dec 8, 2024

Ucas-HaoranWei commented Dec 9, 2024

Niujunbo2002 commented Dec 9, 2024

Niujunbo2002 commented Dec 9, 2024

Stage-1 batchsize>4 CUDA out of memory #240

Stage-1 batchsize>4 CUDA out of memory #240

Comments

Niujunbo2002 commented Dec 8, 2024

Ucas-HaoranWei commented Dec 9, 2024

Niujunbo2002 commented Dec 9, 2024

Niujunbo2002 commented Dec 9, 2024