You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Sorry for the late reply. Actually, we haven't tried DistributedDataParallel training before. The project is implemented with reference to transfer-learning-conv-ai. We'll work on fixing this error soon as we can.
So may I ask how many GPUs you used for training? How big is the memory of each one? What is the batch_size? How long did you train?
I used two A40s with 48G video memory and a batch_size of 32. The training time was very long, averaging 50 minutes per epoch.
When I tried to use the DistributedDataParallel to run the project, I set the code like this:
But the program is stuck, with no output and no further progress. Do you no the reason for this?
The text was updated successfully, but these errors were encountered: