-
Notifications
You must be signed in to change notification settings - Fork 976
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-GPU Training+Single-GPU Eval runs into time out #223
Comments
The timeout et 30s comes from PyTorch, but you can adjust it when initializing the distributed process. Accelerate does it automatically but only if you haven't done it yourself in the script. I'll expose that argument this week or the next, but in the meantime, you can use this line as a workaround:
The default is 3600. |
I didn't know how to set this as an argument if I am using I found the argument I am commenting on this for: Here is my example terminal command
|
@sgugger if I used with FSDP distributed fine-tuning, is the parameter controlled by |
Hi everyone,
we run into a timeout when we evaluate for more than 30 minutes on a single GPU. Is there a way to tell the other GPU to wait until the main GPU completes the evaluation?
@sgugger Can you please have a look?
The text was updated successfully, but these errors were encountered: