You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)
Reproduction
The key issue is that when calling to_kwargs on InitProcessGroupKwargs, defaults will be abandoned, but PyTorch's defaults are not identical with those in InitProcessGroupKwargs, which causes mismatch (e.g., default timeout in InitProcessGroupKwargs is 1800, but is 600 in PyTorch).
Simply initialize an Accelerator with InitProcessGroupKwargs can cause the mismatch:
Hi @carefree0910, thanks for reporting ! Should we set the default to 600 seconds @muellerzr since most people uses nccl with multi-gpu ? Or we could just set it to None and explain the default behavior.
In pytorch doc, I see the following: timeout (timedelta, optional) – Timeout for operations executed against the process group. Default value is 10 minutes for NCCL and 30 minutes for other backends. This is the duration after which collectives will be aborted asynchronously and the process will crash. This is done since CUDA execution is async and it is no longer safe to continue executing user code since failed async NCCL operations might result in subsequent CUDA operations running on corrupted data. When TORCH_NCCL_BLOCKING_WAIT is set, the process will block and wait for this timeout.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
The key issue is that when calling
to_kwargs
onInitProcessGroupKwargs
, defaults will be abandoned, but PyTorch's defaults are not identical with those inInitProcessGroupKwargs
, which causes mismatch (e.g., defaulttimeout
inInitProcessGroupKwargs
is1800
, but is600
in PyTorch).Simply initialize an
Accelerator
withInitProcessGroupKwargs
can cause the mismatch:Expected behavior
Defaults in
InitProcessGroupKwargs
should take effect, but now they don't becauseto_kwargs
abandoned the defaults.The text was updated successfully, but these errors were encountered: