Defaults of `InitProcessGroupKwargs` not working #2611

carefree0910 · 2024-04-03T02:12:16Z

System Info

- `Accelerate` version: 0.28.0
- Platform: Linux-3.10.0-1160.88.1.el7.x86_64-x86_64-with-glibc2.17
- Python version: 3.8.3
- Numpy version: 1.24.4
- PyTorch version (GPU?): 2.2.1+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 64 GB
- GPU type: NVIDIA GeForce RTX 4090
- `Accelerate` default config:
        Not found

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

The key issue is that when calling to_kwargs on InitProcessGroupKwargs, defaults will be abandoned, but PyTorch's defaults are not identical with those in InitProcessGroupKwargs, which causes mismatch (e.g., default timeout in InitProcessGroupKwargs is 1800, but is 600 in PyTorch).

Simply initialize an Accelerator with InitProcessGroupKwargs can cause the mismatch:

from datetime import timedelta
from accelerate import Accelerator
from accelerate import InitProcessGroupKwargs

Accelerator(kwargs_handlers=[InitProcessGroupKwargs(timeout=timedelta(seconds=1800))])

Expected behavior

Defaults in InitProcessGroupKwargs should take effect, but now they don't because to_kwargs abandoned the defaults.

The text was updated successfully, but these errors were encountered:

SunMarc · 2024-04-15T10:02:06Z

Hi @carefree0910, thanks for reporting ! Should we set the default to 600 seconds @muellerzr since most people uses nccl with multi-gpu ? Or we could just set it to None and explain the default behavior.
In pytorch doc, I see the following:
timeout (timedelta, optional) – Timeout for operations executed against the process group. Default value is 10 minutes for NCCL and 30 minutes for other backends. This is the duration after which collectives will be aborted asynchronously and the process will crash. This is done since CUDA execution is async and it is no longer safe to continue executing user code since failed async NCCL operations might result in subsequent CUDA operations running on corrupted data. When TORCH_NCCL_BLOCKING_WAIT is set, the process will block and wait for this timeout.

github-actions · 2024-05-09T15:06:32Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

muellerzr mentioned this issue May 9, 2024

Sets default to PyTorch defaults based on backend #2758

Merged

5 tasks

github-actions bot closed this as completed May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Defaults of `InitProcessGroupKwargs` not working #2611

Defaults of `InitProcessGroupKwargs` not working #2611

carefree0910 commented Apr 3, 2024 •

edited

Loading

SunMarc commented Apr 15, 2024 •

edited

Loading

github-actions bot commented May 9, 2024

Defaults of InitProcessGroupKwargs not working #2611

Defaults of InitProcessGroupKwargs not working #2611

Comments

carefree0910 commented Apr 3, 2024 • edited Loading

System Info

Information

Tasks

Reproduction

Expected behavior

SunMarc commented Apr 15, 2024 • edited Loading

github-actions bot commented May 9, 2024

Defaults of `InitProcessGroupKwargs` not working #2611

Defaults of `InitProcessGroupKwargs` not working #2611

carefree0910 commented Apr 3, 2024 •

edited

Loading

SunMarc commented Apr 15, 2024 •

edited

Loading