You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Training Epoch: 0 / 1 0% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ -- it./s 0:00:00 / -:--:--
Training Batch: 0 / 157 0% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ -- it./s 0:00:00 / -:--:-- /usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly.
The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use
use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
Training Epoch: 0 / 1 0% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ -- it./s 0:00:07 / -:--:--
Training Batch: 0 / 157 0% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ -- it./s 0:00:07 / -:--:-- [2023-08-08 03:08:43,774] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -7) local_rank: 0 (pid: 51194) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==2.1.0.dev20230725+cu121', 'console_scripts', 'torchrun')())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 788, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
collie.py FAILED
-----------------------------------------------------
Failures:
[1]:
time : 2023-08-08_03:08:43
host : 58283303bbb0
rank : 1 (local_rank: 1)
exitcode : -7 (pid: 51195)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 51195
[2]:
time : 2023-08-08_03:08:43
host : 58283303bbb0
rank : 2 (local_rank: 2)
exitcode : -7 (pid: 51196)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 51196
[3]:
time : 2023-08-08_03:08:43
host : 58283303bbb0
rank : 3 (local_rank: 3)
exitcode : -7 (pid: 51197)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 51197
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-08-08_03:08:43
host : 58283303bbb0
rank : 0 (local_rank: 0)
exitcode : -7 (pid: 51194)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 51194
=====================================================
Training Epoch: 0 / 1 0% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ -- it./s 0:00:00 / -:--:--
Training Batch: 0 / 79 0% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ -- it./s 0:00:00 / -:--:-- /usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly.
The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use
use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
Training Epoch: 0 / 1 0% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ -- it./s 0:00:13 / -:--:--
Training Batch: 0 / 79 0% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ -- it./s 0:00:13 / -:--:-- [2023-08-08 03:20:09,633] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 96 closing signal SIGTERM
[2023-08-08 03:20:09,639] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 98 closing signal SIGTERM
[2023-08-08 03:20:09,641] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 99 closing signal SIGTERM
[2023-08-08 03:20:09,643] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 100 closing signal SIGTERM
[2023-08-08 03:20:09,645] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 101 closing signal SIGTERM
[2023-08-08 03:20:09,648] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 102 closing signal SIGTERM
[2023-08-08 03:20:11,395] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -7) local_rank: 0 (pid: 95) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==2.1.0.dev20230725+cu121', 'console_scripts', 'torchrun')())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 788, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
==================================================
collie.py FAILED
--------------------------------------------------
Failures:
[1]:
time : 2023-08-08_03:20:09
host : 06a78451e09d
rank : 2 (local_rank: 2)
exitcode : -7 (pid: 97)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 97
--------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-08-08_03:20:09
host : 06a78451e09d
rank : 0 (local_rank: 0)
exitcode : -7 (pid: 95)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 95
==================================================
请问这种情况应该如何debug呢……
The text was updated successfully, but these errors were encountered:
按照readme里的步骤来的,只把模型换成了llama-2-70B。
输出:
重启容器之后恢复正常。
再次重启之后换成8卡
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --rdzv_backend=c10d --rdzv_endpoint=localhost:29402 --nnodes=1 --nproc_per_node=8
输出:
请问这种情况应该如何debug呢……
The text was updated successfully, but these errors were encountered: