Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

训练出错但没有报错信息 #102

Closed
2793145003 opened this issue Aug 8, 2023 · 2 comments
Closed

训练出错但没有报错信息 #102

2793145003 opened this issue Aug 8, 2023 · 2 comments

Comments

@2793145003
Copy link

按照readme里的步骤来的,只把模型换成了llama-2-70B。
输出:

Training Epoch: 0 / 1     0% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ -- it./s 0:00:00 / -:--:--
Training Batch: 0 / 157   0% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ -- it./s 0:00:00 / -:--:--  /usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly.
The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use
use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
Training Epoch: 0 / 1     0% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ -- it./s 0:00:07 / -:--:--
Training Batch: 0 / 157   0% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ -- it./s 0:00:07 / -:--:--  [2023-08-08 03:08:43,774] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -7) local_rank: 0 (pid: 51194) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.1.0.dev20230725+cu121', 'console_scripts', 'torchrun')())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 788, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
collie.py FAILED
-----------------------------------------------------
Failures:
[1]:
  time      : 2023-08-08_03:08:43
  host      : 58283303bbb0
  rank      : 1 (local_rank: 1)
  exitcode  : -7 (pid: 51195)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 51195
[2]:
  time      : 2023-08-08_03:08:43
  host      : 58283303bbb0
  rank      : 2 (local_rank: 2)
  exitcode  : -7 (pid: 51196)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 51196
[3]:
  time      : 2023-08-08_03:08:43
  host      : 58283303bbb0
  rank      : 3 (local_rank: 3)
  exitcode  : -7 (pid: 51197)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 51197
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-08-08_03:08:43
  host      : 58283303bbb0
  rank      : 0 (local_rank: 0)
  exitcode  : -7 (pid: 51194)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 51194
=====================================================

重启容器之后恢复正常。
再次重启之后换成8卡 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --rdzv_backend=c10d --rdzv_endpoint=localhost:29402 --nnodes=1 --nproc_per_node=8
输出:

Training Epoch: 0 / 1    0% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ -- it./s 0:00:00 / -:--:--
Training Batch: 0 / 79   0% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ -- it./s 0:00:00 / -:--:--  /usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly.
The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use
use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
Training Epoch: 0 / 1    0% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ -- it./s 0:00:13 / -:--:--
Training Batch: 0 / 79   0% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ -- it./s 0:00:13 / -:--:--  [2023-08-08 03:20:09,633] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 96 closing signal SIGTERM
[2023-08-08 03:20:09,639] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 98 closing signal SIGTERM
[2023-08-08 03:20:09,641] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 99 closing signal SIGTERM
[2023-08-08 03:20:09,643] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 100 closing signal SIGTERM
[2023-08-08 03:20:09,645] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 101 closing signal SIGTERM
[2023-08-08 03:20:09,648] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 102 closing signal SIGTERM
[2023-08-08 03:20:11,395] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -7) local_rank: 0 (pid: 95) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.1.0.dev20230725+cu121', 'console_scripts', 'torchrun')())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 788, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
==================================================
collie.py FAILED
--------------------------------------------------
Failures:
[1]:
  time      : 2023-08-08_03:20:09
  host      : 06a78451e09d
  rank      : 2 (local_rank: 2)
  exitcode  : -7 (pid: 97)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 97
--------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-08-08_03:20:09
  host      : 06a78451e09d
  rank      : 0 (local_rank: 0)
  exitcode  : -7 (pid: 95)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 95
==================================================

请问这种情况应该如何debug呢……

@KaiLv69
Copy link
Collaborator

KaiLv69 commented Aug 9, 2023

你好,
感谢你的反馈,我们会优化一下训练过程中的报错。
目前可以尝试用try catch包一下trainer.train(),比如

try:
    trainer.train()
except BaseException as e:
    import sys
    import traceback
    from rich.console import Console
    file = open("./traceback.log", 'a+')
    sys.stdout = file
    traceback.print_exc(file=file)
    file.write("\n\n")
    Console().print_exception()
    raise e

@2793145003
Copy link
Author

你好, 感谢你的反馈,我们会优化一下训练过程中的报错。 目前可以尝试用try catch包一下trainer.train(),比如

try:
    trainer.train()
except BaseException as e:
    import sys
    import traceback
    from rich.console import Console
    file = open("./traceback.log", 'a+')
    sys.stdout = file
    traceback.print_exc(file=file)
    file.write("\n\n")
    Console().print_exception()
    raise e

感谢回复!
用try catch之后也没有任何报错信息。
又搜了一下好像是deepspeed的问题。或者说是docker设置的问题。
解决方法在这里:
deepspeedai/DeepSpeed#4002

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants