Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Connection closed by peer when training on a single node #17

Open
daxiongshu opened this issue Jan 29, 2025 · 1 comment

Comments

@daxiongshu
Copy link

Hello I followed the instructions to train on a single node with 6xA100 GPUs, 80GB gpu memory each.
I used export CUDA_VISIBLE_DEVICES=2,3,4,5,6,7 to select 6 GPUs. the other 2 GPUs were occupied by other workloads.

I have no errors until the end of training 1 epoch.

RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.0.1]:38032
Click to See the full error message!
Train epoch [1/1]: 100%|██████████| 2048/2048 [1:48:48<00:00,  3.19s/it, pg=0.0374, rm=-0.25, ret=-0.25, glen=1958.25, tlen=2062.0, kl=0, act_lr=1e-7]
Traceback (most recent call last):
  File "/tmp/ray/session_2025-01-27_12-50-47_746465_2724393/runtime_resources/working_dir_files/_ray_pkg_edf297b701eae48d/openrlhf/cli/train_ppo_ray_box.py", line 395, in <module>
    train(args)
  File "/tmp/ray/session_2025-01-27_12-50-47_746465_2724393/runtime_resources/working_dir_files/_ray_pkg_edf297b701eae48d/openrlhf/cli/train_ppo_ray_box.py", line 175, in train
    ray.get(refs)
  File "/raid/llm/miniforge3/envs/r1/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/raid/llm/miniforge3/envs/r1/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/raid/llm/miniforge3/envs/r1/lib/python3.11/site-packages/ray/_private/worker.py", line 2623, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/raid/llm/miniforge3/envs/r1/lib/python3.11/site-packages/ray/_private/worker.py", line 861, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::ActorModelRayActorBOX.fit() (pid=146529, ip=0.0.0.0, actor_id=5f14195c0e9c5caec6afbc9803000000, repr=<openrlhf.trainer.ray.ppo_actor.ActorModelRayActorBOX object at 0x7f588046b390>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2025-01-27_12-50-47_746465_2724393/runtime_resources/working_dir_files/_ray_pkg_edf297b701eae48d/openrlhf/trainer/ray/ppo_actor.py", line 872, in fit
    trainer.fit(
  File "/tmp/ray/session_2025-01-27_12-50-47_746465_2724393/runtime_resources/working_dir_files/_ray_pkg_edf297b701eae48d/openrlhf/trainer/ppo_trainer_prm800k_box.py", line 246, in fit
    status = self.ppo_train(steps)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2025-01-27_12-50-47_746465_2724393/runtime_resources/working_dir_files/_ray_pkg_edf297b701eae48d/openrlhf/trainer/ray/ppo_actor.py", line 357, in ppo_train
    self._broadcast_to_vllm()
  File "/tmp/ray/session_2025-01-27_12-50-47_746465_2724393/runtime_resources/working_dir_files/_ray_pkg_edf297b701eae48d/openrlhf/trainer/ray/ppo_actor.py", line 390, in _broadcast_to_vllm
    torch.distributed.broadcast(param.data, 0, group=self._model_update_group)
  File "/raid/llm/miniforge3/envs/r1/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/raid/llm/miniforge3/envs/r1/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2425, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.0.1]:38032
(ActorModelRayActorBOX pid=146529)
Episode [1/20]:   0%|          | 0/8 [3:47:36<?, ?it/s]

---------------------------------------
Job 'raysubmit_VZF7LSSz9GK92ZDe' failed
---------------------------------------

Status message: Job entrypoint command failed with exit code 1, last available logs (truncated to 20,000 chars):
  File "/tmp/ray/session_2025-01-27_12-50-47_746465_2724393/runtime_resources/working_dir_files/_ray_pkg_edf297b701eae48d/openrlhf/trainer/ray/ppo_actor.py", line 390, in _broadcast_to_vllm
    torch.distributed.broadcast(param.data, 0, group=self._model_update_group)
  File "/raid/llm/miniforge3/envs/r1/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/raid/llm/miniforge3/envs/r1/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2425, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.0.1]:38032
(ActorModelRayActorBOX pid=146529)
Episode [1/20]:   0%|          | 0/8 [3:47:36<?, ?it/s]
@NekoMimiUnagi
Copy link

I meet the same error. If you look back in your log, you can find an OOM error. Due to that error, ray closes a process, which may be the one you publish here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants