RuntimeError: Connection closed by peer when training on a single node #17

daxiongshu · 2025-01-29T16:04:30Z

Hello I followed the instructions to train on a single node with 6xA100 GPUs, 80GB gpu memory each.
I used export CUDA_VISIBLE_DEVICES=2,3,4,5,6,7 to select 6 GPUs. the other 2 GPUs were occupied by other workloads.

I have no errors until the end of training 1 epoch.

RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.0.1]:38032

Click to See the full error message!

Train epoch [1/1]: 100%|██████████| 2048/2048 [1:48:48<00:00,  3.19s/it, pg=0.0374, rm=-0.25, ret=-0.25, glen=1958.25, tlen=2062.0, kl=0, act_lr=1e-7]
Traceback (most recent call last):
  File "/tmp/ray/session_2025-01-27_12-50-47_746465_2724393/runtime_resources/working_dir_files/_ray_pkg_edf297b701eae48d/openrlhf/cli/train_ppo_ray_box.py", line 395, in <module>
    train(args)
  File "/tmp/ray/session_2025-01-27_12-50-47_746465_2724393/runtime_resources/working_dir_files/_ray_pkg_edf297b701eae48d/openrlhf/cli/train_ppo_ray_box.py", line 175, in train
    ray.get(refs)
  File "/raid/llm/miniforge3/envs/r1/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/raid/llm/miniforge3/envs/r1/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/raid/llm/miniforge3/envs/r1/lib/python3.11/site-packages/ray/_private/worker.py", line 2623, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/raid/llm/miniforge3/envs/r1/lib/python3.11/site-packages/ray/_private/worker.py", line 861, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::ActorModelRayActorBOX.fit() (pid=146529, ip=0.0.0.0, actor_id=5f14195c0e9c5caec6afbc9803000000, repr=<openrlhf.trainer.ray.ppo_actor.ActorModelRayActorBOX object at 0x7f588046b390>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2025-01-27_12-50-47_746465_2724393/runtime_resources/working_dir_files/_ray_pkg_edf297b701eae48d/openrlhf/trainer/ray/ppo_actor.py", line 872, in fit
    trainer.fit(
  File "/tmp/ray/session_2025-01-27_12-50-47_746465_2724393/runtime_resources/working_dir_files/_ray_pkg_edf297b701eae48d/openrlhf/trainer/ppo_trainer_prm800k_box.py", line 246, in fit
    status = self.ppo_train(steps)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2025-01-27_12-50-47_746465_2724393/runtime_resources/working_dir_files/_ray_pkg_edf297b701eae48d/openrlhf/trainer/ray/ppo_actor.py", line 357, in ppo_train
    self._broadcast_to_vllm()
  File "/tmp/ray/session_2025-01-27_12-50-47_746465_2724393/runtime_resources/working_dir_files/_ray_pkg_edf297b701eae48d/openrlhf/trainer/ray/ppo_actor.py", line 390, in _broadcast_to_vllm
    torch.distributed.broadcast(param.data, 0, group=self._model_update_group)
  File "/raid/llm/miniforge3/envs/r1/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/raid/llm/miniforge3/envs/r1/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2425, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.0.1]:38032
(ActorModelRayActorBOX pid=146529)
Episode [1/20]:   0%|          | 0/8 [3:47:36<?, ?it/s]

---------------------------------------
Job 'raysubmit_VZF7LSSz9GK92ZDe' failed
---------------------------------------

Status message: Job entrypoint command failed with exit code 1, last available logs (truncated to 20,000 chars):
  File "/tmp/ray/session_2025-01-27_12-50-47_746465_2724393/runtime_resources/working_dir_files/_ray_pkg_edf297b701eae48d/openrlhf/trainer/ray/ppo_actor.py", line 390, in _broadcast_to_vllm
    torch.distributed.broadcast(param.data, 0, group=self._model_update_group)
  File "/raid/llm/miniforge3/envs/r1/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/raid/llm/miniforge3/envs/r1/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2425, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.0.1]:38032
(ActorModelRayActorBOX pid=146529)
Episode [1/20]:   0%|          | 0/8 [3:47:36<?, ?it/s]

The text was updated successfully, but these errors were encountered:

NekoMimiUnagi · 2025-01-30T01:49:32Z

I meet the same error. If you look back in your log, you can find an OOM error. Due to that error, ray closes a process, which may be the one you publish here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: Connection closed by peer when training on a single node #17

RuntimeError: Connection closed by peer when training on a single node #17

daxiongshu commented Jan 29, 2025

NekoMimiUnagi commented Jan 30, 2025

RuntimeError: Connection closed by peer when training on a single node #17

RuntimeError: Connection closed by peer when training on a single node #17

Comments

daxiongshu commented Jan 29, 2025

NekoMimiUnagi commented Jan 30, 2025