You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello I followed the instructions to train on a single node with 6xA100 GPUs, 80GB gpu memory each.
I used export CUDA_VISIBLE_DEVICES=2,3,4,5,6,7 to select 6 GPUs. the other 2 GPUs were occupied by other workloads.
I have no errors until the end of training 1 epoch.
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.0.1]:38032
Click to See the full error message!
Train epoch [1/1]: 100%|██████████| 2048/2048 [1:48:48<00:00, 3.19s/it, pg=0.0374, rm=-0.25, ret=-0.25, glen=1958.25, tlen=2062.0, kl=0, act_lr=1e-7]
Traceback (most recent call last):
File "/tmp/ray/session_2025-01-27_12-50-47_746465_2724393/runtime_resources/working_dir_files/_ray_pkg_edf297b701eae48d/openrlhf/cli/train_ppo_ray_box.py", line 395, in<module>
train(args)
File "/tmp/ray/session_2025-01-27_12-50-47_746465_2724393/runtime_resources/working_dir_files/_ray_pkg_edf297b701eae48d/openrlhf/cli/train_ppo_ray_box.py", line 175, in train
ray.get(refs)
File "/raid/llm/miniforge3/envs/r1/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/raid/llm/miniforge3/envs/r1/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/raid/llm/miniforge3/envs/r1/lib/python3.11/site-packages/ray/_private/worker.py", line 2623, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/raid/llm/miniforge3/envs/r1/lib/python3.11/site-packages/ray/_private/worker.py", line 861, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::ActorModelRayActorBOX.fit() (pid=146529, ip=0.0.0.0, actor_id=5f14195c0e9c5caec6afbc9803000000, repr=<openrlhf.trainer.ray.ppo_actor.ActorModelRayActorBOX object at 0x7f588046b390>)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-01-27_12-50-47_746465_2724393/runtime_resources/working_dir_files/_ray_pkg_edf297b701eae48d/openrlhf/trainer/ray/ppo_actor.py", line 872, in fit
trainer.fit(
File "/tmp/ray/session_2025-01-27_12-50-47_746465_2724393/runtime_resources/working_dir_files/_ray_pkg_edf297b701eae48d/openrlhf/trainer/ppo_trainer_prm800k_box.py", line 246, in fit
status = self.ppo_train(steps)
^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-01-27_12-50-47_746465_2724393/runtime_resources/working_dir_files/_ray_pkg_edf297b701eae48d/openrlhf/trainer/ray/ppo_actor.py", line 357, in ppo_train
self._broadcast_to_vllm()
File "/tmp/ray/session_2025-01-27_12-50-47_746465_2724393/runtime_resources/working_dir_files/_ray_pkg_edf297b701eae48d/openrlhf/trainer/ray/ppo_actor.py", line 390, in _broadcast_to_vllm
torch.distributed.broadcast(param.data, 0, group=self._model_update_group)
File "/raid/llm/miniforge3/envs/r1/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/raid/llm/miniforge3/envs/r1/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2425, in broadcast
work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.0.1]:38032
(ActorModelRayActorBOX pid=146529)
Episode [1/20]: 0%|| 0/8 [3:47:36<?, ?it/s]
---------------------------------------
Job 'raysubmit_VZF7LSSz9GK92ZDe' failed
---------------------------------------
Status message: Job entrypoint command failed with exit code 1, last available logs (truncated to 20,000 chars):
File "/tmp/ray/session_2025-01-27_12-50-47_746465_2724393/runtime_resources/working_dir_files/_ray_pkg_edf297b701eae48d/openrlhf/trainer/ray/ppo_actor.py", line 390, in _broadcast_to_vllm
torch.distributed.broadcast(param.data, 0, group=self._model_update_group)
File "/raid/llm/miniforge3/envs/r1/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/raid/llm/miniforge3/envs/r1/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2425, in broadcast
work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.0.1]:38032
(ActorModelRayActorBOX pid=146529)
Episode [1/20]: 0%|| 0/8 [3:47:36<?, ?it/s]
The text was updated successfully, but these errors were encountered:
I meet the same error. If you look back in your log, you can find an OOM error. Due to that error, ray closes a process, which may be the one you publish here.
Hello I followed the instructions to train on a single node with 6xA100 GPUs, 80GB gpu memory each.
I used
export CUDA_VISIBLE_DEVICES=2,3,4,5,6,7
to select 6 GPUs. the other 2 GPUs were occupied by other workloads.I have no errors until the end of training 1 epoch.
Click to See the full error message!
The text was updated successfully, but these errors were encountered: