Loading Model through Multi-Node Ray Cluster Fails #881

VarunSreenivasan16 · 2023-08-25T23:14:00Z

Problem Description

I'm trying to spin up the VLLM API server inside a docker container (that has vllm and all requirements installed) on the head node of a ray cluster (the cluster contains 4 nodes and has access to 4 T4 GPUs) with the following command in the container:

python3 -m vllm.entrypoints.api_server --host 0.0.0.0 --port 8000 --model /home/model --tensor-parallel-size 4 --swap-space 2

I get back the following error (NOTE: I x'd out the IP):

python3 -m vllm.entrypoints.api_server --host 0.0.0.0 --port 8000 --model /home/model --tensor-parallel-size 4 --swap-space^Me 2
^[[?2004l^M2023-08-25 22:29:16,736      INFO worker.py:1313 -- Using address ray://xxx.xxx.xxx.xx:10001 set in the environment variable RAY_ADDRESS
2023-08-25 22:29:16,737 INFO client_builder.py:237 -- Passing the following kwargs to ray.init() on the server: ignore_reinit_error
INFO 08-25 22:29:19 llm_engine.py:70] Initializing an LLM engine with config: model='/home/model', tokenizer='/home/model', tokenizer_mode=auto, trust_remote_code=False, dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=4, seed=0)
WARNING 08-25 22:29:19 config.py:191] Possibly too large swap space. 8.00 GiB out of the 15.34 GiB total CPU memory is allocated for the swap space.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/ray_utils.py", line 21, in __getattr__
    return getattr(self.worker, name)
  File "/usr/local/lib/python3.10/dist-packages/ray/util/tracing/tracing_helper.py", line 464, in _resume_span
    return method(self, *_args, **_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/ray_utils.py", line 21, in __getattr__
    return getattr(self.worker, name)
  File "/usr/local/lib/python3.10/dist-packages/ray/util/tracing/tracing_helper.py", line 464, in _resume_span
    return method(self, *_args, **_kwargs)

....

  File "/usr/local/lib/python3.10/dist-packages/ray/util/tracing/tracing_helper.py", line 464, in _resume_span
    return method(self, *_args, **_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/ray_utils.py", line 21, in __getattr__
    return getattr(self.worker, name)
  File "/usr/local/lib/python3.10/dist-packages/ray/util/tracing/tracing_helper.py", line 463, in _resume_span
    if not _is_tracing_enabled() or _ray_trace_ctx is None:
RecursionError: maximum recursion depth exceeded

How to Reproduce

Hardware

4 x AWS EC2 g4dn.xlarge.
Each g4dn.xlarge instance has 1 Nvidia T4 GPU with 16 GB RAM. The node architecure is x86_64 and the platform is ubuntu.
So, in total my multi-node deployment has 4 different nodes with 4 total T4 GPUS.
Nvidia drivers are correctly installed in all of them and cuda version is 11.8

Model

Fine tuned llama-2-13B model (obtained from merging Llama-2-13b-hf and lora adapter using peft).
The tokenizer.model, tokenizer.json, tokenizer_config.json are present in the model directory as well (obtained from base hf model).

Cluster Setup

I ensured that all the ports are open on each of the nodes to allow cluster communication.
I ran ray status --address <head_node_ip:6379> and I can confirm that all the four nodes are accessible from the docker container running in the head node.

In order to allow VLLM to connect to the ray cluster I setup the environment variable RAY_ADDRESS to be ray://<head_node_ip:10001> and then ran the command to spin up the API server.

Docker Container Setup on Head Node

The following was the Dockerfile used to build the container:

# Set the base image to nvidia/cuda:11.8.0-devel-ubuntu22.04
FROM nvidia/cuda:11.8.0-devel-ubuntu22.04

# Update the repository sources list and install dependencies
RUN apt-get update -y -q && apt-get upgrade -y -q && apt-get install -y -q curl python3-pip git vim

# Copy requirements file 
COPY requirements.txt ./requirements.txt

# Install pip dependencies
RUN pip3 install -r requirements.txt

# Copy the model into home directory
COPY llama-2-13b-merged/ /home/model/

# Expose port for http requests
EXPOSE 8000 9999

# Setup entrypoint
ENTRYPOINT ["/bin/bash"]

The requirements.txt is

vllm
pandas
transformers
peft
datasets
jupyterlab

I can confirm the installation succeeded and nvidia-smi inside the container correctly shows the cuda version to be 11.8.
I also used --gpus all flag when running the docker run command.

Request

I'd appreciate any advice or help on this issue. I'm happy to provide any more information required to address the issue.

The text was updated successfully, but these errors were encountered:

agrogov · 2023-10-08T12:35:11Z

Seems it's just not enough RAM: WARNING 08-25 22:29:19 config.py:191] Possibly too large swap space. 8.00 GiB out of the 15.34 GiB total CPU memory is allocated for the swap space.
Just try to run without --swap-space 2, it's not needed if you sure that VRAM is enough.

masnec · 2023-12-26T12:31:13Z

@VarunSreenivasan16 did you figure out how to fix this? I run into the same issue recently.

masnec · 2023-12-28T10:02:02Z

Hi @WoosukKwon , I encountered the same issue when connecting to remote Ray cluster with vllm v0.2.6. any idea or direction? Thank you!

hmellor · 2024-03-28T13:16:58Z

@VarunSreenivasan16 @masnec did either of you ever figure out what the issue was or resolve it?

masnec · 2024-03-29T01:04:16Z

@hmellor it worked after adding those two flags --engine-use-ray --worker-use-ray.

hmellor · 2024-03-31T14:34:26Z

@masnec would you mind contributing this information to https://docs.vllm.ai/en/latest/serving/distributed_serving.html?

rkooo567 · 2024-05-03T08:07:41Z

Btw, using tensor-parallel across multiple nodes can penalty the performance a lot because it requires high network communication.

VarunSreenivasan16 changed the title ~~Loading Model through Multi-Node Ray Cluster~~ Loading Model through Multi-Node Ray Cluster Fails Aug 25, 2023

jieguangzhou mentioned this issue Jan 5, 2024

Remove vllm dependency when using ray to run vllm superduper-io/superduper#1637

Merged

5 tasks

hmellor closed this as completed May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading Model through Multi-Node Ray Cluster Fails #881

Loading Model through Multi-Node Ray Cluster Fails #881

VarunSreenivasan16 commented Aug 25, 2023 •

edited

Loading

agrogov commented Oct 8, 2023

masnec commented Dec 26, 2023

masnec commented Dec 28, 2023

hmellor commented Mar 28, 2024

masnec commented Mar 29, 2024

hmellor commented Mar 31, 2024

rkooo567 commented May 3, 2024

Loading Model through Multi-Node Ray Cluster Fails #881

Loading Model through Multi-Node Ray Cluster Fails #881

Comments

VarunSreenivasan16 commented Aug 25, 2023 • edited Loading

Problem Description

How to Reproduce

Hardware

Model

Cluster Setup

Docker Container Setup on Head Node

Request

agrogov commented Oct 8, 2023

masnec commented Dec 26, 2023

masnec commented Dec 28, 2023

hmellor commented Mar 28, 2024

masnec commented Mar 29, 2024

hmellor commented Mar 31, 2024

rkooo567 commented May 3, 2024

VarunSreenivasan16 commented Aug 25, 2023 •

edited

Loading