Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading Model through Multi-Node Ray Cluster Fails #881

Closed
VarunSreenivasan16 opened this issue Aug 25, 2023 · 7 comments
Closed

Loading Model through Multi-Node Ray Cluster Fails #881

VarunSreenivasan16 opened this issue Aug 25, 2023 · 7 comments

Comments

@VarunSreenivasan16
Copy link

VarunSreenivasan16 commented Aug 25, 2023

Problem Description

I'm trying to spin up the VLLM API server inside a docker container (that has vllm and all requirements installed) on the head node of a ray cluster (the cluster contains 4 nodes and has access to 4 T4 GPUs) with the following command in the container:

python3 -m vllm.entrypoints.api_server --host 0.0.0.0 --port 8000 --model /home/model --tensor-parallel-size 4 --swap-space 2

I get back the following error (NOTE: I x'd out the IP):

python3 -m vllm.entrypoints.api_server --host 0.0.0.0 --port 8000 --model /home/model --tensor-parallel-size 4 --swap-space^Me 2
^[[?2004l^M2023-08-25 22:29:16,736      INFO worker.py:1313 -- Using address ray://xxx.xxx.xxx.xx:10001 set in the environment variable RAY_ADDRESS
2023-08-25 22:29:16,737 INFO client_builder.py:237 -- Passing the following kwargs to ray.init() on the server: ignore_reinit_error
INFO 08-25 22:29:19 llm_engine.py:70] Initializing an LLM engine with config: model='/home/model', tokenizer='/home/model', tokenizer_mode=auto, trust_remote_code=False, dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=4, seed=0)
WARNING 08-25 22:29:19 config.py:191] Possibly too large swap space. 8.00 GiB out of the 15.34 GiB total CPU memory is allocated for the swap space.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/ray_utils.py", line 21, in __getattr__
    return getattr(self.worker, name)
  File "/usr/local/lib/python3.10/dist-packages/ray/util/tracing/tracing_helper.py", line 464, in _resume_span
    return method(self, *_args, **_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/ray_utils.py", line 21, in __getattr__
    return getattr(self.worker, name)
  File "/usr/local/lib/python3.10/dist-packages/ray/util/tracing/tracing_helper.py", line 464, in _resume_span
    return method(self, *_args, **_kwargs)

....

  File "/usr/local/lib/python3.10/dist-packages/ray/util/tracing/tracing_helper.py", line 464, in _resume_span
    return method(self, *_args, **_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/ray_utils.py", line 21, in __getattr__
    return getattr(self.worker, name)
  File "/usr/local/lib/python3.10/dist-packages/ray/util/tracing/tracing_helper.py", line 463, in _resume_span
    if not _is_tracing_enabled() or _ray_trace_ctx is None:
RecursionError: maximum recursion depth exceeded

How to Reproduce

Hardware

  • 4 x AWS EC2 g4dn.xlarge.
  • Each g4dn.xlarge instance has 1 Nvidia T4 GPU with 16 GB RAM. The node architecure is x86_64 and the platform is ubuntu.
  • So, in total my multi-node deployment has 4 different nodes with 4 total T4 GPUS.
  • Nvidia drivers are correctly installed in all of them and cuda version is 11.8

Model

  • Fine tuned llama-2-13B model (obtained from merging Llama-2-13b-hf and lora adapter using peft).
  • The tokenizer.model, tokenizer.json, tokenizer_config.json are present in the model directory as well (obtained from base hf model).

Cluster Setup

  • I ensured that all the ports are open on each of the nodes to allow cluster communication.
  • I ran ray status --address <head_node_ip:6379> and I can confirm that all the four nodes are accessible from the docker container running in the head node.
Screenshot 2023-08-25 at 5 41 28 PM
  • In order to allow VLLM to connect to the ray cluster I setup the environment variable RAY_ADDRESS to be ray://<head_node_ip:10001> and then ran the command to spin up the API server.

Docker Container Setup on Head Node

The following was the Dockerfile used to build the container:

# Set the base image to nvidia/cuda:11.8.0-devel-ubuntu22.04
FROM nvidia/cuda:11.8.0-devel-ubuntu22.04

# Update the repository sources list and install dependencies
RUN apt-get update -y -q && apt-get upgrade -y -q && apt-get install -y -q curl python3-pip git vim

# Copy requirements file 
COPY requirements.txt ./requirements.txt

# Install pip dependencies
RUN pip3 install -r requirements.txt

# Copy the model into home directory
COPY llama-2-13b-merged/ /home/model/

# Expose port for http requests
EXPOSE 8000 9999

# Setup entrypoint
ENTRYPOINT ["/bin/bash"]

The requirements.txt is

vllm
pandas
transformers
peft
datasets
jupyterlab
  • I can confirm the installation succeeded and nvidia-smi inside the container correctly shows the cuda version to be 11.8.

  • I also used --gpus all flag when running the docker run command.

Request

I'd appreciate any advice or help on this issue. I'm happy to provide any more information required to address the issue.

@VarunSreenivasan16 VarunSreenivasan16 changed the title Loading Model through Multi-Node Ray Cluster Loading Model through Multi-Node Ray Cluster Fails Aug 25, 2023
@agrogov
Copy link

agrogov commented Oct 8, 2023

Seems it's just not enough RAM: WARNING 08-25 22:29:19 config.py:191] Possibly too large swap space. 8.00 GiB out of the 15.34 GiB total CPU memory is allocated for the swap space.
Just try to run without --swap-space 2, it's not needed if you sure that VRAM is enough.

@masnec
Copy link

masnec commented Dec 26, 2023

@VarunSreenivasan16 did you figure out how to fix this? I run into the same issue recently.

@masnec
Copy link

masnec commented Dec 28, 2023

Hi @WoosukKwon , I encountered the same issue when connecting to remote Ray cluster with vllm v0.2.6. any idea or direction? Thank you!

@hmellor
Copy link
Collaborator

hmellor commented Mar 28, 2024

@VarunSreenivasan16 @masnec did either of you ever figure out what the issue was or resolve it?

@masnec
Copy link

masnec commented Mar 29, 2024

@hmellor it worked after adding those two flags --engine-use-ray --worker-use-ray.

@hmellor
Copy link
Collaborator

hmellor commented Mar 31, 2024

@masnec would you mind contributing this information to https://docs.vllm.ai/en/latest/serving/distributed_serving.html?

@rkooo567
Copy link
Collaborator

rkooo567 commented May 3, 2024

Btw, using tensor-parallel across multiple nodes can penalty the performance a lot because it requires high network communication.

@hmellor hmellor closed this as completed May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants