Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to support latest vLLM version (max_lora_rank) #2389

Open
dreamiter opened this issue Sep 16, 2024 · 8 comments
Open

Upgrade to support latest vLLM version (max_lora_rank) #2389

dreamiter opened this issue Sep 16, 2024 · 8 comments
Labels
enhancement New feature or request

Comments

@dreamiter
Copy link

Description

In the current version (using LMI sagemaker image), we are running into the following error:

File "/usr/local/lib/python3.10/dist-packages/vllm/config.py", line 1288, in __post_init__
raise ValueError(
ValueError: max_lora_rank (128) must be one of (8, 16, 32, 64)

Looks like above error was fixed in vllm version v0.5.5.
See release notes here: https://github.com/vllm-project/vllm/releases/tag/v0.5.5
See PR here: vllm-project/vllm#7146

References

N/A

@dreamiter dreamiter added the enhancement New feature or request label Sep 16, 2024
@dreamiter
Copy link
Author

Hi @frankfliu - would you be able to help? Thanks.

@siddvenk
Copy link
Contributor

We are planning a release that will use vllm 0.6.0 (or 0.6.1.post2) soon.

In the meantime, you can try providing a requirements.txt file with vllm==0.5.5 (or later version) to get around this.

@dreamiter
Copy link
Author

dreamiter commented Sep 18, 2024

Thank you @siddvenk for your suggestions.

I tried rebuilding the custom image by running pip install vllm==0.5.5 in a Dockerfile, from your latest stable image 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.29.0-lmi11.0.0-cu124

We specified the followings in serving.properties file:

option.model_id=unsloth/mistral-7b-instruct-v0.3
option.engine=Python
option.rolling_batch=vllm
option.tensor_parallel_degree=1
option.enable_lora=true
option.gpu_memory_utilization=0.95
option.max_model_len=16000
option.max_lora_rank=128

We tried setting max_token to a really high number but the response is still very short.
We also get this log, and appears the vLLM backend does not support max_tokens param.

The following parameters are not supported by vllm with rolling batch: {'logprobs', 'temperature', 'seed', 'max_tokens'}. The supported parameters are set()

Do you have any insights?

@siddvenk
Copy link
Contributor

Yes, you should use max_new_tokens.

You can find the schema for our inference api here https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/lmi/user_guides/lmi_input_output_schema.md.

We also support the openai chat completions schema, details here https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/lmi/user_guides/chat_input_output_schema.md.

@dreamiter
Copy link
Author

Thanks again for your quick response @siddvenk -

Just want to make sure, should we:

  • Add max_new_tokens to the serving.properties file, e.g. option.max_new_tokens=16000
  • Or, pass max_new_tokens as a parameter when invoking the endpoint, such as
curl -X POST https://my.sample.endpoint.com/invocations \
  - H 'Content-Type: application/json' \
  - d '
    {
        "inputs" : "What is Deep Learning?", 
        "parameters" : {
            "do_sample": true,
            "max_new_tokens": 16000,
            "details": true,
        },
        "stream": true, 
    }'

@dreamiter
Copy link
Author

btw, forgot to mention, we are deploying this to sagemaker

@siddvenk
Copy link
Contributor

There are two different configurations.

On a per request basis, you can specify max_new_tokens to limit the number of generated tokens. This is just a limit on the output, not on the total sequence length.

You can limit the maximum length of sequences globally by setting option.max_model_len in serving.properties. This enforces a limit that applies to all requests, which includes both the input (prompt) tokens and generated output tokens.

@dreamiter
Copy link
Author

Thanks, @siddvenk .

We did more tests and it turns out the "short response token" issue was only specific to the custom image I built (mentioned above).

So we suspect we missed some key steps when building the image - can you help us review our process?

Steps:

  1. Create following files
|- Dockerfile
|- requirements.txt
  1. In Dockerfile:
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.29.0-lmi11.0.0-cu124

# Copy files
COPY ./requirements.txt /opt/requirements.txt

# Installs third-party Python dependencies within the Docker environment
RUN pip install --upgrade pip && \
    pip install awscli --trusted-host pypi.org --trusted-host files.pythonhosted.org && \
    pip install --trusted-host pypi.org --trusted-host files.pythonhosted.org -r /opt/requirements.txt \
  1. In requirements.txt:`
vllm==0.5.5
  1. Build the new docker image using docker build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants