Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Usage]: Unable to run gemma-2-2b-it #7060

Closed
Maximgitman opened this issue Aug 2, 2024 · 15 comments
Closed

[Usage]: Unable to run gemma-2-2b-it #7060

Maximgitman opened this issue Aug 2, 2024 · 15 comments
Labels
stale usage How to use vllm

Comments

@Maximgitman
Copy link

The model to consider.

https://huggingface.co/google/gemma-2-2b-it

The closest model vllm already supports.

https://huggingface.co/google/gemma-2-9b-it

What's your difficulty of supporting the model you want?

Faster inference

@Maximgitman Maximgitman added the new model Requests to new models label Aug 2, 2024
@DarkLight1337
Copy link
Member

Provided that the model architecture is the same as the 9B model, the 2B model should already be supported. Just try running vLLM with that model.

@Maximgitman
Copy link
Author

Provided that the model architecture is the same as the 9B model, the 2B model should already be supported. Just try running vLLM with that model.

Thank you for the response, I tried today and got the following error

import os
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest

# Setting the environment variable as suggested
os.environ['VLLM_ATTENTION_BACKEND'] = 'FLASHINFER'

llm = LLM(model="google/gemma-2-2b-it", enable_lora=True)
ValueError: Please use Flashinfer backend for models with logits_soft_cap (i.e., Gemma-2). Otherwise, the output might be wrong. Set Flashinfer backend by export VLLM_ATTENTION_BACKEND=FLASHINFER.

vLLM version 0.5.3

@arunpatala
Copy link

arunpatala commented Aug 2, 2024

same error for me using latest docker image

@DarkLight1337
Copy link
Member

Provided that the model architecture is the same as the 9B model, the 2B model should already be supported. Just try running vLLM with that model.

Thank you for the response, I tried today and got the following error

import os
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest

# Setting the environment variable as suggested
os.environ['VLLM_ATTENTION_BACKEND'] = 'FLASHINFER'

llm = LLM(model="google/gemma-2-2b-it", enable_lora=True)
ValueError: Please use Flashinfer backend for models with logits_soft_cap (i.e., Gemma-2). Otherwise, the output might be wrong. Set Flashinfer backend by export VLLM_ATTENTION_BACKEND=FLASHINFER.

vLLM version 0.5.3

Can you show the full log?

@Zbaoli
Copy link

Zbaoli commented Aug 2, 2024

i try to use export VLLM_ATTENTION_BACKEND=FLASHINFER to solve this error,
but another error occured:

[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 155, in __init__
[rank0]:     self.llm_engine = LLMEngine.from_engine_args(
[rank0]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 441, in from_engine_args
[rank0]:     engine = cls(
[rank0]:              ^^^^
[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 265, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 364, in _initialize_kv_caches
[rank0]:     self.model_executor.determine_num_available_blocks())
[rank0]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/vllm/executor/gpu_executor.py", line 94, in determine_num_available_blocks
[rank0]:     return self.driver_worker.determine_num_available_blocks()
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/vllm/worker/worker.py", line 179, in determine_num_available_blocks
[rank0]:     self.model_runner.profile_run()
[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 896, in profile_run
[rank0]:     self.execute_model(model_input, kv_caches, intermediate_tensors)
[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1272, in execute_model
[rank0]:     BatchDecodeWithPagedKVCacheWrapper(
[rank0]: TypeError: 'NoneType' object is not callable

@DarkLight1337
Copy link
Member

DarkLight1337 commented Aug 2, 2024

[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1272, in execute_model
[rank0]:     BatchDecodeWithPagedKVCacheWrapper(
[rank0]: TypeError: 'NoneType' object is not callable

This means you don't have FlashInfer installed.

@arunpatala
Copy link

thanks. I was able to run with setting environment variable in docker command as follows:

volume_hf=$HF_HOME
docker run --runtime nvidia --gpus all
-v $volume_hf:/root/.cache/huggingface
-e VLLM_ATTENTION_BACKEND=FLASHINFER
-p 8000:8000
--ipc=host
vllm/vllm-openai:latest
--model "google/gemma-2-2b-it"
--max-model-len 4096
--max-num-seqs 8
--served-model-name model

@Zbaoli
Copy link

Zbaoli commented Aug 2, 2024

[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1272, in execute_model
[rank0]:     BatchDecodeWithPagedKVCacheWrapper(
[rank0]: TypeError: 'NoneType' object is not callable

This means you don't have FlashInfer installed.

yes, maybe it's the reason, but i encountered this error, i install flashinfer pacakge by using pip install flashinfer command.

>>> pip show flashinfer
Name: flashinfer
Version: 1.0.0
Summary: White Hat Researcher
Home-page: UNKNOWN
Author: Pastaga
Author-email: UNKNOWN
License: MIT
Location: /root/miniconda3/envs/vllm/lib/python3.11/site-packages
Requires:
Required-by:
>>> python -c "import flashinfer"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'flashinfer'

@DarkLight1337
Copy link
Member

DarkLight1337 commented Aug 2, 2024

Make sure you're installing FlashInfer for the correct version of PyTorch/CUDA: https://github.com/flashinfer-ai/flashinfer?tab=readme-ov-file#installation

@DarkLight1337
Copy link
Member

DarkLight1337 commented Aug 2, 2024

Also, vLLM requires FlashInfer v0.0.8 (refer to the part about Gemma 2 in https://github.com/vllm-project/vllm/releases/tag/v0.5.1)

@hornikmatej
Copy link

Also, vLLM requires FlashInfer v0.0.8 (refer to the part about Gemma 2 in https://github.com/vllm-project/vllm/releases/tag/v0.5.1)

Yes, you are right, works with v0.0.8 but not with v0.1.3, throws following error:

[rank0]:   File "/home/ubuntu/clone-brain/.venv/lib/python3.11/site-packages/flashinfer/prefill.py", line 791, in begin_forward
[rank0]:     self._wrapper.begin_forward(
[rank0]: RuntimeError: CHECK_EQ(paged_kv_indptr.size(0), batch_size + 1) failed. 1 vs 257

Using vLLM 0.5.3.post1

@Maximgitman
Copy link
Author

Here are the steps that helped me run fine-tuned gemma-2-2b-it with vLLM:
Maybe will help someone

  1. Install and Verify vLLM
    Make sure the vLLM version is 0.5.3.
!pip install vllm==0.5.3
import vllm
print(vllm.__version__)
0.5.3
  1. Install Flashinfer
    Follow the instructions here to install Flashinfer.
    Check your torch version and CUDA compatibility:
import torch
print(torch.__version__)  # Should print: 2.3.1+cu121
print(torch.version.cuda) # Should print: 12.1

Based on the documentation, gemma runs on version 0.08. vLLM requires FlashInfer v0.0.8 (refer to the part about Gemma 2 in vLLM releases and Flashinfer documentation).
!pip install flashinfer==0.0.8 -i https://flashinfer.ai/whl/cu121/torch2.3/

  1. Update VLLM Backend Variable in Environment
import os
os.environ["VLLM_ATTENTION_BACKEND"] = "FLASHINFER"
  1. Test vLLM
from vllm import LLM, SamplingParams

llm = LLM(model=<gemma-2-2b-model>, trust_remote_code=True)

sampling_params = SamplingParams(
    temperature=0.8,
    max_tokens=512,
    top_p=0.95,
    top_k=1,
)

prompts = [
    test_data[random.randint(0, test_data.shape[0])]["text"],
]

outputs = llm.generate(
    prompts,
    sampling_params
)

# Expected output:
# Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.24s/it, est. speed input: 991.44 toks/s, output: 87.79 toks/s]

Hardware
NVIDIA A100-SXM4-40GB

Should it be faster?

@DarkLight1337 DarkLight1337 changed the title [New Model]: [Usage]: Unable to run gemma-2-2b-it Aug 3, 2024
@DarkLight1337 DarkLight1337 added usage How to use vllm and removed new model Requests to new models labels Aug 3, 2024
Copy link

github-actions bot commented Nov 1, 2024

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

@github-actions github-actions bot added the stale label Nov 1, 2024
Copy link

github-actions bot commented Dec 1, 2024

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 1, 2024
@fgenie
Copy link

fgenie commented Dec 27, 2024

Here are the steps that helped me run fine-tuned gemma-2-2b-it with vLLM: Maybe will help someone

  1. Install and Verify vLLM
    Make sure the vLLM version is 0.5.3.
!pip install vllm==0.5.3
import vllm
print(vllm.__version__)
0.5.3
  1. Install Flashinfer
    Follow the instructions here to install Flashinfer.
    Check your torch version and CUDA compatibility:
import torch
print(torch.__version__)  # Should print: 2.3.1+cu121
print(torch.version.cuda) # Should print: 12.1

Based on the documentation, gemma runs on version 0.08. vLLM requires FlashInfer v0.0.8 (refer to the part about Gemma 2 in vLLM releases and Flashinfer documentation). !pip install flashinfer==0.0.8 -i https://flashinfer.ai/whl/cu121/torch2.3/

  1. Update VLLM Backend Variable in Environment
import os
os.environ["VLLM_ATTENTION_BACKEND"] = "FLASHINFER"
  1. Test vLLM
from vllm import LLM, SamplingParams

llm = LLM(model=<gemma-2-2b-model>, trust_remote_code=True)

sampling_params = SamplingParams(
    temperature=0.8,
    max_tokens=512,
    top_p=0.95,
    top_k=1,
)

prompts = [
    test_data[random.randint(0, test_data.shape[0])]["text"],
]

outputs = llm.generate(
    prompts,
    sampling_params
)

# Expected output:
# Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.24s/it, est. speed input: 991.44 toks/s, output: 87.79 toks/s]

Hardware NVIDIA A100-SXM4-40GB

Should it be faster?

After following this, for whom might still encounter some runtime error like here, it is likely flashinfer version issue (mine was vllm==0.5.3 and flashinfer==0.1.6).

#7070 suggests installing flashinfer==0.1.2 which resolved my problem. Hope it helped!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale usage How to use vllm
Projects
None yet
Development

No branches or pull requests

6 participants