[Usage]: Unable to run `gemma-2-2b-it` #7060

Maximgitman · 2024-08-02T03:25:33Z

The model to consider.

https://huggingface.co/google/gemma-2-2b-it

The closest model vllm already supports.

https://huggingface.co/google/gemma-2-9b-it

What's your difficulty of supporting the model you want?

Faster inference

DarkLight1337 · 2024-08-02T05:08:07Z

Provided that the model architecture is the same as the 9B model, the 2B model should already be supported. Just try running vLLM with that model.

Maximgitman · 2024-08-02T05:15:46Z

Provided that the model architecture is the same as the 9B model, the 2B model should already be supported. Just try running vLLM with that model.

Thank you for the response, I tried today and got the following error

import os
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest

# Setting the environment variable as suggested
os.environ['VLLM_ATTENTION_BACKEND'] = 'FLASHINFER'

llm = LLM(model="google/gemma-2-2b-it", enable_lora=True)

ValueError: Please use Flashinfer backend for models with logits_soft_cap (i.e., Gemma-2). Otherwise, the output might be wrong. Set Flashinfer backend by export VLLM_ATTENTION_BACKEND=FLASHINFER.

vLLM version 0.5.3

arunpatala · 2024-08-02T05:23:33Z

same error for me using latest docker image

DarkLight1337 · 2024-08-02T05:30:35Z

Provided that the model architecture is the same as the 9B model, the 2B model should already be supported. Just try running vLLM with that model.

Thank you for the response, I tried today and got the following error
import os
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest

# Setting the environment variable as suggested
os.environ['VLLM_ATTENTION_BACKEND'] = 'FLASHINFER'

llm = LLM(model="google/gemma-2-2b-it", enable_lora=True)
ValueError: Please use Flashinfer backend for models with logits_soft_cap (i.e., Gemma-2). Otherwise, the output might be wrong. Set Flashinfer backend by export VLLM_ATTENTION_BACKEND=FLASHINFER.
vLLM version 0.5.3

Can you show the full log?

Zbaoli · 2024-08-02T05:38:10Z

i try to use export VLLM_ATTENTION_BACKEND=FLASHINFER to solve this error,
but another error occured:

[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 155, in __init__
[rank0]:     self.llm_engine = LLMEngine.from_engine_args(
[rank0]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 441, in from_engine_args
[rank0]:     engine = cls(
[rank0]:              ^^^^
[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 265, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 364, in _initialize_kv_caches
[rank0]:     self.model_executor.determine_num_available_blocks())
[rank0]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/vllm/executor/gpu_executor.py", line 94, in determine_num_available_blocks
[rank0]:     return self.driver_worker.determine_num_available_blocks()
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/vllm/worker/worker.py", line 179, in determine_num_available_blocks
[rank0]:     self.model_runner.profile_run()
[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 896, in profile_run
[rank0]:     self.execute_model(model_input, kv_caches, intermediate_tensors)
[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1272, in execute_model
[rank0]:     BatchDecodeWithPagedKVCacheWrapper(
[rank0]: TypeError: 'NoneType' object is not callable

DarkLight1337 · 2024-08-02T05:43:23Z

[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1272, in execute_model
[rank0]:     BatchDecodeWithPagedKVCacheWrapper(
[rank0]: TypeError: 'NoneType' object is not callable

This means you don't have FlashInfer installed.

arunpatala · 2024-08-02T05:52:33Z

thanks. I was able to run with setting environment variable in docker command as follows:

volume_hf=$HF_HOME
docker run --runtime nvidia --gpus all
-v $volume_hf:/root/.cache/huggingface
-e VLLM_ATTENTION_BACKEND=FLASHINFER
-p 8000:8000
--ipc=host
vllm/vllm-openai:latest
--model "google/gemma-2-2b-it"
--max-model-len 4096
--max-num-seqs 8
--served-model-name model

Zbaoli · 2024-08-02T06:20:48Z

[rank0]:   File "/root/miniconda3/envs/vllm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1272, in execute_model
[rank0]:     BatchDecodeWithPagedKVCacheWrapper(
[rank0]: TypeError: 'NoneType' object is not callable

This means you don't have FlashInfer installed.

yes, maybe it's the reason, but i encountered this error, i install flashinfer pacakge by using pip install flashinfer command.

>>> pip show flashinfer
Name: flashinfer
Version: 1.0.0
Summary: White Hat Researcher
Home-page: UNKNOWN
Author: Pastaga
Author-email: UNKNOWN
License: MIT
Location: /root/miniconda3/envs/vllm/lib/python3.11/site-packages
Requires:
Required-by:
>>> python -c "import flashinfer"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'flashinfer'

DarkLight1337 · 2024-08-02T06:26:33Z

Make sure you're installing FlashInfer for the correct version of PyTorch/CUDA: https://github.com/flashinfer-ai/flashinfer?tab=readme-ov-file#installation

DarkLight1337 · 2024-08-02T06:29:28Z

Also, vLLM requires FlashInfer v0.0.8 (refer to the part about Gemma 2 in https://github.com/vllm-project/vllm/releases/tag/v0.5.1)

hornikmatej · 2024-08-02T10:13:46Z

Also, vLLM requires FlashInfer v0.0.8 (refer to the part about Gemma 2 in https://github.com/vllm-project/vllm/releases/tag/v0.5.1)

Yes, you are right, works with v0.0.8 but not with v0.1.3, throws following error:

[rank0]:   File "/home/ubuntu/clone-brain/.venv/lib/python3.11/site-packages/flashinfer/prefill.py", line 791, in begin_forward
[rank0]:     self._wrapper.begin_forward(
[rank0]: RuntimeError: CHECK_EQ(paged_kv_indptr.size(0), batch_size + 1) failed. 1 vs 257

Using vLLM 0.5.3.post1

Maximgitman · 2024-08-02T23:15:01Z

Here are the steps that helped me run fine-tuned gemma-2-2b-it with vLLM:
Maybe will help someone

Install and Verify vLLM
Make sure the vLLM version is 0.5.3.

!pip install vllm==0.5.3
import vllm
print(vllm.__version__)
0.5.3

Install Flashinfer
Follow the instructions here to install Flashinfer.
Check your torch version and CUDA compatibility:

import torch
print(torch.__version__)  # Should print: 2.3.1+cu121
print(torch.version.cuda) # Should print: 12.1

Based on the documentation, gemma runs on version 0.08. vLLM requires FlashInfer v0.0.8 (refer to the part about Gemma 2 in vLLM releases and Flashinfer documentation).
!pip install flashinfer==0.0.8 -i https://flashinfer.ai/whl/cu121/torch2.3/

Update VLLM Backend Variable in Environment

import os
os.environ["VLLM_ATTENTION_BACKEND"] = "FLASHINFER"

Test vLLM

from vllm import LLM, SamplingParams

llm = LLM(model=<gemma-2-2b-model>, trust_remote_code=True)

sampling_params = SamplingParams(
    temperature=0.8,
    max_tokens=512,
    top_p=0.95,
    top_k=1,
)

prompts = [
    test_data[random.randint(0, test_data.shape[0])]["text"],
]

outputs = llm.generate(
    prompts,
    sampling_params
)

# Expected output:
# Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.24s/it, est. speed input: 991.44 toks/s, output: 87.79 toks/s]

Hardware
NVIDIA A100-SXM4-40GB

Should it be faster?

github-actions · 2024-11-01T02:04:47Z

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

github-actions · 2024-12-01T02:14:02Z

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

fgenie · 2024-12-27T10:21:51Z

Here are the steps that helped me run fine-tuned gemma-2-2b-it with vLLM: Maybe will help someone

Install and Verify vLLM
Make sure the vLLM version is 0.5.3.
!pip install vllm==0.5.3
import vllm
print(vllm.__version__)
0.5.3
Install Flashinfer
Follow the instructions here to install Flashinfer.
Check your torch version and CUDA compatibility:
import torch
print(torch.__version__)  # Should print: 2.3.1+cu121
print(torch.version.cuda) # Should print: 12.1
Based on the documentation, gemma runs on version 0.08. vLLM requires FlashInfer v0.0.8 (refer to the part about Gemma 2 in vLLM releases and Flashinfer documentation). !pip install flashinfer==0.0.8 -i https://flashinfer.ai/whl/cu121/torch2.3/

Update VLLM Backend Variable in Environment
import os
os.environ["VLLM_ATTENTION_BACKEND"] = "FLASHINFER"
Test vLLM
from vllm import LLM, SamplingParams

llm = LLM(model=<gemma-2-2b-model>, trust_remote_code=True)

sampling_params = SamplingParams(
    temperature=0.8,
    max_tokens=512,
    top_p=0.95,
    top_k=1,
)

prompts = [
    test_data[random.randint(0, test_data.shape[0])]["text"],
]

outputs = llm.generate(
    prompts,
    sampling_params
)

# Expected output:
# Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.24s/it, est. speed input: 991.44 toks/s, output: 87.79 toks/s]
Hardware NVIDIA A100-SXM4-40GB

Should it be faster?

After following this, for whom might still encounter some runtime error like here, it is likely flashinfer version issue (mine was vllm==0.5.3 and flashinfer==0.1.6).

#7070 suggests installing flashinfer==0.1.2 which resolved my problem. Hope it helped!

Maximgitman added the new model Requests to new models label Aug 2, 2024

DarkLight1337 mentioned this issue Aug 3, 2024

[Misc]: Support for Shieldgemma model #7084

Closed

DarkLight1337 changed the title ~~[New Model]:~~ [Usage]: Unable to run gemma-2-2b-it Aug 3, 2024

DarkLight1337 added usage How to use vllm and removed new model Requests to new models labels Aug 3, 2024

spitis mentioned this issue Aug 20, 2024

[Bug]: Gemma2 models inference using vLLM 0.5.4 produces incorrect responses #7650

Open

github-actions bot added the stale label Nov 1, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Usage]: Unable to run `gemma-2-2b-it` #7060

[Usage]: Unable to run `gemma-2-2b-it` #7060

Maximgitman commented Aug 2, 2024

DarkLight1337 commented Aug 2, 2024

Maximgitman commented Aug 2, 2024

arunpatala commented Aug 2, 2024 •

edited

Loading

DarkLight1337 commented Aug 2, 2024

Zbaoli commented Aug 2, 2024

DarkLight1337 commented Aug 2, 2024 •

edited

Loading

arunpatala commented Aug 2, 2024

Zbaoli commented Aug 2, 2024

DarkLight1337 commented Aug 2, 2024 •

edited

Loading

DarkLight1337 commented Aug 2, 2024 •

edited

Loading

hornikmatej commented Aug 2, 2024

Maximgitman commented Aug 2, 2024

github-actions bot commented Nov 1, 2024

github-actions bot commented Dec 1, 2024

fgenie commented Dec 27, 2024

[Usage]: Unable to run gemma-2-2b-it #7060

[Usage]: Unable to run gemma-2-2b-it #7060

Comments

Maximgitman commented Aug 2, 2024

The model to consider.

The closest model vllm already supports.

What's your difficulty of supporting the model you want?

DarkLight1337 commented Aug 2, 2024

Maximgitman commented Aug 2, 2024

arunpatala commented Aug 2, 2024 • edited Loading

DarkLight1337 commented Aug 2, 2024

Zbaoli commented Aug 2, 2024

DarkLight1337 commented Aug 2, 2024 • edited Loading

arunpatala commented Aug 2, 2024

Zbaoli commented Aug 2, 2024

DarkLight1337 commented Aug 2, 2024 • edited Loading

DarkLight1337 commented Aug 2, 2024 • edited Loading

hornikmatej commented Aug 2, 2024

Maximgitman commented Aug 2, 2024

github-actions bot commented Nov 1, 2024

github-actions bot commented Dec 1, 2024

fgenie commented Dec 27, 2024

[Usage]: Unable to run `gemma-2-2b-it` #7060

[Usage]: Unable to run `gemma-2-2b-it` #7060

arunpatala commented Aug 2, 2024 •

edited

Loading

DarkLight1337 commented Aug 2, 2024 •

edited

Loading

DarkLight1337 commented Aug 2, 2024 •

edited

Loading

DarkLight1337 commented Aug 2, 2024 •

edited

Loading