-
-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: GGUF quantization with tensor parallelism #7662
Comments
@chrismrutherford thanks for reporting, we just landed a fix! |
Hi, I'm encountering some issue and wanted to report here.. The following is the Python code that I used for testing. It worked well with from huggingface_hub import hf_hub_download
from vllm import LLM, SamplingParams
def run_gguf_inference(model_path):
llm = LLM(
model=model_path,
max_model_len=4096,
tokenizer="meta-llama/Meta-Llama-3.1-8B-Instruct",
tensor_parallel_size=2,
)
tokenizer = llm.get_tokenizer()
conversations = tokenizer.apply_chat_template(
[{'role': 'user', 'content': 'what is the future of AI?'}],
tokenize=False,
add_generation_prompt=True,
)
outputs = llm.generate(
[conversations],
SamplingParams(temperature=0, max_tokens=1000),
)
for output in outputs:
print(output)
if __name__ == "__main__":
repo_id = "bullerwins/Meta-Llama-3.1-8B-Instruct-GGUF"
filename = "Meta-Llama-3.1-8B-Instruct-Q2_K.gguf"
model = hf_hub_download(repo_id, filename=filename)
run_gguf_inference(model) With
With
|
🚀 The feature, motivation and pitch
When I launch vllm using a gguf (Q8_0 snapshot) and ray (--tensor-parallel-size 8, across 2 nodes of 4 gpus) I get the following error message:
(RayWorkerWrapper pid=11033) ERROR 08-19 16:07:35 worker_base.py:438] ValueError: GGUF quantization hasn't supported tensor parallelism yet. [repeated 2x across cluster]
Please could you add tensor parallelism for GGUF with quantization.
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: