Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: LoRA support for Mixtral GPTQ and AWQ #5540

Closed
StrikerRUS opened this issue Jun 14, 2024 · 8 comments
Closed

[Feature]: LoRA support for Mixtral GPTQ and AWQ #5540

StrikerRUS opened this issue Jun 14, 2024 · 8 comments

Comments

@StrikerRUS
Copy link

StrikerRUS commented Jun 14, 2024

🚀 The feature, motivation and pitch

Please consider adding support for GPTQ and AWQ quantized Mixtral models.

I guess that after #4012 it's technically possible.

Alternatives

No response

Additional context

My Docker compose:
---
version: "3.8"

services:
  vllm-vllm:
    image: mirror.gcr.io/vllm/vllm-openai:v0.4.2
    container_name: vllm-vllm
# --model=casperhansen/mixtral-instruct-awq
    command: --model=TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ --download-dir=/root/.cache/huggingface/hub/ --dtype=half --gpu-memory-utilization=0.9 --enforce-eager --device=cuda --disable-log-stats --enable-lora --lora-modules mixtral-finetune-0-1-5=/root/adapters/
    ports:
      - xxxx:8000
    restart: unless-stopped
    healthcheck:
      test: /bin/bash -c "cat < /dev/null > /dev/tcp/vllm-vllm/8000"
      interval: 10s
      start_period: 2m
    logging:
      options:
        max-size: 500mb
        max-file: 4
    environment:
      - HF_HOME=/root/.cache/huggingface/
    volumes:
      - vllm_models:/root/.cache/huggingface/
      - vllm_adapters:/root/adapters/
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['all']
              capabilities: [gpu]

volumes:
  vllm_models:
    driver: local
    driver_opts:
      type: 'none'
      o: 'bind'
      device: '/storage/gpt-project/Models_local/hf_local_0_1_0/'
  vllm_adapters:
    driver: local
    driver_opts:
      type: 'none'
      o: 'bind'
      device: '/storage/classifier-project/Models/Mixtral_finetune_0_1_5/checkpoint-7308/'
Error log:
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will │
be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.    │
  warnings.warn(                                                                                                                       │
WARNING 06-14 12:29:50 config.py:1086] Casting torch.bfloat16 to torch.float16.                                                        │
INFO 06-14 12:29:50 config.py:177] The model is convertible to Marlin format. Using Marlin kernel.                                     │
WARNING 06-14 12:29:50 config.py:976] gptq_marlin quantization is not tested with LoRA yet.                                            │
INFO 06-14 12:29:50 llm_engine.py:100] Initializing an LLM engine (v0.4.2) with config: model='TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ│
', speculative_config=None, tokenizer='TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ', skip_tokenizer_init=False, tokenizer_mode=auto, revis│
ion=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir='/root/.cache/huggingf│
ace/hub/', load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eage│
r=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='│
outlines'), seed=0, served_model_name=TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ)                                                        │
INFO 06-14 12:29:50 utils.py:660] Found nccl from library /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1                               │
INFO 06-14 12:29:51 selector.py:27] Using FlashAttention-2 backend.                                                                    │
[rank0]: Traceback (most recent call last):                                                                                            │
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main                                                       │
[rank0]:     return _run_code(code, main_globals, None,                                                                                │
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code                                                                  │
[rank0]:     exec(code, run_globals)                                                                                                   │
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 168, in <module>                 │
[rank0]:     engine = AsyncLLMEngine.from_engine_args(                                                                                 │
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 366, in from_engine_args               │
[rank0]:     engine = cls(                                                                                                             │
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 324, in __init__                       │
[rank0]:     self.engine = self._init_engine(*args, **kwargs)                                                                          │
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 442, in _init_engine                   │
[rank0]:     return engine_class(*args, **kwargs)                                                                                      │
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 160, in __init__                             │
[rank0]:     self.model_executor = executor_class(                                                                                     │
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 41, in __init__                         │
[rank0]:     self._init_executor()                                                                                                     │
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 23, in _init_executor                    │
[rank0]:     self._init_non_spec_worker()                                                                                              │
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 69, in _init_non_spec_worker             │
[rank0]:     self.driver_worker.load_model()                                                                                           │
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 118, in load_model                               │
[rank0]:     self.model_runner.load_model()                                                                                            │
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 164, in load_model                         │
[rank0]:     self.model = get_model(                                                                                                   │
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 19, in get_model          │
[rank0]:     return loader.load_model(model_config=model_config,                                                                       │
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 222, in load_model          │
[rank0]:     model = _initialize_model(model_config, self.load_config,                                                                 │
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 90, in _initialize_model    │
[rank0]:     **_get_model_initialization_kwargs(                                                                                       │
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 70, in _get_model_initializa│
tion_kwargs                                                                                                                            │
[rank0]:     raise ValueError(                                                                                                         █
[rank0]: ValueError: Model MixtralForCausalLM does not support LoRA, but LoRA is enabled. Support for this model may be added in the fu│
ture. If this is important to you, please open an issue on github.
@robertgshaw2-redhat
Copy link
Collaborator

:)

@hmellor
Copy link
Collaborator

hmellor commented Jul 4, 2024

@StrikerRUS has the PR you mentioned handled your use case?

@StrikerRUS
Copy link
Author

@hmellor Nope. LoRA adapters still cannot be used with quantized Mixtral models.
There are no supported_lora_modules attribute in the quantized MixtralForCausalLM class.
Refer to non-quantized version of MixtralForCausalLM class.

# LoRA specific attributes
supported_lora_modules = [
"qkv_proj",
"o_proj",
"embed_tokens",
"lm_head",
]

Even after adding that attribute and adjusting method arguments, vLLM crashes with an error about tensor shape mismatch. I guess some further work should be done to bring the LoRA support.

@ksjadeja
Copy link

I am facing a similar issue. Did you find any workaround @StrikerRUS ?

@StrikerRUS
Copy link
Author

@ksjadeja Switched to Llama3.1 😄

@ksjadeja
Copy link

@hmellor Do you think this is going to get picked up by someone?

Copy link

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

@github-actions github-actions bot added the stale label Dec 17, 2024
Copy link

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants