[Feature]: LoRA support for Mixtral GPTQ and AWQ #5540

StrikerRUS · 2024-06-14T12:44:35Z

🚀 The feature, motivation and pitch

Please consider adding support for GPTQ and AWQ quantized Mixtral models.

I guess that after #4012 it's technically possible.

Alternatives

No response

Additional context

My Docker compose:

---
version: "3.8"

services:
  vllm-vllm:
    image: mirror.gcr.io/vllm/vllm-openai:v0.4.2
    container_name: vllm-vllm
# --model=casperhansen/mixtral-instruct-awq
    command: --model=TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ --download-dir=/root/.cache/huggingface/hub/ --dtype=half --gpu-memory-utilization=0.9 --enforce-eager --device=cuda --disable-log-stats --enable-lora --lora-modules mixtral-finetune-0-1-5=/root/adapters/
    ports:
      - xxxx:8000
    restart: unless-stopped
    healthcheck:
      test: /bin/bash -c "cat < /dev/null > /dev/tcp/vllm-vllm/8000"
      interval: 10s
      start_period: 2m
    logging:
      options:
        max-size: 500mb
        max-file: 4
    environment:
      - HF_HOME=/root/.cache/huggingface/
    volumes:
      - vllm_models:/root/.cache/huggingface/
      - vllm_adapters:/root/adapters/
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['all']
              capabilities: [gpu]

volumes:
  vllm_models:
    driver: local
    driver_opts:
      type: 'none'
      o: 'bind'
      device: '/storage/gpt-project/Models_local/hf_local_0_1_0/'
  vllm_adapters:
    driver: local
    driver_opts:
      type: 'none'
      o: 'bind'
      device: '/storage/classifier-project/Models/Mixtral_finetune_0_1_5/checkpoint-7308/'

Error log:

/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will │
be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.    │
  warnings.warn(                                                                                                                       │
WARNING 06-14 12:29:50 config.py:1086] Casting torch.bfloat16 to torch.float16.                                                        │
INFO 06-14 12:29:50 config.py:177] The model is convertible to Marlin format. Using Marlin kernel.                                     │
WARNING 06-14 12:29:50 config.py:976] gptq_marlin quantization is not tested with LoRA yet.                                            │
INFO 06-14 12:29:50 llm_engine.py:100] Initializing an LLM engine (v0.4.2) with config: model='TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ│
', speculative_config=None, tokenizer='TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ', skip_tokenizer_init=False, tokenizer_mode=auto, revis│
ion=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir='/root/.cache/huggingf│
ace/hub/', load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eage│
r=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='│
outlines'), seed=0, served_model_name=TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ)                                                        │
INFO 06-14 12:29:50 utils.py:660] Found nccl from library /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1                               │
INFO 06-14 12:29:51 selector.py:27] Using FlashAttention-2 backend.                                                                    │
[rank0]: Traceback (most recent call last):                                                                                            │
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main                                                       │
[rank0]:     return _run_code(code, main_globals, None,                                                                                │
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code                                                                  │
[rank0]:     exec(code, run_globals)                                                                                                   │
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 168, in <module>                 │
[rank0]:     engine = AsyncLLMEngine.from_engine_args(                                                                                 │
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 366, in from_engine_args               │
[rank0]:     engine = cls(                                                                                                             │
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 324, in __init__                       │
[rank0]:     self.engine = self._init_engine(*args, **kwargs)                                                                          │
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 442, in _init_engine                   │
[rank0]:     return engine_class(*args, **kwargs)                                                                                      │
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 160, in __init__                             │
[rank0]:     self.model_executor = executor_class(                                                                                     │
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 41, in __init__                         │
[rank0]:     self._init_executor()                                                                                                     │
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 23, in _init_executor                    │
[rank0]:     self._init_non_spec_worker()                                                                                              │
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 69, in _init_non_spec_worker             │
[rank0]:     self.driver_worker.load_model()                                                                                           │
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 118, in load_model                               │
[rank0]:     self.model_runner.load_model()                                                                                            │
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 164, in load_model                         │
[rank0]:     self.model = get_model(                                                                                                   │
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 19, in get_model          │
[rank0]:     return loader.load_model(model_config=model_config,                                                                       │
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 222, in load_model          │
[rank0]:     model = _initialize_model(model_config, self.load_config,                                                                 │
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 90, in _initialize_model    │
[rank0]:     **_get_model_initialization_kwargs(                                                                                       │
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 70, in _get_model_initializa│
tion_kwargs                                                                                                                            │
[rank0]:     raise ValueError(                                                                                                         █
[rank0]: ValueError: Model MixtralForCausalLM does not support LoRA, but LoRA is enabled. Support for this model may be added in the fu│
ture. If this is important to you, please open an issue on github.

The text was updated successfully, but these errors were encountered:

robertgshaw2-redhat · 2024-06-14T12:47:50Z

:)

hmellor · 2024-07-04T14:06:05Z

@StrikerRUS has the PR you mentioned handled your use case?

StrikerRUS · 2024-07-04T18:56:06Z

@hmellor Nope. LoRA adapters still cannot be used with quantized Mixtral models.
There are no supported_lora_modules attribute in the quantized MixtralForCausalLM class.
Refer to non-quantized version of MixtralForCausalLM class.

vllm/vllm/model_executor/models/mixtral.py

Lines 294 to 300 in 27902d4

    
           # LoRA specific attributes 
        
           supported_lora_modules = [ 
        
               "qkv_proj", 
        
               "o_proj", 
        
               "embed_tokens", 
        
               "lm_head", 
        
           ]

Even after adding that attribute and adjusting method arguments, vLLM crashes with an error about tensor shape mismatch. I guess some further work should be done to bring the LoRA support.

ksjadeja · 2024-08-14T23:59:58Z

I am facing a similar issue. Did you find any workaround @StrikerRUS ?

StrikerRUS · 2024-08-16T12:58:56Z

@ksjadeja Switched to Llama3.1 😄

ksjadeja · 2024-09-17T17:57:10Z

@hmellor Do you think this is going to get picked up by someone?

github-actions · 2024-12-17T02:06:28Z

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

github-actions · 2025-01-17T01:58:07Z

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

StrikerRUS added the feature request label Jun 14, 2024

This was referenced Sep 28, 2024

[Roadmap] vLLM Roadmap Q3 2024 #5805

Closed

[Roadmap] vLLM Roadmap Q4 2024 #9006

Open

github-actions bot added the stale label Dec 17, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: LoRA support for Mixtral GPTQ and AWQ #5540

[Feature]: LoRA support for Mixtral GPTQ and AWQ #5540

StrikerRUS commented Jun 14, 2024 •

edited

Loading

robertgshaw2-redhat commented Jun 14, 2024

hmellor commented Jul 4, 2024

StrikerRUS commented Jul 4, 2024

ksjadeja commented Aug 14, 2024

StrikerRUS commented Aug 16, 2024

ksjadeja commented Sep 17, 2024

github-actions bot commented Dec 17, 2024

github-actions bot commented Jan 17, 2025

[Feature]: LoRA support for Mixtral GPTQ and AWQ #5540

[Feature]: LoRA support for Mixtral GPTQ and AWQ #5540

Comments

StrikerRUS commented Jun 14, 2024 • edited Loading

🚀 The feature, motivation and pitch

Alternatives

Additional context

robertgshaw2-redhat commented Jun 14, 2024

hmellor commented Jul 4, 2024

StrikerRUS commented Jul 4, 2024

ksjadeja commented Aug 14, 2024

StrikerRUS commented Aug 16, 2024

ksjadeja commented Sep 17, 2024

github-actions bot commented Dec 17, 2024

github-actions bot commented Jan 17, 2025

StrikerRUS commented Jun 14, 2024 •

edited

Loading