-
-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fork main #1523
Closed
Closed
Fork main #1523
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Feature #182
Because I need to use baichuan2-13B with more than one lora adapters at the same time, I tried to implement these features by myself. It can work well for my situation now. And this feature was mentioned in #182. Welcome to give me some comments, and I'll try my best to modify them.
Add Features
I use peft to implement multi-lora adapters. And in this situation, because we want to use more than one lora adapters, we can't merge the lora weights into the base model. So there will be some extra computation which will increase the latency. If there is only one lora adapter that you want to use, just do not use this feature. And I'm still working on how to implement a more efficient version of multi-lora adapters in a single batch.
Changes of files
requirements.txt
Add peft for lora adapters
tests/kernels/test_blora.py
Test scripts for multi-lora computation
tests/kernels/test_normhead.py
Test scripts for NormHead layer which is used in baichuan2-13B
vllm/engine/arg_utils.py
Add two args which are used to initialize lora adapters when load the model
vllm/engine/async_llm_engine.py
Check the lora config args are valid.
vllm/engine/llm_engine.py
Check the lora config args are valid. And pass the lora config to workers.
vllm/entrypoints/llm.py
Add lora config parameters and pass them to llm_engine
vllm/model_executor/lora_utils.py
Create lora adapters and replace the target module in the base model
vllm/model_executor/model_loader.py
Support baichuan2 and add lora adapters when initialize model.
vllm/model_executor/models/init.py
Support baichuan2
vllm/model_executor/models/baichuan.py
Support baichuan2 and schedule the lora information after each iteration according to the metadata.
Impelement the method to load lora weights in parallel.
vllm/model_executor/parallel_utils/layers.py
Implement the lora module with ColumnParallelLinear and Row ParallelLinear
vllm/sampling_params.py
Add lora_id parameter to specify the lora adapter you want to use for this prompt.
vllm/worker/worker.py
pass the lora config to initialize the model
Thanks and looking forward to your comments!