-
-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernel] Full Tensor Parallelism for LoRA Layers #3524
[Kernel] Full Tensor Parallelism for LoRA Layers #3524
Conversation
…bility. These make use of the design from S-LoRA
@FurtherAI Thank you for your contribution! I will be happy to help you get the tests converted to pytest. |
@FurtherAI Thanks for the PR. General feedback:
|
I'll try to reduce the duplicate code following what I mentioned and add that by next week. |
…hanges to test_layers which should test the new layers with pytest.
@Yard1 Check out the latest commit. Minimized the duplicate code, allow the user to select the implementation and modified the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this looks good! I think it should be straightforward to convert the sharded test to use pytest. I am a bit worried about its complexity - it would be good to add some more comments to explain it and what happens in the Worker
class.
Could we also add low-level unit tests for the INST
dimensions in the test_punica.py
file?
I think we should drop my I have already proposed the updated |
@FurtherAI Ok, that sounds good, let's drop it. Can you merge master into your branch to resolve conflicts? |
Yes, I'll merge master in the next commit. |
Forgot about this. The error on the last test is an oom error, not sure if it is related to the changes. But I'll merge main and add the tests and see if it occurs again.
|
Merged main and added the extra Punica tests. The formatter is a little mad still for two things:
|
Thanks @FurtherAI , this is looking good! I think we can merge them after conflicts are resolved, and then we can work on integrating those. Thank you for your patience!
Honestly, let's just explicitly define them in a list or something. This auto detection is too magical.
Add |
…sor parallelism for QKV.
@Yard1 could you look at this? It can't figure out what formatting it wants for |
@FurtherAI should be good now |
Tyvm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Let's work on followup to use the new layers.
@FurtherAI |
Yes, sorry about that. Made a new PR to fix that. Was able to reproduce and fix your error, so should be correct. |
Co-authored-by: Antoni Baum <[email protected]>
Co-authored-by: Antoni Baum <[email protected]>
@FurtherAI seems like there may be a bug with llama3-70b-instruct. Getting the following error on startup of vllm (during engine initilization like count num blocks) when using tensor parallelism 8 with A100 GPUs. Full error stack:
Let me know if you are able to reproduce. |
@sfc-gh-ybsat Thanks for finding this. I'm waiting on access to Llama 3 or a way to skip that before I can reproduce it. It seems to be the buffer which is not contiguous. Quick fix is to just do Not sure why the buffer can be discontiguous, though, since they should be contiguous when they're created and all gather shouldn't change that. Update: I was not able to reproduce, at least in eager mode. Graph capture was either taking too long or stuck waiting for NCCL. Running the example script |
@sfc-gh-ybsat Do you have some minimal code that can reproduce the error? |
thanks @FurtherAI for the follow up. Example is adjusted from https://docs.vllm.ai/en/latest/getting_started/examples/multilora_inference.html
|
@sfc-gh-ybsat Thanks, I can reproduce. Calling contiguous is correct I think to fix the error, but there's another error with the graph capture. You should be able to run it in eager mode though for now with the line changed as below. @Yard1, I think it is throwing an error during the all gather, do you know anything about what's wrong with this or changing it to call buffers = tensor_model_parallel_all_gather(buffers).contiguous()
|
The current multi-LoRA layers run only one of the LoRAs with tensor parallelism. Here, I add layers that are fully sharded, as in, both LoRA layers run with tensor parallelism. This has better performance at higher sequence length, rank, tensor parallel size and so on, with similar performance at smaller scales.
One thing that needs to be added is the ability for the user to select the fully sharded versions or vLLM should select these based on some heuristics. Also need the tests I wrote to be converted to pytest, that should be a copy of the existing LoRA tests but with the layers replaced with the fully sharded versions or added to those tests. I had trouble running the tests myself.
I do not add versions for the sampler, since I had mixed results with the improvement, or the vocab parallel embedding, since it did not look like it would benefit very much.
ADD TO #1804