-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performance issue #323
Comments
Hey @sleepwalker2017, thanks for reporting this. We'll definitely take a look on our side to try and repro and see if we get similar results. One thing to point out since you're running on multiple GPUs is there was a bug that was just recently fixed (PR #324) that was disabling the optimized SGMV kernels when using small rank LoRAs like this across multiple GPUs. It should now be fixed if you want to pull the latest image and try running again. Rgardless, we'll take a look and share our results as well. So we do have benchmark results we can share for single GPU that might be helpful. We haven't run as extensive of tests with multi-GPU. There will be some additional overhead from cross device comms. Are you connecting your GPUs via NVLink or PCIe? |
It's PCIe. |
@tgaddair I have noticed the same performance drop with 1 *4090 and batch 1, input text length 2K chars avg and generated 300 tokens avg. other info:
performance:
Is there any potential optimization to be planed ? |
@tgaddair Hi, I have got the updated result after using the latest main code, same config with bellow performance:
About 21.6% perf drop, so is that reasonable? and could you share what's the optimization has been make? thanks. |
Hi, I'm benchmarking lora-x on 2*A30.
I get the poor performance, is that normal?
The first sheet, I send requests for base model, and the batch means the number of clients.
The second sheet, I send requests for multiple loras, I notice the token throughput is low and the GPU util is also low.
Here is my scripts:
client: I use locust to start multiple clients to send requests. The core code is here, each request with its unique adatper id:
Any benchmark result for lora-x? Or any benchmark example codes? Than you.
The text was updated successfully, but these errors were encountered: