Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split by rows instead of layers for llama.cpp multi-gpu #5435

Merged
merged 33 commits into from
Feb 5, 2024

Conversation

Ph0rk0z
Copy link
Contributor

@Ph0rk0z Ph0rk0z commented Feb 4, 2024

On some cards, the new splitting by layer causes performance. Even on 3090s, the utilization goes from over 50 to 43. P40s actually have demonstrable losses. This parameter lets you split by rows like theoriginal behavior and should fix those speed issues. Default behavior should still be splitting by layers.

oobabooga and others added 30 commits December 14, 2023 22:39
@oobabooga
Copy link
Owner

Is there a reason to not have split by rows by default if it leads to better performance?

@Ph0rk0z
Copy link
Contributor Author

Ph0rk0z commented Feb 5, 2024

I kept the default behavior of l.cpp and also have no way to test 4090 or all different combinations. I can say P40 gains its 2 or 3 t/s back and 3090 goes from 40% utilization per GPU to over 5X%.

Nothing makes it like pre ggerganov/llama.cpp#4606 unfortunately.

@oobabooga
Copy link
Owner

Fair enough

@oobabooga oobabooga changed the base branch from main to dev February 5, 2024 02:36
@oobabooga oobabooga merged commit 2a45620 into oobabooga:dev Feb 5, 2024
PoetOnTheRun pushed a commit to PoetOnTheRun/text-generation-webui that referenced this pull request Feb 22, 2024
@Ph0rk0z Ph0rk0z deleted the patch-4 branch May 12, 2024 17:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants