-
Notifications
You must be signed in to change notification settings - Fork 10.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml: create thread pool lazily #2674
ggml: create thread pool lazily #2674
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this change have any advantage compared to passing n_threads == 1
to ggml_graph_compute()
when CUDA is enabled?
No threads would be created and seems like a less-intrusive solution compared to the proposed one
bool node_and_src_all_cpu = node->backend == GGML_BACKEND_CPU; | ||
for (int j = 0; node_and_src_all_cpu && j < GGML_MAX_SRC; ++j) { | ||
if (node->src[j] != NULL && node->src[j]->backend != GGML_BACKEND_CPU) { | ||
node_and_src_all_cpu = false; | ||
} | ||
} | ||
if (!node_and_src_all_cpu) { | ||
n_tasks = 1; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Long term I would like to see ggml_tensor.backend
removed, so I prefer to limit it's application, especially inside ggml
Currently not really. I personally think creating the thread pool lazily is preferable but it would not be difficult to just add a check for the number of GPU layers to the user code. @slaren may have an opinion with regards to #2239 but I don't think it would make much of a difference. |
I don't think that this affects the ggml backends interface, other than it would become obsolete after it and would need to be removed. The problem with the backends interface is the overhead of having to create the threads in each call to |
funny in my tests it makes no difference at all (base) ➜ llama.cpp git:(master) ✗ CUDA_VISIBLE_DEVICES=1 ./build/bin/main --gqa 8 -m ./airoboros-l2-70b-gpt4-2.0.ggmlv3.q4_K_M.bin -n 512 -p "Building a website can be done in 10 simple steps:" -t 1 --n-gpu-layers 100 --mul-mat-q -c 4096 system_info: n_threads = 1 / 128 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | Building a website can be done in 10 simple steps:
llama_print_timings: load time = 4808.19 ms system_info: n_threads = 10 / 128 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | Building a website can be done in 10 simple steps:
llama_print_timings: load time = 4872.12 ms |
Obsoleted by #2915 . |
Currently ggml eagerly creates a thread pool when it starts evaluating a graph. However, when all layers are offloaded with CUDA these threads provide no benefit and just spin. So in that case it's better to manually set the number of threads to 1. This PR makes it so that the thread pool is created lazily and only once the first graph node with more than one task is encountered. If all layers are offloaded all nodes automatically only have a single task and as such the thread pool is never created. In my testing the lazy creation of the thread pool does not affect CPU only performance.