Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml: create thread pool lazily #2674

Conversation

JohannesGaessler
Copy link
Collaborator

Currently ggml eagerly creates a thread pool when it starts evaluating a graph. However, when all layers are offloaded with CUDA these threads provide no benefit and just spin. So in that case it's better to manually set the number of threads to 1. This PR makes it so that the thread pool is created lazily and only once the first graph node with more than one task is encountered. If all layers are offloaded all nodes automatically only have a single task and as such the thread pool is never created. In my testing the lazy creation of the thread pool does not affect CPU only performance.

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change have any advantage compared to passing n_threads == 1 to ggml_graph_compute() when CUDA is enabled?

No threads would be created and seems like a less-intrusive solution compared to the proposed one

Comment on lines +16756 to +16765
bool node_and_src_all_cpu = node->backend == GGML_BACKEND_CPU;
for (int j = 0; node_and_src_all_cpu && j < GGML_MAX_SRC; ++j) {
if (node->src[j] != NULL && node->src[j]->backend != GGML_BACKEND_CPU) {
node_and_src_all_cpu = false;
}
}
if (!node_and_src_all_cpu) {
n_tasks = 1;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long term I would like to see ggml_tensor.backend removed, so I prefer to limit it's application, especially inside ggml

@JohannesGaessler
Copy link
Collaborator Author

Does this change have any advantage compared to passing n_threads == 1 to ggml_graph_compute() when CUDA is enabled?

Currently not really. I personally think creating the thread pool lazily is preferable but it would not be difficult to just add a check for the number of GPU layers to the user code.

@slaren may have an opinion with regards to #2239 but I don't think it would make much of a difference.

@slaren
Copy link
Member

slaren commented Aug 22, 2023

I don't think that this affects the ggml backends interface, other than it would become obsolete after it and would need to be removed. The problem with the backends interface is the overhead of having to create the threads in each call to ggml_graph_compute, which as far as I can tell this doesn't solve.

@darkacorn
Copy link

funny in my tests it makes no difference at all
t1 / t10

(base) ➜ llama.cpp git:(master) ✗ CUDA_VISIBLE_DEVICES=1 ./build/bin/main --gqa 8 -m ./airoboros-l2-70b-gpt4-2.0.ggmlv3.q4_K_M.bin -n 512 -p "Building a website can be done in 10 simple steps:" -t 1 --n-gpu-layers 100 --mul-mat-q -c 4096
main: warning: base model only supports context sizes no greater than 2048 tokens (4096 specified)
main: build = 1009 (9e232f0)
main: seed = 1692738933
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA RTX A6000, compute capability 8.6
llama.cpp: loading model from ./airoboros-l2-70b-gpt4-2.0.ggmlv3.q4_K_M.bin
llama_model_load_internal: warning: assuming 70B model based on GQA == 8
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 4096
llama_model_load_internal: n_embd = 8192
llama_model_load_internal: n_mult = 7168
llama_model_load_internal: n_head = 64
llama_model_load_internal: n_head_kv = 8
llama_model_load_internal: n_layer = 80
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 8
llama_model_load_internal: rnorm_eps = 5.0e-06
llama_model_load_internal: n_ff = 28672
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 15 (mostly Q4_K - Medium)
llama_model_load_internal: model size = 70B
llama_model_load_internal: ggml ctx size = 0.21 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 1217.85 MB (+ 1280.00 MB per state)
llama_model_load_internal: allocating batch_size x (1280 kB + n_ctx x 256 B) = 1152 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 80 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 83/83 layers to GPU
llama_model_load_internal: total VRAM used: 41755 MB
llama_new_context_with_model: kv self size = 1280.00 MB

system_info: n_threads = 1 / 128 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 4096, n_batch = 512, n_predict = 512, n_keep = 0

Building a website can be done in 10 simple steps:

  1. Find a niche: Identify your target audience and the purpose of your website.
  2. Choose a platform: Decide on whether to use ready-made website builders like Wix or Squarespace, or learn how to code from scratch using HTML and CSS.
  3. Register your domain name: Purchase a unique URL for your site (e.g., www.yourwebsitename.com).
  4. Choose a hosting service: Find a reliable web host where all the files of your website will be stored.
  5. Design your layout: Decide on the look and feel of your pages, including colors, fonts, images etc.
  6. Create content: Write text, add images, videos or any other type of media that is relevant to your niche.
  7. Make it mobile-friendly: Ensure your website looks good and functions well on all devices - desktops, laptops, tablets, and smartphones.
  8. Test thoroughly: Check for typos, broken links, slow loading times etc., before going live.
  9. Go Live! Once everything is tested and approved, publish your site so it's accessible online.
  10. Promote your website: Use SEO techniques to make sure people can find you when they search related keywords; use social media platforms to spread the word about your new venture. [end of text]

llama_print_timings: load time = 4808.19 ms
llama_print_timings: sample time = 196.23 ms / 297 runs ( 0.66 ms per token, 1513.53 tokens per second)
llama_print_timings: prompt eval time = 396.47 ms / 14 tokens ( 28.32 ms per token, 35.31 tokens per second)
llama_print_timings: eval time = 24767.13 ms / 296 runs ( 83.67 ms per token, 11.95 tokens per second)
llama_print_timings: total time = 25415.24 ms
(base) ➜ llama.cpp git:(master) ✗ CUDA_VISIBLE_DEVICES=1 ./build/bin/main --gqa 8 -m ./airoboros-l2-70b-gpt4-2.0.ggmlv3.q4_K_M.bin -n 512 -p "Building a website can be done in 10 simple steps:" -t 10 --n-gpu-layers 100 --mul-mat-q -c 4096
main: warning: base model only supports context sizes no greater than 2048 tokens (4096 specified)
main: build = 1009 (9e232f0)
main: seed = 1692738975
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA RTX A6000, compute capability 8.6
llama.cpp: loading model from ./airoboros-l2-70b-gpt4-2.0.ggmlv3.q4_K_M.bin
llama_model_load_internal: warning: assuming 70B model based on GQA == 8
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 4096
llama_model_load_internal: n_embd = 8192
llama_model_load_internal: n_mult = 7168
llama_model_load_internal: n_head = 64
llama_model_load_internal: n_head_kv = 8
llama_model_load_internal: n_layer = 80
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 8
llama_model_load_internal: rnorm_eps = 5.0e-06
llama_model_load_internal: n_ff = 28672
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 15 (mostly Q4_K - Medium)
llama_model_load_internal: model size = 70B
llama_model_load_internal: ggml ctx size = 0.21 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 1217.85 MB (+ 1280.00 MB per state)
llama_model_load_internal: allocating batch_size x (1280 kB + n_ctx x 256 B) = 1152 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 80 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 83/83 layers to GPU
llama_model_load_internal: total VRAM used: 41755 MB
llama_new_context_with_model: kv self size = 1280.00 MB

system_info: n_threads = 10 / 128 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 4096, n_batch = 512, n_predict = 512, n_keep = 0

Building a website can be done in 10 simple steps:

  1. Domain Name - This is the web address for your site, such as "www.yourwebsite.com". You'll need to purchase this from a domain registrar like GoDaddy or Namecheap.
  2. Web Hosting - This is where your website files will be stored and accessed by visitors on the internet. There are many hosting providers available, including Bluehost, HostGator, and DreamHost.
  3. Content Management System (CMS) - If you want to create a more complex site with dynamic content, then using a CMS like WordPress or Joomla might be helpful. But if you just want a simple static website, no need for this step.
  4. Theme & Plugins - If you're using WordPress (or similar), now would be the time to choose and install a theme that determines how your site looks, and any necessary plugins that add extra functionality.
  5. Website Content - Start creating content for your website including text, images, videos etc., depending on what kind of website you're building.
  6. Testing & Debugging - Test all pages of the website to make sure they work properly across different devices and browsers. Also, check if there are any broken links or errors showing up in webmaster tools.
  7. SEO Optimization - Apply basic SEO techniques like adding meta tags, alt text for images, internal linking, etc., so your site can rank better on search engines.
  8. Launching the Website - Once everything is tested and ready, publish the website by setting it live on your web host.
  9. Backup & Security - Regularly backup your website data to prevent loss in case of server crash or hacking attack. Ensure security measures like SSL certificate installation, firewall setup, regular software updates are taken care of.
  10. Maintenance & Updates - Keep updating the website with fresh content regularly and ensure that all components (including themes, plugins) are updated to their latest versions for smooth functioning and optimal user experience. [end of text]

llama_print_timings: load time = 4872.12 ms
llama_print_timings: sample time = 286.72 ms / 455 runs ( 0.63 ms per token, 1586.91 tokens per second)
llama_print_timings: prompt eval time = 391.76 ms / 14 tokens ( 27.98 ms per token, 35.74 tokens per second)
llama_print_timings: eval time = 38937.41 ms / 454 runs ( 85.77 ms per token, 11.66 tokens per second)
llama_print_timings: total time = 39704.37 ms

@JohannesGaessler
Copy link
Collaborator Author

Obsoleted by #2915 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants