-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NUMA CPU selection doesn't work on EPYC Genoa #5121
Comments
I'm not deeply familiar with NUMA, but it would be great to improve support in this regard. PRs welcome |
@ggerganov I was able to fix this particular problem with a fairly simple patch, but it doesn't fix the entire problem. Without making sure memory is allocated on the correct numa node's local memory, the speedups are minimal as the default malloc strategy is to interleave across numa nodes. |
There was some discussion about NUMA when we added initial support in #1556 (review) that might provide some insights about the way
This sounds cumbersome, would be better to avoid extra parameters and do whatever is possible |
OK, I've got some test NUMA code built in a new independent test project. Once I validate my strategy performance-wise and wrap my head around a workable set of final data structures, I'll produce a minimal patch to fix this specific scheduling bug and start work on a new NUMA improvements branch. I'm hoping to see near-linear performance improvements across CPU sockets. Ideally Genoa can sustain 920GB/s across multiple TB of memory. I will try to make the numa support work automatically with GGML_USE_NUMA so it can be IFDEF'd for minimal impact on non-NUMA users and so there aren't a bunch of new cli parameters. Maybe just a --numa-mirror or --numa-multi-node flag to replicate the data across nodes, since the impact to memory will be to use modelsize*numanodes GB of RAM? Thank you for the link to that pull. Its very useful information |
Welcome - the referenced issue in there (#1437) might have some extra useful info.
We can consider this depending on the implementation. I think if the main interest is server deployment, these could even be compile-time options |
I've finished my proof of concept and am convinced that the changes needed to maximize per-thread bandwidth in a NUMA system are worth it for anyone with multi-socket Genoa. As for the current NUMA node scheduling scheme, it appears to be always attempting to spread the threads across all NUMA nodes, which seems to me to be a worst case: Is this the intention? If not, I'd like to create a patch to force execution of all threads to the main thread's executing node to ensure data locality with the loaded model's data buffer. That way numactl can be used to launch llama.cpp and control which node handles all the threads, which is how I'd assumed it was intended to operate. |
https://github.com/bmtwl/numabw is my repo for the NUMA bandwidth testing tool I developed for my PoC |
@ggerganov |
Have attempted pull request number 2 |
When running latest git pull of llama.cpp on dual-socket EPYC Genoa system, the set_numa_thread_affinity() code attempts to set pthread affinity in a linear fashion (i = 0; i < node->n_cpus; ++i)
However, numa nodes on this system have interleaved CPUs:
so there are many threads not accessing local memory, making generation very slow.
I proved this to myself by disabling** the cpus not in the other numa node, and the llama.cpp code continuously faults with:
I think the g_state.numa structure has to be modified to encode the info from /sys/devices/system/node/ and use that for a cpu mask when calling pthread_setaffinity_np
** echo 0 > /sys/devices/system/cpu/cpu$1/online
The text was updated successfully, but these errors were encountered: