Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NUMA CPU selection doesn't work on EPYC Genoa #5121

Closed
bmtwl opened this issue Jan 25, 2024 · 9 comments
Closed

NUMA CPU selection doesn't work on EPYC Genoa #5121

bmtwl opened this issue Jan 25, 2024 · 9 comments

Comments

@bmtwl
Copy link
Contributor

bmtwl commented Jan 25, 2024

When running latest git pull of llama.cpp on dual-socket EPYC Genoa system, the set_numa_thread_affinity() code attempts to set pthread affinity in a linear fashion (i = 0; i < node->n_cpus; ++i)
However, numa nodes on this system have interleaved CPUs:

found 2 numa nodes, 128 CPUs
CPUs on node 0: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
CPUs on node 1: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127

so there are many threads not accessing local memory, making generation very slow.

I proved this to myself by disabling** the cpus not in the other numa node, and the llama.cpp code continuously faults with:

warning: pthread_setaffinity_np() failed: Invalid argument

I think the g_state.numa structure has to be modified to encode the info from /sys/devices/system/node/ and use that for a cpu mask when calling pthread_setaffinity_np

** echo 0 > /sys/devices/system/cpu/cpu$1/online

@ggerganov
Copy link
Owner

I'm not deeply familiar with NUMA, but it would be great to improve support in this regard. PRs welcome

@bmtwl
Copy link
Contributor Author

bmtwl commented Jan 28, 2024

@ggerganov I was able to fix this particular problem with a fairly simple patch, but it doesn't fix the entire problem. Without making sure memory is allocated on the correct numa node's local memory, the speedups are minimal as the default malloc strategy is to interleave across numa nodes.
Since it requires changing out malloc for numa_alloc_onnnode and free for malloc_free only when numa is on, the changes will be intrusive. numa.h has to be included in ggml.c and -lnuma has to be specified in the linker.
There are also other opportunities for speedups eg. mirroring the GGUF data buffer locally in each numa node and using the local buffer when running inference, but I'm struggling to grasp the entire program flow around nested data structures, memory allocation and inference in general.
I'm happy to create a branch to work on this, but I'd like to follow your general plan for development and not create any ugly kludges or bad performance regressions.
Any insight into how you'd like to see the allocation of large buffers on numa handled (compiler directive and IFDEFs? CLI options for strategy and settings numa nodes/cpu sets to use?) and a basic rundown of the structures that hold the GGUF files in memory would be very appreciated.

@ggerganov
Copy link
Owner

ggerganov commented Jan 29, 2024

There was some discussion about NUMA when we added initial support in #1556 (review) that might provide some insights about the way ggml work with the data

-lnuma can be optional dependency when we build with GGML_USE_NUMA, but I cannot say at this point for sure that I will be OK with merging the patch - it will depend on how big and intrusive it is and what are the benefits from it.

CLI options for strategy and settings numa nodes/cpu sets to use?

This sounds cumbersome, would be better to avoid extra parameters and do whatever is possible without with default settings. I remember in #1556 that @zrm showed how to control some of this stuff outside of ggml through command-line tools

@bmtwl
Copy link
Contributor Author

bmtwl commented Jan 29, 2024

OK, I've got some test NUMA code built in a new independent test project. Once I validate my strategy performance-wise and wrap my head around a workable set of final data structures, I'll produce a minimal patch to fix this specific scheduling bug and start work on a new NUMA improvements branch. I'm hoping to see near-linear performance improvements across CPU sockets. Ideally Genoa can sustain 920GB/s across multiple TB of memory.

I will try to make the numa support work automatically with GGML_USE_NUMA so it can be IFDEF'd for minimal impact on non-NUMA users and so there aren't a bunch of new cli parameters. Maybe just a --numa-mirror or --numa-multi-node flag to replicate the data across nodes, since the impact to memory will be to use modelsize*numanodes GB of RAM?

Thank you for the link to that pull. Its very useful information

@ggerganov
Copy link
Owner

ggerganov commented Jan 29, 2024

Welcome - the referenced issue in there (#1437) might have some extra useful info.

Maybe just a --numa-mirror or --numa-multi-node flag to replicate the data across nodes, since the impact to memory will be to use modelsize*numanodes GB of RAM?

We can consider this depending on the implementation. I think if the main interest is server deployment, these could even be compile-time options

@bmtwl
Copy link
Contributor Author

bmtwl commented Feb 5, 2024

I've finished my proof of concept and am convinced that the changes needed to maximize per-thread bandwidth in a NUMA system are worth it for anyone with multi-socket Genoa.
However, I think that you are right: the changes will likely be invasive enough that a compile time flag is the best way to handle including them in the build.

As for the current NUMA node scheduling scheme, it appears to be always attempting to spread the threads across all NUMA nodes, which seems to me to be a worst case:
(ggml.c line 16588)
const int node_num = thread_n / ((n_threads + g_state.numa.n_nodes - 1) / g_state.numa.n_nodes);

Is this the intention? If not, I'd like to create a patch to force execution of all threads to the main thread's executing node to ensure data locality with the loaded model's data buffer. That way numactl can be used to launch llama.cpp and control which node handles all the threads, which is how I'd assumed it was intended to operate.
Please let me know and I'll submit a patch to change behaviour of the --numa cli argument.

@bmtwl
Copy link
Contributor Author

bmtwl commented Feb 5, 2024

https://github.com/bmtwl/numabw is my repo for the NUMA bandwidth testing tool I developed for my PoC

@bmtwl
Copy link
Contributor Author

bmtwl commented Feb 6, 2024

@ggerganov
Added pull request. Please let me know if any changes are needed.

@bmtwl
Copy link
Contributor Author

bmtwl commented Feb 6, 2024

Have attempted pull request number 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants