NUMA CPU selection doesn't work on EPYC Genoa #5121

bmtwl · 2024-01-25T07:58:55Z

When running latest git pull of llama.cpp on dual-socket EPYC Genoa system, the set_numa_thread_affinity() code attempts to set pthread affinity in a linear fashion (i = 0; i < node->n_cpus; ++i)
However, numa nodes on this system have interleaved CPUs:

found 2 numa nodes, 128 CPUs
CPUs on node 0: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
CPUs on node 1: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127

so there are many threads not accessing local memory, making generation very slow.

I proved this to myself by disabling** the cpus not in the other numa node, and the llama.cpp code continuously faults with:

warning: pthread_setaffinity_np() failed: Invalid argument

I think the g_state.numa structure has to be modified to encode the info from /sys/devices/system/node/ and use that for a cpu mask when calling pthread_setaffinity_np

** echo 0 > /sys/devices/system/cpu/cpu$1/online

ggerganov · 2024-01-25T14:22:50Z

I'm not deeply familiar with NUMA, but it would be great to improve support in this regard. PRs welcome

bmtwl · 2024-01-28T08:39:55Z

@ggerganov I was able to fix this particular problem with a fairly simple patch, but it doesn't fix the entire problem. Without making sure memory is allocated on the correct numa node's local memory, the speedups are minimal as the default malloc strategy is to interleave across numa nodes.
Since it requires changing out malloc for numa_alloc_onnnode and free for malloc_free only when numa is on, the changes will be intrusive. numa.h has to be included in ggml.c and -lnuma has to be specified in the linker.
There are also other opportunities for speedups eg. mirroring the GGUF data buffer locally in each numa node and using the local buffer when running inference, but I'm struggling to grasp the entire program flow around nested data structures, memory allocation and inference in general.
I'm happy to create a branch to work on this, but I'd like to follow your general plan for development and not create any ugly kludges or bad performance regressions.
Any insight into how you'd like to see the allocation of large buffers on numa handled (compiler directive and IFDEFs? CLI options for strategy and settings numa nodes/cpu sets to use?) and a basic rundown of the structures that hold the GGUF files in memory would be very appreciated.

ggerganov · 2024-01-29T08:35:47Z

There was some discussion about NUMA when we added initial support in #1556 (review) that might provide some insights about the way ggml work with the data

-lnuma can be optional dependency when we build with GGML_USE_NUMA, but I cannot say at this point for sure that I will be OK with merging the patch - it will depend on how big and intrusive it is and what are the benefits from it.

CLI options for strategy and settings numa nodes/cpu sets to use?

This sounds cumbersome, would be better to avoid extra parameters and do whatever is possible ~~without~~ with default settings. I remember in #1556 that @zrm showed how to control some of this stuff outside of ggml through command-line tools

bmtwl · 2024-01-29T15:36:01Z

OK, I've got some test NUMA code built in a new independent test project. Once I validate my strategy performance-wise and wrap my head around a workable set of final data structures, I'll produce a minimal patch to fix this specific scheduling bug and start work on a new NUMA improvements branch. I'm hoping to see near-linear performance improvements across CPU sockets. Ideally Genoa can sustain 920GB/s across multiple TB of memory.

I will try to make the numa support work automatically with GGML_USE_NUMA so it can be IFDEF'd for minimal impact on non-NUMA users and so there aren't a bunch of new cli parameters. Maybe just a --numa-mirror or --numa-multi-node flag to replicate the data across nodes, since the impact to memory will be to use modelsize*numanodes GB of RAM?

Thank you for the link to that pull. Its very useful information

ggerganov · 2024-01-29T19:59:50Z

Welcome - the referenced issue in there (#1437) might have some extra useful info.

Maybe just a --numa-mirror or --numa-multi-node flag to replicate the data across nodes, since the impact to memory will be to use modelsize*numanodes GB of RAM?

We can consider this depending on the implementation. I think if the main interest is server deployment, these could even be compile-time options

bmtwl · 2024-02-05T16:22:09Z

I've finished my proof of concept and am convinced that the changes needed to maximize per-thread bandwidth in a NUMA system are worth it for anyone with multi-socket Genoa.
However, I think that you are right: the changes will likely be invasive enough that a compile time flag is the best way to handle including them in the build.

As for the current NUMA node scheduling scheme, it appears to be always attempting to spread the threads across all NUMA nodes, which seems to me to be a worst case:
(ggml.c line 16588)
const int node_num = thread_n / ((n_threads + g_state.numa.n_nodes - 1) / g_state.numa.n_nodes);

Is this the intention? If not, I'd like to create a patch to force execution of all threads to the main thread's executing node to ensure data locality with the loaded model's data buffer. That way numactl can be used to launch llama.cpp and control which node handles all the threads, which is how I'd assumed it was intended to operate.
Please let me know and I'll submit a patch to change behaviour of the --numa cli argument.

bmtwl · 2024-02-05T18:00:28Z

https://github.com/bmtwl/numabw is my repo for the NUMA bandwidth testing tool I developed for my PoC

bmtwl · 2024-02-06T09:20:26Z

@ggerganov
Added pull request. Please let me know if any changes are needed.

bmtwl · 2024-02-06T22:50:54Z

Have attempted pull request number 2

bmtwl added the bug-unconfirmed label Jan 25, 2024

ggerganov mentioned this issue Feb 7, 2024

Added numa options to allow finer grained control as well as plumbing for a new mirror mode that will require numa.h #5377

Merged

bmtwl closed this as completed Feb 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NUMA CPU selection doesn't work on EPYC Genoa #5121

NUMA CPU selection doesn't work on EPYC Genoa #5121

bmtwl commented Jan 25, 2024

ggerganov commented Jan 25, 2024

bmtwl commented Jan 28, 2024

ggerganov commented Jan 29, 2024 •

edited

Loading

bmtwl commented Jan 29, 2024

ggerganov commented Jan 29, 2024 •

edited

Loading

bmtwl commented Feb 5, 2024

bmtwl commented Feb 5, 2024

bmtwl commented Feb 6, 2024

bmtwl commented Feb 6, 2024

NUMA CPU selection doesn't work on EPYC Genoa #5121

NUMA CPU selection doesn't work on EPYC Genoa #5121

Comments

bmtwl commented Jan 25, 2024

ggerganov commented Jan 25, 2024

bmtwl commented Jan 28, 2024

ggerganov commented Jan 29, 2024 • edited Loading

bmtwl commented Jan 29, 2024

ggerganov commented Jan 29, 2024 • edited Loading

bmtwl commented Feb 5, 2024

bmtwl commented Feb 5, 2024

bmtwl commented Feb 6, 2024

bmtwl commented Feb 6, 2024

ggerganov commented Jan 29, 2024 •

edited

Loading

ggerganov commented Jan 29, 2024 •

edited

Loading