-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZFSOnLinux is not NUMA-safe #1000
Comments
I have a system we should be able to test with. It'll take a little bit of work (have to install the second proc and install an OS), but it's doable. |
I have gotten good stability running latest ZFSOnLinux with a NUMA aware kernel & machine when ARC size has been limited to 16G (64G total memory). System has 2 CPU sockets in use. |
It sounds like reducing the maximum ARC size to RAM / (2 * number of sockets) might prevent issues, assuming that all processors have equal amounts of RAM. This agrees with past reports that claimed that reducing ARC size helped. |
For anyone that is not familiar with NUMA safety, the issue involves how synchronization is done. Typically, synchronization in SMP systems relies on cores locking the bus to enable atomic operations upon which synchronization mechanisms depend. Other cores will use a technique called bus snooping to know when this happens and avoid doing anything until the bus is unlocked. If other cores need to do memory operations while the bus is locked, they need to wait, which is known as bus contention. That causes SMP-scalability problems. NUMA hardware attempts to avoid such problems by using a separate bus for each socket. That can improve performance by permitting groups of threads to to avoid bus contention with unrelated groups by running them on different sockets, but it breaks kernel code that assumes that there is only one bus, which is what I mean when I talk about NUMA safety. Even code where some considerations for NUMA have been made will be broken if a single aspect of it lacks NUMA safety. The original Solaris code was NUMA-safe, so this is a ZFSOnLinux-specific regression. It might be possible that other ports have similar regressions, mainly because the developers likely aren't doing regression tests on NUMA-capable hardware. |
I think this is #922 |
@cwedgwood This report is a description of an issue that could explain a vast number of reports, #922 included. In the case of #922, is NUMA involved? |
@ryao yes, it's NUMA (sorry, i thought that was known) somehere here or on IRC are details of things i tried such as various zone reclaim changes or limiting ARC to <50% of the smallest NUMA node side (none of this works very well, it maybe buys time) |
What are the causes and/or symptoms of NUMA non-safeness? My naive understanding is that the non-local memory is simply slower to access than the node-local memory. |
Would you verify that your issue still occurs if you disable all but one NUMA node? |
@ryao i breaks w/ numa disabled i don't think this is numa specific, it might be that makes it easier to trigger in some cases but it's certainly not a requirement |
Would you describe how you disabled NUMA? |
further details from that boot There is no SRAT table in this case.
|
There are very few CPUs that have 16 logical threads and if it has multiple sockets, it is quite likely that they are not sharing the same memory bus. With that said, would you tell me more about this system? In specific, I would like references to the model and information on what parts compose it. The motherboard model that |
After talking with @behlendorf, I am not convinced that the presence of NUMA could be an explanation anymore. It seems that NUMA systems implement a cache coherence protocol in hardware that should prevent the scenario that I thought might be happening. |
There have been reports of deadlocks with ZFS when systems have large quantities of RAM, such as issues #417, #695, #837 and #922. Until now, none of these reports mentioned whether or not NUMA hardware was involved.
MooingLemur reported in #zfsonlinux on the freenode IRC network that he is having deadlocks with 0.6.0-rc11 on a Gentoo Linux system that has 96GB of memory. He has an older Gentoo Linux system with 32GB of memory that has no deadlocks. The key difference between the two is that the newer is 2-socket NUMA system and the older is not.
I have long suspected that the issues people with large quantities of memory reported were caused by NUMA-safety issues, but I have lacked access to NUMA hardware to examine that possibility. I am now confident that my suspicion is correct based on my discussion with MooingLemur.
As a volunteer, I will not have time to look into this for the remainder of the year. If I were to find time, I doubt that I would have access to NUMA hardware. I expect that professional developers working on the ZFSOnLinux code are in a much better position to debug this than I am, both time-wise and hardware-wise.
The text was updated successfully, but these errors were encountered: