Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZFSOnLinux is not NUMA-safe #1000

Closed
ryao opened this issue Sep 30, 2012 · 15 comments
Closed

ZFSOnLinux is not NUMA-safe #1000

ryao opened this issue Sep 30, 2012 · 15 comments
Labels
Type: Performance Performance improvement or performance problem
Milestone

Comments

@ryao
Copy link
Contributor

ryao commented Sep 30, 2012

There have been reports of deadlocks with ZFS when systems have large quantities of RAM, such as issues #417, #695, #837 and #922. Until now, none of these reports mentioned whether or not NUMA hardware was involved.

MooingLemur reported in #zfsonlinux on the freenode IRC network that he is having deadlocks with 0.6.0-rc11 on a Gentoo Linux system that has 96GB of memory. He has an older Gentoo Linux system with 32GB of memory that has no deadlocks. The key difference between the two is that the newer is 2-socket NUMA system and the older is not.

I have long suspected that the issues people with large quantities of memory reported were caused by NUMA-safety issues, but I have lacked access to NUMA hardware to examine that possibility. I am now confident that my suspicion is correct based on my discussion with MooingLemur.

As a volunteer, I will not have time to look into this for the remainder of the year. If I were to find time, I doubt that I would have access to NUMA hardware. I expect that professional developers working on the ZFSOnLinux code are in a much better position to debug this than I am, both time-wise and hardware-wise.

@prometheanfire
Copy link
Contributor

I have a system we should be able to test with. It'll take a little bit of work (have to install the second proc and install an OS), but it's doable.

@atonkyra
Copy link

I have gotten good stability running latest ZFSOnLinux with a NUMA aware kernel & machine when ARC size has been limited to 16G (64G total memory).

System has 2 CPU sockets in use.

@ryao
Copy link
Contributor Author

ryao commented Sep 30, 2012

It sounds like reducing the maximum ARC size to RAM / (2 * number of sockets) might prevent issues, assuming that all processors have equal amounts of RAM. This agrees with past reports that claimed that reducing ARC size helped.

@ryao
Copy link
Contributor Author

ryao commented Sep 30, 2012

For anyone that is not familiar with NUMA safety, the issue involves how synchronization is done. Typically, synchronization in SMP systems relies on cores locking the bus to enable atomic operations upon which synchronization mechanisms depend. Other cores will use a technique called bus snooping to know when this happens and avoid doing anything until the bus is unlocked. If other cores need to do memory operations while the bus is locked, they need to wait, which is known as bus contention. That causes SMP-scalability problems.

NUMA hardware attempts to avoid such problems by using a separate bus for each socket. That can improve performance by permitting groups of threads to to avoid bus contention with unrelated groups by running them on different sockets, but it breaks kernel code that assumes that there is only one bus, which is what I mean when I talk about NUMA safety. Even code where some considerations for NUMA have been made will be broken if a single aspect of it lacks NUMA safety.

The original Solaris code was NUMA-safe, so this is a ZFSOnLinux-specific regression. It might be possible that other ports have similar regressions, mainly because the developers likely aren't doing regression tests on NUMA-capable hardware.

@cwedgwood
Copy link
Contributor

I think this is #922

@ryao
Copy link
Contributor Author

ryao commented Oct 1, 2012

@cwedgwood This report is a description of an issue that could explain a vast number of reports, #922 included. In the case of #922, is NUMA involved?

@cwedgwood
Copy link
Contributor

@ryao yes, it's NUMA (sorry, i thought that was known)

somehere here or on IRC are details of things i tried such as various zone reclaim changes or limiting ARC to <50% of the smallest NUMA node side (none of this works very well, it maybe buys time)

@chrisrd
Copy link
Contributor

chrisrd commented Oct 1, 2012

What are the causes and/or symptoms of NUMA non-safeness? My naive understanding is that the non-local memory is simply slower to access than the node-local memory.

@cwedgwood
Copy link
Contributor

@chrisrd reclaim potentially works differently and some resources are per-zone so you can exhaust a local zone if you only consider global state

that said, the #922 issues i'm not convinced are entirely NUMA related but more to do with memory pressure and ARC interaction

@ryao
Copy link
Contributor Author

ryao commented Oct 3, 2012

Would you verify that your issue still occurs if you disable all but one NUMA node?

@cwedgwood
Copy link
Contributor

@ryao i breaks w/ numa disabled

i don't think this is numa specific, it might be that makes it easier to trigger in some cases but it's certainly not a requirement

@ryao
Copy link
Contributor Author

ryao commented Oct 3, 2012

Would you describe how you disabled NUMA?

@cwedgwood
Copy link
Contributor

#922 (comment)

further details from that boot

There is no SRAT table in this case.

[    0.000000] NUMA: Initialized distance table, cnt=1
[    0.000000] NUMA: Warning: node ids are out of bound, from=-1 to=-1 distance=10
[    0.000000] No NUMA configuration found

...

[    0.000000] On node 0 totalpages: 6289175
[    0.000000]   DMA zone: 64 pages used for memmap
[    0.000000]   DMA zone: 7 pages reserved
[    0.000000]   DMA zone: 3904 pages, LIFO batch:0
[    0.000000]   DMA32 zone: 16320 pages used for memmap
[    0.000000]   DMA32 zone: 763856 pages, LIFO batch:31
[    0.000000]   Normal zone: 86016 pages used for memmap
[    0.000000]   Normal zone: 5419008 pages, LIFO batch:31

...

[    0.495826] Booting Node   0, Processors  #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15
[    0.871420] Brought up 16 CPUs
[    0.874795] Total of 16 processors activated (76801.61 BogoMIPS).

...

@ryao
Copy link
Contributor Author

ryao commented Oct 3, 2012

There are very few CPUs that have 16 logical threads and if it has multiple sockets, it is quite likely that they are not sharing the same memory bus.

With that said, would you tell me more about this system? In specific, I would like references to the model and information on what parts compose it. The motherboard model that dmidecode outputs is probably the most important information. In fact, the entire output of dmidecode, lspci and dmesg would be useful. Feel free to censor serial numbers that could be used for identification purposes. Such information is useless in examining your issue.

@ryao
Copy link
Contributor Author

ryao commented Oct 3, 2012

After talking with @behlendorf, I am not convinced that the presence of NUMA could be an explanation anymore. It seems that NUMA systems implement a cache coherence protocol in hardware that should prevent the scenario that I thought might be happening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Performance Performance improvement or performance problem
Projects
None yet
Development

No branches or pull requests

5 participants