ZFSOnLinux is not NUMA-safe #1000

ryao · 2012-09-30T05:17:05Z

There have been reports of deadlocks with ZFS when systems have large quantities of RAM, such as issues #417, #695, #837 and #922. Until now, none of these reports mentioned whether or not NUMA hardware was involved.

MooingLemur reported in #zfsonlinux on the freenode IRC network that he is having deadlocks with 0.6.0-rc11 on a Gentoo Linux system that has 96GB of memory. He has an older Gentoo Linux system with 32GB of memory that has no deadlocks. The key difference between the two is that the newer is 2-socket NUMA system and the older is not.

I have long suspected that the issues people with large quantities of memory reported were caused by NUMA-safety issues, but I have lacked access to NUMA hardware to examine that possibility. I am now confident that my suspicion is correct based on my discussion with MooingLemur.

As a volunteer, I will not have time to look into this for the remainder of the year. If I were to find time, I doubt that I would have access to NUMA hardware. I expect that professional developers working on the ZFSOnLinux code are in a much better position to debug this than I am, both time-wise and hardware-wise.

prometheanfire · 2012-09-30T06:15:20Z

I have a system we should be able to test with. It'll take a little bit of work (have to install the second proc and install an OS), but it's doable.

atonkyra · 2012-09-30T06:26:50Z

I have gotten good stability running latest ZFSOnLinux with a NUMA aware kernel & machine when ARC size has been limited to 16G (64G total memory).

System has 2 CPU sockets in use.

ryao · 2012-09-30T06:35:09Z

It sounds like reducing the maximum ARC size to RAM / (2 * number of sockets) might prevent issues, assuming that all processors have equal amounts of RAM. This agrees with past reports that claimed that reducing ARC size helped.

ryao · 2012-09-30T07:37:12Z

For anyone that is not familiar with NUMA safety, the issue involves how synchronization is done. Typically, synchronization in SMP systems relies on cores locking the bus to enable atomic operations upon which synchronization mechanisms depend. Other cores will use a technique called bus snooping to know when this happens and avoid doing anything until the bus is unlocked. If other cores need to do memory operations while the bus is locked, they need to wait, which is known as bus contention. That causes SMP-scalability problems.

NUMA hardware attempts to avoid such problems by using a separate bus for each socket. That can improve performance by permitting groups of threads to to avoid bus contention with unrelated groups by running them on different sockets, but it breaks kernel code that assumes that there is only one bus, which is what I mean when I talk about NUMA safety. Even code where some considerations for NUMA have been made will be broken if a single aspect of it lacks NUMA safety.

The original Solaris code was NUMA-safe, so this is a ZFSOnLinux-specific regression. It might be possible that other ports have similar regressions, mainly because the developers likely aren't doing regression tests on NUMA-capable hardware.

cwedgwood · 2012-10-01T03:47:19Z

I think this is #922

ryao · 2012-10-01T19:51:33Z

@cwedgwood This report is a description of an issue that could explain a vast number of reports, #922 included. In the case of #922, is NUMA involved?

cwedgwood · 2012-10-01T21:17:01Z

@ryao yes, it's NUMA (sorry, i thought that was known)

somehere here or on IRC are details of things i tried such as various zone reclaim changes or limiting ARC to <50% of the smallest NUMA node side (none of this works very well, it maybe buys time)

chrisrd · 2012-10-01T23:23:13Z

What are the causes and/or symptoms of NUMA non-safeness? My naive understanding is that the non-local memory is simply slower to access than the node-local memory.

cwedgwood · 2012-10-01T23:51:24Z

@chrisrd reclaim potentially works differently and some resources are per-zone so you can exhaust a local zone if you only consider global state

that said, the #922 issues i'm not convinced are entirely NUMA related but more to do with memory pressure and ARC interaction

ryao · 2012-10-03T01:38:21Z

Would you verify that your issue still occurs if you disable all but one NUMA node?

cwedgwood · 2012-10-03T16:56:04Z

@ryao i breaks w/ numa disabled

i don't think this is numa specific, it might be that makes it easier to trigger in some cases but it's certainly not a requirement

ryao · 2012-10-03T17:24:01Z

Would you describe how you disabled NUMA?

cwedgwood · 2012-10-03T17:31:38Z

#922 (comment)

further details from that boot

There is no SRAT table in this case.

[    0.000000] NUMA: Initialized distance table, cnt=1
[    0.000000] NUMA: Warning: node ids are out of bound, from=-1 to=-1 distance=10
[    0.000000] No NUMA configuration found

...

[    0.000000] On node 0 totalpages: 6289175
[    0.000000]   DMA zone: 64 pages used for memmap
[    0.000000]   DMA zone: 7 pages reserved
[    0.000000]   DMA zone: 3904 pages, LIFO batch:0
[    0.000000]   DMA32 zone: 16320 pages used for memmap
[    0.000000]   DMA32 zone: 763856 pages, LIFO batch:31
[    0.000000]   Normal zone: 86016 pages used for memmap
[    0.000000]   Normal zone: 5419008 pages, LIFO batch:31

...

[    0.495826] Booting Node   0, Processors  #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15
[    0.871420] Brought up 16 CPUs
[    0.874795] Total of 16 processors activated (76801.61 BogoMIPS).

...

ryao · 2012-10-03T20:18:28Z

There are very few CPUs that have 16 logical threads and if it has multiple sockets, it is quite likely that they are not sharing the same memory bus.

With that said, would you tell me more about this system? In specific, I would like references to the model and information on what parts compose it. The motherboard model that dmidecode outputs is probably the most important information. In fact, the entire output of dmidecode, lspci and dmesg would be useful. Feel free to censor serial numbers that could be used for identification purposes. Such information is useless in examining your issue.

ryao · 2012-10-03T22:46:18Z

After talking with @behlendorf, I am not convinced that the presence of NUMA could be an explanation anymore. It seems that NUMA systems implement a cache coherence protocol in hardware that should prevent the scenario that I thought might be happening.

ryao closed this as completed Oct 3, 2012

JuliaVixen mentioned this issue Jul 7, 2016

PANIC at dmu.c:844:dmu_write() #4830

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZFSOnLinux is not NUMA-safe #1000

ZFSOnLinux is not NUMA-safe #1000

ryao commented Sep 30, 2012

prometheanfire commented Sep 30, 2012

atonkyra commented Sep 30, 2012

ryao commented Sep 30, 2012

ryao commented Sep 30, 2012

cwedgwood commented Oct 1, 2012

ryao commented Oct 1, 2012

cwedgwood commented Oct 1, 2012

chrisrd commented Oct 1, 2012

cwedgwood commented Oct 1, 2012

ryao commented Oct 3, 2012

cwedgwood commented Oct 3, 2012

ryao commented Oct 3, 2012

cwedgwood commented Oct 3, 2012

ryao commented Oct 3, 2012

ryao commented Oct 3, 2012

ZFSOnLinux is not NUMA-safe #1000

ZFSOnLinux is not NUMA-safe #1000

Comments

ryao commented Sep 30, 2012

prometheanfire commented Sep 30, 2012

atonkyra commented Sep 30, 2012

ryao commented Sep 30, 2012

ryao commented Sep 30, 2012

cwedgwood commented Oct 1, 2012

ryao commented Oct 1, 2012

cwedgwood commented Oct 1, 2012

chrisrd commented Oct 1, 2012

cwedgwood commented Oct 1, 2012

ryao commented Oct 3, 2012

cwedgwood commented Oct 3, 2012

ryao commented Oct 3, 2012

cwedgwood commented Oct 3, 2012

ryao commented Oct 3, 2012

ryao commented Oct 3, 2012