arc_reclaim pegs CPU and blocks I/O #7559

CRWdog193 · 2018-05-24T21:40:06Z

System information

Type	Version/Name
Distribution Name	Debian
Distribution Version	Jessie 8.10 (mostly)
Linux Kernel	4.15.1
Architecture
ZFS Version	0.7.6 (git tag kmod)
SPL Version	0.7.6 (git tag kmod)

Describe the problem you're observing

Under heavy usage, especially with limited ARC (normally run 15G in Production, on hosts with 256G total), it is possible that arc_reclaim CPU usage can elevate to 100% (with many arc_prune threads using ~30% apiece) and stay there.

When ARC is being stressed, arc_reclaim CPU load begins low, and appears to dynamically rise and fall - however, under some conditions it can begin to trend upwards over time. While in this state, arcstat is always at or below zfs_arc_max - will maybe exceed 100% for a single sample.

Eventually, arc_reclaim can peg @ 100% CPU, with many arc_prune threads showing ~30% CPU apiece. When it's doing this, arcstat always seems to show ARC at or slightly above the configured value for zfs_arc_max, never below.

For a while, there may be no apparent bad effect (other than disk writes taking slightly longer than expected). However, if left like this for some time, there is a point at which application disk write times abruptly climb from ~2000ms to 10-18s. At the same time, disk I/O as shown by iostat drops off to almost nothing and individual disk %util is usually <1%...

Looks to be blocking on a resource - and the system writes back up with it..

If caught early enough and some load removed, arc_reclaim seems to respond and CPU usage will drop back to less concerning (but still somewhat high) figures, ~25-35% CPU (with ~10% for arc_prune threads).

However, if left long enough (some hours) in the pegged state, arc_reclaim seems to lose the ability to recover on its own entirely - it will remain pegged @ 100% for multiple hours after system is removed from rotation, thus removing all system load.

Describe how to reproduce the problem

This can be problematic to reproduce; issue observed on a production host that'd been running apparently perfectly happily for 15 days prior... It does seem to be possible to (sometimes) reproduce the symptoms after a few dozen hours by

a) constraining ARC lower than usual (e.g. 4G),
b) running a somewhat higher (+25% at least) than normal save load,
c) running a "normal" query load (includes deep/wide directory traversals with a client that mmap()'s files >1MB in size, of which there are many) - and
d) aggravating things with multiple arbitrary rsyncs & recursive find/deletes for good measure.

Host showing the issue is running 4 x RAID-Z1 vdevs consisting of 6 SSDs apiece, gzip compression enabled.

We have other hosts with identical hardware (albeit very different workloads/RAID organization e.g. RAID-10 with lz4 compression) running the exact same PXE-booted OS image for nearly two months that have not exhibited any issues - but it does seem to be workload-related.

Describe how to workaround the problem

Increasing zfs_arc_max makes arc_reclaim/arc_prune drop out of top - but leaves the nagging doubt that issue might simply recur once ARC re-fills to the new size.

Which is exactly what seems to happen if you increase it only a little (e.g. 1-2G on an already artificially low setting); as soon as ARC fills, arc_reclaim re-appears in top with high CPU usage which starts to climb etc.

echo 1 > /proc/sys/vm/drop_cache: has no apparent effect.
echo 2 > /proc/sys/vm/drop_cache: makes arc_reclaim vanish from top output entirely. Appears to happily maintain ARC at or below zfs_arc_max with expected low CPU usage thereafter.
echo 3 > /proc/sys/vm/drop_cache: behaves as for "echo 2", except this flushes ARC contents as well. Again, arc_reclaim seems to behave as expected after ARC gets back up to zfs_arc_max in size.

Include any warning/errors/backtraces from the system logs

The system logs nothing of interest (e.g. no stack traces), although application monitoring/alerting gets exercised based on disk write/query times.

drescherjm · 2018-05-24T21:47:31Z

I have seen this doing scrubs.

CRWdog193 · 2018-07-13T15:16:45Z

Problem still seen on kernel 4.16.14 with ZFS 0.7.9 with ARC increased to 32G; after 18 days, disk iowait suddenly increased exponentially (with almost no disk I/O according to iostat), kernel decided arbitrary processes (like systemd units) were non-responsive (because stalled waiting on I/O) and began semi-randomly re-starting them; hilarity ensured.

Unlike prior instances where sending 2 to drop_caches worked & restored order as soon as flush completed, this time it didn't seem to have any effect (might not have waited long enough - perhaps things simply got too bad?) - but resorting to 3 got things back into a sane state and ARC appeared to happily maintain itself at size 'c' again after it re-pop'd.

Hope was that constraining ARC to a relatively small size for a fairly large (~20T) pool was the issue, but 32G (on a 256G host) doesn't seem unreasonable.

Hopefully these taken at the time will be useful (more info - like all of /proc/spl/kstat/zfs/* taken at intervals while issue on-going available if useful):

flamegraph: https://sendeyo.com/up/d/842b42d155
slabinfo: https://pastebin.com/9fKDFhzT
arcstats: https://pastebin.com/mNR8rE6V
dnodestats: https://pastebin.com/NFG5r4du
meminfo: https://pastebin.com/AxdpPn6s

I observe (perhaps naively) that these symptoms seem similar to #4345, reported against 0.6.4 & resolved as "should be fixed in 0.7.0 series".

ghost · 2018-10-23T02:49:57Z

Any update? I have same issue on 0.7.11, kernel latest Debian Stretch stable. arc size 48G.

rincebrain · 2018-10-23T04:19:30Z

You may find the discussion in https://www.bountysource.com/issues/46175251-arc_prune-high-load-and-soft-lockups and/or #3303 relevant; that is, check if you're running into arc_{meta,dnode}_limit and, if so, change zfs_arc_{meta,dnode}_limit_percent (or the non-percent equivalents, if you prefer).

AceSlash · 2019-03-20T16:38:46Z

I have the same issue on 0.7.12 on debian 9 kernel "4.9.0-8-amd64 #1 SMP Debian 4.9.144-3.1 (2019-02-19)".

On a server with low activity (it has one nfs mount and the average access generate 1.7mb/s write and 883kb/s read networkd traffic, no local access happens), the dnode attains the arc_dnode_limit and then arc_reclaim starts and crushes performance to a point that the minimum access doesn't work anymore (up to 1 minutes to write a 1MB file).

# cat /proc/spl/kstat/zfs/arcstats |egrep "dnode_size|arc_dnode_limit|arc_meta_used|arc_meta_limit"
dnode_size                      4    4556595856
arc_meta_used                   4    16229440784
arc_meta_limit                  4    45596462284
arc_dnode_limit                 4    4559646228

Setting the zfs_arc_dnode_limit_percent to 75 or 99% doesn't change anything, arc_reclaim is till destroying the IO.

echo 99 > /sys/module/zfs/parameters/zfs_arc_dnode_limit_percent

The only fixe that worked is to drop all caches:

echo 3 > /proc/sys/vm/drop_caches

The following graph show the issue more visually, the beginning of the issue with the arc_reclaim is at 16h17 and the drop_caches is at 16h57:

cdir · 2019-09-04T13:02:51Z

same problem, using ubuntu 18.04 with zfs 0.7.5, kernel 4.15.18

echo 3 > /proc/sys/vm/drop_caches

fixed the problem for the moment

snajpa · 2019-09-05T23:13:49Z

What do you guys think about this? I think that could be the cause, but I'm not 100% sure. It would make sense to me...

vpsfreecz@3cad19f

candlerb · 2020-04-30T09:19:33Z

Just hit what appears to be the same issue, on an NFS server with SSDs (Ubuntu 18.04, kernel 4.15.0-99). This server is used for temporary files, and it gets hit very hard with NFS traffic over a 10G network. It has 12TB of available storage (6 x 4TB SSD, as 3 mirrored vdevs), and 32GB of RAM.

top - 09:57:04 up 59 days, 14:11,  3 users,  load average: 16.14, 16.22, 16.31
Tasks: 411 total,   4 running, 278 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.1 us, 18.7 sy,  0.0 ni, 72.4 id,  7.4 wa,  0.0 hi,  1.3 si,  0.0 st
KiB Mem : 32380640 total,  3541108 free, 23284852 used,  5554680 buff/cache
KiB Swap:  7811068 total,  7807472 free,     3596 used.  8655448 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 1201 root      20   0       0      0      0 R 100.0  0.0   4698:27 arc_reclaim
 1200 root      20   0       0      0      0 R  98.3  0.0   3533:06 arc_prune
10958 root      20   0       0      0      0 R   3.0  0.0   0:00.09 arc_prune
10226 root      20   0   43208   4540   3456 S   1.0  0.0   0:10.16 top

# cat /proc/spl/kstat/zfs/arcstats |egrep "dnode_size|arc_dnode_limit|arc_meta_used|arc_meta_limit"
dnode_size                      4    7164108224
arc_meta_used                   4    16454923696
arc_meta_limit                  4    12434165760
arc_dnode_limit                 4    1243416576

iostat very low (<1%), and no scrub was in progress at the time - in fact it was a couple of weeks since the last scrub.

Tried:

echo 75 > /sys/module/zfs/parameters/zfs_arc_dnode_limit_percent  # was 10
echo 4194304 > /sys/module/zfs/parameters/zfs_arc_max  # was 0
echo 0 > /sys/module/zfs/parameters/zfs_arc_max
echo 3 > /proc/sys/vm/drop_caches

but still no joy. We had to reboot it.

The previous post from @snajpa suggests it might be a simple typo in the source code - can any developer comment on this?

chrisrd · 2020-05-01T06:00:00Z

That previous post from @snajpa referencing vpsfreecz/zfs@3cad19fd5927 certainly looks plausible. It refers to code introduced by @behlendorf and @dweeezil in commit 25458cb "Limit the amount of dnode metadata in the ARC":

+               /*
+                * Try to reduce pinned dnodes with a floor of arc_dnode_limit.
+                * Request that 10% of the LRUs be scanned by the superblock
+                * shrinker.
+                */
+               if (type == ARC_BUFC_DATA && arc_dnode_size > arc_dnode_limit)
+                       arc_prune_async((arc_dnode_size - arc_dnode_limit) /
+                           sizeof (dnode_t) / zfs_arc_dnode_reduce_percent);

@snajpa 's possible fix is:

-	if (type == ARC_BUFC_DATA && aggsum_compare(&astat_dnode_size,
+	if (type == ARC_BUFC_METADATA && aggsum_compare(&astat_dnode_size,

I haven't tried to work out exactly what is supposed to be happening there, but the general tone of that commit indicates this may indeed be a correct fix.

@behlendorf and @dweezil - can you take a look?

CRWdog193 · 2020-05-01T12:46:07Z

Wanted to clarify that the only reason for our constraining ARC was historical; in 0.6.x, despite a suitable value for zfs_arc_shrink_shift, ARC would fail to respond to system memory pressure fast enough - short-lived processes would start, try to grab a lot of memory and OOM. Constraining ARC to 15G left them room - and also had the effect that ARC would periodically fill up only to collapse to zero. Believe this was a bug - but operationally pretty benign, other than a minor performance penalty.

In 0.7.x, ARC collapse was never observed - but only one specific workload pegged arc_reclaim. Stopped setting zfs_arc_max in 0.7.x, and it was reactive enough that there were no more OOMs. Original host that prompted this report (and many more with the exact same workload), logged multiple 100's of days uptime after with no interventions.

Which unfortunately won't help anyone with a machine doing this because ARC has used everything and there's no more to give it. 32G of RAM certainly seems like it should be reasonable for an NFS server...

Hadn't looked at this ticket in years; not in a position to test the possible fix from @snajpa, or I would. ZFS still my favorite filesystem (even on my laptop) so it'd be great to get this sorted; seems likely the issue will continue to bite some combination of "low" ARC/high load systems.

snajpa · 2020-05-04T20:52:13Z

So yeah, the bug is actually a typo, our servers haven't experienced this problem since deploying that ^^ fix.

If you look about what the condition is talking about, it talks about dnode sizes, which are metadata, and the path then goes on to evict the buffer. So the condition should IMO test for metadata buffer type, if we want to evict metadata there :)

chrisrd · 2020-05-05T01:48:40Z

@snajpa - how about submitting a patch?

Not related to this fix, but perhaps to the overall issue: calling an async freeing function in a loop whilst trying to free up a specified amount of space seems problematic - it seems like we could fire off a whole bunch bunch of arc_prune_async() tasks whilst we're looping through "not enough free yet, let's try harder, not enough free yet, let's try harder, not enough free yet, let's try harder..." before the first async task has been able to even start up let alone do anything. I wonder if we should be using taskq_wait() or similar in there somewhere.

snajpa · 2020-05-10T12:35:51Z

@chrisrd if a day had like 48 hours, I'd already submit that (in fact there are more thing in the queue waiting). Feel free to pick this up (and take credit, no problem, just get it done someone :D)

candlerb · 2020-06-18T13:41:34Z

FYI, we had the same issue occur again with the same machine as before.

We have prometheus node_exporter now which collects zfs stats, so I'm able to see how arc_meta_used was increasing over time:

This system has 32GB RAM, and it has chosen about 12.5G as the arc_meta_limit:

root@storage14:~# grep arc_meta /proc/spl/kstat/zfs/arcstats
arc_meta_used                   4    9197519864
arc_meta_limit                  4    12499396608
arc_meta_max                    4    11376708496
arc_meta_min                    4    16777216

Right now it's hovering around 8-9GB used. However prior to the problem, it kept increasing and increasing, until it flatlined around 16GB, and the system became so bogged down we had to reboot it. (Tried dropping caches the last time, and that didn't work)

If I run the query node_zfs_arc_arc_meta_used > node_zfs_arc_arc_meta_limit I see quite a few servers are hovering just over the limit.

But if I run the query node_zfs_arc_arc_meta_used > node_zfs_arc_arc_meta_limit*1.02 (i.e. only show machines which have exceeded the limit by more than 2%) it becomes a lot cleaner:

I notice this occasionally happens for a couple of other machines, but they managed to correct themselves and didn't need rebooting. Those other machines are also NFS servers, but they are under much lower load than the one which needed rebooting.

FWIW, those two machines ("monster21" and "monster22" in the graphs) are running Ubuntu 19.10, kernel 5.3.0-26, ZOL 0.8.1. The one which locked up ("storage14") is running Ubuntu 18.04, kernel 4.15.0-106, ZOL 0.7.5.

Anyway, I'm not sure how much use this will be in helping to fix the underlying problem, but at least I now have a prometheus query that I can alert on. Then hopefully I can give the affected system some careful observation before it becomes too serious.

stale · 2021-06-18T16:32:38Z

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

kernelOfTruth · 2021-06-19T00:06:14Z

How's the state of things related to this issue ?

Are you still running into this ?

candlerb · 2021-06-19T09:23:50Z

Yes, I am still getting these alerts occasionally for node_zfs_arc_arc_meta_used exceeding 102% of node_zfs_arc_arc_meta_limit. This includes boxes running Ubuntu 20.04 (kernel 5.4.0-42, zfsutils-linux 0.8.3-1ubuntu12.9). However they do seem to resolve themselves rather than crashing

stale · 2022-06-19T11:02:56Z

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

bunder2015 added Component: Memory Management kernel memory management Type: Performance Performance improvement or performance problem labels Jan 31, 2019

This was referenced May 14, 2020

fix dnode eviction typo in arc_evict_state() #10331

Closed

arc_prune: high load and soft lockups #6223

Closed

100% CPU load from arc_prune #9966

Open

zenaan mentioned this issue Jun 30, 2020

plugable ARC #10516

Open

stale bot added the Status: Stale No recent activity for issue label Jun 18, 2021

stale bot removed the Status: Stale No recent activity for issue label Jun 19, 2021

remingtonc mentioned this issue Feb 10, 2022

ZFS 2.1.2 + NFSv4 ARC stuck pruning/evicting, flatlines throughput #13079

Closed

stale bot added the Status: Stale No recent activity for issue label Jun 19, 2022

stale bot closed this as completed Sep 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

arc_reclaim pegs CPU and blocks I/O #7559

arc_reclaim pegs CPU and blocks I/O #7559

CRWdog193 commented May 24, 2018

drescherjm commented May 24, 2018

CRWdog193 commented Jul 13, 2018

ghost commented Oct 23, 2018

rincebrain commented Oct 23, 2018 •

edited

Loading

AceSlash commented Mar 20, 2019

cdir commented Sep 4, 2019

snajpa commented Sep 5, 2019

candlerb commented Apr 30, 2020

chrisrd commented May 1, 2020

CRWdog193 commented May 1, 2020

snajpa commented May 4, 2020

chrisrd commented May 5, 2020

snajpa commented May 10, 2020

candlerb commented Jun 18, 2020

stale bot commented Jun 18, 2021

kernelOfTruth commented Jun 19, 2021 •

edited

Loading

candlerb commented Jun 19, 2021

stale bot commented Jun 19, 2022

arc_reclaim pegs CPU and blocks I/O #7559

arc_reclaim pegs CPU and blocks I/O #7559

Comments

CRWdog193 commented May 24, 2018

System information

Describe the problem you're observing

Describe how to reproduce the problem

Describe how to workaround the problem

Include any warning/errors/backtraces from the system logs

drescherjm commented May 24, 2018

CRWdog193 commented Jul 13, 2018

ghost commented Oct 23, 2018

rincebrain commented Oct 23, 2018 • edited Loading

AceSlash commented Mar 20, 2019

cdir commented Sep 4, 2019

snajpa commented Sep 5, 2019

candlerb commented Apr 30, 2020

chrisrd commented May 1, 2020

CRWdog193 commented May 1, 2020

snajpa commented May 4, 2020

chrisrd commented May 5, 2020

snajpa commented May 10, 2020

candlerb commented Jun 18, 2020

stale bot commented Jun 18, 2021

kernelOfTruth commented Jun 19, 2021 • edited Loading

candlerb commented Jun 19, 2021

stale bot commented Jun 19, 2022

rincebrain commented Oct 23, 2018 •

edited

Loading

kernelOfTruth commented Jun 19, 2021 •

edited

Loading