-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
arc_reclaim pegs CPU and blocks I/O #7559
Comments
I have seen this doing scrubs. |
Problem still seen on kernel 4.16.14 with ZFS 0.7.9 with ARC increased to 32G; after 18 days, disk iowait suddenly increased exponentially (with almost no disk I/O according to iostat), kernel decided arbitrary processes (like systemd units) were non-responsive (because stalled waiting on I/O) and began semi-randomly re-starting them; hilarity ensured. Unlike prior instances where sending 2 to drop_caches worked & restored order as soon as flush completed, this time it didn't seem to have any effect (might not have waited long enough - perhaps things simply got too bad?) - but resorting to 3 got things back into a sane state and ARC appeared to happily maintain itself at size 'c' again after it re-pop'd. Hope was that constraining ARC to a relatively small size for a fairly large (~20T) pool was the issue, but 32G (on a 256G host) doesn't seem unreasonable. Hopefully these taken at the time will be useful (more info - like all of /proc/spl/kstat/zfs/* taken at intervals while issue on-going available if useful): flamegraph: https://sendeyo.com/up/d/842b42d155 I observe (perhaps naively) that these symptoms seem similar to #4345, reported against 0.6.4 & resolved as "should be fixed in 0.7.0 series". |
Any update? I have same issue on 0.7.11, kernel latest Debian Stretch stable. arc size 48G. |
You may find the discussion in https://www.bountysource.com/issues/46175251-arc_prune-high-load-and-soft-lockups and/or #3303 relevant; that is, check if you're running into |
I have the same issue on 0.7.12 on debian 9 kernel "4.9.0-8-amd64 #1 SMP Debian 4.9.144-3.1 (2019-02-19)". On a server with low activity (it has one nfs mount and the average access generate 1.7mb/s write and 883kb/s read networkd traffic, no local access happens), the dnode attains the arc_dnode_limit and then arc_reclaim starts and crushes performance to a point that the minimum access doesn't work anymore (up to 1 minutes to write a 1MB file).
Setting the zfs_arc_dnode_limit_percent to 75 or 99% doesn't change anything, arc_reclaim is till destroying the IO.
The only fixe that worked is to drop all caches:
The following graph show the issue more visually, the beginning of the issue with the arc_reclaim is at 16h17 and the drop_caches is at 16h57: |
same problem, using ubuntu 18.04 with zfs 0.7.5, kernel 4.15.18
fixed the problem for the moment |
What do you guys think about this? I think that could be the cause, but I'm not 100% sure. It would make sense to me... |
Just hit what appears to be the same issue, on an NFS server with SSDs (Ubuntu 18.04, kernel 4.15.0-99). This server is used for temporary files, and it gets hit very hard with NFS traffic over a 10G network. It has 12TB of available storage (6 x 4TB SSD, as 3 mirrored vdevs), and 32GB of RAM.
Tried:
but still no joy. We had to reboot it. The previous post from @snajpa suggests it might be a simple typo in the source code - can any developer comment on this? |
That previous post from @snajpa referencing vpsfreecz/zfs@3cad19fd5927 certainly looks plausible. It refers to code introduced by @behlendorf and @dweeezil in commit 25458cb "Limit the amount of dnode metadata in the ARC":
@snajpa 's possible fix is:
I haven't tried to work out exactly what is supposed to be happening there, but the general tone of that commit indicates this may indeed be a correct fix. @behlendorf and @dweezil - can you take a look? |
Wanted to clarify that the only reason for our constraining ARC was historical; in 0.6.x, despite a suitable value for zfs_arc_shrink_shift, ARC would fail to respond to system memory pressure fast enough - short-lived processes would start, try to grab a lot of memory and OOM. Constraining ARC to 15G left them room - and also had the effect that ARC would periodically fill up only to collapse to zero. Believe this was a bug - but operationally pretty benign, other than a minor performance penalty. In 0.7.x, ARC collapse was never observed - but only one specific workload pegged arc_reclaim. Stopped setting zfs_arc_max in 0.7.x, and it was reactive enough that there were no more OOMs. Original host that prompted this report (and many more with the exact same workload), logged multiple 100's of days uptime after with no interventions. Which unfortunately won't help anyone with a machine doing this because ARC has used everything and there's no more to give it. 32G of RAM certainly seems like it should be reasonable for an NFS server... Hadn't looked at this ticket in years; not in a position to test the possible fix from @snajpa, or I would. ZFS still my favorite filesystem (even on my laptop) so it'd be great to get this sorted; seems likely the issue will continue to bite some combination of "low" ARC/high load systems. |
So yeah, the bug is actually a typo, our servers haven't experienced this problem since deploying that ^^ fix. If you look about what the condition is talking about, it talks about dnode sizes, which are metadata, and the path then goes on to evict the buffer. So the condition should IMO test for metadata buffer type, if we want to evict metadata there :) |
@snajpa - how about submitting a patch? Not related to this fix, but perhaps to the overall issue: calling an async freeing function in a loop whilst trying to free up a specified amount of space seems problematic - it seems like we could fire off a whole bunch bunch of |
@chrisrd if a day had like 48 hours, I'd already submit that (in fact there are more thing in the queue waiting). Feel free to pick this up (and take credit, no problem, just get it done someone :D) |
FYI, we had the same issue occur again with the same machine as before. We have prometheus node_exporter now which collects zfs stats, so I'm able to see how arc_meta_used was increasing over time: This system has 32GB RAM, and it has chosen about 12.5G as the arc_meta_limit:
Right now it's hovering around 8-9GB used. However prior to the problem, it kept increasing and increasing, until it flatlined around 16GB, and the system became so bogged down we had to reboot it. (Tried dropping caches the last time, and that didn't work) If I run the query But if I run the query I notice this occasionally happens for a couple of other machines, but they managed to correct themselves and didn't need rebooting. Those other machines are also NFS servers, but they are under much lower load than the one which needed rebooting. FWIW, those two machines ("monster21" and "monster22" in the graphs) are running Ubuntu 19.10, kernel 5.3.0-26, ZOL 0.8.1. The one which locked up ("storage14") is running Ubuntu 18.04, kernel 4.15.0-106, ZOL 0.7.5. Anyway, I'm not sure how much use this will be in helping to fix the underlying problem, but at least I now have a prometheus query that I can alert on. Then hopefully I can give the affected system some careful observation before it becomes too serious. |
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
How's the state of things related to this issue ? Are you still running into this ? |
Yes, I am still getting these alerts occasionally for |
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
System information
Describe the problem you're observing
Under heavy usage, especially with limited ARC (normally run 15G in Production, on hosts with 256G total), it is possible that arc_reclaim CPU usage can elevate to 100% (with many arc_prune threads using ~30% apiece) and stay there.
When ARC is being stressed, arc_reclaim CPU load begins low, and appears to dynamically rise and fall - however, under some conditions it can begin to trend upwards over time. While in this state, arcstat is always at or below zfs_arc_max - will maybe exceed 100% for a single sample.
Eventually, arc_reclaim can peg @ 100% CPU, with many arc_prune threads showing ~30% CPU apiece. When it's doing this, arcstat always seems to show ARC at or slightly above the configured value for zfs_arc_max, never below.
For a while, there may be no apparent bad effect (other than disk writes taking slightly longer than expected). However, if left like this for some time, there is a point at which application disk write times abruptly climb from ~2000ms to 10-18s. At the same time, disk I/O as shown by iostat drops off to almost nothing and individual disk %util is usually <1%...
Looks to be blocking on a resource - and the system writes back up with it..
If caught early enough and some load removed, arc_reclaim seems to respond and CPU usage will drop back to less concerning (but still somewhat high) figures, ~25-35% CPU (with ~10% for arc_prune threads).
However, if left long enough (some hours) in the pegged state, arc_reclaim seems to lose the ability to recover on its own entirely - it will remain pegged @ 100% for multiple hours after system is removed from rotation, thus removing all system load.
Describe how to reproduce the problem
This can be problematic to reproduce; issue observed on a production host that'd been running apparently perfectly happily for 15 days prior... It does seem to be possible to (sometimes) reproduce the symptoms after a few dozen hours by
a) constraining ARC lower than usual (e.g. 4G),
b) running a somewhat higher (+25% at least) than normal save load,
c) running a "normal" query load (includes deep/wide directory traversals with a client that mmap()'s files >1MB in size, of which there are many) - and
d) aggravating things with multiple arbitrary rsyncs & recursive find/deletes for good measure.
Host showing the issue is running 4 x RAID-Z1 vdevs consisting of 6 SSDs apiece, gzip compression enabled.
We have other hosts with identical hardware (albeit very different workloads/RAID organization e.g. RAID-10 with lz4 compression) running the exact same PXE-booted OS image for nearly two months that have not exhibited any issues - but it does seem to be workload-related.
Describe how to workaround the problem
Increasing zfs_arc_max makes arc_reclaim/arc_prune drop out of top - but leaves the nagging doubt that issue might simply recur once ARC re-fills to the new size.
Which is exactly what seems to happen if you increase it only a little (e.g. 1-2G on an already artificially low setting); as soon as ARC fills, arc_reclaim re-appears in top with high CPU usage which starts to climb etc.
echo 1 > /proc/sys/vm/drop_cache: has no apparent effect.
echo 2 > /proc/sys/vm/drop_cache: makes arc_reclaim vanish from top output entirely. Appears to happily maintain ARC at or below zfs_arc_max with expected low CPU usage thereafter.
echo 3 > /proc/sys/vm/drop_cache: behaves as for "echo 2", except this flushes ARC contents as well. Again, arc_reclaim seems to behave as expected after ARC gets back up to zfs_arc_max in size.
Include any warning/errors/backtraces from the system logs
The system logs nothing of interest (e.g. no stack traces), although application monitoring/alerting gets exercised based on disk write/query times.
The text was updated successfully, but these errors were encountered: