-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
arc_prune eats the whole CPU core [0.6.5.4] #4345
Comments
@validname Hi, can you please post output of
some hardware specs the values for |
@kernelOfTruth OK.
Actually, we've set only
|
@validname how big is your L2ARC ? What additional filesystems do you use ? What usecase are we talking about ? Any reason why you chose such a small ARC size ? Is there output in dmesg (kernel log) ? |
2+3.
All processes mentioned above are either absent (was killed by LXC utils) or stuck in 'D' state now. |
@validname thanks for the function graph trace that sheds considerable light on the problem. You're definitely repeatedly hitting the I suspect you may have have run afoul of a deadlock in the eviction path which was recently fixed in master, 3b9fd93. The fix for this has already been cherry picked in to the zfs-0.6.5-release branch and will appear in the next zfs tag. If this was just a one of I'd suggest waiting for the new tag in a week or two. If you're able to trigger this more easily then you may want to cherry pick the fix and verify it resolves the issue. |
@behlendorf Thanks for answer!
Unfortunately, we don't know what exactly can trigger this problem. But we have several servers with nearly the same configuration and load, so I'll just setup ZFS with the fix that you mentioned on the half of them. |
I also seem to be hitting this bug. This is about as much as I could get out of the system as the whole machine becomes completely unresponsive (couldn't even open a new terminal window or start programs (root on zfs): echo 3 > /proc/sys/vm/drop_caches returned nearly immediately with no memory actually being freed. There was a scrub running when the problem began to manifest, so this may contribute to the problem without being a root cause. I should also probably point out that this happens fairly regularly on this machine, which is NOT in 24/7 use (gets switched off when not used most of the time), but when it does get used it is usually under a very heavy load. Spec: Two pools: I am reasonably certain this wasn't happening last year, so it may be interesting to find a version where this issue first started arising. |
@validname I think @behlendorf meant |
I am maybe hitting some issue related to this? arc_prune is eating a core and many (maybe all?) processes accessing this filesystem are hung. zpool iostat shows basically no i/o on the pool.
|
Looking at the work going on in #4850 ...is there a workaround for production systems right now by setting some arc parameters? I have a few systems that keep falling into this. |
@dweeezil: I'm testing out your patch at https://github.com/dweeezil/zfs/tree/arc-dnode-limit-0.6.5 and seeing similar breakage still.
|
Metadata-intensive workloads can cause the ARC to become permanently filled with dnode_t objects as they're pinned by the VFS layer. Subsequent data-intensive workloads may only benefit from about 25% of the potential ARC (arc_c_max - arc_meta_limit). In order to help track metadata usage more precisely, the other_size metadata arcstat has replaced with dbuf_size, dnode_size and bonus_size. The new zfs_arc_dnode_limit tunable, which defaults to 10% of zfs_arc_meta_limit, defines the minimum number of bytes which is desirable to be consumed by dnodes. Attempts to evict non-metadata will trigger async prune tasks if the space used by dnodes exceeds this limit. The new zfs_arc_dnode_reduce_percent tunable specifies the amount by which the excess dnode space is attempted to be pruned as a percentage of the amount by which zfs_arc_dnode_limit is being exceeded. By default, it tries to unpin 10% of the dnodes. The problem of dnode metadata pinning was observed with the following testing procedure (in this example, zfs_arc_max is set to 4GiB): - Create a large number of small files until arc_meta_used exceeds arc_meta_limit (3GiB with default tuning) and arc_prune starts increasing. - Create a 3GiB file with dd. Observe arc_mata_used. It will still be around 3GiB. - Repeatedly read the 3GiB file and observe arc_meta_limit as before. It will continue to stay around 3GiB. With this modification, space for the 3GiB file is gradually made available as subsequent demands on the ARC are made. The previous behavior can be restored by setting zfs_arc_dnode_limit to the same value as the zfs_arc_meta_limit. Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#4345 Issue openzfs#4512 Issue openzfs#4773 Closes openzfs#4858
Is there any additional debugging I could provide to help? I am still hitting this constantly on https://github.com/dweeezil/zfs/tree/arc-dnode-limit-0.6.5 |
@logan2211 A good start would be the arcstats when running with the dnode limiting patch. I referenced this issue in the patch because your original arcstats showed the characteristics of a problem which it would fix. The extra arcstats it provides might help determine what'a happening. (note: I'm "off the grid" until August 15th, so will likely not have a chance to pursue this further until then). |
Sorry I forgot to include that! Thanks for the reply. This is on a system running 0.6.5.7-1 from your branch that is spinning arc_prune right now.
|
@logan2211 Those arcstats do show a dnode metadata jam-up. You mentioned gluster which I think might be a big xattr user. If so, are you using |
Thanks for the info. I was not using xattr=sa but I have enabled it now and deleted/resynced a lot of data into the pools. I also increased zfs_arc_dnode_reduce_percent to 100. I was still seeing the problem after doing these things and one of the patches you mentioned did not merge cleanly into your arc-dnode-limit-0.6.5 branch, so what I did a few days ago was build a module from zfs/master (e35c5a8) and since then I have not had any problems. I am considering removing the zfs_arc_dnode_reduce_percent override on one of the nodes to see if things continue to run smoothly. It seems like whatever was causing it may be fixed in master though! |
I do still get some serious [arc_prune] cpu usage if I set only 4GB zfs_arc_max even after dweeezil@650f6cf applied on my Gentoo box with zfs 0.6.5.8, if I raise the bar to 8GB zfs_arc_max with 6GB arc_meta_limit it won't happened, my system got some workload mix between heavy metadata usage and some stream reading, here is my system status just after boot for 50mins
|
@AndCycle Actually, that to seems to be a normal behaviour for system with limited resources (like memory) and arc_prune just prunes obsolete cache entries. What will happen if you increase limits (both zfs_arc_max and arc_meta_limit) when arc_prune consumes CPU? Do you have hard disk activity when this happens? |
@validname if I increase zfs_arc_max and arc_meta_limit, it won't consume that much cpu, just few percent, my hard disk is still activity but system isn't complete fine with this, I can observe some hiccup from munin as there are empty data point, I think this just step away from push it over limit, yea, my case is not really fit in your situation, the only common part here is high arc_prune cpu usage, |
Can someone verify that this is the case. Has anyone been able to reproduce this issue in master? How about 0.6.5.8? |
@validname noop, not really normal, because it's a busy loop without actually doing anything, after play around on my system I found out that's probably caused by inotify or some related stuff, I have CrashPlan backup service on my machine, which do constantly scan through system, once I stop CrashPlan, the dnode_size drop immediately, my guess is inotify pin the metadata which can't be released at all on my scenario, |
Interesting, I suspect SELinux' restorecond may also be doing a similar thing, albeit not as extensively as CrashPlan or lsyncd. |
@AndCycle It seems an "inotify watch everything" workload would call for increasing |
If the dnodes are being wedged in an unevictable state in the ARC and can't be freed with inotify watches in place, won't upping the zfs_arc_dnode_limit_percent to 100 only postpone the inevitable? Especially in cases where the ARC size is very restricted. |
@dweeezil I tried increased zfs_arc_dnode_limit_percent to 100, no effect, here is my current system value, I increased max_user_watches as CrashPlan requested due to storage size, |
@AndCycle Unfortunately, it's definitely possible to cause Bad Things to happen with ZoL if you can pin enough metadata and it seems the idea of inotifying everything can do just that. You might be able to mitigate the issue a bit by setting I'm not very familiar with these filesystem notifications schemes within the kernel. Is this program trying to watch every single file? Or just every single directory? Or both? I can't imagine this ever working very well on a very large filesystem with, say, millions or billions of files/directories. It doesn't seem very scalable. One thing that comes to mind is that we're using |
@dweeezil yea, it's not very scalable,
if you wanna recreate the scenario, get inotify tools then make it watch some place with massive directory structure
will easily have tons of dnode usage pinned inside arc. |
Sorry for very long answer. And sorry for bad news.
Test server worked 52 days (vs. 30 average days before) under it's usual workload and it happens again. Again, one CPU core was eaten by one kernel process of arc_prune. |
Hi! I seem to be having the same issue. It is on After about two weeks of system uptime and ~30 minutes into the nightly rsync that copies data from the zfs arc_prune goes to 100% on one cpu core, and stays there until reboot. Right now arc_prune has been running like this for about 12 hours. Opposed to #4726, I only have one arc_prune process running wild, no wild arc_reclaims. Am I suffering from the same issue? |
I've got some info from another stalled server with 4.3.3 kernel, zfs 0.6.5.4 (1ffc4c1) +patch from 650f6cf. As usual, increasing ARC size didn't help, only reboot did.
dnode_size is far away from it limit of 4Gb, but arc_meta_used is at it's limit. |
Any progress on that issue? We have servers with zfs 0.6.5.4 and aforementioned patches that got stuck every two weeks (the pattern is high write rate into systemd-journald). Is there any kind of additional debugging info that I can provide to identify bug? |
@seletskiy Could you please open a new issue with specifics from your system when the problem is occurring (see https://github.com/zfsonlinux/zfs/wiki/FAQ#reporting-a-problem) and also the arcstats. @validname In the gist you posted on Dec 5 2016, the Something else I'd like to point out to anyone seeing Finally, anyone experiencing this issue should try setting |
@cserem @seletskiy @validname Thanks in advance! |
Closing. This issue should be resolved in the 0.7 series. |
Hello!
Right now we have a server at which [arc_prune] constantly eats CPU and a lot of processes are stuck at 'D' state. It lasts for 18 hours without changes. We've tried to increase both zfs_arc_max and zfs_arc_meta_limit but nothing happens.
ZFS: 0.6.5.4/1ffc4c1
SPL: 0.6.5.4/6e5e068
Linux kernel: 4.3.3
We've tried to ftrace this process for 1 second:
https://gist.githubusercontent.com/validname/1a7f44ec106a933e0ca0/raw/61cb22b928bdb1cf4977b193d27dc66ea71e9f8b/01.arc_prune_function_trace
https://gist.githubusercontent.com/validname/44bcc5e55e6e294d9e09/raw/bc36b1d394034c4e12fcfa97b82e7dabdc7bc696/02.arc_prune_function_graph_trace
What we can investigate further to find a reason of this behaviour?
And what we can do to prevent it?
The text was updated successfully, but these errors were encountered: