Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlocks resulting from the post-kmem rework merge #3225

Closed
wants to merge 1 commit into from

Conversation

dweeezil
Copy link
Contributor

Put the following functions under PF_FSTRANS to prevent deadlocks
due to re-entering the filesystem during reclaim:

dbuf_hold()
dmu_prefetch()
sa_buf_hold()

Testing this one now:
dmu_zfetch_fetch() - z_hold_mtx[] via zget and zf_rwlock via dmu_zfetch

@dweeezil
Copy link
Contributor Author

This is intended to fix the issues outlined in #3183. It's a WIP branch but since it appears others have been grabbing patches from it, I think it's time to get it posted as a pull request.

The "testing this one now" change to dmu_zfetch_zfetch() was added late yesterday to fix a deadlock I recently outlined but it hasn't been tested heavily yet. There's a very good chance I pushed the fstrans marking too far down the call chain so further evaluation is necessary.

@dweeezil
Copy link
Contributor Author

Unfortunately, it seems z_hold_mtx[i] strikes again and does so in a manner that I feared could happen. It's possible for one process to hold entry X via the zfs_zget() path and try to acquire entry Y via zfs_zinactive() while another process holding entry Y via zfs_zinactive() enters reclaim in the zfs_zget->zfs_znode_alloc- path and winds up in zfs_zinactive() trying to acquire X.

I'm going to try the more heavy-handed approach of putting z_hold_mtx[] under MUTEX_FSTRANS, but I suspect the plan was to use that feature as a last resort. I had to put the system under the most extreme memory pressure possible in order to catch this one; the moral equivalent of a "reclaim torture test" mentioned in another issue.

@tuxoko
Copy link
Contributor

tuxoko commented Mar 26, 2015

@dweeezil
You HAVE TO put z_hold_mtx under FSTRANS. Otherwise, you'll have to add spl_fstrans_mark to every possible memory allocation under z_hold_mtx. The reason is just as you described, direct reclaim will cause reentry on z_hold_mtx, and it will cause deadlock even if the reentry is on different z_hold_mtx entry, because there could be multiple threads doing reclaim.

@dweeezil
Copy link
Contributor Author

@tuxoko I'll be running tests with z_hold_mtx under MUTEX_FSTRANS shortly. I expect it will fix this current case of deadlocks. There are clearly too many code paths to fix, individually, otherwise.

@tuxoko
Copy link
Contributor

tuxoko commented Mar 26, 2015

@dweeezil @behlendorf
By the way, I'm not sure why nobody commented on this: #3183 (comment) ...
As far as I can tell, after I added MUTEX_FSTRANS to l2arc_buflist_mtx, the order was already violated between hash_lock in arc_release().

If we are going to keep the MUTEX_FSTRANS thing as it is, we need to be very careful of the lock/unlock sequence of such mutexes. spl_fstrans_mark would also have effect in this. But personally, unless we could fix the MUTEX_FSTRANS thing, I think we should stick to spl_fstrans_mark, because it is more explicit.

@dweeezil
Copy link
Contributor Author

@tuxoko Sorry, I had been meaning to read the comment you referenced in #3183 more closely but since I had not been leaning on MUTEX_FSTRANS, I didn't bother. Your comment is, of course, correct and is certainly a potential problem with z_hold_mtx in which the entries are aliased among potentially many unrelated objects. Using MUTEX_FSTRANS as a fix for z_hold_mtx isn't going to work as it stands now.

@tuxoko
Copy link
Contributor

tuxoko commented Mar 26, 2015

@dweeezil
As far as I can tell, z_hold_mtx is pretty well behaved. It always acquires and releases in the same functon. So it should be pretty straight forward to use spl_fstrans_mark on them.

@dweeezil dweeezil force-pushed the post-kmem-rework-deadlocks branch from fc43636 to 8b5bec7 Compare March 27, 2015 13:19
@dweeezil
Copy link
Contributor Author

My latest commit simply marks/unmarks where the z_hold_mtx array is used. This seems to cure all the cases of its related deadlocks I've been able to produce so far. I've got one small related tweak I'll push later today but I think this is a good start.

Unfortunately, I can still deadlock my test system with the stress test. I've not analyzed the stack traces and lock debugging fully yet but it looks like it may be related to the kernel's per-inode i_mutex lock. I'm concerned this one may be more an interaction between the VFS and ZFS issue rather than strictly a ZFS issue. It looks more like a simple deadlock rather than a lock inversion problem. My specific concern is that something in the manner in which directories are mapped to objects and/or their parentage is managed under ZFS may be incompatible with the assumptions made by the VFS insofar as managing i_mutex is concerned. I'll know more after analyzing the stacks I've captured.

@dweeezil dweeezil force-pushed the post-kmem-rework-deadlocks branch from 8b5bec7 to 7045329 Compare March 28, 2015 16:18
@dweeezil
Copy link
Contributor Author

I've pushed another version of this with the comments cleaned up and one small superfluous change reverted. Further testing shows that the deadlocks I was still able to create are somewhat expected on systems with very little memory, no swap and on which tmpfs is being used. After removing all uses of tmpfs, I was no longer able to create the same deadlock. That said, I'm going to continue stress testing over the next day or so.

@dweeezil
Copy link
Contributor Author

Here's an update after chasing deadlocks all day. Most of the last class of deadlocks I was seeing were due to direct reclaim during the various dmu_tx_hold family of functions. Unfortunately, after locking all of those paths down, I'm now able generate deadlocks involving the metaslab_preload()...arc_buf_alloc() path. At this rate, the entire system will be reduced to non-sleeping allocations just as it mostly was before the kmem-rework merge. Furthermore, all these semi-gratuitous changes cause even more divergence with upstream code than KM_PUSHPAGE did in the past. I'm going to keep plugging away at it for awhile but a little voice in the back of my head is telling me we need a different approach.

@dweeezil dweeezil force-pushed the post-kmem-rework-deadlocks branch from 6d617e7 to d85a1f7 Compare March 29, 2015 15:41
@dweeezil
Copy link
Contributor Author

I've backed off of all the dmu_tx_hold lockdowns. Instead, as @tuxoko rightly pointed out, memory allocations when db_mtx is held can cause deadlocks. The latest commit puts it under MUTEX_FSTRANS but since I've got a good set of stack traces involving this deadlock, it might be possible to do this in a more granular manner with spl_fstrans_mark().

With this patch in place, however, I can still lock up my test system in which it spins trying to free memory in the SPL slab shrinker. I'm going to get this rebased on current master code now in order to get the latest ARC changes (2cbb06b in particular).

@kernelOfTruth
Copy link
Contributor

hm, several git fetch/pull fails due to github being DDOS,

failed test imports

gzip: stdin: unexpected end of file
tar: Child returned status 1
tar: Error is not recoverable: exiting now
Fail (2)

whatever that means

and

http://buildbot.zfsonlinux.org/builders/centos-7.0-x86_64-builder/builds/1764/steps/shell_17

[ 5959.028703] WARNING: at fs/xfs/xfs_aops.c:968 xfs_vm_writepage+0x5ab/0x5c0 [xfs]()
[ 5959.028704] Modules linked in: zfs(POF) zunicode(POF) zavl(POF) zcommon(POF) znvpair(POF) spl(OF) sd_mod crct10dif_generic crc_t10dif crct10dif_common ext4 mbcache jbd2 loop zlib_deflate ip6t_rpfilter ip6t_REJECT ipt_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter ip_tables sg crc32c_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd virtio_console serio_raw pcspkr virtio_balloon i2c_piix4 mperf xfs libcrc32c sr_mod cdrom ata_generic pata_acpi virtio_blk virtio_scsi qxl virtio_net drm_kms_helper
[ 5959.028743]  ttm drm ata_piix libata virtio_pci virtio_ring virtio i2c_core floppy dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_debug]
[ 5959.028752] CPU: 2 PID: 10982 Comm: kworker/u8:2 Tainted: PF       W  O--------------   3.10.0-123.20.1.el7.x86_64 #1
[ 5959.028754] Hardware name: Red Hat RHEV Hypervisor, BIOS 0.5.1 01/01/2007
[ 5959.028756] Workqueue: writeback bdi_writeback_workfn (flush-253:1)
[ 5959.028758]  0000000000000000 00000000c7ca0d4c ffff88004d159900 ffffffff815e2b0c
[ 5959.028761]  ffff88004d159938 ffffffff8105dee1 ffff880029dbeae8 ffff880029dbeae8
[ 5959.028765]  ffff88004d159c50 ffff880029dbe998 ffffea00011383c0 ffff88004d159948

http://oss.sgi.com/archives/xfs/2014-07/msg00395.html

looks like another buildbot issue

@dweeezil I'll give your latest patch a try, although I doubt I'll run into any serious lockups due to my (compared to others) minor load

Thanks for your continued work on this issue !

@dweeezil dweeezil force-pushed the post-kmem-rework-deadlocks branch from d85a1f7 to 1c05d84 Compare March 30, 2015 00:32
@dweeezil
Copy link
Contributor Author

I've pushed a master-rebase of dweeezil/zfs@d85a1f7 (it's dweeezil/zfs@1c05d84). So far, it's looking very good. I've not been able to deadlock my test system but I have been able to live-lock it to the point only a SysRq-e will get it going again.

I'm concerned that @tuxoko's observation about the ordering issue with MUTEX_FSTRANS may be happening along the sync path because I'm seeing reclaim entered from txg_sync_thread in my live-locks:

do_try_to_free_pages+0x155/0x440
...
kmem_cache_alloc+0x1b4/0x210
spl_kmem_cache_alloc+0xe9/0xd10 
zio_create+0x84/0x780 
zio_write+0x119/0x1e0 
arc_write+0x121/0x250 
dbuf_write.isra.12+0x24c/0x760 
dbuf_sync_leaf+0x1ac/0xa20 
dbuf_sync_list+0x95/0xe0 
dnode_sync+0x5c5/0x1b60 
dmu_objset_sync_dnodes+0xc5/0x220 
dmu_objset_sync+0x1b8/0x480 
dsl_dataset_sync+0x6a/0x100 
dsl_pool_sync+0xec/0x690 
spa_sync+0x418/0xd50 
txg_sync_thread+0x3e8/0x7f0 

I have a feeling this patch's use of MUTEX_FSTRANS for db_mtx may be the culprit so I'm going to try to lock down the individual db_mtx users and move away from MUTEX_FSTRANS (and use spl_fstrans_mark() instead).

@chrisrd
Copy link
Contributor

chrisrd commented Mar 30, 2015

@dweeezil "I've backed off of all the dmu_tx_hold lockdowns. Instead, as @tuxoko rightly pointed out, memory allocations when db_mtx is held can cause deadlocks."

Hmm, that rings a bell...

https://www.illumos.org/issues/5056 - ZFS deadlock on db_mtx and dn_holds

According to git log, hasn't been applied to ZoL?

@dweeezil
Copy link
Contributor Author

@chrisrd That patch is going to take a bit of digesting; it's pretty extensive. The stacks shown in https://www.illumos.org/issues/5056 don't look like anything I've seen as part of working on this issue (at least insofar as the umount path is concerned). That said, I may try a quick port of the patch to see how well it fits into our current code.

@behlendorf
Copy link
Contributor

I'm concerned that @tuxoko's observation about the ordering issue with MUTEX_FSTRANS may be happening along the sync path because I'm seeing reclaim entered from txg_sync_thread in my live-locks:

@tuxoko sorry about the long delayed comment on #3183 (comment). You're exactly right, that's a legitimate case which can happen and I should have commented on it explicitly. At the time I didn't think that could occur in the existing code but given the stacks observed by @dweeezil it seems it may be possible. Direct reclaim should never ever ever be possible in the context of the txg_sync_thread().

You mentioned the only way to handle this transparently is to keep a reference count in the task. Unfortunately, that's not really something we can do because we can't add a field to the task struct. We could add a bit of thread specific data (TSD) but I suspect the performance hit there would be unacceptable.

We may not be able to transparently hide this case. Which is a shame because I'd really like to keep the Linux specific bits to a minimum. I completely agree with @dweeezil's comments above:

At this rate, the entire system will be reduced to non-sleeping allocations just as it mostly was before the kmem-rework merge. Furthermore, all these semi-gratuitous changes cause even more divergence with upstream code than KM_PUSHPAGE did in the past. I'm going to keep plugging away at it for awhile but a little voice in the back of my head is telling me we need a different approach.

So with that in mind let me throw out a couple possible alternate approaches we could take. Better ideas are welcome! I'd love to put this issue behind us once and for all.

  • Continue adding targeted calls to spl_fstrans_mark() throughout the common ZFS code. One possible clean way to minimize the footprint here would be to extend the ZFS_OBJ_HOLD_ENTER/ZFS_OBJ_HOLD_EXIT macro to take a third fstrans_cookie_t argument. Then we could hide the mark/unmark in the macro for Linux, and a patch could be pushed upstream to add the argument and just ignore it. I'd have to see how this works out in practice but at the moment this seems like the best approach to me.
  • Alternately, we could disable direct reclaim in larger chunks of the code by adding the mark/unmark to the Linux specific zpl_* functions. That would keep the changes in Linux specific code and I think should cover most common cases. Although it's heavy handed.
  • Even more heavy handed would be to just add GFP_NOFS to all KM_SLEEP allocations. This would be even more comprehensive than the KM_PUSHPAGE solution but it would be straight forward. We'd definitely want to somehow measure what the performance / stability impact is. I could see this potentially causing issues on a machine which is starved for memory, but perhaps not. It's hard to say for sure.

@dweeezil
Copy link
Contributor Author

@behlendorf Before addressing your points above, here are a few other details. First I want to characterize the workload of my torture tests (see https://github.com/dweeezil/stress-tests) to get idea why I'm able to create deadlocks so quickly. All my testing is done with "mem=4g" to simulate a system that's tight on memory from the beginning. It's much, much harder to deadlock even at the 8GiB level with all other things equal. The filesystem is populated with about 1 million files in a 3-level directory hierarchy consisting of 1000 directories. I've not been re-populating it (right now, there are 899K inodes used on my test system). A set of 20 concurrent file-creation with random overwrite processes are repeatedly run and at the same time, a set of 20 concurrent random file removal processes are run. The random removal uses find(1) to traverse sets of the 1000 directories and randomly removes files. In other words, I'm running 40 concurrent processes which are rapidly traversing the filesystem. I wrote these awhile ago to try to create a workload for the issue addressed by bc88866 but starting running into this issue.

This is a mostly metadata-heavy workload and on a memory constrained system, the ARC quickly fills to its automatically-calculated 1GiB cap (and overshoots periodically).

While all this is running, I run a single memory hogging program which nibbles away at a pre-specified amount of memory in 10MiB chunks, dirtying each chunk at it goes. I'm currently using a 2000MiB cap (somewhat arbitrarily). The memory hog is allowed to run to completion and is then re-run after a 4-second delay to allow the ARC to refill.

When I launch the memory hog, I watch the ARC via arcstats.py and, in general, on the first run of the memory hog, I can let it run a good long while without deadlocking. It will eventually deadlock but my normal regimen is to kill the hogger and let the ARC totally refill and then wait a bit. The system will generally deadlock on the very next run of the hogger.

With this patch in its current state, I can usually wake the system up by doing a SysRq-e to SIGTERM all the processes. The trouble seems to stem from the sync task going into reclaim. I do want to try to back off of using MUTEX_FSTRAMS for db_mtx and use spl_fstrans_mark as necessary instead. I do have stack traces saved from a deadlock to help guide me to the trouble spot(s).

As to your points above:

  • I gather that continued use of spl_fstrans_mark() was the planned path forward and is what I'd like to do to address the deadlocks involving db_mtx. I do like the idea of encapsulating some of the work in the ZFS_OBJ... macros to minimize the number of code line additions. That said, however, similar tricks won't likely work for the potentially many other places in which spl_fstrans_mark() might be necessary should we continue down that path.
  • I think the scheme of marking the zpl_ functions as well as the sync task might be the way to go for the time being (see below for rationale).
  • I certainly thought of the "always use GFP_NOFS" scheme but it is clearly heavy handed.

My overriding concern is the timing of this issue with respect to a (from what I gather) long-overdue tagged release. If we miss marking something with spl_fstrans_mark() by virtue of testing not catching a deadlock, it will most certainly be uncovered once the code gets deployed and if one of these deadlocks is found, it would be big trouble given what appears to be most distros current release model. This, however, should be mitigated by more widespread adoption of "stable" release updates.

I'd like to add one other note regarding the case of reclaim I spotted in the sync thread: My first idea was that something was turning off PF_FSTRANS so I put a test and a printk() at the bottom its loop but the test was never triggered. In other words, something is apparently disabling PF_FSTRANS and then re-enabling it before hitting the bottom of the sync thread's loop (I did not pepper the loop with any additional tests).

@behlendorf
Copy link
Contributor

@dweeezil it sounds like we're on the same page about this. I think the short term goal here needs to be putting together some patch which addresses most/all of the known deadlocks. Otherwise we're almost certainly going to have people updating to the next released version and encountering this issue. That's something I definitely want to avoid.

However, this is the last remaining blocker on the list holding up a tag. So in the interests of getting something out the door, I agree that we either need to adopt the zpl_ scheme in the short term which I think has a pretty high likelihood of resolving the vast majority of cases. Or convincing ourselves that your existing patch with a reworked bit for MUTEX_FSTRANS in practice covers most realistic scenarios. Or, potentially reintroduce a KM_PUSHPAGE replacement for now called something like KM_NOFS. This would be straight forward to do since 79c76d5 clearly identifies all the KM_PUSHPAGE site which were changed. We'd be in no worse shape than the existing released tag and it would be trivial to revert latter.

Have you perhaps already put together a patch for the zpl_ wrapper case? If so how do it holding up in testing? I could propose and test patches for a KM_NOFS approach if your think that's reasonable.

If you could provide a trivial wrapper script in the short term for your test case I could get some additional testing done done myself for these patches.

And speaking of testing thanks for describing your workload and providing a pointer to the test scripts. It would be ideal if at some point we could some form of your testing to the buildbot, perhaps as part of the zfsstress tests @nedbass put together, https://github.com/nedbass/zfsstress. These tests are very similar is spirit to the kind of testing you've been doing.

@dweeezil
Copy link
Contributor Author

@behlendorf Kind of in reverse order here: I've put together a wrapper script for my current testing regimen and cleaned up the (very crude) scripts just a bit. You'll need both https://github.com/dweeezil/stress-tests and https://github.com/dweeezil/junkutils (or your own memory hogging and dirtying program). The former repository has the stress testing scripts and test1.sh is the wrapper which will (live|dead)lock my mem=4g test system very reliably. The scripts use /tank/fish as their filesystem and the text in the top of test1.sh explains how to seed it before testing. I've not re-seeded my test filesystem in a long time. One other note, and I suspect it doesn't matter: I just realized my current test pool has all/most of the newer features disabled (spacemap_histogram, enabled_txg, hole_birth, extensible_dataset, embedded_data and bookmarks). I've got no idea why but I can't imagine it would matter for this testing.

I've not yet done the zpl_... wrapper case but it certainly seems it would be pretty easy. Given that we don't have any upstream with which to maintain compatibility, I don't see any problem with simply peppering each function with the fstrans mark/unmark. I could write a patch this evening and run it through its paces. Presumably the ZFS_OBJ_HOLD... mark/unmarking I've added could go away as well.

This should give you enough information to get the test case running. Hopefully you can duplicate the problem as easily as I am able to (which, BTW, I'm strictly running bare metal right now).

@behlendorf
Copy link
Contributor

@dweeezil when you refresh the SPL patch can you also revert the MUTEX_FSTRANS patch openzfs/spl@d0d5dd7. Let's get rid of that interface since we've decided not to use it.

As for the zio_taskqs[] I've thought for a while we should at a minimum make those values tunable somehow. That would allow us to more easily get feedback on reasonable values for a variety of systems.

@behlendorf
Copy link
Contributor

@dweeezil @tuxoko the latest version of the patch stack continues to hold together well for me in all but the most abusive testing. In my estimation it is behaving at least as well as, if not better than, the existing 0.6.3 code when in a low memory situation. There's certainly room for us to continue improving things, and we should, but in the interests of getting this our due tag I'd like to merge this. Then we can get a final round of testing in over the weekend before the tag.

Do you guys have any reservations or last minute improvements you'd like to get in to this change? What's your general impression of the state of things with this patch applied to master?

@dweeezil
Copy link
Contributor Author

dweeezil commented Apr 3, 2015

@behlendorf Insofar as fixing the original issues described in this thread and some other related issues, I'm happy with this patch as it stands. What's your take on the SPL patch? I've been doing all my testing with it as it helps things a lot in this particular case.

My impression is that things are in pretty good shape with this patch applied. I'm planning doing some more tests this weekend on bigger hardware configurations along with my (delayed by this issue) testing of pull request 3115 (purposely not hash-referenced; It was this issue which stopped me in my tracks while working on that patch).

@behlendorf
Copy link
Contributor

@dweeezil I've been using the SPL patch as well and I think it helps considerably in certain situations. I was going to include it when I merge this change. Thus far it's worked well for me after making that one minor fix. OK, then I'll get this merged today and queue up some additional weekend testing on the new master.

@behlendorf behlendorf closed this in 40d06e3 Apr 3, 2015
kernelOfTruth added a commit to kernelOfTruth/zfs that referenced this pull request Apr 4, 2015
…3115_WIP_clean

sync against latest upstream master from April 3rd to benefit from the
changes from openzfs#3225
kernelOfTruth added a commit to kernelOfTruth/zfs that referenced this pull request Apr 4, 2015
…3115_WIP_clean

sync against latest upstream master from April 3rd to benefit from the
changes from openzfs#3225
@deajan
Copy link

deajan commented Apr 5, 2015

Hi,
Since i upgraded my until now very stable backup server from CentOS 7.0 to 7.1, i had to upgrade zfs-testing from zfs-0.6.3-166 to zfs-0.6.3-260.
I now have a lot of deadlocks like this one, which end up totally locking my machine, from ssh and console.

Apr  5 05:53:22 backupmaster kernel: INFO: task kswapd0:46 blocked for more than 120 seconds.
Apr  5 05:53:22 backupmaster kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr  5 05:53:22 backupmaster kernel: kswapd0         D ffff88021fc13680     0    46      2 0x00000000
Apr  5 05:53:22 backupmaster kernel: ffff880210fab778 0000000000000046 ffff880210fabfd8 0000000000013680
Apr  5 05:53:22 backupmaster kernel: ffff880210fabfd8 0000000000013680 ffff880213d64440 ffff880213d64440
Apr  5 05:53:22 backupmaster kernel: ffff8800d22ee570 fffffffeffffffff ffff8800d22ee578 0000000000000001
Apr  5 05:53:22 backupmaster kernel: Call Trace:
Apr  5 05:53:22 backupmaster kernel: [<ffffffff81609e29>] schedule+0x29/0x70
Apr  5 05:53:22 backupmaster kernel: [<ffffffff8160b925>] rwsem_down_read_failed+0xf5/0x165
Apr  5 05:53:22 backupmaster kernel: [<ffffffff812ede0d>] ? list_del+0xd/0x30
Apr  5 05:53:22 backupmaster kernel: [<ffffffff812e31e4>] call_rwsem_down_read_failed+0x14/0x30
Apr  5 05:53:22 backupmaster kernel: [<ffffffffa02b59d2>] ? spl_kmem_free+0x32/0x50 [spl]
Apr  5 05:53:22 backupmaster kernel: [<ffffffff816091f0>] ? down_read+0x20/0x30
Apr  5 05:53:22 backupmaster kernel: [<ffffffffa03effeb>] dmu_zfetch_find+0x4b/0x830 [zfs]
Apr  5 05:53:22 backupmaster kernel: [<ffffffff812ede0d>] ? list_del+0xd/0x30
Apr  5 05:53:22 backupmaster kernel: [<ffffffffa03f0add>] dmu_zfetch+0xcd/0x930 [zfs]
Apr  5 05:53:22 backupmaster kernel: [<ffffffffa03d7249>] dbuf_read+0xa99/0xbd0 [zfs]
Apr  5 05:53:22 backupmaster kernel: [<ffffffffa03f2548>] dnode_hold_impl+0x198/0x610 [zfs]
Apr  5 05:53:22 backupmaster kernel: [<ffffffff811ae07d>] ? __kmalloc_node+0x13d/0x280
Apr  5 05:53:22 backupmaster kernel: [<ffffffffa03f29d9>] dnode_hold+0x19/0x20 [zfs]
Apr  5 05:53:22 backupmaster kernel: [<ffffffffa03ed732>] dmu_tx_hold_object_impl.isra.1+0x42/0x190 [zfs]
Apr  5 05:53:22 backupmaster kernel: [<ffffffffa03ed895>] dmu_tx_hold_bonus+0x15/0x30 [zfs]
Apr  5 05:53:22 backupmaster kernel: [<ffffffffa03ef01c>] dmu_tx_hold_sa+0x3c/0x180 [zfs]
Apr  5 05:53:22 backupmaster kernel: [<ffffffffa047302b>] zfs_inactive+0x15b/0x290 [zfs]
Apr  5 05:53:22 backupmaster kernel: [<ffffffffa048b783>] zpl_evict_inode+0x43/0x60 [zfs]
Apr  5 05:53:22 backupmaster kernel: [<ffffffff811e2237>] evict+0xa7/0x170
Apr  5 05:53:22 backupmaster kernel: [<ffffffff811e233e>] dispose_list+0x3e/0x50
Apr  5 05:53:22 backupmaster kernel: [<ffffffff811e3213>] prune_icache_sb+0x163/0x320
Apr  5 05:53:22 backupmaster kernel: [<ffffffff811ca496>] prune_super+0xd6/0x1a0
Apr  5 05:53:22 backupmaster kernel: [<ffffffff811691a5>] shrink_slab+0x165/0x300
Apr  5 05:53:22 backupmaster kernel: [<ffffffff811c0751>] ? vmpressure+0x21/0x90
Apr  5 05:53:22 backupmaster kernel: [<ffffffff8116cdf1>] balance_pgdat+0x4b1/0x5e0
Apr  5 05:53:22 backupmaster kernel: [<ffffffff8116d093>] kswapd+0x173/0x450
Apr  5 05:53:22 backupmaster kernel: [<ffffffff81098340>] ? wake_up_bit+0x30/0x30
Apr  5 05:53:22 backupmaster kernel: [<ffffffff8116cf20>] ? balance_pgdat+0x5e0/0x5e0
Apr  5 05:53:22 backupmaster kernel: [<ffffffff8109738f>] kthread+0xcf/0xe0
Apr  5 05:53:22 backupmaster kernel: [<ffffffff810972c0>] ? kthread_create_on_node+0x140/0x140
Apr  5 05:53:22 backupmaster kernel: [<ffffffff8161497c>] ret_from_fork+0x7c/0xb0
Apr  5 05:53:22 backupmaster kernel: [<ffffffff810972c0>] ? kthread_create_on_node+0x140/0x140

I am currently connected to my machine, which is quite unresponsive, top showing a load average of 18, but no high cpu usage nor disk usage (wa = 0.9).

I've setup a crontab that logs free memory and arcstats every 3 minutes until it will crash.

Current stats are

[root@backupmaster ~]# free -h
              total        used        free      shared  buff/cache   available
Mem:           7,6G        4,2G        343M        5,3M        3,1G        1,4G
Swap:          7,9G        272M        7,6G

[root@backupmaster ~]# cat /proc/spl/kstat/zfs/arcstats
5 1 0x01 86 4128 10440723686 103106010736704
name                            type data
hits                            4    6799000
misses                          4    903734
demand_data_hits                4    1052478
demand_data_misses              4    98478
demand_metadata_hits            4    3979439
demand_metadata_misses          4    308492
prefetch_data_hits              4    90300
prefetch_data_misses            4    389395
prefetch_metadata_hits          4    1676783
prefetch_metadata_misses        4    107369
mru_hits                        4    2804169
mru_ghost_hits                  4    68374
mfu_hits                        4    2227748
mfu_ghost_hits                  4    17802
deleted                         4    1092616
recycle_miss                    4    95086
mutex_miss                      4    95
evict_skip                      4    34452369
evict_l2_cached                 4    0
evict_l2_eligible               4    94026845184
evict_l2_ineligible             4    3055902720
hash_elements                   4    37261
hash_elements_max               4    121953
hash_collisions                 4    95936
hash_chains                     4    723
hash_chain_max                  4    4
p                               4    1261568616
c                               4    1261827688
c_min                           4    4194304
c_max                           4    4085360640
size                            4    1261617624
hdr_size                        4    13965568
data_size                       4    52940800
meta_size                       4    311249920
other_size                      4    883461336
anon_size                       4    3293184
anon_evict_data                 4    0
anon_evict_metadata             4    0
mru_size                        4    341787136
mru_evict_data                  4    40685568
mru_evict_metadata              4    2132480
mru_ghost_size                  4    57884160
mru_ghost_evict_data            4    5898240
mru_ghost_evict_metadata        4    51985920
mfu_size                        4    19110400
mfu_evict_data                  4    12255232
mfu_evict_metadata              4    72192
mfu_ghost_size                  4    250805248
mfu_ghost_evict_data            4    76679168
mfu_ghost_evict_metadata        4    174126080
l2_hits                         4    0
l2_misses                       4    0
l2_feeds                        4    0
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_hdr_miss              4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_free_on_write                4    0
l2_cdata_free_on_write          4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_asize                        4    0
l2_hdr_size                     4    0
l2_compress_successes           4    0
l2_compress_zeros               4    0
l2_compress_failures            4    0
memory_throttle_count           4    0
duplicate_buffers               4    0
duplicate_buffers_size          4    0
duplicate_reads                 4    0
memory_direct_count             4    1361
memory_indirect_count           4    91273
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    0
arc_meta_used                   4    1208676824
arc_meta_limit                  4    3064020480
arc_meta_max                    4    1220892016

[root@backupmaster ~]# cat /proc/meminfo
MemTotal:        7979220 kB
MemFree:          356532 kB
MemAvailable:    1482912 kB
Buffers:          492676 kB
Cached:           135984 kB
SwapCached:        32176 kB
Active:          2214544 kB
Inactive:        1874108 kB
Active(anon):    1911372 kB
Inactive(anon):  1554044 kB
Active(file):     303172 kB
Inactive(file):   320064 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       8273916 kB
SwapFree:        7994748 kB
Dirty:                20 kB
Writeback:             0 kB
AnonPages:       3434528 kB
Mapped:            45848 kB
Shmem:              5424 kB
Slab:            2590800 kB
SReclaimable:     756560 kB
SUnreclaim:      1834240 kB
KernelStack:        8272 kB
PageTables:        19976 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    12263524 kB
Committed_AS:    6059032 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      310168 kB
VmallocChunk:   34355582156 kB
HardwareCorrupted:     0 kB
AnonHugePages:   1095680 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      108980 kB
DirectMap2M:     8247296 kB

[root@backupmaster ~]# vmstat -s
      7979220 K total memory
      4403136 K used memory
      2214524 K active memory
      1874108 K inactive memory
       356592 K free memory
       492676 K buffer memory
      2726816 K swap cache
      8273916 K total swap
       279168 K used swap
      7994748 K free swap
       167961 non-nice user cpu ticks
         2039 nice user cpu ticks
       635800 system cpu ticks
     39524247 idle cpu ticks
       913491 IO-wait cpu ticks
           29 IRQ cpu ticks
         5232 softirq cpu ticks
            0 stolen cpu ticks
     64657816 pages paged in
    171284419 pages paged out
         7477 pages swapped in
        83728 pages swapped out
     99269066 interrupts
    206560852 CPU context switches
   1428141322 boot time
        75119 forks

I get more and more load average and will probably loose "the hand" on the machine.
Anything i could try ? Anything i should log to help finding where the deadlock comes from ?

Running CentOS 7.1 Linux backupmaster.siege.local 3.10.0-229.1.2.el7.x86_64 #1 SMP Fri Mar 27 03:04:26 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux on a Fujitsu TX140S1P with 2x4Tb RE4 drives and 8Gb RAM, no deduplication.

PS: sorry if this isn't the right issue to post to, i just read a lot of deadlock stuff and this one just looks like mine.

@DeHackEd
Copy link
Contributor

DeHackEd commented Apr 5, 2015

There should be more than one stack trace. Can you paste (or pastebin, or gist, etc) some more from the same time period? Should all be logged within a 2 minute window.

Also note that another commit went in just recently to improve the situation further. It's not a 100% fix but every little bit helps. Pulling the next update would be my suggestion.

@deajan
Copy link

deajan commented Apr 5, 2015

Well there is a zfs-0.6.3-261 package, but it seems it isn't public or built yet (neither was it yesterday).
It says HTTP 404 Not found for the spl and zfs rpms when i try yum --enablerepo=zfs-testing update

The full stack trace from last deadlock can be found here: https://gist.github.com/deajan/78757af1066b7d77a73d

@kernelOfTruth kernelOfTruth mentioned this pull request Apr 5, 2015
@behlendorf
Copy link
Contributor

@deajan Sorry about that. The EPEL zfs-testing repository will be fixed today.

@deajan
Copy link

deajan commented Apr 9, 2015

@behlendorf Thanks a lot, and congrats for release 0.6.4 :))

behlendorf added a commit to behlendorf/zfs that referenced this pull request Apr 14, 2015
Prevent deadlocks by disabling direct reclaim during all NFS calls.
This is related to 40d06e3.

Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#3225
behlendorf added a commit to behlendorf/zfs that referenced this pull request Apr 14, 2015
Prevent deadlocks by disabling direct reclaim during all NFS and
xattr calls.  This is related to 40d06e3.

Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#3225
chrisrd added a commit to chrisrd/zfs that referenced this pull request Apr 16, 2015
Commit @40d06e3 removed a scheme to avoid reacquiring a mutex in
zfs_zinactive. It turns out this the scheme is necessary. Reinstate it.

Signed-off-by: Chris Dunlop <[email protected]>
Issue openzfs#3225
Closes openzfs#3304
behlendorf added a commit to behlendorf/zfs that referenced this pull request Apr 16, 2015
Prevent deadlocks by disabling direct reclaim during all NFS, xattr,
ctldir, and super function calls.  This is related to 40d06e3.

Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#3225
behlendorf added a commit to behlendorf/zfs that referenced this pull request Apr 17, 2015
Prevent deadlocks by disabling direct reclaim during all NFS, xattr,
ctldir, and super function calls.  This is related to 40d06e3.

Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Issue openzfs#3225
behlendorf added a commit that referenced this pull request Apr 17, 2015
Prevent deadlocks by disabling direct reclaim during all NFS, xattr,
ctldir, and super function calls.  This is related to 40d06e3.

Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Issue #3225
@ryao
Copy link
Contributor

ryao commented Apr 22, 2015

It would be better to identify each lock that can be accessed during direct reclaim, convert the entry/exit routines into preprocessor macro wrappers and insert spl_fstrans_mark()/spl_fstrans_unmark() there. That way it would be clear why a given location needs it when doing review in the future. Doing it in the VFS hooks will miss direct reclaim issues in code paths exercised by taskq threads (e.g. iput_async) and will make the code harder to understand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants