Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process stuck at zfs_fsync #10440

Closed
sebu06 opened this issue Jun 11, 2020 · 15 comments
Closed

Process stuck at zfs_fsync #10440

sebu06 opened this issue Jun 11, 2020 · 15 comments
Labels
Status: Stale No recent activity for issue Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@sebu06
Copy link

sebu06 commented Jun 11, 2020

System information

Type Version/Name
Distribution Name Debian
Distribution Version Buster
Linux Kernel 5.4.41-1-pve
Architecture x86_64
ZFS Version 0.8.4-pve1
SPL Version 0.8.4-pve1

I'm using the tool backy2 to backup a CEPH cluster to a raidz ZFS volume. The Backy2 Process uses a SQLite Database stored on in a dataset on the Backup pool. Sometimes during the Backup, the process gets stuck (D state). The stacktrace shows that it is stuck at a zfs-task:

root@zds:/proc/2958595$ cat stack
[<0>] cv_wait_common+0x104/0x130 [spl]
[<0>] __cv_wait+0x15/0x20 [spl]
[<0>] zil_commit_impl+0x241/0xd90 [zfs]
[<0>] zil_commit+0x3d/0x60 [zfs]
[<0>] zfs_fsync+0x77/0xe0 [zfs]
[<0>] zpl_fsync+0x68/0xa0 [zfs]
[<0>] vfs_fsync_range+0x48/0x80
[<0>] do_fsync+0x3d/0x70
[<0>] __x64_sys_fdatasync+0x17/0x20
[<0>] do_syscall_64+0x57/0x190
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

The problem happens not every day, but on a regular basis. It is quite annoying, because it requires a reboot of the cluster node. The cluster is a Proxmox6 cluster. The same setup was working for a long time on a Proxmox5 cluster (Kernel 4.15.18-29-pve, zfs 0.7.13-pve1~bpo2, Debian Stretch) and problems only started after the update.

@behlendorf behlendorf added the Type: Defect Incorrect behavior (e.g. crash, hang) label Jun 11, 2020
@sebu06
Copy link
Author

sebu06 commented Jun 11, 2020

by the way - there is a process stuck at the moment, which I did not kill. So if any additional data can be obtained from that I'm happy to help

@sebu06
Copy link
Author

sebu06 commented Jun 11, 2020

thanks, I'll try recompiling with that fix and let you know if it occurs again

@behlendorf
Copy link
Contributor

This looks very similar to openzfsonosx/openzfs#3. There's a tiny proposed fix in openzfsonosx/openzfs@adeb452. This code was added in the 0.8.x releases, so that's consistent with your observation that it didn't happen with the earlier release.

@snajpa
Copy link
Contributor

snajpa commented Aug 21, 2020

This bug is still relevant for me with current zfs master (as of today);

lockdep reveals this, it is probably relevant, I'm investigating further:

[  140.025805] ======================================================
[  140.025806] WARNING: possible circular locking dependency detected
[  140.025808] 5.9.0-rc1+ #51 Tainted: G           O     
[  140.025810] ------------------------------------------------------
[  140.025811] systemd-journal/3974 is trying to acquire lock:
[  140.025813] ffff94ba81261070 (&zilog->zl_lock){+.+.}-{3:3}, at: zil_alloc_lwb+0x1df/0x3e0 [zfs]
[  140.025970] 
               but task is already holding lock:
[  140.025971] ffff94ba9ba504f8 (&zcw->zcw_lock){+.+.}-{3:3}, at: zil_commit_impl+0x52b/0x1850 [zfs]
[  140.026100] 
               which lock already depends on the new lock.

[  140.026100] 
               the existing dependency chain (in reverse order) is:
[  140.026101] 
               -> #1 (&zcw->zcw_lock){+.+.}-{3:3}:
[  140.026110]        __mutex_lock+0xac/0x9e0
[  140.026113]        mutex_lock_nested+0x1b/0x20
[  140.026171]        zil_commit_waiter_link_lwb+0x51/0x1a0 [zfs]
[  140.026271]        zil_commit_impl+0x12b0/0x1850 [zfs]
[  140.026366]        zil_commit+0x43/0x60 [zfs]
[  140.026478]        zpl_writepages+0xf8/0x1a0 [zfs]
[  140.026540]        do_writepages+0x43/0xf0
[  140.026544]        __filemap_fdatawrite_range+0xd5/0x110
[  140.026546]        filemap_write_and_wait_range+0x4b/0xb0
[  140.026601]        zpl_fsync+0x4d/0xb0 [zfs]
[  140.026648]        vfs_fsync_range+0x49/0x80
[  140.026650]        do_fsync+0x3d/0x70
[  140.026653]        __x64_sys_fsync+0x14/0x20
[  140.026658]        do_syscall_64+0x38/0x90
[  140.026662]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  140.026663] 
               -> #0 (&zilog->zl_lock){+.+.}-{3:3}:
[  140.026670]        __lock_acquire+0x1298/0x2460
[  140.026673]        lock_acquire+0xab/0x380
[  140.026675]        __mutex_lock+0xac/0x9e0
[  140.026676]        mutex_lock_nested+0x1b/0x20
[  140.026719]        zil_alloc_lwb+0x1df/0x3e0 [zfs]
[  140.026803]        zil_lwb_write_issue+0x265/0x3f0 [zfs]
[  140.026886]        zil_commit_impl+0x577/0x1850 [zfs]
[  140.026969]        zil_commit+0x43/0x60 [zfs]
[  140.027051]        zpl_writepages+0xf8/0x1a0 [zfs]
[  140.027116]        do_writepages+0x43/0xf0
[  140.027119]        __filemap_fdatawrite_range+0xd5/0x110
[  140.027122]        filemap_write_and_wait_range+0x4b/0xb0
[  140.027172]        zpl_fsync+0x4d/0xb0 [zfs]
[  140.027215]        vfs_fsync_range+0x49/0x80
[  140.027217]        do_fsync+0x3d/0x70
[  140.027220]        __x64_sys_fsync+0x14/0x20
[  140.027223]        do_syscall_64+0x38/0x90
[  140.027226]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  140.027227] 
               other info that might help us debug this:

[  140.027228]  Possible unsafe locking scenario:

[  140.027228]        CPU0                    CPU1
[  140.027229]        ----                    ----
[  140.027229]   lock(&zcw->zcw_lock);
[  140.027231]                                lock(&zilog->zl_lock);
[  140.027232]                                lock(&zcw->zcw_lock);
[  140.027233]   lock(&zilog->zl_lock);
[  140.027234] 
                *** DEADLOCK ***

[  140.027236] 2 locks held by systemd-journal/3974:
[  140.027236]  #0: ffff94ba812612b8 (&zilog->zl_issuer_lock){+.+.}-{3:3}, at: zil_commit_impl+0x4dc/0x1850 [zfs]
[  140.027342]  #1: ffff94ba9ba504f8 (&zcw->zcw_lock){+.+.}-{3:3}, at: zil_commit_impl+0x52b/0x1850 [zfs]
[  140.027452] 
               stack backtrace:
[  140.027455] CPU: 13 PID: 25 Comm: systemd-journal Tainted: G           O      5.9.0-rc1+ #51
[  140.027457] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
[  140.027459] Call Trace:
[  140.027467]  dump_stack+0x9f/0xda
[  140.027472]  print_circular_bug.isra.0.cold+0x13d/0x142
[  140.027476]  check_noncircular+0x169/0x180
[  140.027483]  __lock_acquire+0x1298/0x2460
[  140.027489]  lock_acquire+0xab/0x380
[  140.027545]  ? zil_alloc_lwb+0x1df/0x3e0 [zfs]
[  140.027609]  __mutex_lock+0xac/0x9e0
[  140.027657]  ? zil_alloc_lwb+0x1df/0x3e0 [zfs]
[  140.027742]  ? zil_alloc_lwb+0x1df/0x3e0 [zfs]
[  140.027792]  ? spl_kmem_cache_alloc+0xc2/0x840 [spl]
[  140.027798]  mutex_lock_nested+0x1b/0x20
[  140.027840]  ? zio_buf_alloc+0x30/0x60 [zfs]
[  140.027882]  ? mutex_lock_nested+0x1b/0x20
[  140.027936]  zil_alloc_lwb+0x1df/0x3e0 [zfs]
[  140.028021]  zil_lwb_write_issue+0x265/0x3f0 [zfs]
[  140.028113]  zil_commit_impl+0x577/0x1850 [zfs]
[  140.028199]  zil_commit+0x43/0x60 [zfs]
[  140.028281]  zpl_writepages+0xf8/0x1a0 [zfs]
[  140.028325]  do_writepages+0x43/0xf0
[  140.028327]  ? _raw_spin_unlock+0x23/0x30
[  140.028330]  __filemap_fdatawrite_range+0xd5/0x110
[  140.028333]  filemap_write_and_wait_range+0x4b/0xb0
[  140.028376]  zpl_fsync+0x4d/0xb0 [zfs]
[  140.028418]  vfs_fsync_range+0x49/0x80
[  140.028421]  do_fsync+0x3d/0x70
[  140.028424]  __x64_sys_fsync+0x14/0x20
[  140.028428]  do_syscall_64+0x38/0x90
[  140.028430]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  140.028434] RIP: 0033:0x7f48b7837214
[  140.028436] Code: 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8d 05 f1 96 2d 00 8b 00 85 c0 75 13 b8 4a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3c f3 c3 66 90 53 89 fb 48 83 ec 10 e8 b4 94
[  140.028438] RSP: 002b:00007ffc919c1ef8 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
[  140.028440] RAX: ffffffffffffffda RBX: 000056389b01cb30 RCX: 00007f48b7837214
[  140.028441] RDX: 00007f48b2f36000 RSI: 000056389b01cd20 RDI: 0000000000000012
[  140.028442] RBP: 00007ffc919c2210 R08: 0005ad650b223649 R09: 000000005f3feab5
[  140.028444] R10: 00007ffc919f6000 R11: 0000000000000246 R12: 00000000ffffffff
[  140.028445] R13: 0000000000000000 R14: 000056389b01cb30 R15: 00007ffc919c21d0

@snajpa
Copy link
Contributor

snajpa commented Aug 21, 2020

Proposed fix, testing:

diff --git a/module/zfs/zil.c b/module/zfs/zil.c
index 9dc20ba14..cf4d6d9dc 100644
--- a/module/zfs/zil.c
+++ b/module/zfs/zil.c
@@ -2513,8 +2513,8 @@ zil_commit_waiter_timeout(zilog_t *zilog, zil_commit_waiter_t *zcw)
         * elsewhere.
         */
        mutex_exit(&zcw->zcw_lock);
-       mutex_enter(&zilog->zl_issuer_lock);
        mutex_enter(&zcw->zcw_lock);
+       mutex_enter(&zilog->zl_issuer_lock);

@snajpa
Copy link
Contributor

snajpa commented Aug 21, 2020

Oh, it's not going to be that simple :-D

@snajpa
Copy link
Contributor

snajpa commented Aug 21, 2020

After noinlineing all functions in zil.c:

[   59.868232] ======================================================
[   59.868233] WARNING: possible circular locking dependency detected
[   59.868236] 5.9.0-rc1+ #51 Tainted: G           O     
[   59.868237] ------------------------------------------------------
[   59.868239] systemd-journal/3685 is trying to acquire lock:
[   59.868241] ffff989c8b1bc070 (&zilog->zl_lock){+.+.}-{3:3}, at: zil_alloc_lwb+0x1f8/0x4a0 [zfs]
[   59.868405] 
               but task is already holding lock:
[   59.868406] ffff989c6da5b1f8 (&zcw->zcw_lock){+.+.}-{3:3}, at: zil_commit_waiter_timeout+0x1ca/0x670 [zfs]
[   59.868542] 
               which lock already depends on the new lock.

[   59.868543] 
               the existing dependency chain (in reverse order) is:
[   59.868544] 
               -> #1 (&zcw->zcw_lock){+.+.}-{3:3}:
[   59.868552]        __mutex_lock+0xac/0x9e0
[   59.868555]        mutex_lock_nested+0x1b/0x20
[   59.868622]        zil_commit_waiter_skip+0x58/0x210 [zfs]
[   59.868765]        zil_prune_commit_list+0x1c9/0x250 [zfs]
[   59.868895]        zil_commit_writer.isra.0+0x118/0x280 [zfs]
[   59.869023]        zil_commit_impl+0x4a/0x150 [zfs]
[   59.869151]        zil_commit+0x5f/0x170 [zfs]
[   59.869278]        zfs_fsync+0x7c/0x110 [zfs]
[   59.869405]        zpl_fsync+0x93/0x120 [zfs]
[   59.869475]        vfs_fsync_range+0x49/0x80
[   59.869478]        do_fsync+0x3d/0x70
[   59.869482]        __x64_sys_fsync+0x14/0x20
[   59.869488]        do_syscall_64+0x38/0x90
[   59.869493]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   59.869495] 
               -> #0 (&zilog->zl_lock){+.+.}-{3:3}:
[   59.869503]        __lock_acquire+0x1298/0x2460
[   59.869507]        lock_acquire+0xab/0x380
[   59.869510]        __mutex_lock+0xac/0x9e0
[   59.869512]        mutex_lock_nested+0x1b/0x20
[   59.869577]        zil_alloc_lwb+0x1f8/0x4a0 [zfs]
[   59.869715]        zil_lwb_write_issue+0x2e8/0x580 [zfs]
[   59.869843]        zil_commit_waiter_timeout+0x236/0x670 [zfs]
[   59.869972]        zil_commit_waiter+0x2e9/0x4d0 [zfs]
[   59.870100]        zil_commit_impl+0x55/0x150 [zfs]
[   59.870227]        zil_commit+0x5f/0x170 [zfs]
[   59.870354]        zpl_writepages+0xf8/0x1a0 [zfs]
[   59.870423]        do_writepages+0x43/0xf0
[   59.870427]        __filemap_fdatawrite_range+0xd5/0x110
[   59.870430]        filemap_write_and_wait_range+0x4b/0xb0
[   59.870495]        zpl_fsync+0x4d/0x120 [zfs]
[   59.870561]        vfs_fsync_range+0x49/0x80
[   59.870565]        do_fsync+0x3d/0x70
[   59.870568]        __x64_sys_fsync+0x14/0x20
[   59.870571]        do_syscall_64+0x38/0x90
[   59.870575]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   59.870577] 
               other info that might help us debug this:

[   59.870578]  Possible unsafe locking scenario:

[   59.870579]        CPU0                    CPU1
[   59.870580]        ----                    ----
[   59.870580]   lock(&zcw->zcw_lock);
[   59.870583]                                lock(&zilog->zl_lock);
[   59.870585]                                lock(&zcw->zcw_lock);
[   59.870587]   lock(&zilog->zl_lock);
[   59.870589] 
                *** DEADLOCK ***

[   59.870591] 2 locks held by systemd-journal/3685:
[   59.870592]  #0: ffff989c8b1bc2b8 (&zilog->zl_issuer_lock){+.+.}-{3:3}, at: zil_commit_waiter_timeout+0x15c/0x670 [zfs]
[   59.870750]  #1: ffff989c6da5b1f8 (&zcw->zcw_lock){+.+.}-{3:3}, at: zil_commit_waiter_timeout+0x1ca/0x670 [zfs]
[   59.870881] 
               stack backtrace:
[   59.870884] CPU: 11 PID: 22 Comm: systemd-journal Tainted: G           O      5.9.0-rc1+ #51
[   59.870887] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
[   59.870889] Call Trace:
[   59.870896]  dump_stack+0x9f/0xda
[   59.870900]  print_circular_bug.isra.0.cold+0x13d/0x142
[   59.870904]  check_noncircular+0x169/0x180
[   59.870911]  __lock_acquire+0x1298/0x2460
[   59.870917]  lock_acquire+0xab/0x380
[   59.870983]  ? zil_alloc_lwb+0x1f8/0x4a0 [zfs]
[   59.871052]  __mutex_lock+0xac/0x9e0
[   59.871117]  ? zil_alloc_lwb+0x1f8/0x4a0 [zfs]
[   59.871204]  ? spl_kmem_cache_alloc+0xda/0xd50 [spl]
[   59.871286]  ? zil_alloc_lwb+0x1f8/0x4a0 [zfs]
[   59.871355]  mutex_lock_nested+0x1b/0x20
[   59.871357]  ? mutex_lock_nested+0x1b/0x20
[   59.871423]  zil_alloc_lwb+0x1f8/0x4a0 [zfs]
[   59.871553]  zil_lwb_write_issue+0x2e8/0x580 [zfs]
[   59.871715]  zil_commit_waiter_timeout+0x236/0x670 [zfs]
[   59.871844]  zil_commit_waiter+0x2e9/0x4d0 [zfs]
[   59.871984]  zil_commit_impl+0x55/0x150 [zfs]
[   59.872109]  zil_commit+0x5f/0x170 [zfs]
[   59.872257]  zpl_writepages+0xf8/0x1a0 [zfs]
[   59.872323]  do_writepages+0x43/0xf0
[   59.872327]  ? _raw_spin_unlock+0x23/0x30
[   59.872331]  __filemap_fdatawrite_range+0xd5/0x110
[   59.872336]  filemap_write_and_wait_range+0x4b/0xb0
[   59.872415]  zpl_fsync+0x4d/0x120 [zfs]
[   59.872482]  vfs_fsync_range+0x49/0x80
[   59.872487]  do_fsync+0x3d/0x70
[   59.872491]  __x64_sys_fsync+0x14/0x20
[   59.872495]  do_syscall_64+0x38/0x90
[   59.872499]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   59.872503] RIP: 0033:0x7fe5bf0a6214
[   59.872507] Code: 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8d 05 f1 96 2d 00 8b 00 85 c0 75 13 b8 4a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3c f3 c3 66 90 53 89 fb 48 83 ec 10 e8 b4 94
[   59.872509] RSP: 002b:00007ffde604ec88 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
[   59.872513] RAX: ffffffffffffffda RBX: 00005585ac058640 RCX: 00007fe5bf0a6214
[   59.872515] RDX: 00007fe5ba7a5000 RSI: 0000000000000003 RDI: 0000000000000015
[   59.872517] RBP: 00007fe5becb4ea0 R08: 0000000000000005 R09: 0000000000000002
[   59.872518] R10: 00005585ac048010 R11: 0000000000000246 R12: 0000000000000002
[   59.872520] R13: 00000000fffffff0 R14: 00005585ac058640 R15: 00007ffde604f0d0

snajpa added a commit to vpsfreecz/zfs that referenced this issue Sep 3, 2020
There are three locks at play, when handling ZIL records. First is a
zl_issuer_lock, that is supposed to be held, whenever we're issuing
a new record to ZIL. Then there is zl_lock to actually protect the in
the zilog struct. And then there's a last lock, zcw_lock, which is
used to protect the lists of ZIL call-back waiters.

The order of locking is supposed to be:

zl_issuer_lock -> zcw_lock -> zl_lock; as implied their usage
in zil_alloc_lwb and zil_commit_waiter_timeout functions.

Function zil_commit_waiter_link_lwb goes against this, it is expecting
to be entered with zl_lock already held and only then it is taking
the zcw_lock;

This patch straightens the locking in zil_commit_waiter_link_lwb to
take the zl_lock on its own, correcting the locking order.

Following is an attached messsage from the Linux lockdep mechanism
describing the potential deadlock - in our production, it was a matter
of a few hours to hit this one...

(redacted to fit 72 cols)

systemd-journal/5561 is trying to acquire lock:
(&zilog->zl_lock){+.+.}-{4:4}, at: zil_alloc_lwb+0x1df/0x3e0

but task is already holding lock:
(&zcw->zcw_lock){+.+.}-{4:4}, at: zil_commit_impl+0x52b/0x1850

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&zcw->zcw_lock){+.+.}-{4:4}:
       __mutex_lock+0xac/0x9e0
       mutex_lock_nested+0x1b/0x20
       zil_commit_waiter_link_lwb+0x51/0x1a0 [zfs]
       zil_commit_impl+0x12b0/0x1850 [zfs]
       zil_commit+0x43/0x60 [zfs]
       zpl_writepages+0xf8/0x1a0 [zfs]
       do_writepages+0x43/0xf0
       __filemap_fdatawrite_range+0xd5/0x110
       filemap_write_and_wait_range+0x4b/0xb0
       zpl_fsync+0x4d/0xb0 [zfs]
       vfs_fsync_range+0x49/0x80
       do_fsync+0x3d/0x70
       __x64_sys_fsync+0x14/0x20
       do_syscall_64+0x38/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #0 (&zilog->zl_lock){+.+.}-{4:4}:
       __lock_acquire+0x12b6/0x2480
       lock_acquire+0xab/0x380
       __mutex_lock+0xac/0x9e0
       mutex_lock_nested+0x1b/0x20
       zil_alloc_lwb+0x1df/0x3e0 [zfs]
       zil_lwb_write_issue+0x265/0x3f0 [zfs]
       zil_commit_impl+0x577/0x1850 [zfs]
       zil_commit+0x43/0x60 [zfs]
       zpl_writepages+0xf8/0x1a0 [zfs]
       do_writepages+0x43/0xf0
       __filemap_fdatawrite_range+0xd5/0x110
       filemap_write_and_wait_range+0x4b/0xb0
       zpl_fsync+0x4d/0xb0 [zfs]
       vfs_fsync_range+0x49/0x80
       do_fsync+0x3d/0x70
       __x64_sys_fsync+0x14/0x20
       do_syscall_64+0x38/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&zcw->zcw_lock);
                               lock(&zilog->zl_lock);
                               lock(&zcw->zcw_lock);
  lock(&zilog->zl_lock);

 *** DEADLOCK ***

2 locks held by systemd-journal/5561:
 #0: (&zilog->zl_issuer_lock){+.+.}-{4:4}, at: zil_commit_impl+0x4dc...
 #1: (&zcw->zcw_lock){+.+.}-{4:4}, at: zil_commit_impl+0x52b/0x1850

Fixes openzfs#7038
Fixes openzfs#10440

Signed-off-by: Pavel Snajdr <[email protected]>
@snajpa snajpa mentioned this issue Sep 3, 2020
12 tasks
snajpa added a commit to vpsfreecz/zfs that referenced this issue Sep 3, 2020
There are three locks at play, when handling ZIL records. First is a
zl_issuer_lock, that is supposed to be held, whenever we're issuing
a new record to ZIL. Then there is zl_lock to actually protect the in
the zilog struct. And then there's a last lock, zcw_lock, which is
used to protect the lists of ZIL call-back waiters.

The order of locking is supposed to be:

zl_issuer_lock -> zcw_lock -> zl_lock; as implied their usage
in zil_alloc_lwb and zil_commit_waiter_timeout functions.

Function zil_commit_waiter_link_lwb goes against this, it is expecting
to be entered with zl_lock already held and only then it is taking
the zcw_lock;

This patch straightens the locking in zil_commit_waiter_link_lwb to
take the zl_lock on its own, correcting the locking order.

Following is an attached messsage from the Linux lockdep mechanism
describing the potential deadlock - in our production, it was a matter
of a few hours to hit this one...

(redacted to fit 72 cols)

systemd-journal/5561 is trying to acquire lock:
(&zilog->zl_lock){+.+.}-{4:4}, at: zil_alloc_lwb+0x1df/0x3e0

but task is already holding lock:
(&zcw->zcw_lock){+.+.}-{4:4}, at: zil_commit_impl+0x52b/0x1850

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&zcw->zcw_lock){+.+.}-{4:4}:
       __mutex_lock+0xac/0x9e0
       mutex_lock_nested+0x1b/0x20
       zil_commit_waiter_link_lwb+0x51/0x1a0 [zfs]
       zil_commit_impl+0x12b0/0x1850 [zfs]
       zil_commit+0x43/0x60 [zfs]
       zpl_writepages+0xf8/0x1a0 [zfs]
       do_writepages+0x43/0xf0
       __filemap_fdatawrite_range+0xd5/0x110
       filemap_write_and_wait_range+0x4b/0xb0
       zpl_fsync+0x4d/0xb0 [zfs]
       vfs_fsync_range+0x49/0x80
       do_fsync+0x3d/0x70
       __x64_sys_fsync+0x14/0x20
       do_syscall_64+0x38/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #0 (&zilog->zl_lock){+.+.}-{4:4}:
       __lock_acquire+0x12b6/0x2480
       lock_acquire+0xab/0x380
       __mutex_lock+0xac/0x9e0
       mutex_lock_nested+0x1b/0x20
       zil_alloc_lwb+0x1df/0x3e0 [zfs]
       zil_lwb_write_issue+0x265/0x3f0 [zfs]
       zil_commit_impl+0x577/0x1850 [zfs]
       zil_commit+0x43/0x60 [zfs]
       zpl_writepages+0xf8/0x1a0 [zfs]
       do_writepages+0x43/0xf0
       __filemap_fdatawrite_range+0xd5/0x110
       filemap_write_and_wait_range+0x4b/0xb0
       zpl_fsync+0x4d/0xb0 [zfs]
       vfs_fsync_range+0x49/0x80
       do_fsync+0x3d/0x70
       __x64_sys_fsync+0x14/0x20
       do_syscall_64+0x38/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&zcw->zcw_lock);
                               lock(&zilog->zl_lock);
                               lock(&zcw->zcw_lock);
  lock(&zilog->zl_lock);

 *** DEADLOCK ***

2 locks held by systemd-journal/5561:
 #0: (&zilog->zl_issuer_lock){+.+.}-{4:4}, at: zil_commit_impl+0x4dc...
 #1: (&zcw->zcw_lock){+.+.}-{4:4}, at: zil_commit_impl+0x52b/0x1850

Fixes openzfs#7038
Fixes openzfs#10440

Signed-off-by: Pavel Snajdr <[email protected]>
snajpa added a commit to vpsfreecz/zfs that referenced this issue Sep 3, 2020
There are three locks at play, when handling ZIL records. First is a
zl_issuer_lock, that is supposed to be held, whenever we're issuing
a new record to ZIL. Then there is zl_lock to actually protect the in
the zilog struct. And then there's a last lock, zcw_lock, which is
used to protect the lists of ZIL call-back waiters.

The order of locking is supposed to be:

zl_issuer_lock -> zcw_lock -> zl_lock; as implied their usage
in zil_alloc_lwb and zil_commit_waiter_timeout functions.

Function zil_commit_waiter_link_lwb goes against this, it is expecting
to be entered with zl_lock already held and only then it is taking
the zcw_lock;

This patch straightens the locking in zil_commit_waiter_link_lwb to
take the zl_lock on its own, correcting the locking order.

Following is an attached messsage from the Linux lockdep mechanism
describing the potential deadlock - in our production, it was a matter
of a few hours to hit this one...

(redacted to fit 72 cols)

systemd-journal/5561 is trying to acquire lock:
(&zilog->zl_lock){+.+.}-{4:4}, at: zil_alloc_lwb+0x1df/0x3e0

but task is already holding lock:
(&zcw->zcw_lock){+.+.}-{4:4}, at: zil_commit_impl+0x52b/0x1850

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&zcw->zcw_lock){+.+.}-{4:4}:
       __mutex_lock+0xac/0x9e0
       mutex_lock_nested+0x1b/0x20
       zil_commit_waiter_link_lwb+0x51/0x1a0 [zfs]
       zil_commit_impl+0x12b0/0x1850 [zfs]
       zil_commit+0x43/0x60 [zfs]
       zpl_writepages+0xf8/0x1a0 [zfs]
       do_writepages+0x43/0xf0
       __filemap_fdatawrite_range+0xd5/0x110
       filemap_write_and_wait_range+0x4b/0xb0
       zpl_fsync+0x4d/0xb0 [zfs]
       vfs_fsync_range+0x49/0x80
       do_fsync+0x3d/0x70
       __x64_sys_fsync+0x14/0x20
       do_syscall_64+0x38/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #0 (&zilog->zl_lock){+.+.}-{4:4}:
       __lock_acquire+0x12b6/0x2480
       lock_acquire+0xab/0x380
       __mutex_lock+0xac/0x9e0
       mutex_lock_nested+0x1b/0x20
       zil_alloc_lwb+0x1df/0x3e0 [zfs]
       zil_lwb_write_issue+0x265/0x3f0 [zfs]
       zil_commit_impl+0x577/0x1850 [zfs]
       zil_commit+0x43/0x60 [zfs]
       zpl_writepages+0xf8/0x1a0 [zfs]
       do_writepages+0x43/0xf0
       __filemap_fdatawrite_range+0xd5/0x110
       filemap_write_and_wait_range+0x4b/0xb0
       zpl_fsync+0x4d/0xb0 [zfs]
       vfs_fsync_range+0x49/0x80
       do_fsync+0x3d/0x70
       __x64_sys_fsync+0x14/0x20
       do_syscall_64+0x38/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&zcw->zcw_lock);
                               lock(&zilog->zl_lock);
                               lock(&zcw->zcw_lock);
  lock(&zilog->zl_lock);

 *** DEADLOCK ***

2 locks held by systemd-journal/5561:
 #0: (&zilog->zl_issuer_lock){+.+.}-{4:4}, at: zil_commit_impl+0x4dc...
 #1: (&zcw->zcw_lock){+.+.}-{4:4}, at: zil_commit_impl+0x52b/0x1850

Fixes openzfs#7038
Fixes openzfs#10440

Signed-off-by: Pavel Snajdr <[email protected]>
@snajpa
Copy link
Contributor

snajpa commented Jan 26, 2021

It took a while, but the problem has manifested again, running with ~2 months old git HEAD:

[[email protected]:~]# for pid in `ct ps -o pid 18205`; do pid=$(ps aux | grep journald | grep $pid | awk '{ print $2; }'); if [ "$pid" != "" ]; then ps aux | grep $pid; cat /proc/$pid/stack; fi; done
5071175+   12166  0.0  0.0  27432  4504 ?        Ds   11:51   0:00 /lib/systemd/systemd-journald
90308608  121662  0.0  0.0 217132   616 pts/3    Ss+   2020   0:00 agetty --login-program /nix/store/1l5yw8hyfpvcn2a5lzds5nvgaz519n67-shadow-4.8/bin/login --noclear --keep-baud pts/3 115200,38400,9600 vt220
root     3166910  0.0  0.0   6348  2268 pts/194  S+   13:07   0:00 grep 12166
[<0>] zil_commit_impl+0x147/0xe00 [zfs]
[<0>] zfs_fsync+0x72/0x100 [zfs]
[<0>] zpl_fsync+0x67/0xa0 [zfs]
[<0>] do_fsync+0x38/0x70
[<0>] __x64_sys_fsync+0x10/0x20
[<0>] do_syscall_64+0x33/0x40
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
5071175+  784154  0.0  0.0  27432  4456 ?        Ds   10:27   0:00 /lib/systemd/systemd-journald
root     3169606  0.0  0.0   6348  2272 pts/194  S+   13:08   0:00 grep 784154
[<0>] zil_commit_impl+0x147/0xe00 [zfs]
[<0>] zfs_fsync+0x72/0x100 [zfs]
[<0>] zpl_fsync+0x67/0xa0 [zfs]
[<0>] do_fsync+0x38/0x70
[<0>] __x64_sys_fsync+0x10/0x20
[<0>] do_syscall_64+0x33/0x40
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
5071175+  857385  0.0  0.0  27432  4692 ?        Ds   12:12   0:00 /lib/systemd/systemd-journald
root     3171751  0.0  0.0   6348  2236 pts/194  S+   13:08   0:00 grep 857385
[<0>] zil_commit_impl+0x147/0xe00 [zfs]
[<0>] zfs_fsync+0x72/0x100 [zfs]
[<0>] zpl_fsync+0x67/0xa0 [zfs]
[<0>] do_fsync+0x38/0x70
[<0>] __x64_sys_fsync+0x10/0x20
[<0>] do_syscall_64+0x33/0x40
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
5071175+ 1651466  0.0  0.0  27432  4460 ?        Ds   10:48   0:00 /lib/systemd/systemd-journald
root     3174521  0.0  0.0   6348  2176 pts/194  S+   13:08   0:00 grep 1651466
[<0>] zil_commit_impl+0x147/0xe00 [zfs]
[<0>] zfs_fsync+0x72/0x100 [zfs]
[<0>] zpl_fsync+0x67/0xa0 [zfs]
[<0>] do_fsync+0x38/0x70
[<0>] __x64_sys_fsync+0x10/0x20
[<0>] do_syscall_64+0x33/0x40
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
5071175+ 1703898  0.0  0.0  27432  4680 ?        Ds   12:34   0:00 /lib/systemd/systemd-journald
root     3174676  0.0  0.0   6348  2180 pts/194  S+   13:08   0:00 grep 1703898
[<0>] zil_commit_impl+0x147/0xe00 [zfs]
[<0>] zfs_fsync+0x72/0x100 [zfs]
[<0>] zpl_fsync+0x67/0xa0 [zfs]
[<0>] do_fsync+0x38/0x70
[<0>] __x64_sys_fsync+0x10/0x20
[<0>] do_syscall_64+0x33/0x40
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
5071175+ 2510625  0.0  0.0  27432  4432 ?        Ds   11:09   0:00 /lib/systemd/systemd-journald
root     3187083  0.0  0.0   6348  2272 pts/194  S+   13:08   0:00 grep 2510625
[<0>] zil_commit_impl+0x147/0xe00 [zfs]
[<0>] zfs_fsync+0x72/0x100 [zfs]
[<0>] zpl_fsync+0x67/0xa0 [zfs]
[<0>] do_fsync+0x38/0x70
[<0>] __x64_sys_fsync+0x10/0x20
[<0>] do_syscall_64+0x33/0x40
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
5071175+ 2545314  0.0  0.0  27432  4340 ?        Ds   09:24   0:00 /lib/systemd/systemd-journald
root     3188253  0.0  0.0   6348  2264 pts/194  S+   13:08   0:00 grep 2545314
[<0>] zil_commit_impl+0x147/0xe00 [zfs]
[<0>] zpl_writepages+0xdd/0x180 [zfs]
[<0>] do_writepages+0x43/0xd0
[<0>] __filemap_fdatawrite_range+0xce/0x110
[<0>] filemap_write_and_wait_range+0x26/0x70
[<0>] zpl_fsync+0x37/0xa0 [zfs]
[<0>] do_fsync+0x38/0x70
[<0>] __x64_sys_fsync+0x10/0x20
[<0>] do_syscall_64+0x33/0x40
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
5071175+ 2641300  0.0  0.0  27432  4612 ?        Ds   12:55   0:00 /lib/systemd/systemd-journald
root     3189963  0.0  0.0   6348  2276 pts/194  S+   13:08   0:00 grep 2641300
[<0>] zil_commit_impl+0x147/0xe00 [zfs]
[<0>] zfs_fsync+0x72/0x100 [zfs]
[<0>] zpl_fsync+0x67/0xa0 [zfs]
[<0>] do_fsync+0x38/0x70
[<0>] __x64_sys_fsync+0x10/0x20
[<0>] do_syscall_64+0x33/0x40
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
root     3191574  0.0  0.0   6348  2264 pts/194  S+   13:08   0:00 grep 3293685
5071175+ 3293685  0.0  0.0  27432  4440 ?        Ds   11:30   0:00 /lib/systemd/systemd-journald
[<0>] zil_commit_impl+0x147/0xe00 [zfs]
[<0>] zfs_fsync+0x72/0x100 [zfs]
[<0>] zpl_fsync+0x67/0xa0 [zfs]
[<0>] do_fsync+0x38/0x70
[<0>] __x64_sys_fsync+0x10/0x20
[<0>] do_syscall_64+0x33/0x40
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
root     3191923  0.0  0.0   6348  2272 pts/194  S+   13:08   0:00 grep 3337412
5071175+ 3337412  0.0  0.0  27432  4432 ?        Ds   09:45   0:00 /lib/systemd/systemd-journald
[<0>] zil_commit_impl+0x147/0xe00 [zfs]
[<0>] zfs_fsync+0x72/0x100 [zfs]
[<0>] zpl_fsync+0x67/0xa0 [zfs]
[<0>] do_fsync+0x38/0x70
[<0>] __x64_sys_fsync+0x10/0x20
[<0>] do_syscall_64+0x33/0x40
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
root     3195197  0.0  0.0   6348  2236 pts/194  S+   13:08   0:00 grep 4063111
5071175+ 4063111  0.0  0.0 213280 10808 ?        Dsl   2020  21:40 /lib/systemd/systemd-journald
[<0>] do_coredump+0x33f/0x1070
[<0>] get_signal+0x166/0x8f0
[<0>] arch_do_signal+0x30/0x700
[<0>] exit_to_user_mode_prepare+0xf9/0x160
[<0>] syscall_exit_to_user_mode+0x2c/0x160
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
root     3199445  0.0  0.0   6348  2268 pts/194  S+   13:08   0:00 grep 4188468
5071175+ 4188468  0.0  0.0  27432  4440 ?        Ds   10:06   0:00 /lib/systemd/systemd-journald
[<0>] zil_commit_impl+0x147/0xe00 [zfs]
[<0>] zfs_fsync+0x72/0x100 [zfs]
[<0>] zpl_fsync+0x67/0xa0 [zfs]
[<0>] do_fsync+0x38/0x70
[<0>] __x64_sys_fsync+0x10/0x20
[<0>] do_syscall_64+0x33/0x40
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

@snajpa
Copy link
Contributor

snajpa commented Jan 26, 2021

Well, our issue seems to be only relevant when core-dumping - #10875 (comment)

@AndyLavr
Copy link

AndyLavr commented Nov 21, 2021

Well, our issue seems to be only relevant when core-dumping - #10875 (comment)

The real problem and it does exist - linux-next 20211118 + current zfs:

You can catch this deadlock when debug, use this config:

# The config is based on running daily CI for enterprise Linux distros to
# seek regressions on linux-next builds on different bare-metal and virtual
# platforms. It can be used for example,
#
# $ make ARCH=amd64 defconfig debug.config
#
# Keep alphabetically sorted inside each section.
#
# printk and dmesg options
#
CONFIG_DEBUG_BUGVERBOSE=y
CONFIG_DYNAMIC_DEBUG=y
CONFIG_PRINTK_CALLER=y
CONFIG_PRINTK_TIME=y
CONFIG_SYMBOLIC_ERRNAME=y
#
# Compile-time checks and compiler options
#
CONFIG_DEBUG_INFO=y
CONFIG_DEBUG_SECTION_MISMATCH=y
CONFIG_FRAME_WARN=2048
CONFIG_SECTION_MISMATCH_WARN_ONLY=y
#
# Generic Kernel Debugging Instruments
#
# CONFIG_UBSAN_ALIGNMENT is not set
# CONFIG_UBSAN_DIV_ZERO is not set
# CONFIG_UBSAN_TRAP is not set
# CONFIG_WARN_ALL_UNSEEDED_RANDOM is not set
CONFIG_DEBUG_FS=y
CONFIG_DEBUG_FS_ALLOW_ALL=y
CONFIG_DEBUG_IRQFLAGS=y
CONFIG_UBSAN=y
CONFIG_UBSAN_BOOL=y
CONFIG_UBSAN_BOUNDS=y
CONFIG_UBSAN_ENUM=y
CONFIG_UBSAN_SHIFT=y
CONFIG_UBSAN_UNREACHABLE=y
#
# Memory Debugging
#
# CONFIG_DEBUG_PAGEALLOC is not set
# CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF is not set
# CONFIG_DEBUG_RODATA_TEST is not set
# CONFIG_DEBUG_WX is not set
# CONFIG_KFENCE is not set
# CONFIG_PAGE_POISONING is not set
# CONFIG_SLUB_STATS is not set
CONFIG_PAGE_EXTENSION=y
CONFIG_PAGE_OWNER=y
CONFIG_DEBUG_KMEMLEAK=y
CONFIG_DEBUG_KMEMLEAK_AUTO_SCAN=y
CONFIG_DEBUG_OBJECTS=y
CONFIG_DEBUG_OBJECTS_ENABLE_DEFAULT=1
CONFIG_DEBUG_OBJECTS_FREE=y
CONFIG_DEBUG_OBJECTS_PERCPU_COUNTER=y
CONFIG_DEBUG_OBJECTS_RCU_HEAD=y
CONFIG_DEBUG_OBJECTS_TIMERS=y
CONFIG_DEBUG_OBJECTS_WORK=y
CONFIG_DEBUG_PER_CPU_MAPS=y
CONFIG_DEBUG_STACK_USAGE=y
CONFIG_DEBUG_VIRTUAL=y
CONFIG_DEBUG_VM=y
CONFIG_DEBUG_VM_PGFLAGS=y
CONFIG_DEBUG_VM_RB=y
CONFIG_DEBUG_VM_VMACACHE=y
CONFIG_GENERIC_PTDUMP=y
CONFIG_KASAN=y
CONFIG_KASAN_GENERIC=y
CONFIG_KASAN_INLINE=y
CONFIG_KASAN_VMALLOC=y
CONFIG_PTDUMP_DEBUGFS=y
CONFIG_SCHED_STACK_END_CHECK=y
CONFIG_SLUB_DEBUG_ON=y
#
# Debug Oops, Lockups and Hangs
#
# CONFIG_BOOTPARAM_HUNG_TASK_PANIC is not set
# CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC is not set
CONFIG_DEBUG_ATOMIC_SLEEP=y
CONFIG_DETECT_HUNG_TASK=y
CONFIG_PANIC_ON_OOPS=y
CONFIG_PANIC_TIMEOUT=0
CONFIG_SOFTLOCKUP_DETECTOR=y
#
# Lock Debugging (spinlocks, mutexes, etc...)
#
# CONFIG_PROVE_RAW_LOCK_NESTING is not set
CONFIG_PROVE_LOCKING=y
#
# Debug kernel data structures
#
CONFIG_BUG_ON_DATA_CORRUPTION=y
#
# RCU Debugging
#
CONFIG_PROVE_RCU=y
CONFIG_PROVE_RCU_LIST=y
#
# Tracers
#
CONFIG_BRANCH_PROFILE_NONE=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_FTRACE=y
CONFIG_FUNCTION_TRACER=y

[  +0,010730] ======================================================
[  +0,000003] WARNING: possible circular locking dependency detected
[  +0,000003] 5.16.0-generic #20211119 Not tainted
[  +0,000004] ------------------------------------------------------
[  +0,000001] systemd-random-/883 is trying to acquire lock:
[  +0,000005] ffff88812d2ec068 (&zilog->zl_lock){+.+.}-{3:3}, at: zil_alloc_lwb+0x4a3/0xd20 [zfs]
[  +0,000321] 
              but task is already holding lock:
[  +0,000002] ffff888131d5d0a0 (&zcw->zcw_lock){+.+.}-{3:3}, at: zil_commit_waiter_timeout+0x26d/0xbe0 [zfs]
[  +0,000299] 
              which lock already depends on the new lock.

[  +0,000003] 
              the existing dependency chain (in reverse order) is:
[  +0,000002] 
              -> #1 (&zcw->zcw_lock){+.+.}-{3:3}:
[  +0,000012]        __lock_acquire+0xaaf/0x1750
[  +0,000010]        lock_acquire.part.0+0x105/0x350
[  +0,000006]        lock_acquire+0xf1/0x2b0
[  +0,000005]        __mutex_lock+0x151/0x13f0
[  +0,000006]        mutex_lock_nested+0x1b/0x20
[  +0,000006]        zil_commit_waiter_skip+0x6f/0x440 [zfs]
[  +0,000289]        zil_commit_impl+0xf80/0x20e0 [zfs]
[  +0,000279]        zil_commit+0xbc/0x120 [zfs]
[  +0,000284]        zfs_fsync+0x1db/0x3c0 [zfs]
[  +0,000286]        zpl_fsync+0x17e/0x290 [zfs]
[  +0,000282]        vfs_fsync_range+0xf6/0x220
[  +0,000007]        __x64_sys_fsync+0x56/0x90
[  +0,000005]        do_syscall_64+0x5c/0xc0
[  +0,000008]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0,000006] 
              -> #0 (&zilog->zl_lock){+.+.}-{3:3}:
[  +0,000012]        check_prev_add+0x162/0x20b0
[  +0,000007]        validate_chain+0xa9c/0xfc0
[  +0,000005]        __lock_acquire+0xaaf/0x1750
[  +0,000006]        lock_acquire.part.0+0x105/0x350
[  +0,000005]        lock_acquire+0xf1/0x2b0
[  +0,000005]        __mutex_lock+0x151/0x13f0
[  +0,000005]        mutex_lock_nested+0x1b/0x20
[  +0,000005]        zil_alloc_lwb+0x4a3/0xd20 [zfs]
[  +0,000282]        zil_lwb_write_issue+0x830/0x12c0 [zfs]
[  +0,000286]        zil_commit_waiter_timeout+0x403/0xbe0 [zfs]
[  +0,000289]        zil_commit_impl+0x8c3/0x20e0 [zfs]
[  +0,000287]        zil_commit+0xbc/0x120 [zfs]
[  +0,000290]        zfs_fsync+0x1db/0x3c0 [zfs]
[  +0,000284]        zpl_fsync+0x17e/0x290 [zfs]
[  +0,000283]        vfs_fsync_range+0xf6/0x220
[  +0,000007]        __x64_sys_fsync+0x56/0x90
[  +0,000005]        do_syscall_64+0x5c/0xc0
[  +0,000005]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0,000006] 
              other info that might help us debug this:

[  +0,000003]  Possible unsafe locking scenario:

[  +0,000002]        CPU0                    CPU1
[  +0,000001]        ----                    ----
[  +0,000002]   lock(&zcw->zcw_lock);
[  +0,000007]                                lock(&zilog->zl_lock);
[  +0,000005]                                lock(&zcw->zcw_lock);
[  +0,000006]   lock(&zilog->zl_lock);
[  +0,000005] 
               *** DEADLOCK ***

[  +0,000002] 2 locks held by systemd-random-/883:
[  +0,000004]  #0: ffff88812d2ec288 (&zilog->zl_issuer_lock){+.+.}-{3:3}, at: zil_commit_waiter_timeout+0x1b5/0xbe0 [zfs]
[  +0,000300]  #1: ffff888131d5d0a0 (&zcw->zcw_lock){+.+.}-{3:3}, at: zil_commit_waiter_timeout+0x26d/0xbe0 [zfs]
[  +0,000298] 
              stack backtrace:
[  +0,000003] CPU: 7 PID: 883 Comm: systemd-random- Not tainted 5.16.0-generic #20211119
[  +0,000007] Hardware name: Dell Inc. Precision M6600/04YY4M, BIOS A18 09/14/2018
[  +0,000004] Call Trace:
[  +0,000003]  <TASK>
[  +0,000004]  show_stack+0x52/0x58
[  +0,000008]  dump_stack_lvl+0x74/0xa6
[  +0,000007]  dump_stack+0x10/0x12
[  +0,000005]  print_circular_bug.cold+0x15d/0x16c
[  +0,000012]  check_noncircular+0x268/0x310
[  +0,000008]  ? print_circular_bug+0x1d0/0x1d0
[  +0,000007]  ? lock_acquire.part.0+0x105/0x350
[  +0,000006]  ? _raw_spin_lock_irqsave+0xe/0x10
[  +0,000009]  ? check_prev_add+0x20b0/0x20b0
[  +0,000013]  check_prev_add+0x162/0x20b0
[  +0,000005]  ? add_chain_cache+0x26a/0x860
[  +0,000006]  ? call_rcu_zapped+0xb0/0xb0
[  +0,000012]  validate_chain+0xa9c/0xfc0
[  +0,000011]  ? check_prev_add+0x20b0/0x20b0
[  +0,000013]  __lock_acquire+0xaaf/0x1750
[  +0,000015]  lock_acquire.part.0+0x105/0x350
[  +0,000007]  ? zil_alloc_lwb+0x4a3/0xd20 [zfs]
[  +0,000288]  ? lock_acquire.part.0+0x105/0x350
[  +0,000007]  ? _raw_spin_lock_irqsave+0xe/0x10
[  +0,000006]  ? rcu_read_unlock+0x50/0x50
[  +0,000010]  ? __kasan_check_read+0x11/0x20
[  +0,000013]  lock_acquire+0xf1/0x2b0
[  +0,000006]  ? zil_alloc_lwb+0x4a3/0xd20 [zfs]
[  +0,000291]  __mutex_lock+0x151/0x13f0
[  +0,000006]  ? zil_alloc_lwb+0x4a3/0xd20 [zfs]
[  +0,000290]  ? zil_alloc_lwb+0x4a3/0xd20 [zfs]
[  +0,000289]  ? _raw_spin_unlock_irqrestore+0xe/0x10
[  +0,000007]  ? create_object.isra.0+0x5a7/0xae0
[  +0,000009]  ? mutex_lock_io_nested+0x1240/0x1240
[  +0,000009]  ? rcu_read_lock_sched_held+0x46/0x80
[  +0,000008]  ? kmem_cache_alloc+0x326/0x3f0
[  +0,000012]  ? spl_kmem_cache_alloc+0x2a1/0x3d0 [spl]
[  +0,000030]  mutex_lock_nested+0x1b/0x20
[  +0,000005]  ? zio_buf_alloc+0x4f/0xa0 [zfs]
[  +0,000286]  ? mutex_lock_nested+0x1b/0x20
[  +0,000007]  zil_alloc_lwb+0x4a3/0xd20 [zfs]
[  +0,000289]  ? dsl_dataset_dirty+0x122/0x390 [zfs]
[  +0,232318]  zil_lwb_write_issue+0x830/0x12c0 [zfs]
[  +0,001872] systemd[1]: Started Journal Service.
[  -0,000061]  ? zil_lwb_add_block+0x8e0/0x8e0 [zfs]
[  +0,003970]  zil_commit_waiter_timeout+0x403/0xbe0 [zfs]
[  +0,001912]  zil_commit_impl+0x8c3/0x20e0 [zfs]
[  +0,000175]  ? __mutex_unlock_slowpath+0x160/0x3c0
[  +0,000012]  ? zil_itx_assign+0x1180/0x1180 [zfs]
[  +0,000165]  ? _raw_spin_unlock+0x23/0x40
[  +0,000005]  ? rrw_enter_read_impl+0x3e9/0x830 [zfs]
[  +0,000160]  zil_commit+0xbc/0x120 [zfs]
[  +0,000163]  zfs_fsync+0x1db/0x3c0 [zfs]
[  +0,000171]  zpl_fsync+0x17e/0x290 [zfs]
[  +0,000163]  ? zpl_iter_write+0x5e0/0x5e0 [zfs]
[  +0,000152]  vfs_fsync_range+0xf6/0x220
[  +0,000005]  ? __fget_light+0x57/0x240
[  +0,000003]  ? syscall_enter_from_user_mode+0x21/0x60
[  +0,000006]  __x64_sys_fsync+0x56/0x90
[  +0,000004]  do_syscall_64+0x5c/0xc0
[  +0,000005]  ? lockdep_hardirqs_on_prepare.part.0+0x1be/0x370
[  +0,000006]  ? irqentry_exit_to_user_mode+0x29/0x30
[  +0,000003]  ? irqentry_exit+0x63/0x90
[  +0,000003]  ? exc_page_fault+0x86/0xf0
[  +0,000003]  ? asm_exc_page_fault+0x8/0x30
[  +0,000005]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0,000004] RIP: 0033:0x7f86c54468e7
[  +0,000005] Code: 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 4a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 e3 5f f7 ff
[  +0,000003] RSP: 002b:00007ffecef47df8 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
[  +0,000005] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f86c54468e7
[  +0,000003] RDX: 0000000000000200 RSI: 0000000000000200 RDI: 0000000000000004
[  +0,000002] RBP: 0000000000000004 R08: 000055be174c1a00 R09: 0000000000000000
[  +0,000002] R10: 0000000000001000 R11: 0000000000000246 R12: 0000000000000005
[  +0,000002] R13: 0000000000000000 R14: 000055be174c1cf0 R15: 0000000000000200
[  +0,000007]  </TASK>

I will report the result after testing this fix.
Fix zfs_fsync deadlock

@AndyLavr
Copy link

The situation changed after the patch, TRAP disappeared, but the deadlock end remained.

[   46.278264] ======================================================
[   46.278268] WARNING: possible circular locking dependency detected
[   46.278270] 5.16.0-generic #20211121 Not tainted
[   46.278272] ------------------------------------------------------
[   46.278273] z_ioctl_int/427 is trying to acquire lock:
[   46.278276] ffff88814a9ecd88 (&zcw->zcw_lock){+.+.}-{3:3}, at: zil_lwb_flush_vdevs_done+0x67e/0x1270 [zfs]
[   46.302996] 
               but task is already holding lock:
[   46.303001] ffff88812d6f1068 (&zilog->zl_lock){+.+.}-{3:3}, at: zil_lwb_flush_vdevs_done+0x204/0x1270 [zfs]
[   46.315210] 
               which lock already depends on the new lock.

[   46.315222] 
               the existing dependency chain (in reverse order) is:
[   46.315223] 
               -> #1 (&zilog->zl_lock){+.+.}-{3:3}:
[   46.315232]        __lock_acquire+0xaaf/0x1750
[   46.315241]        lock_acquire.part.0+0x105/0x350
[   46.315244]        lock_acquire+0xf1/0x2b0
[   46.315248]        __mutex_lock+0x151/0x13f0
[   46.346712]        mutex_lock_nested+0x1b/0x20
[   46.346729]        zil_commit_waiter_link_lwb+0x137/0x930 [zfs]
[   46.352404]        zil_lwb_commit+0x8a7/0xc30 [zfs]
[   46.352595]        zil_process_commit_list+0x3df/0x1700 [zfs]
[   46.352758]        zil_commit_impl+0x1755/0x22b0 [zfs]
[   46.352919]        zil_commit+0xbc/0x120 [zfs]
[   46.353081]        zfs_fsync+0x1db/0x3c0 [zfs]
[   46.353242]        zpl_fsync+0x17e/0x290 [zfs]
[   46.353411]        vfs_fsync_range+0xf6/0x220
[   46.353416]        __x64_sys_fsync+0x56/0x90
[   46.353419]        do_syscall_64+0x5c/0xc0
[   46.353424]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[   46.353429] 
               -> #0 (&zcw->zcw_lock){+.+.}-{3:3}:
[   46.353438]        check_prev_add+0x162/0x20b0
[   46.353442]        validate_chain+0xa9c/0xfc0
[   46.353446]        __lock_acquire+0xaaf/0x1750
[   46.353449]        lock_acquire.part.0+0x105/0x350
[   46.353452]        lock_acquire+0xf1/0x2b0
[   46.353455]        __mutex_lock+0x151/0x13f0
[   46.353458]        mutex_lock_nested+0x1b/0x20
[   46.353461]        zil_lwb_flush_vdevs_done+0x67e/0x1270 [zfs]
[   46.353623]        zio_done+0xcbb/0x5600 [zfs]
[   46.353785]        zio_execute+0x1c5/0x450 [zfs]
[   46.353947]        taskq_thread+0x9de/0x15d0 [spl]
[   46.353963]        kthread+0x35d/0x420
[   46.353966]        ret_from_fork+0x22/0x30
[   46.353971] 
               other info that might help us debug this:

[   46.353973]  Possible unsafe locking scenario:

[   46.353974]        CPU0                    CPU1
[   46.353975]        ----                    ----
[   46.353976]   lock(&zilog->zl_lock);
[   46.353980]                                lock(&zcw->zcw_lock);
[   46.353983]                                lock(&zilog->zl_lock);
[   46.353986]   lock(&zcw->zcw_lock);
[   46.353989] 
                *** DEADLOCK ***

[   46.353990] 1 lock held by z_ioctl_int/427:
[   46.353993]  #0: ffff88812d6f1068 (&zilog->zl_lock){+.+.}-{3:3}, at: zil_lwb_flush_vdevs_done+0x204/0x1270 [zfs]
[   46.354163] 
               stack backtrace:
[   46.354165] CPU: 1 PID: 427 Comm: z_ioctl_int Not tainted 5.16.0-generic #20211121
[   46.354169] Hardware name: Dell Inc. Precision M6600/04YY4M, BIOS A18 09/14/2018
[   46.354172] Call Trace:
[   46.354174]  <TASK>
[   46.354176]  show_stack+0x52/0x58
[   46.354182]  dump_stack_lvl+0x74/0xa6
[   46.354187]  dump_stack+0x10/0x12
[   46.354190]  print_circular_bug.cold+0x15d/0x16c
[   46.354197]  check_noncircular+0x268/0x310
[   46.354202]  ? print_circular_bug+0x1d0/0x1d0
[   46.354208]  ? ret_from_fork+0x22/0x30
[   46.354215]  check_prev_add+0x162/0x20b0
[   46.354219]  ? add_chain_cache+0x26a/0x860
[   46.354223]  ? call_rcu_zapped+0xb0/0xb0
[   46.354229]  validate_chain+0xa9c/0xfc0
[   46.354235]  ? check_prev_add+0x20b0/0x20b0
[   46.354242]  __lock_acquire+0xaaf/0x1750
[   46.354246]  ? zio_data_buf_free+0x52/0xa0 [zfs]
[   46.354410]  ? zil_lwb_flush_vdevs_done+0x59d/0x1270 [zfs]
[   46.354572]  ? zio_done+0xcbb/0x5600 [zfs]
[   46.354752]  lock_acquire.part.0+0x105/0x350
[   46.354757]  ? zil_lwb_flush_vdevs_done+0x67e/0x1270 [zfs]
[   46.354922]  ? rcu_read_unlock+0x50/0x50
[   46.354926]  ? _raw_spin_unlock_irqrestore+0xe/0x10
[   46.354932]  ? mark_held_locks+0xa4/0xf0
[   46.354936]  ? __kasan_slab_free+0xde/0x110
[   46.354942]  lock_acquire+0xf1/0x2b0
[   46.354946]  ? zil_lwb_flush_vdevs_done+0x67e/0x1270 [zfs]
[   46.355112]  __mutex_lock+0x151/0x13f0
[   46.355116]  ? zil_lwb_flush_vdevs_done+0x67e/0x1270 [zfs]
[   46.355279]  ? __kasan_slab_free+0xde/0x110
[   46.355284]  ? zil_lwb_flush_vdevs_done+0x67e/0x1270 [zfs]
[   46.355447]  ? spl_kmem_cache_free+0x3ad/0x6a0 [spl]
[   46.355463]  ? mutex_lock_io_nested+0x1240/0x1240
[   46.355466]  ? kmem_cache_free+0x1bd/0x3d0
[   46.355472]  ? lockdep_hardirqs_on_prepare.part.0+0x1be/0x370
[   46.355479]  ? spl_kmem_cache_free+0x3f0/0x6a0 [spl]
[   46.355495]  mutex_lock_nested+0x1b/0x20
[   46.355498]  ? mutex_lock_nested+0x1b/0x20
[   46.355502]  zil_lwb_flush_vdevs_done+0x67e/0x1270 [zfs]
[   46.511317]  ? zil_create+0x670/0x670 [zfs]
[   46.513140]  ? zio_wait_for_children+0x1cb/0x4d0 [zfs]
[   46.513348] systemd[1]: Started Journal Service.
[   46.514977]  zio_done+0xcbb/0x5600 [zfs]
[   46.518333]  zio_execute+0x1c5/0x450 [zfs]
[   46.520161]  taskq_thread+0x9de/0x15d0 [spl]
[   46.520182]  ? taskq_thread_spawn+0x170/0x170 [spl]
[   46.520197]  ? wake_up_q+0x100/0x100
[   46.520203]  ? lock_acquire.part.0+0x105/0x350
[   46.520208]  ? _raw_spin_lock_irqsave+0xe/0x10
[   46.520213]  ? find_held_lock+0x35/0x130
[   46.520225]  ? zio_gang_tree_free+0xf0/0xf0 [zfs]
[   46.532048]  ? __kasan_check_read+0x11/0x20
[   46.532055]  ? __kthread_parkme+0xcb/0x1e0
[   46.532067]  ? __kasan_check_read+0x11/0x20
[   46.532070]  ? taskq_thread_spawn+0x170/0x170 [spl]
[   46.532086]  kthread+0x35d/0x420
[   46.540627]  ? set_kthread_struct+0x110/0x110
[   46.540632]  ret_from_fork+0x22/0x30
[   46.540650]  </TASK>

@snajpa
Copy link
Contributor

snajpa commented Nov 21, 2021

@AndyLavr I'm not sure the issues are actually related, it's been a while and all I recall now is a good salad from the mix of info I had at the time :D

Are you experiencing this with any other linux revisions than the pretty recent -next?

Btw, do you think you'd be able to distill a reproducer from the workload you have?

@AndyLavr
Copy link

AndyLavr commented Nov 21, 2021

@snajpa

Are you experiencing this with any other linux revisions than the pretty recent -next?

I was debugging another issue and accidentally saw this. I only use -next.

Btw, do you think you'd be able to distill a reproducer from the workload you have?

This is one message during boot. I will send you dmesg by email.

@snajpa
Copy link
Contributor

snajpa commented Nov 21, 2021

Ah, OK, I see, that's a lockdep warning about a potential deadlock; if you're not really hitting it, I guess you can ignore that one :)

I haven't seen the actual deadlock occur again, so it must be resolved for quite some time now. :)

@stale
Copy link

stale bot commented Nov 22, 2022

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Nov 22, 2022
@stale stale bot closed this as completed Mar 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Stale No recent activity for issue Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants