-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix zfs_fsync deadlock #10875
Fix zfs_fsync deadlock #10875
Conversation
Never mind, there is a massive oversight in zil_prune_commit_list, which is taking the zl_lock too. So back to the drawing board... |
Still will be interesting to let the bots die on that :-D I'll close the PR in a bit, waiting what they tell us :) |
There are three locks at play, when handling ZIL records. First is a zl_issuer_lock, that is supposed to be held, whenever we're issuing a new record to ZIL. Then there is zl_lock to actually protect the in the zilog struct. And then there's a last lock, zcw_lock, which is used to protect the lists of ZIL call-back waiters. The order of locking is supposed to be: zl_issuer_lock -> zcw_lock -> zl_lock; as implied their usage in zil_alloc_lwb and zil_commit_waiter_timeout functions. Function zil_commit_waiter_link_lwb goes against this, it is expecting to be entered with zl_lock already held and only then it is taking the zcw_lock; This patch straightens the locking in zil_commit_waiter_link_lwb to take the zl_lock on its own, correcting the locking order. Following is an attached messsage from the Linux lockdep mechanism describing the potential deadlock - in our production, it was a matter of a few hours to hit this one... (redacted to fit 72 cols) systemd-journal/5561 is trying to acquire lock: (&zilog->zl_lock){+.+.}-{4:4}, at: zil_alloc_lwb+0x1df/0x3e0 but task is already holding lock: (&zcw->zcw_lock){+.+.}-{4:4}, at: zil_commit_impl+0x52b/0x1850 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&zcw->zcw_lock){+.+.}-{4:4}: __mutex_lock+0xac/0x9e0 mutex_lock_nested+0x1b/0x20 zil_commit_waiter_link_lwb+0x51/0x1a0 [zfs] zil_commit_impl+0x12b0/0x1850 [zfs] zil_commit+0x43/0x60 [zfs] zpl_writepages+0xf8/0x1a0 [zfs] do_writepages+0x43/0xf0 __filemap_fdatawrite_range+0xd5/0x110 filemap_write_and_wait_range+0x4b/0xb0 zpl_fsync+0x4d/0xb0 [zfs] vfs_fsync_range+0x49/0x80 do_fsync+0x3d/0x70 __x64_sys_fsync+0x14/0x20 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #0 (&zilog->zl_lock){+.+.}-{4:4}: __lock_acquire+0x12b6/0x2480 lock_acquire+0xab/0x380 __mutex_lock+0xac/0x9e0 mutex_lock_nested+0x1b/0x20 zil_alloc_lwb+0x1df/0x3e0 [zfs] zil_lwb_write_issue+0x265/0x3f0 [zfs] zil_commit_impl+0x577/0x1850 [zfs] zil_commit+0x43/0x60 [zfs] zpl_writepages+0xf8/0x1a0 [zfs] do_writepages+0x43/0xf0 __filemap_fdatawrite_range+0xd5/0x110 filemap_write_and_wait_range+0x4b/0xb0 zpl_fsync+0x4d/0xb0 [zfs] vfs_fsync_range+0x49/0x80 do_fsync+0x3d/0x70 __x64_sys_fsync+0x14/0x20 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&zcw->zcw_lock); lock(&zilog->zl_lock); lock(&zcw->zcw_lock); lock(&zilog->zl_lock); *** DEADLOCK *** 2 locks held by systemd-journal/5561: #0: (&zilog->zl_issuer_lock){+.+.}-{4:4}, at: zil_commit_impl+0x4dc... #1: (&zcw->zcw_lock){+.+.}-{4:4}, at: zil_commit_impl+0x52b/0x1850 Fixes openzfs#7038 Fixes openzfs#10440 Signed-off-by: Pavel Snajdr <[email protected]>
77c1ceb
to
af06fc7
Compare
Ok, I think I have figured out what to do about the |
zil_commit_waiter_skip(itx->itx_private); | ||
} else { | ||
mutex_exit(&zilog->zl_lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you comment on why it's safe to drop this lock here, and then reacquire it in zil_commit_waiter_link_lwb
? For example, may anything bad happen if zilog->zl_last_lwb_opened
were to change in between the drop and reacquire?
I don't mean to imply this is wrong, just trying to refresh my memory on if this is safe or not, since the locking between these three locks is subtle..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it have been possible for the lwb assigned above:
lwb_t *last_lwb = zilog->zl_last_lwb_opened;
to have been issued and perhaps completed in between dropping the lock here, and acquiring it later? In which case, the lwb_t
structure pointed to by last_lwb
may no longer be valid anymore?
I'm looking at the code, and I don't think that's possible, since we will be holding the zl_issuer_lock
lock here, and it also has to be held when issuing the zl_last_lwb_opened
; so there's no chance for last_lwb
to be freed in between the drop/acquire of zl_lock
.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And just to be clear, my concern is if the lwb we pass into zil_commit_waiter_link_lwb
can be freed by the time it's used in that function, due to us dropping zl_lock
and racing with some other thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking closer, the lwb wouldn't necessarily have to have been freed for this to break, we just need to ensure it doesn't transition into the LWB_STATE_FLUSH_DONE
state during the drop/acquire, since at that point we would have wanted to skip the itx, rather than add it to the lwb.
Further, I believe at this point, last_lwb
should be in the LWB_STATE_OPENED
state, and another thread can't transition it to the LWB_STATE_ISSUED
(and thus it can't transition to LWB_STATE_FLUSH_DONE
either), due to us still holding the zl_issuer_lock
lock.
If I'm thinking about all this correctly, then I think this change makes sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I put the unlock after the brackets in both if/else bodies is exactly that the zilog->zl_last_lwb_opened doesn't change. Then the function is called with *itx pointer, not the lwb. I think this is legit, but it for sure needs more eyeballs on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> #1 (&zcw->zcw_lock){+.+.}-{4:4}:
__mutex_lock+0xac/0x9e0
mutex_lock_nested+0x1b/0x20
zil_commit_waiter_link_lwb+0x51/0x1a0 [zfs]
zil_commit_impl+0x12b0/0x1850 [zfs]
zil_commit+0x43/0x60 [zfs]
zpl_writepages+0xf8/0x1a0 [zfs]
do_writepages+0x43/0xf0
__filemap_fdatawrite_range+0xd5/0x110
filemap_write_and_wait_range+0x4b/0xb0
zpl_fsync+0x4d/0xb0 [zfs]
vfs_fsync_range+0x49/0x80
do_fsync+0x3d/0x70
__x64_sys_fsync+0x14/0x20
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (&zilog->zl_lock){+.+.}-{4:4}:
__lock_acquire+0x12b6/0x2480
lock_acquire+0xab/0x380
__mutex_lock+0xac/0x9e0
mutex_lock_nested+0x1b/0x20
zil_alloc_lwb+0x1df/0x3e0 [zfs]
zil_lwb_write_issue+0x265/0x3f0 [zfs]
zil_commit_impl+0x577/0x1850 [zfs]
zil_commit+0x43/0x60 [zfs]
zpl_writepages+0xf8/0x1a0 [zfs]
do_writepages+0x43/0xf0
__filemap_fdatawrite_range+0xd5/0x110
filemap_write_and_wait_range+0x4b/0xb0
zpl_fsync+0x4d/0xb0 [zfs]
vfs_fsync_range+0x49/0x80
do_fsync+0x3d/0x70
__x64_sys_fsync+0x14/0x20
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Also, looking at these two stacks from the PR description, don't they both require the single (per ZIL) zl_issuer_lock
? So then, we actually could not have two threads simultaneously in these two stacks, and thus there isn't any actual deadlock concern (even if the lock ordering is confusing)?
I think it'd be good to get stack information from the system when the actual deadlock occurs, since I'm not sure I'm convinced that lockdep is reporting a legitimate deadlock scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The actual lockdep message is a bit confusing, that is true - the main concern here is the zil_commit_waiter_timeout
function, which releases the zcw->zcw_lock
for LWBs that aren't finished yet, then it takes the zl_issuer_lock
, but doesn't care about the zl_lock
at all, which might be held by a different thread. My main concern is the order of locking in that function:
mutex_exit(&zcw->zcw_lock);
mutex_enter(&zilog->zl_issuer_lock);
mutex_enter(&zcw->zcw_lock);
Now if we combine that with functions that take the zl_lock
in combination with the above locks, and the fact that they do it with different order (sometimes the zcw_lock is first, sometimes zl_lock)... I don't know, I think this is rather opaque and I think I like the idea of separate lock protecting the zl_lwb_list
best.
I'll paste the stacks of the stuck processes later today, they are still present on the production machines we run, we just clone the workload in a new dataset everytime this happens, so that we don't have to reboot the whole box. This is precisely "enabled" by the zcw_lock
being specific to the callback waiter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the node with the stuck processes has been cycled meanwhile, I'll have to wait a bit for this to turn up again... should have saved the stacks when I had the chance :-D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, I think this is rather opaque and I think I like the idea of separate lock protecting the zl_lwb_list best.
I don't necessarily disagree with this sentiment, but I'd like to determine if we're making a change to eliminate a legitimate deadlock scenario, or if we're making a change to make the code easier to read/comprehend/etc (and/or to silence a lockdep false positive).
Which is why I'd like to see the state of the system when the actual deadlock occurs, so we can verify if that deadlock is the same issue being raised here, of if it's something different. In which case, if it is different, I'd still be OK looking to making some changes to the code that make this logic easier to understand.
then it takes the zl_issuer_lock, but doesn't care about the zl_lock at all, which might be held by a different thread.
I'm not sure how much of the ZIL logic and/or locking you're familiar with, so just to make sure we're all on the same page (and sorry if this isn't already documented in the code), the zl_issuer_lock
has more to do about limiting that a single thread is only ever issuing ZIL writes to disk, rather than protecting structures in memory.
Due to how the ZIL is laid out on disk, via a linked list of blocks and block pointers, we can only have a single thread issue ZIL writes to disk, as the thread that issues these writes is also the thread that allocates the "next" ZIL block on disk; and there can only ever be a single "next" block at any given time. So we use the zl_issuer_lock
to ensure there isn't two threads racing when allocating that "next" block.
My main concern is the order of locking in that function
Taking my above explanation into account, here we have to acquire the zl_issuer_lock
because we're about to call zil_lwb_write_issue
, and zl_issuer_lock
is effectively used to wrap calls to zil_lwb_write_issue
rather than accessing any particular structure in memory.
Additionally, after we acquire zl_issuer_lock
we have to re-acquire zcw_lock
to ensure the waiter that we're concerned about doesn't transition into the "done" state while we're about to issue that waiter's lwb (this is also why we check zcw_done
after we re-acquire the zcw_lock
, since it could have been marked done after we dropped the lock).
We don't acquire the zl_lock
at this point since we're not making any direct modifications to the zilog
structure. Further, we may not need that lock, for example in the case where zcw_done
, in which case acquiring zl_lock
unnecessarily may hinder ZIL performance without any benefit (e.g. if acquiring this lock here, conflicts with acquiring the lock in zil_lwb_flush_vdevs_done
).
I'm more than happy to discuss this more; please let me know if this makes sense, and/or if I need to add any more details, clarification, etc...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, FWIW, there's some commentary on zl_issuer_lock
and zl_lock
in zil_impl.h
above the definitions of the lwb_state_t
, lwb_t
, and zil_commit_waiter_t
structures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given what's stated in the testing section, about having run this through the original reproducer workload and no longer seeing the issue (and not seeing any new issues), LGTM.
PR still needs work, based on discussion in comments.
Just wanted to let you guys know, that I'm letting this one sleep, until the deadlock reappears in production. I may be chasing ghosts, which don't actually exist anymore :-) |
Well I'm not able to prove this bug exists anymore, so closing this. 😸 |
|
|
Well, meh... TLDR: not a ZFS bug, actually most likely a kernel bug, less likely something with gcc It was weird to me that the last offsets all refer to lines with |
Description
There are three locks at play, when handling ZIL records. First is a
zl_issuer_lock, that is supposed to be held, whenever we're issuing
a new record to ZIL. Then there is zl_lock to actually protect the in
the zilog struct. And then there's a last lock, zcw_lock, which is
used to protect the lists of ZIL call-back waiters.
The order of locking is supposed to be:
zl_issuer_lock -> zcw_lock -> zl_lock; as implied their usage
in zil_alloc_lwb and zil_commit_waiter_timeout functions.
Function zil_commit_waiter_link_lwb goes against this, it is expecting
to be entered with zl_lock already held and only then it is taking
the zcw_lock;
This patch straightens the locking in zil_commit_waiter_link_lwb to
take the zl_lock on its own, correcting the locking order.
Following is an attached messsage from the Linux lockdep mechanism
describing the potential deadlock - in our production, it was a matter
of a few hours to hit this one...
(redacted to fit 72 cols)
Fixes #7038
Fixes #10440
Signed-off-by: Pavel Snajdr [email protected]
How Has This Been Tested?
Tested locally for now, deploy to production pending; the lockdep warning is gone, can't reproduce the deadlock anymore (but it was hard to reproduce as it was, sometimes it took hours in the testing setup I have - nothing fancy, just some Docker runs over and over again).
Types of changes
Checklist:
Signed-off-by
.