Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix zfs_fsync deadlock #10875

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 7 additions & 9 deletions module/zfs/zil.c
Original file line number Diff line number Diff line change
Expand Up @@ -998,11 +998,8 @@ zil_commit_waiter_skip(zil_commit_waiter_t *zcw)
static void
zil_commit_waiter_link_lwb(zil_commit_waiter_t *zcw, lwb_t *lwb)
{
/*
* The lwb_waiters field of the lwb is protected by the zilog's
* zl_lock, thus it must be held when calling this function.
*/
ASSERT(MUTEX_HELD(&lwb->lwb_zilog->zl_lock));
ASSERT(MUTEX_HELD(&lwb->lwb_zilog->zl_issuer_lock));
ASSERT(!MUTEX_HELD(&lwb->lwb_zilog->zl_lock));

mutex_enter(&zcw->zcw_lock);
ASSERT(!list_link_active(&zcw->zcw_node));
Expand All @@ -1012,8 +1009,10 @@ zil_commit_waiter_link_lwb(zil_commit_waiter_t *zcw, lwb_t *lwb)
lwb->lwb_state == LWB_STATE_ISSUED ||
lwb->lwb_state == LWB_STATE_WRITE_DONE);

mutex_enter(&lwb->lwb_zilog->zl_lock);
list_insert_tail(&lwb->lwb_waiters, zcw);
zcw->zcw_lwb = lwb;
mutex_exit(&lwb->lwb_zilog->zl_lock);
mutex_exit(&zcw->zcw_lock);
}

Expand Down Expand Up @@ -1635,10 +1634,8 @@ zil_lwb_commit(zilog_t *zilog, itx_t *itx, lwb_t *lwb)
* For more details, see the comment above zil_commit().
*/
if (lrc->lrc_txtype == TX_COMMIT) {
mutex_enter(&zilog->zl_lock);
zil_commit_waiter_link_lwb(itx->itx_private, lwb);
itx->itx_private = NULL;
mutex_exit(&zilog->zl_lock);
return (lwb);
}

Expand Down Expand Up @@ -2168,8 +2165,8 @@ zil_prune_commit_list(zilog_t *zilog)
break;

mutex_enter(&zilog->zl_lock);

lwb_t *last_lwb = zilog->zl_last_lwb_opened;

if (last_lwb == NULL ||
last_lwb->lwb_state == LWB_STATE_FLUSH_DONE) {
/*
Expand All @@ -2178,13 +2175,14 @@ zil_prune_commit_list(zilog_t *zilog)
* never any itx's for it to wait on), so it's
* safe to skip this waiter and mark it done.
*/
mutex_exit(&zilog->zl_lock);
zil_commit_waiter_skip(itx->itx_private);
} else {
mutex_exit(&zilog->zl_lock);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you comment on why it's safe to drop this lock here, and then reacquire it in zil_commit_waiter_link_lwb? For example, may anything bad happen if zilog->zl_last_lwb_opened were to change in between the drop and reacquire?

I don't mean to imply this is wrong, just trying to refresh my memory on if this is safe or not, since the locking between these three locks is subtle..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it have been possible for the lwb assigned above:

lwb_t *last_lwb = zilog->zl_last_lwb_opened;

to have been issued and perhaps completed in between dropping the lock here, and acquiring it later? In which case, the lwb_t structure pointed to by last_lwb may no longer be valid anymore?

I'm looking at the code, and I don't think that's possible, since we will be holding the zl_issuer_lock lock here, and it also has to be held when issuing the zl_last_lwb_opened; so there's no chance for last_lwb to be freed in between the drop/acquire of zl_lock.

Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And just to be clear, my concern is if the lwb we pass into zil_commit_waiter_link_lwb can be freed by the time it's used in that function, due to us dropping zl_lock and racing with some other thread.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking closer, the lwb wouldn't necessarily have to have been freed for this to break, we just need to ensure it doesn't transition into the LWB_STATE_FLUSH_DONE state during the drop/acquire, since at that point we would have wanted to skip the itx, rather than add it to the lwb.

Further, I believe at this point, last_lwb should be in the LWB_STATE_OPENED state, and another thread can't transition it to the LWB_STATE_ISSUED (and thus it can't transition to LWB_STATE_FLUSH_DONE either), due to us still holding the zl_issuer_lock lock.

If I'm thinking about all this correctly, then I think this change makes sense to me.

Copy link
Contributor Author

@snajpa snajpa Sep 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I put the unlock after the brackets in both if/else bodies is exactly that the zilog->zl_last_lwb_opened doesn't change. Then the function is called with *itx pointer, not the lwb. I think this is legit, but it for sure needs more eyeballs on it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> #1 (&zcw->zcw_lock){+.+.}-{4:4}:
       __mutex_lock+0xac/0x9e0
       mutex_lock_nested+0x1b/0x20
       zil_commit_waiter_link_lwb+0x51/0x1a0 [zfs]
       zil_commit_impl+0x12b0/0x1850 [zfs]
       zil_commit+0x43/0x60 [zfs]
       zpl_writepages+0xf8/0x1a0 [zfs]
       do_writepages+0x43/0xf0
       __filemap_fdatawrite_range+0xd5/0x110
       filemap_write_and_wait_range+0x4b/0xb0
       zpl_fsync+0x4d/0xb0 [zfs]
       vfs_fsync_range+0x49/0x80
       do_fsync+0x3d/0x70
       __x64_sys_fsync+0x14/0x20
       do_syscall_64+0x38/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #0 (&zilog->zl_lock){+.+.}-{4:4}:
       __lock_acquire+0x12b6/0x2480
       lock_acquire+0xab/0x380
       __mutex_lock+0xac/0x9e0
       mutex_lock_nested+0x1b/0x20
       zil_alloc_lwb+0x1df/0x3e0 [zfs]
       zil_lwb_write_issue+0x265/0x3f0 [zfs]
       zil_commit_impl+0x577/0x1850 [zfs]
       zil_commit+0x43/0x60 [zfs]
       zpl_writepages+0xf8/0x1a0 [zfs]
       do_writepages+0x43/0xf0
       __filemap_fdatawrite_range+0xd5/0x110
       filemap_write_and_wait_range+0x4b/0xb0
       zpl_fsync+0x4d/0xb0 [zfs]
       vfs_fsync_range+0x49/0x80
       do_fsync+0x3d/0x70
       __x64_sys_fsync+0x14/0x20
       do_syscall_64+0x38/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

Also, looking at these two stacks from the PR description, don't they both require the single (per ZIL) zl_issuer_lock? So then, we actually could not have two threads simultaneously in these two stacks, and thus there isn't any actual deadlock concern (even if the lock ordering is confusing)?

I think it'd be good to get stack information from the system when the actual deadlock occurs, since I'm not sure I'm convinced that lockdep is reporting a legitimate deadlock scenario.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual lockdep message is a bit confusing, that is true - the main concern here is the zil_commit_waiter_timeout function, which releases the zcw->zcw_lock for LWBs that aren't finished yet, then it takes the zl_issuer_lock, but doesn't care about the zl_lock at all, which might be held by a different thread. My main concern is the order of locking in that function:

        mutex_exit(&zcw->zcw_lock);                                             
        mutex_enter(&zilog->zl_issuer_lock);                                    
        mutex_enter(&zcw->zcw_lock);  

Now if we combine that with functions that take the zl_lock in combination with the above locks, and the fact that they do it with different order (sometimes the zcw_lock is first, sometimes zl_lock)... I don't know, I think this is rather opaque and I think I like the idea of separate lock protecting the zl_lwb_list best.

I'll paste the stacks of the stuck processes later today, they are still present on the production machines we run, we just clone the workload in a new dataset everytime this happens, so that we don't have to reboot the whole box. This is precisely "enabled" by the zcw_lock being specific to the callback waiter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the node with the stuck processes has been cycled meanwhile, I'll have to wait a bit for this to turn up again... should have saved the stacks when I had the chance :-D

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, I think this is rather opaque and I think I like the idea of separate lock protecting the zl_lwb_list best.

I don't necessarily disagree with this sentiment, but I'd like to determine if we're making a change to eliminate a legitimate deadlock scenario, or if we're making a change to make the code easier to read/comprehend/etc (and/or to silence a lockdep false positive).

Which is why I'd like to see the state of the system when the actual deadlock occurs, so we can verify if that deadlock is the same issue being raised here, of if it's something different. In which case, if it is different, I'd still be OK looking to making some changes to the code that make this logic easier to understand.

then it takes the zl_issuer_lock, but doesn't care about the zl_lock at all, which might be held by a different thread.

I'm not sure how much of the ZIL logic and/or locking you're familiar with, so just to make sure we're all on the same page (and sorry if this isn't already documented in the code), the zl_issuer_lock has more to do about limiting that a single thread is only ever issuing ZIL writes to disk, rather than protecting structures in memory.

Due to how the ZIL is laid out on disk, via a linked list of blocks and block pointers, we can only have a single thread issue ZIL writes to disk, as the thread that issues these writes is also the thread that allocates the "next" ZIL block on disk; and there can only ever be a single "next" block at any given time. So we use the zl_issuer_lock to ensure there isn't two threads racing when allocating that "next" block.

My main concern is the order of locking in that function

Taking my above explanation into account, here we have to acquire the zl_issuer_lock because we're about to call zil_lwb_write_issue, and zl_issuer_lock is effectively used to wrap calls to zil_lwb_write_issue rather than accessing any particular structure in memory.

Additionally, after we acquire zl_issuer_lock we have to re-acquire zcw_lock to ensure the waiter that we're concerned about doesn't transition into the "done" state while we're about to issue that waiter's lwb (this is also why we check zcw_done after we re-acquire the zcw_lock, since it could have been marked done after we dropped the lock).

We don't acquire the zl_lock at this point since we're not making any direct modifications to the zilog structure. Further, we may not need that lock, for example in the case where zcw_done, in which case acquiring zl_lock unnecessarily may hinder ZIL performance without any benefit (e.g. if acquiring this lock here, conflicts with acquiring the lock in zil_lwb_flush_vdevs_done).

I'm more than happy to discuss this more; please let me know if this makes sense, and/or if I need to add any more details, clarification, etc...

Copy link
Member

@prakashsurya prakashsurya Sep 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, FWIW, there's some commentary on zl_issuer_lock and zl_lock in zil_impl.h above the definitions of the lwb_state_t, lwb_t, and zil_commit_waiter_t structures.

zil_commit_waiter_link_lwb(itx->itx_private, last_lwb);
itx->itx_private = NULL;
}

mutex_exit(&zilog->zl_lock);

list_remove(&zilog->zl_itx_commit_list, itx);
zil_itx_destroy(itx);
Expand Down