-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock on dp_config_rwlock on simultaneous snap/rename #2652
Comments
It's completely reproduceable by "hacking" Example patch: https://gist.github.com/seletskiy/b5ac103852367f1afcf4. After message "ok, waking up" |
@seletskiy I just filed #2654 as a duplicate of this without realizing. I am behind on reading issues in the tracker. The other party that encountered this did not get backtraces to me until today. My analysis is that the txg_sync task called This is a Linux-specific regression because Illumos does not have anything for renaming zvol snapshots (I am not sure if the snapshots are even accessible). I have a rough plan of action sketched out in my head. The general idea is to make the rename functionality asynchronous. I will just have a couple of lists of zvols to rename (syncing/quiscing style) similar to the txg commit. Then I will kick the system taskq to run this code whenever work is batched. That ought to avoid this deadlock. I just need to make sure not to race with zvol destruction. |
@ryao: I've tried to fix the bug yesterday and found, that the root cause is that when snapshot is created at the first it starts So, actual problem is that same lock is acquired, freed and again tried to be acquired in the same flow, so another task had a chance to grab a lock just between. I've tried to union that two actions in the sync task (by passing additional field Trying to figure out why just now. EDIT: oh, actually I've mistaken |
So, my point is that the issue maybe not in the Consider: dsl_dataset_snapshot(nvlist_t *snaps, nvlist_t *props, nvlist_t *errors)
{
// [omitted]
if (error == 0) {
// This is atomic operation, so rrwlock will be held only until it done
// So, we acquire write lock on `dp_config_rwlock` here and after
// dsl_dataset_snapshot_sync we will release this lock.
error = dsl_sync_task(firstname, dsl_dataset_snapshot_check,
dsl_dataset_snapshot_sync, &ddsa,
fnvlist_num_pairs(snaps) * 3);
}
// [omitted]
#ifdef _KERNEL
if (error == 0) {
for (pair = nvlist_next_nvpair(snaps, NULL); pair != NULL;
// [omitted]
// Here, we acquire `zvol_state_lock` and then acquire
// `dp_config_rwlock` on writing. Again.
zvol_create_minors(snapname);
}
}
#endif So, here's activity on that locks:
If |
Oh, ok, got it. Looks like it [snapshot order] is a correct order of locking, while |
This commit should prevent a deadlock on dp_config_rwlock when running `zfs rename` by ensuring zvol_rename_minors() is not called under this lock. Signed-off-by: Stanislav Seletskiy <[email protected]> Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#2652. Closes openzfs#2525.
I have positive confirmation from the organization affected by this issue yhat @seletskiy's patch has resolved it. @seletskiy Thanks again for working on this. I was pleasantly surprised to see you solve it so quickly and elegantly. |
This commit should prevent a deadlock on dp_config_rwlock when running `zfs rename` by ensuring zvol_rename_minors() is not called under this lock. Signed-off-by: Stanislav Seletskiy <[email protected]> Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#2652. Closes openzfs#2525.
This commit should prevent a deadlock on dp_config_rwlock when running `zfs rename` by ensuring zvol_rename_minors() is not called under this lock. Signed-off-by: Stanislav Seletskiy <[email protected]> Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Tim Chase <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#2652. Closes openzfs#2525.
I've able to reproduce above stacktrace from time to time on the zfs 0.6.3 while running zfs snap/zfs rename simultaneously.
My investigations lead me, that zfs got stuck in deadlock because of following sequence:
zfs rename
pendsrename
operation to be made on sync event;zfs snap
perform action immediately and acquirezvol_state_lock
fromzvol_create_minor
function;txg_sync
starts torename
operation;txg_sync
enters intodsl_sync_task_sync
and acquiredp_config_rwlock
on writing;zfs snap
enters intodsl_pool_hold
and try to acquiredp_config_rwlock
on reading (and hangs there);txg_sync
enterszvol_rename_minors
and try to acquirezvol_state_lock
(and hangs there);The text was updated successfully, but these errors were encountered: