Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock on dp_config_rwlock on simultaneous snap/rename #2652

Closed
seletskiy opened this issue Sep 2, 2014 · 6 comments
Closed

Deadlock on dp_config_rwlock on simultaneous snap/rename #2652

seletskiy opened this issue Sep 2, 2014 · 6 comments
Milestone

Comments

@seletskiy
Copy link
Contributor

Sep 02 09:04:47 kernel: INFO: task txg_sync:1009 blocked for more than 120 seconds.
Sep 02 09:04:47 kernel:       Tainted: P           O  3.16.1-1-apparmor #1
Sep 02 09:04:47 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 02 09:04:47 kernel: txg_sync        D 0000000000000004     0  1009      2 0x00000000
Sep 02 09:04:47 kernel:  ffff8817f07cbaf0 0000000000000046 ffff8817e9bef010 00000000000145c0
Sep 02 09:04:47 kernel:  ffff8817f07cbfd8 00000000000145c0 ffff8817e9bef010 0000000000000112
Sep 02 09:04:47 kernel:  ffff8817f07cbac0 ffffffffa0188a4d ffff8817f04b6900 ffff8817f07cba98
Sep 02 09:04:47 kernel: Call Trace:
Sep 02 09:04:47 kernel:  [<ffffffffa0188a4d>] ? kmem_alloc_debug+0x29d/0x540 [spl]
Sep 02 09:04:47 kernel:  [<ffffffffa04be031>] ? dbuf_dirty+0x471/0x8c0 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffff811b22e3>] ? __kmalloc+0x163/0x1c0
Sep 02 09:04:47 kernel:  [<ffffffffa0188a4d>] ? kmem_alloc_debug+0x29d/0x540 [spl]
Sep 02 09:04:47 kernel:  [<ffffffff81549d79>] schedule+0x29/0x70
Sep 02 09:04:47 kernel:  [<ffffffff8154a1e6>] schedule_preempt_disabled+0x16/0x20
Sep 02 09:04:47 kernel:  [<ffffffff8154bb65>] __mutex_lock_slowpath+0xe5/0x230
Sep 02 09:04:47 kernel:  [<ffffffff8154bcc7>] mutex_lock+0x17/0x30
Sep 02 09:04:47 kernel:  [<ffffffffa0564c51>] zvol_rename_minors+0x61/0x170 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa04dda16>] dsl_dataset_rename_snapshot_sync_impl+0x196/0x2e0 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa04ddc2f>] dsl_dataset_rename_snapshot_sync+0xcf/0x1c0 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa04eee4a>] dsl_sync_task_sync+0xda/0xe0 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa04e6ea3>] dsl_pool_sync+0x323/0x480 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa04ff53d>] spa_sync+0x42d/0xb10 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffff810da948>] ? ktime_get_ts+0x48/0xf0
Sep 02 09:04:47 kernel:  [<ffffffffa0510252>] txg_sync_thread+0x382/0x5f0 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa050fed0>] ? txg_delay+0xf0/0xf0 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa018d2ca>] thread_generic_wrapper+0x7a/0x90 [spl]
Sep 02 09:04:47 kernel:  [<ffffffffa018d250>] ? __thread_exit+0xa0/0xa0 [spl]
Sep 02 09:04:47 kernel:  [<ffffffff81091eca>] kthread+0xea/0x100
Sep 02 09:04:47 kernel:  [<ffffffff811d0000>] ? __fput+0x110/0x200
Sep 02 09:04:47 kernel:  [<ffffffff81091de0>] ? kthread_create_on_node+0x1b0/0x1b0
Sep 02 09:04:47 kernel:  [<ffffffff8154db7c>] ret_from_fork+0x7c/0xb0
Sep 02 09:04:47 kernel:  [<ffffffff81091de0>] ? kthread_create_on_node+0x1b0/0x1b0

Sep 02 09:04:47 kernel: INFO: task zfs:16622 blocked for more than 120 seconds.
Sep 02 09:04:47 kernel:       Tainted: P           O  3.16.1-1-apparmor #1
Sep 02 09:04:47 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 02 09:04:47 kernel: zfs             D 000000000000000d     0 16622  14558 0x00000000
Sep 02 09:04:47 kernel:  ffff881d825f3a10 0000000000000082 ffff881f5ec7b2f0 00000000000145c0
Sep 02 09:04:47 kernel:  ffff881d825f3fd8 00000000000145c0 ffff881f5ec7b2f0 ffff881d825f3970
Sep 02 09:04:47 kernel:  0000000000000000 ffff881d825f3970 ffff8817c8ef8000 0000000000000240
Sep 02 09:04:47 kernel: Call Trace:
Sep 02 09:04:47 kernel:  [<ffffffffa01886ab>] ? kmem_free_debug+0x4b/0x150 [spl]
Sep 02 09:04:47 kernel:  [<ffffffffa00d4b1e>] ? zprop_name_to_prop_cb+0x6e/0x90 [zcommon]
Sep 02 09:04:47 kernel:  [<ffffffffa00d4ab0>] ? zprop_string_to_index+0xc0/0xc0 [zcommon]
Sep 02 09:04:47 kernel:  [<ffffffffa00d4ab0>] ? zprop_string_to_index+0xc0/0xc0 [zcommon]
Sep 02 09:04:47 kernel:  [<ffffffffa00d48c1>] ? zprop_iter_common+0xe1/0x140 [zcommon]
Sep 02 09:04:47 kernel:  [<ffffffff81549d79>] schedule+0x29/0x70
Sep 02 09:04:47 kernel:  [<ffffffffa01951d5>] cv_wait_common+0x125/0x1b0 [spl]
Sep 02 09:04:47 kernel:  [<ffffffff810b6d10>] ? __wake_up_sync+0x20/0x20
Sep 02 09:04:47 kernel:  [<ffffffffa0195275>] __cv_wait+0x15/0x20 [spl]
Sep 02 09:04:47 kernel:  [<ffffffffa04f5c9b>] rrw_enter_read+0x3b/0x130 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa04f5e30>] rrw_enter+0x20/0x30 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa04e7e4a>] dsl_pool_hold+0x5a/0x80 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa04c8745>] dmu_objset_hold+0x35/0xd0 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa0188a4d>] ? kmem_alloc_debug+0x29d/0x540 [spl]
Sep 02 09:04:47 kernel:  [<ffffffffa04e8654>] dsl_prop_get+0x44/0xa0 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa04e86ce>] dsl_prop_get_integer+0x1e/0x20 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa0562ae7>] __zvol_create_minor+0xc7/0x660 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa0564933>] zvol_create_minor+0x23/0x40 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa056495e>] zvol_create_minors_cb+0xe/0x20 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa04ca35d>] dmu_objset_find_impl+0x38d/0x420 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa0564950>] ? zvol_create_minor+0x40/0x40 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa04ca1b6>] dmu_objset_find_impl+0x1e6/0x420 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa0564950>] ? zvol_create_minor+0x40/0x40 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa05029ed>] ? spa_open_common+0x2ad/0x430 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa0564950>] ? zvol_create_minor+0x40/0x40 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa04ca448>] dmu_objset_find+0x58/0x90 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa0564a9f>] zvol_create_minors+0x2f/0x40 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa0535dfa>] zfs_ioc_snapshot+0x14a/0x2d0 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffffa0536c2a>] zfsdev_ioctl+0x20a/0x500 [zfs]
Sep 02 09:04:47 kernel:  [<ffffffff811e1480>] do_vfs_ioctl+0x2d0/0x4b0
Sep 02 09:04:47 kernel:  [<ffffffff811e16e1>] SyS_ioctl+0x81/0xa0
Sep 02 09:04:47 kernel:  [<ffffffff8154dc29>] system_call_fastpath+0x16/0x1b

I've able to reproduce above stacktrace from time to time on the zfs 0.6.3 while running zfs snap/zfs rename simultaneously.

My investigations lead me, that zfs got stuck in deadlock because of following sequence:

  • zfs rename pends rename operation to be made on sync event;
  • zfs snap perform action immediately and acquire zvol_state_lock from zvol_create_minor function;
  • sync event triggered and txg_sync starts to rename operation;
  • txg_sync enters into dsl_sync_task_sync and acquire dp_config_rwlock on writing;
  • zfs snap enters into dsl_pool_hold and try to acquire dp_config_rwlock on reading (and hangs there);
  • txg_sync enters zvol_rename_minors and try to acquire zvol_state_lock (and hangs there);
  • deadlock;
@seletskiy
Copy link
Contributor Author

It's completely reproduceable by "hacking" zvol.c file by adding some msleep in the __zvol_snapdev_hidden just before dsl_prop_get_integer (e.g. 30 second sleep is enough) and then calling zfs snap zroot@3 + zfs rename zroot@1 zroot@2.

Example patch: https://gist.github.com/seletskiy/b5ac103852367f1afcf4.

After message "ok, waking up" txg_sync will be deadlocked.

@ryao
Copy link
Contributor

ryao commented Sep 2, 2014

@seletskiy I just filed #2654 as a duplicate of this without realizing. I am behind on reading issues in the tracker. The other party that encountered this did not get backtraces to me until today.

My analysis is that the txg_sync task called rrw_enter(&dp->dp_config_rwlock, RW_WRITER, FTAG); in dsl_sync_task_sync() and then zvol_rename_minors(), which immediately called mutex_lock(&zvol_state_lock). Simultaneously, A userland process started a snapshot and took mutex_lock(&zvol_state_lock) before creating minors. It then called dsl_pool_config_enter(), which did rrw_enter(&dp->dp_config_rwlock, RW_READER, tag);. This is identical to what you found.

This is a Linux-specific regression because Illumos does not have anything for renaming zvol snapshots (I am not sure if the snapshots are even accessible). I have a rough plan of action sketched out in my head. The general idea is to make the rename functionality asynchronous. I will just have a couple of lists of zvols to rename (syncing/quiscing style) similar to the txg commit. Then I will kick the system taskq to run this code whenever work is batched. That ought to avoid this deadlock. I just need to make sure not to race with zvol destruction.

@seletskiy
Copy link
Contributor Author

@ryao: I've tried to fix the bug yesterday and found, that the root cause is that when snapshot is created at the first it starts sync_task. Which takes the rw_lock on writing, creates actual snapshot structures and then exits sync task, freeing rw_lock. Immediately after code goes into the create_minors, where zvol_state_lock is taken and rw_lock is tried to be acquired for the reading (where it hangs).

So, actual problem is that same lock is acquired, freed and again tried to be acquired in the same flow, so another task had a chance to grab a lock just between.

I've tried to union that two actions in the sync task (by passing additional field suspended in the ddsa struct and moving part of the code with create_minors stuff to the actual snapshot work, so lock rw_lock is taken only once before zvol_state_lock, but it ended up assertion that the lock is tried to be acquired from different thread.

Trying to figure out why just now.

EDIT: oh, actually I've mistaken ASSERT != with ASSERT ==, so it's actually complaining about rw_lock is getting acquired for reading (in zvol_create_minors → dsl_pool_config_enter) when it's already acquired for writing (in dsl_sync_task_sync).

@seletskiy
Copy link
Contributor Author

So, my point is that the issue maybe not in the rename mechanism, but in the snapshot.

Consider:

dsl_dataset_snapshot(nvlist_t *snaps, nvlist_t *props, nvlist_t *errors)
{
    // [omitted]

    if (error == 0) {
        // This is atomic operation, so rrwlock will be held only until it done
        // So, we acquire write lock on `dp_config_rwlock` here and after
        // dsl_dataset_snapshot_sync we will release this lock.
        error = dsl_sync_task(firstname, dsl_dataset_snapshot_check,
            dsl_dataset_snapshot_sync, &ddsa,
            fnvlist_num_pairs(snaps) * 3);
    }

    // [omitted]

#ifdef _KERNEL
    if (error == 0) {
        for (pair = nvlist_next_nvpair(snaps, NULL); pair != NULL;
            // [omitted]

            // Here, we acquire `zvol_state_lock` and then acquire
            // `dp_config_rwlock` on writing. Again.
            zvol_create_minors(snapname);
        }
    }
#endif

So, here's activity on that locks:

zfs snapshot
|
|
+ dsl_dataset_snapshot
    |
    |
    + acquire dp_config_rwlock [W]
    |
    + do snapshot_sync stuff
    |
    + release dp_config_rwlock [W]
    |
    + zvol_create_minors
        |
        |
        + acquire zvol_state_lock
        |
        + <<< HERE some unlucky thread can cause deadlock if acquire
        |     dp_config_rwlock first
        |
        + acquire dp_config_rwlock [R]

If dp_config_rwlock is not released until minors are created everything will be fine.

@seletskiy
Copy link
Contributor Author

Oh, ok, got it. Looks like it [snapshot order] is a correct order of locking, while rename uses opposite.

@behlendorf behlendorf added this to the 0.6.4 milestone Sep 4, 2014
ryao pushed a commit to ryao/zfs that referenced this issue Sep 5, 2014
This commit should prevent a deadlock on dp_config_rwlock when
running `zfs rename` by ensuring zvol_rename_minors() is not
called under this lock.

Signed-off-by: Stanislav Seletskiy <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#2652.
Closes openzfs#2525.
@ryao
Copy link
Contributor

ryao commented Sep 8, 2014

I have positive confirmation from the organization affected by this issue yhat @seletskiy's patch has resolved it.

@seletskiy Thanks again for working on this. I was pleasantly surprised to see you solve it so quickly and elegantly.

DeHackEd pushed a commit to DeHackEd/zfs that referenced this issue Sep 18, 2014
This commit should prevent a deadlock on dp_config_rwlock when
running `zfs rename` by ensuring zvol_rename_minors() is not
called under this lock.

Signed-off-by: Stanislav Seletskiy <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#2652.
Closes openzfs#2525.
ryao pushed a commit to ryao/zfs that referenced this issue Nov 29, 2014
This commit should prevent a deadlock on dp_config_rwlock when
running `zfs rename` by ensuring zvol_rename_minors() is not
called under this lock.

Signed-off-by: Stanislav Seletskiy <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Tim Chase <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#2652.
Closes openzfs#2525.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants