-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use UMH_WAIT_EXEC with call_usermodehelper() to reliably set elevator #9422
Conversation
On Ubuntu 18.04, disk elevator is no longer set to `noop` with the change to use UMH_NO_WAIT flag. This change uses UMH_WAIT_EXEC flag, wait for exec, and disk elevator is now reliably set on pool creation and reboot. Signed-off-by: Tony Nguyen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using UMB_WAIT_EXEC
was something which we originally considered. However, in the end we opted for UMH_NO_WAIT
since technically we could hit the same deadlock when executing /bin/sh
. Though that's probably astonishingly unlikely in practice.
we could probably even use UMH_NO_WAIT just to be safe in case reading /bin/sh from disk triggers the same issue: i haven't been able to reproduce this particular case though.
I'm curious, were you able to determine why UMH_NO_WAIT
does not work as expected in your environment? Aside from the wait behavior I thought these would behave the same.
That all said, since this does restore the intended behavior it LGTM. However, be aware that the intention is to retire this functionality in the master branch in the not to distant future. See #9317. Due to issues like this it's preferable to delegate setting the scheduler to some infrastructure in user space.
I managed to reproduce #8664 on my root-on-zfs test VM with this change applied:
Same reproducer running HEAD^ (94bcf6f) does not deadlock:
I'm not sure why Incidentally this change seems to prevent #9417 |
Interesting. Thanks for reproducing the issue. I’ll use your test case.
I agree we should also investigate why UMH_NO_WAIT isn’t working.
…On Mon, Oct 7, 2019 at 14:51 loli10K ***@***.***> wrote:
I managed to reproduce #8664
<#8664> on my root-on-zfs test VM
with this change applied:
***@***.***:~# cat /proc/sys/kernel/spl/gitrev
zfs-0.8.0-309-gff214b82a
***@***.***:~# sysctl -w kernel.hung_task_timeout_secs=60
kernel.hung_task_timeout_secs = 60
***@***.***:~# zfs set primarycache=none rpool
***@***.***:~# echo 3 > /proc/sys/vm/drop_caches
[ 323.265982] bash (1453): drop_caches: 3
***@***.***:~#
***@***.***:~#
***@***.***:~# time zpool scrub rpool
[ 445.680923] INFO: task txg_sync:534 blocked for more than 61 seconds.
[ 445.683620] Tainted: P OE 5.2.14 #1
[ 445.685922] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 445.689250] txg_sync D 0 534 2 0x80004000
[ 445.691255] Call Trace:
[ 445.691749] ? __schedule+0x24a/0x600
[ 445.692437] schedule+0x2f/0xa0
[ 445.693076] cv_wait_common+0x156/0x290 [spl]
[ 445.693943] ? do_wait_intr_irq+0xb0/0xb0
[ 445.694802] spa_config_enter+0x13b/0x1e0 [zfs]
[ 445.695729] spa_sync+0x1e7/0x1850 [zfs]
[ 445.696556] ? zfs_refcount_remove_many+0x1e9/0x270 [zfs]
[ 445.697604] ? _cond_resched+0x15/0x30
[ 445.698350] ? mutex_lock+0xe/0x30
[ 445.699074] ? spa_txg_history_set+0xde/0x130 [zfs]
[ 445.700067] ? spa_txg_history_init_io+0xdc/0x110 [zfs]
[ 445.701140] txg_sync_thread+0x2ce/0x550 [zfs]
[ 445.702053] ? txg_thread_exit.isra.7+0xe0/0xe0 [zfs]
[ 445.703014] ? __thread_exit+0x20/0x20 [spl]
[ 445.703831] thread_generic_wrapper+0x9f/0xe0 [spl]
[ 445.704803] kthread+0x116/0x130
[ 445.705446] ? __kthread_cancel_work+0x80/0x80
[ 445.706303] ret_from_fork+0x35/0x40
[ 445.707013] INFO: task systemd-timesyn:887 blocked for more than 61 seconds.
[ 445.708382] Tainted: P OE 5.2.14 #1
[ 445.709450] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 445.710977] systemd-timesyn D 0 887 1 0x00000120
[ 445.712074] Call Trace:
[ 445.712576] ? __schedule+0x24a/0x600
[ 445.713318] ? mutex_lock+0xe/0x30
[ 445.714012] schedule+0x2f/0xa0
[ 445.714656] cv_wait_common+0x156/0x290 [spl]
[ 445.715539] ? do_wait_intr_irq+0xb0/0xb0
[ 445.716388] dbuf_read+0x26f/0x600 [zfs]
[ 445.717235] dmu_buf_hold+0x56/0x80 [zfs]
[ 445.718113] zap_lockdir+0x54/0x120 [zfs]
[ 445.718927] ? stack_trace_save+0x4b/0x70
[ 445.719762] zap_lookup_norm+0x5d/0xd0 [zfs]
[ 445.720672] zap_lookup+0x12/0x20 [zfs]
[ 445.721508] zfs_dirent_lock+0x565/0x7d0 [zfs]
[ 445.722425] zfs_dirlook+0x8a/0x2c0 [zfs]
[ 445.723263] zfs_lookup+0x1ef/0x3b0 [zfs]
[ 445.724124] zpl_lookup+0xbd/0x280 [zfs]
[ 445.724919] __lookup_slow+0x97/0x150
[ 445.725650] lookup_slow+0x35/0x50
[ 445.726307] walk_component+0x1bf/0x2f0
[ 445.727049] path_lookupat.isra.37+0x6d/0x210
[ 445.727932] filename_lookup+0xaf/0x180
[ 445.728708] ? kmem_cache_alloc+0x1c0/0x240
[ 445.729580] vfs_statx+0x73/0xe0
[ 445.730223] ? _cond_resched+0x15/0x30
[ 445.730997] ? dentry_kill+0x4d/0x190
[ 445.731733] __do_sys_newstat+0x39/0x70
[ 445.732512] ? syscall_trace_enter+0x192/0x2b0
[ 445.733426] do_syscall_64+0x55/0x130
[ 445.734165] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 445.735178] RIP: 0033:0x7f6b42d51aa5
[ 445.735893] Code: Bad RIP value.
[ 445.736542] RSP: 002b:00007ffe8b033d18 EFLAGS: 00000246 ORIG_RAX: 0000000000000004
[ 445.738079] RAX: ffffffffffffffda RBX: 00007ffe8b033d20 RCX: 00007f6b42d51aa5
[ 445.739465] RDX: 00007ffe8b033d20 RSI: 00007ffe8b033d20 RDI: 00007f6b42dec54f
[ 445.740884] RBP: 00007ffe8b033e80 R08: 000000000000000f R09: 0000556380d3d840
[ 445.742274] R10: 0000000000000001 R11: 0000000000000246 R12: 00007f6b42e27900
[ 445.743675] R13: 0000000000000002 R14: 0000000000000065 R15: 0000000000000066
[ 445.745082] INFO: task cron:1045 blocked for more than 61 seconds.
[ 445.746193] Tainted: P OE 5.2.14 #1
[ 445.746963] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 445.748105] cron D 0 1045 1 0x00000000
[ 445.748924] Call Trace:
[ 445.749291] ? __schedule+0x24a/0x600
[ 445.749827] schedule+0x2f/0xa0
[ 445.750295] cv_wait_common+0x156/0x290 [spl]
[ 445.750941] ? do_wait_intr_irq+0xb0/0xb0
[ 445.751574] spa_config_enter+0x13b/0x1e0 [zfs]
[ 445.752286] zio_vdev_io_start+0x529/0x5a0 [zfs]
[ 445.753025] ? tsd_get_by_thread+0x3b/0x80 [spl]
[ 445.753867] zio_nowait+0x148/0x2f0 [zfs]
[ 445.754584] arc_read+0xb44/0x1a50 [zfs]
[ 445.755295] ? dbuf_rele_and_unlock+0x750/0x750 [zfs]
[ 445.756192] dbuf_read_impl.constprop.16+0x2d4/0xc50 [zfs]
[ 445.757178] dbuf_read+0x107/0x600 [zfs]
[ 445.757891] dmu_buf_hold+0x56/0x80 [zfs]
[ 445.758628] zap_lockdir+0x54/0x120 [zfs]
[ 445.759324] ? stack_trace_save+0x4b/0x70
[ 445.760065] zap_lookup_norm+0x5d/0xd0 [zfs]
[ 445.760867] zap_lookup+0x12/0x20 [zfs]
[ 445.761576] zfs_dirent_lock+0x565/0x7d0 [zfs]
[ 445.762392] zfs_dirlook+0x8a/0x2c0 [zfs]
[ 445.763136] zfs_lookup+0x1ef/0x3b0 [zfs]
[ 445.763878] zpl_lookup+0xbd/0x280 [zfs]
[ 445.764535] __lookup_slow+0x97/0x150
[ 445.765094] lookup_slow+0x35/0x50
[ 445.765596] walk_component+0x1bf/0x2f0
[ 445.766161] path_lookupat.isra.37+0x6d/0x210
[ 445.766794] filename_lookup+0xaf/0x180
[ 445.767361] ? kmem_cache_alloc+0x1c0/0x240
[ 445.767980] vfs_statx+0x73/0xe0
[ 445.768457] __do_sys_newstat+0x39/0x70
[ 445.769065] ? __hrtimer_init+0xb0/0xb0
[ 445.769840] do_syscall_64+0x55/0x130
[ 445.770423] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 445.771161] RIP: 0033:0x7fb243a4faa5
[ 445.771682] Code: Bad RIP value.
[ 445.772163] RSP: 002b:00007ffc3247f608 EFLAGS: 00000246 ORIG_RAX: 0000000000000004
[ 445.773283] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb243a4faa5
[ 445.774477] RDX: 00007ffc3247f690 RSI: 00007ffc3247f690 RDI: 00007fb243ae88b1
[ 445.775505] RBP: 0000000000000000 R08: 000000000000000f R09: 0000564cd0d1ddb0
[ 445.776535] R10: 0000000000000001 R11: 0000000000000246 R12: 00007fb243ae88b1
[ 445.777597] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
[ 445.778772] INFO: task zpool:1812 blocked for more than 61 seconds.
[ 445.779682] Tainted: P OE 5.2.14 #1
[ 445.780438] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 445.781591] zpool D 0 1812 1453 0x00004000
[ 445.782527] Call Trace:
[ 445.782902] ? __schedule+0x24a/0x600
[ 445.783486] ? vdev_open_children+0x160/0x160 [zfs]
[ 445.784202] schedule+0x2f/0xa0
[ 445.784669] taskq_wait+0x7a/0xc0 [spl]
[ 445.785265] ? do_wait_intr_irq+0xb0/0xb0
[ 445.786052] taskq_destroy+0x50/0x420 [spl]
[ 445.786707] ? vdev_open_children+0x160/0x160 [zfs]
[ 445.787466] vdev_open_children+0x10a/0x160 [zfs]
[ 445.788201] vdev_root_open+0x37/0x120 [zfs]
[ 445.788918] vdev_open+0xd9/0x900 [zfs]
[ 445.789687] vdev_reopen+0x4c/0x100 [zfs]
[ 445.790318] dsl_scan+0x36/0x120 [zfs]
[ 445.790916] zfs_ioc_pool_scan+0x5a/0xd0 [zfs]
[ 445.791612] zfsdev_ioctl_common+0x72e/0x9d0 [zfs]
[ 445.792316] ? __handle_mm_fault+0x10a2/0x1230
[ 445.793000] do_vfs_ioctl+0xa4/0x620
[ 445.793716] ksys_ioctl+0x60/0x90
[ 445.794207] __x64_sys_ioctl+0x16/0x20
[ 445.794752] do_syscall_64+0x55/0x130
[ 445.795291] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 445.796023] RIP: 0033:0x7f851ccba427
[ 445.796543] Code: Bad RIP value.
[ 445.797050] RSP: 002b:00007ffccae900f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 445.798358] RAX: ffffffffffffffda RBX: 00007ffccae90530 RCX: 00007f851ccba427
[ 445.799389] RDX: 00007ffccae90530 RSI: 0000000000005a07 RDI: 0000000000000003
[ 445.800414] RBP: 00007ffccae93b10 R08: 0000000000000008 R09: 0000000000000001
[ 445.801471] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
[ 445.802636] R13: 0000000000000000 R14: 000056476e2b87a0 R15: 000056476e2b6430
[ 445.803686] INFO: task vdev_open:1813 blocked for more than 61 seconds.
[ 445.804731] Tainted: P OE 5.2.14 #1
[ 445.805580] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 445.806848] vdev_open D 0 1813 2 0x80004000
[ 445.807752] Call Trace:
[ 445.808157] ? __schedule+0x24a/0x600
[ 445.808803] ? vdev_open_children+0x160/0x160 [zfs]
[ 445.809604] schedule+0x2f/0xa0
[ 445.810122] taskq_wait+0x7a/0xc0 [spl]
[ 445.810733] ? do_wait_intr_irq+0xb0/0xb0
[ 445.811421] taskq_destroy+0x50/0x420 [spl]
[ 445.812153] ? vdev_open_children+0x160/0x160 [zfs]
[ 445.813024] vdev_open_children+0x10a/0x160 [zfs]
[ 445.813812] vdev_mirror_open+0x2c/0x140 [zfs]
[ 445.814591] vdev_open+0xd9/0x900 [zfs]
[ 445.815216] ? __schedule+0x252/0x600
[ 445.815855] vdev_open_child+0x1e/0x40 [zfs]
[ 445.816541] taskq_thread+0x336/0x660 [spl]
[ 445.817211] ? __switch_to_asm+0x40/0x70
[ 445.817821] ? wake_up_q+0x70/0x70
[ 445.818353] ? taskq_thread_spawn+0x50/0x50 [spl]
[ 445.819071] kthread+0x116/0x130
[ 445.819578] ? __kthread_cancel_work+0x80/0x80
[ 445.820265] ret_from_fork+0x35/0x40
[ 445.820865] INFO: task vdev_open:1815 blocked for more than 61 seconds.
[ 445.821957] Tainted: P OE 5.2.14 #1
[ 445.822774] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 445.824005] vdev_open D 0 1815 2 0x80004000
[ 445.824934] Call Trace:
[ 445.825341] ? __schedule+0x24a/0x600
[ 445.825934] ? __switch_to_asm+0x40/0x70
[ 445.826564] ? __switch_to_asm+0x34/0x70
[ 445.827265] schedule+0x2f/0xa0
[ 445.827852] schedule_timeout+0x20d/0x310
[ 445.828582] ? __schedule+0x252/0x600
[ 445.829284] ? try_to_wake_up+0x54/0x460
[ 445.829968] wait_for_completion+0x11f/0x190
[ 445.830637] ? wake_up_q+0x70/0x70
[ 445.831179] call_usermodehelper_exec+0x12e/0x160
[ 445.831979] vdev_elevator_switch+0x112/0x1a0 [zfs]
[ 445.832811] vdev_disk_open+0x2a0/0x4b0 [zfs]
[ 445.833598] vdev_open+0xd9/0x900 [zfs]
[ 445.834220] ? __schedule+0x252/0x600
[ 445.834854] vdev_open_child+0x1e/0x40 [zfs]
[ 445.835536] taskq_thread+0x336/0x660 [spl]
[ 445.836230] ? __switch_to_asm+0x40/0x70
[ 445.836872] ? wake_up_q+0x70/0x70
[ 445.837418] ? taskq_thread_spawn+0x50/0x50 [spl]
[ 445.838171] kthread+0x116/0x130
[ 445.838696] ? __kthread_cancel_work+0x80/0x80
[ 445.839398] ret_from_fork+0x35/0x40
[ 445.840056] INFO: task kworker/u8:5:1816 blocked for more than 61 seconds.
[ 445.841283] Tainted: P OE 5.2.14 #1
[ 445.842190] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 445.843438] kworker/u8:5 D 0 1816 2 0x80004000
[ 445.844313] Call Trace:
[ 445.844713] ? __schedule+0x24a/0x600
[ 445.845284] schedule+0x2f/0xa0
[ 445.845765] cv_wait_common+0x156/0x290 [spl]
[ 445.846447] ? do_wait_intr_irq+0xb0/0xb0
[ 445.847190] spa_config_enter+0x13b/0x1e0 [zfs]
[ 445.848067] zio_vdev_io_start+0x529/0x5a0 [zfs]
[ 445.848940] ? tsd_get_by_thread+0x3b/0x80 [spl]
[ 445.849725] zio_nowait+0x148/0x2f0 [zfs]
[ 445.850411] arc_read+0xb44/0x1a50 [zfs]
[ 445.851078] ? dbuf_rele_and_unlock+0x750/0x750 [zfs]
[ 445.851927] dbuf_read_impl.constprop.16+0x2d4/0xc50 [zfs]
[ 445.852887] dbuf_read+0x107/0x600 [zfs]
[ 445.853567] dmu_buf_hold+0x56/0x80 [zfs]
[ 445.854253] zap_lockdir+0x54/0x120 [zfs]
[ 445.854902] ? stack_trace_save+0x4b/0x70
[ 445.855574] zap_lookup_norm+0x5d/0xd0 [zfs]
[ 445.856304] zap_lookup+0x12/0x20 [zfs]
[ 445.856970] zfs_dirent_lock+0x565/0x7d0 [zfs]
[ 445.857716] zfs_dirlook+0x8a/0x2c0 [zfs]
[ 445.858401] zfs_lookup+0x1ef/0x3b0 [zfs]
[ 445.859073] zpl_lookup+0xbd/0x280 [zfs]
[ 445.859683] path_openat+0x89f/0x1420
[ 445.860277] do_filp_open+0x93/0x100
[ 445.860910] ? zpl_fallocate+0x10/0x10 [zfs]
[ 445.861582] ? stack_trace_save+0x4b/0x70
[ 445.862222] ? _raw_write_lock_irqsave+0x2e/0x30
[ 445.862975] do_open_execat+0x83/0x1a0
[ 445.863606] __do_execve_file.isra.34+0x70a/0x9d0
[ 445.864495] ? create_object+0x21d/0x260
[ 445.865241] do_execve+0x21/0x30
[ 445.865846] call_usermodehelper_exec_async+0x189/0x1b0
[ 445.866690] ? recalc_sigpending+0x17/0x50
[ 445.867387] ? proc_cap_handler+0x1b0/0x1b0
[ 445.868073] ret_from_fork+0x35/0x40
Same reproducer running HEAD^ (94bcf6f
<94bcf6f>)
does not deadlock:
***@***.***:~# cat /proc/sys/kernel/spl/gitrev
zfs-0.8.0-308-g94bcf6f5e
***@***.***:~# sysctl -w kernel.hung_task_timeout_secs=60
kernel.hung_task_timeout_secs = 60
***@***.***:~# zfs set primarycache=none rpool
***@***.***:~# echo 3 > /proc/sys/vm/drop_caches
[ 116.529482] bash (1773): drop_caches: 3
***@***.***:~#
***@***.***:~#
***@***.***:~# time zpool scrub rpool
real 0m1.487s
user 0m0.000s
sys 0m0.153s
***@***.***:~#
I'm not sure why UMH_NO_WAIT is not working, we should probably
investigate that too.
Incidentally this change seems to prevent #9417
<#9417>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#9422>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAQVV7B3A5C2PQ7JWRKJ4ADQNOOM5ANCNFSM4I6GNZZQ>
.
|
Codecov Report
@@ Coverage Diff @@
## master #9422 +/- ##
==========================================
+ Coverage 78.95% 79.03% +0.08%
==========================================
Files 410 410
Lines 122499 122499
==========================================
+ Hits 96723 96823 +100
+ Misses 25776 25676 -100
Continue to review full report at Codecov.
|
@tonynguien I don't believe we can move ahead with this PR since it will reintroduce the original deadlock. There are alternative ways to handle this, a few which come to mind are:
All of which are a bit klunky. Since automatically setting the IO scheduled isn't critical functionality, I'd suggest that instead we move forward with the original #9317 proposal and deprecate this functionality. |
Thanks @behlendorf. I agree that we should move forward with #9317. For systems where all devices are used for ZFS, setting |
Motivation and Context
The change reliably sets device elevator to
noop
.Description
On Ubuntu 18.04, disk elevator is no longer set to
noop
with thechange to use UMH_NO_WAIT flag. This change uses UMH_WAIT_EXEC flag,
wait for exec, and disk elevator is now reliably set on pool creation
and reboot.
How Has This Been Tested?
On Ubuntu 18.04, where disk elevator is no longer set to
noop
on pool creation or reboot (100% reproducible), we no longer see the issue this change.I also ran
zpool scrub
several times and didn't hit panic mentioned in #8664.Types of changes
Checklist:
Signed-off-by
.