Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use UMH_WAIT_EXEC with call_usermodehelper() to reliably set elevator #9422

Closed
wants to merge 1 commit into from
Closed

Use UMH_WAIT_EXEC with call_usermodehelper() to reliably set elevator #9422

wants to merge 1 commit into from

Conversation

tonynguien
Copy link
Contributor

Motivation and Context

The change reliably sets device elevator to noop.

Description

On Ubuntu 18.04, disk elevator is no longer set to noop with the
change to use UMH_NO_WAIT flag. This change uses UMH_WAIT_EXEC flag,
wait for exec, and disk elevator is now reliably set on pool creation
and reboot.

How Has This Been Tested?

On Ubuntu 18.04, where disk elevator is no longer set to noop on pool creation or reboot (100% reproducible), we no longer see the issue this change.

I also ran zpool scrub several times and didn't hit panic mentioned in #8664.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

On Ubuntu 18.04, disk elevator is no longer set to `noop` with the
change to use UMH_NO_WAIT flag. This change uses UMH_WAIT_EXEC flag,
wait for exec, and disk elevator is now reliably set on pool creation
and reboot.

Signed-off-by: Tony Nguyen <[email protected]>
@behlendorf behlendorf requested a review from loli10K October 7, 2019 16:35
@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Oct 7, 2019
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using UMB_WAIT_EXEC was something which we originally considered. However, in the end we opted for UMH_NO_WAIT since technically we could hit the same deadlock when executing /bin/sh. Though that's probably astonishingly unlikely in practice.

we could probably even use UMH_NO_WAIT just to be safe in case reading /bin/sh from disk triggers the same issue: i haven't been able to reproduce this particular case though.

I'm curious, were you able to determine why UMH_NO_WAIT does not work as expected in your environment? Aside from the wait behavior I thought these would behave the same.

That all said, since this does restore the intended behavior it LGTM. However, be aware that the intention is to retire this functionality in the master branch in the not to distant future. See #9317. Due to issues like this it's preferable to delegate setting the scheduler to some infrastructure in user space.

@loli10K
Copy link
Contributor

loli10K commented Oct 7, 2019

I managed to reproduce #8664 on my root-on-zfs test VM with this change applied:

root@linux:~# cat /proc/sys/kernel/spl/gitrev
zfs-0.8.0-309-gff214b82a
root@linux:~# sysctl -w kernel.hung_task_timeout_secs=60
kernel.hung_task_timeout_secs = 60
root@linux:~# zfs set primarycache=none rpool
root@linux:~# echo 3 > /proc/sys/vm/drop_caches 
[  323.265982] bash (1453): drop_caches: 3
root@linux:~# 
root@linux:~# 
root@linux:~# time zpool scrub rpool
[  445.680923] INFO: task txg_sync:534 blocked for more than 61 seconds.
[  445.683620]       Tainted: P           OE     5.2.14 #1
[  445.685922] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  445.689250] txg_sync        D    0   534      2 0x80004000
[  445.691255] Call Trace:
[  445.691749]  ? __schedule+0x24a/0x600
[  445.692437]  schedule+0x2f/0xa0
[  445.693076]  cv_wait_common+0x156/0x290 [spl]
[  445.693943]  ? do_wait_intr_irq+0xb0/0xb0
[  445.694802]  spa_config_enter+0x13b/0x1e0 [zfs]
[  445.695729]  spa_sync+0x1e7/0x1850 [zfs]
[  445.696556]  ? zfs_refcount_remove_many+0x1e9/0x270 [zfs]
[  445.697604]  ? _cond_resched+0x15/0x30
[  445.698350]  ? mutex_lock+0xe/0x30
[  445.699074]  ? spa_txg_history_set+0xde/0x130 [zfs]
[  445.700067]  ? spa_txg_history_init_io+0xdc/0x110 [zfs]
[  445.701140]  txg_sync_thread+0x2ce/0x550 [zfs]
[  445.702053]  ? txg_thread_exit.isra.7+0xe0/0xe0 [zfs]
[  445.703014]  ? __thread_exit+0x20/0x20 [spl]
[  445.703831]  thread_generic_wrapper+0x9f/0xe0 [spl]
[  445.704803]  kthread+0x116/0x130
[  445.705446]  ? __kthread_cancel_work+0x80/0x80
[  445.706303]  ret_from_fork+0x35/0x40
[  445.707013] INFO: task systemd-timesyn:887 blocked for more than 61 seconds.
[  445.708382]       Tainted: P           OE     5.2.14 #1
[  445.709450] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  445.710977] systemd-timesyn D    0   887      1 0x00000120
[  445.712074] Call Trace:
[  445.712576]  ? __schedule+0x24a/0x600
[  445.713318]  ? mutex_lock+0xe/0x30
[  445.714012]  schedule+0x2f/0xa0
[  445.714656]  cv_wait_common+0x156/0x290 [spl]
[  445.715539]  ? do_wait_intr_irq+0xb0/0xb0
[  445.716388]  dbuf_read+0x26f/0x600 [zfs]
[  445.717235]  dmu_buf_hold+0x56/0x80 [zfs]
[  445.718113]  zap_lockdir+0x54/0x120 [zfs]
[  445.718927]  ? stack_trace_save+0x4b/0x70
[  445.719762]  zap_lookup_norm+0x5d/0xd0 [zfs]
[  445.720672]  zap_lookup+0x12/0x20 [zfs]
[  445.721508]  zfs_dirent_lock+0x565/0x7d0 [zfs]
[  445.722425]  zfs_dirlook+0x8a/0x2c0 [zfs]
[  445.723263]  zfs_lookup+0x1ef/0x3b0 [zfs]
[  445.724124]  zpl_lookup+0xbd/0x280 [zfs]
[  445.724919]  __lookup_slow+0x97/0x150
[  445.725650]  lookup_slow+0x35/0x50
[  445.726307]  walk_component+0x1bf/0x2f0
[  445.727049]  path_lookupat.isra.37+0x6d/0x210
[  445.727932]  filename_lookup+0xaf/0x180
[  445.728708]  ? kmem_cache_alloc+0x1c0/0x240
[  445.729580]  vfs_statx+0x73/0xe0
[  445.730223]  ? _cond_resched+0x15/0x30
[  445.730997]  ? dentry_kill+0x4d/0x190
[  445.731733]  __do_sys_newstat+0x39/0x70
[  445.732512]  ? syscall_trace_enter+0x192/0x2b0
[  445.733426]  do_syscall_64+0x55/0x130
[  445.734165]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  445.735178] RIP: 0033:0x7f6b42d51aa5
[  445.735893] Code: Bad RIP value.
[  445.736542] RSP: 002b:00007ffe8b033d18 EFLAGS: 00000246 ORIG_RAX: 0000000000000004
[  445.738079] RAX: ffffffffffffffda RBX: 00007ffe8b033d20 RCX: 00007f6b42d51aa5
[  445.739465] RDX: 00007ffe8b033d20 RSI: 00007ffe8b033d20 RDI: 00007f6b42dec54f
[  445.740884] RBP: 00007ffe8b033e80 R08: 000000000000000f R09: 0000556380d3d840
[  445.742274] R10: 0000000000000001 R11: 0000000000000246 R12: 00007f6b42e27900
[  445.743675] R13: 0000000000000002 R14: 0000000000000065 R15: 0000000000000066
[  445.745082] INFO: task cron:1045 blocked for more than 61 seconds.
[  445.746193]       Tainted: P           OE     5.2.14 #1
[  445.746963] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  445.748105] cron            D    0  1045      1 0x00000000
[  445.748924] Call Trace:
[  445.749291]  ? __schedule+0x24a/0x600
[  445.749827]  schedule+0x2f/0xa0
[  445.750295]  cv_wait_common+0x156/0x290 [spl]
[  445.750941]  ? do_wait_intr_irq+0xb0/0xb0
[  445.751574]  spa_config_enter+0x13b/0x1e0 [zfs]
[  445.752286]  zio_vdev_io_start+0x529/0x5a0 [zfs]
[  445.753025]  ? tsd_get_by_thread+0x3b/0x80 [spl]
[  445.753867]  zio_nowait+0x148/0x2f0 [zfs]
[  445.754584]  arc_read+0xb44/0x1a50 [zfs]
[  445.755295]  ? dbuf_rele_and_unlock+0x750/0x750 [zfs]
[  445.756192]  dbuf_read_impl.constprop.16+0x2d4/0xc50 [zfs]
[  445.757178]  dbuf_read+0x107/0x600 [zfs]
[  445.757891]  dmu_buf_hold+0x56/0x80 [zfs]
[  445.758628]  zap_lockdir+0x54/0x120 [zfs]
[  445.759324]  ? stack_trace_save+0x4b/0x70
[  445.760065]  zap_lookup_norm+0x5d/0xd0 [zfs]
[  445.760867]  zap_lookup+0x12/0x20 [zfs]
[  445.761576]  zfs_dirent_lock+0x565/0x7d0 [zfs]
[  445.762392]  zfs_dirlook+0x8a/0x2c0 [zfs]
[  445.763136]  zfs_lookup+0x1ef/0x3b0 [zfs]
[  445.763878]  zpl_lookup+0xbd/0x280 [zfs]
[  445.764535]  __lookup_slow+0x97/0x150
[  445.765094]  lookup_slow+0x35/0x50
[  445.765596]  walk_component+0x1bf/0x2f0
[  445.766161]  path_lookupat.isra.37+0x6d/0x210
[  445.766794]  filename_lookup+0xaf/0x180
[  445.767361]  ? kmem_cache_alloc+0x1c0/0x240
[  445.767980]  vfs_statx+0x73/0xe0
[  445.768457]  __do_sys_newstat+0x39/0x70
[  445.769065]  ? __hrtimer_init+0xb0/0xb0
[  445.769840]  do_syscall_64+0x55/0x130
[  445.770423]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  445.771161] RIP: 0033:0x7fb243a4faa5
[  445.771682] Code: Bad RIP value.
[  445.772163] RSP: 002b:00007ffc3247f608 EFLAGS: 00000246 ORIG_RAX: 0000000000000004
[  445.773283] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb243a4faa5
[  445.774477] RDX: 00007ffc3247f690 RSI: 00007ffc3247f690 RDI: 00007fb243ae88b1
[  445.775505] RBP: 0000000000000000 R08: 000000000000000f R09: 0000564cd0d1ddb0
[  445.776535] R10: 0000000000000001 R11: 0000000000000246 R12: 00007fb243ae88b1
[  445.777597] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
[  445.778772] INFO: task zpool:1812 blocked for more than 61 seconds.
[  445.779682]       Tainted: P           OE     5.2.14 #1
[  445.780438] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  445.781591] zpool           D    0  1812   1453 0x00004000
[  445.782527] Call Trace:
[  445.782902]  ? __schedule+0x24a/0x600
[  445.783486]  ? vdev_open_children+0x160/0x160 [zfs]
[  445.784202]  schedule+0x2f/0xa0
[  445.784669]  taskq_wait+0x7a/0xc0 [spl]
[  445.785265]  ? do_wait_intr_irq+0xb0/0xb0
[  445.786052]  taskq_destroy+0x50/0x420 [spl]
[  445.786707]  ? vdev_open_children+0x160/0x160 [zfs]
[  445.787466]  vdev_open_children+0x10a/0x160 [zfs]
[  445.788201]  vdev_root_open+0x37/0x120 [zfs]
[  445.788918]  vdev_open+0xd9/0x900 [zfs]
[  445.789687]  vdev_reopen+0x4c/0x100 [zfs]
[  445.790318]  dsl_scan+0x36/0x120 [zfs]
[  445.790916]  zfs_ioc_pool_scan+0x5a/0xd0 [zfs]
[  445.791612]  zfsdev_ioctl_common+0x72e/0x9d0 [zfs]
[  445.792316]  ? __handle_mm_fault+0x10a2/0x1230
[  445.793000]  do_vfs_ioctl+0xa4/0x620
[  445.793716]  ksys_ioctl+0x60/0x90
[  445.794207]  __x64_sys_ioctl+0x16/0x20
[  445.794752]  do_syscall_64+0x55/0x130
[  445.795291]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  445.796023] RIP: 0033:0x7f851ccba427
[  445.796543] Code: Bad RIP value.
[  445.797050] RSP: 002b:00007ffccae900f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  445.798358] RAX: ffffffffffffffda RBX: 00007ffccae90530 RCX: 00007f851ccba427
[  445.799389] RDX: 00007ffccae90530 RSI: 0000000000005a07 RDI: 0000000000000003
[  445.800414] RBP: 00007ffccae93b10 R08: 0000000000000008 R09: 0000000000000001
[  445.801471] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
[  445.802636] R13: 0000000000000000 R14: 000056476e2b87a0 R15: 000056476e2b6430
[  445.803686] INFO: task vdev_open:1813 blocked for more than 61 seconds.
[  445.804731]       Tainted: P           OE     5.2.14 #1
[  445.805580] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  445.806848] vdev_open       D    0  1813      2 0x80004000
[  445.807752] Call Trace:
[  445.808157]  ? __schedule+0x24a/0x600
[  445.808803]  ? vdev_open_children+0x160/0x160 [zfs]
[  445.809604]  schedule+0x2f/0xa0
[  445.810122]  taskq_wait+0x7a/0xc0 [spl]
[  445.810733]  ? do_wait_intr_irq+0xb0/0xb0
[  445.811421]  taskq_destroy+0x50/0x420 [spl]
[  445.812153]  ? vdev_open_children+0x160/0x160 [zfs]
[  445.813024]  vdev_open_children+0x10a/0x160 [zfs]
[  445.813812]  vdev_mirror_open+0x2c/0x140 [zfs]
[  445.814591]  vdev_open+0xd9/0x900 [zfs]
[  445.815216]  ? __schedule+0x252/0x600
[  445.815855]  vdev_open_child+0x1e/0x40 [zfs]
[  445.816541]  taskq_thread+0x336/0x660 [spl]
[  445.817211]  ? __switch_to_asm+0x40/0x70
[  445.817821]  ? wake_up_q+0x70/0x70
[  445.818353]  ? taskq_thread_spawn+0x50/0x50 [spl]
[  445.819071]  kthread+0x116/0x130
[  445.819578]  ? __kthread_cancel_work+0x80/0x80
[  445.820265]  ret_from_fork+0x35/0x40
[  445.820865] INFO: task vdev_open:1815 blocked for more than 61 seconds.
[  445.821957]       Tainted: P           OE     5.2.14 #1
[  445.822774] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  445.824005] vdev_open       D    0  1815      2 0x80004000
[  445.824934] Call Trace:
[  445.825341]  ? __schedule+0x24a/0x600
[  445.825934]  ? __switch_to_asm+0x40/0x70
[  445.826564]  ? __switch_to_asm+0x34/0x70
[  445.827265]  schedule+0x2f/0xa0
[  445.827852]  schedule_timeout+0x20d/0x310
[  445.828582]  ? __schedule+0x252/0x600
[  445.829284]  ? try_to_wake_up+0x54/0x460
[  445.829968]  wait_for_completion+0x11f/0x190
[  445.830637]  ? wake_up_q+0x70/0x70
[  445.831179]  call_usermodehelper_exec+0x12e/0x160
[  445.831979]  vdev_elevator_switch+0x112/0x1a0 [zfs]
[  445.832811]  vdev_disk_open+0x2a0/0x4b0 [zfs]
[  445.833598]  vdev_open+0xd9/0x900 [zfs]
[  445.834220]  ? __schedule+0x252/0x600
[  445.834854]  vdev_open_child+0x1e/0x40 [zfs]
[  445.835536]  taskq_thread+0x336/0x660 [spl]
[  445.836230]  ? __switch_to_asm+0x40/0x70
[  445.836872]  ? wake_up_q+0x70/0x70
[  445.837418]  ? taskq_thread_spawn+0x50/0x50 [spl]
[  445.838171]  kthread+0x116/0x130
[  445.838696]  ? __kthread_cancel_work+0x80/0x80
[  445.839398]  ret_from_fork+0x35/0x40
[  445.840056] INFO: task kworker/u8:5:1816 blocked for more than 61 seconds.
[  445.841283]       Tainted: P           OE     5.2.14 #1
[  445.842190] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  445.843438] kworker/u8:5    D    0  1816      2 0x80004000
[  445.844313] Call Trace:
[  445.844713]  ? __schedule+0x24a/0x600
[  445.845284]  schedule+0x2f/0xa0
[  445.845765]  cv_wait_common+0x156/0x290 [spl]
[  445.846447]  ? do_wait_intr_irq+0xb0/0xb0
[  445.847190]  spa_config_enter+0x13b/0x1e0 [zfs]
[  445.848067]  zio_vdev_io_start+0x529/0x5a0 [zfs]
[  445.848940]  ? tsd_get_by_thread+0x3b/0x80 [spl]
[  445.849725]  zio_nowait+0x148/0x2f0 [zfs]
[  445.850411]  arc_read+0xb44/0x1a50 [zfs]
[  445.851078]  ? dbuf_rele_and_unlock+0x750/0x750 [zfs]
[  445.851927]  dbuf_read_impl.constprop.16+0x2d4/0xc50 [zfs]
[  445.852887]  dbuf_read+0x107/0x600 [zfs]
[  445.853567]  dmu_buf_hold+0x56/0x80 [zfs]
[  445.854253]  zap_lockdir+0x54/0x120 [zfs]
[  445.854902]  ? stack_trace_save+0x4b/0x70
[  445.855574]  zap_lookup_norm+0x5d/0xd0 [zfs]
[  445.856304]  zap_lookup+0x12/0x20 [zfs]
[  445.856970]  zfs_dirent_lock+0x565/0x7d0 [zfs]
[  445.857716]  zfs_dirlook+0x8a/0x2c0 [zfs]
[  445.858401]  zfs_lookup+0x1ef/0x3b0 [zfs]
[  445.859073]  zpl_lookup+0xbd/0x280 [zfs]
[  445.859683]  path_openat+0x89f/0x1420
[  445.860277]  do_filp_open+0x93/0x100
[  445.860910]  ? zpl_fallocate+0x10/0x10 [zfs]
[  445.861582]  ? stack_trace_save+0x4b/0x70
[  445.862222]  ? _raw_write_lock_irqsave+0x2e/0x30
[  445.862975]  do_open_execat+0x83/0x1a0
[  445.863606]  __do_execve_file.isra.34+0x70a/0x9d0
[  445.864495]  ? create_object+0x21d/0x260
[  445.865241]  do_execve+0x21/0x30
[  445.865846]  call_usermodehelper_exec_async+0x189/0x1b0
[  445.866690]  ? recalc_sigpending+0x17/0x50
[  445.867387]  ? proc_cap_handler+0x1b0/0x1b0
[  445.868073]  ret_from_fork+0x35/0x40

Same reproducer running HEAD^ (94bcf6f) does not deadlock:

root@linux:~# cat /proc/sys/kernel/spl/gitrev
zfs-0.8.0-308-g94bcf6f5e
root@linux:~# sysctl -w kernel.hung_task_timeout_secs=60
kernel.hung_task_timeout_secs = 60
root@linux:~# zfs set primarycache=none rpool
root@linux:~# echo 3 > /proc/sys/vm/drop_caches 
[  116.529482] bash (1773): drop_caches: 3
root@linux:~# 
root@linux:~# 
root@linux:~# time zpool scrub rpool

real	0m1.487s
user	0m0.000s
sys	0m0.153s
root@linux:~# 

I'm not sure why UMH_NO_WAIT is not working, we should probably investigate that too.

Incidentally this change seems to prevent #9417

@tonynguien
Copy link
Contributor Author

tonynguien commented Oct 8, 2019 via email

@codecov
Copy link

codecov bot commented Oct 9, 2019

Codecov Report

Merging #9422 into master will increase coverage by 0.08%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9422      +/-   ##
==========================================
+ Coverage   78.95%   79.03%   +0.08%     
==========================================
  Files         410      410              
  Lines      122499   122499              
==========================================
+ Hits        96723    96823     +100     
+ Misses      25776    25676     -100
Flag Coverage Δ
#kernel 79.75% <100%> (-0.01%) ⬇️
#user 66.5% <ø> (+0.28%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 94bcf6f...ff214b8. Read the comment docs.

@behlendorf
Copy link
Contributor

@tonynguien I don't believe we can move ahead with this PR since it will reintroduce the original deadlock. There are alternative ways to handle this, a few which come to mind are:

  1. A new udev rule,
  2. Update the zpool utility to set the scheduler, or
  3. Perform the upcall from the system taskq

All of which are a bit klunky. Since automatically setting the IO scheduled isn't critical functionality, I'd suggest that instead we move forward with the original #9317 proposal and deprecate this functionality.

@behlendorf behlendorf self-requested a review October 30, 2019 22:28
@behlendorf behlendorf closed this Nov 1, 2019
@tonynguien
Copy link
Contributor Author

@tonynguien I don't believe we can move ahead with this PR since it will reintroduce the original deadlock. There are alternative ways to handle this, a few which come to mind are:

  1. A new udev rule,
  2. Update the zpool utility to set the scheduler, or
  3. Perform the upcall from the system taskq

All of which are a bit klunky. Since automatically setting the IO scheduled isn't critical functionality, I'd suggest that instead we move forward with the original #9317 proposal and deprecate this functionality.

Thanks @behlendorf. I agree that we should move forward with #9317.

For systems where all devices are used for ZFS, setting elevator=noop GRUB option is one way to ensure disks have noop IO scheduler.

@tonynguien tonynguien deleted the elevator.UMH_WAIT_EXEC branch November 1, 2019 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants