Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker: failed to exit within 10 seconds of kill - trying direct SIGKILL #41676

Open
discostur opened this issue Nov 16, 2020 · 5 comments
Open

Comments

@discostur
Copy link

discostur commented Nov 16, 2020

Description
We are running a 3 node K8S (1.17.13) cluster with ceph rbd as storage backend. Our cluster runs multiple MariaDB pods. Sometimes a pod crashes and we see the following kernel dump in dmesg:

[So Nov 15 15:59:41 2020] INFO: task mysqld:44333 blocked for more than 120 seconds.
[So Nov 15 15:59:41 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[So Nov 15 15:59:41 2020] mysqld          D ffff98eb48645280     0 44333  39200 0x00000080
[So Nov 15 15:59:41 2020] Call Trace:
[So Nov 15 15:59:41 2020]  [<ffffffffb975ec27>] ? blk_mq_run_hw_queue+0x57/0x110
[So Nov 15 15:59:41 2020]  [<ffffffffb9b85310>] ? bit_wait+0x50/0x50
[So Nov 15 15:59:41 2020]  [<ffffffffb9b871e9>] schedule+0x29/0x70
[So Nov 15 15:59:41 2020]  [<ffffffffb9b84cd1>] schedule_timeout+0x221/0x2d0
[So Nov 15 15:59:41 2020]  [<ffffffffb9761b9c>] ? blk_mq_flush_plug_list+0x19c/0x200
[So Nov 15 15:59:41 2020]  [<ffffffffb9421a0f>] ? xen_clocksource_get_cycles+0x1f/0x30
[So Nov 15 15:59:41 2020]  [<ffffffffb9506362>] ? ktime_get_ts64+0x52/0xf0
[So Nov 15 15:59:41 2020]  [<ffffffffb9b85310>] ? bit_wait+0x50/0x50
[So Nov 15 15:59:41 2020]  [<ffffffffb9b868bd>] io_schedule_timeout+0xad/0x130
[So Nov 15 15:59:41 2020]  [<ffffffffb9b86958>] io_schedule+0x18/0x20
[So Nov 15 15:59:41 2020]  [<ffffffffb9b85321>] bit_wait_io+0x11/0x50
[So Nov 15 15:59:41 2020]  [<ffffffffb9b84e47>] __wait_on_bit+0x67/0x90
[So Nov 15 15:59:41 2020]  [<ffffffffb95bcf01>] wait_on_page_bit+0x81/0xa0
[So Nov 15 15:59:41 2020]  [<ffffffffb94c6dd0>] ? wake_bit_function+0x40/0x40
[So Nov 15 15:59:41 2020]  [<ffffffffb95bd031>] __filemap_fdatawait_range+0x111/0x190
[So Nov 15 15:59:41 2020]  [<ffffffffb95cade1>] ? do_writepages+0x21/0x50
[So Nov 15 15:59:41 2020]  [<ffffffffb95bd0c4>] filemap_fdatawait_range+0x14/0x30
[So Nov 15 15:59:41 2020]  [<ffffffffb95bfab6>] filemap_write_and_wait_range+0x56/0x90
[So Nov 15 15:59:41 2020]  [<ffffffffc08ab9fa>] ext4_sync_file+0xba/0x320 [ext4]
[So Nov 15 15:59:41 2020]  [<ffffffffb9683a07>] do_fsync+0x67/0xb0
[So Nov 15 15:59:41 2020]  [<ffffffffb9683cf0>] SyS_fsync+0x10/0x20
[So Nov 15 15:59:41 2020]  [<ffffffffb9b93f92>] system_call_fastpath+0x25/0x2a
[So Nov 15 16:01:41 2020] INFO: task mysqld:44333 blocked for more than 120 seconds.
[So Nov 15 16:01:41 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[So Nov 15 16:01:41 2020] mysqld          D ffff98eb48645280     0 44333  39200 0x00000080
[So Nov 15 16:01:41 2020] Call Trace:
[So Nov 15 16:01:41 2020]  [<ffffffffb975ec27>] ? blk_mq_run_hw_queue+0x57/0x110
[So Nov 15 16:01:41 2020]  [<ffffffffb9b85310>] ? bit_wait+0x50/0x50
[So Nov 15 16:01:41 2020]  [<ffffffffb9b871e9>] schedule+0x29/0x70
[So Nov 15 16:01:41 2020]  [<ffffffffb9b84cd1>] schedule_timeout+0x221/0x2d0
[So Nov 15 16:01:41 2020]  [<ffffffffb9761b9c>] ? blk_mq_flush_plug_list+0x19c/0x200
[So Nov 15 16:01:41 2020]  [<ffffffffb9421a0f>] ? xen_clocksource_get_cycles+0x1f/0x30
[So Nov 15 16:01:41 2020]  [<ffffffffb9506362>] ? ktime_get_ts64+0x52/0xf0
[So Nov 15 16:01:41 2020]  [<ffffffffb9b85310>] ? bit_wait+0x50/0x50
[So Nov 15 16:01:41 2020]  [<ffffffffb9b868bd>] io_schedule_timeout+0xad/0x130
[So Nov 15 16:01:41 2020]  [<ffffffffb9b86958>] io_schedule+0x18/0x20
[So Nov 15 16:01:41 2020]  [<ffffffffb9b85321>] bit_wait_io+0x11/0x50
[So Nov 15 16:01:41 2020]  [<ffffffffb9b84e47>] __wait_on_bit+0x67/0x90
[So Nov 15 16:01:41 2020]  [<ffffffffb95bcf01>] wait_on_page_bit+0x81/0xa0
[So Nov 15 16:01:41 2020]  [<ffffffffb94c6dd0>] ? wake_bit_function+0x40/0x40
[So Nov 15 16:01:41 2020]  [<ffffffffb95bd031>] __filemap_fdatawait_range+0x111/0x190
[So Nov 15 16:01:41 2020]  [<ffffffffb95cade1>] ? do_writepages+0x21/0x50
[So Nov 15 16:01:41 2020]  [<ffffffffb95bd0c4>] filemap_fdatawait_range+0x14/0x30
[So Nov 15 16:01:41 2020]  [<ffffffffb95bfab6>] filemap_write_and_wait_range+0x56/0x90
[So Nov 15 16:01:41 2020]  [<ffffffffc08ab9fa>] ext4_sync_file+0xba/0x320 [ext4]
[So Nov 15 16:01:41 2020]  [<ffffffffb9683a07>] do_fsync+0x67/0xb0
[So Nov 15 16:01:41 2020]  [<ffffffffb9683cf0>] SyS_fsync+0x10/0x20
[So Nov 15 16:01:41 2020]  [<ffffffffb9b93f92>] system_call_fastpath+0x25/0x2a
[So Nov 15 16:03:41 2020] INFO: task mysqld:44333 blocked for more than 120 seconds.
[So Nov 15 16:03:41 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[So Nov 15 16:03:41 2020] mysqld          D ffff98eb48645280     0 44333  39200 0x00000080
[So Nov 15 16:03:41 2020] Call Trace:
[So Nov 15 16:03:41 2020]  [<ffffffffb975ec27>] ? blk_mq_run_hw_queue+0x57/0x110
[So Nov 15 16:03:41 2020]  [<ffffffffb9b85310>] ? bit_wait+0x50/0x50
[So Nov 15 16:03:41 2020]  [<ffffffffb9b871e9>] schedule+0x29/0x70
[So Nov 15 16:03:41 2020]  [<ffffffffb9b84cd1>] schedule_timeout+0x221/0x2d0
[So Nov 15 16:03:41 2020]  [<ffffffffb9761b9c>] ? blk_mq_flush_plug_list+0x19c/0x200
[So Nov 15 16:03:41 2020]  [<ffffffffb9421a0f>] ? xen_clocksource_get_cycles+0x1f/0x30
[So Nov 15 16:03:41 2020]  [<ffffffffb9506362>] ? ktime_get_ts64+0x52/0xf0
[So Nov 15 16:03:41 2020]  [<ffffffffb9b85310>] ? bit_wait+0x50/0x50
[So Nov 15 16:03:41 2020]  [<ffffffffb9b868bd>] io_schedule_timeout+0xad/0x130
[So Nov 15 16:03:41 2020]  [<ffffffffb9b86958>] io_schedule+0x18/0x20
[So Nov 15 16:03:41 2020]  [<ffffffffb9b85321>] bit_wait_io+0x11/0x50
[So Nov 15 16:03:41 2020]  [<ffffffffb9b84e47>] __wait_on_bit+0x67/0x90
[So Nov 15 16:03:41 2020]  [<ffffffffb95bcf01>] wait_on_page_bit+0x81/0xa0
[So Nov 15 16:03:41 2020]  [<ffffffffb94c6dd0>] ? wake_bit_function+0x40/0x40
[So Nov 15 16:03:41 2020]  [<ffffffffb95bd031>] __filemap_fdatawait_range+0x111/0x190
[So Nov 15 16:03:41 2020]  [<ffffffffb95cade1>] ? do_writepages+0x21/0x50
[So Nov 15 16:03:41 2020]  [<ffffffffb95bd0c4>] filemap_fdatawait_range+0x14/0x30
[So Nov 15 16:03:41 2020]  [<ffffffffb95bfab6>] filemap_write_and_wait_range+0x56/0x90
[So Nov 15 16:03:41 2020]  [<ffffffffc08ab9fa>] ext4_sync_file+0xba/0x320 [ext4]
[So Nov 15 16:03:41 2020]  [<ffffffffb9683a07>] do_fsync+0x67/0xb0
[So Nov 15 16:03:41 2020]  [<ffffffffb9683cf0>] SyS_fsync+0x10/0x20
[So Nov 15 16:03:41 2020]  [<ffffffffb9b93f92>] system_call_fastpath+0x25/0x2a
[So Nov 15 16:05:41 2020] INFO: task mysqld:44333 blocked for more than 120 seconds.
[So Nov 15 16:05:41 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[So Nov 15 16:05:41 2020] mysqld          D ffff98eb48645280     0 44333  39200 0x00000080
[So Nov 15 16:05:41 2020] Call Trace:
[So Nov 15 16:05:41 2020]  [<ffffffffb975ec27>] ? blk_mq_run_hw_queue+0x57/0x110
[So Nov 15 16:05:41 2020]  [<ffffffffb9b85310>] ? bit_wait+0x50/0x50
[So Nov 15 16:05:41 2020]  [<ffffffffb9b871e9>] schedule+0x29/0x70
[So Nov 15 16:05:41 2020]  [<ffffffffb9b84cd1>] schedule_timeout+0x221/0x2d0
[So Nov 15 16:05:41 2020]  [<ffffffffb9761b9c>] ? blk_mq_flush_plug_list+0x19c/0x200
[So Nov 15 16:05:41 2020]  [<ffffffffb9421a0f>] ? xen_clocksource_get_cycles+0x1f/0x30
[So Nov 15 16:05:41 2020]  [<ffffffffb9506362>] ? ktime_get_ts64+0x52/0xf0
[So Nov 15 16:05:41 2020]  [<ffffffffb9b85310>] ? bit_wait+0x50/0x50
[So Nov 15 16:05:41 2020]  [<ffffffffb9b868bd>] io_schedule_timeout+0xad/0x130
[So Nov 15 16:05:41 2020]  [<ffffffffb9b86958>] io_schedule+0x18/0x20
[So Nov 15 16:05:41 2020]  [<ffffffffb9b85321>] bit_wait_io+0x11/0x50
[So Nov 15 16:05:41 2020]  [<ffffffffb9b84e47>] __wait_on_bit+0x67/0x90
[So Nov 15 16:05:41 2020]  [<ffffffffb95bcf01>] wait_on_page_bit+0x81/0xa0
[So Nov 15 16:05:41 2020]  [<ffffffffb94c6dd0>] ? wake_bit_function+0x40/0x40
[So Nov 15 16:05:41 2020]  [<ffffffffb95bd031>] __filemap_fdatawait_range+0x111/0x190
[So Nov 15 16:05:41 2020]  [<ffffffffb95cade1>] ? do_writepages+0x21/0x50
[So Nov 15 16:05:41 2020]  [<ffffffffb95bd0c4>] filemap_fdatawait_range+0x14/0x30
[So Nov 15 16:05:41 2020]  [<ffffffffb95bfab6>] filemap_write_and_wait_range+0x56/0x90
[So Nov 15 16:05:41 2020]  [<ffffffffc08ab9fa>] ext4_sync_file+0xba/0x320 [ext4]
[So Nov 15 16:05:41 2020]  [<ffffffffb9683a07>] do_fsync+0x67/0xb0
[So Nov 15 16:05:41 2020]  [<ffffffffb9683cf0>] SyS_fsync+0x10/0x20
[So Nov 15 16:05:41 2020]  [<ffffffffb9b93f92>] system_call_fastpath+0x25/0x2a
[So Nov 15 16:07:41 2020] INFO: task mysqld:44333 blocked for more than 120 seconds.
[So Nov 15 16:07:41 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[So Nov 15 16:07:41 2020] mysqld          D ffff98eb48645280     0 44333  39200 0x00000080
[So Nov 15 16:07:41 2020] Call Trace:
[So Nov 15 16:07:41 2020]  [<ffffffffb975ec27>] ? blk_mq_run_hw_queue+0x57/0x110
[So Nov 15 16:07:41 2020]  [<ffffffffb9b85310>] ? bit_wait+0x50/0x50
[So Nov 15 16:07:41 2020]  [<ffffffffb9b871e9>] schedule+0x29/0x70
[So Nov 15 16:07:41 2020]  [<ffffffffb9b84cd1>] schedule_timeout+0x221/0x2d0
[So Nov 15 16:07:41 2020]  [<ffffffffb9761b9c>] ? blk_mq_flush_plug_list+0x19c/0x200
[So Nov 15 16:07:41 2020]  [<ffffffffb9421a0f>] ? xen_clocksource_get_cycles+0x1f/0x30
[So Nov 15 16:07:41 2020]  [<ffffffffb9506362>] ? ktime_get_ts64+0x52/0xf0
[So Nov 15 16:07:41 2020]  [<ffffffffb9b85310>] ? bit_wait+0x50/0x50
[So Nov 15 16:07:41 2020]  [<ffffffffb9b868bd>] io_schedule_timeout+0xad/0x130
[So Nov 15 16:07:41 2020]  [<ffffffffb9b86958>] io_schedule+0x18/0x20
[So Nov 15 16:07:41 2020]  [<ffffffffb9b85321>] bit_wait_io+0x11/0x50
[So Nov 15 16:07:41 2020]  [<ffffffffb9b84e47>] __wait_on_bit+0x67/0x90
[So Nov 15 16:07:41 2020]  [<ffffffffb95bcf01>] wait_on_page_bit+0x81/0xa0
[So Nov 15 16:07:41 2020]  [<ffffffffb94c6dd0>] ? wake_bit_function+0x40/0x40
[So Nov 15 16:07:41 2020]  [<ffffffffb95bd031>] __filemap_fdatawait_range+0x111/0x190
[So Nov 15 16:07:41 2020]  [<ffffffffb95cade1>] ? do_writepages+0x21/0x50
[So Nov 15 16:07:41 2020]  [<ffffffffb95bd0c4>] filemap_fdatawait_range+0x14/0x30
[So Nov 15 16:07:41 2020]  [<ffffffffb95bfab6>] filemap_write_and_wait_range+0x56/0x90
[So Nov 15 16:07:41 2020]  [<ffffffffc08ab9fa>] ext4_sync_file+0xba/0x320 [ext4]
[So Nov 15 16:07:41 2020]  [<ffffffffb9683a07>] do_fsync+0x67/0xb0
[So Nov 15 16:07:41 2020]  [<ffffffffb9683cf0>] SyS_fsync+0x10/0x20
[So Nov 15 16:07:41 2020]  [<ffffffffb9b93f92>] system_call_fastpath+0x25/0x2a
[So Nov 15 16:09:41 2020] INFO: task mysqld:44333 blocked for more than 120 seconds.
[So Nov 15 16:09:41 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[So Nov 15 16:09:41 2020] mysqld          D ffff98eb48645280     0 44333  39200 0x00000080
[So Nov 15 16:09:41 2020] Call Trace:
[So Nov 15 16:09:41 2020]  [<ffffffffb975ec27>] ? blk_mq_run_hw_queue+0x57/0x110
[So Nov 15 16:09:41 2020]  [<ffffffffb9b85310>] ? bit_wait+0x50/0x50
[So Nov 15 16:09:41 2020]  [<ffffffffb9b871e9>] schedule+0x29/0x70
[So Nov 15 16:09:41 2020]  [<ffffffffb9b84cd1>] schedule_timeout+0x221/0x2d0
[So Nov 15 16:09:41 2020]  [<ffffffffb9761b9c>] ? blk_mq_flush_plug_list+0x19c/0x200
[So Nov 15 16:09:41 2020]  [<ffffffffb9421a0f>] ? xen_clocksource_get_cycles+0x1f/0x30
[So Nov 15 16:09:41 2020]  [<ffffffffb9506362>] ? ktime_get_ts64+0x52/0xf0
[So Nov 15 16:09:41 2020]  [<ffffffffb9b85310>] ? bit_wait+0x50/0x50
[So Nov 15 16:09:41 2020]  [<ffffffffb9b868bd>] io_schedule_timeout+0xad/0x130
[So Nov 15 16:09:41 2020]  [<ffffffffb9b86958>] io_schedule+0x18/0x20
[So Nov 15 16:09:41 2020]  [<ffffffffb9b85321>] bit_wait_io+0x11/0x50
[So Nov 15 16:09:41 2020]  [<ffffffffb9b84e47>] __wait_on_bit+0x67/0x90
[So Nov 15 16:09:41 2020]  [<ffffffffb95bcf01>] wait_on_page_bit+0x81/0xa0
[So Nov 15 16:09:41 2020]  [<ffffffffb94c6dd0>] ? wake_bit_function+0x40/0x40
[So Nov 15 16:09:41 2020]  [<ffffffffb95bd031>] __filemap_fdatawait_range+0x111/0x190
[So Nov 15 16:09:41 2020]  [<ffffffffb95cade1>] ? do_writepages+0x21/0x50
[So Nov 15 16:09:41 2020]  [<ffffffffb95bd0c4>] filemap_fdatawait_range+0x14/0x30
[So Nov 15 16:09:41 2020]  [<ffffffffb95bfab6>] filemap_write_and_wait_range+0x56/0x90
[So Nov 15 16:09:41 2020]  [<ffffffffc08ab9fa>] ext4_sync_file+0xba/0x320 [ext4]
[So Nov 15 16:09:41 2020]  [<ffffffffb9683a07>] do_fsync+0x67/0xb0
[So Nov 15 16:09:41 2020]  [<ffffffffb9683cf0>] SyS_fsync+0x10/0x20
[So Nov 15 16:09:41 2020]  [<ffffffffb9b93f92>] system_call_fastpath+0x25/0x2a
[So Nov 15 16:11:41 2020] INFO: task mysqld:44333 blocked for more than 120 seconds.
[So Nov 15 16:11:41 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[So Nov 15 16:11:41 2020] mysqld          D ffff98eb48645280     0 44333  39200 0x00000080
[So Nov 15 16:11:41 2020] Call Trace:
[So Nov 15 16:11:41 2020]  [<ffffffffb975ec27>] ? blk_mq_run_hw_queue+0x57/0x110
[So Nov 15 16:11:41 2020]  [<ffffffffb9b85310>] ? bit_wait+0x50/0x50
[So Nov 15 16:11:41 2020]  [<ffffffffb9b871e9>] schedule+0x29/0x70
[So Nov 15 16:11:41 2020]  [<ffffffffb9b84cd1>] schedule_timeout+0x221/0x2d0
[So Nov 15 16:11:41 2020]  [<ffffffffb9761b9c>] ? blk_mq_flush_plug_list+0x19c/0x200
[So Nov 15 16:11:41 2020]  [<ffffffffb9421a0f>] ? xen_clocksource_get_cycles+0x1f/0x30
[So Nov 15 16:11:41 2020]  [<ffffffffb9506362>] ? ktime_get_ts64+0x52/0xf0
[So Nov 15 16:11:41 2020]  [<ffffffffb9b85310>] ? bit_wait+0x50/0x50
[So Nov 15 16:11:41 2020]  [<ffffffffb9b868bd>] io_schedule_timeout+0xad/0x130
[So Nov 15 16:11:41 2020]  [<ffffffffb9b86958>] io_schedule+0x18/0x20
[So Nov 15 16:11:41 2020]  [<ffffffffb9b85321>] bit_wait_io+0x11/0x50
[So Nov 15 16:11:41 2020]  [<ffffffffb9b84e47>] __wait_on_bit+0x67/0x90
[So Nov 15 16:11:41 2020]  [<ffffffffb95bcf01>] wait_on_page_bit+0x81/0xa0
[So Nov 15 16:11:41 2020]  [<ffffffffb94c6dd0>] ? wake_bit_function+0x40/0x40
[So Nov 15 16:11:41 2020]  [<ffffffffb95bd031>] __filemap_fdatawait_range+0x111/0x190
[So Nov 15 16:11:41 2020]  [<ffffffffb95cade1>] ? do_writepages+0x21/0x50
[So Nov 15 16:11:41 2020]  [<ffffffffb95bd0c4>] filemap_fdatawait_range+0x14/0x30
[So Nov 15 16:11:41 2020]  [<ffffffffb95bfab6>] filemap_write_and_wait_range+0x56/0x90
[So Nov 15 16:11:41 2020]  [<ffffffffc08ab9fa>] ext4_sync_file+0xba/0x320 [ext4]
[So Nov 15 16:11:41 2020]  [<ffffffffb9683a07>] do_fsync+0x67/0xb0
[So Nov 15 16:11:41 2020]  [<ffffffffb9683cf0>] SyS_fsync+0x10/0x20
[So Nov 15 16:11:41 2020]  [<ffffffffb9b93f92>] system_call_fastpath+0x25/0x2a
[So Nov 15 16:13:41 2020] INFO: task mysqld:44333 blocked for more than 120 seconds.
[So Nov 15 16:13:41 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[So Nov 15 16:13:41 2020] mysqld          D ffff98eb48645280     0 44333  39200 0x00000080
[So Nov 15 16:13:41 2020] Call Trace:
[So Nov 15 16:13:41 2020]  [<ffffffffb975ec27>] ? blk_mq_run_hw_queue+0x57/0x110
[So Nov 15 16:13:41 2020]  [<ffffffffb9b85310>] ? bit_wait+0x50/0x50
[So Nov 15 16:13:41 2020]  [<ffffffffb9b871e9>] schedule+0x29/0x70
[So Nov 15 16:13:41 2020]  [<ffffffffb9b84cd1>] schedule_timeout+0x221/0x2d0
[So Nov 15 16:13:41 2020]  [<ffffffffb9761b9c>] ? blk_mq_flush_plug_list+0x19c/0x200
[So Nov 15 16:13:41 2020]  [<ffffffffb9421a0f>] ? xen_clocksource_get_cycles+0x1f/0x30
[So Nov 15 16:13:41 2020]  [<ffffffffb9506362>] ? ktime_get_ts64+0x52/0xf0
[So Nov 15 16:13:41 2020]  [<ffffffffb9b85310>] ? bit_wait+0x50/0x50
[So Nov 15 16:13:41 2020]  [<ffffffffb9b868bd>] io_schedule_timeout+0xad/0x130
[So Nov 15 16:13:41 2020]  [<ffffffffb9b86958>] io_schedule+0x18/0x20
[So Nov 15 16:13:41 2020]  [<ffffffffb9b85321>] bit_wait_io+0x11/0x50
[So Nov 15 16:13:41 2020]  [<ffffffffb9b84e47>] __wait_on_bit+0x67/0x90
[So Nov 15 16:13:41 2020]  [<ffffffffb95bcf01>] wait_on_page_bit+0x81/0xa0
[So Nov 15 16:13:41 2020]  [<ffffffffb94c6dd0>] ? wake_bit_function+0x40/0x40
[So Nov 15 16:13:41 2020]  [<ffffffffb95bd031>] __filemap_fdatawait_range+0x111/0x190
[So Nov 15 16:13:41 2020]  [<ffffffffb95cade1>] ? do_writepages+0x21/0x50
[So Nov 15 16:13:41 2020]  [<ffffffffb95bd0c4>] filemap_fdatawait_range+0x14/0x30
[So Nov 15 16:13:41 2020]  [<ffffffffb95bfab6>] filemap_write_and_wait_range+0x56/0x90
[So Nov 15 16:13:41 2020]  [<ffffffffc08ab9fa>] ext4_sync_file+0xba/0x320 [ext4]
[So Nov 15 16:13:41 2020]  [<ffffffffb9683a07>] do_fsync+0x67/0xb0
[So Nov 15 16:13:41 2020]  [<ffffffffb9683cf0>] SyS_fsync+0x10/0x20
[So Nov 15 16:13:41 2020]  [<ffffffffb9b93f92>] system_call_fastpath+0x25/0x2a
[So Nov 15 16:15:41 2020] INFO: task mysqld:44333 blocked for more than 120 seconds.
[So Nov 15 16:15:41 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[So Nov 15 16:15:41 2020] mysqld          D ffff98eb48645280     0 44333  39200 0x00000080
[So Nov 15 16:15:41 2020] Call Trace:
[So Nov 15 16:15:41 2020]  [<ffffffffb975ec27>] ? blk_mq_run_hw_queue+0x57/0x110
[So Nov 15 16:15:41 2020]  [<ffffffffb9b85310>] ? bit_wait+0x50/0x50
[So Nov 15 16:15:41 2020]  [<ffffffffb9b871e9>] schedule+0x29/0x70
[So Nov 15 16:15:41 2020]  [<ffffffffb9b84cd1>] schedule_timeout+0x221/0x2d0
[So Nov 15 16:15:41 2020]  [<ffffffffb9761b9c>] ? blk_mq_flush_plug_list+0x19c/0x200
[So Nov 15 16:15:41 2020]  [<ffffffffb9421a0f>] ? xen_clocksource_get_cycles+0x1f/0x30
[So Nov 15 16:15:41 2020]  [<ffffffffb9506362>] ? ktime_get_ts64+0x52/0xf0
[So Nov 15 16:15:41 2020]  [<ffffffffb9b85310>] ? bit_wait+0x50/0x50
[So Nov 15 16:15:41 2020]  [<ffffffffb9b868bd>] io_schedule_timeout+0xad/0x130
[So Nov 15 16:15:41 2020]  [<ffffffffb9b86958>] io_schedule+0x18/0x20
[So Nov 15 16:15:41 2020]  [<ffffffffb9b85321>] bit_wait_io+0x11/0x50
[So Nov 15 16:15:41 2020]  [<ffffffffb9b84e47>] __wait_on_bit+0x67/0x90
[So Nov 15 16:15:41 2020]  [<ffffffffb95bcf01>] wait_on_page_bit+0x81/0xa0
[So Nov 15 16:15:41 2020]  [<ffffffffb94c6dd0>] ? wake_bit_function+0x40/0x40
[So Nov 15 16:15:41 2020]  [<ffffffffb95bd031>] __filemap_fdatawait_range+0x111/0x190
[So Nov 15 16:15:41 2020]  [<ffffffffb95cade1>] ? do_writepages+0x21/0x50
[So Nov 15 16:15:41 2020]  [<ffffffffb95bd0c4>] filemap_fdatawait_range+0x14/0x30
[So Nov 15 16:15:41 2020]  [<ffffffffb95bfab6>] filemap_write_and_wait_range+0x56/0x90
[So Nov 15 16:15:41 2020]  [<ffffffffc08ab9fa>] ext4_sync_file+0xba/0x320 [ext4]
[So Nov 15 16:15:41 2020]  [<ffffffffb9683a07>] do_fsync+0x67/0xb0
[So Nov 15 16:15:41 2020]  [<ffffffffb9683cf0>] SyS_fsync+0x10/0x20
[So Nov 15 16:15:41 2020]  [<ffffffffb9b93f92>] system_call_fastpath+0x25/0x2a
[So Nov 15 16:17:41 2020] INFO: task mysqld:44333 blocked for more than 120 seconds.
[So Nov 15 16:17:41 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[So Nov 15 16:17:41 2020] mysqld          D ffff98eb48645280     0 44333  39200 0x00000080
[So Nov 15 16:17:41 2020] Call Trace:
[So Nov 15 16:17:41 2020]  [<ffffffffb975ec27>] ? blk_mq_run_hw_queue+0x57/0x110
[So Nov 15 16:17:41 2020]  [<ffffffffb9b85310>] ? bit_wait+0x50/0x50
[So Nov 15 16:17:41 2020]  [<ffffffffb9b871e9>] schedule+0x29/0x70
[So Nov 15 16:17:41 2020]  [<ffffffffb9b84cd1>] schedule_timeout+0x221/0x2d0
[So Nov 15 16:17:41 2020]  [<ffffffffb9761b9c>] ? blk_mq_flush_plug_list+0x19c/0x200
[So Nov 15 16:17:41 2020]  [<ffffffffb9421a0f>] ? xen_clocksource_get_cycles+0x1f/0x30
[So Nov 15 16:17:41 2020]  [<ffffffffb9506362>] ? ktime_get_ts64+0x52/0xf0
[So Nov 15 16:17:41 2020]  [<ffffffffb9b85310>] ? bit_wait+0x50/0x50
[So Nov 15 16:17:41 2020]  [<ffffffffb9b868bd>] io_schedule_timeout+0xad/0x130
[So Nov 15 16:17:41 2020]  [<ffffffffb9b86958>] io_schedule+0x18/0x20
[So Nov 15 16:17:41 2020]  [<ffffffffb9b85321>] bit_wait_io+0x11/0x50
[So Nov 15 16:17:41 2020]  [<ffffffffb9b84e47>] __wait_on_bit+0x67/0x90
[So Nov 15 16:17:41 2020]  [<ffffffffb95bcf01>] wait_on_page_bit+0x81/0xa0
[So Nov 15 16:17:41 2020]  [<ffffffffb94c6dd0>] ? wake_bit_function+0x40/0x40
[So Nov 15 16:17:41 2020]  [<ffffffffb95bd031>] __filemap_fdatawait_range+0x111/0x190
[So Nov 15 16:17:41 2020]  [<ffffffffb95cade1>] ? do_writepages+0x21/0x50
[So Nov 15 16:17:41 2020]  [<ffffffffb95bd0c4>] filemap_fdatawait_range+0x14/0x30
[So Nov 15 16:17:41 2020]  [<ffffffffb95bfab6>] filemap_write_and_wait_range+0x56/0x90
[So Nov 15 16:17:41 2020]  [<ffffffffc08ab9fa>] ext4_sync_file+0xba/0x320 [ext4]
[So Nov 15 16:17:41 2020]  [<ffffffffb9683a07>] do_fsync+0x67/0xb0
[So Nov 15 16:17:41 2020]  [<ffffffffb9683cf0>] SyS_fsync+0x10/0x20
[So Nov 15 16:17:41 2020]  [<ffffffffb9b93f92>] system_call_fastpath+0x25/0x2a

After that crash, the MariaDB process is stuck in defunct state and we are not able to kill it:

root      39194  0.0  0.0 110120  4100 ?        Sl   Nov13   0:06  \_ containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/ebeaa83822517eddf5921864d773a52a5f3c7db965f06532a1edee32d0cfc759 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc -systemd-cgroup
polkitd   39214  0.0  0.0      0     0 ?        Zsl  Nov13   1:25  |   \_ [mysqld] <defunct>

Docker daemon itself is also not able to kill the defunct container:

Nov 16 14:05:23 node02 dockerd: time="2020-11-16T14:05:23.489093574+01:00" level=info msg="Container ebeaa83822517eddf5921864d773a52a5f3c7db965f06532a1edee32d0cfc759 failed to exit within 30 seconds of signal 15 - using the force"
Nov 16 14:05:33 node02 dockerd: time="2020-11-16T14:05:33.518450650+01:00" level=info msg="Container ebeaa8382251 failed to exit within 10 seconds of kill - trying direct SIGKILL"

We have to restart the complete node to remove the defunct process / container. This is problematic to us because the MariaDB storage runs on a ceph-rbd. During the phase where the pod is defunct / stuck k8s tries to schedule a new container on another host, but thats not possible because the old container still holds the lock on the ceph rbd volume.

Why are we not able to kill / remove the defunct process / container? I though containerd-shim should reap any defunct child processes? Containerd is running correct and we are able to spin up new containers on the same host ...

Steps to reproduce the issue:
At the moment i'm not able to reproduce this. The error happens randomly in our environment and i'm not able to track it down to anything specific ...

Output of docker version:

$docker version
Client: Docker Engine - Community
 Version:           19.03.13
 API version:       1.40
 Go version:        go1.13.15
 Git commit:        4484c46d9d
 Built:             Wed Sep 16 17:03:45 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.13
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       4484c46d9d
  Built:            Wed Sep 16 17:02:21 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.3.7
  GitCommit:        8fba4e9a7d01810a393d5d25a3621dc101981175
 runc:
  Version:          1.0.0-rc10
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

Output of docker info:

$docker info
Client:
 Debug Mode: false

Server:
 Containers: 308
  Running: 307
  Paused: 0
  Stopped: 1
 Images: 42
 Server Version: 19.03.13
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: systemd
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 8fba4e9a7d01810a393d5d25a3621dc101981175
 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 init version: fec3683
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 3.10.0-1160.2.2.el7.x86_64
 Operating System: CentOS Linux 7 (Core)
 OSType: linux
 Architecture: x86_64
 CPUs: 9
 Total Memory: 12.53GiB
 ID: TRHX:I6MB:PCU4:QRL6:DPFH:EB5O:5DJK:2GMT:JIZ6:I7ZO:52LF:COX7
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
@derbauer97
Copy link

We have the same issue with EKS 1.18 (kubernets 1.18.9) and docker 19.3.13.

@thaJeztah
Copy link
Member

@kolyshkin ptal

@MrOffline77
Copy link

Are there any workaround suggestions ?

@dajianderichang
Copy link

repo

@PungYoung
Copy link

docker rm -f xxxxxx,or update docker-ce version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants