You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description
We are running a 3 node K8S (1.17.13) cluster with ceph rbd as storage backend. Our cluster runs multiple MariaDB pods. Sometimes a pod crashes and we see the following kernel dump in dmesg:
[So Nov 15 15:59:41 2020] INFO: task mysqld:44333 blocked for more than 120 seconds.
[So Nov 15 15:59:41 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[So Nov 15 15:59:41 2020] mysqld D ffff98eb48645280 0 44333 39200 0x00000080
[So Nov 15 15:59:41 2020] Call Trace:
[So Nov 15 15:59:41 2020] [<ffffffffb975ec27>] ? blk_mq_run_hw_queue+0x57/0x110
[So Nov 15 15:59:41 2020] [<ffffffffb9b85310>] ? bit_wait+0x50/0x50
[So Nov 15 15:59:41 2020] [<ffffffffb9b871e9>] schedule+0x29/0x70
[So Nov 15 15:59:41 2020] [<ffffffffb9b84cd1>] schedule_timeout+0x221/0x2d0
[So Nov 15 15:59:41 2020] [<ffffffffb9761b9c>] ? blk_mq_flush_plug_list+0x19c/0x200
[So Nov 15 15:59:41 2020] [<ffffffffb9421a0f>] ? xen_clocksource_get_cycles+0x1f/0x30
[So Nov 15 15:59:41 2020] [<ffffffffb9506362>] ? ktime_get_ts64+0x52/0xf0
[So Nov 15 15:59:41 2020] [<ffffffffb9b85310>] ? bit_wait+0x50/0x50
[So Nov 15 15:59:41 2020] [<ffffffffb9b868bd>] io_schedule_timeout+0xad/0x130
[So Nov 15 15:59:41 2020] [<ffffffffb9b86958>] io_schedule+0x18/0x20
[So Nov 15 15:59:41 2020] [<ffffffffb9b85321>] bit_wait_io+0x11/0x50
[So Nov 15 15:59:41 2020] [<ffffffffb9b84e47>] __wait_on_bit+0x67/0x90
[So Nov 15 15:59:41 2020] [<ffffffffb95bcf01>] wait_on_page_bit+0x81/0xa0
[So Nov 15 15:59:41 2020] [<ffffffffb94c6dd0>] ? wake_bit_function+0x40/0x40
[So Nov 15 15:59:41 2020] [<ffffffffb95bd031>] __filemap_fdatawait_range+0x111/0x190
[So Nov 15 15:59:41 2020] [<ffffffffb95cade1>] ? do_writepages+0x21/0x50
[So Nov 15 15:59:41 2020] [<ffffffffb95bd0c4>] filemap_fdatawait_range+0x14/0x30
[So Nov 15 15:59:41 2020] [<ffffffffb95bfab6>] filemap_write_and_wait_range+0x56/0x90
[So Nov 15 15:59:41 2020] [<ffffffffc08ab9fa>] ext4_sync_file+0xba/0x320 [ext4]
[So Nov 15 15:59:41 2020] [<ffffffffb9683a07>] do_fsync+0x67/0xb0
[So Nov 15 15:59:41 2020] [<ffffffffb9683cf0>] SyS_fsync+0x10/0x20
[So Nov 15 15:59:41 2020] [<ffffffffb9b93f92>] system_call_fastpath+0x25/0x2a
[So Nov 15 16:01:41 2020] INFO: task mysqld:44333 blocked for more than 120 seconds.
[So Nov 15 16:01:41 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[So Nov 15 16:01:41 2020] mysqld D ffff98eb48645280 0 44333 39200 0x00000080
[So Nov 15 16:01:41 2020] Call Trace:
[So Nov 15 16:01:41 2020] [<ffffffffb975ec27>] ? blk_mq_run_hw_queue+0x57/0x110
[So Nov 15 16:01:41 2020] [<ffffffffb9b85310>] ? bit_wait+0x50/0x50
[So Nov 15 16:01:41 2020] [<ffffffffb9b871e9>] schedule+0x29/0x70
[So Nov 15 16:01:41 2020] [<ffffffffb9b84cd1>] schedule_timeout+0x221/0x2d0
[So Nov 15 16:01:41 2020] [<ffffffffb9761b9c>] ? blk_mq_flush_plug_list+0x19c/0x200
[So Nov 15 16:01:41 2020] [<ffffffffb9421a0f>] ? xen_clocksource_get_cycles+0x1f/0x30
[So Nov 15 16:01:41 2020] [<ffffffffb9506362>] ? ktime_get_ts64+0x52/0xf0
[So Nov 15 16:01:41 2020] [<ffffffffb9b85310>] ? bit_wait+0x50/0x50
[So Nov 15 16:01:41 2020] [<ffffffffb9b868bd>] io_schedule_timeout+0xad/0x130
[So Nov 15 16:01:41 2020] [<ffffffffb9b86958>] io_schedule+0x18/0x20
[So Nov 15 16:01:41 2020] [<ffffffffb9b85321>] bit_wait_io+0x11/0x50
[So Nov 15 16:01:41 2020] [<ffffffffb9b84e47>] __wait_on_bit+0x67/0x90
[So Nov 15 16:01:41 2020] [<ffffffffb95bcf01>] wait_on_page_bit+0x81/0xa0
[So Nov 15 16:01:41 2020] [<ffffffffb94c6dd0>] ? wake_bit_function+0x40/0x40
[So Nov 15 16:01:41 2020] [<ffffffffb95bd031>] __filemap_fdatawait_range+0x111/0x190
[So Nov 15 16:01:41 2020] [<ffffffffb95cade1>] ? do_writepages+0x21/0x50
[So Nov 15 16:01:41 2020] [<ffffffffb95bd0c4>] filemap_fdatawait_range+0x14/0x30
[So Nov 15 16:01:41 2020] [<ffffffffb95bfab6>] filemap_write_and_wait_range+0x56/0x90
[So Nov 15 16:01:41 2020] [<ffffffffc08ab9fa>] ext4_sync_file+0xba/0x320 [ext4]
[So Nov 15 16:01:41 2020] [<ffffffffb9683a07>] do_fsync+0x67/0xb0
[So Nov 15 16:01:41 2020] [<ffffffffb9683cf0>] SyS_fsync+0x10/0x20
[So Nov 15 16:01:41 2020] [<ffffffffb9b93f92>] system_call_fastpath+0x25/0x2a
[So Nov 15 16:03:41 2020] INFO: task mysqld:44333 blocked for more than 120 seconds.
[So Nov 15 16:03:41 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[So Nov 15 16:03:41 2020] mysqld D ffff98eb48645280 0 44333 39200 0x00000080
[So Nov 15 16:03:41 2020] Call Trace:
[So Nov 15 16:03:41 2020] [<ffffffffb975ec27>] ? blk_mq_run_hw_queue+0x57/0x110
[So Nov 15 16:03:41 2020] [<ffffffffb9b85310>] ? bit_wait+0x50/0x50
[So Nov 15 16:03:41 2020] [<ffffffffb9b871e9>] schedule+0x29/0x70
[So Nov 15 16:03:41 2020] [<ffffffffb9b84cd1>] schedule_timeout+0x221/0x2d0
[So Nov 15 16:03:41 2020] [<ffffffffb9761b9c>] ? blk_mq_flush_plug_list+0x19c/0x200
[So Nov 15 16:03:41 2020] [<ffffffffb9421a0f>] ? xen_clocksource_get_cycles+0x1f/0x30
[So Nov 15 16:03:41 2020] [<ffffffffb9506362>] ? ktime_get_ts64+0x52/0xf0
[So Nov 15 16:03:41 2020] [<ffffffffb9b85310>] ? bit_wait+0x50/0x50
[So Nov 15 16:03:41 2020] [<ffffffffb9b868bd>] io_schedule_timeout+0xad/0x130
[So Nov 15 16:03:41 2020] [<ffffffffb9b86958>] io_schedule+0x18/0x20
[So Nov 15 16:03:41 2020] [<ffffffffb9b85321>] bit_wait_io+0x11/0x50
[So Nov 15 16:03:41 2020] [<ffffffffb9b84e47>] __wait_on_bit+0x67/0x90
[So Nov 15 16:03:41 2020] [<ffffffffb95bcf01>] wait_on_page_bit+0x81/0xa0
[So Nov 15 16:03:41 2020] [<ffffffffb94c6dd0>] ? wake_bit_function+0x40/0x40
[So Nov 15 16:03:41 2020] [<ffffffffb95bd031>] __filemap_fdatawait_range+0x111/0x190
[So Nov 15 16:03:41 2020] [<ffffffffb95cade1>] ? do_writepages+0x21/0x50
[So Nov 15 16:03:41 2020] [<ffffffffb95bd0c4>] filemap_fdatawait_range+0x14/0x30
[So Nov 15 16:03:41 2020] [<ffffffffb95bfab6>] filemap_write_and_wait_range+0x56/0x90
[So Nov 15 16:03:41 2020] [<ffffffffc08ab9fa>] ext4_sync_file+0xba/0x320 [ext4]
[So Nov 15 16:03:41 2020] [<ffffffffb9683a07>] do_fsync+0x67/0xb0
[So Nov 15 16:03:41 2020] [<ffffffffb9683cf0>] SyS_fsync+0x10/0x20
[So Nov 15 16:03:41 2020] [<ffffffffb9b93f92>] system_call_fastpath+0x25/0x2a
[So Nov 15 16:05:41 2020] INFO: task mysqld:44333 blocked for more than 120 seconds.
[So Nov 15 16:05:41 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[So Nov 15 16:05:41 2020] mysqld D ffff98eb48645280 0 44333 39200 0x00000080
[So Nov 15 16:05:41 2020] Call Trace:
[So Nov 15 16:05:41 2020] [<ffffffffb975ec27>] ? blk_mq_run_hw_queue+0x57/0x110
[So Nov 15 16:05:41 2020] [<ffffffffb9b85310>] ? bit_wait+0x50/0x50
[So Nov 15 16:05:41 2020] [<ffffffffb9b871e9>] schedule+0x29/0x70
[So Nov 15 16:05:41 2020] [<ffffffffb9b84cd1>] schedule_timeout+0x221/0x2d0
[So Nov 15 16:05:41 2020] [<ffffffffb9761b9c>] ? blk_mq_flush_plug_list+0x19c/0x200
[So Nov 15 16:05:41 2020] [<ffffffffb9421a0f>] ? xen_clocksource_get_cycles+0x1f/0x30
[So Nov 15 16:05:41 2020] [<ffffffffb9506362>] ? ktime_get_ts64+0x52/0xf0
[So Nov 15 16:05:41 2020] [<ffffffffb9b85310>] ? bit_wait+0x50/0x50
[So Nov 15 16:05:41 2020] [<ffffffffb9b868bd>] io_schedule_timeout+0xad/0x130
[So Nov 15 16:05:41 2020] [<ffffffffb9b86958>] io_schedule+0x18/0x20
[So Nov 15 16:05:41 2020] [<ffffffffb9b85321>] bit_wait_io+0x11/0x50
[So Nov 15 16:05:41 2020] [<ffffffffb9b84e47>] __wait_on_bit+0x67/0x90
[So Nov 15 16:05:41 2020] [<ffffffffb95bcf01>] wait_on_page_bit+0x81/0xa0
[So Nov 15 16:05:41 2020] [<ffffffffb94c6dd0>] ? wake_bit_function+0x40/0x40
[So Nov 15 16:05:41 2020] [<ffffffffb95bd031>] __filemap_fdatawait_range+0x111/0x190
[So Nov 15 16:05:41 2020] [<ffffffffb95cade1>] ? do_writepages+0x21/0x50
[So Nov 15 16:05:41 2020] [<ffffffffb95bd0c4>] filemap_fdatawait_range+0x14/0x30
[So Nov 15 16:05:41 2020] [<ffffffffb95bfab6>] filemap_write_and_wait_range+0x56/0x90
[So Nov 15 16:05:41 2020] [<ffffffffc08ab9fa>] ext4_sync_file+0xba/0x320 [ext4]
[So Nov 15 16:05:41 2020] [<ffffffffb9683a07>] do_fsync+0x67/0xb0
[So Nov 15 16:05:41 2020] [<ffffffffb9683cf0>] SyS_fsync+0x10/0x20
[So Nov 15 16:05:41 2020] [<ffffffffb9b93f92>] system_call_fastpath+0x25/0x2a
[So Nov 15 16:07:41 2020] INFO: task mysqld:44333 blocked for more than 120 seconds.
[So Nov 15 16:07:41 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[So Nov 15 16:07:41 2020] mysqld D ffff98eb48645280 0 44333 39200 0x00000080
[So Nov 15 16:07:41 2020] Call Trace:
[So Nov 15 16:07:41 2020] [<ffffffffb975ec27>] ? blk_mq_run_hw_queue+0x57/0x110
[So Nov 15 16:07:41 2020] [<ffffffffb9b85310>] ? bit_wait+0x50/0x50
[So Nov 15 16:07:41 2020] [<ffffffffb9b871e9>] schedule+0x29/0x70
[So Nov 15 16:07:41 2020] [<ffffffffb9b84cd1>] schedule_timeout+0x221/0x2d0
[So Nov 15 16:07:41 2020] [<ffffffffb9761b9c>] ? blk_mq_flush_plug_list+0x19c/0x200
[So Nov 15 16:07:41 2020] [<ffffffffb9421a0f>] ? xen_clocksource_get_cycles+0x1f/0x30
[So Nov 15 16:07:41 2020] [<ffffffffb9506362>] ? ktime_get_ts64+0x52/0xf0
[So Nov 15 16:07:41 2020] [<ffffffffb9b85310>] ? bit_wait+0x50/0x50
[So Nov 15 16:07:41 2020] [<ffffffffb9b868bd>] io_schedule_timeout+0xad/0x130
[So Nov 15 16:07:41 2020] [<ffffffffb9b86958>] io_schedule+0x18/0x20
[So Nov 15 16:07:41 2020] [<ffffffffb9b85321>] bit_wait_io+0x11/0x50
[So Nov 15 16:07:41 2020] [<ffffffffb9b84e47>] __wait_on_bit+0x67/0x90
[So Nov 15 16:07:41 2020] [<ffffffffb95bcf01>] wait_on_page_bit+0x81/0xa0
[So Nov 15 16:07:41 2020] [<ffffffffb94c6dd0>] ? wake_bit_function+0x40/0x40
[So Nov 15 16:07:41 2020] [<ffffffffb95bd031>] __filemap_fdatawait_range+0x111/0x190
[So Nov 15 16:07:41 2020] [<ffffffffb95cade1>] ? do_writepages+0x21/0x50
[So Nov 15 16:07:41 2020] [<ffffffffb95bd0c4>] filemap_fdatawait_range+0x14/0x30
[So Nov 15 16:07:41 2020] [<ffffffffb95bfab6>] filemap_write_and_wait_range+0x56/0x90
[So Nov 15 16:07:41 2020] [<ffffffffc08ab9fa>] ext4_sync_file+0xba/0x320 [ext4]
[So Nov 15 16:07:41 2020] [<ffffffffb9683a07>] do_fsync+0x67/0xb0
[So Nov 15 16:07:41 2020] [<ffffffffb9683cf0>] SyS_fsync+0x10/0x20
[So Nov 15 16:07:41 2020] [<ffffffffb9b93f92>] system_call_fastpath+0x25/0x2a
[So Nov 15 16:09:41 2020] INFO: task mysqld:44333 blocked for more than 120 seconds.
[So Nov 15 16:09:41 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[So Nov 15 16:09:41 2020] mysqld D ffff98eb48645280 0 44333 39200 0x00000080
[So Nov 15 16:09:41 2020] Call Trace:
[So Nov 15 16:09:41 2020] [<ffffffffb975ec27>] ? blk_mq_run_hw_queue+0x57/0x110
[So Nov 15 16:09:41 2020] [<ffffffffb9b85310>] ? bit_wait+0x50/0x50
[So Nov 15 16:09:41 2020] [<ffffffffb9b871e9>] schedule+0x29/0x70
[So Nov 15 16:09:41 2020] [<ffffffffb9b84cd1>] schedule_timeout+0x221/0x2d0
[So Nov 15 16:09:41 2020] [<ffffffffb9761b9c>] ? blk_mq_flush_plug_list+0x19c/0x200
[So Nov 15 16:09:41 2020] [<ffffffffb9421a0f>] ? xen_clocksource_get_cycles+0x1f/0x30
[So Nov 15 16:09:41 2020] [<ffffffffb9506362>] ? ktime_get_ts64+0x52/0xf0
[So Nov 15 16:09:41 2020] [<ffffffffb9b85310>] ? bit_wait+0x50/0x50
[So Nov 15 16:09:41 2020] [<ffffffffb9b868bd>] io_schedule_timeout+0xad/0x130
[So Nov 15 16:09:41 2020] [<ffffffffb9b86958>] io_schedule+0x18/0x20
[So Nov 15 16:09:41 2020] [<ffffffffb9b85321>] bit_wait_io+0x11/0x50
[So Nov 15 16:09:41 2020] [<ffffffffb9b84e47>] __wait_on_bit+0x67/0x90
[So Nov 15 16:09:41 2020] [<ffffffffb95bcf01>] wait_on_page_bit+0x81/0xa0
[So Nov 15 16:09:41 2020] [<ffffffffb94c6dd0>] ? wake_bit_function+0x40/0x40
[So Nov 15 16:09:41 2020] [<ffffffffb95bd031>] __filemap_fdatawait_range+0x111/0x190
[So Nov 15 16:09:41 2020] [<ffffffffb95cade1>] ? do_writepages+0x21/0x50
[So Nov 15 16:09:41 2020] [<ffffffffb95bd0c4>] filemap_fdatawait_range+0x14/0x30
[So Nov 15 16:09:41 2020] [<ffffffffb95bfab6>] filemap_write_and_wait_range+0x56/0x90
[So Nov 15 16:09:41 2020] [<ffffffffc08ab9fa>] ext4_sync_file+0xba/0x320 [ext4]
[So Nov 15 16:09:41 2020] [<ffffffffb9683a07>] do_fsync+0x67/0xb0
[So Nov 15 16:09:41 2020] [<ffffffffb9683cf0>] SyS_fsync+0x10/0x20
[So Nov 15 16:09:41 2020] [<ffffffffb9b93f92>] system_call_fastpath+0x25/0x2a
[So Nov 15 16:11:41 2020] INFO: task mysqld:44333 blocked for more than 120 seconds.
[So Nov 15 16:11:41 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[So Nov 15 16:11:41 2020] mysqld D ffff98eb48645280 0 44333 39200 0x00000080
[So Nov 15 16:11:41 2020] Call Trace:
[So Nov 15 16:11:41 2020] [<ffffffffb975ec27>] ? blk_mq_run_hw_queue+0x57/0x110
[So Nov 15 16:11:41 2020] [<ffffffffb9b85310>] ? bit_wait+0x50/0x50
[So Nov 15 16:11:41 2020] [<ffffffffb9b871e9>] schedule+0x29/0x70
[So Nov 15 16:11:41 2020] [<ffffffffb9b84cd1>] schedule_timeout+0x221/0x2d0
[So Nov 15 16:11:41 2020] [<ffffffffb9761b9c>] ? blk_mq_flush_plug_list+0x19c/0x200
[So Nov 15 16:11:41 2020] [<ffffffffb9421a0f>] ? xen_clocksource_get_cycles+0x1f/0x30
[So Nov 15 16:11:41 2020] [<ffffffffb9506362>] ? ktime_get_ts64+0x52/0xf0
[So Nov 15 16:11:41 2020] [<ffffffffb9b85310>] ? bit_wait+0x50/0x50
[So Nov 15 16:11:41 2020] [<ffffffffb9b868bd>] io_schedule_timeout+0xad/0x130
[So Nov 15 16:11:41 2020] [<ffffffffb9b86958>] io_schedule+0x18/0x20
[So Nov 15 16:11:41 2020] [<ffffffffb9b85321>] bit_wait_io+0x11/0x50
[So Nov 15 16:11:41 2020] [<ffffffffb9b84e47>] __wait_on_bit+0x67/0x90
[So Nov 15 16:11:41 2020] [<ffffffffb95bcf01>] wait_on_page_bit+0x81/0xa0
[So Nov 15 16:11:41 2020] [<ffffffffb94c6dd0>] ? wake_bit_function+0x40/0x40
[So Nov 15 16:11:41 2020] [<ffffffffb95bd031>] __filemap_fdatawait_range+0x111/0x190
[So Nov 15 16:11:41 2020] [<ffffffffb95cade1>] ? do_writepages+0x21/0x50
[So Nov 15 16:11:41 2020] [<ffffffffb95bd0c4>] filemap_fdatawait_range+0x14/0x30
[So Nov 15 16:11:41 2020] [<ffffffffb95bfab6>] filemap_write_and_wait_range+0x56/0x90
[So Nov 15 16:11:41 2020] [<ffffffffc08ab9fa>] ext4_sync_file+0xba/0x320 [ext4]
[So Nov 15 16:11:41 2020] [<ffffffffb9683a07>] do_fsync+0x67/0xb0
[So Nov 15 16:11:41 2020] [<ffffffffb9683cf0>] SyS_fsync+0x10/0x20
[So Nov 15 16:11:41 2020] [<ffffffffb9b93f92>] system_call_fastpath+0x25/0x2a
[So Nov 15 16:13:41 2020] INFO: task mysqld:44333 blocked for more than 120 seconds.
[So Nov 15 16:13:41 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[So Nov 15 16:13:41 2020] mysqld D ffff98eb48645280 0 44333 39200 0x00000080
[So Nov 15 16:13:41 2020] Call Trace:
[So Nov 15 16:13:41 2020] [<ffffffffb975ec27>] ? blk_mq_run_hw_queue+0x57/0x110
[So Nov 15 16:13:41 2020] [<ffffffffb9b85310>] ? bit_wait+0x50/0x50
[So Nov 15 16:13:41 2020] [<ffffffffb9b871e9>] schedule+0x29/0x70
[So Nov 15 16:13:41 2020] [<ffffffffb9b84cd1>] schedule_timeout+0x221/0x2d0
[So Nov 15 16:13:41 2020] [<ffffffffb9761b9c>] ? blk_mq_flush_plug_list+0x19c/0x200
[So Nov 15 16:13:41 2020] [<ffffffffb9421a0f>] ? xen_clocksource_get_cycles+0x1f/0x30
[So Nov 15 16:13:41 2020] [<ffffffffb9506362>] ? ktime_get_ts64+0x52/0xf0
[So Nov 15 16:13:41 2020] [<ffffffffb9b85310>] ? bit_wait+0x50/0x50
[So Nov 15 16:13:41 2020] [<ffffffffb9b868bd>] io_schedule_timeout+0xad/0x130
[So Nov 15 16:13:41 2020] [<ffffffffb9b86958>] io_schedule+0x18/0x20
[So Nov 15 16:13:41 2020] [<ffffffffb9b85321>] bit_wait_io+0x11/0x50
[So Nov 15 16:13:41 2020] [<ffffffffb9b84e47>] __wait_on_bit+0x67/0x90
[So Nov 15 16:13:41 2020] [<ffffffffb95bcf01>] wait_on_page_bit+0x81/0xa0
[So Nov 15 16:13:41 2020] [<ffffffffb94c6dd0>] ? wake_bit_function+0x40/0x40
[So Nov 15 16:13:41 2020] [<ffffffffb95bd031>] __filemap_fdatawait_range+0x111/0x190
[So Nov 15 16:13:41 2020] [<ffffffffb95cade1>] ? do_writepages+0x21/0x50
[So Nov 15 16:13:41 2020] [<ffffffffb95bd0c4>] filemap_fdatawait_range+0x14/0x30
[So Nov 15 16:13:41 2020] [<ffffffffb95bfab6>] filemap_write_and_wait_range+0x56/0x90
[So Nov 15 16:13:41 2020] [<ffffffffc08ab9fa>] ext4_sync_file+0xba/0x320 [ext4]
[So Nov 15 16:13:41 2020] [<ffffffffb9683a07>] do_fsync+0x67/0xb0
[So Nov 15 16:13:41 2020] [<ffffffffb9683cf0>] SyS_fsync+0x10/0x20
[So Nov 15 16:13:41 2020] [<ffffffffb9b93f92>] system_call_fastpath+0x25/0x2a
[So Nov 15 16:15:41 2020] INFO: task mysqld:44333 blocked for more than 120 seconds.
[So Nov 15 16:15:41 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[So Nov 15 16:15:41 2020] mysqld D ffff98eb48645280 0 44333 39200 0x00000080
[So Nov 15 16:15:41 2020] Call Trace:
[So Nov 15 16:15:41 2020] [<ffffffffb975ec27>] ? blk_mq_run_hw_queue+0x57/0x110
[So Nov 15 16:15:41 2020] [<ffffffffb9b85310>] ? bit_wait+0x50/0x50
[So Nov 15 16:15:41 2020] [<ffffffffb9b871e9>] schedule+0x29/0x70
[So Nov 15 16:15:41 2020] [<ffffffffb9b84cd1>] schedule_timeout+0x221/0x2d0
[So Nov 15 16:15:41 2020] [<ffffffffb9761b9c>] ? blk_mq_flush_plug_list+0x19c/0x200
[So Nov 15 16:15:41 2020] [<ffffffffb9421a0f>] ? xen_clocksource_get_cycles+0x1f/0x30
[So Nov 15 16:15:41 2020] [<ffffffffb9506362>] ? ktime_get_ts64+0x52/0xf0
[So Nov 15 16:15:41 2020] [<ffffffffb9b85310>] ? bit_wait+0x50/0x50
[So Nov 15 16:15:41 2020] [<ffffffffb9b868bd>] io_schedule_timeout+0xad/0x130
[So Nov 15 16:15:41 2020] [<ffffffffb9b86958>] io_schedule+0x18/0x20
[So Nov 15 16:15:41 2020] [<ffffffffb9b85321>] bit_wait_io+0x11/0x50
[So Nov 15 16:15:41 2020] [<ffffffffb9b84e47>] __wait_on_bit+0x67/0x90
[So Nov 15 16:15:41 2020] [<ffffffffb95bcf01>] wait_on_page_bit+0x81/0xa0
[So Nov 15 16:15:41 2020] [<ffffffffb94c6dd0>] ? wake_bit_function+0x40/0x40
[So Nov 15 16:15:41 2020] [<ffffffffb95bd031>] __filemap_fdatawait_range+0x111/0x190
[So Nov 15 16:15:41 2020] [<ffffffffb95cade1>] ? do_writepages+0x21/0x50
[So Nov 15 16:15:41 2020] [<ffffffffb95bd0c4>] filemap_fdatawait_range+0x14/0x30
[So Nov 15 16:15:41 2020] [<ffffffffb95bfab6>] filemap_write_and_wait_range+0x56/0x90
[So Nov 15 16:15:41 2020] [<ffffffffc08ab9fa>] ext4_sync_file+0xba/0x320 [ext4]
[So Nov 15 16:15:41 2020] [<ffffffffb9683a07>] do_fsync+0x67/0xb0
[So Nov 15 16:15:41 2020] [<ffffffffb9683cf0>] SyS_fsync+0x10/0x20
[So Nov 15 16:15:41 2020] [<ffffffffb9b93f92>] system_call_fastpath+0x25/0x2a
[So Nov 15 16:17:41 2020] INFO: task mysqld:44333 blocked for more than 120 seconds.
[So Nov 15 16:17:41 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[So Nov 15 16:17:41 2020] mysqld D ffff98eb48645280 0 44333 39200 0x00000080
[So Nov 15 16:17:41 2020] Call Trace:
[So Nov 15 16:17:41 2020] [<ffffffffb975ec27>] ? blk_mq_run_hw_queue+0x57/0x110
[So Nov 15 16:17:41 2020] [<ffffffffb9b85310>] ? bit_wait+0x50/0x50
[So Nov 15 16:17:41 2020] [<ffffffffb9b871e9>] schedule+0x29/0x70
[So Nov 15 16:17:41 2020] [<ffffffffb9b84cd1>] schedule_timeout+0x221/0x2d0
[So Nov 15 16:17:41 2020] [<ffffffffb9761b9c>] ? blk_mq_flush_plug_list+0x19c/0x200
[So Nov 15 16:17:41 2020] [<ffffffffb9421a0f>] ? xen_clocksource_get_cycles+0x1f/0x30
[So Nov 15 16:17:41 2020] [<ffffffffb9506362>] ? ktime_get_ts64+0x52/0xf0
[So Nov 15 16:17:41 2020] [<ffffffffb9b85310>] ? bit_wait+0x50/0x50
[So Nov 15 16:17:41 2020] [<ffffffffb9b868bd>] io_schedule_timeout+0xad/0x130
[So Nov 15 16:17:41 2020] [<ffffffffb9b86958>] io_schedule+0x18/0x20
[So Nov 15 16:17:41 2020] [<ffffffffb9b85321>] bit_wait_io+0x11/0x50
[So Nov 15 16:17:41 2020] [<ffffffffb9b84e47>] __wait_on_bit+0x67/0x90
[So Nov 15 16:17:41 2020] [<ffffffffb95bcf01>] wait_on_page_bit+0x81/0xa0
[So Nov 15 16:17:41 2020] [<ffffffffb94c6dd0>] ? wake_bit_function+0x40/0x40
[So Nov 15 16:17:41 2020] [<ffffffffb95bd031>] __filemap_fdatawait_range+0x111/0x190
[So Nov 15 16:17:41 2020] [<ffffffffb95cade1>] ? do_writepages+0x21/0x50
[So Nov 15 16:17:41 2020] [<ffffffffb95bd0c4>] filemap_fdatawait_range+0x14/0x30
[So Nov 15 16:17:41 2020] [<ffffffffb95bfab6>] filemap_write_and_wait_range+0x56/0x90
[So Nov 15 16:17:41 2020] [<ffffffffc08ab9fa>] ext4_sync_file+0xba/0x320 [ext4]
[So Nov 15 16:17:41 2020] [<ffffffffb9683a07>] do_fsync+0x67/0xb0
[So Nov 15 16:17:41 2020] [<ffffffffb9683cf0>] SyS_fsync+0x10/0x20
[So Nov 15 16:17:41 2020] [<ffffffffb9b93f92>] system_call_fastpath+0x25/0x2a
After that crash, the MariaDB process is stuck in defunct state and we are not able to kill it:
Docker daemon itself is also not able to kill the defunct container:
Nov 16 14:05:23 node02 dockerd: time="2020-11-16T14:05:23.489093574+01:00" level=info msg="Container ebeaa83822517eddf5921864d773a52a5f3c7db965f06532a1edee32d0cfc759 failed to exit within 30 seconds of signal 15 - using the force"
Nov 16 14:05:33 node02 dockerd: time="2020-11-16T14:05:33.518450650+01:00" level=info msg="Container ebeaa8382251 failed to exit within 10 seconds of kill - trying direct SIGKILL"
We have to restart the complete node to remove the defunct process / container. This is problematic to us because the MariaDB storage runs on a ceph-rbd. During the phase where the pod is defunct / stuck k8s tries to schedule a new container on another host, but thats not possible because the old container still holds the lock on the ceph rbd volume.
Why are we not able to kill / remove the defunct process / container? I though containerd-shim should reap any defunct child processes? Containerd is running correct and we are able to spin up new containers on the same host ...
Steps to reproduce the issue:
At the moment i'm not able to reproduce this. The error happens randomly in our environment and i'm not able to track it down to anything specific ...
Output of docker version:
$docker version
Client: Docker Engine - Community
Version: 19.03.13
API version: 1.40
Go version: go1.13.15
Git commit: 4484c46d9d
Built: Wed Sep 16 17:03:45 2020
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 19.03.13
API version: 1.40 (minimum version 1.12)
Go version: go1.13.15
Git commit: 4484c46d9d
Built: Wed Sep 16 17:02:21 2020
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.3.7
GitCommit: 8fba4e9a7d01810a393d5d25a3621dc101981175
runc:
Version: 1.0.0-rc10
GitCommit: dc9208a3303feef5b3839f4323d9beb36df0a9dd
docker-init:
Version: 0.18.0
GitCommit: fec3683
Description
We are running a 3 node K8S (1.17.13) cluster with ceph rbd as storage backend. Our cluster runs multiple MariaDB pods. Sometimes a pod crashes and we see the following kernel dump in dmesg:
After that crash, the MariaDB process is stuck in
defunct
state and we are not able to kill it:Docker daemon itself is also not able to kill the defunct container:
We have to restart the complete node to remove the defunct process / container. This is problematic to us because the MariaDB storage runs on a ceph-rbd. During the phase where the pod is defunct / stuck k8s tries to schedule a new container on another host, but thats not possible because the old container still holds the lock on the ceph rbd volume.
Why are we not able to kill / remove the defunct process / container? I though
containerd-shim
should reap any defunct child processes? Containerd is running correct and we are able to spin up new containers on the same host ...Steps to reproduce the issue:
At the moment i'm not able to reproduce this. The error happens randomly in our environment and i'm not able to track it down to anything specific ...
Output of
docker version
:Output of
docker info
:The text was updated successfully, but these errors were encountered: