You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The node didn't reboot but the root EBS has been disconnected.
[Tue Nov 12 07:46:12 2019] INFO: task systemd-journal:2268 blocked for more than 120 seconds.
[Tue Nov 12 07:46:12 2019] Not tainted 4.14.104-95.84.amzn2.x86_64 #1
[Tue Nov 12 07:46:12 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Tue Nov 12 07:46:12 2019] systemd-journal D 0 2268 1 0x00000004
[Tue Nov 12 07:46:12 2019] Call Trace:
[Tue Nov 12 07:46:12 2019] ? __schedule+0x28e/0x890
[Tue Nov 12 07:46:12 2019] schedule+0x28/0x80
[Tue Nov 12 07:46:12 2019] io_schedule+0x12/0x40
[Tue Nov 12 07:46:12 2019] wait_on_page_bit+0x110/0x150
[Tue Nov 12 07:46:12 2019] ? page_cache_tree_insert+0xc0/0xc0
[Tue Nov 12 07:46:12 2019] write_cache_pages+0x172/0x4a0
[Tue Nov 12 07:46:12 2019] ? xfs_aops_discard_page+0x130/0x130 [xfs]
[Tue Nov 12 07:46:12 2019] ? __switch_to_asm+0x40/0x70
[Tue Nov 12 07:46:12 2019] ? __switch_to_asm+0x34/0x70
[Tue Nov 12 07:46:12 2019] ? __switch_to_asm+0x34/0x70
[Tue Nov 12 07:46:12 2019] xfs_vm_writepages+0x64/0xa0 [xfs]
[Tue Nov 12 07:46:12 2019] do_writepages+0x4b/0xe0
[Tue Nov 12 07:46:12 2019] ? ep_read_events_proc+0xe0/0xe0
[Tue Nov 12 07:46:12 2019] ? __filemap_fdatawrite_range+0xc1/0x100
[Tue Nov 12 07:46:12 2019] __filemap_fdatawrite_range+0xc1/0x100
[Tue Nov 12 07:46:12 2019] file_write_and_wait_range+0x31/0x90
[Tue Nov 12 07:46:12 2019] xfs_file_fsync+0x5d/0x1d0 [xfs]
[Tue Nov 12 07:46:12 2019] do_fsync+0x38/0x60
[Tue Nov 12 07:46:12 2019] SyS_fsync+0xc/0x10
[Tue Nov 12 07:46:12 2019] do_syscall_64+0x67/0x100
[Tue Nov 12 07:46:12 2019] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[Tue Nov 12 07:46:12 2019] RIP: 0033:0x7f31a37de4e4
[Tue Nov 12 07:46:12 2019] RSP: 002b:00007fff2f743ff8 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
[Tue Nov 12 07:46:12 2019] RAX: ffffffffffffffda RBX: 000055a884d472b0 RCX: 00007f31a37de4e4
[Tue Nov 12 07:46:12 2019] RDX: 00007fff2f744110 RSI: 0000034da3f0cf6e RDI: 000000000000001c
[Tue Nov 12 07:46:12 2019] RBP: 0000000000000000 R08: 0000000200000001 R09: 00007fff2f744024
[Tue Nov 12 07:46:12 2019] R10: 0000000000000020 R11: 0000000000000246 R12: 00007fff2f744108
[Tue Nov 12 07:46:12 2019] R13: 00059721639cb947 R14: 0000000000000000 R15: 0000000000000000
5. What happened after the commands executed?
After the AWS down, we saw a constant increase of container on a set of kubernetes node. All newly created container was a protokube container:
...
40cc828e28c7 protokube:1.13.2 "/usr/bin/protokube …" 2 days ago Exited (2) 2 days ago brave_mcclintock
12e19029052d protokube:1.13.2 "/usr/bin/protokube …" 2 days ago Exited (2) 2 days ago happy_beaver
c7e99abede18 protokube:1.13.2 "/usr/bin/protokube …" 2 weeks ago Up 2 weeks infallible_knuth
During 2 days, we got +66K failed containers:
# docker container ls -a | grep protokube:1.13.2 | wc -l
66934
The protokube systemd service is continuously restarting because of the initial protokube container is still running:
Nov 12 08:24:50 ip-10-200-44-108.eu-central-1.compute.internal docker[22834]: protokube version 0.1
Nov 12 08:24:50 ip-10-200-44-108.eu-central-1.compute.internal docker[22834]: I1112 08:24:50.190232 23472 aws_volume.go:72] AWS API Request: ec2metadata/GetMetadata
Nov 12 08:24:50 ip-10-200-44-108.eu-central-1.compute.internal docker[22834]: I1112 08:24:50.192703 23472 aws_volume.go:72] AWS API Request: ec2metadata/GetMetadata
Nov 12 08:24:50 ip-10-200-44-108.eu-central-1.compute.internal docker[22834]: I1112 08:24:50.194951 23472 aws_volume.go:72] AWS API Request: ec2metadata/GetMetadata
Nov 12 08:24:50 ip-10-200-44-108.eu-central-1.compute.internal docker[22834]: I1112 08:24:50.196715 23472 aws_volume.go:72] AWS API Request: ec2/DescribeInstances
Nov 12 08:24:50 ip-10-200-44-108.eu-central-1.compute.internal docker[22834]: I1112 08:24:50.446441 23472 main.go:233] cluster-id: staging.k8s.local
Nov 12 08:24:50 ip-10-200-44-108.eu-central-1.compute.internal docker[22834]: I1112 08:24:50.446465 23472 gossip.go:56] gossip dns connection limit is:0
Nov 12 08:24:50 ip-10-200-44-108.eu-central-1.compute.internal docker[22834]: I1112 08:24:50.446562 23472 gossip.go:153] UpdateValues: remove=[], put=map[dns/local/NS/local:gossip]
Nov 12 08:24:50 ip-10-200-44-108.eu-central-1.compute.internal docker[22834]: I1112 08:24:50.446639 23472 kube_boot.go:131] Not in role master; won't scan for volumes
Nov 12 08:24:50 ip-10-200-44-108.eu-central-1.compute.internal docker[22834]: I1112 08:24:50.446659 23472 kube_boot.go:169] ensuring that kubelet systemd service is running
Nov 12 08:24:50 ip-10-200-44-108.eu-central-1.compute.internal docker[22834]: panic: listen tcp4 0.0.0.0:3999: bind: address already in use
To fix that issue, I stopped the systemd service, stopped the running container and cleanup all exited container:
docker container rm $(docker container ls -a -q -f "ancestor=protokube:1.13.2")
To provide a permanent fix, I've opened that PR #7927 where the name of the protokube container is static to avoid duplication.
The text was updated successfully, but these errors were encountered:
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
**1. What
kops
version are you running?3. What cloud provider are you using?
AWS
4. What commands did you run? What is the simplest way to reproduce this issue?
It happened after the AWS's down of the 12th november 2019 (https://www.reddit.com/r/aws/comments/dv6xc5/on_going_aws_frankfurt_is_partially_down/)
The node didn't reboot but the root EBS has been disconnected.
5. What happened after the commands executed?
After the AWS down, we saw a constant increase of container on a set of kubernetes node. All newly created container was a protokube container:
During 2 days, we got +66K failed containers:
The protokube systemd service is continuously restarting because of the initial protokube container is still running:
To fix that issue, I stopped the systemd service, stopped the running container and cleanup all exited container:
To provide a permanent fix, I've opened that PR #7927 where the name of the protokube container is static to avoid duplication.
The text was updated successfully, but these errors were encountered: