protokube always restart and spawn new containers #7928

MqllR · 2019-11-14T16:25:28Z

**1. What kops version are you running?

kops version
Version 1.13.2 (git-7cfcf8261)

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

It happened after the AWS's down of the 12th november 2019 (https://www.reddit.com/r/aws/comments/dv6xc5/on_going_aws_frankfurt_is_partially_down/)

The node didn't reboot but the root EBS has been disconnected.

[Tue Nov 12 07:46:12 2019] INFO: task systemd-journal:2268 blocked for more than 120 seconds.
[Tue Nov 12 07:46:12 2019]       Not tainted 4.14.104-95.84.amzn2.x86_64 #1
[Tue Nov 12 07:46:12 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Tue Nov 12 07:46:12 2019] systemd-journal D    0  2268      1 0x00000004
[Tue Nov 12 07:46:12 2019] Call Trace:
[Tue Nov 12 07:46:12 2019]  ? __schedule+0x28e/0x890
[Tue Nov 12 07:46:12 2019]  schedule+0x28/0x80
[Tue Nov 12 07:46:12 2019]  io_schedule+0x12/0x40
[Tue Nov 12 07:46:12 2019]  wait_on_page_bit+0x110/0x150
[Tue Nov 12 07:46:12 2019]  ? page_cache_tree_insert+0xc0/0xc0
[Tue Nov 12 07:46:12 2019]  write_cache_pages+0x172/0x4a0
[Tue Nov 12 07:46:12 2019]  ? xfs_aops_discard_page+0x130/0x130 [xfs]
[Tue Nov 12 07:46:12 2019]  ? __switch_to_asm+0x40/0x70
[Tue Nov 12 07:46:12 2019]  ? __switch_to_asm+0x34/0x70
[Tue Nov 12 07:46:12 2019]  ? __switch_to_asm+0x34/0x70
[Tue Nov 12 07:46:12 2019]  xfs_vm_writepages+0x64/0xa0 [xfs]
[Tue Nov 12 07:46:12 2019]  do_writepages+0x4b/0xe0
[Tue Nov 12 07:46:12 2019]  ? ep_read_events_proc+0xe0/0xe0
[Tue Nov 12 07:46:12 2019]  ? __filemap_fdatawrite_range+0xc1/0x100
[Tue Nov 12 07:46:12 2019]  __filemap_fdatawrite_range+0xc1/0x100
[Tue Nov 12 07:46:12 2019]  file_write_and_wait_range+0x31/0x90
[Tue Nov 12 07:46:12 2019]  xfs_file_fsync+0x5d/0x1d0 [xfs]
[Tue Nov 12 07:46:12 2019]  do_fsync+0x38/0x60
[Tue Nov 12 07:46:12 2019]  SyS_fsync+0xc/0x10
[Tue Nov 12 07:46:12 2019]  do_syscall_64+0x67/0x100
[Tue Nov 12 07:46:12 2019]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[Tue Nov 12 07:46:12 2019] RIP: 0033:0x7f31a37de4e4
[Tue Nov 12 07:46:12 2019] RSP: 002b:00007fff2f743ff8 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
[Tue Nov 12 07:46:12 2019] RAX: ffffffffffffffda RBX: 000055a884d472b0 RCX: 00007f31a37de4e4
[Tue Nov 12 07:46:12 2019] RDX: 00007fff2f744110 RSI: 0000034da3f0cf6e RDI: 000000000000001c
[Tue Nov 12 07:46:12 2019] RBP: 0000000000000000 R08: 0000000200000001 R09: 00007fff2f744024
[Tue Nov 12 07:46:12 2019] R10: 0000000000000020 R11: 0000000000000246 R12: 00007fff2f744108
[Tue Nov 12 07:46:12 2019] R13: 00059721639cb947 R14: 0000000000000000 R15: 0000000000000000

5. What happened after the commands executed?

After the AWS down, we saw a constant increase of container on a set of kubernetes node. All newly created container was a protokube container:

...
40cc828e28c7        protokube:1.13.2                                                    "/usr/bin/protokube …"   2 days ago           Exited (2) 2 days ago                               brave_mcclintock
12e19029052d        protokube:1.13.2                                                    "/usr/bin/protokube …"   2 days ago           Exited (2) 2 days ago                               happy_beaver
c7e99abede18        protokube:1.13.2                                                    "/usr/bin/protokube …"   2 weeks ago          Up 2 weeks                                          infallible_knuth

During 2 days, we got +66K failed containers:

# docker container ls -a | grep protokube:1.13.2 | wc -l
66934

The protokube systemd service is continuously restarting because of the initial protokube container is still running:

Nov 12 08:24:50 ip-10-200-44-108.eu-central-1.compute.internal docker[22834]: protokube version 0.1
Nov 12 08:24:50 ip-10-200-44-108.eu-central-1.compute.internal docker[22834]: I1112 08:24:50.190232   23472 aws_volume.go:72] AWS API Request: ec2metadata/GetMetadata
Nov 12 08:24:50 ip-10-200-44-108.eu-central-1.compute.internal docker[22834]: I1112 08:24:50.192703   23472 aws_volume.go:72] AWS API Request: ec2metadata/GetMetadata
Nov 12 08:24:50 ip-10-200-44-108.eu-central-1.compute.internal docker[22834]: I1112 08:24:50.194951   23472 aws_volume.go:72] AWS API Request: ec2metadata/GetMetadata
Nov 12 08:24:50 ip-10-200-44-108.eu-central-1.compute.internal docker[22834]: I1112 08:24:50.196715   23472 aws_volume.go:72] AWS API Request: ec2/DescribeInstances
Nov 12 08:24:50 ip-10-200-44-108.eu-central-1.compute.internal docker[22834]: I1112 08:24:50.446441   23472 main.go:233] cluster-id: staging.k8s.local
Nov 12 08:24:50 ip-10-200-44-108.eu-central-1.compute.internal docker[22834]: I1112 08:24:50.446465   23472 gossip.go:56] gossip dns connection limit is:0
Nov 12 08:24:50 ip-10-200-44-108.eu-central-1.compute.internal docker[22834]: I1112 08:24:50.446562   23472 gossip.go:153] UpdateValues: remove=[], put=map[dns/local/NS/local:gossip]
Nov 12 08:24:50 ip-10-200-44-108.eu-central-1.compute.internal docker[22834]: I1112 08:24:50.446639   23472 kube_boot.go:131] Not in role master; won't scan for volumes
Nov 12 08:24:50 ip-10-200-44-108.eu-central-1.compute.internal docker[22834]: I1112 08:24:50.446659   23472 kube_boot.go:169] ensuring that kubelet systemd service is running
Nov 12 08:24:50 ip-10-200-44-108.eu-central-1.compute.internal docker[22834]: panic: listen tcp4 0.0.0.0:3999: bind: address already in use

To fix that issue, I stopped the systemd service, stopped the running container and cleanup all exited container:

docker container rm $(docker container ls -a -q -f "ancestor=protokube:1.13.2")

To provide a permanent fix, I've opened that PR #7927 where the name of the protokube container is static to avoid duplication.

The text was updated successfully, but these errors were encountered:

fejta-bot · 2020-02-13T10:11:48Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

MqllR · 2020-02-29T14:03:44Z

This PR #7986 fix the problem.

MqllR mentioned this issue Nov 14, 2019

fix: protokube restart always and spawn new containers #7927

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 13, 2020

MqllR closed this as completed Feb 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

protokube always restart and spawn new containers #7928

protokube always restart and spawn new containers #7928

MqllR commented Nov 14, 2019 •

edited

Loading

fejta-bot commented Feb 13, 2020

MqllR commented Feb 29, 2020

protokube always restart and spawn new containers #7928

protokube always restart and spawn new containers #7928

Comments

MqllR commented Nov 14, 2019 • edited Loading

fejta-bot commented Feb 13, 2020

MqllR commented Feb 29, 2020

MqllR commented Nov 14, 2019 •

edited

Loading