Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

protokube always restart and spawn new containers #7928

Closed
MqllR opened this issue Nov 14, 2019 · 2 comments
Closed

protokube always restart and spawn new containers #7928

MqllR opened this issue Nov 14, 2019 · 2 comments
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@MqllR
Copy link

MqllR commented Nov 14, 2019

**1. What kops version are you running?

kops version
Version 1.13.2 (git-7cfcf8261)

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

It happened after the AWS's down of the 12th november 2019 (https://www.reddit.com/r/aws/comments/dv6xc5/on_going_aws_frankfurt_is_partially_down/)

The node didn't reboot but the root EBS has been disconnected.

[Tue Nov 12 07:46:12 2019] INFO: task systemd-journal:2268 blocked for more than 120 seconds.
[Tue Nov 12 07:46:12 2019]       Not tainted 4.14.104-95.84.amzn2.x86_64 #1
[Tue Nov 12 07:46:12 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Tue Nov 12 07:46:12 2019] systemd-journal D    0  2268      1 0x00000004
[Tue Nov 12 07:46:12 2019] Call Trace:
[Tue Nov 12 07:46:12 2019]  ? __schedule+0x28e/0x890
[Tue Nov 12 07:46:12 2019]  schedule+0x28/0x80
[Tue Nov 12 07:46:12 2019]  io_schedule+0x12/0x40
[Tue Nov 12 07:46:12 2019]  wait_on_page_bit+0x110/0x150
[Tue Nov 12 07:46:12 2019]  ? page_cache_tree_insert+0xc0/0xc0
[Tue Nov 12 07:46:12 2019]  write_cache_pages+0x172/0x4a0
[Tue Nov 12 07:46:12 2019]  ? xfs_aops_discard_page+0x130/0x130 [xfs]
[Tue Nov 12 07:46:12 2019]  ? __switch_to_asm+0x40/0x70
[Tue Nov 12 07:46:12 2019]  ? __switch_to_asm+0x34/0x70
[Tue Nov 12 07:46:12 2019]  ? __switch_to_asm+0x34/0x70
[Tue Nov 12 07:46:12 2019]  xfs_vm_writepages+0x64/0xa0 [xfs]
[Tue Nov 12 07:46:12 2019]  do_writepages+0x4b/0xe0
[Tue Nov 12 07:46:12 2019]  ? ep_read_events_proc+0xe0/0xe0
[Tue Nov 12 07:46:12 2019]  ? __filemap_fdatawrite_range+0xc1/0x100
[Tue Nov 12 07:46:12 2019]  __filemap_fdatawrite_range+0xc1/0x100
[Tue Nov 12 07:46:12 2019]  file_write_and_wait_range+0x31/0x90
[Tue Nov 12 07:46:12 2019]  xfs_file_fsync+0x5d/0x1d0 [xfs]
[Tue Nov 12 07:46:12 2019]  do_fsync+0x38/0x60
[Tue Nov 12 07:46:12 2019]  SyS_fsync+0xc/0x10
[Tue Nov 12 07:46:12 2019]  do_syscall_64+0x67/0x100
[Tue Nov 12 07:46:12 2019]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[Tue Nov 12 07:46:12 2019] RIP: 0033:0x7f31a37de4e4
[Tue Nov 12 07:46:12 2019] RSP: 002b:00007fff2f743ff8 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
[Tue Nov 12 07:46:12 2019] RAX: ffffffffffffffda RBX: 000055a884d472b0 RCX: 00007f31a37de4e4
[Tue Nov 12 07:46:12 2019] RDX: 00007fff2f744110 RSI: 0000034da3f0cf6e RDI: 000000000000001c
[Tue Nov 12 07:46:12 2019] RBP: 0000000000000000 R08: 0000000200000001 R09: 00007fff2f744024
[Tue Nov 12 07:46:12 2019] R10: 0000000000000020 R11: 0000000000000246 R12: 00007fff2f744108
[Tue Nov 12 07:46:12 2019] R13: 00059721639cb947 R14: 0000000000000000 R15: 0000000000000000

5. What happened after the commands executed?

After the AWS down, we saw a constant increase of container on a set of kubernetes node. All newly created container was a protokube container:

...
40cc828e28c7        protokube:1.13.2                                                    "/usr/bin/protokube …"   2 days ago           Exited (2) 2 days ago                               brave_mcclintock
12e19029052d        protokube:1.13.2                                                    "/usr/bin/protokube …"   2 days ago           Exited (2) 2 days ago                               happy_beaver
c7e99abede18        protokube:1.13.2                                                    "/usr/bin/protokube …"   2 weeks ago          Up 2 weeks                                          infallible_knuth

During 2 days, we got +66K failed containers:

# docker container ls -a | grep protokube:1.13.2 | wc -l
66934

The protokube systemd service is continuously restarting because of the initial protokube container is still running:

Nov 12 08:24:50 ip-10-200-44-108.eu-central-1.compute.internal docker[22834]: protokube version 0.1
Nov 12 08:24:50 ip-10-200-44-108.eu-central-1.compute.internal docker[22834]: I1112 08:24:50.190232   23472 aws_volume.go:72] AWS API Request: ec2metadata/GetMetadata
Nov 12 08:24:50 ip-10-200-44-108.eu-central-1.compute.internal docker[22834]: I1112 08:24:50.192703   23472 aws_volume.go:72] AWS API Request: ec2metadata/GetMetadata
Nov 12 08:24:50 ip-10-200-44-108.eu-central-1.compute.internal docker[22834]: I1112 08:24:50.194951   23472 aws_volume.go:72] AWS API Request: ec2metadata/GetMetadata
Nov 12 08:24:50 ip-10-200-44-108.eu-central-1.compute.internal docker[22834]: I1112 08:24:50.196715   23472 aws_volume.go:72] AWS API Request: ec2/DescribeInstances
Nov 12 08:24:50 ip-10-200-44-108.eu-central-1.compute.internal docker[22834]: I1112 08:24:50.446441   23472 main.go:233] cluster-id: staging.k8s.local
Nov 12 08:24:50 ip-10-200-44-108.eu-central-1.compute.internal docker[22834]: I1112 08:24:50.446465   23472 gossip.go:56] gossip dns connection limit is:0
Nov 12 08:24:50 ip-10-200-44-108.eu-central-1.compute.internal docker[22834]: I1112 08:24:50.446562   23472 gossip.go:153] UpdateValues: remove=[], put=map[dns/local/NS/local:gossip]
Nov 12 08:24:50 ip-10-200-44-108.eu-central-1.compute.internal docker[22834]: I1112 08:24:50.446639   23472 kube_boot.go:131] Not in role master; won't scan for volumes
Nov 12 08:24:50 ip-10-200-44-108.eu-central-1.compute.internal docker[22834]: I1112 08:24:50.446659   23472 kube_boot.go:169] ensuring that kubelet systemd service is running
Nov 12 08:24:50 ip-10-200-44-108.eu-central-1.compute.internal docker[22834]: panic: listen tcp4 0.0.0.0:3999: bind: address already in use

To fix that issue, I stopped the systemd service, stopped the running container and cleanup all exited container:

docker container rm $(docker container ls -a -q -f "ancestor=protokube:1.13.2")

To provide a permanent fix, I've opened that PR #7927 where the name of the protokube container is static to avoid duplication.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 13, 2020
@MqllR
Copy link
Author

MqllR commented Feb 29, 2020

This PR #7986 fix the problem.

@MqllR MqllR closed this as completed Feb 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

3 participants