Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extreme slowdown since upgrading to v0.9.7 (5 minutes) #309

Closed
ruudk opened this issue Feb 13, 2020 · 23 comments · Fixed by #327
Closed

Extreme slowdown since upgrading to v0.9.7 (5 minutes) #309

ruudk opened this issue Feb 13, 2020 · 23 comments · Fixed by #327

Comments

@ruudk
Copy link
Contributor

ruudk commented Feb 13, 2020

When I run on 0.9.6 I get all debug messages with --debug flag.

But starting from v0.9.7 it stops, and I only see:

public-ingress-controller-58d54554c7-rknd6 controller time="2020-02-13T09:07:06Z" level=info msg="starting /kube-ingress-aws-controller"

I don't see anything weird in v0.9.6...v0.9.7

@mikkeloscar
Copy link
Collaborator

I can not replicate this issue. I see debug messages for v0.10.1:

time="2020-02-13T09:34:00Z" level=debug msg="Ingress update not needed default/

If you only see the line you posted and nothing else. Then it seems like it's stuck somewhere.

@ruudk
Copy link
Contributor Author

ruudk commented Feb 13, 2020

Super weird. Any idea how I can find out where it's stuck on?

@szuecs
Copy link
Member

szuecs commented Feb 24, 2020

@ruudk normally I would try strace -ff -p <pid> to see where a process is stuck. I guess it's too old and you already killed the process. :)

Does it makes sense to let the issue open or should we close it?

@ruudk
Copy link
Contributor Author

ruudk commented Mar 27, 2020

Sorry for not getting back to you.

I still haven't found the problem. It turns out that version v0.9.7 is the problem. When I check the diff, only the AWS SDK was updated.

@szuecs Do you know how to do this in Kubernetes? How do I strace this process?

@szuecs
Copy link
Member

szuecs commented Mar 27, 2020

@ruudk you can always ssh to your node, find the process and use the mentioned command.

@ruudk
Copy link
Contributor Author

ruudk commented Mar 27, 2020

Was able to do it like this:

docker run -t --pid=container:k8s_controller_private-ingress-controller-64d9d89697-8m8qt_kube-system_2f6c2879-7035-11ea-abdb-02e721b79532_0 \
>   --net=container:k8s_controller_private-ingress-controller-64d9d89697-8m8qt_kube-system_2f6c2879-7035-11ea-abdb-02e721b79532_0 \
>   --cap-add sys_admin \
>   --cap-add sys_ptrace \
>   strace
strace: Process 1 attached
futex(0x1e51748, FUTEX_WAIT_PRIVATE, 0, NULL

Doesn't give much info, because by the time I'm able to strace it, the program is already waiting... and I miss all the history. Any tips?

@szuecs
Copy link
Member

szuecs commented Mar 27, 2020

@ruudk Maybe use an initContainer that does "sleep 600", or create your own docker container and start the container with strace -ff kube-ingress-aws-controller ..

@ruudk

This comment has been minimized.

@ruudk

This comment has been minimized.

@ruudk

This comment has been minimized.

@ruudk

This comment has been minimized.

@ruudk

This comment has been minimized.

@ruudk

This comment has been minimized.

@ruudk

This comment has been minimized.

@ruudk

This comment has been minimized.

@ruudk
Copy link
Contributor Author

ruudk commented Mar 27, 2020

Ok, after some debugging and investigation I found out that version 0.9.7 does work, it's just extremely slow. It takes more than 5 minutes to startup. See logs:

time="2020-03-27T16:02:57Z" level=info msg="starting /kube-ingress-aws-controller"
time="2020-03-27T16:07:03Z" level=info msg="controller manifest:"
... now it continues

This is caused by the update of AWS Go SDK. It introduces support for Instance Metadata Service v2 (IMDSv2) in aws/aws-sdk-go#2958.

IMDSv2 has a new security method. You first need to ask for a token by issuing a PUT http://169.254.169.254/latest/api/token. That packet can only hop 1 time. That works great if you are on the EC2 instance. But if you are inside Docker / Kubernetes, you first need to hop the Docker network. Because you do that, the packet is automatically dropped because it's invalid. Therefore, the packet to create the token never arrives at IMDSv2 and the client is waiting forever.

After some sort of timeout (5 minutes) it seems to fallback to IMDSv1 and starts working.

In order to fix this, you need to increase the hop limit. Luckily AWS has provided a way to do this with:

aws ec2 modify-instance-metadata-options --instance-id i-34215432543254235 --http-endpoint enabled --http-put-response-hop-limit 2

This increases the hop limit from (default) 1 to 2 and allows to create the IMDSv2 token.

@mikkeloscar, you couldn't reproduce this, you must be running kube-ingress-aws-controller inside a Docker container on EC2 right? Did you use --http-put-response-hop-limit 2?

@ruudk ruudk changed the title Debug log stopped working between v0.9.6...v0.10.1 Extreme slowdown since upgrading to v0.9.7 (5 minutes) Mar 27, 2020
@ruudk
Copy link
Contributor Author

ruudk commented Mar 27, 2020

After running aws ec2 modify-instance-metadata-options --instance-id i-34215432543254235 --http-endpoint enabled --http-put-response-hop-limit 2 the start-up time from kube-ingress-aws-controller was super fast again...

@szuecs
Copy link
Member

szuecs commented Mar 29, 2020

Maybe kube2iam or kube-Aws-iam-controller have already the credentials cached in our case.
Logs show it’s bad and the referenced issue looks like we can merge the PRs.
Thanks @ruudk for the investigation and pr.

@ruudk
Copy link
Contributor Author

ruudk commented Mar 30, 2020

I found out what causes this issue.

The AwsAdapter overrides the default HttpClient with an instrumented_http client:

cfg = cfg.WithHTTPClient(instrumented_http.NewClient(cfg.HTTPClient, nil))

But the ec2metadata service from AWS only sets the timeout to 1 second when there is no HttpClient defined: https://github.com/aws/aws-sdk-go/blob/323bf04864819db39fd5de23b4d083312666d9fa/aws/ec2metadata/service.go#L76-L87

What to do here?

@mikkeloscar
Copy link
Collaborator

What to do here?

It seems like an aws-sdk issue if you can't provide a custom http client the way we do it. Either they should allow setting a client specifically for the ec2 metadata or maybe use a context in the calls which has a timeout so they don't need to modify the whole client.

From our side I would also be ok with a flag to disable the instrumented http client if this would be helpful for you?

@ruudk
Copy link
Contributor Author

ruudk commented Apr 1, 2020

How is the instrumented HTTP client used? I don't think I need it.

@mikkeloscar
Copy link
Collaborator

@ruudk it exposes Prometheus metrics for HTTP calls. We depend on this to some extend so I would like to not remove it, but I'm ok disabling it with a flag.

@ruudk
Copy link
Contributor Author

ruudk commented Apr 1, 2020

@mikkeloscar Like this? #327

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants