Extreme slowdown since upgrading to v0.9.7 (5 minutes) #309

ruudk · 2020-02-13T09:09:50Z

When I run on 0.9.6 I get all debug messages with --debug flag.

But starting from v0.9.7 it stops, and I only see:

public-ingress-controller-58d54554c7-rknd6 controller time="2020-02-13T09:07:06Z" level=info msg="starting /kube-ingress-aws-controller"

I don't see anything weird in v0.9.6...v0.9.7

The text was updated successfully, but these errors were encountered:

mikkeloscar · 2020-02-13T09:36:14Z

I can not replicate this issue. I see debug messages for v0.10.1:

time="2020-02-13T09:34:00Z" level=debug msg="Ingress update not needed default/

If you only see the line you posted and nothing else. Then it seems like it's stuck somewhere.

ruudk · 2020-02-13T09:36:55Z

Super weird. Any idea how I can find out where it's stuck on?

szuecs · 2020-02-24T08:49:35Z

@ruudk normally I would try strace -ff -p <pid> to see where a process is stuck. I guess it's too old and you already killed the process. :)

Does it makes sense to let the issue open or should we close it?

ruudk · 2020-03-27T14:05:22Z

Sorry for not getting back to you.

I still haven't found the problem. It turns out that version v0.9.7 is the problem. When I check the diff, only the AWS SDK was updated.

@szuecs Do you know how to do this in Kubernetes? How do I strace this process?

szuecs · 2020-03-27T14:20:37Z

@ruudk you can always ssh to your node, find the process and use the mentioned command.

ruudk · 2020-03-27T14:25:40Z

Was able to do it like this:

docker run -t --pid=container:k8s_controller_private-ingress-controller-64d9d89697-8m8qt_kube-system_2f6c2879-7035-11ea-abdb-02e721b79532_0 \
>   --net=container:k8s_controller_private-ingress-controller-64d9d89697-8m8qt_kube-system_2f6c2879-7035-11ea-abdb-02e721b79532_0 \
>   --cap-add sys_admin \
>   --cap-add sys_ptrace \
>   strace
strace: Process 1 attached
futex(0x1e51748, FUTEX_WAIT_PRIVATE, 0, NULL

Doesn't give much info, because by the time I'm able to strace it, the program is already waiting... and I miss all the history. Any tips?

szuecs · 2020-03-27T14:38:11Z

@ruudk Maybe use an initContainer that does "sleep 600", or create your own docker container and start the container with strace -ff kube-ingress-aws-controller ..

ruudk · 2020-03-27T16:36:51Z

Ok, after some debugging and investigation I found out that version 0.9.7 does work, it's just extremely slow. It takes more than 5 minutes to startup. See logs:

time="2020-03-27T16:02:57Z" level=info msg="starting /kube-ingress-aws-controller"
time="2020-03-27T16:07:03Z" level=info msg="controller manifest:"
... now it continues

This is caused by the update of AWS Go SDK. It introduces support for Instance Metadata Service v2 (IMDSv2) in aws/aws-sdk-go#2958.

IMDSv2 has a new security method. You first need to ask for a token by issuing a PUT http://169.254.169.254/latest/api/token. That packet can only hop 1 time. That works great if you are on the EC2 instance. But if you are inside Docker / Kubernetes, you first need to hop the Docker network. Because you do that, the packet is automatically dropped because it's invalid. Therefore, the packet to create the token never arrives at IMDSv2 and the client is waiting forever.

After some sort of timeout (5 minutes) it seems to fallback to IMDSv1 and starts working.

In order to fix this, you need to increase the hop limit. Luckily AWS has provided a way to do this with:

aws ec2 modify-instance-metadata-options --instance-id i-34215432543254235 --http-endpoint enabled --http-put-response-hop-limit 2

This increases the hop limit from (default) 1 to 2 and allows to create the IMDSv2 token.

@mikkeloscar, you couldn't reproduce this, you must be running kube-ingress-aws-controller inside a Docker container on EC2 right? Did you use --http-put-response-hop-limit 2?

ruudk · 2020-03-27T16:44:53Z

After running aws ec2 modify-instance-metadata-options --instance-id i-34215432543254235 --http-endpoint enabled --http-put-response-hop-limit 2 the start-up time from kube-ingress-aws-controller was super fast again...

szuecs · 2020-03-29T10:17:31Z

Maybe kube2iam or kube-Aws-iam-controller have already the credentials cached in our case.
Logs show it’s bad and the referenced issue looks like we can merge the PRs.
Thanks @ruudk for the investigation and pr.

ruudk · 2020-03-30T13:40:56Z

I found out what causes this issue.

The AwsAdapter overrides the default HttpClient with an instrumented_http client:

kube-ingress-aws-controller/aws/adapter.go

Line 156 in df00edd

cfg = cfg.WithHTTPClient(instrumented_http.NewClient(cfg.HTTPClient, nil))

But the ec2metadata service from AWS only sets the timeout to 1 second when there is no HttpClient defined: https://github.com/aws/aws-sdk-go/blob/323bf04864819db39fd5de23b4d083312666d9fa/aws/ec2metadata/service.go#L76-L87

What to do here?

mikkeloscar · 2020-04-01T14:15:00Z

What to do here?

It seems like an aws-sdk issue if you can't provide a custom http client the way we do it. Either they should allow setting a client specifically for the ec2 metadata or maybe use a context in the calls which has a timeout so they don't need to modify the whole client.

From our side I would also be ok with a flag to disable the instrumented http client if this would be helpful for you?

ruudk · 2020-04-01T14:19:34Z

How is the instrumented HTTP client used? I don't think I need it.

mikkeloscar · 2020-04-01T14:21:40Z

@ruudk it exposes Prometheus metrics for HTTP calls. We depend on this to some extend so I would like to not remove it, but I'm ok disabling it with a flag.

ruudk · 2020-04-01T14:31:25Z

@mikkeloscar Like this? #327

This comment has been minimized.

Sign in to view

ruudk changed the title ~~Debug log stopped working between v0.9.6...v0.10.1~~ Extreme slowdown since upgrading to v0.9.7 (5 minutes) Mar 27, 2020

This was referenced Mar 28, 2020

Provide an environment variable to disable IMDSv2 path aws/aws-sdk-go#2980

Closed

AwsAdapter > Enable LogDebugWithRequestErrors when debug mode is enabled #323

Merged

ruudk mentioned this issue Apr 1, 2020

Add flag to disable instrumented http client #327

Merged

mikkeloscar closed this as completed in #327 Apr 2, 2020

ghost mentioned this issue Jul 2, 2021

e2e TestGitPipelineRun test fails on newly installed pipeline tektoncd/pipeline#3627

Closed

ghost mentioned this issue Jul 12, 2021

Long delays on first taskrun caused by aws-sdk-go tektoncd/pipeline#4087

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extreme slowdown since upgrading to v0.9.7 (5 minutes) #309

Extreme slowdown since upgrading to v0.9.7 (5 minutes) #309

ruudk commented Feb 13, 2020 •

edited

Loading

mikkeloscar commented Feb 13, 2020

ruudk commented Feb 13, 2020

szuecs commented Feb 24, 2020 •

edited

Loading

ruudk commented Mar 27, 2020

szuecs commented Mar 27, 2020

ruudk commented Mar 27, 2020

szuecs commented Mar 27, 2020

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

ruudk commented Mar 27, 2020

ruudk commented Mar 27, 2020

szuecs commented Mar 29, 2020

ruudk commented Mar 30, 2020

mikkeloscar commented Apr 1, 2020

ruudk commented Apr 1, 2020

mikkeloscar commented Apr 1, 2020

ruudk commented Apr 1, 2020

Extreme slowdown since upgrading to v0.9.7 (5 minutes) #309

Extreme slowdown since upgrading to v0.9.7 (5 minutes) #309

Comments

ruudk commented Feb 13, 2020 • edited Loading

mikkeloscar commented Feb 13, 2020

ruudk commented Feb 13, 2020

szuecs commented Feb 24, 2020 • edited Loading

ruudk commented Mar 27, 2020

szuecs commented Mar 27, 2020

ruudk commented Mar 27, 2020

szuecs commented Mar 27, 2020

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

ruudk commented Mar 27, 2020

ruudk commented Mar 27, 2020

szuecs commented Mar 29, 2020

ruudk commented Mar 30, 2020

mikkeloscar commented Apr 1, 2020

ruudk commented Apr 1, 2020

mikkeloscar commented Apr 1, 2020

ruudk commented Apr 1, 2020

ruudk commented Feb 13, 2020 •

edited

Loading

szuecs commented Feb 24, 2020 •

edited

Loading