Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ready: Use envoy listener to expose endpoint from worker for ambassador #4253

Closed
wants to merge 40 commits into from

Conversation

jfrabaute
Copy link
Contributor

/ready endpoint used by emissary is using the admin port (8001 by
default).
This generates a problem during config reloads with large configs as the
admin thread is blocking so the /ready endpoint can be very slow to
answer (in the order of several seconds, even more).

The problem is described in this envoy issue:
envoyproxy/envoy#16425

This change is fixing the slowness of /ready endpoint.
The /ready endpoint is now exposed in the worker pool by adding a
listener+ health check http filter.
This way, the /ready endpoint is fast and it is not blocked by any
config reload or blocking admin operation as it depends on the worker
pool.

Related Issues

envoyproxy/envoy#16425

Testing

Manual tests
Deployed on a local k8s cluster.

LukeShu and others added 30 commits May 27, 2022 22:00
 - Rename `actions/collect-testing-logs` -> `actions/collect-logs`
 - Rename the `name` arg to `jobname`
 - Revise the description

Signed-off-by: Luke Shumaker <[email protected]>
This was causing problems for me on my laptop.

Signed-off-by: Luke Shumaker <[email protected]>
Avoid writing repetitive rules for simple files.

Signed-off-by: Luke Shumaker <[email protected]>
… in CI [ci-skip]

This is marked [ci-skip] because (as this PR reveals) `make clobber`
is currently broken.

Signed-off-by: Luke Shumaker <[email protected]>
Get it to actually clean up all the things that it should.

Signed-off-by: Luke Shumaker <[email protected]>
If there's an old PR that's already been merged, that means we should be
creating a new one, rather than assuming that pushing to the branch will
update the old one.

As seen on https://github.com/emissary-ingress/emissary/runs/6523998225

Signed-off-by: Luke Shumaker <[email protected]>
…k on, and that's no good for Envoy.

Signed-off-by: Flynn <[email protected]>
… on MacOS. Sigh.

Thanks to @LukeShu for the help here!

Signed-off-by: Flynn <[email protected]>
Signed-off-by: Flynn <[email protected]>
If there are any authservices in annotation config, then add logic to whether the synthetic authservice should be injected/removed

Signed-off-by: AliceProxy <[email protected]>
Signed-off-by: Flynn <[email protected]>
Signed-off-by: AliceProxy <[email protected]>
Edge-stack does not support custom authservices so we xfail these tests when running edge-stack.
The synthetic authservice should ensure that a valid authservice is always present for edge-stack.

Signed-off-by: AliceProxy <[email protected]>
… custom authservice not compatible with the default edge-stack authservice

Signed-off-by: AliceProxy <[email protected]>
This reverts commit 09ecd8e.

Signed-off-by: AliceProxy <[email protected]>
This reverts commit b494d99.

Signed-off-by: AliceProxy <[email protected]>
alex and others added 10 commits May 27, 2022 22:00
…ltiple mappings have the same name

Signed-off-by: AliceProxy <[email protected]>
/ready endpoint used by emissary is using the admin port (8001 by
default).
This generates a problem during config reloads with large configs as the
admin thread is blocking so the /ready endpoint can be very slow to
answer (in the order of several seconds, even more).

The problem is described in this envoy issue:
envoyproxy/envoy#16425

This change is trying to fix the /ready endpoint problem.
The /ready endpoint can be exposed in the worker pool by adding a
listener+ health check http filter.
This way, the /ready endpoint is fast and it is not blocked by any
config reload or blocking admin operation as it depends on the worker
pool.

Future changes will allow to use this endpoint with diagd and the go
code as well so they get a fast /ready endpoint and they do not use the
admin port.

This listener is disabled by default. the config "read_port" can be used
to set the port and enable this new listener on envoy.

Signed-off-by: Fabrice Rabaute <[email protected]>
Using the new /ready listener/endpoint for the healcheck means the go code
will be involved at some point.
Right now, the config for the ready endpoint is based on 3 fields:
ready_port
ready_ip
ready_log

In the current code, those 3 fields are part of the ambassador module config.

When the go code is involved, there will be a race condition on "when" those values
are applied/taken into account between the go process and the python+envoy process.
So that might be a problem.
What could happen is the following:
The ambassador module is changed and ready_port is changed for instance.
t0: the go process gets the change and starts to process it
t0: the python process gets the change and starts to process it.
t10: the go process has processed the change and starts to ping the new port for /ready endpoint.
t100: the python process and envoy has changed the port.

During t10 to t100, the go process /ready check will fail, because the refresh of the config
between the go process and the python+envoy process are not synced.
That might be a problem and generate problems during updates, with the pod becoming not ready
and k8s not sending traffic to it anymore.

Alternative option:
Move the 3 fields out of the ambassador module, define some defaults, that can only be changed with env vars.
The 3 env vars are:
AMBASSADOR_READY_PORT
AMBASSADOR_READY_IP
AMBASSADOR_READY_LOG

This way, the config cannot change between updates, it's setup at startup.

Signed-off-by: Fabrice Rabaute <[email protected]>
Now that we expose a listener for the /ready endpoint, it can be used by
the go code to query envoy readiness rather than using the one in the
envoy admin thread (which can be slow).

Signed-off-by: Fabrice Rabaute <[email protected]>
There should be no need to configure this option to a different value
than 127.0.0.1, so remove the env var option and set it to 127.0.0.1 all
the time.

Signed-off-by: Fabrice Rabaute <[email protected]>
Copy link
Contributor

@alexgervais alexgervais left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution @jfrabaute!
I'm a bit curious about your testing strategy for this PR. How did you go about testing large configuration reloads? Could we automate any part of your manual tests?

}
}
if readyPort < 1 || readyPort > 32767 {
readyPort = 8002
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the fallback ready port be 8001?
I see multiple occurrences of 8002 in the python code as well, yet the previously hardcoded ready-url was "http://localhost:8001/ready".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm adding a new listener here, where the default port is 8002.
This is this port that I'm using as "default" port now for the readiness.
This new listener is just exposing the "/ready" endpoint with the special ready filter.
The default is 8002, and it can be changed to another value, in case port 8002 is already used by some customer config (to expose a TCPMapping for instance).

Now the new url for the readiness is "http://127.0.0.1:8002/ready".
If the user changes this port using the env var AMBASSADOR_READY_PORT, the value can be different.

I hope it's clear now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification, I understand much better now.
I know a lot of installations and monitors are configured to ping :8001/ready for uptime and health check of course. I wonder how we can make a smooth transition to the new port... any thoughts @LanceEa ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:8001/ready endpoint is still working. This is an endpoint automatically exposed by the envoy admin thread, so installs using this endpoint should still work.

@LanceEa
Copy link
Contributor

LanceEa commented May 31, 2022

Thanks @jfrabaute for the PR. I will take a look this week and let you know if I have any questions.

I would also echo what @alexgervais mentioned around testing and how you tested the large configuration and whether we could add it to our test suite to validate the previous vs new behavior.

@jfrabaute
Copy link
Contributor Author

Hi,

For testing, I did test on the previous PR: #3626
This PR is the exact same except that it also it also impacts the go process to check the readiness on this new endpoint rather then the old endpoint (which is slow).

In order to test manually, you can start an ambassador instance on a cluster(or a kind cluster for instance), create 1000 mappings.
When don, you can hit the stats endpoint on envoy (something like http://localhost:8001/stats?format=prometheus) and while running this hit, you can query http://localhost:8001/ready right after the stats endpoint has started(but not finished) yet.
You'll see that stats prometheus will take a while, like several seconds, and in the meantime, /ready endpoint will also hang.
Then when the stats endpoint ends, the ready endpoint will exit. So the ready endpoint execution on the envoy side is blocked by the stats endpoint (the admin thread is single threaded).
When moving the ready endpoint to a worker listener, the hit is always fast, just a efw milliseconds, regardless if a stats endpoint execution is in progress or not.

@jfrabaute
Copy link
Contributor Author

Hi,

Any chance to have this PR reviewed and merged? or get feedback about what's missing for a merge?

Thanks.

@jfrabaute
Copy link
Contributor Author

Superseded by: #4300

@jfrabaute jfrabaute closed this Jun 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants