ready: Use envoy listener to expose endpoint from worker for ambassador #4253

jfrabaute · 2022-05-29T01:23:19Z

/ready endpoint used by emissary is using the admin port (8001 by
default).
This generates a problem during config reloads with large configs as the
admin thread is blocking so the /ready endpoint can be very slow to
answer (in the order of several seconds, even more).

The problem is described in this envoy issue:
envoyproxy/envoy#16425

This change is fixing the slowness of /ready endpoint.
The /ready endpoint is now exposed in the worker pool by adding a
listener+ health check http filter.
This way, the /ready endpoint is fast and it is not blocked by any
config reload or blocking admin operation as it depends on the worker
pool.

Related Issues

envoyproxy/envoy#16425

Testing

Manual tests
Deployed on a local k8s cluster.

- Rename `actions/collect-testing-logs` -> `actions/collect-logs` - Rename the `name` arg to `jobname` - Revise the description Signed-off-by: Luke Shumaker <[email protected]>

This was causing problems for me on my laptop. Signed-off-by: Luke Shumaker <[email protected]>

Signed-off-by: Luke Shumaker <[email protected]>

Avoid writing repetitive rules for simple files. Signed-off-by: Luke Shumaker <[email protected]>

Signed-off-by: Luke Shumaker <[email protected]>

… in CI [ci-skip] This is marked [ci-skip] because (as this PR reveals) `make clobber` is currently broken. Signed-off-by: Luke Shumaker <[email protected]>

Get it to actually clean up all the things that it should. Signed-off-by: Luke Shumaker <[email protected]>

If there's an old PR that's already been merged, that means we should be creating a new one, rather than assuming that pushing to the branch will update the old one. As seen on https://github.com/emissary-ingress/emissary/runs/6523998225 Signed-off-by: Luke Shumaker <[email protected]>

…k on, and that's no good for Envoy. Signed-off-by: Flynn <[email protected]>

@LukeShu

… on MacOS. Sigh. Thanks to @LukeShu for the help here! Signed-off-by: Flynn <[email protected]>

Signed-off-by: Flynn <[email protected]>

Signed-off-by: AliceProxy <[email protected]>

If there are any authservices in annotation config, then add logic to whether the synthetic authservice should be injected/removed Signed-off-by: AliceProxy <[email protected]> Signed-off-by: Flynn <[email protected]> Signed-off-by: AliceProxy <[email protected]>

Edge-stack does not support custom authservices so we xfail these tests when running edge-stack. The synthetic authservice should ensure that a valid authservice is always present for edge-stack. Signed-off-by: AliceProxy <[email protected]>

Signed-off-by: AliceProxy <[email protected]>

… custom authservice not compatible with the default edge-stack authservice Signed-off-by: AliceProxy <[email protected]>

Signed-off-by: AliceProxy <[email protected]>

…oved Signed-off-by: AliceProxy <[email protected]>

This reverts commit 09ecd8e. Signed-off-by: AliceProxy <[email protected]>

This reverts commit b494d99. Signed-off-by: AliceProxy <[email protected]>

This reverts commit dfbdc99. Signed-off-by: AliceProxy <[email protected]>

This reverts commit 4fae467. Signed-off-by: AliceProxy <[email protected]>

This reverts commit adbb2b7. Signed-off-by: AliceProxy <[email protected]>

Signed-off-by: alex <[email protected]>

…ltiple mappings have the same name Signed-off-by: AliceProxy <[email protected]>

Signed-off-by: AliceProxy <[email protected]>

/ready endpoint used by emissary is using the admin port (8001 by default). This generates a problem during config reloads with large configs as the admin thread is blocking so the /ready endpoint can be very slow to answer (in the order of several seconds, even more). The problem is described in this envoy issue: envoyproxy/envoy#16425 This change is trying to fix the /ready endpoint problem. The /ready endpoint can be exposed in the worker pool by adding a listener+ health check http filter. This way, the /ready endpoint is fast and it is not blocked by any config reload or blocking admin operation as it depends on the worker pool. Future changes will allow to use this endpoint with diagd and the go code as well so they get a fast /ready endpoint and they do not use the admin port. This listener is disabled by default. the config "read_port" can be used to set the port and enable this new listener on envoy. Signed-off-by: Fabrice Rabaute <[email protected]>

Using the new /ready listener/endpoint for the healcheck means the go code will be involved at some point. Right now, the config for the ready endpoint is based on 3 fields: ready_port ready_ip ready_log In the current code, those 3 fields are part of the ambassador module config. When the go code is involved, there will be a race condition on "when" those values are applied/taken into account between the go process and the python+envoy process. So that might be a problem. What could happen is the following: The ambassador module is changed and ready_port is changed for instance. t0: the go process gets the change and starts to process it t0: the python process gets the change and starts to process it. t10: the go process has processed the change and starts to ping the new port for /ready endpoint. t100: the python process and envoy has changed the port. During t10 to t100, the go process /ready check will fail, because the refresh of the config between the go process and the python+envoy process are not synced. That might be a problem and generate problems during updates, with the pod becoming not ready and k8s not sending traffic to it anymore. Alternative option: Move the 3 fields out of the ambassador module, define some defaults, that can only be changed with env vars. The 3 env vars are: AMBASSADOR_READY_PORT AMBASSADOR_READY_IP AMBASSADOR_READY_LOG This way, the config cannot change between updates, it's setup at startup. Signed-off-by: Fabrice Rabaute <[email protected]>

Now that we expose a listener for the /ready endpoint, it can be used by the go code to query envoy readiness rather than using the one in the envoy admin thread (which can be slow). Signed-off-by: Fabrice Rabaute <[email protected]>

There should be no need to configure this option to a different value than 127.0.0.1, so remove the env var option and set it to 127.0.0.1 all the time. Signed-off-by: Fabrice Rabaute <[email protected]>

alexgervais

Thanks for your contribution @jfrabaute!
I'm a bit curious about your testing strategy for this PR. How did you go about testing large configuration reloads? Could we automate any part of your manual tests?

alexgervais · 2022-05-30T13:29:49Z

pkg/acp/envoy.go

+		}
+	}
+	if readyPort < 1 || readyPort > 32767 {
+		readyPort = 8002


Should the fallback ready port be 8001?
I see multiple occurrences of 8002 in the python code as well, yet the previously hardcoded ready-url was "http://localhost:8001/ready".

I'm adding a new listener here, where the default port is 8002.
This is this port that I'm using as "default" port now for the readiness.
This new listener is just exposing the "/ready" endpoint with the special ready filter.
The default is 8002, and it can be changed to another value, in case port 8002 is already used by some customer config (to expose a TCPMapping for instance).

Now the new url for the readiness is "http://127.0.0.1:8002/ready".
If the user changes this port using the env var AMBASSADOR_READY_PORT, the value can be different.

I hope it's clear now.

Thanks for the clarification, I understand much better now.
I know a lot of installations and monitors are configured to ping :8001/ready for uptime and health check of course. I wonder how we can make a smooth transition to the new port... any thoughts @LanceEa ?

:8001/ready endpoint is still working. This is an endpoint automatically exposed by the envoy admin thread, so installs using this endpoint should still work.

LanceEa · 2022-05-31T20:09:00Z

Thanks @jfrabaute for the PR. I will take a look this week and let you know if I have any questions.

I would also echo what @alexgervais mentioned around testing and how you tested the large configuration and whether we could add it to our test suite to validate the previous vs new behavior.

jfrabaute · 2022-06-01T23:50:08Z

Hi,

For testing, I did test on the previous PR: #3626
This PR is the exact same except that it also it also impacts the go process to check the readiness on this new endpoint rather then the old endpoint (which is slow).

In order to test manually, you can start an ambassador instance on a cluster(or a kind cluster for instance), create 1000 mappings.
When don, you can hit the stats endpoint on envoy (something like http://localhost:8001/stats?format=prometheus) and while running this hit, you can query http://localhost:8001/ready right after the stats endpoint has started(but not finished) yet.
You'll see that stats prometheus will take a while, like several seconds, and in the meantime, /ready endpoint will also hang.
Then when the stats endpoint ends, the ready endpoint will exit. So the ready endpoint execution on the envoy side is blocked by the stats endpoint (the admin thread is single threaded).
When moving the ready endpoint to a worker listener, the hit is always fast, just a efw milliseconds, regardless if a stats endpoint execution is in progress or not.

jfrabaute · 2022-06-06T16:00:29Z

Hi,

Any chance to have this PR reviewed and merged? or get feedback about what's missing for a merge?

Thanks.

jfrabaute · 2022-06-27T11:17:22Z

Superseded by: #4300

LukeShu and others added 30 commits May 27, 2022 22:00

.github: Clean up the log collection

e3e52fb

- Rename `actions/collect-testing-logs` -> `actions/collect-logs` - Rename the `name` arg to `jobname` - Revise the description Signed-off-by: Luke Shumaker <[email protected]>

py-list-deps: Add a hack to work with newer setuptools

6befc66

This was causing problems for me on my laptop. Signed-off-by: Luke Shumaker <[email protected]>

build-sys: Don't say .PHONY: %.clean every dang time

5d28fae

Signed-off-by: Luke Shumaker <[email protected]>

envoy.mk: Fix a typoed line affecting make clobber

23e0fbe

Signed-off-by: Luke Shumaker <[email protected]>

build-sys: Add a default %.clean / %.rm implementation

d404cff

Avoid writing repetitive rules for simple files. Signed-off-by: Luke Shumaker <[email protected]>

build-sys: Begone with the top-level bin/

c7dc932

Signed-off-by: Luke Shumaker <[email protected]>

build-sys: Fix typos in comments

1955eb0

Signed-off-by: Luke Shumaker <[email protected]>

build-sys: Don't forget to pass --rm to docker run

c891c00

Signed-off-by: Luke Shumaker <[email protected]>

.github: Have CI check that make clobber works for everything we do…

a4b13d2

… in CI [ci-skip] This is marked [ci-skip] because (as this PR reveals) `make clobber` is currently broken. Signed-off-by: Luke Shumaker <[email protected]>

build-sys: Fix make clobber

1b9baa4

Get it to actually clean up all the things that it should. Signed-off-by: Luke Shumaker <[email protected]>

Pin to Alpine 3.15 with glibc 2.34 -- 2.35 seems to have LD_BIND stuc…

8c1ebc0

…k on, and that's no good for Envoy. Signed-off-by: Flynn <[email protected]>

While we're at it, drop the superfluous / screwing up make deploy…

37bb279

… on MacOS. Sigh. Thanks to @LukeShu for the help here! Signed-off-by: Flynn <[email protected]>

Swap order of auth & cors

ae89c2c

Signed-off-by: Flynn <[email protected]>

CHANGELOG

f697bbf

Signed-off-by: Flynn <[email protected]>

add filters and filterpolicies to snapshots

ae20c4c

Signed-off-by: AliceProxy <[email protected]>

xfail custom AuthServices in aes

2d26e33

Edge-stack does not support custom authservices so we xfail these tests when running edge-stack. The synthetic authservice should ensure that a valid authservice is always present for edge-stack. Signed-off-by: AliceProxy <[email protected]>

add debug and error logging to watcher functions

63374f8

Signed-off-by: AliceProxy <[email protected]>

no synthetic authservice in demo mode. The demo mode provides its own…

500ae8f

… custom authservice not compatible with the default edge-stack authservice Signed-off-by: AliceProxy <[email protected]>

syntheticauth: cleanup check for docker demo mode

e6b8247

Signed-off-by: AliceProxy <[email protected]>

syntheticauth: add more test cases and cleanup old comments

27e213d

Signed-off-by: AliceProxy <[email protected]>

syntheticauth: update test comments and expected values

23c6597

Signed-off-by: AliceProxy <[email protected]>

syntheticauth: fix an edgecase where invalid authservices are not rem…

fd73c97

…oved Signed-off-by: AliceProxy <[email protected]>

Revert "Kill off pkg/dtest"

35ea8a7

This reverts commit 09ecd8e. Signed-off-by: AliceProxy <[email protected]>

Revert "Kill off pkg/k8s"

dcd69ed

This reverts commit b494d99. Signed-off-by: AliceProxy <[email protected]>

Revert "kubeapply: Do a better job of waiting for CRDs"

0932a29

This reverts commit dfbdc99. Signed-off-by: AliceProxy <[email protected]>

Revert "pkg/kubeapply: Transition to use pkg/kates rather than pkg/k8s"

f430a97

This reverts commit 4fae467. Signed-off-by: AliceProxy <[email protected]>

Revert "kates: Implement a kubectl apply-like function"

4f4dbbd

This reverts commit adbb2b7. Signed-off-by: AliceProxy <[email protected]>

Allow Host and TLSContext to configure a CRL

9049777

Signed-off-by: alex <[email protected]>

alex and others added 10 commits May 27, 2022 22:00

Added KAT test for TLSContext with CRL

cf1713f

Signed-off-by: alex <[email protected]>

Skipping CRL tests for envoy V2

3d3396a

Signed-off-by: alex <[email protected]>

Added changelog entry

04cea57

Signed-off-by: alex <[email protected]>

cache: fix an issue where the cache key for mappings is wrong when mu…

06fdd38

…ltiple mappings have the same name Signed-off-by: AliceProxy <[email protected]>

ad changelog for Mapping cache fix

432134c

Signed-off-by: AliceProxy <[email protected]>

make the cache tests actually run

3fb6b2f

Signed-off-by: AliceProxy <[email protected]>

ready: Use the new /ready listener in the go code

c0f4595

Now that we expose a listener for the /ready endpoint, it can be used by the go code to query envoy readiness rather than using the one in the envoy admin thread (which can be slow). Signed-off-by: Fabrice Rabaute <[email protected]>

ready: Remove AMBASSADOR_READY_IP env var

0344152

There should be no need to configure this option to a different value than 127.0.0.1, so remove the env var option and set it to 127.0.0.1 all the time. Signed-off-by: Fabrice Rabaute <[email protected]>

jfrabaute mentioned this pull request May 29, 2022

ready: Add option to enable envoy readiness endpoint from worker #3626

Closed

alexgervais reviewed May 30, 2022

View reviewed changes

alexgervais requested review from Alice-Lilith and LanceEa May 30, 2022 13:32

LukeShu force-pushed the master branch from 3fb6b2f to c32a044 Compare June 6, 2022 19:01

tomasbanet mentioned this pull request Jun 21, 2022

ready: Use envoy listener to expose endpoint from worker for ambassador #4300

Closed

jfrabaute closed this Jun 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ready: Use envoy listener to expose endpoint from worker for ambassador #4253

ready: Use envoy listener to expose endpoint from worker for ambassador #4253

jfrabaute commented May 29, 2022

alexgervais left a comment

alexgervais May 30, 2022

jfrabaute Jun 1, 2022

alexgervais Jun 2, 2022

jfrabaute Jun 2, 2022

LanceEa commented May 31, 2022

jfrabaute commented Jun 1, 2022

jfrabaute commented Jun 6, 2022

jfrabaute commented Jun 27, 2022

ready: Use envoy listener to expose endpoint from worker for ambassador #4253

ready: Use envoy listener to expose endpoint from worker for ambassador #4253

Conversation

jfrabaute commented May 29, 2022

Related Issues

Testing

alexgervais left a comment

Choose a reason for hiding this comment

alexgervais May 30, 2022

Choose a reason for hiding this comment

jfrabaute Jun 1, 2022

Choose a reason for hiding this comment

alexgervais Jun 2, 2022

Choose a reason for hiding this comment

jfrabaute Jun 2, 2022

Choose a reason for hiding this comment

LanceEa commented May 31, 2022

jfrabaute commented Jun 1, 2022

jfrabaute commented Jun 6, 2022

jfrabaute commented Jun 27, 2022