-
Notifications
You must be signed in to change notification settings - Fork 687
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ready: Use envoy listener to expose endpoint from worker for ambassador #4253
Conversation
- Rename `actions/collect-testing-logs` -> `actions/collect-logs` - Rename the `name` arg to `jobname` - Revise the description Signed-off-by: Luke Shumaker <[email protected]>
This was causing problems for me on my laptop. Signed-off-by: Luke Shumaker <[email protected]>
Signed-off-by: Luke Shumaker <[email protected]>
Signed-off-by: Luke Shumaker <[email protected]>
Avoid writing repetitive rules for simple files. Signed-off-by: Luke Shumaker <[email protected]>
Signed-off-by: Luke Shumaker <[email protected]>
Signed-off-by: Luke Shumaker <[email protected]>
Signed-off-by: Luke Shumaker <[email protected]>
… in CI [ci-skip] This is marked [ci-skip] because (as this PR reveals) `make clobber` is currently broken. Signed-off-by: Luke Shumaker <[email protected]>
Get it to actually clean up all the things that it should. Signed-off-by: Luke Shumaker <[email protected]>
If there's an old PR that's already been merged, that means we should be creating a new one, rather than assuming that pushing to the branch will update the old one. As seen on https://github.com/emissary-ingress/emissary/runs/6523998225 Signed-off-by: Luke Shumaker <[email protected]>
…k on, and that's no good for Envoy. Signed-off-by: Flynn <[email protected]>
… on MacOS. Sigh. Thanks to @LukeShu for the help here! Signed-off-by: Flynn <[email protected]>
Signed-off-by: Flynn <[email protected]>
Signed-off-by: Flynn <[email protected]>
Signed-off-by: AliceProxy <[email protected]>
If there are any authservices in annotation config, then add logic to whether the synthetic authservice should be injected/removed Signed-off-by: AliceProxy <[email protected]> Signed-off-by: Flynn <[email protected]> Signed-off-by: AliceProxy <[email protected]>
Edge-stack does not support custom authservices so we xfail these tests when running edge-stack. The synthetic authservice should ensure that a valid authservice is always present for edge-stack. Signed-off-by: AliceProxy <[email protected]>
Signed-off-by: AliceProxy <[email protected]>
… custom authservice not compatible with the default edge-stack authservice Signed-off-by: AliceProxy <[email protected]>
Signed-off-by: AliceProxy <[email protected]>
Signed-off-by: AliceProxy <[email protected]>
Signed-off-by: AliceProxy <[email protected]>
…oved Signed-off-by: AliceProxy <[email protected]>
This reverts commit 09ecd8e. Signed-off-by: AliceProxy <[email protected]>
This reverts commit b494d99. Signed-off-by: AliceProxy <[email protected]>
This reverts commit dfbdc99. Signed-off-by: AliceProxy <[email protected]>
This reverts commit 4fae467. Signed-off-by: AliceProxy <[email protected]>
This reverts commit adbb2b7. Signed-off-by: AliceProxy <[email protected]>
Signed-off-by: alex <[email protected]>
Signed-off-by: alex <[email protected]>
Signed-off-by: alex <[email protected]>
Signed-off-by: alex <[email protected]>
…ltiple mappings have the same name Signed-off-by: AliceProxy <[email protected]>
Signed-off-by: AliceProxy <[email protected]>
Signed-off-by: AliceProxy <[email protected]>
/ready endpoint used by emissary is using the admin port (8001 by default). This generates a problem during config reloads with large configs as the admin thread is blocking so the /ready endpoint can be very slow to answer (in the order of several seconds, even more). The problem is described in this envoy issue: envoyproxy/envoy#16425 This change is trying to fix the /ready endpoint problem. The /ready endpoint can be exposed in the worker pool by adding a listener+ health check http filter. This way, the /ready endpoint is fast and it is not blocked by any config reload or blocking admin operation as it depends on the worker pool. Future changes will allow to use this endpoint with diagd and the go code as well so they get a fast /ready endpoint and they do not use the admin port. This listener is disabled by default. the config "read_port" can be used to set the port and enable this new listener on envoy. Signed-off-by: Fabrice Rabaute <[email protected]>
Using the new /ready listener/endpoint for the healcheck means the go code will be involved at some point. Right now, the config for the ready endpoint is based on 3 fields: ready_port ready_ip ready_log In the current code, those 3 fields are part of the ambassador module config. When the go code is involved, there will be a race condition on "when" those values are applied/taken into account between the go process and the python+envoy process. So that might be a problem. What could happen is the following: The ambassador module is changed and ready_port is changed for instance. t0: the go process gets the change and starts to process it t0: the python process gets the change and starts to process it. t10: the go process has processed the change and starts to ping the new port for /ready endpoint. t100: the python process and envoy has changed the port. During t10 to t100, the go process /ready check will fail, because the refresh of the config between the go process and the python+envoy process are not synced. That might be a problem and generate problems during updates, with the pod becoming not ready and k8s not sending traffic to it anymore. Alternative option: Move the 3 fields out of the ambassador module, define some defaults, that can only be changed with env vars. The 3 env vars are: AMBASSADOR_READY_PORT AMBASSADOR_READY_IP AMBASSADOR_READY_LOG This way, the config cannot change between updates, it's setup at startup. Signed-off-by: Fabrice Rabaute <[email protected]>
Now that we expose a listener for the /ready endpoint, it can be used by the go code to query envoy readiness rather than using the one in the envoy admin thread (which can be slow). Signed-off-by: Fabrice Rabaute <[email protected]>
There should be no need to configure this option to a different value than 127.0.0.1, so remove the env var option and set it to 127.0.0.1 all the time. Signed-off-by: Fabrice Rabaute <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution @jfrabaute!
I'm a bit curious about your testing strategy for this PR. How did you go about testing large configuration reloads? Could we automate any part of your manual tests?
} | ||
} | ||
if readyPort < 1 || readyPort > 32767 { | ||
readyPort = 8002 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the fallback ready port be 8001
?
I see multiple occurrences of 8002
in the python code as well, yet the previously hardcoded ready-url was "http://localhost:8001/ready"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm adding a new listener here, where the default port is 8002.
This is this port that I'm using as "default" port now for the readiness.
This new listener is just exposing the "/ready" endpoint with the special ready filter.
The default is 8002, and it can be changed to another value, in case port 8002 is already used by some customer config (to expose a TCPMapping for instance).
Now the new url for the readiness is "http://127.0.0.1:8002/ready".
If the user changes this port using the env var AMBASSADOR_READY_PORT, the value can be different.
I hope it's clear now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification, I understand much better now.
I know a lot of installations and monitors are configured to ping :8001/ready
for uptime and health check of course. I wonder how we can make a smooth transition to the new port... any thoughts @LanceEa ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:8001/ready
endpoint is still working. This is an endpoint automatically exposed by the envoy admin thread, so installs using this endpoint should still work.
Thanks @jfrabaute for the PR. I will take a look this week and let you know if I have any questions. I would also echo what @alexgervais mentioned around testing and how you tested the large configuration and whether we could add it to our test suite to validate the previous vs new behavior. |
Hi, For testing, I did test on the previous PR: #3626 In order to test manually, you can start an ambassador instance on a cluster(or a kind cluster for instance), create 1000 mappings. |
Hi, Any chance to have this PR reviewed and merged? or get feedback about what's missing for a merge? Thanks. |
Superseded by: #4300 |
/ready endpoint used by emissary is using the admin port (8001 by
default).
This generates a problem during config reloads with large configs as the
admin thread is blocking so the /ready endpoint can be very slow to
answer (in the order of several seconds, even more).
The problem is described in this envoy issue:
envoyproxy/envoy#16425
This change is fixing the slowness of /ready endpoint.
The /ready endpoint is now exposed in the worker pool by adding a
listener+ health check http filter.
This way, the /ready endpoint is fast and it is not blocked by any
config reload or blocking admin operation as it depends on the worker
pool.
Related Issues
envoyproxy/envoy#16425
Testing
Manual tests
Deployed on a local k8s cluster.