-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
slow stats endpoint and blocking /ready endpoint when hitting /stats endpoint #16425
Comments
I've seen people recommend against hitting an admin endpoint as a readiness check for this reason (I think from @howardjohn ?). The fact that it might be delayed due to xDS responses make it a poor choice for periodic health checks. |
It would be good to ensure that config updates can proceed even if admin requests are consuming significant CPU. |
Yeah for Istio we hit this a lot. We ended up using it just for the startup (at which point stats/ is probably not huge and any large XDS load is probably Envoy actually initializing). From then on, we just hit a normal (non-admin) listener. So the requests go through Envoy's worker threads, not the main thread. |
See also #16139 (comment) I think if we streamed out large HTTP responses from admin, it would take less cpu/memory and, depending on how we managed events, might not block other admin requests from running during the streaming process. |
Thanks for the feedback. Regarding health check, the Regarding the One extra question about json and prometheus formats: Is that expected that asking for those format is making the request much slower? Is the transformation step between stats format and another format that big that it makes the request 5x slower? |
Right now each request is entirely blocking, which is fine for tiny requests like /ready. But if we change the stats handlers to stream data out, and if we don't buffer and sort, I think the admin port could service /ready in between chunks of a /stats response. |
Yes, we're hitting the scrape problem as well when config is reloaded (So that's a 3rd problem 😞 ). Are there any workaround/options here with the current admin thread design (especially for the blocked behavior during a config reload)? |
I don't think there is a work around. I think it is fairly difficult problem to solve since certain data structures are only safe to access from the main thread, hence the restrictions on config reload and admin handlers running on the main thread. |
I guess there are two approaches:
A lot of the work that's being done now may not be needed by all users, such as sorting the stats output. If we we do sort, we don't need to buffer the serialized form which is happening now, we can just buffer the vector of stat pointers, or at least shared_ptr so that we can survive dropping a stat while streaming it out. |
Doing stats operations that take 6 seconds on workers is not acceptable either. Ideally we'ld do very little in the main thread, just the things that need to go through it due to the need for ordering. admin requests should be served by a thread pool. Config processing should happen on its own set of threads. |
Sorry if I wasn't clear; I was proposing the second option: break up the stats streaming work and do it asynchronously on the main thread. I think we'd do a little less total work by not buffering up the full serialized form (and maybe not sorting) but that might be too optimistic. But even if it's the same 6s of work, if we do it in 100 60ms chunks rather than all at once, it would have less latency impact for I'm not opposed to having other threads dedicated to admin processing. But I'd be a little worried about the potential impact on request latency at (say) p99 or p95, of having more computable threads than physical cores. We could also reserve a few cores for admin processing, depending on the setup, but that might not work well for everyone. So I thought async-chunks-on-main-thread might be a good way to go. |
I think for stats it should be relatively plausible to push this to a distinct thread; given we already have stats sinks that gather this data, can't we have a pseudo-sink that dumps to some shared memory buffer for a consumer thread that can handle these requests? This doesn't solve the more general problem of something like |
Yeah you can collect the stats in vectors of shared_ptrs on the main thread and then stream them out from another. But I think we should stream them out in chunks regardless rather than buffering up all the text before sending a response. That will solve a memory burst problem per #16139 . Once we are streaming, we can make it async and see whether we actually need to add a new thread. |
@snowp @jmarantz By the way, this is part of what #15876 is addressing, to a small degree. One of the larger issues, for us at least, was once there is a large number of VirtualHosts (> 80K), even a single VirtualHost add/delete via VHDS was causing delay of /stats, /ready, etc. This, to a large degree, seems to stem from the way RouteMatcher and VirtualHosts are handled. Out of curiosity, @jfrabaute, are you using VHDS? Due you have many VirtualHosts? |
No idea. We're using ambassador, which is taking care of managing the envoy config and lifecycle. I don't think ambassador is using VHDS, but I'm not sure. I'll ask the ambassador team. Regarding the problem, it looks like the |
Answer: Ambassador does not use VHDS |
Okay, good to know its not limited to VHDS. I'd be curious to see a flamegraph, but I'm just a fly on the wall here. |
I am not aware of such a plan to allow |
Could those 2 |
All that can be done! Just a matter of someone doing it. To simplify, is the goal just to make /ready fast always, and you'd be willing to have that on a different port and its own (very cheap) thread? That project is nice because it doesn't have to touch the complexity of stats and config-update. But it would involve plumbing through the declaration of a new port in the API. Other @envoyproxy/maintainers may have better ideas also. |
Another strategy might be to expose whatever relevant information you want out of /ready via a HTTP filter, allowing users to set up a filter chain on a distinct port that can handle this. That wouldn't require any core API changes This somewhat begs the question of what information /ready is using to declare readiness: much of it I expect is just dependent on the server starting, so anything handling traffic on a worker thread would be gated on this. Does this just boil down to a direct response from the HCM? |
Yeah.... Totally see what you mean here.
That's the small goal for me, but that might just be a narrow view of a problem which is larger, hence created the issue to get an idea of a possible larger picture.
Yes, that's the idea behind my proposal. I had a quick look at the code, and knowing close to nothing, the admin thread already seem like a complex piece of code, so isolating this change and not touching the admin thread seemed like an interesting option, and it is just for monitoring could make sense to have a first iteration in a reasonable amount of time.
That seems interesting. Similar to what I'm proposing (IIUC), but even simpler as it's just getting the info and using the worker threads. |
I need to check if we are experiencing delays on stats scrapes but I suspect we are. We have a similar scenario but using Contour for ingress on Kubernetes as opposed to Ambassador. This is a multi-tenant cluster and can end up hosting a lot of routes, clusters, etc. This has me thinking about some @mattklein123 tweets about push vs. pull - https://twitter.com/mattklein123/status/1328559009633239040, https://twitter.com/mattklein123/status/1266010765669961729 I plan on looking into changing the model of having Prometheus scrape In theory this approach wouldn't see the periodic load as the scrape occurs. Keen on advice or gotcha's and whether a Statsd approach would help with blocking / contention |
One thing to add to this already long and involved discussion is that we directionally want to head towards admin port not being special in any particular way vs. regular listeners. This has a ton of advantages around security and consistency. This might matter if we start adding new ports (which I don't think should be preferred; it's not great to fix an implementation issue with an API one). |
One question: I'm looking at http filters, and it looks like a health check http filter already exist: Would that http filter solve the problem if used on a specific listener for the monitoring system like k8s readiness/liveness? If so, that would be a first step ( |
/ready endpoint used by emissary is using the admin port (8001 by default). This generates a problem during config reloads with large configs as the admin thread is blocking so the /ready endpoint can be very slow to answer (in the order of several seconds, even more). The problem is described in this envoy issue: envoyproxy/envoy#16425 This change is trying to fix the /ready endpoint problem. The /ready endpoint can be exposed in the worker pool by adding a listener+ health check http filter. This way, the /ready endpoint is fast and it is not blocked by any config reload or blocking admin operation as it depends on the worker pool. Future changes will allow to use this endpoint with diagd and the go code as well so they get a fast /ready endpoint and they do not use the admin port. This listener is disabled by default. the config "read_port" can be used to set the port and enable this new listener on envoy. Signed-off-by: Fabrice Rabaute <[email protected]>
/ready endpoint used by emissary is using the admin port (8001 by default). This generates a problem during config reloads with large configs as the admin thread is blocking so the /ready endpoint can be very slow to answer (in the order of several seconds, even more). The problem is described in this envoy issue: envoyproxy/envoy#16425 This change is trying to fix the /ready endpoint problem. The /ready endpoint can be exposed in the worker pool by adding a listener+ health check http filter. This way, the /ready endpoint is fast and it is not blocked by any config reload or blocking admin operation as it depends on the worker pool. Future changes will allow to use this endpoint with diagd and the go code as well so they get a fast /ready endpoint and they do not use the admin port. This listener is disabled by default. the config "read_port" can be used to set the port and enable this new listener on envoy. Signed-off-by: Fabrice Rabaute <[email protected]>
/ready endpoint used by emissary is using the admin port (8001 by default). This generates a problem during config reloads with large configs as the admin thread is blocking so the /ready endpoint can be very slow to answer (in the order of several seconds, even more). The problem is described in this envoy issue: envoyproxy/envoy#16425 This change is trying to fix the /ready endpoint problem. The /ready endpoint can be exposed in the worker pool by adding a listener+ health check http filter. This way, the /ready endpoint is fast and it is not blocked by any config reload or blocking admin operation as it depends on the worker pool. Future changes will allow to use this endpoint with diagd and the go code as well so they get a fast /ready endpoint and they do not use the admin port. This listener is disabled by default. the config "read_port" can be used to set the port and enable this new listener on envoy. Signed-off-by: Fabrice Rabaute <[email protected]>
/ready endpoint used by emissary is using the admin port (8001 by default). This generates a problem during config reloads with large configs as the admin thread is blocking so the /ready endpoint can be very slow to answer (in the order of several seconds, even more). The problem is described in this envoy issue: envoyproxy/envoy#16425 This change is trying to fix the /ready endpoint problem. The /ready endpoint can be exposed in the worker pool by adding a listener+ health check http filter. This way, the /ready endpoint is fast and it is not blocked by any config reload or blocking admin operation as it depends on the worker pool. Future changes will allow to use this endpoint with diagd and the go code as well so they get a fast /ready endpoint and they do not use the admin port. This listener is disabled by default. the config "read_port" can be used to set the port and enable this new listener on envoy.
/ready endpoint used by emissary is using the admin port (8001 by default). This generates a problem during config reloads with large configs as the admin thread is blocking so the /ready endpoint can be very slow to answer (in the order of several seconds, even more). The problem is described in this envoy issue: envoyproxy/envoy#16425 This change is trying to fix the /ready endpoint problem. The /ready endpoint can be exposed in the worker pool by adding a listener+ health check http filter. This way, the /ready endpoint is fast and it is not blocked by any config reload or blocking admin operation as it depends on the worker pool. Future changes will allow to use this endpoint with diagd and the go code as well so they get a fast /ready endpoint and they do not use the admin port. This listener is disabled by default. the config "read_port" can be used to set the port and enable this new listener on envoy. Signed-off-by: Fabrice Rabaute <[email protected]>
/ready endpoint used by emissary is using the admin port (8001 by default). This generates a problem during config reloads with large configs as the admin thread is blocking so the /ready endpoint can be very slow to answer (in the order of several seconds, even more). The problem is described in this envoy issue: envoyproxy/envoy#16425 This change is trying to fix the /ready endpoint problem. The /ready endpoint can be exposed in the worker pool by adding a listener+ health check http filter. This way, the /ready endpoint is fast and it is not blocked by any config reload or blocking admin operation as it depends on the worker pool. Future changes will allow to use this endpoint with diagd and the go code as well so they get a fast /ready endpoint and they do not use the admin port. This listener is disabled by default. the config "read_port" can be used to set the port and enable this new listener on envoy. Signed-off-by: Fabrice Rabaute <[email protected]>
/ready endpoint used by emissary is using the admin port (8001 by default). This generates a problem during config reloads with large configs as the admin thread is blocking so the /ready endpoint can be very slow to answer (in the order of several seconds, even more). The problem is described in this envoy issue: envoyproxy/envoy#16425 This change is trying to fix the /ready endpoint problem. The /ready endpoint can be exposed in the worker pool by adding a listener+ health check http filter. This way, the /ready endpoint is fast and it is not blocked by any config reload or blocking admin operation as it depends on the worker pool. Future changes will allow to use this endpoint with diagd and the go code as well so they get a fast /ready endpoint and they do not use the admin port. This listener is disabled by default. the config "read_port" can be used to set the port and enable this new listener on envoy. Signed-off-by: Fabrice Rabaute <[email protected]>
/ready endpoint used by emissary is using the admin port (8001 by default). This generates a problem during config reloads with large configs as the admin thread is blocking so the /ready endpoint can be very slow to answer (in the order of several seconds, even more). The problem is described in this envoy issue: envoyproxy/envoy#16425 This change is trying to fix the /ready endpoint problem. The /ready endpoint can be exposed in the worker pool by adding a listener+ health check http filter. This way, the /ready endpoint is fast and it is not blocked by any config reload or blocking admin operation as it depends on the worker pool. Future changes will allow to use this endpoint with diagd and the go code as well so they get a fast /ready endpoint and they do not use the admin port. This listener is disabled by default. the config "read_port" can be used to set the port and enable this new listener on envoy. Signed-off-by: Fabrice Rabaute <[email protected]>
/ready endpoint used by emissary is using the admin port (8001 by default). This generates a problem during config reloads with large configs as the admin thread is blocking so the /ready endpoint can be very slow to answer (in the order of several seconds, even more). The problem is described in this envoy issue: envoyproxy/envoy#16425 This change is trying to fix the /ready endpoint problem. The /ready endpoint can be exposed in the worker pool by adding a listener+ health check http filter. This way, the /ready endpoint is fast and it is not blocked by any config reload or blocking admin operation as it depends on the worker pool. Future changes will allow to use this endpoint with diagd and the go code as well so they get a fast /ready endpoint and they do not use the admin port. This listener is disabled by default. the config "read_port" can be used to set the port and enable this new listener on envoy. Signed-off-by: Fabrice Rabaute <[email protected]>
/ready endpoint used by emissary is using the admin port (8001 by default). This generates a problem during config reloads with large configs as the admin thread is blocking so the /ready endpoint can be very slow to answer (in the order of several seconds, even more). The problem is described in this envoy issue: envoyproxy/envoy#16425 This change is trying to fix the /ready endpoint problem. The /ready endpoint can be exposed in the worker pool by adding a listener+ health check http filter. This way, the /ready endpoint is fast and it is not blocked by any config reload or blocking admin operation as it depends on the worker pool. Future changes will allow to use this endpoint with diagd and the go code as well so they get a fast /ready endpoint and they do not use the admin port. This listener is disabled by default. the config "read_port" can be used to set the port and enable this new listener on envoy. Signed-off-by: Fabrice Rabaute <[email protected]>
/ready endpoint used by emissary is using the admin port (8001 by default). This generates a problem during config reloads with large configs as the admin thread is blocking so the /ready endpoint can be very slow to answer (in the order of several seconds, even more). The problem is described in this envoy issue: envoyproxy/envoy#16425 This change is trying to fix the /ready endpoint problem. The /ready endpoint can be exposed in the worker pool by adding a listener+ health check http filter. This way, the /ready endpoint is fast and it is not blocked by any config reload or blocking admin operation as it depends on the worker pool. Future changes will allow to use this endpoint with diagd and the go code as well so they get a fast /ready endpoint and they do not use the admin port. This listener is disabled by default. the config "read_port" can be used to set the port and enable this new listener on envoy. Signed-off-by: Fabrice Rabaute <[email protected]>
/ready endpoint used by emissary is using the admin port (8001 by default). This generates a problem during config reloads with large configs as the admin thread is blocking so the /ready endpoint can be very slow to answer (in the order of several seconds, even more). The problem is described in this envoy issue: envoyproxy/envoy#16425 This change is trying to fix the /ready endpoint problem. The /ready endpoint can be exposed in the worker pool by adding a listener+ health check http filter. This way, the /ready endpoint is fast and it is not blocked by any config reload or blocking admin operation as it depends on the worker pool. Future changes will allow to use this endpoint with diagd and the go code as well so they get a fast /ready endpoint and they do not use the admin port. This listener is disabled by default. the config "read_port" can be used to set the port and enable this new listener on envoy. Signed-off-by: Fabrice Rabaute <[email protected]>
/ready endpoint used by emissary is using the admin port (8001 by default). This generates a problem during config reloads with large configs as the admin thread is blocking so the /ready endpoint can be very slow to answer (in the order of several seconds, even more). The problem is described in this envoy issue: envoyproxy/envoy#16425 This change is trying to fix the /ready endpoint problem. The /ready endpoint can be exposed in the worker pool by adding a listener+ health check http filter. This way, the /ready endpoint is fast and it is not blocked by any config reload or blocking admin operation as it depends on the worker pool. Future changes will allow to use this endpoint with diagd and the go code as well so they get a fast /ready endpoint and they do not use the admin port. This listener is disabled by default. the config "read_port" can be used to set the port and enable this new listener on envoy. Signed-off-by: Fabrice Rabaute <[email protected]>
/ready endpoint used by emissary is using the admin port (8001 by default). This generates a problem during config reloads with large configs as the admin thread is blocking so the /ready endpoint can be very slow to answer (in the order of several seconds, even more). The problem is described in this envoy issue: envoyproxy/envoy#16425 This change is trying to fix the /ready endpoint problem. The /ready endpoint can be exposed in the worker pool by adding a listener+ health check http filter. This way, the /ready endpoint is fast and it is not blocked by any config reload or blocking admin operation as it depends on the worker pool. Future changes will allow to use this endpoint with diagd and the go code as well so they get a fast /ready endpoint and they do not use the admin port. This listener is disabled by default. the config "read_port" can be used to set the port and enable this new listener on envoy. Signed-off-by: Fabrice Rabaute <[email protected]>
/ready endpoint used by emissary is using the admin port (8001 by default). This generates a problem during config reloads with large configs as the admin thread is blocking so the /ready endpoint can be very slow to answer (in the order of several seconds, even more). The problem is described in this envoy issue: envoyproxy/envoy#16425 This change is trying to fix the /ready endpoint problem. The /ready endpoint can be exposed in the worker pool by adding a listener+ health check http filter. This way, the /ready endpoint is fast and it is not blocked by any config reload or blocking admin operation as it depends on the worker pool. Future changes will allow to use this endpoint with diagd and the go code as well so they get a fast /ready endpoint and they do not use the admin port. This listener is disabled by default. the config "read_port" can be used to set the port and enable this new listener on envoy. Signed-off-by: Fabrice Rabaute <[email protected]>
/ready endpoint used by emissary is using the admin port (8001 by default). This generates a problem during config reloads with large configs as the admin thread is blocking so the /ready endpoint can be very slow to answer (in the order of several seconds, even more). The problem is described in this envoy issue: envoyproxy/envoy#16425 This change is trying to fix the /ready endpoint problem. The /ready endpoint can be exposed in the worker pool by adding a listener+ health check http filter. This way, the /ready endpoint is fast and it is not blocked by any config reload or blocking admin operation as it depends on the worker pool. Future changes will allow to use this endpoint with diagd and the go code as well so they get a fast /ready endpoint and they do not use the admin port. This listener is disabled by default. the config "read_port" can be used to set the port and enable this new listener on envoy. Signed-off-by: Fabrice Rabaute <[email protected]>
/ready endpoint used by emissary is using the admin port (8001 by default). This generates a problem during config reloads with large configs as the admin thread is blocking so the /ready endpoint can be very slow to answer (in the order of several seconds, even more). The problem is described in this envoy issue: envoyproxy/envoy#16425 This change is trying to fix the /ready endpoint problem. The /ready endpoint can be exposed in the worker pool by adding a listener+ health check http filter. This way, the /ready endpoint is fast and it is not blocked by any config reload or blocking admin operation as it depends on the worker pool. Future changes will allow to use this endpoint with diagd and the go code as well so they get a fast /ready endpoint and they do not use the admin port. This listener is disabled by default. the config "read_port" can be used to set the port and enable this new listener on envoy. Signed-off-by: Fabrice Rabaute <[email protected]>
/ready endpoint used by emissary is using the admin port (8001 by default). This generates a problem during config reloads with large configs as the admin thread is blocking so the /ready endpoint can be very slow to answer (in the order of several seconds, even more). The problem is described in this envoy issue: envoyproxy/envoy#16425 This change is trying to fix the /ready endpoint problem. The /ready endpoint can be exposed in the worker pool by adding a listener+ health check http filter. This way, the /ready endpoint is fast and it is not blocked by any config reload or blocking admin operation as it depends on the worker pool. Future changes will allow to use this endpoint with diagd and the go code as well so they get a fast /ready endpoint and they do not use the admin port. This listener is disabled by default. the config "read_port" can be used to set the port and enable this new listener on envoy. Signed-off-by: Fabrice Rabaute <[email protected]>
/ready endpoint used by emissary is using the admin port (8001 by default). This generates a problem during config reloads with large configs as the admin thread is blocking so the /ready endpoint can be very slow to answer (in the order of several seconds, even more). The problem is described in this envoy issue: envoyproxy/envoy#16425 This change is trying to fix the /ready endpoint problem. The /ready endpoint can be exposed in the worker pool by adding a listener+ health check http filter. This way, the /ready endpoint is fast and it is not blocked by any config reload or blocking admin operation as it depends on the worker pool. Future changes will allow to use this endpoint with diagd and the go code as well so they get a fast /ready endpoint and they do not use the admin port. This listener is disabled by default. the config "read_port" can be used to set the port and enable this new listener on envoy. Co-authored-by: Lance Austin <[email protected]> Signed-off-by: Lance Austin <[email protected]> Signed-off-by: Fabrice Rabaute <[email protected]>
/ready endpoint used by emissary is using the admin port (8001 by default). This generates a problem during config reloads with large configs as the admin thread is blocking so the /ready endpoint can be very slow to answer (in the order of several seconds, even more). The problem is described in this envoy issue: envoyproxy/envoy#16425 This change is trying to fix the /ready endpoint problem. The /ready endpoint can be exposed in the worker pool by adding a listener+ health check http filter. This way, the /ready endpoint is fast and it is not blocked by any config reload or blocking admin operation as it depends on the worker pool. Future changes will allow to use this endpoint with diagd and the go code as well so they get a fast /ready endpoint and they do not use the admin port. This listener is disabled by default. the config "read_port" can be used to set the port and enable this new listener on envoy. Signed-off-by: Fabrice Rabaute <[email protected]>
/ready endpoint used by emissary is using the admin port (8001 by default). This generates a problem during config reloads with large configs as the admin thread is blocking so the /ready endpoint can be very slow to answer (in the order of several seconds, even more). The problem is described in this envoy issue: envoyproxy/envoy#16425 This change is trying to fix the /ready endpoint problem. The /ready endpoint can be exposed in the worker pool by adding a listener+ health check http filter. This way, the /ready endpoint is fast and it is not blocked by any config reload or blocking admin operation as it depends on the worker pool. Future changes will allow to use this endpoint with diagd and the go code as well so they get a fast /ready endpoint and they do not use the admin port. This listener is disabled by default. the config "read_port" can be used to set the port and enable this new listener on envoy. Signed-off-by: Fabrice Rabaute <[email protected]>
/ready endpoint used by emissary is using the admin port (8001 by default). This generates a problem during config reloads with large configs as the admin thread is blocking so the /ready endpoint can be very slow to answer (in the order of several seconds, even more). The problem is described in this envoy issue: envoyproxy/envoy#16425 This change is trying to fix the /ready endpoint problem. The /ready endpoint can be exposed in the worker pool by adding a listener+ health check http filter. This way, the /ready endpoint is fast and it is not blocked by any config reload or blocking admin operation as it depends on the worker pool. Future changes will allow to use this endpoint with diagd and the go code as well so they get a fast /ready endpoint and they do not use the admin port. This listener is disabled by default. the config "read_port" can be used to set the port and enable this new listener on envoy. Signed-off-by: Fabrice Rabaute <[email protected]>
/ready endpoint used by emissary is using the admin port (8001 by default). This generates a problem during config reloads with large configs as the admin thread is blocking so the /ready endpoint can be very slow to answer (in the order of several seconds, even more). The problem is described in this envoy issue: envoyproxy/envoy#16425 This change is trying to fix the /ready endpoint problem. The /ready endpoint can be exposed in the worker pool by adding a listener+ health check http filter. This way, the /ready endpoint is fast and it is not blocked by any config reload or blocking admin operation as it depends on the worker pool. Future changes will allow to use this endpoint with diagd and the go code as well so they get a fast /ready endpoint and they do not use the admin port. This listener is disabled by default. the config "read_port" can be used to set the port and enable this new listener on envoy. Signed-off-by: Fabrice Rabaute <[email protected]>
/ready endpoint used by emissary is using the admin port (8001 by default). This generates a problem during config reloads with large configs as the admin thread is blocking so the /ready endpoint can be very slow to answer (in the order of several seconds, even more). The problem is described in this envoy issue: envoyproxy/envoy#16425 This change is trying to fix the /ready endpoint problem. The /ready endpoint can be exposed in the worker pool by adding a listener+ health check http filter. This way, the /ready endpoint is fast and it is not blocked by any config reload or blocking admin operation as it depends on the worker pool. Future changes will allow to use this endpoint with diagd and the go code as well so they get a fast /ready endpoint and they do not use the admin port. This listener is disabled by default. the config "read_port" can be used to set the port and enable this new listener on envoy. Signed-off-by: Fabrice Rabaute <[email protected]>
/ready endpoint used by emissary is using the admin port (8001 by default). This generates a problem during config reloads with large configs as the admin thread is blocking so the /ready endpoint can be very slow to answer (in the order of several seconds, even more). The problem is described in this envoy issue: envoyproxy/envoy#16425 This change is trying to fix the /ready endpoint problem. The /ready endpoint can be exposed in the worker pool by adding a listener+ health check http filter. This way, the /ready endpoint is fast and it is not blocked by any config reload or blocking admin operation as it depends on the worker pool. Future changes will allow to use this endpoint with diagd and the go code as well so they get a fast /ready endpoint and they do not use the admin port. This listener is disabled by default. the config "read_port" can be used to set the port and enable this new listener on envoy. Signed-off-by: Fabrice Rabaute <[email protected]>
I might probably report 2 distinct problems:
1/
/stats?format={json|prometheus}
slowness2/
/ready
endpoint blocked by/stats
endpoint when running at the same time.Maybe I should create 2 issues, but starting with this one to have feedback from the envoy team.
So, we have some envoy instances with a lot of mappings/big config (between 5000 and 10000).
In order to reproduce it, I did create a sample envoy config with 7000 mappings, so it's easy to test.
Problem 1:
With this amount of mapping, the
/stats
endpoints takes a bit more than 1 second.The
/stats?format={json|prometheus}
(either json or prometheus format) is talking more than 6 seconds!That is more than 5x between
/stats
and the 2 other ones.Is that expected?
Problem 2:
When
/stats
is called, and another client is doing a/ready
request, the/ready
request seems to be stuck until the/stats
request is done.So, when the
/stats?format=prometheus
is called and takes 7 seconds (for example), and a/ready
request is done by a client at the beginning of this 7 seconds time, the/ready
request is also going to take 7 seconds.It generates a bunch of problems with monitoring, especially because we are using ambassador and ambassardor /check_ready endpoint is a wrapper around the envoy /ready endpoint and it has a 2 seconds timeout (https://github.com/datawire/ambassador/blob/d1a8b1ca89d878b4c8722f51f2479028288b747e/pkg/acp/envoy.go#L61 )
So, if prometheus is scraping at the same time, the readiness is failing.
Repro steps:
(all attachments have an extra .txt extension that should be removed)
I am attaching the "sample-long.yaml" config with the 7000 mappings.
I am also attaching the sample-long.sh" basic bash script to generate the sample config (if needed).
Here is the command I'm using to run envoy locally:
I'm attaching the
test.sh
script to run to test to perf of the different endpoints.Here is the output of this
test.sh
script when running on my laptop. My laptop does nothing, envoy does nothing (no traffic except the tests).You can see in the output that the
/ready
is taking more than 8 seconds when executed at the same time as the/stats?format=prometheus
.Is that expected?
Thank you for any feedback.
sample-long.sh.txt
test.sh.txt
sample-long.yaml.txt
The text was updated successfully, but these errors were encountered: