Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slow stats endpoint and blocking /ready endpoint when hitting /stats endpoint #16425

Open
jfrabaute opened this issue May 11, 2021 · 28 comments
Open
Labels
area/admin area/stats bug no stalebot Disables stalebot from closing an issue

Comments

@jfrabaute
Copy link

jfrabaute commented May 11, 2021

I might probably report 2 distinct problems:

1/ /stats?format={json|prometheus} slowness
2/ /ready endpoint blocked by /stats endpoint when running at the same time.

Maybe I should create 2 issues, but starting with this one to have feedback from the envoy team.

So, we have some envoy instances with a lot of mappings/big config (between 5000 and 10000).
In order to reproduce it, I did create a sample envoy config with 7000 mappings, so it's easy to test.

Problem 1:

With this amount of mapping, the /stats endpoints takes a bit more than 1 second.
The /stats?format={json|prometheus} (either json or prometheus format) is talking more than 6 seconds!
That is more than 5x between /stats and the 2 other ones.
Is that expected?

Problem 2:

When /stats is called, and another client is doing a /ready request, the /ready request seems to be stuck until the /stats request is done.
So, when the /stats?format=prometheus is called and takes 7 seconds (for example), and a /ready request is done by a client at the beginning of this 7 seconds time, the /ready request is also going to take 7 seconds.
It generates a bunch of problems with monitoring, especially because we are using ambassador and ambassardor /check_ready endpoint is a wrapper around the envoy /ready endpoint and it has a 2 seconds timeout (https://github.com/datawire/ambassador/blob/d1a8b1ca89d878b4c8722f51f2479028288b747e/pkg/acp/envoy.go#L61 )

So, if prometheus is scraping at the same time, the readiness is failing.

Repro steps:

(all attachments have an extra .txt extension that should be removed)
I am attaching the "sample-long.yaml" config with the 7000 mappings.
I am also attaching the sample-long.sh" basic bash script to generate the sample config (if needed).
Here is the command I'm using to run envoy locally:

docker run --rm --network=host \
    -v $(pwd)/sample-long.yaml:/sample.yaml \
    -ti envoyproxy/envoy:v1.18.2 \
    --config-path /sample.yaml

I'm attaching the test.sh script to run to test to perf of the different endpoints.

Here is the output of this test.sh script when running on my laptop. My laptop does nothing, envoy does nothing (no traffic except the tests).

> ./test.sh
*********************
*** TEST ready endpoint (it's fast)
0.01user 0.00system 0:00.01elapsed 92%CPU (0avgtext+0avgdata 11860maxresident)k
0inputs+0outputs (0major+650minor)pagefaults 0swaps

*********************
*** TEST stats+ready endpoint (ready endpoint is going to be slow because it's probably locked waiting for stat endpoint)
READY timing (slow :-( ):
0.00user 0.00system 0:08.14elapsed 0%CPU (0avgtext+0avgdata 12152maxresident)k
0inputs+0outputs (0major+660minor)pagefaults 0swaps
---------------
STATS timing:
0.00user 0.02system 0:08.76elapsed 0%CPU (0avgtext+0avgdata 12116maxresident)k
0inputs+0outputs (0major+683minor)pagefaults 0swaps

*********************
*** TEST stat endpoints (basic one: around 1 second)
0.00user 0.01system 0:01.12elapsed 1%CPU (0avgtext+0avgdata 12060maxresident)k
0inputs+0outputs (0major+677minor)pagefaults 0swaps

*********************
*** TEST stat endpoints (json one: more than 5x than basic one)
0.00user 0.01system 0:04.95elapsed 0%CPU (0avgtext+0avgdata 12276maxresident)k
0inputs+0outputs (0major+680minor)pagefaults 0swaps

*********************
*** TEST stat endpoints (prometheus one: more than 5x than basic one)
0.01user 0.01system 0:09.32elapsed 0%CPU (0avgtext+0avgdata 11840maxresident)k
0inputs+0outputs (0major+672minor)pagefaults 0swaps

You can see in the output that the /ready is taking more than 8 seconds when executed at the same time as the /stats?format=prometheus.
Is that expected?

Thank you for any feedback.

sample-long.sh.txt

test.sh.txt

sample-long.yaml.txt

@jfrabaute jfrabaute added bug triage Issue requires triage labels May 11, 2021
@antoniovicente
Copy link
Contributor

antoniovicente commented May 11, 2021

cc @htuch @jmarantz

IIRC the main thread is used for admin handlers like /stats and /ready, and also config processing from XDS. If any of those are slow, the rest will be delayed.

@antoniovicente antoniovicente added area/admin and removed triage Issue requires triage labels May 11, 2021
@snowp
Copy link
Contributor

snowp commented May 11, 2021

I've seen people recommend against hitting an admin endpoint as a readiness check for this reason (I think from @howardjohn ?). The fact that it might be delayed due to xDS responses make it a poor choice for periodic health checks.

@antoniovicente
Copy link
Contributor

It would be good to ensure that config updates can proceed even if admin requests are consuming significant CPU.
Also, it would be good to allow multiple admin requests to make progress in parallel with each other and in parallel with config; a large config reload shouldn't block /stats scrapes and vice versa.

@howardjohn
Copy link
Contributor

Yeah for Istio we hit this a lot. We ended up using it just for the startup (at which point stats/ is probably not huge and any large XDS load is probably Envoy actually initializing). From then on, we just hit a normal (non-admin) listener. So the requests go through Envoy's worker threads, not the main thread.

@jmarantz
Copy link
Contributor

See also #16139 (comment)

I think if we streamed out large HTTP responses from admin, it would take less cpu/memory and, depending on how we managed events, might not block other admin requests from running during the streaming process.

@jfrabaute
Copy link
Author

Thanks for the feedback.

Regarding health check, the /ready endpoint should not be used (We should put ambassador guys in the loop here probably because this is what they use, and they seem to expect /ready to be fast. Adding @kflynn at least).
So, what is the recommended health check endpoint for envoy? Using one of the endpoints exposed (so worker thread, not admin thread)?

Regarding the /stats slowness with json and prometheus format, it looks like the admin thread is synchronous and can only handle one request at a time, is that correct? So, in this case, that's why, when doing 2 requests /ready and /stats, one is waiting for the other to finish. Correct? (=It's not related to any locking...that's was wrong).
In this case, yeah, there is not much to do.

One extra question about json and prometheus formats: Is that expected that asking for those format is making the request much slower? Is the transformation step between stats format and another format that big that it makes the request 5x slower?

@jmarantz
Copy link
Contributor

Right now each request is entirely blocking, which is fine for tiny requests like /ready. But if we change the stats handlers to stream data out, and if we don't buffer and sort, I think the admin port could service /ready in between chunks of a /stats response.

@jfrabaute
Copy link
Author

It would be good to ensure that config updates can proceed even if admin requests are consuming significant CPU.
Also, it would be good to allow multiple admin requests to make progress in parallel with each other and in parallel with config; a large config reload shouldn't block /stats scrapes and vice versa.

Yes, we're hitting the scrape problem as well when config is reloaded (So that's a 3rd problem 😞 ).
So when a config is reloading, the 3 endpoints for readiness/liveness/stats are slow and they time out and things start to go bad. We did increase the timeout, but it's not really a good solution.

Are there any workaround/options here with the current admin thread design (especially for the blocked behavior during a config reload)?

@antoniovicente
Copy link
Contributor

I don't think there is a work around. I think it is fairly difficult problem to solve since certain data structures are only safe to access from the main thread, hence the restrictions on config reload and admin handlers running on the main thread.

@jmarantz
Copy link
Contributor

I guess there are two approaches:

  1. try to do some of the admin work on worker threads
  2. keep using the main thread, but stream data out asynchronously rather than buffering all of it.

A lot of the work that's being done now may not be needed by all users, such as sorting the stats output. If we we do sort, we don't need to buffer the serialized form which is happening now, we can just buffer the vector of stat pointers, or at least shared_ptr so that we can survive dropping a stat while streaming it out.

@antoniovicente
Copy link
Contributor

Doing stats operations that take 6 seconds on workers is not acceptable either. Ideally we'ld do very little in the main thread, just the things that need to go through it due to the need for ordering. admin requests should be served by a thread pool. Config processing should happen on its own set of threads.

@jmarantz
Copy link
Contributor

Sorry if I wasn't clear; I was proposing the second option: break up the stats streaming work and do it asynchronously on the main thread. I think we'd do a little less total work by not buffering up the full serialized form (and maybe not sorting) but that might be too optimistic. But even if it's the same 6s of work, if we do it in 100 60ms chunks rather than all at once, it would have less latency impact for /ready endpoints etc.

I'm not opposed to having other threads dedicated to admin processing. But I'd be a little worried about the potential impact on request latency at (say) p99 or p95, of having more computable threads than physical cores. We could also reserve a few cores for admin processing, depending on the setup, but that might not work well for everyone. So I thought async-chunks-on-main-thread might be a good way to go.

@htuch
Copy link
Member

htuch commented May 12, 2021

I think for stats it should be relatively plausible to push this to a distinct thread; given we already have stats sinks that gather this data, can't we have a pseudo-sink that dumps to some shared memory buffer for a consumer thread that can handle these requests?

This doesn't solve the more general problem of something like /healthz latency and config update interference.

@jmarantz
Copy link
Contributor

Yeah you can collect the stats in vectors of shared_ptrs on the main thread and then stream them out from another. But I think we should stream them out in chunks regardless rather than buffering up all the text before sending a response. That will solve a memory burst problem per #16139 .

Once we are streaming, we can make it async and see whether we actually need to add a new thread.

@jtway
Copy link
Contributor

jtway commented May 12, 2021

I've seen people recommend against hitting an admin endpoint as a readiness check for this reason (I think from @howardjohn ?). The fact that it might be delayed due to xDS responses make it a poor choice for periodic health checks.

@snowp @jmarantz By the way, this is part of what #15876 is addressing, to a small degree. One of the larger issues, for us at least, was once there is a large number of VirtualHosts (> 80K), even a single VirtualHost add/delete via VHDS was causing delay of /stats, /ready, etc. This, to a large degree, seems to stem from the way RouteMatcher and VirtualHosts are handled.

Out of curiosity, @jfrabaute, are you using VHDS? Due you have many VirtualHosts?

@jfrabaute
Copy link
Author

Out of curiosity, @jfrabaute, are you using VHDS? Due you have many VirtualHosts?

No idea. We're using ambassador, which is taking care of managing the envoy config and lifecycle. I don't think ambassador is using VHDS, but I'm not sure. I'll ask the ambassador team.

Regarding the problem, it looks like the /ready vs /stat problem is only one small part of the blocking problem.
The config reload seems to even be a bigger problem as it's blocking both /ready and /stats endpoints.
Is there a plan to make changes here (config reload) to make the 2 other endpoints not blocked?

@jfrabaute
Copy link
Author

Answer: Ambassador does not use VHDS

@jtway
Copy link
Contributor

jtway commented May 12, 2021

Okay, good to know its not limited to VHDS. I'd be curious to see a flamegraph, but I'm just a fly on the wall here.

@jmarantz
Copy link
Contributor

I am not aware of such a plan to allow /ready to run concurrently with config reload, but it seems like this should be possible to do also; maybe. more difficult than stats though.

@jfrabaute
Copy link
Author

Could those 2 '/ready and /stats endpoint run in a secondary http server (listening on a different port)?
/ready can return stale information, or just read an atomic value.
For stats, it's probably already possible they can be read in //, if value is stale, it's not a big deal, next scraping will get the update.
This way, they are moved from the admin thread/server and can live their own life.
That's one more "control" thread to manage, but it might make sense to isolate to have them fast those as they are expected to be used by external services monitoring envoy. And it does not "touch" the admin thread, which is important, especially for the config management.

@jmarantz
Copy link
Contributor

All that can be done! Just a matter of someone doing it.

To simplify, is the goal just to make /ready fast always, and you'd be willing to have that on a different port and its own (very cheap) thread?

That project is nice because it doesn't have to touch the complexity of stats and config-update. But it would involve plumbing through the declaration of a new port in the API. Other @envoyproxy/maintainers may have better ideas also.

@snowp
Copy link
Contributor

snowp commented May 12, 2021

Another strategy might be to expose whatever relevant information you want out of /ready via a HTTP filter, allowing users to set up a filter chain on a distinct port that can handle this. That wouldn't require any core API changes

This somewhat begs the question of what information /ready is using to declare readiness: much of it I expect is just dependent on the server starting, so anything handling traffic on a worker thread would be gated on this. Does this just boil down to a direct response from the HCM?

@jfrabaute
Copy link
Author

@jmarantz:

All that can be done! Just a matter of someone doing it.

Yeah.... Totally see what you mean here.

To simplify, is the goal just to make /ready fast always, and you'd be willing to have that on a different port and its own (very cheap) thread?

That's the small goal for me, but that might just be a narrow view of a problem which is larger, hence created the issue to get an idea of a possible larger picture.
If it's most of it, or, at least for now and in the near/mid term, most of the people would benefit from having this change for /ready and /stats only, that could be a first step.

That project is nice because it doesn't have to touch the complexity of stats and config-update.

Yes, that's the idea behind my proposal. I had a quick look at the code, and knowing close to nothing, the admin thread already seem like a complex piece of code, so isolating this change and not touching the admin thread seemed like an interesting option, and it is just for monitoring could make sense to have a first iteration in a reasonable amount of time.

But it would involve plumbing through the declaration of a new port in the API. Other @envoyproxy/maintainers may have better ideas also.

@snowp

Another strategy might be to expose whatever relevant information you want out of /ready via a HTTP filter, allowing users to set up a filter chain on a distinct port that can handle this. That wouldn't require any core API changes

That seems interesting. Similar to what I'm proposing (IIUC), but even simpler as it's just getting the info and using the worker threads.
But I don't know if ambassador provides the level of flexibility where we can manipulated the http filters. I'll ask the ambassador team here. That is not related to envoy here, and ambassador could make changes to enable this tho.

@moderation
Copy link
Contributor

moderation commented May 12, 2021

I need to check if we are experiencing delays on stats scrapes but I suspect we are. We have a similar scenario but using Contour for ingress on Kubernetes as opposed to Ambassador. This is a multi-tenant cluster and can end up hosting a lot of routes, clusters, etc.

This has me thinking about some @mattklein123 tweets about push vs. pull - https://twitter.com/mattklein123/status/1328559009633239040, https://twitter.com/mattklein123/status/1266010765669961729

I plan on looking into changing the model of having Prometheus scrape /stats and instead use a push model with the StatsdSink - https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/metrics/v3/stats.proto#extension-envoy-stat-sinks-statsd

In theory this approach wouldn't see the periodic load as the scrape occurs. Keen on advice or gotcha's and whether a Statsd approach would help with blocking / contention

@htuch
Copy link
Member

htuch commented May 13, 2021

One thing to add to this already long and involved discussion is that we directionally want to head towards admin port not being special in any particular way vs. regular listeners. This has a ton of advantages around security and consistency. This might matter if we start adding new ports (which I don't think should be preferred; it's not great to fix an implementation issue with an API one).

@jfrabaute
Copy link
Author

One question: I'm looking at http filters, and it looks like a health check http filter already exist:
https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/health_check_filter
https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/health_checking#arch-overview-health-checking-filter

Would that http filter solve the problem if used on a specific listener for the monitoring system like k8s readiness/liveness?

If so, that would be a first step (/stats is the second one).

jfrabaute added a commit to jfrabaute/ambassador that referenced this issue Mar 1, 2022
/ready endpoint used by emissary is using the admin port (8001 by
default).
This generates a problem during config reloads with large configs as the
admin thread is blocking so the /ready endpoint can be very slow to
answer (in the order of several seconds, even more).

The problem is described in this envoy issue:
envoyproxy/envoy#16425

This change is trying to fix the /ready endpoint problem.
The /ready endpoint can be exposed in the worker pool by adding a
listener+ health check http filter.
This way, the /ready endpoint is fast and it is not blocked by any
config reload or blocking admin operation as it depends on the worker
pool.

Future changes will allow to use this endpoint with diagd and the go
code as well so they get a fast /ready endpoint and they do not use the
admin port.

This listener is disabled by default. the config "read_port" can be used
to set the port and enable this new listener on envoy.

Signed-off-by: Fabrice Rabaute <[email protected]>
jfrabaute added a commit to jfrabaute/ambassador that referenced this issue May 29, 2022
/ready endpoint used by emissary is using the admin port (8001 by
default).
This generates a problem during config reloads with large configs as the
admin thread is blocking so the /ready endpoint can be very slow to
answer (in the order of several seconds, even more).

The problem is described in this envoy issue:
envoyproxy/envoy#16425

This change is trying to fix the /ready endpoint problem.
The /ready endpoint can be exposed in the worker pool by adding a
listener+ health check http filter.
This way, the /ready endpoint is fast and it is not blocked by any
config reload or blocking admin operation as it depends on the worker
pool.

Future changes will allow to use this endpoint with diagd and the go
code as well so they get a fast /ready endpoint and they do not use the
admin port.

This listener is disabled by default. the config "read_port" can be used
to set the port and enable this new listener on envoy.

Signed-off-by: Fabrice Rabaute <[email protected]>
jfrabaute added a commit to jfrabaute/ambassador that referenced this issue May 29, 2022
/ready endpoint used by emissary is using the admin port (8001 by
default).
This generates a problem during config reloads with large configs as the
admin thread is blocking so the /ready endpoint can be very slow to
answer (in the order of several seconds, even more).

The problem is described in this envoy issue:
envoyproxy/envoy#16425

This change is trying to fix the /ready endpoint problem.
The /ready endpoint can be exposed in the worker pool by adding a
listener+ health check http filter.
This way, the /ready endpoint is fast and it is not blocked by any
config reload or blocking admin operation as it depends on the worker
pool.

Future changes will allow to use this endpoint with diagd and the go
code as well so they get a fast /ready endpoint and they do not use the
admin port.

This listener is disabled by default. the config "read_port" can be used
to set the port and enable this new listener on envoy.

Signed-off-by: Fabrice Rabaute <[email protected]>
tsuna pushed a commit to tsuna/emissary that referenced this issue Jun 19, 2022
/ready endpoint used by emissary is using the admin port (8001 by
default).
This generates a problem during config reloads with large configs as the
admin thread is blocking so the /ready endpoint can be very slow to
answer (in the order of several seconds, even more).

The problem is described in this envoy issue:
envoyproxy/envoy#16425

This change is trying to fix the /ready endpoint problem.
The /ready endpoint can be exposed in the worker pool by adding a
listener+ health check http filter.
This way, the /ready endpoint is fast and it is not blocked by any
config reload or blocking admin operation as it depends on the worker
pool.

Future changes will allow to use this endpoint with diagd and the go
code as well so they get a fast /ready endpoint and they do not use the
admin port.

This listener is disabled by default. the config "read_port" can be used
to set the port and enable this new listener on envoy.
tomasbanet pushed a commit to tomasbanet/emissary that referenced this issue Jun 21, 2022
/ready endpoint used by emissary is using the admin port (8001 by
default).
This generates a problem during config reloads with large configs as the
admin thread is blocking so the /ready endpoint can be very slow to
answer (in the order of several seconds, even more).

The problem is described in this envoy issue:
envoyproxy/envoy#16425

This change is trying to fix the /ready endpoint problem.
The /ready endpoint can be exposed in the worker pool by adding a
listener+ health check http filter.
This way, the /ready endpoint is fast and it is not blocked by any
config reload or blocking admin operation as it depends on the worker
pool.

Future changes will allow to use this endpoint with diagd and the go
code as well so they get a fast /ready endpoint and they do not use the
admin port.

This listener is disabled by default. the config "read_port" can be used
to set the port and enable this new listener on envoy.

Signed-off-by: Fabrice Rabaute <[email protected]>
tomasbanet pushed a commit to tomasbanet/emissary that referenced this issue Jun 21, 2022
/ready endpoint used by emissary is using the admin port (8001 by
default).
This generates a problem during config reloads with large configs as the
admin thread is blocking so the /ready endpoint can be very slow to
answer (in the order of several seconds, even more).

The problem is described in this envoy issue:
envoyproxy/envoy#16425

This change is trying to fix the /ready endpoint problem.
The /ready endpoint can be exposed in the worker pool by adding a
listener+ health check http filter.
This way, the /ready endpoint is fast and it is not blocked by any
config reload or blocking admin operation as it depends on the worker
pool.

Future changes will allow to use this endpoint with diagd and the go
code as well so they get a fast /ready endpoint and they do not use the
admin port.

This listener is disabled by default. the config "read_port" can be used
to set the port and enable this new listener on envoy.

Signed-off-by: Fabrice Rabaute <[email protected]>
tomasbanet pushed a commit to tomasbanet/emissary that referenced this issue Jun 21, 2022
/ready endpoint used by emissary is using the admin port (8001 by
default).
This generates a problem during config reloads with large configs as the
admin thread is blocking so the /ready endpoint can be very slow to
answer (in the order of several seconds, even more).

The problem is described in this envoy issue:
envoyproxy/envoy#16425

This change is trying to fix the /ready endpoint problem.
The /ready endpoint can be exposed in the worker pool by adding a
listener+ health check http filter.
This way, the /ready endpoint is fast and it is not blocked by any
config reload or blocking admin operation as it depends on the worker
pool.

Future changes will allow to use this endpoint with diagd and the go
code as well so they get a fast /ready endpoint and they do not use the
admin port.

This listener is disabled by default. the config "read_port" can be used
to set the port and enable this new listener on envoy.

Signed-off-by: Fabrice Rabaute <[email protected]>
tomasbanet pushed a commit to tomasbanet/emissary that referenced this issue Jun 27, 2022
/ready endpoint used by emissary is using the admin port (8001 by
default).
This generates a problem during config reloads with large configs as the
admin thread is blocking so the /ready endpoint can be very slow to
answer (in the order of several seconds, even more).

The problem is described in this envoy issue:
envoyproxy/envoy#16425

This change is trying to fix the /ready endpoint problem.
The /ready endpoint can be exposed in the worker pool by adding a
listener+ health check http filter.
This way, the /ready endpoint is fast and it is not blocked by any
config reload or blocking admin operation as it depends on the worker
pool.

Future changes will allow to use this endpoint with diagd and the go
code as well so they get a fast /ready endpoint and they do not use the
admin port.

This listener is disabled by default. the config "read_port" can be used
to set the port and enable this new listener on envoy.

Signed-off-by: Fabrice Rabaute <[email protected]>
tomasbanet pushed a commit to tomasbanet/emissary that referenced this issue Jul 5, 2022
/ready endpoint used by emissary is using the admin port (8001 by
default).
This generates a problem during config reloads with large configs as the
admin thread is blocking so the /ready endpoint can be very slow to
answer (in the order of several seconds, even more).

The problem is described in this envoy issue:
envoyproxy/envoy#16425

This change is trying to fix the /ready endpoint problem.
The /ready endpoint can be exposed in the worker pool by adding a
listener+ health check http filter.
This way, the /ready endpoint is fast and it is not blocked by any
config reload or blocking admin operation as it depends on the worker
pool.

Future changes will allow to use this endpoint with diagd and the go
code as well so they get a fast /ready endpoint and they do not use the
admin port.

This listener is disabled by default. the config "read_port" can be used
to set the port and enable this new listener on envoy.

Signed-off-by: Fabrice Rabaute <[email protected]>
tomasbanet pushed a commit to tomasbanet/emissary that referenced this issue Jul 7, 2022
/ready endpoint used by emissary is using the admin port (8001 by
default).
This generates a problem during config reloads with large configs as the
admin thread is blocking so the /ready endpoint can be very slow to
answer (in the order of several seconds, even more).

The problem is described in this envoy issue:
envoyproxy/envoy#16425

This change is trying to fix the /ready endpoint problem.
The /ready endpoint can be exposed in the worker pool by adding a
listener+ health check http filter.
This way, the /ready endpoint is fast and it is not blocked by any
config reload or blocking admin operation as it depends on the worker
pool.

Future changes will allow to use this endpoint with diagd and the go
code as well so they get a fast /ready endpoint and they do not use the
admin port.

This listener is disabled by default. the config "read_port" can be used
to set the port and enable this new listener on envoy.

Signed-off-by: Fabrice Rabaute <[email protected]>
jfrabaute added a commit to jfrabaute/ambassador that referenced this issue Jul 25, 2022
/ready endpoint used by emissary is using the admin port (8001 by
default).
This generates a problem during config reloads with large configs as the
admin thread is blocking so the /ready endpoint can be very slow to
answer (in the order of several seconds, even more).

The problem is described in this envoy issue:
envoyproxy/envoy#16425

This change is trying to fix the /ready endpoint problem.
The /ready endpoint can be exposed in the worker pool by adding a
listener+ health check http filter.
This way, the /ready endpoint is fast and it is not blocked by any
config reload or blocking admin operation as it depends on the worker
pool.

Future changes will allow to use this endpoint with diagd and the go
code as well so they get a fast /ready endpoint and they do not use the
admin port.

This listener is disabled by default. the config "read_port" can be used
to set the port and enable this new listener on envoy.

Signed-off-by: Fabrice Rabaute <[email protected]>
jfrabaute added a commit to jfrabaute/ambassador that referenced this issue Aug 12, 2022
/ready endpoint used by emissary is using the admin port (8001 by
default).
This generates a problem during config reloads with large configs as the
admin thread is blocking so the /ready endpoint can be very slow to
answer (in the order of several seconds, even more).

The problem is described in this envoy issue:
envoyproxy/envoy#16425

This change is trying to fix the /ready endpoint problem.
The /ready endpoint can be exposed in the worker pool by adding a
listener+ health check http filter.
This way, the /ready endpoint is fast and it is not blocked by any
config reload or blocking admin operation as it depends on the worker
pool.

Future changes will allow to use this endpoint with diagd and the go
code as well so they get a fast /ready endpoint and they do not use the
admin port.

This listener is disabled by default. the config "read_port" can be used
to set the port and enable this new listener on envoy.

Signed-off-by: Fabrice Rabaute <[email protected]>
tomasbanet pushed a commit to tomasbanet/emissary that referenced this issue Sep 6, 2022
/ready endpoint used by emissary is using the admin port (8001 by
default).
This generates a problem during config reloads with large configs as the
admin thread is blocking so the /ready endpoint can be very slow to
answer (in the order of several seconds, even more).

The problem is described in this envoy issue:
envoyproxy/envoy#16425

This change is trying to fix the /ready endpoint problem.
The /ready endpoint can be exposed in the worker pool by adding a
listener+ health check http filter.
This way, the /ready endpoint is fast and it is not blocked by any
config reload or blocking admin operation as it depends on the worker
pool.

Future changes will allow to use this endpoint with diagd and the go
code as well so they get a fast /ready endpoint and they do not use the
admin port.

This listener is disabled by default. the config "read_port" can be used
to set the port and enable this new listener on envoy.

Signed-off-by: Fabrice Rabaute <[email protected]>
tomasbanet pushed a commit to tomasbanet/emissary that referenced this issue Sep 15, 2022
/ready endpoint used by emissary is using the admin port (8001 by
default).
This generates a problem during config reloads with large configs as the
admin thread is blocking so the /ready endpoint can be very slow to
answer (in the order of several seconds, even more).

The problem is described in this envoy issue:
envoyproxy/envoy#16425

This change is trying to fix the /ready endpoint problem.
The /ready endpoint can be exposed in the worker pool by adding a
listener+ health check http filter.
This way, the /ready endpoint is fast and it is not blocked by any
config reload or blocking admin operation as it depends on the worker
pool.

Future changes will allow to use this endpoint with diagd and the go
code as well so they get a fast /ready endpoint and they do not use the
admin port.

This listener is disabled by default. the config "read_port" can be used
to set the port and enable this new listener on envoy.

Signed-off-by: Fabrice Rabaute <[email protected]>
tomasbanet pushed a commit to tomasbanet/emissary that referenced this issue Sep 15, 2022
/ready endpoint used by emissary is using the admin port (8001 by
default).
This generates a problem during config reloads with large configs as the
admin thread is blocking so the /ready endpoint can be very slow to
answer (in the order of several seconds, even more).

The problem is described in this envoy issue:
envoyproxy/envoy#16425

This change is trying to fix the /ready endpoint problem.
The /ready endpoint can be exposed in the worker pool by adding a
listener+ health check http filter.
This way, the /ready endpoint is fast and it is not blocked by any
config reload or blocking admin operation as it depends on the worker
pool.

Future changes will allow to use this endpoint with diagd and the go
code as well so they get a fast /ready endpoint and they do not use the
admin port.

This listener is disabled by default. the config "read_port" can be used
to set the port and enable this new listener on envoy.

Signed-off-by: Fabrice Rabaute <[email protected]>
tomasbanet pushed a commit to tomasbanet/emissary that referenced this issue Oct 3, 2022
/ready endpoint used by emissary is using the admin port (8001 by
default).
This generates a problem during config reloads with large configs as the
admin thread is blocking so the /ready endpoint can be very slow to
answer (in the order of several seconds, even more).

The problem is described in this envoy issue:
envoyproxy/envoy#16425

This change is trying to fix the /ready endpoint problem.
The /ready endpoint can be exposed in the worker pool by adding a
listener+ health check http filter.
This way, the /ready endpoint is fast and it is not blocked by any
config reload or blocking admin operation as it depends on the worker
pool.

Future changes will allow to use this endpoint with diagd and the go
code as well so they get a fast /ready endpoint and they do not use the
admin port.

This listener is disabled by default. the config "read_port" can be used
to set the port and enable this new listener on envoy.

Signed-off-by: Fabrice Rabaute <[email protected]>
tomasbanet pushed a commit to tomasbanet/emissary that referenced this issue Oct 3, 2022
/ready endpoint used by emissary is using the admin port (8001 by
default).
This generates a problem during config reloads with large configs as the
admin thread is blocking so the /ready endpoint can be very slow to
answer (in the order of several seconds, even more).

The problem is described in this envoy issue:
envoyproxy/envoy#16425

This change is trying to fix the /ready endpoint problem.
The /ready endpoint can be exposed in the worker pool by adding a
listener+ health check http filter.
This way, the /ready endpoint is fast and it is not blocked by any
config reload or blocking admin operation as it depends on the worker
pool.

Future changes will allow to use this endpoint with diagd and the go
code as well so they get a fast /ready endpoint and they do not use the
admin port.

This listener is disabled by default. the config "read_port" can be used
to set the port and enable this new listener on envoy.

Signed-off-by: Fabrice Rabaute <[email protected]>
jfrabaute added a commit to aristanetworks/emissary that referenced this issue Oct 10, 2022
/ready endpoint used by emissary is using the admin port (8001 by
default).
This generates a problem during config reloads with large configs as the
admin thread is blocking so the /ready endpoint can be very slow to
answer (in the order of several seconds, even more).

The problem is described in this envoy issue:
envoyproxy/envoy#16425

This change is trying to fix the /ready endpoint problem.
The /ready endpoint can be exposed in the worker pool by adding a
listener+ health check http filter.
This way, the /ready endpoint is fast and it is not blocked by any
config reload or blocking admin operation as it depends on the worker
pool.

Future changes will allow to use this endpoint with diagd and the go
code as well so they get a fast /ready endpoint and they do not use the
admin port.

This listener is disabled by default. the config "read_port" can be used
to set the port and enable this new listener on envoy.

Signed-off-by: Fabrice Rabaute <[email protected]>
LanceEa pushed a commit to emissary-ingress/emissary that referenced this issue Nov 23, 2022
/ready endpoint used by emissary is using the admin port (8001 by
default).
This generates a problem during config reloads with large configs as the
admin thread is blocking so the /ready endpoint can be very slow to
answer (in the order of several seconds, even more).

The problem is described in this envoy issue:
envoyproxy/envoy#16425

This change is trying to fix the /ready endpoint problem.
The /ready endpoint can be exposed in the worker pool by adding a
listener+ health check http filter.
This way, the /ready endpoint is fast and it is not blocked by any
config reload or blocking admin operation as it depends on the worker
pool.

Future changes will allow to use this endpoint with diagd and the go
code as well so they get a fast /ready endpoint and they do not use the
admin port.

This listener is disabled by default. the config "read_port" can be used
to set the port and enable this new listener on envoy.

Co-authored-by: Lance Austin <[email protected]>
Signed-off-by: Lance Austin <[email protected]>

Signed-off-by: Fabrice Rabaute <[email protected]>
tomasbanet pushed a commit to tomasbanet/emissary that referenced this issue Dec 13, 2022
/ready endpoint used by emissary is using the admin port (8001 by
default).
This generates a problem during config reloads with large configs as the
admin thread is blocking so the /ready endpoint can be very slow to
answer (in the order of several seconds, even more).

The problem is described in this envoy issue:
envoyproxy/envoy#16425

This change is trying to fix the /ready endpoint problem.
The /ready endpoint can be exposed in the worker pool by adding a
listener+ health check http filter.
This way, the /ready endpoint is fast and it is not blocked by any
config reload or blocking admin operation as it depends on the worker
pool.

Future changes will allow to use this endpoint with diagd and the go
code as well so they get a fast /ready endpoint and they do not use the
admin port.

This listener is disabled by default. the config "read_port" can be used
to set the port and enable this new listener on envoy.

Signed-off-by: Fabrice Rabaute <[email protected]>
tomasbanet pushed a commit to tomasbanet/emissary that referenced this issue Dec 21, 2022
/ready endpoint used by emissary is using the admin port (8001 by
default).
This generates a problem during config reloads with large configs as the
admin thread is blocking so the /ready endpoint can be very slow to
answer (in the order of several seconds, even more).

The problem is described in this envoy issue:
envoyproxy/envoy#16425

This change is trying to fix the /ready endpoint problem.
The /ready endpoint can be exposed in the worker pool by adding a
listener+ health check http filter.
This way, the /ready endpoint is fast and it is not blocked by any
config reload or blocking admin operation as it depends on the worker
pool.

Future changes will allow to use this endpoint with diagd and the go
code as well so they get a fast /ready endpoint and they do not use the
admin port.

This listener is disabled by default. the config "read_port" can be used
to set the port and enable this new listener on envoy.

Signed-off-by: Fabrice Rabaute <[email protected]>
tomasbanet pushed a commit to tomasbanet/emissary that referenced this issue Jan 16, 2023
/ready endpoint used by emissary is using the admin port (8001 by
default).
This generates a problem during config reloads with large configs as the
admin thread is blocking so the /ready endpoint can be very slow to
answer (in the order of several seconds, even more).

The problem is described in this envoy issue:
envoyproxy/envoy#16425

This change is trying to fix the /ready endpoint problem.
The /ready endpoint can be exposed in the worker pool by adding a
listener+ health check http filter.
This way, the /ready endpoint is fast and it is not blocked by any
config reload or blocking admin operation as it depends on the worker
pool.

Future changes will allow to use this endpoint with diagd and the go
code as well so they get a fast /ready endpoint and they do not use the
admin port.

This listener is disabled by default. the config "read_port" can be used
to set the port and enable this new listener on envoy.

Signed-off-by: Fabrice Rabaute <[email protected]>
LanceEa pushed a commit to emissary-ingress/emissary that referenced this issue Feb 3, 2023
/ready endpoint used by emissary is using the admin port (8001 by
default).
This generates a problem during config reloads with large configs as the
admin thread is blocking so the /ready endpoint can be very slow to
answer (in the order of several seconds, even more).

The problem is described in this envoy issue:
envoyproxy/envoy#16425

This change is trying to fix the /ready endpoint problem.
The /ready endpoint can be exposed in the worker pool by adding a
listener+ health check http filter.
This way, the /ready endpoint is fast and it is not blocked by any
config reload or blocking admin operation as it depends on the worker
pool.

Future changes will allow to use this endpoint with diagd and the go
code as well so they get a fast /ready endpoint and they do not use the
admin port.

This listener is disabled by default. the config "read_port" can be used
to set the port and enable this new listener on envoy.

Signed-off-by: Fabrice Rabaute <[email protected]>
LanceEa pushed a commit to emissary-ingress/emissary that referenced this issue Feb 3, 2023
/ready endpoint used by emissary is using the admin port (8001 by
default).
This generates a problem during config reloads with large configs as the
admin thread is blocking so the /ready endpoint can be very slow to
answer (in the order of several seconds, even more).

The problem is described in this envoy issue:
envoyproxy/envoy#16425

This change is trying to fix the /ready endpoint problem.
The /ready endpoint can be exposed in the worker pool by adding a
listener+ health check http filter.
This way, the /ready endpoint is fast and it is not blocked by any
config reload or blocking admin operation as it depends on the worker
pool.

Future changes will allow to use this endpoint with diagd and the go
code as well so they get a fast /ready endpoint and they do not use the
admin port.

This listener is disabled by default. the config "read_port" can be used
to set the port and enable this new listener on envoy.

Signed-off-by: Fabrice Rabaute <[email protected]>
LanceEa pushed a commit to emissary-ingress/emissary that referenced this issue Feb 7, 2023
/ready endpoint used by emissary is using the admin port (8001 by
default).
This generates a problem during config reloads with large configs as the
admin thread is blocking so the /ready endpoint can be very slow to
answer (in the order of several seconds, even more).

The problem is described in this envoy issue:
envoyproxy/envoy#16425

This change is trying to fix the /ready endpoint problem.
The /ready endpoint can be exposed in the worker pool by adding a
listener+ health check http filter.
This way, the /ready endpoint is fast and it is not blocked by any
config reload or blocking admin operation as it depends on the worker
pool.

Future changes will allow to use this endpoint with diagd and the go
code as well so they get a fast /ready endpoint and they do not use the
admin port.

This listener is disabled by default. the config "read_port" can be used
to set the port and enable this new listener on envoy.

Signed-off-by: Fabrice Rabaute <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/admin area/stats bug no stalebot Disables stalebot from closing an issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants