Support exposing server readiness state by a file #11397

l8huang · 2020-06-02T02:05:36Z

Description:

We are running Istio with Envoy as ingress gateway, there could be more than 2k Gateway/VirtualService/Service got translated into XDS configs. A problem with this scale is, Envoy may take a long time to handle XDS response pushed from control plane, this causes the ingress-gateway pod's readiness probe timeout(it's a admin API call /stats?usedonly&filter=server.state), then pod is reported as "Not Ready". We tried to increase the readiness probe timeout value to 15s, but that doesn't resolve this issue essentially, it may timeout again if XDS config number grows. In this case, an admin API call timeout doesn't mean workers thread not work.

Currently main thread handles both XDS responses and admin API calls, one possibility is separating admin API handler from main thread, I guess this might be overkilled for resolving this issue. How about exposing server readiness state by a local file? Then other process(e.g pilot-agent) can use that file to check Envoy server state, instead of calling admin API.

The text was updated successfully, but these errors were encountered:

mattklein123 · 2020-06-02T02:59:01Z

Can you please provide a perf trace showing that the main thread is actually blocked processing xDS responses for 15s or longer?

I don't really understand how saving state to a file is going to fix this problem if the actual processing time takes the long. It seems like the server simply isn't ready and you either have to account for that or the performance needs to be improved?

ramaraochavali · 2020-06-02T04:58:15Z

@mattklein123 I think the problem here is not with the initial config but subsequent config updates. The initial config update would certainly delay the pod readiness (correctly so) till the config is obtained - but after it is ready, and during config updates that take longer the calls to admin handlers fail and might result in this issue?

Relevant Istio issues istio/istio#18046 and istio/istio#23371

I tried to cache the server.state after the initial config update - but there were concerns of the cache validity across hot restarts - so that change was reverted.

But I do think writing to file is not an ideal solution to this problem - because there are other stats that readiness probes depend on like worker_started etc.

mattklein123 · 2020-06-02T15:03:38Z

But I do think writing to file is not an ideal solution to this problem - because there are other stats that readiness probes depend on like worker_started etc.

Right, we need to fix the underlying performance problem in some way or increase the timeout. File caching does not make sense. If file caching is desired it can be implemented by an external process.

zyfjeff · 2020-06-02T15:16:36Z

This problem is actually similar to the problem I mentioned before #10394 . The main thread is busy dealing with xDS and the Admin interface fails to respond, resulting in a timeout. I think the problem to be solved by the core is still to find the reason for the slow processing of xDS and solve it. I believe that if these performance problems mentioned in #11362 can be solved, then this problem will be solved eventually.

l8huang · 2020-06-02T19:30:00Z

I agree this issue can be resolved fundamentally in some better ways:

create a dedicated thread to handle admin API call
make control plane using delta push to avoid huge LDS/RDS/CDS response, and give admin API call high priority for avoiding starving

When Istio control plane having more and more Gateways/VirtualServices/Services, the time needed by handling LDS/RDS/CDS response becomes longer and longer. Of course the number of XDS config should be in a reasonable range, but 2k Gateways/VirtualServices/Services looks like reasonable. I tried to increate the timeout value, but after it exceeds 15s, I feel this is not a correct direction.

IMO: with current istio and Envoy implementation, this is a scalability issue.

@ramaraochavali could you please share some details about your file caching solution?

The method I proposed is not a file caching, the idea is Envoy creates and updates a state file actively, I briefly checked how server.state works in Envoy, it doesn't change frequently(k8s container readiness probe also happens with an interval). Using a file to expose healthy state is not unusual, k8s also supports using command(e.g check if healthy file exists) in readiness probe.

Of course using a file has some limitations, let figure out what's the best solution for Istio and Envoy.

mattklein123 · 2020-06-02T20:40:10Z

I'm not in favor of writing a file. There are a bunch of issues with this becoming stale. We should just fix the actual issue in some way. In the interim if you want to write a file you can do this externally.

ramaraochavali · 2020-06-03T05:30:46Z

could you please share some details about your file caching solution?

I have n't tried file caching. In Pilot I cached the server state after initial config is obtained , from then on reusing the cached stat instead of looking at admin hander

stale · 2020-07-03T06:24:21Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

stale · 2020-07-12T09:55:27Z

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted". Thank you for your contributions.

mattklein123 added the question Questions that are neither investigations, bugs, nor enhancements label Jun 2, 2020

mattklein123 added investigate Potential bug that needs verification and removed question Questions that are neither investigations, bugs, nor enhancements labels Jun 2, 2020

stale bot added the stale stalebot believes this issue/PR has not been touched recently label Jul 3, 2020

stale bot closed this as completed Jul 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support exposing server readiness state by a file #11397

Support exposing server readiness state by a file #11397

l8huang commented Jun 2, 2020

mattklein123 commented Jun 2, 2020

ramaraochavali commented Jun 2, 2020

mattklein123 commented Jun 2, 2020

zyfjeff commented Jun 2, 2020

l8huang commented Jun 2, 2020

mattklein123 commented Jun 2, 2020

ramaraochavali commented Jun 3, 2020

stale bot commented Jul 3, 2020

stale bot commented Jul 12, 2020

Support exposing server readiness state by a file #11397

Support exposing server readiness state by a file #11397

Comments

l8huang commented Jun 2, 2020

mattklein123 commented Jun 2, 2020

ramaraochavali commented Jun 2, 2020

mattklein123 commented Jun 2, 2020

zyfjeff commented Jun 2, 2020

l8huang commented Jun 2, 2020

mattklein123 commented Jun 2, 2020

ramaraochavali commented Jun 3, 2020

stale bot commented Jul 3, 2020

stale bot commented Jul 12, 2020