Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support exposing server readiness state by a file #11397

Closed
l8huang opened this issue Jun 2, 2020 · 9 comments
Closed

Support exposing server readiness state by a file #11397

l8huang opened this issue Jun 2, 2020 · 9 comments
Labels
investigate Potential bug that needs verification stale stalebot believes this issue/PR has not been touched recently

Comments

@l8huang
Copy link
Contributor

l8huang commented Jun 2, 2020

Description:

We are running Istio with Envoy as ingress gateway, there could be more than 2k Gateway/VirtualService/Service got translated into XDS configs. A problem with this scale is, Envoy may take a long time to handle XDS response pushed from control plane, this causes the ingress-gateway pod's readiness probe timeout(it's a admin API call /stats?usedonly&filter=server.state), then pod is reported as "Not Ready". We tried to increase the readiness probe timeout value to 15s, but that doesn't resolve this issue essentially, it may timeout again if XDS config number grows. In this case, an admin API call timeout doesn't mean workers thread not work.

Currently main thread handles both XDS responses and admin API calls, one possibility is separating admin API handler from main thread, I guess this might be overkilled for resolving this issue. How about exposing server readiness state by a local file? Then other process(e.g pilot-agent) can use that file to check Envoy server state, instead of calling admin API.

@mattklein123 mattklein123 added the question Questions that are neither investigations, bugs, nor enhancements label Jun 2, 2020
@mattklein123
Copy link
Member

Can you please provide a perf trace showing that the main thread is actually blocked processing xDS responses for 15s or longer?

I don't really understand how saving state to a file is going to fix this problem if the actual processing time takes the long. It seems like the server simply isn't ready and you either have to account for that or the performance needs to be improved?

@mattklein123 mattklein123 added investigate Potential bug that needs verification and removed question Questions that are neither investigations, bugs, nor enhancements labels Jun 2, 2020
@ramaraochavali
Copy link
Contributor

@mattklein123 I think the problem here is not with the initial config but subsequent config updates. The initial config update would certainly delay the pod readiness (correctly so) till the config is obtained - but after it is ready, and during config updates that take longer the calls to admin handlers fail and might result in this issue?

Relevant Istio issues istio/istio#18046 and istio/istio#23371

I tried to cache the server.state after the initial config update - but there were concerns of the cache validity across hot restarts - so that change was reverted.

But I do think writing to file is not an ideal solution to this problem - because there are other stats that readiness probes depend on like worker_started etc.

@mattklein123
Copy link
Member

But I do think writing to file is not an ideal solution to this problem - because there are other stats that readiness probes depend on like worker_started etc.

Right, we need to fix the underlying performance problem in some way or increase the timeout. File caching does not make sense. If file caching is desired it can be implemented by an external process.

@zyfjeff
Copy link
Member

zyfjeff commented Jun 2, 2020

This problem is actually similar to the problem I mentioned before #10394 . The main thread is busy dealing with xDS and the Admin interface fails to respond, resulting in a timeout. I think the problem to be solved by the core is still to find the reason for the slow processing of xDS and solve it. I believe that if these performance problems mentioned in #11362 can be solved, then this problem will be solved eventually.

@l8huang
Copy link
Contributor Author

l8huang commented Jun 2, 2020

I agree this issue can be resolved fundamentally in some better ways:

  • create a dedicated thread to handle admin API call
  • make control plane using delta push to avoid huge LDS/RDS/CDS response, and give admin API call high priority for avoiding starving

When Istio control plane having more and more Gateways/VirtualServices/Services, the time needed by handling LDS/RDS/CDS response becomes longer and longer. Of course the number of XDS config should be in a reasonable range, but 2k Gateways/VirtualServices/Services looks like reasonable. I tried to increate the timeout value, but after it exceeds 15s, I feel this is not a correct direction.

IMO: with current istio and Envoy implementation, this is a scalability issue.

@ramaraochavali could you please share some details about your file caching solution?

The method I proposed is not a file caching, the idea is Envoy creates and updates a state file actively, I briefly checked how server.state works in Envoy, it doesn't change frequently(k8s container readiness probe also happens with an interval). Using a file to expose healthy state is not unusual, k8s also supports using command(e.g check if healthy file exists) in readiness probe.

Of course using a file has some limitations, let figure out what's the best solution for Istio and Envoy.

@mattklein123
Copy link
Member

I'm not in favor of writing a file. There are a bunch of issues with this becoming stale. We should just fix the actual issue in some way. In the interim if you want to write a file you can do this externally.

@ramaraochavali
Copy link
Contributor

could you please share some details about your file caching solution?

I have n't tried file caching. In Pilot I cached the server state after initial config is obtained , from then on reusing the cached stat instead of looking at admin hander

@stale
Copy link

stale bot commented Jul 3, 2020

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Jul 3, 2020
@stale
Copy link

stale bot commented Jul 12, 2020

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted". Thank you for your contributions.

@stale stale bot closed this as completed Jul 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
investigate Potential bug that needs verification stale stalebot believes this issue/PR has not been touched recently
Projects
None yet
Development

No branches or pull requests

4 participants