-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support exposing server readiness state by a file #11397
Comments
Can you please provide a perf trace showing that the main thread is actually blocked processing xDS responses for 15s or longer? I don't really understand how saving state to a file is going to fix this problem if the actual processing time takes the long. It seems like the server simply isn't ready and you either have to account for that or the performance needs to be improved? |
@mattklein123 I think the problem here is not with the initial config but subsequent config updates. The initial config update would certainly delay the pod readiness (correctly so) till the config is obtained - but after it is ready, and during config updates that take longer the calls to admin handlers fail and might result in this issue? Relevant Istio issues istio/istio#18046 and istio/istio#23371 I tried to cache the But I do think writing to file is not an ideal solution to this problem - because there are other stats that readiness probes depend on like |
Right, we need to fix the underlying performance problem in some way or increase the timeout. File caching does not make sense. If file caching is desired it can be implemented by an external process. |
This problem is actually similar to the problem I mentioned before #10394 . The main thread is busy dealing with xDS and the Admin interface fails to respond, resulting in a timeout. I think the problem to be solved by the core is still to find the reason for the slow processing of xDS and solve it. I believe that if these performance problems mentioned in #11362 can be solved, then this problem will be solved eventually. |
I agree this issue can be resolved fundamentally in some better ways:
When Istio control plane having more and more Gateways/VirtualServices/Services, the time needed by handling LDS/RDS/CDS response becomes longer and longer. Of course the number of XDS config should be in a reasonable range, but 2k Gateways/VirtualServices/Services looks like reasonable. I tried to increate the timeout value, but after it exceeds 15s, I feel this is not a correct direction. IMO: with current istio and Envoy implementation, this is a scalability issue. @ramaraochavali could you please share some details about your file caching solution? The method I proposed is not a file caching, the idea is Envoy creates and updates a state file actively, I briefly checked how Of course using a file has some limitations, let figure out what's the best solution for Istio and Envoy. |
I'm not in favor of writing a file. There are a bunch of issues with this becoming stale. We should just fix the actual issue in some way. In the interim if you want to write a file you can do this externally. |
I have n't tried file caching. In Pilot I cached the server state after initial config is obtained , from then on reusing the cached stat instead of looking at admin hander |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted". Thank you for your contributions. |
Description:
We are running Istio with Envoy as ingress gateway, there could be more than 2k Gateway/VirtualService/Service got translated into XDS configs. A problem with this scale is, Envoy may take a long time to handle XDS response pushed from control plane, this causes the ingress-gateway pod's readiness probe timeout(it's a admin API call
/stats?usedonly&filter=server.state
), then pod is reported as "Not Ready". We tried to increase the readiness probe timeout value to 15s, but that doesn't resolve this issue essentially, it may timeout again if XDS config number grows. In this case, an admin API call timeout doesn't mean workers thread not work.Currently main thread handles both XDS responses and admin API calls, one possibility is separating admin API handler from main thread, I guess this might be overkilled for resolving this issue. How about exposing server readiness state by a local file? Then other process(e.g pilot-agent) can use that file to check Envoy server state, instead of calling admin API.
The text was updated successfully, but these errors were encountered: