-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
On-demand DNS resolution #20562
Comments
Could you share more details about the case that never actually send requests, I do not hear it before. Do you mean health check?
Is there a cache inconsistency issue when |
I mean a user creates a config that results in a envoy cluster for googeapis.com being sent to a bunch of Envoy instances in the mesh. However, only 1/2 of their applications actually call googleapis.com, so we have a bunch of excess DNS requests. Even in the 50% of instances we do depend on googleapis.com, we may call googleapis.com infrequently.
No, this just means we would need to perform a synchronous DNS request |
Could you use dynamic forward proxy for this? Dynamic forward proxy is effectively on demand DNS. :) |
Interesting, hadn't thought of that before but does seem close to what we want. One concern would be it only works for HTTP (I think?), and we would likely need cluster-per-hostname for a lot of cases to have more control. The dynamic proxy could be used in some cases though, I think |
L4 has a variant called the SNI forward proxy which works the same way. I would take a look at that also. Almost certainly we can make one of these work how you want it to. |
We also support just raw TCP without TLS. An example in Istio API:
What this sets up in Envoy terms is a Listener on 1.2.3.4:1234 which forwards to a STRICT_DNS cluster resolving That may just be a minor edge case though. Let me play around with the HTTP/SNI forms and see if there are any gaps |
If you really needed it we could add an option to the "SNI" filter to just hard code a target if none is supplied or not TLS, etc. |
It may be generally useful to have a way to delay service discovery until a cluster is used (even if it's EDS, not DNS). I think it's similar to why we have a separate healthcheck interval for clusters that have never been used. |
Update: correct the linked issue @ggreenway what's your opinion on my proposal #20873? I'd like to build a notfication/callback on |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions. |
I took a look, and I don't think the semantics and security posture of DFP match the Istio-use case very well. Would love to investigate if we can actually achieve on-demand DNS resolution for these clusters, carving out just a piece of DFP functionality. |
@keithmattix When a STRICT_DNS cluster is first time used, we trigger a DNS resolution on the first request, hold it and initiate async DNS resolution (use the first resolution to make calls) for it so that it continues to do DNS resolutions in the background. Is that what you are thinking? Do you plan to work on this? |
I just find this issue, and i think this is exactly what we need !
but after registering about 100 services, CoreDNS request volume skyrocketed.
|
That is only used for DNS queries made by application. DNS queries that are resolved by Envoy (similar to your Service Entry case), will go through core DNS |
Sorry I missed this! I'd be happy to work/collaborate on this. I think that direction makes sense because an Envoy that never directs traffic to a cluster will never have async DNS resolution started for that cluster. @ramaraochavali let's talk about how to make this happen |
/assign @keithmattix |
@ggreenway can you or another maintainer reopen this? |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions. |
Not stale |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions. |
Not stale |
Just found this issue, let me give my 2cents:
|
@ramaraochavali, @keithmattix asked me to work on this issue. Are you still looking into this? |
No. Please go ahead |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions. |
Not stale |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions. |
not stale |
+1 definitely not stale |
Title: On-demand DNS resolution
Description:
Currently, DNS clusters (LOGICAL and STRICT), constantly resolve DNS in the background.
We see fairly common reports of DNS servers being overloaded due to having a large explosion of these clusters across many Envoy workloads sharing the same DNS server (typically this is kube-dns).
Often, these workloads are either infrequently or never actually sending requests to these clusters. However, due to the operational complexity involved with having fine grained configuration, the cluster is still present on an excessive number of Envoy instances. Even with perfect configuration, we may have a service we call once per hour but need to resolve repeatedly.
We use the respect_ttl field, so sometimes this can be fixed with a larger TTL. However, this is not always configurable, and even when it is high can still lead to thundering herd problems.
It would be ideal if we could support on-demand DNS resolution. This could be implemented in a few ways:
(preferred) on first request to a cluster, do DNS resolution and set timer to re-resolve (based on TTL). Once timer is ready, re-resolve only if there was any requests during this time. This means that as long as
1/QPS > TTL
, we are blocked for DNS only on the first requests. If we have infrequent requests, we only resolve once per request.On the first request to a cluster, start DNS resolution like normal.
If we have full on-demand CDS this would look a lot like (2). However, I expect that will take a long time, and even when we have it this is still likely useful
The text was updated successfully, but these errors were encountered: