Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On-demand DNS resolution #20562

Open
howardjohn opened this issue Mar 28, 2022 · 30 comments
Open

On-demand DNS resolution #20562

howardjohn opened this issue Mar 28, 2022 · 30 comments
Assignees
Labels
area/dns enhancement Feature requests. Not bugs or questions.

Comments

@howardjohn
Copy link
Contributor

Title: On-demand DNS resolution

Description:
Currently, DNS clusters (LOGICAL and STRICT), constantly resolve DNS in the background.

We see fairly common reports of DNS servers being overloaded due to having a large explosion of these clusters across many Envoy workloads sharing the same DNS server (typically this is kube-dns).

Often, these workloads are either infrequently or never actually sending requests to these clusters. However, due to the operational complexity involved with having fine grained configuration, the cluster is still present on an excessive number of Envoy instances. Even with perfect configuration, we may have a service we call once per hour but need to resolve repeatedly.

We use the respect_ttl field, so sometimes this can be fixed with a larger TTL. However, this is not always configurable, and even when it is high can still lead to thundering herd problems.

It would be ideal if we could support on-demand DNS resolution. This could be implemented in a few ways:

  1. (preferred) on first request to a cluster, do DNS resolution and set timer to re-resolve (based on TTL). Once timer is ready, re-resolve only if there was any requests during this time. This means that as long as 1/QPS > TTL, we are blocked for DNS only on the first requests. If we have infrequent requests, we only resolve once per request.

  2. On the first request to a cluster, start DNS resolution like normal.

If we have full on-demand CDS this would look a lot like (2). However, I expect that will take a long time, and even when we have it this is still likely useful

@howardjohn howardjohn added enhancement Feature requests. Not bugs or questions. triage Issue requires triage labels Mar 28, 2022
@daixiang0 daixiang0 added area/dns and removed triage Issue requires triage labels Mar 29, 2022
@daixiang0
Copy link
Member

daixiang0 commented Mar 29, 2022

Could you share more details about the case that never actually send requests, I do not hear it before. Do you mean health check?

This means that as long as 1/QPS > TTL, we are blocked for DNS only on the first requests. If we have infrequent requests, we only resolve once per request.

Is there a cache inconsistency issue when 1/QPS > TTL if I understand correctly?

@howardjohn
Copy link
Contributor Author

I mean a user creates a config that results in a envoy cluster for googeapis.com being sent to a bunch of Envoy instances in the mesh. However, only 1/2 of their applications actually call googleapis.com, so we have a bunch of excess DNS requests.

Even in the 50% of instances we do depend on googleapis.com, we may call googleapis.com infrequently.

Is there a cache inconsistency issue when 1/QPS > TTL if I understand correctly?

No, this just means we would need to perform a synchronous DNS request

@mattklein123
Copy link
Member

Could you use dynamic forward proxy for this? Dynamic forward proxy is effectively on demand DNS. :)

@howardjohn
Copy link
Contributor Author

Interesting, hadn't thought of that before but does seem close to what we want. One concern would be it only works for HTTP (I think?), and we would likely need cluster-per-hostname for a lot of cases to have more control. The dynamic proxy could be used in some cases though, I think

@mattklein123
Copy link
Member

L4 has a variant called the SNI forward proxy which works the same way. I would take a look at that also. Almost certainly we can make one of these work how you want it to.

@howardjohn
Copy link
Contributor Author

We also support just raw TCP without TLS. An example in Istio API:

apiVersion: networking.istio.io/v1beta1
kind: ServiceEntry
metadata:
  name: external-database
spec:
  addresses:
  - 1.2.3.4
  ports:
  - number: 1234
  resolution: DNS
  endpoints:
  - address: some-sql-instance.aws.com

What this sets up in Envoy terms is a Listener on 1.2.3.4:1234 which forwards to a STRICT_DNS cluster resolving some-sql-instance.aws.com.

That may just be a minor edge case though. Let me play around with the HTTP/SNI forms and see if there are any gaps

@mattklein123
Copy link
Member

If you really needed it we could add an option to the "SNI" filter to just hard code a target if none is supplied or not TLS, etc.

@ggreenway
Copy link
Contributor

It may be generally useful to have a way to delay service discovery until a cluster is used (even if it's EDS, not DNS). I think it's similar to why we have a separate healthcheck interval for clusters that have never been used.

@lambdai
Copy link
Contributor

lambdai commented Apr 27, 2022

Update: correct the linked issue
I used to explore delaying EDS request until the upstream connection is desired. A huge compleixity there, and the gain is marginal on top of ondemand CDS.

@ggreenway what's your opinion on my proposal #20873?

I'd like to build a notfication/callback on endpoint ready in the Cluster. The initial goal was to resolve the race condition at currrent ondemand cds, and my hunch is that the potential delayed discovery is another use case.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label May 28, 2022
@github-actions
Copy link

github-actions bot commented Jun 4, 2022

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

@keithmattix
Copy link
Contributor

keithmattix commented Jul 1, 2024

I took a look, and I don't think the semantics and security posture of DFP match the Istio-use case very well. Would love to investigate if we can actually achieve on-demand DNS resolution for these clusters, carving out just a piece of DFP functionality.

@ramaraochavali
Copy link
Contributor

Would love to investigate if we can actually achieve on-demand DNS resolution for these clusters, carving out just a piece of DFP functionality.

@keithmattix When a STRICT_DNS cluster is first time used, we trigger a DNS resolution on the first request, hold it and initiate async DNS resolution (use the first resolution to make calls) for it so that it continues to do DNS resolutions in the background. Is that what you are thinking? Do you plan to work on this?

@nessa829
Copy link

I just find this issue, and i think this is exactly what we need !
Due to architectural history and issues, despite being in the same cluster, all communications are routed through domains registered in Route 53 for now. As a result, the traffic passes unnecessarily through NLB / ingressGW for communication between microservices.
To address this easily, I tried introducing service entries as below.

apiVersion: networking.istio.io/v1beta1
kind: ServiceEntry
metadata:
  name: sre-test-entry
  namespace: sre-test
spec:
  endpoints:
  - address: sre-test.sre-test.svc.cluster.local
  hosts:
  - sre-test.example.com
  location: MESH_INTERNAL
  ports:
  - name: http
    number: 80
    protocol: HTTP
  resolution: DNS

but after registering about 100 services, CoreDNS request volume skyrocketed.
I do not understand why this is intended.

This happens even if the proxy never sends any requests to these applications.

@ramaraochavali
Copy link
Contributor

I do not understand why this is intended.

That is only used for DNS queries made by application. DNS queries that are resolved by Envoy (similar to your Service Entry case), will go through core DNS

@keithmattix
Copy link
Contributor

keithmattix commented Jul 16, 2024

Would love to investigate if we can actually achieve on-demand DNS resolution for these clusters, carving out just a piece of DFP functionality.

@keithmattix When a STRICT_DNS cluster is first time used, we trigger a DNS resolution on the first request, hold it and initiate async DNS resolution (use the first resolution to make calls) for it so that it continues to do DNS resolutions in the background. Is that what you are thinking? Do you plan to work on this?

Sorry I missed this! I'd be happy to work/collaborate on this. I think that direction makes sense because an Envoy that never directs traffic to a cluster will never have async DNS resolution started for that cluster. @ramaraochavali let's talk about how to make this happen

@keithmattix
Copy link
Contributor

/assign @keithmattix

@keithmattix
Copy link
Contributor

@ggreenway can you or another maintainer reopen this?

@ggreenway ggreenway reopened this Jul 16, 2024
@github-actions github-actions bot removed the stale stalebot believes this issue/PR has not been touched recently label Jul 16, 2024
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Aug 15, 2024
@keithmattix
Copy link
Contributor

Not stale

@github-actions github-actions bot removed the stale stalebot believes this issue/PR has not been touched recently label Aug 16, 2024
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Sep 15, 2024
@keithmattix
Copy link
Contributor

Not stale

@github-actions github-actions bot removed the stale stalebot believes this issue/PR has not been touched recently label Sep 15, 2024
@aitorpazos
Copy link

Just found this issue, let me give my 2cents:

  • Existing behavior, while may be putting additional load on DNS infra it has nice failure mode characteristics.
  • IMO, it would be great if we can provide opt-in config to change/tweak its current behavior but will be disappointment if it goes away.
  • It makes envoy resilient to issues/outages on the DNS infrastructure. As DNS traffic in many environments is still UDP, is not that uncommon to have a packet drop that times out a query on first try (environments with spikes of 1%-5% packet drops are not that uncommon). So having the health checks having navigated that for you in the background gives greater success chances for first requests.
  • In very dynamic (eg: auto-scaled environments) having a first request to an envoy instance is not that uncommon.
  • Envoy can withstand somewhat extended DNS outages without impacting traffic as the recently resolved and health-checked backends are fresh.

@Stevenjin8
Copy link
Contributor

@ramaraochavali, @keithmattix asked me to work on this issue. Are you still looking into this?

@ramaraochavali
Copy link
Contributor

No. Please go ahead

Copy link

github-actions bot commented Nov 2, 2024

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Nov 2, 2024
@keithmattix
Copy link
Contributor

Not stale

@github-actions github-actions bot removed the stale stalebot believes this issue/PR has not been touched recently label Nov 2, 2024
Copy link

github-actions bot commented Dec 2, 2024

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Dec 2, 2024
@ramaraochavali
Copy link
Contributor

not stale

@github-actions github-actions bot removed the stale stalebot believes this issue/PR has not been touched recently label Dec 3, 2024
@maazghani
Copy link

maazghani commented Dec 11, 2024

+1 definitely not stale

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/dns enhancement Feature requests. Not bugs or questions.
Projects
None yet
Development

No branches or pull requests