On-demand DNS resolution #20562

howardjohn · 2022-03-28T21:33:37Z

Title: On-demand DNS resolution

Description:
Currently, DNS clusters (LOGICAL and STRICT), constantly resolve DNS in the background.

We see fairly common reports of DNS servers being overloaded due to having a large explosion of these clusters across many Envoy workloads sharing the same DNS server (typically this is kube-dns).

Often, these workloads are either infrequently or never actually sending requests to these clusters. However, due to the operational complexity involved with having fine grained configuration, the cluster is still present on an excessive number of Envoy instances. Even with perfect configuration, we may have a service we call once per hour but need to resolve repeatedly.

We use the respect_ttl field, so sometimes this can be fixed with a larger TTL. However, this is not always configurable, and even when it is high can still lead to thundering herd problems.

It would be ideal if we could support on-demand DNS resolution. This could be implemented in a few ways:

(preferred) on first request to a cluster, do DNS resolution and set timer to re-resolve (based on TTL). Once timer is ready, re-resolve only if there was any requests during this time. This means that as long as 1/QPS > TTL, we are blocked for DNS only on the first requests. If we have infrequent requests, we only resolve once per request.
On the first request to a cluster, start DNS resolution like normal.

If we have full on-demand CDS this would look a lot like (2). However, I expect that will take a long time, and even when we have it this is still likely useful

The text was updated successfully, but these errors were encountered:

daixiang0 · 2022-03-29T00:55:20Z

Could you share more details about the case that never actually send requests, I do not hear it before. Do you mean health check?

This means that as long as 1/QPS > TTL, we are blocked for DNS only on the first requests. If we have infrequent requests, we only resolve once per request.

Is there a cache inconsistency issue when 1/QPS > TTL if I understand correctly?

howardjohn · 2022-03-29T00:57:43Z

I mean a user creates a config that results in a envoy cluster for googeapis.com being sent to a bunch of Envoy instances in the mesh. However, only 1/2 of their applications actually call googleapis.com, so we have a bunch of excess DNS requests.

Even in the 50% of instances we do depend on googleapis.com, we may call googleapis.com infrequently.

Is there a cache inconsistency issue when 1/QPS > TTL if I understand correctly?

No, this just means we would need to perform a synchronous DNS request

mattklein123 · 2022-03-29T03:25:07Z

Could you use dynamic forward proxy for this? Dynamic forward proxy is effectively on demand DNS. :)

howardjohn · 2022-03-29T03:42:37Z

Interesting, hadn't thought of that before but does seem close to what we want. One concern would be it only works for HTTP (I think?), and we would likely need cluster-per-hostname for a lot of cases to have more control. The dynamic proxy could be used in some cases though, I think

mattklein123 · 2022-03-29T15:26:24Z

L4 has a variant called the SNI forward proxy which works the same way. I would take a look at that also. Almost certainly we can make one of these work how you want it to.

howardjohn · 2022-03-29T15:31:43Z

We also support just raw TCP without TLS. An example in Istio API:

apiVersion: networking.istio.io/v1beta1
kind: ServiceEntry
metadata:
  name: external-database
spec:
  addresses:
  - 1.2.3.4
  ports:
  - number: 1234
  resolution: DNS
  endpoints:
  - address: some-sql-instance.aws.com

What this sets up in Envoy terms is a Listener on 1.2.3.4:1234 which forwards to a STRICT_DNS cluster resolving some-sql-instance.aws.com.

That may just be a minor edge case though. Let me play around with the HTTP/SNI forms and see if there are any gaps

mattklein123 · 2022-03-29T15:33:31Z

If you really needed it we could add an option to the "SNI" filter to just hard code a target if none is supplied or not TLS, etc.

ggreenway · 2022-04-27T01:11:59Z

It may be generally useful to have a way to delay service discovery until a cluster is used (even if it's EDS, not DNS). I think it's similar to why we have a separate healthcheck interval for clusters that have never been used.

lambdai · 2022-04-27T06:01:40Z

Update: correct the linked issue
I used to explore delaying EDS request until the upstream connection is desired. A huge compleixity there, and the gain is marginal on top of ondemand CDS.

@ggreenway what's your opinion on my proposal #20873?

I'd like to build a notfication/callback on endpoint ready in the Cluster. The initial goal was to resolve the race condition at currrent ondemand cds, and my hunch is that the potential delayed discovery is another use case.

github-actions · 2022-05-28T00:02:40Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions · 2022-06-04T04:01:11Z

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

keithmattix · 2024-07-01T20:08:24Z

I took a look, and I don't think the semantics and security posture of DFP match the Istio-use case very well. Would love to investigate if we can actually achieve on-demand DNS resolution for these clusters, carving out just a piece of DFP functionality.

ramaraochavali · 2024-07-04T11:59:15Z

Would love to investigate if we can actually achieve on-demand DNS resolution for these clusters, carving out just a piece of DFP functionality.

@keithmattix When a STRICT_DNS cluster is first time used, we trigger a DNS resolution on the first request, hold it and initiate async DNS resolution (use the first resolution to make calls) for it so that it continues to do DNS resolutions in the background. Is that what you are thinking? Do you plan to work on this?

nessa829 · 2024-07-16T09:17:19Z

I just find this issue, and i think this is exactly what we need !
Due to architectural history and issues, despite being in the same cluster, all communications are routed through domains registered in Route 53 for now. As a result, the traffic passes unnecessarily through NLB / ingressGW for communication between microservices.
To address this easily, I tried introducing service entries as below.

apiVersion: networking.istio.io/v1beta1
kind: ServiceEntry
metadata:
  name: sre-test-entry
  namespace: sre-test
spec:
  endpoints:
  - address: sre-test.sre-test.svc.cluster.local
  hosts:
  - sre-test.example.com
  location: MESH_INTERNAL
  ports:
  - name: http
    number: 80
    protocol: HTTP
  resolution: DNS

but after registering about 100 services, CoreDNS request volume skyrocketed.
I do not understand why this is intended.

This happens even if the proxy never sends any requests to these applications.

ramaraochavali · 2024-07-16T10:07:31Z

I do not understand why this is intended.

That is only used for DNS queries made by application. DNS queries that are resolved by Envoy (similar to your Service Entry case), will go through core DNS

keithmattix · 2024-07-16T11:54:35Z

Would love to investigate if we can actually achieve on-demand DNS resolution for these clusters, carving out just a piece of DFP functionality.

@keithmattix When a STRICT_DNS cluster is first time used, we trigger a DNS resolution on the first request, hold it and initiate async DNS resolution (use the first resolution to make calls) for it so that it continues to do DNS resolutions in the background. Is that what you are thinking? Do you plan to work on this?

Sorry I missed this! I'd be happy to work/collaborate on this. I think that direction makes sense because an Envoy that never directs traffic to a cluster will never have async DNS resolution started for that cluster. @ramaraochavali let's talk about how to make this happen

keithmattix · 2024-07-16T17:29:32Z

/assign @keithmattix

keithmattix · 2024-07-16T17:32:35Z

@ggreenway can you or another maintainer reopen this?

github-actions · 2024-08-15T20:01:08Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

keithmattix · 2024-08-15T20:05:53Z

Not stale

github-actions · 2024-09-15T00:03:36Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

keithmattix · 2024-09-15T00:48:42Z

Not stale

aitorpazos · 2024-09-25T10:14:17Z

Just found this issue, let me give my 2cents:

Existing behavior, while may be putting additional load on DNS infra it has nice failure mode characteristics.
IMO, it would be great if we can provide opt-in config to change/tweak its current behavior but will be disappointment if it goes away.
It makes envoy resilient to issues/outages on the DNS infrastructure. As DNS traffic in many environments is still UDP, is not that uncommon to have a packet drop that times out a query on first try (environments with spikes of 1%-5% packet drops are not that uncommon). So having the health checks having navigated that for you in the background gives greater success chances for first requests.
In very dynamic (eg: auto-scaled environments) having a first request to an envoy instance is not that uncommon.
Envoy can withstand somewhat extended DNS outages without impacting traffic as the recently resolved and health-checked backends are fresh.

Stevenjin8 · 2024-10-01T14:17:46Z

@ramaraochavali, @keithmattix asked me to work on this issue. Are you still looking into this?

ramaraochavali · 2024-10-03T15:24:10Z

No. Please go ahead

github-actions · 2024-11-02T16:01:03Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

keithmattix · 2024-11-02T16:03:39Z

Not stale

github-actions · 2024-12-02T20:01:23Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

ramaraochavali · 2024-12-03T15:00:18Z

not stale

maazghani · 2024-12-11T21:29:37Z

+1 definitely not stale

howardjohn added enhancement Feature requests. Not bugs or questions. triage Issue requires triage labels Mar 28, 2022

daixiang0 added area/dns and removed triage Issue requires triage labels Mar 29, 2022

howardjohn mentioned this issue Apr 8, 2022

istio-proxy kills coredns istio/istio#38269

Closed

github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label May 28, 2022

github-actions bot closed this as completed Jun 4, 2022

howardjohn mentioned this issue Jul 19, 2023

Implement on-demand hostname for Workloads istio/ztunnel#612

Merged

kdorosh mentioned this issue Jul 19, 2023

support on-demand hostname for workloads istio/ztunnel#608

Closed

nirvanagit mentioned this issue Feb 8, 2024

DNS resolution failure results in UH / no healthy upstream #31992

Closed

howardjohn mentioned this issue Feb 26, 2024

TLS origination doesn't work with resolution NONE in ServiceEntry istio/istio#49450

Closed

2 tasks

repokitteh-read-only bot assigned keithmattix Jul 16, 2024

ggreenway reopened this Jul 16, 2024

github-actions bot removed the stale stalebot believes this issue/PR has not been touched recently label Jul 16, 2024

This was referenced Jul 24, 2024

Flood of DNS requests in cluster istio/istio#30034

Open

dns: Create new field for delayed DNS resolution #35479

Closed

github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Aug 15, 2024

github-actions bot removed the stale stalebot believes this issue/PR has not been touched recently label Aug 16, 2024

github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Sep 15, 2024

github-actions bot removed the stale stalebot believes this issue/PR has not been touched recently label Sep 15, 2024

howardjohn mentioned this issue Sep 27, 2024

Add more DNS configuration fields to ServiceEntry when resolution: DNS istio/istio#53319

Open

github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Nov 2, 2024

github-actions bot removed the stale stalebot believes this issue/PR has not been touched recently label Nov 2, 2024

howardjohn mentioned this issue Nov 11, 2024

Envoy/C-Ares bug can cause istio to become "stuck" to one DNS pod istio/istio#53577

Closed

2 tasks

github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Dec 2, 2024

github-actions bot removed the stale stalebot believes this issue/PR has not been touched recently label Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On-demand DNS resolution #20562

On-demand DNS resolution #20562

howardjohn commented Mar 28, 2022

daixiang0 commented Mar 29, 2022 •

edited

Loading

howardjohn commented Mar 29, 2022

mattklein123 commented Mar 29, 2022

howardjohn commented Mar 29, 2022

mattklein123 commented Mar 29, 2022

howardjohn commented Mar 29, 2022

mattklein123 commented Mar 29, 2022

ggreenway commented Apr 27, 2022

lambdai commented Apr 27, 2022 •

edited

Loading

github-actions bot commented May 28, 2022

github-actions bot commented Jun 4, 2022

keithmattix commented Jul 1, 2024 •

edited

Loading

ramaraochavali commented Jul 4, 2024

nessa829 commented Jul 16, 2024

ramaraochavali commented Jul 16, 2024

keithmattix commented Jul 16, 2024 •

edited

Loading

keithmattix commented Jul 16, 2024

keithmattix commented Jul 16, 2024

github-actions bot commented Aug 15, 2024

keithmattix commented Aug 15, 2024

github-actions bot commented Sep 15, 2024

keithmattix commented Sep 15, 2024

aitorpazos commented Sep 25, 2024

Stevenjin8 commented Oct 1, 2024

ramaraochavali commented Oct 3, 2024

github-actions bot commented Nov 2, 2024

keithmattix commented Nov 2, 2024

github-actions bot commented Dec 2, 2024

ramaraochavali commented Dec 3, 2024

maazghani commented Dec 11, 2024 •

edited

Loading

On-demand DNS resolution #20562

On-demand DNS resolution #20562

Comments

howardjohn commented Mar 28, 2022

daixiang0 commented Mar 29, 2022 • edited Loading

howardjohn commented Mar 29, 2022

mattklein123 commented Mar 29, 2022

howardjohn commented Mar 29, 2022

mattklein123 commented Mar 29, 2022

howardjohn commented Mar 29, 2022

mattklein123 commented Mar 29, 2022

ggreenway commented Apr 27, 2022

lambdai commented Apr 27, 2022 • edited Loading

github-actions bot commented May 28, 2022

github-actions bot commented Jun 4, 2022

keithmattix commented Jul 1, 2024 • edited Loading

ramaraochavali commented Jul 4, 2024

nessa829 commented Jul 16, 2024

ramaraochavali commented Jul 16, 2024

keithmattix commented Jul 16, 2024 • edited Loading

keithmattix commented Jul 16, 2024

keithmattix commented Jul 16, 2024

github-actions bot commented Aug 15, 2024

keithmattix commented Aug 15, 2024

github-actions bot commented Sep 15, 2024

keithmattix commented Sep 15, 2024

aitorpazos commented Sep 25, 2024

Stevenjin8 commented Oct 1, 2024

ramaraochavali commented Oct 3, 2024

github-actions bot commented Nov 2, 2024

keithmattix commented Nov 2, 2024

github-actions bot commented Dec 2, 2024

ramaraochavali commented Dec 3, 2024

maazghani commented Dec 11, 2024 • edited Loading

daixiang0 commented Mar 29, 2022 •

edited

Loading

lambdai commented Apr 27, 2022 •

edited

Loading

keithmattix commented Jul 1, 2024 •

edited

Loading

keithmattix commented Jul 16, 2024 •

edited

Loading

maazghani commented Dec 11, 2024 •

edited

Loading