Refactor dns cluster api #36353

Stevenjin8 · 2024-09-26T16:04:45Z

TODO:

Create new DnsConfig proto
Update deprecation docs
Update strict dns cluster to consume new dns config
Update logical dns cluster to consume new dns config
Test changes

Commit Message: Refactor DNS cluster configuration in its own extension
Additional Description:
Risk Level:
Testing:
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]

repokitteh-read-only · 2024-09-26T16:04:50Z

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #36353 was opened by Stevenjin8.

see: more, trace.

repokitteh-read-only · 2024-09-26T16:04:55Z

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @mattklein123
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

🐱

Caused by: #36353 was opened by Stevenjin8.

see: more, trace.

Stevenjin8 · 2024-09-26T16:35:05Z

@markdroth @wbpcode Just wanted to check in that I'm going in the right direction

source/common/upstream/upstream_impl.cc

source/extensions/clusters/strict_dns/strict_dns_cluster.cc

Signed-off-by: Steven Jin Xuan <[email protected]>

Stevenjin8 · 2024-10-02T14:45:26Z

@markdroth I would also appreciate some guidance on what you expect for testing.

mattklein123 · 2024-10-04T22:54:03Z

Sorry can you provide some context on this?

/wait

markdroth

@mattklein123 For context, see #35479 (comment).

@Stevenjin8, I'm not the right person to ask about testing for the Envoy implementation, so I'll let one of the Envoy maintainers do that. My feedback here is just on the xDS API changes.

Please let me know if you have any questions. Thanks!

api/envoy/config/cluster/v3/cluster.proto

api/envoy/extensions/clusters/dns/v3/cluster.proto

api/envoy/extensions/common/dynamic_forward_proxy/v3/dns_cache.proto

api/envoy/config/cluster/v3/cluster.proto

Signed-off-by: Steven Jin Xuan <[email protected]>

Stevenjin8 · 2024-11-25T19:12:58Z

/retest

markdroth

This looks great from an API perspective!

I'll let one of the Envoy maintainers weigh in for the actual implementation review.

markdroth · 2024-12-03T22:19:01Z

api/envoy/extensions/clusters/dns/v3/dns_cluster.proto

-  // This field is only considered when the ``name`` field of the ``TypedConfig`` is ``envoy.cluster.dns``.
-  DnsDiscoveryType dns_discovery_type = 9;
+  // If true, perform logical DNS resolution. Otherwise, perform strict DNS resolution.
+  bool logical = 9;


From an API perspective, this is really hard to understand. What does "logical DNS resolution" mean?

I think we should instead use something like what I suggested earlier:

// If true, all returned addresses are considered to be associated with a single endpoint. // Otherwise, each address is considered to be a separate endpoint. bool all_addresses_in_single_endpoint = 9;

I agree that "logical" is confusing, but I also don't see how "all_addresses_in_single_endpoint" captures the semantics of logical vs strict dns clusters. I could be wrong, but my understanding is as follows:

[logical dns] is optimal for large scale web services that must be accessed via DNS. Such services typically use round robin DNS to return many different IP addresses. Typically a different result is returned for each query. If strict DNS were used in this scenario, Envoy would assume that the cluster’s members were changing during every resolution interval which would lead to draining connection pools, connection cycling, etc. Instead, with logical DNS, connections stay alive until they get cycled.

It seems to me that the crucial difference between is how existing hosts are treated after a dns query, not whether all addresses are in a single endpoint (the docs don't even mention the latter). Some people may even want strict dns resolution when all addresses are in a single endpoint.

I'm not familiar with the Envoy implementation, but from having implemented LOGICAL_DNS support in gRPC, I think the key difference is actually the fact that all addresses are considered part of a single endpoint.

In a LOGICAL_DNS cluster, Envoy creates a single Host (which is the object it uses internally to represent an endpoint) with all addresses. Since there is only one Host, all requests will always be sent to that Host, regardless of which LB policy is used. Every time the Host is asked for a new connection, it uses whichever address is currently at the front of the list.

In a STRICT_DNS cluster, we create a separate Host for each address. The LB policy is used to pick which Host each request should be sent to, and the chosen Host manages connections to that address.

@mattklein123 can correct me if I'm wrong here, but the above is my understanding.

@markdroth I think I see what you mean. Logical DNS clusters have a single Host with one address. This address may change based on results of DNS queries.

My question for @mattklein123 (or anyone familiar with the matter) is why the requirement for there to be a single host? Is there a fundamental reason why we can't have a config like

name: my_cluster type: LOGICAL_DNS lb_policy: ROUND_ROBIN hosts: - socket_address: address: eastus.service.com port_value: 80 - socket_address: address: westus.service.com port_value: 80

This would create two Hosts under the hood, one for eastus.service.com and one for westus.service.com, and envoy would round robin between them. When any Host is asked for an address, it would return the one "at the front of [its] list".

In principle, I don't see any reason why we couldn't add a mechanism to do something like that, but it's not something we have today, and we probably wouldn't want to actually do that without a concrete requirement.

But I also think that that feature isn't really relevant to the question of how we indicate the difference between LOGICAL_DNS and STRICT_DNS clusters. If we were going to introduce the ability for a single DNS cluster to look up multiple names, we could do it for both types, but there would still be the same difference in behavior between LOGICAL_DNS and STRICT_DNS: I would expect STRICT_DNS to still create a separate Host for each address, and we could define the LOGICAL_DNS behavior to be either (a) have one Host for each name, each of which has all of the addresses for that name, or (b) have a single Host containing all addresses for both names -- which behavior we chose would depend on the concrete use-case we'd be trying to address.

From the perspective of the xDS data model and familiarity with the gRPC implementation, I think that's the only difference. However, I'm not familiar with Envoy's implementation, which is why I was asking for confirmation.

If you don't know, is there someone else we can ask who is more familiar with the Envoy implementation? I just want to make sure we're not introducing a misleading API here.

Thanks!

I would have to page it all back in and look. I can try to find some time for that.

Unless anyone can think of any reason why bool all_addresses_in_single_endpoint doesn't work, I'd like to see this changed. I really don't think bool logical makes sense as an API here.

@markdroth sounds good

@markdroth I think we've resolved the last outstanding item. With your and @mattklein123 's approval, I'm ready to merge whenever.

markdroth · 2024-12-04T01:01:17Z

api/envoy/config/cluster/v3/cluster.proto

+  // extension point and configuring it with :ref:`DnsCluster<envoy_v3_api_msg_extensions.clusters.dns.v3.DnsCluster>`.
+  // If :ref:`cluster_type<envoy_v3_api_field_config.cluster.v3.Cluster.cluster_type>` is configured with
+  // :ref:`DnsCluster<envoy_v3_api_msg_extensions.clusters.dns.v3.DnsCluster>`, this field will be ignored.
+  google.protobuf.Duration dns_refresh_rate = 16 [


Do these fields still need to be used for other cluster types, like redis? If so, we probably can't mark them as deprecated yet.

Both redis and dynamic forward proxies define their own refresh rates

envoy/api/envoy/extensions/clusters/redis/v3/redis_cluster.proto

Lines 58 to 85 in 983545f

message RedisClusterConfig {

option (udpa.annotations.versioning).previous_message_type =

"envoy.config.cluster.redis.RedisClusterConfig";

// Interval between successive topology refresh requests. If not set, this defaults to 5s.

google.protobuf.Duration cluster_refresh_rate = 1 [(validate.rules).duration = {gt {}}];

// Timeout for topology refresh request. If not set, this defaults to 3s.

google.protobuf.Duration cluster_refresh_timeout = 2 [(validate.rules).duration = {gt {}}];

// The minimum interval that must pass after triggering a topology refresh request before a new

// request can possibly be triggered again. Any errors received during one of these

// time intervals are ignored. If not set, this defaults to 5s.

google.protobuf.Duration redirect_refresh_interval = 3;

// The number of redirection errors that must be received before

// triggering a topology refresh request. If not set, this defaults to 5.

// If this is set to 0, topology refresh after redirect is disabled.

google.protobuf.UInt32Value redirect_refresh_threshold = 4;

// The number of failures that must be received before triggering a topology refresh request.

// If not set, this defaults to 0, which disables the topology refresh due to failure.

uint32 failure_refresh_threshold = 5;

// The number of hosts became degraded or unhealthy before triggering a topology refresh request.

// If not set, this defaults to 0, which disables the topology refresh due to degraded or

// unhealthy host.

uint32 host_degraded_refresh_threshold = 6;

envoy/api/envoy/extensions/common/dynamic_forward_proxy/v3/dns_cache.proto

Line 65 in 983545f

google.protobuf.Duration dns_refresh_rate = 3

Couldn't find any other ones.

api/envoy/extensions/clusters/common/dns/v3/dns.proto

markdroth · 2024-12-04T01:16:29Z

source/extensions/clusters/logical_dns/logical_dns_cluster.cc

-                                     Network::DnsResolverSharedPtr dns_resolver,
-                                     absl::Status& creation_status)
+absl::StatusOr<std::unique_ptr<LogicalDnsCluster>>
+LogicalDnsCluster::create(const envoy::config::cluster::v3::Cluster& cluster,


This looks really good to me!

mattklein123

Thanks this looks great. Please merge main and we can get this in.

/wait

api/envoy/extensions/clusters/dns/v3/dns_cluster.proto

mattklein123 · 2024-12-12T04:24:39Z

api/envoy/extensions/clusters/dns/v3/dns_cluster.proto

+  // :ref:`AUTO<envoy_v3_api_enum_value_extensions.clusters.common.dns.v3.DnsLookupFamily.AUTO>`.
+  common.dns.v3.DnsLookupFamily dns_lookup_family = 8;
+
+  // If true, perform logical DNS resolution. Otherwise, perform strict DNS resolution.


Can you write more what the difference here is for users and/or link back to the arch docs where this is discussed?

Added links to avoid repetition, but I can also be a bit more verbose.

Signed-off-by: Steven Jin Xuan <[email protected]>

Stevenjin8 · 2024-12-12T19:39:51Z

Forgot to merge main. Should be ready for another review now (even though not much changed)

Signed-off-by: Steven Jin Xuan <[email protected]>

Stevenjin8 · 2024-12-17T19:41:25Z

/retest

Signed-off-by: Steven Jin Xuan <[email protected]>

markdroth

Just one minor comment improvement needed -- otherwise, this looks great from an API perspective!

api/envoy/extensions/clusters/dns/v3/dns_cluster.proto

Signed-off-by: Steven Jin Xuan <[email protected]>

markdroth · 2024-12-18T19:17:37Z

This looks great from an API perspective!

/lgtm api

repokitteh-read-only bot added the api label Sep 26, 2024

repokitteh-read-only bot assigned mattklein123 Sep 26, 2024

Stevenjin8 force-pushed the refactor/dns-api branch 3 times, most recently from 072cab0 to e1a1e5f Compare September 26, 2024 16:29

Stevenjin8 mentioned this pull request Sep 26, 2024

dns: Create new field for delayed DNS resolution #35479

Closed

Stevenjin8 force-pushed the refactor/dns-api branch 2 times, most recently from a305ccc to 928ef02 Compare October 1, 2024 16:46

Stevenjin8 marked this pull request as ready for review October 1, 2024 16:47

Stevenjin8 requested review from mattklein123 and alyssawilk as code owners October 1, 2024 16:47

Stevenjin8 commented Oct 1, 2024

View reviewed changes

source/common/upstream/upstream_impl.cc Outdated Show resolved Hide resolved

Stevenjin8 commented Oct 1, 2024

View reviewed changes

source/extensions/clusters/strict_dns/strict_dns_cluster.cc Outdated Show resolved Hide resolved

Stevenjin8 force-pushed the refactor/dns-api branch from 928ef02 to 2f5e253 Compare October 1, 2024 16:59

Move cluster dns config and validate in strict dns clusters

9048a69

Signed-off-by: Steven Jin Xuan <[email protected]>

Stevenjin8 force-pushed the refactor/dns-api branch from 2f5e253 to 9048a69 Compare October 2, 2024 14:31

Stevenjin8 changed the title ~~Refactor/dns api~~ Refactor dns cluster api Oct 2, 2024

repokitteh-read-only bot added the waiting label Oct 4, 2024

markdroth reviewed Oct 4, 2024

View reviewed changes

Roll back incompatible API changes

3917c63

Signed-off-by: Steven Jin Xuan <[email protected]>

Stevenjin8 requested review from RyanTheOptimist, abeyad, fredyw, wbpcode and soulxu as code owners October 10, 2024 19:54

Merge branch 'main' into refactor/dns-api

752a805

Signed-off-by: Steven Jin Xuan <[email protected]>

markdroth reviewed Dec 4, 2024

View reviewed changes

mattklein123 requested changes Dec 12, 2024

View reviewed changes

repokitteh-read-only bot added the waiting label Dec 12, 2024

Stevenjin8 requested review from mattklein123 and markdroth December 12, 2024 19:16

Better docs

bd0ae97

Signed-off-by: Steven Jin Xuan <[email protected]>

repokitteh-read-only bot removed the waiting label Dec 12, 2024

Merge branch 'main' into refactor/dns-api

1ffdf41

Signed-off-by: Steven Jin Xuan <[email protected]>

second merge

aeb3302

Signed-off-by: Steven Jin Xuan <[email protected]>

mattklein123 previously approved these changes Dec 16, 2024

View reviewed changes

repokitteh-read-only bot removed the api label Dec 16, 2024

mattklein123 enabled auto-merge (squash) December 16, 2024 22:40

mattklein123 added the waiting label Dec 17, 2024

third merge

0b816c6

Signed-off-by: Steven Jin Xuan <[email protected]>

auto-merge was automatically disabled December 17, 2024 15:51
Head branch was pushed to by a user without write access

Stevenjin8 dismissed mattklein123’s stale review via 0b816c6 December 17, 2024 15:51

repokitteh-read-only bot added api and removed waiting labels Dec 17, 2024

Stevenjin8 added 3 commits December 17, 2024 17:56

update logical dns api

2a7832f

Signed-off-by: Steven Jin Xuan <[email protected]>

forgot one!

9b7927c

Signed-off-by: Steven Jin Xuan <[email protected]>

last one please

3078d7c

Signed-off-by: Steven Jin Xuan <[email protected]>

markdroth reviewed Dec 18, 2024

View reviewed changes

api/envoy/extensions/clusters/dns/v3/dns_cluster.proto Outdated Show resolved Hide resolved

API docs clarification

e31a09b

Signed-off-by: Steven Jin Xuan <[email protected]>

Stevenjin8 requested a review from markdroth December 18, 2024 19:16

repokitteh-read-only bot removed the api label Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor dns cluster api #36353

Refactor dns cluster api #36353

Stevenjin8 commented Sep 26, 2024 •

edited

Loading

repokitteh-read-only bot commented Sep 26, 2024

repokitteh-read-only bot commented Sep 26, 2024

Stevenjin8 commented Sep 26, 2024

Stevenjin8 commented Oct 2, 2024

mattklein123 commented Oct 4, 2024

markdroth left a comment

Stevenjin8 commented Nov 25, 2024

markdroth left a comment

markdroth Dec 3, 2024

Stevenjin8 Dec 4, 2024

markdroth Dec 4, 2024

Stevenjin8 Dec 4, 2024

markdroth Dec 4, 2024

markdroth Dec 6, 2024

mattklein123 Dec 7, 2024

markdroth Dec 17, 2024

Stevenjin8 Dec 17, 2024

Stevenjin8 Dec 18, 2024

markdroth Dec 4, 2024

Stevenjin8 Dec 4, 2024

markdroth Dec 4, 2024

mattklein123 left a comment

mattklein123 Dec 12, 2024

Stevenjin8 Dec 12, 2024

Stevenjin8 commented Dec 12, 2024

Stevenjin8 commented Dec 17, 2024

markdroth left a comment

markdroth commented Dec 18, 2024

	message RedisClusterConfig {
	option (udpa.annotations.versioning).previous_message_type =
	"envoy.config.cluster.redis.RedisClusterConfig";

	// Interval between successive topology refresh requests. If not set, this defaults to 5s.
	google.protobuf.Duration cluster_refresh_rate = 1 [(validate.rules).duration = {gt {}}];

	// Timeout for topology refresh request. If not set, this defaults to 3s.
	google.protobuf.Duration cluster_refresh_timeout = 2 [(validate.rules).duration = {gt {}}];

	// The minimum interval that must pass after triggering a topology refresh request before a new
	// request can possibly be triggered again. Any errors received during one of these
	// time intervals are ignored. If not set, this defaults to 5s.
	google.protobuf.Duration redirect_refresh_interval = 3;

	// The number of redirection errors that must be received before
	// triggering a topology refresh request. If not set, this defaults to 5.
	// If this is set to 0, topology refresh after redirect is disabled.
	google.protobuf.UInt32Value redirect_refresh_threshold = 4;

	// The number of failures that must be received before triggering a topology refresh request.
	// If not set, this defaults to 0, which disables the topology refresh due to failure.
	uint32 failure_refresh_threshold = 5;

	// The number of hosts became degraded or unhealthy before triggering a topology refresh request.
	// If not set, this defaults to 0, which disables the topology refresh due to degraded or
	// unhealthy host.
	uint32 host_degraded_refresh_threshold = 6;

Refactor dns cluster api #36353

Are you sure you want to change the base?

Refactor dns cluster api #36353

Conversation

Stevenjin8 commented Sep 26, 2024 • edited Loading

repokitteh-read-only bot commented Sep 26, 2024

repokitteh-read-only bot commented Sep 26, 2024

Stevenjin8 commented Sep 26, 2024

Stevenjin8 commented Oct 2, 2024

mattklein123 commented Oct 4, 2024

markdroth left a comment

Choose a reason for hiding this comment

Stevenjin8 commented Nov 25, 2024

markdroth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattklein123 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Stevenjin8 commented Dec 12, 2024

Stevenjin8 commented Dec 17, 2024

markdroth left a comment

Choose a reason for hiding this comment

markdroth commented Dec 18, 2024

Stevenjin8 commented Sep 26, 2024 •

edited

Loading