Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor dns cluster api #36353

Open
wants to merge 55 commits into
base: main
Choose a base branch
from

Conversation

Stevenjin8
Copy link
Contributor

@Stevenjin8 Stevenjin8 commented Sep 26, 2024

TODO:

  • Create new DnsConfig proto
  • Update deprecation docs
  • Update strict dns cluster to consume new dns config
  • Update logical dns cluster to consume new dns config
  • Test changes

Commit Message: Refactor DNS cluster configuration in its own extension
Additional Description:
Risk Level:
Testing:
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]

Copy link

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #36353 was opened by Stevenjin8.

see: more, trace.

Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @mattklein123
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

🐱

Caused by: #36353 was opened by Stevenjin8.

see: more, trace.

@Stevenjin8
Copy link
Contributor Author

@markdroth @wbpcode Just wanted to check in that I'm going in the right direction

@Stevenjin8 Stevenjin8 force-pushed the refactor/dns-api branch 2 times, most recently from a305ccc to 928ef02 Compare October 1, 2024 16:46
@Stevenjin8 Stevenjin8 marked this pull request as ready for review October 1, 2024 16:47
@Stevenjin8 Stevenjin8 changed the title Refactor/dns api Refactor dns cluster api Oct 2, 2024
@Stevenjin8
Copy link
Contributor Author

@markdroth I would also appreciate some guidance on what you expect for testing.

@mattklein123
Copy link
Member

Sorry can you provide some context on this?

/wait

Copy link
Contributor

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattklein123 For context, see #35479 (comment).

@Stevenjin8, I'm not the right person to ask about testing for the Envoy implementation, so I'll let one of the Envoy maintainers do that. My feedback here is just on the xDS API changes.

Please let me know if you have any questions. Thanks!

api/envoy/config/cluster/v3/cluster.proto Outdated Show resolved Hide resolved
api/envoy/extensions/clusters/dns/v3/cluster.proto Outdated Show resolved Hide resolved
api/envoy/extensions/clusters/dns/v3/cluster.proto Outdated Show resolved Hide resolved
api/envoy/extensions/clusters/dns/v3/cluster.proto Outdated Show resolved Hide resolved
api/envoy/config/cluster/v3/cluster.proto Outdated Show resolved Hide resolved
api/envoy/config/cluster/v3/cluster.proto Outdated Show resolved Hide resolved
Signed-off-by: Steven Jin Xuan <[email protected]>
@Stevenjin8
Copy link
Contributor Author

/retest

Copy link
Contributor

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great from an API perspective!

I'll let one of the Envoy maintainers weigh in for the actual implementation review.

// This field is only considered when the ``name`` field of the ``TypedConfig`` is ``envoy.cluster.dns``.
DnsDiscoveryType dns_discovery_type = 9;
// If true, perform logical DNS resolution. Otherwise, perform strict DNS resolution.
bool logical = 9;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From an API perspective, this is really hard to understand. What does "logical DNS resolution" mean?

I think we should instead use something like what I suggested earlier:

// If true, all returned addresses are considered to be associated with a single endpoint.
// Otherwise, each address is considered to be a separate endpoint.
bool all_addresses_in_single_endpoint = 9;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that "logical" is confusing, but I also don't see how "all_addresses_in_single_endpoint" captures the semantics of logical vs strict dns clusters. I could be wrong, but my understanding is as follows:

[logical dns] is optimal for large scale web services that must be accessed via DNS. Such services typically use round robin DNS to return many different IP addresses. Typically a different result is returned for each query. If strict DNS were used in this scenario, Envoy would assume that the cluster’s members were changing during every resolution interval which would lead to draining connection pools, connection cycling, etc. Instead, with logical DNS, connections stay alive until they get cycled.

It seems to me that the crucial difference between is how existing hosts are treated after a dns query, not whether all addresses are in a single endpoint (the docs don't even mention the latter). Some people may even want strict dns resolution when all addresses are in a single endpoint.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with the Envoy implementation, but from having implemented LOGICAL_DNS support in gRPC, I think the key difference is actually the fact that all addresses are considered part of a single endpoint.

In a LOGICAL_DNS cluster, Envoy creates a single Host (which is the object it uses internally to represent an endpoint) with all addresses. Since there is only one Host, all requests will always be sent to that Host, regardless of which LB policy is used. Every time the Host is asked for a new connection, it uses whichever address is currently at the front of the list.

In a STRICT_DNS cluster, we create a separate Host for each address. The LB policy is used to pick which Host each request should be sent to, and the chosen Host manages connections to that address.

@mattklein123 can correct me if I'm wrong here, but the above is my understanding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markdroth I think I see what you mean. Logical DNS clusters have a single Host with one address. This address may change based on results of DNS queries.

My question for @mattklein123 (or anyone familiar with the matter) is why the requirement for there to be a single host? Is there a fundamental reason why we can't have a config like

      name: my_cluster
      type: LOGICAL_DNS
      lb_policy: ROUND_ROBIN 
      hosts:
        - socket_address:
            address: eastus.service.com
            port_value: 80
        - socket_address:
            address: westus.service.com
            port_value: 80       

This would create two Hosts under the hood, one for eastus.service.com and one for westus.service.com, and envoy would round robin between them. When any Host is asked for an address, it would return the one "at the front of [its] list".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle, I don't see any reason why we couldn't add a mechanism to do something like that, but it's not something we have today, and we probably wouldn't want to actually do that without a concrete requirement.

But I also think that that feature isn't really relevant to the question of how we indicate the difference between LOGICAL_DNS and STRICT_DNS clusters. If we were going to introduce the ability for a single DNS cluster to look up multiple names, we could do it for both types, but there would still be the same difference in behavior between LOGICAL_DNS and STRICT_DNS: I would expect STRICT_DNS to still create a separate Host for each address, and we could define the LOGICAL_DNS behavior to be either (a) have one Host for each name, each of which has all of the addresses for that name, or (b) have a single Host containing all addresses for both names -- which behavior we chose would depend on the concrete use-case we'd be trying to address.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the perspective of the xDS data model and familiarity with the gRPC implementation, I think that's the only difference. However, I'm not familiar with Envoy's implementation, which is why I was asking for confirmation.

If you don't know, is there someone else we can ask who is more familiar with the Envoy implementation? I just want to make sure we're not introducing a misleading API here.

Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have to page it all back in and look. I can try to find some time for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless anyone can think of any reason why bool all_addresses_in_single_endpoint doesn't work, I'd like to see this changed. I really don't think bool logical makes sense as an API here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markdroth sounds good

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markdroth I think we've resolved the last outstanding item. With your and @mattklein123 's approval, I'm ready to merge whenever.

// extension point and configuring it with :ref:`DnsCluster<envoy_v3_api_msg_extensions.clusters.dns.v3.DnsCluster>`.
// If :ref:`cluster_type<envoy_v3_api_field_config.cluster.v3.Cluster.cluster_type>` is configured with
// :ref:`DnsCluster<envoy_v3_api_msg_extensions.clusters.dns.v3.DnsCluster>`, this field will be ignored.
google.protobuf.Duration dns_refresh_rate = 16 [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these fields still need to be used for other cluster types, like redis? If so, we probably can't mark them as deprecated yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both redis and dynamic forward proxies define their own refresh rates

message RedisClusterConfig {
option (udpa.annotations.versioning).previous_message_type =
"envoy.config.cluster.redis.RedisClusterConfig";
// Interval between successive topology refresh requests. If not set, this defaults to 5s.
google.protobuf.Duration cluster_refresh_rate = 1 [(validate.rules).duration = {gt {}}];
// Timeout for topology refresh request. If not set, this defaults to 3s.
google.protobuf.Duration cluster_refresh_timeout = 2 [(validate.rules).duration = {gt {}}];
// The minimum interval that must pass after triggering a topology refresh request before a new
// request can possibly be triggered again. Any errors received during one of these
// time intervals are ignored. If not set, this defaults to 5s.
google.protobuf.Duration redirect_refresh_interval = 3;
// The number of redirection errors that must be received before
// triggering a topology refresh request. If not set, this defaults to 5.
// If this is set to 0, topology refresh after redirect is disabled.
google.protobuf.UInt32Value redirect_refresh_threshold = 4;
// The number of failures that must be received before triggering a topology refresh request.
// If not set, this defaults to 0, which disables the topology refresh due to failure.
uint32 failure_refresh_threshold = 5;
// The number of hosts became degraded or unhealthy before triggering a topology refresh request.
// If not set, this defaults to 0, which disables the topology refresh due to degraded or
// unhealthy host.
uint32 host_degraded_refresh_threshold = 6;

google.protobuf.Duration dns_refresh_rate = 3

Couldn't find any other ones.

Network::DnsResolverSharedPtr dns_resolver,
absl::Status& creation_status)
absl::StatusOr<std::unique_ptr<LogicalDnsCluster>>
LogicalDnsCluster::create(const envoy::config::cluster::v3::Cluster& cluster,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good to me!

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this looks great. Please merge main and we can get this in.

/wait

// :ref:`AUTO<envoy_v3_api_enum_value_extensions.clusters.common.dns.v3.DnsLookupFamily.AUTO>`.
common.dns.v3.DnsLookupFamily dns_lookup_family = 8;

// If true, perform logical DNS resolution. Otherwise, perform strict DNS resolution.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you write more what the difference here is for users and/or link back to the arch docs where this is discussed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added links to avoid repetition, but I can also be a bit more verbose.

Signed-off-by: Steven Jin Xuan <[email protected]>
@Stevenjin8
Copy link
Contributor Author

Forgot to merge main. Should be ready for another review now (even though not much changed)

Signed-off-by: Steven Jin Xuan <[email protected]>
mattklein123
mattklein123 previously approved these changes Dec 16, 2024
@mattklein123 mattklein123 enabled auto-merge (squash) December 16, 2024 22:40
Signed-off-by: Steven Jin Xuan <[email protected]>
auto-merge was automatically disabled December 17, 2024 15:51

Head branch was pushed to by a user without write access

@Stevenjin8
Copy link
Contributor Author

/retest

Signed-off-by: Steven Jin Xuan <[email protected]>
Signed-off-by: Steven Jin Xuan <[email protected]>
Signed-off-by: Steven Jin Xuan <[email protected]>
Copy link
Contributor

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one minor comment improvement needed -- otherwise, this looks great from an API perspective!

api/envoy/extensions/clusters/dns/v3/dns_cluster.proto Outdated Show resolved Hide resolved
Signed-off-by: Steven Jin Xuan <[email protected]>
@Stevenjin8 Stevenjin8 requested a review from markdroth December 18, 2024 19:16
@markdroth
Copy link
Contributor

This looks great from an API perspective!

/lgtm api

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deps Approval required for changes to Envoy's external dependencies
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants