diff --git a/A61-IPv4-IPv6-dualstack-backends.md b/A61-IPv4-IPv6-dualstack-backends.md new file mode 100644 index 000000000..8080305e5 --- /dev/null +++ b/A61-IPv4-IPv6-dualstack-backends.md @@ -0,0 +1,918 @@ +A61: IPv4 and IPv6 Dualstack Backend Support +---- +* Author(s): @markdroth +* Approver: @ejona86, @dfawley +* Status: Ready for Implementation +* Implemented in: C-core +* Last updated: 2023-12-15 +* Discussion at: https://groups.google.com/g/grpc-io/c/VjORlKP97cE/m/ihqyN32TAQAJ + +## Abstract + +gRPC clients currently support both IPv4 and IPv6. However, most +implementations do not have support for individual backends that have both +an IPv4 and IPv6 address. It is desirable to natively support such +backends in a way that correctly interacts with load balancing. + +## Background + +For background on the interaction between the resolver and LB policy in +the gRPC client channel, see [Load Balancing in +gRPC](https://github.com/grpc/grpc/blob/master/doc/load-balancing.md). + +In most gRPC implementations, the resolver returns a flat list of +addresses, where each address is assumed to be a different endpoint, and +the LB policy is expected to balance the load across those endpoints. +The list of addresses can include both IPv4 and IPv6 addresses, but it +has no way to represent the case where two addresses point to the same +endpoint, so the LB policy will treat them as two different endpoints, +sending each one its own share of the load. However, the actual desired +behavior in this case is for the LB policy to use only one of the +addresses for each endpoint at any given time. (Note that gRPC Java +already supports this.) + +Also, when connecting to an endpoint with multiple addresses, +it is desirable to use the "Happy Eyeballs" algorithm described in +[RFC-8305][RFC-8305] to minimize the time it takes to establish a working +connection by parallelizing connection attempts in a reasonable way. +Currently, all gRPC implementations perform connection attempts in a +completely serial manner in the pick_first LB policy. + +This work is being done in conjunction with an effort to add multiple +addresses per endpoint in xDS. We will support the new xDS APIs being +added for that effort as well. Note that this change has implications +for session affinity behavior in xDS. + +### Related Proposals: +* [Support for dual stack EDS endpoints in Envoy][envoy-design] +* [gRFC A17: Client-Side Health Checking][A17] +* [gRFC A27: xDS-Based Global Load Balancing][A27] +* [gRFC A58: Weighted Round Robin LB Policy][A58] +* [gRFC A48: xDS Least Request LB Policy][A48] +* [gRFC A42: Ring Hash LB Policy][A42] +* [gRFC A56: Priority LB Policy][A56] +* [gRFC A55: xDS-Based Stateful Session Affinity][A55] +* [gRFC A60: xDS-Based Stateful Session Affinity for Weighted Clusters][A60] +* [gRFC A62: pick_first: Sticky TRANSIENT_FAILURE and address order + randomization][A62] +* [gRFC A50: Outlier Detection Support][A50] +* [gRFC A51: Custom Backend Metrics][A51] + +## Proposal + +This proposal includes several parts: +- Allow resolvers to return multiple addresses per endpoint. +- Implement Happy Eyeballs. This will be done in the pick_first LB policy, + which will become the universal leaf policy. It will also need to + support client-side health checking. In Java and Go, the pick_first + logic will be moved out of the subchannel and into the pick_first + policy itself. +- In xDS, we will support the new fields in EDS to indicate multiple + addresses per endpoint, and we will extend the stateful session + affinity mechanism to support such endpoints. + +### Allow Resolvers to Return Multiple Addresses Per Endpoint + +Instead of returning a flat list of addresses, the resolver will be able +to return a list of endpoints, each of which can have multiple addresses. + +Because DNS does not have a way to indicate which addresses are +associated with the same endpoint, the DNS resolver will return each +address as a separate endpoint. + +#### Attributes Returned by the Resolver + +All gRPC implementations have a mechanism for the resolver to return +arbitrary attributes to be passed to the LB policies. Attributes can +be set at the top level, which is used for things like passing the +XdsClient instance from the resolver to the LB policies (as described in +[gRFC A27][A27]), or per-address, which is used for things like passing +hierarchical address information down through the LB policy tree (as +described in [gRFC A56][A56]). + +The exact semantics for these attributes currently vary across languages. +This proposal does not attempt to define unified semantics for these +attributes, although another proposal may attempt that in the future. +For now, this proposal only defines the required changes of this +interface in the wake of supporting multiple addresses per endpoint. + +Specifically, the resolver API must provide a mechanism for passing +attributes on a per-endpoint basis. Most of the attributes that are +currently per-address will now be per-endpoint instead. Implementations +may also support per-address attributes, but this is not required. + +### Happy Eyeballs in the pick_first LB Policy + +The pick_first LB policy currently attempts to connect to each address +serially, stopping at the first one that succeeds. We will change it to +instead use the Happy Eyeballs algorithm on the initial pass through the +address list. Specifically: + +- As per [RFC-8305 section + 5](https://www.rfc-editor.org/rfc/rfc8305#section-5), the default + Connection Attempt Delay value is 250ms. Implementations may provide + a channel arg to control this value, although they must be between the + recommended lower bound of 100ms and upper bound of 2s. Any value + lower than 100ms should be treated as 100ms; any value higher than 2s + should be treated as 2s. +- Whenever we start a connection attempt on a given address, if it is not + the last address in the list, we start a timer for the Connection + Attempt Delay. +- If the timer fires before the connection attempt completes, we will + start a connection attempt on the next address in the list. Note that + we do not interrupt the previous connection attempt that is still in + flight; at this point, we will have in-flight connection attempts to + multiple addresses at once. Also note that, as per the previous + bullet, we will once again start a timer if this new address is not + the last address in the list. +- The first time any connection attempt succeeds (i.e., the subchannel + reports READY, which happens after all handshakes are complete), + we choose that connection. If there is a timer running, we cancel + the timer. +- We will wait for at least one connection attempt on every address to + fail before we consider the first pass to be complete. At that point, + we will request re-resolution. As per [gRFC A62][A62], we will report + TRANSIENT_FAILURE state and will continue trying to connect. We will + stay in TRANSIENT_FAILURE until either (a) we become connected or (b) + the LB policy is destroyed by the channel shutting down or going IDLE. + +If the first pass completes without a successful connection attempt, we +will switch to a mode where we keep trying to connect to all addresses at +all times, with no regard for the order of the addresses. Each +individual subchannel will provide [backoff behavior][backoff-spec], +reporting TRANSIENT_FAILURE while in backoff and then IDLE when backoff +has finished. The pick_first policy will therefore automatically +request a connection whenever a subchannel reports IDLE. We will count +the number of connection failures, and when that number reaches the +number of subchannels, we will request re-resolution; note that because +the backoff state will differ across the subchannels, this may mean that +we have seen multiple failures of a single subchannel and no failures +from another subchannel, but this is a close enough approximation and +very simple to implement. + +Note that every time the LB policy receives a new address list, it will +start an initial Happy Eyeballs pass over the new list, even if some of +the subchannels are not actually new due to their addresses having been +present on both the old and new lists. This means that on the initial +pass through the address list for a subsequent address list update, when +pick_first decides to start a connection attempt on a given subchannel +(whether because it is the first subchannel in the list or because the +timer fired before the previous address' connection attempt completed), +that subchannel may not be in state IDLE, which is the only state in +which a connection attempt may be requested. (Note: This same problem +may occur in C-core even on the first address list update, due to +subchannels being shared with other channels.) Therefore, when we are +ready to start a connection attempt on a given subchannel: + +- If the subchannel is in state IDLE, we request a connection attempt + immediately. If it is not the last subchannel in the list, we will + start the timer; if it is the last subchannel in the list, we will + wait for the attempt to complete. +- If the subchannel is in state CONNECTING, we do not need to actually + request a connection, but we will treat it as if we did. If it is not + the last subchannel in the list, we will start the timer; if it is the + last subchannel in the list, we will wait for the attempt to complete. +- If the subchannel is in state TRANSIENT_FAILURE, then we know that + it is in backoff due to a recent connection attempt failure, so we + treat it as if we have already made a connection attempt on this + subchannel, and we will immediately move on to the next subchannel. + +Note that because we do not report TRANSIENT_FAILURE until after the +Happy Eyeballs pass has completed and we start a new Happy Eyeballs pass +whenever we receive a new address list, there is a potential failure +mode where we may never report TRANSIENT_FAILURE if we are receiving new +address lists faster than we are completing Happy Eyeballs passes. This +is a pre-existing problem, and each gRPC implementation currently deals +with it in its own way. This design does not propose any changes to +those existing approaches, although a future gRFC may attempt to achieve +further convergence here. + +Once a subchannel does become READY, pick_first will unref all other +subchannels, thus cancelling any connection attempts that were already +in flight. Note that the [connection backoff][backoff-spec] state is +stored in the subchannel, so this means that we will lose backoff state +for those subchannels (but see note for C-core below). In general, +this is expected to be okay, because once we see a READY subchannel, +we generally expect to maintain that connection for a while, after which +the backoff state for the other subchannels will no longer be relevant. +However, there could be pathalogical cases where a connection does not +last very long and we wind up making subsequent connection attempts +to the other addresses sooner than we ideally should. This should be +fairly rare, so we're willing to accept this; if it becomes a problem, +we can find ways to address it at that point. + +#### Implications of Subchannel Sharing in C-core + +In C-core, there are some additional details to handle due to the +existance of subchannel sharing between channels. Any given subchannel +that pick_first is using may also be used other channel(s), and any +of those other channels may request a connection on the subchannel +at any time. This means that pick_first needs to be prepared for the +fact that any subchannel may report any connectivity state at any time +(even at the moment that pick_first starts using the subchannel), even +if it did not previously request a connection on the subchannel itself. +This has a couple of implications: + +- pick_first needs to be prepared for any subchannel to report READY at + any time, even if it did not previously request a connection on that + subchannel. Currently (prior to this design), pick_first immediately + chooses the first subchannel that reports READY. That behavior seems + consistent with the intent of Happy Eyeballs, so we will retain it. +- When we choose a subchannel that has become successfully connected, + we will unref all of the other subchannels. For any subchannel on + which we were the only channel holding a ref, this will cause any + pending connection attempt to be cancelled, and the subchannel will + be destroyed. However, if some other channel was holding a ref to the + subchannel, the connection attempt will continue, even if the other + channel did not want it. This is slightly sub-optimal, but it's not + really a new problem; the same thing can occur today if there are two + channels both using pick_first with overlapping sets of addresses. + We can find ways to address this in the future if and when it becomes + a problem. + +#### Move pick_first Logic Out of Subchannel (Java/Go) + +In Java and Go, the pick_first logic is currently implemented in the +subchannel. We will pull this logic out of the subchannel and move it +into the pick_first policy itself. This means that subchannels will +have only one address, and that address does not change over the +lifetime of the subchannel. It will also mean that connection backoff +will be done on a per-address basis rather than a per-endpoint basis. +This will move us closer to having uniform architecture across all of +our implementations. + +#### Use pick_first as the Universal Leaf Policy + +There are two main types of LB policies in gRPC: leaf policies, which +directly interact with subchannels, and parent policies, which delegate +to other LB policies. Happy Eyeballs support is necessary only in leaf +policies. + +Because we do not want to implement Happy Eyeballs multiple times, we +will implement it only in pick_first, and we will change all other leaf +policies to delegate to pick_first instead of directly interacting with +subchannels. This set of policies, which we will refer to as +"[petiole](https://en.wikipedia.org/wiki/Petiole_(botany))" policies, +includes the following: +- round_robin (see [gRPC Load Balancing](https://github.com/grpc/grpc/blob/master/doc/load-balancing.md#round_robin)) +- weighted_round_robin (see [gRFC A58][A58]) +- ring_hash (see [gRFC A42][A42]) +- least_request (see [gRFC A48][A48] -- currently supported in Java and + Go only) + +The petiole policies will receive a list of endpoints, each of which +may contain multiple addresses. They will create a pick_first child +policy for each endpoint, to which they will pass a list containing a +single endpoint with all of its addresses. (See below for more details +on individual petiole policies.) + +Note that implementations should be careful to ensure that this +change does not make error messages less useful when a pick fails. +For example, today, when round_robin has all of its subchannels in state +TRANSIENT_FAILURE, it can return a picker that fails RPCs with the error +message reported by one of the subchannels (e.g., "failed to connect +to all addresses; last error: ipv4:127.0.0.1:443: Failed to connect to +remote host: Connection refused"), which tends to be more useful than +just saying something like "all subchannels failed". With this change, +round_robin will be delegating to pick_first instead of directly +interacting with subchannels, and the LB policy API in many gRPC +implementations does not have a mechanism to report an error message +along with the connectivity state. In those implementations, it may be +necessary for round_robin to return a picker that delegates to one of +the pick_first children's pickers, possibly modifying the error message +from the child picker before returning it to the channel. + +#### Address List Handling in pick_first + +As mentioned above, we are changing the LB policy API to take an address +list that contains a list of endpoints, each of which can contain one +or more addresses. However, the Happy Eyeballs algorithm assumes a flat +list of addresses, not this two-dimensional list. To address that, we +need to define how pick_first will flatten the list. We also need to +define how that flattening interacts with both the sorting described in +[RFC-8305 section 4](https://www.rfc-editor.org/rfc/rfc8305#section-4) +and with the optional shuffling described in [gRFC A62][A62]. + +There are three cases to consider here: + +A. If pick_first is used under a petiole policy, it will see a single + endpoint with one or more addresses. + +B. If pick_first is used as the top-level policy in the channel with the + DNS resolver, it will see one or more endpoints, each of which have + exactly one address. It should be noted that the DNS resolver does + not actually know which addresses might or might not be associated + with the same endpoint, so it assumes that each address is a separate + endpoint. + +C. If pick_first is used as the top-level policy in the channel with a + custom resolver implementation, it may see more than one endpoint, + each of which has one or more addresses. + +[RFC-8305 section 4](https://www.rfc-editor.org/rfc/rfc8305#section-4) +says to perform RFC-6724 sorting first. In gRPC, that sorting happens +in the DNS resolver before the address list is passed to the LB policy, +so it will already be done by the time pick_first sees the address list. + +When the pick_first policy sees an address list, it will perform these +steps in the following order: + +1. Perform the optional shuffling described in [gRFC A62][A62]. The + shuffling will change the order of the endpoints but will not touch + the order of the addresses within each endpoint. This means that the + shuffling will work for cases B and C above, but it will not work for + case A; this is expected to be the right behavior, because we do not + have or anticipate any use cases where a petiole policy will need to + enable shuffling. + +2. Flatten the list by concatenating the ordered list of addresses for + each of the endpoints, in order. + +3. In the flattened list, interleave addresses from the two address + families, as per [RFC-8305 section + 4](https://www.rfc-editor.org/rfc/rfc8305#section-4). Doing this on + the flattened address list ensures the best behavior if only one of + the two address families is working. + +#### Generic Health Reporting Mechanism + +gRPC currently supports two mechanisms that provide a health signal for +a connection: client-side health checking, as described in [gRFC A17][A17], +and outlier detection, as described in [gRFC A50][A50]. Currently, both +mechanisms signal unhealthiness by essentially causing the subchannel to +report TRANSIENT_FAILURE to the leaf LB policy. However, that approach +will no longer work with this design, as explained in the +[Reaons for Generic Health Reporting](#reasons-for-generic-health-reporting) +section below. + +Instead, we need to make these health signals visible to the petiole +policies without affecting the underlying connectivity management of +the pick_first policy. However, since both of these mechanisms work on +individual subchannels rather than on endpoints with multiple subchannels, +this functionality is best implemented in pick_first itself, since +that's where we know which subchannel was actually chosen. Therefore, +pick_first will have an option to support these health signals, and +that option will be used only when pick_first is used as a child policy +underneath a petiole policy. + +Note that we do not want either of these mechanisms to actually work +when pick_first is used as an LB policy by itself, so we will implement +this functionality in a way that it can be triggered by a parent policy +such as round_robin but cannot be triggered by an external application. +(For example, in C-core, this will be triggered via an internal-only +channel arg that will be set by the petiole policies.) + +When this option is enabled in pick_first, it will be necessary for +pick_first to see both the "raw" connectivity state of each subchannel +and the state reflected by health checking. The connection management +behavior will continue to use the "raw" connectivity state, just as it +does today. Only once pick_first chooses a subchannel will it start +the health watch, and the connectivity state reported by that watch +is the state that pick_first will report to its parent. + +Although we need pick_first to be aware of the chosen subchannel's +health, we do not want it to have to be specifically aware of individual +health-reporting mechanisms like client-side health checking or outlier +detection (or any other such mechanism that we might add in the future). +As a result, we will structure this as a general-purpose health-reporting +watch that will be started by pick_fist without regard to whether any +individual health-reporting mechanism is actually configured. If no +health-reporting mechanisms are actually configured, the watch will +report the subchannel's raw connectivity state, so it will effectively +be a no-op. + +#### Address List Updates in Petiole Policies + +The algorithm used by petiole policies to handle address list updates +will need to be updated to reflect the new two-level nature of address +lists. + +Currently, there are differences between C-core and Java/Go in terms of +how address list works are handled, so we need to specify how each +approach works and how it is going to be changed. + +##### Address List Updates in C-core + +In C-core, the channel provides a subchannel pool, which means that if +an LB policy creates multiple subchannels with the same address and +channel args, both of the returned subchannel objects will actually be +refs to the same underlying real subchannel. + +As a result, the normal way to handle an address list update today is to +create a whole new list of subchannels, ignoring the fact that some of +them may be duplicates of subchannels in the previous list; for those +duplicates, the new list will just wind up getting a new ref to the +existing subchannel, so there will not be any connection churn. Also, to +avoid adding unnecessary latency to RPCs being sent on the channel, we +wait to actually start using the new list until we have seen the initial +connectivity state update on all of those subchannels and they have been +given the chance to get connected, if necessary. + +With the changes described in this proposal, we will continue to take +the same basic approach, except that for each endpoint, we will create a +pick_first child policy instead of creating a subchannel. Note that the +subchannel pool will still be used by all pick_first child policies, so +creating a new pick_first child in the new list for the same address that +is already in use by a pick_first child in the old list will wind up +reusing the existing connection. + +##### Address List Updates in Java/Go + +In Java and Go, there is no subchannel pool, so when an LB policy gets +an updated address list, it needs to explicitly check whether any of +those addresses were already present on its previous list. It +effectively does a set comparison: for any address on the new list that +is not on the old list, it will create a new subchannel; for any address +that was on the old list but is not on the new list, it will remove the +subchannel; and for any address on both lists, it will retain the +existing subchannel. + +This algorithm will continue to be used, with the difference that each +entry in the list will now be a set of one or more addresses rather than +a single address. Note that the order of the addresses will not matter +when determining whether an endpoint is present on the list; if the old +list had an endpoint with address list `[A, B]` and the new list has an +endpoint with address list `[B, A]`, that endpoint will be considered to +be present on both lists. However, because the order of the addresses +will matter to the pick_first child when establishing a new connection, +the petiole policy will need to send an updated address list to the +pick_first child to ensure that it has the updated order. + +Note that in this algorithm, the unordered set of addresses must be the +same on both the old and new list for an endpoint to be considered the +same. This means that if an address is added or removed from an +existing endpoint, it will be considered a completely new endpoint, +which may cause some unnecessary connection churn. For this design, we +are accepting this limitation, but we may consider optimizing this in +the future if it becomes a problem. + +Except for the cases noted below (Ring Hash and Outlier Detection), +it is up to the implementation whether a given LB policy takes resolver +attributes into account when comparing endpoints from the old list and +the new list. + +#### Weighted Round Robin + +In the `weighted_round_robin` policy described in [gRFC A58][A58], some +additional state is needed to track the weight of each endpoint. + +##### WRR in C-core + +In C-core, WRR currently has a map of address weights, keyed by the +associated address. The weight objects are ref-counted and remove +themselves from the map when their ref-count reaches zero. When a +subchannel is created for a given address, it takes a new ref to the +weight object for its address. This structure allows the weight +information to be retained when we create a new subchannel list in +response to an updated address list. + +With the changes in this proposal, this map will instead be keyed by the +unordered set of addresses for each endpoint. This will use the same +semantics as address list updates in Java/Go, described above: an +endpoint on the old list with addresses `[A, B]` will be considered +identical to an endpoint on the new list with addresses `[B, A]`. + +Note that in order to start the ORCA OOB watcher for backend metrics +on the subchannel (see [gRFC A51][A51]), WRR will need to intercept +subchannel creation via the helper that it passes down into the pick_first +policy. It will unconditionally start the watch for each subchannel +as it is created, all of which will update the same subchannel weight. +However, once pick_first chooses a subchannel, it will unref the other +subchannels, so only one OOB watcher will remain in steady state. + +##### WRR in Java/Go + +In Java and Go, WRR stores the subchannel weight in the individual +subchannel. We will continue to use this same structure, except that +instead of using a map from a single address to a subchannel, we will +store a map from an unordered set of addresses to a pick_first child, +and the endpoint weight will be stored alongside that pick_first child. + +Just like in C-core, in order to start the ORCA OOB watcher for backend +metrics on the subchannel, WRR will need to intercept subchannel creation +via the helper that it passes down into the pick_first policy. However, +unlike C-core, Java and Go will need to wrap the subchannels and store +them, so that they can start or stop the ORCA OOB watcher as needed by a +subsequent config change. + +#### Least Request + +The least-request LB policy (Java and Go only, described in [gRFC +A48][A48]) will work essentially the same way as WRR. The only difference +is that the data it is storing on a per-endpoint basis is outstanding +request counts rather than weights. + +#### Ring Hash + +Currently, as described in [gRFC A42][A42], each entry in the ring is a +single address, positioned based on the hash of that address. With this +design, that will change such that each entry in the ring is an endpoint, +positioned based on the hash of the endpoint's first address. However, +once an entry in the ring is selected, we may wind up connecting to the +endpoint on a different address than the one that we hashed to. + +Note that this means that if the order of the addresses for a given +endpoint change, that will change the position of the endpoint in +the ring. This is considered acceptable, since ring_hash is already +subject to churn in the ring whenever the address list changes. + +Because ring_hash establishes connections lazily, but pick_first will +attempt to connect as soon as it receives its initial address list, the +ring_hash policy will lazily create the pick_first child when it wants +to connect. + +Note that as of [gRFC A62][A62], pick_first has sticky-TF behavior in +all languages: when a connection attempt fails, it continues retrying +indefinitely with appropriate [backoff][backoff-spec], staying in +TRANSIENT_FAILURE state until either it establishes a connection or the +pick_first policy is destroyed. This means that the ring_hash picker no +longer needs to explicitly trigger connection attempts on subchannels in +state TRANSIENT_FAILURE, which makes the logic much simpler. The picker +pseudo-code now becomes: + +``` +first_index = ring.FindIndexForHash(request.hash); +for (i = 0; i < ring.size(); ++i) { + index = (first_index + i) % ring.size(); + if (ring[index].state == READY) { + return ring[index].picker->Pick(...); + } + if (ring[index].state == IDLE) { + ring[index].endpoint.TriggerConnectionAttemptInControlPlane(); + return PICK_QUEUE; + } + if (ring[index].state == CONNECTING) { + return PICK_QUEUE; + } +} +``` + +As per [gRFC A42][A42], the ring_hash policy normally requires pick +requests to trigger subchannel connection attempts, but if it is +being used as a child of the priority policy, it will not be getting +any picks once it reports TRANSIENT_FAILURE. To work around this, it +currently makes sure that it is attempting to connect (after applicable +backoff period) to at least one subchannel at any given time. After +a given subchannel fails a connection attempt, it moves on to the +next subchannel in the ring. This approach allows the policy to recover +if any one endpoint becomes reachable, while also minimizing the number +of endpoints it is trying to connect to simultaneously, so that it does +not wind up with a lot of unnecessary connections when connectivity is +restored. However, with the sticky-TF behavior, it will not be possible +to attempt to connect to only one endpoint at a time, because when a +given pick_first child reports TRANSIENT_FAILURE, it will automatically +try reconnecting after the backoff period without waiting for a connection +to be requested. Proposed psuedo-code for this logic is: + +``` +if (in_transient_failure && endpoint_entered_transient_failure) { + first_idle_index = -1; + for (i = 0; i < endpoints.size(); ++i) { + if (endpoints[i].connectivity_state() == CONNECTING) { + first_idle_index = -1; + break; + } + if (first_idle_index == -1 && endpoints[i].connectivity_state() == IDLE) { + first_idle_index = i; + } + } + if (first_idle_index != -1) { + endpoints[first_idle_index].RequestConnection(); + } +} +``` + +Note that this means that after an extended connectivity outage, +ring_hash will now often wind up with many unnecessary connections. +However, this situation is also possible via the picker if ring_hash is +the last child under the priority policy, so we are willing to live with +this behavior for now. If it becomes a problem in the future, we can +consider ways to ameliorate it at that time. + +Note that in C-core, the normal approach for handling address list +updates described [above](#address-list-updates-in-c-core) won't work, +because if we are creating the pick_first children lazily, then we will +wind up not creating the children in the new endpoint list and thus +never swapping over to it. As a result, ring_hash in C-core will use an +approach more like that of [Java and Go](#address-list-updates-in-javago): +it will maintain a map of endpoints by the set of addresses, and it will +update that set in place when it receives an updated address list. + +Because ring_hash chooses which endpoint to use via a hash function based +solely on the first address of the endpoint, it does not make sense to +have multiple endpoints with the same address that are differentiated +only by the resolver attributes. Thus, resolver attributes are ignored +when de-duping endpoints. + +#### Outlier Detection + +The goal of the outlier detection policy is to temporarily stop sending +traffic to servers that are returning an unusually large error rate. +The kinds of problems that it is intended to catch are primarily things +that are independent of which address is used to connect to the server; +a problem with the reachability of a particular address is more likely to +cause connectivity problems than individual RPC failures, and problems +that cause RPC failures are generally just as likely to occur on any +address. Therefore, this design changes the outlier detection policy +to make ejection decisions on a per-endpoint basis, instead of on a +per-address basis as it does today. RPCs made to any address associated +with an endpoint will count as activity on that endpoint, and ejection +or unejection decisions for an endpoint will affect subchannels for all +addresses of an endpoint. + +As described in [gRFC A50][A50], the outlier detection policy currently +maintains a map keyed by individual address. The map values contain both +the set of currently existing subchannels for a given address as well +as the ejection state for that address. This map will be split into +two maps: a map of currently existing subchannels, keyed by individual +address, and a map of ejection state, keyed by the unordered set of +addresses on the endpoint. + +The entry in the subchannel map will hold a ref to the corresponding +entry in the endpoint map. This ref will be updated when the LB policy +receives an updated address list, when the list of addresses in the +endpoint changes. It will be used to update the successful and failed +call counts as each RPC finishes. Note that appropriate synchronization +is required for those two different accesses. + +The entry in the endpoint map may hold a pointer to the entries in the +subchannel map for the addresses associated with the endpoint, or the +implementation may simply look up each of the endpoint's addresses in +the subchannel map separately. These accesses from the endpoint map +to the subchannel map will be performed by the LB policy when ejecting +or unejecting the endpoint, to send health state notifications to the +corresponding subchannels. Note that if the ejection timer runs in the +same synchronization context as the rest of the activity in the LB policy, +no additional synchronization should be needed here. + +The set of entries in both maps will continue to be set based on the +address list that the outlier detection policy receives from its parent. +And the map keys will continue to use only the addresses, not taking +resolver attributes into account. + +Currently, the outlier detection policy wraps the subchannels and ejects +them by reporting their connectivity state as TRANSIENT_FAILURE. +As described [above](#generic-health-reporting-mechanism), we will +change the outlier detection policy to instead eject endpoints by +wrapping the subchannel's generic health reporting mechanism. + +### Support Multiple Addresses Per Endpoint in xDS + +The EDS resource has been updated to support multiple addresses per +endpoint in +[envoyproxy/envoy#27881](https://github.com/envoyproxy/envoy/pull/27881). +Specifically, that PR adds a new `AdditionalAddress` message, which +contains a single `address` field, and it adds a repeated +`additional_addresses` field of that type to the `Endpoint` proto. + +When validating the EDS resource, when processing the `Endpoint` proto, +we validate each entry of `additional_addresses` as follows: +- If the `address` field is unset, we reject the resource. +- If the `address` field *is* set, then we validate it exactly the same + way that we already validate the `Endpoint.address` field. + +#### Changes to Stateful Session Affinity + +We need to support endpoints with multiple addresses in stateful session +affinity (see gRFCs [A55][A55] and [A60][A60]). We want to add one +additional property here, which is that we do not want affinity to break +if an endpoint has multiple addresses and then one of those addresses +is removed in an EDS update. This will require some changes to the +original design. + +First, the session cookie, which currently contains a single endpoint +address, will be changed to contain a list of endpoint addresses. As per +gRFC A60, the cookie's format is a base64-encoded string of the form +`
;`. This design changes that format such that the +address part will be a comma-delimited list of addresses. The +`StatefulSession` filter currently sets a call attribute that +communicates the address from the cookie to the `xds_override_host` LB +policy; that call attribute will now contain the list of addresses from +the cookie. + +Next, the entries in the address map in the `xds_override_host` LB policy +need to contain the actual address list to be used in the cookie when +a given address is picked. Note that the original design already described +how we would represent endpoints with multiple addresses in this map, +since that was already possible in Java (see the description in A55 of +handling EquivalentAddressGroups when constructing the map). However, +the original design envisioned that we would store a list of addresses +that would be looked up as keys in the map when finding alternative +addresses to use, which we no longer need now that we will be encoding +the list of addresses in the cookie itself. Instead, what we need from +the map entry is the information necessary to construct the list of +addresses to be encoded in the cookie when the address for that map +entry is picked. Implementations will likely want to store this as a +single string instead of a list, since that will avoid the need to +construct the string on a per-RPC basis. + +As per the original design, when returning the server's initial metadata +to the application, the `StatefulSession` filter may need to set a cookie +indicating which endpoint was chosen for the RPC. However, now that the +cookie needs to include all of the endpoint's addresses and not just the +specific one that is used, we need to communicate that information from +the `xds_override_host` LB policy back to the `StatefulSession` filter. +This will be done via the same call attribute that the `StatefulSession` +filter creates to communicate the list of addresses from the cookie to +the `xds_override_host` policy. That attribute will be given a new +method to allow the `xds_override_host` policy to set the list of +addresses to be encoded in the cookie, based on the address chosen by +the picker. The `StatefulSession` filter will then update the cookie if +the address list in the cookie does not match the address list reported +by the `xds_override_host` policy. Note that when encoding the cookie, +the address that is actually used must be the first address in the list. + +In accordance with those changes, the picker logic will now look like this: + +``` +def Pick(pick_args): + override_host_attribute = pick_args.call_attributes.get(attribute_key) + if override_host_attribute is not None: + idle_subchannel = None + found_connecting = False + for address in override_host_attribute.cookie_address_list: + entry = lb_policy.address_map[address] + if entry found: + if (entry.subchannel is set AND + entry.health_status is in policy_config.override_host_status): + if entry.subchannel.connectivity_state == READY: + override_host_attribute.set_actual_address_list(entry.address_list) + return entry.subchannel as pick result + elif entry.subchannel.connectivity_state == IDLE: + if idle_subchannel is None: + idle_subchannel = entry.subchannel + elif entry.subchannel.connectivity_state == CONNECTING: + found_connecting = True + # No READY subchannel found. If we found an IDLE subchannel, + # trigger a connection attempt and queue the pick until that attempt + # completes. + if idle_subchannel is not None: + hop into control plane to trigger connection attempt for idle_subchannel + return queue as pick result + # No READY or IDLE subchannels. If we found a CONNECTING + # subchannel, queue the pick and wait for the connection attempt + # to complete. + if found_connecting: + return queue as pick result + # pick_args.override_addresses not set or did not find a matching subchannel, + # so delegate to the child picker. + result = child_picker.Pick(pick_args) + if result.type == PICK_COMPLETE: + entry = lb_policy.address_map[result.subchannel.address()] + if entry found: + override_host_attribute.set_actual_address_list(entry.address_list) + return result +``` + +### Temporary environment variable protection + +The code that reads the new EDS fields will be initially guarded by an +environment variable called `GRPC_EXPERIMENTAL_XDS_DUALSTACK_ENDPOINTS`. +This environment variable guard will be removed once this feature has +proven stable. + +Note that we will not use this environment variable to guard the Happy +Eyeballs functionality, because that functionality will be on by +default, not something that is enabled via external input. + +## Rationale + +### Happy Eyeballs Functionality + +Note that we will not support all parts of "Happy Eyeballs" as described +in [RFC-8305][RFC-8305]. For example, because our resolver API does +not provide a way to return some addresses without others, we will not +start trying to connect before all of the DNS queries have returned. + +### Java and Go Pick First Restructuring + +In Java and Go, pick_first is currently implemented inside the subchannel +rather than at the LB policy layer. In those implementations, it +might work to implement Happy Eyeballs inside the subchannel, which +would avoid the need to make pick_first the universal leaf policy, +and in Go, it would avoid the need to move the health-checking code +out of the subchannel. However, that approach won't work for C-core, +and we would like to take this opportunity to move toward a more +uniform cross-language architecture. Also, moving pick_first up +to the LB policy layer in Java and Go will have the nice effect of +making their backoff work per-address instead of across all addresses, +which is what C-core does and what the (poorly specified) [connection +backoff spec][backoff-spec] seems to have originally envisioned. + +### Reasons for Generic Health Reporting + +Currently, client-side health checking and outlier detection +signal unhealthiness by essentially causing the subchannel to report +TRANSIENT_FAILURE to the leaf LB policy. This existing approach works +reasonably when petiole policies directly create and manage subchannels, +but it will not work when pick_first is the universal leaf policy. +When pick_first sees its chosen subchannel transition from READY to +TRANSIENT_FAILURE, it will interpret that as the connection failing, so +it will unref the subchannel and report IDLE to its parent. This causes +two problems. + +The first problem is that we don't want unhealthiness to trigger +connection churn, but pick_first would react in this case by dropping +the existing connection unnecessarily. Note that, as described in [gRFC +A17](A17-client-side-health-checking.md#pick_first), the client-side +health checking mechanism does not work with pick_first, for this exact +reason. In hindsight, we should have imposed the same restriction for +outlier detection, but that was not explicitly stated in [gRFC A50][A50]. +However, that gRFC does say that outlier detection will ignore subchannels +with multiple addresses, which is the case in Java and Go. In C-core, +it should have worked with pick_first, although it turns out that there +was a bug that prevented it from working, which means that we can know +that no users were actually counting on this behavior. This means that +we can retroactively say that outlier detection should never have worked +with pick_first with minimal risk of affecting users that might have +been counting on this use-case. (It might affect Java/Go channels that +use pick_first and happen to have only one address, and it might have +been used in Node.) + +The second problem is that this would cause pick_first to report IDLE +instead of TRANSIENT_FAILURE up to the petiole policy. This could +affect the aggregated connectivity state that the petiole policy reports +to *its* parent. And parent policies like the priority policy (see +[gRFC A56][A56]) may then make the wrong routing decision based on that +incorrect state. + +These problems are solved via the introduction of the [Generic Health +Reporting Mechanism](#generic-health-reporting-mechanism). + +## Implementation + +### C-core + +- move client-side health checking out of subchannel so that it can be + controlled by pick_first (https://github.com/grpc/grpc/pull/32709) +- assume LB policies start in CONNECTING state + (https://github.com/grpc/grpc/pull/33009) +- prep for outlier detection ejecting via health watch + (https://github.com/grpc/grpc/pull/33340) +- move pick_first off of the subchannel_list library that it previously + shared with petiole policies, and add generic health watch support + (https://github.com/grpc/grpc/pull/34218) +- change petiole policies to use generic health watch, and change outlier + detection to eject via health state instead of raw connectivity state + (https://github.com/grpc/grpc/pull/34222) +- change ring_hash to delegate to pick_first + (https://github.com/grpc/grpc/pull/34244) +- add endpoint_list library for petiole policies, and use it to change + round_robin to delegate to pick_first + (https://github.com/grpc/grpc/pull/34337) +- change WRR to delegate to pick_first + (https://github.com/grpc/grpc/pull/34245) +- implement happy eyeballs in pick_first + (https://github.com/grpc/grpc/pull/34426 and + https://github.com/grpc/grpc/pull/34717) +- implement address interleaving for happy eyeballs + (https://github.com/grpc/grpc/pull/34615 and + https://github.com/grpc/grpc/pull/34804) +- change resolver and LB policy APIs to support multiple addresses per + endpoint, and update most LB policies + (https://github.com/grpc/grpc/pull/33567) +- support new xDS fields (https://github.com/grpc/grpc/pull/34506) +- change outlier detection to handle multiple addresses per endpoint + (https://github.com/grpc/grpc/pull/34526) +- change stateful session affinity to handle multiple addresses per endpoint + (https://github.com/grpc/grpc/pull/34472) + +### Java + +- move pick_first logic out of subchannel and into pick_first policy +- make pick_first the universal leaf policy, including client-side + health checking support +- implement happy eyeballs in pick_first +- fix ring_hash to support endpoints with multiple addresses +- support new xDS fields + +### Go + +- change subchannel connectivity state API (maybe) +- move pick_first logic out of subchannel and into pick_first policy +- make pick_first the universal leaf policy, including client-side + health checking support (includes moving health checking logic out of + the subchannel) +- change address list to support multiple addresses per endpoint and + change LB policies to handle this (including ring_hash) +- implement happy eyeballs in pick_first +- support new xDS fields + +## Open issues (if applicable) + +N/A + +[envoy-design]: https://docs.google.com/document/d/1AjmTcMWwb7nia4rAgqE-iqIbSbfiXCI4h1vk-FONFdM/edit +[A17]: A17-client-side-health-checking.md +[A27]: A27-xds-global-load-balancing.md +[A42]: A42-xds-ring-hash-lb-policy.md +[A48]: A48-xds-least-request-lb-policy.md +[A50]: A50-xds-outlier-detection.md +[A51]: A51-custom-backend-metrics.md +[A55]: A55-xds-stateful-session-affinity.md +[A60]: A60-xds-stateful-session-affinity-weighted-clusters.md +[A56]: A56-priority-lb-policy.md +[A58]: A58-client-side-weighted-round-robin-lb-policy.md +[RFC-8305]: https://www.rfc-editor.org/rfc/rfc8305 +[A62]: A62-pick-first.md +[backoff-spec]: https://github.com/grpc/grpc/blob/master/doc/connection-backoff.md