Skip to content

Commit

Permalink
A42 update: fix ring_hash connectivity state aggregation rules (#296)
Browse files Browse the repository at this point in the history
* A42 update: fix ring_hash connectivity state aggregation rules

* clarify that ring_hash usually starts in IDLE

* clarify behavior change in priority policy

* further clarify the priority change

* clarify wording

* swap the order of aggregation rules 3 and 4

* clarify wording
  • Loading branch information
markdroth authored Apr 14, 2022
1 parent c2230e9 commit de70e83
Showing 1 changed file with 82 additions and 27 deletions.
109 changes: 82 additions & 27 deletions A42-xds-ring-hash-lb-policy.md
Original file line number Diff line number Diff line change
Expand Up @@ -347,8 +347,10 @@ ring will be stored within the picker, so any time a new ring is
generated, a new picker will be returned.

The picker will contain not just the ring but also the current state of
every subchannel in the ring. Whenever a subchannel's state changes, a
new picker will need to be generated.
every subchannel in the ring. Implementations may either generate a new
picker every time a subchannel's state changes, or they may provide a
mechanism to synchronize subchannel state data between the LB policy
itself and the picker.

##### LB Policy Config

Expand Down Expand Up @@ -406,36 +408,88 @@ parent.

##### Aggregated Connectivity State

The `ring_hash` policy will use the normal aggregation rules for reporting
the overall connectivity state to its parent (i.e., the same rules used
by `weighted_target`), but with one special case. If there are 2 or more
subchannels in state `TRANSIENT_FAILURE` and all other subchannels are in
state `IDLE` or `CONNECTING`, then the policy will report `TRANSIENT_FAILURE`.
This heuristic ensures that the priority policy will fail over to the
next priority quickly when there's an outage.

So the overall aggregation rules here are:
Because the `ring_hash` policy does not proactively connect to
subchannels but rather triggers connection attempts based on picks, it
cannot use the aggregation rules used by most LB policies (e.g., those used
by the `weighted_target` policy) as-is. Instead, it has two differences
in how it handles `TRANSIENT_FAILURE`, both of which are motivated by
ensuring that the `ring_hash` policy will report reasonable connectivity
state transitions to its parent (which today is always the `priority`
policy but might at some point in the future be the channel itself).

The `priority` policy essentially has an ordered list of child policies and
will send picks to the highest priority child that is currently reporting
`READY` or `IDLE`. This means that for `ring_hash` to function as a child
of the `priority` policy, it needs to report `TRANSIENT_FAILURE` when its
subchannels are not reachable. However, `ring_hash` attempts to connect
only to those subchannels that pick requests hash to, and if the first
subchannel fails to connect, it then sequentially attempts to connecto to
subsequent subchannels in ring order, so it may take a very long time
for all of the subchannels to report `TRANSIENT_FAILURE` instead of `IDLE`.
Under the normal aggregation rules, that means that the `ring_hash` policy
would take far too long to report `TRANSIENT_FAILURE`. And more
generally, if `ring_hash` has only attempted to connect to a small
subset of its subchannels, it cannot truly know the overall reachability
of all of the subchannels. To address these problems, the `ring_hash`
policy will use a heuristic that if there are two or more subchannels
reporting `TRANSIENT_FAILURE` and none in state `READY`, it will report
an aggregated connectivity state of `TRANSIENT_FAILURE`. This heuristic
is an attempt to balance the need to allow the `priority` policy to
quickly failover to the next priority and the desire to avoid reporting
the entire `ring_hash` policy as having failed when the problem is just
one individual subchannel that happens to be unreachable.

In addition, once the `ring_hash` policy reports `TRANSIENT_FAILURE`, it
needs some way to recover from that state. The `ring_hash` policy
normally requires pick requests to trigger subchannel connection
attempts, but if it is being used as a child of the `priority` policy,
it will not be getting any picks once it reports `TRANSIENT_FAILURE`.
To work around this, it will make sure that it is attempting to
connect (after applicable backoff period) to at least one subchannel at
any given time. After a given subchannel fails a connection attempt, it
will move on to the next subchannel in the ring. It will keep doing this
until one of the subchannels successfully connects, at which point it
will report `READY` and stop proactively trying to connect.

Another issue is the way that the `priority` policy handles its failover
timer. The failover timer is used to apply an upper bound to the amount
of time that `priority` policy waits for a child policy to become
connected before it gives up and creates the child policy for the next
priority. The failover timer is started when a child is first created
and is cancelled when the child reports any state other than `CONNECTING`.
To allow this timer to work properly, the `ring_hash` policy should
remain in state `CONNECTING` until it transitions to either
`TRANSIENT_FAILURE` or `READY`. Specifically, after the first subchannel
reports `TRANSIENT_FAILURE` and all other subchannels are in `IDLE`, it
should continue to report `CONNECTING` instead of `IDLE`. In this case,
just as in the `TRANSIENT_FAILURE` case above, it will proactively attempt
to connect to at least one subchannel at all times while it is reporting
`CONNECTING`, so that it does not stay in state `CONNECTING` indefinitely
if it is not receiving picks (e.g., if the application is only occassionally
starting RPCs and giving them very short deadlines).

Note that when the `ring_hash` policy first starts up with a completely
new set of subchannels that are all in state `IDLE`, it will report `IDLE`
as a consequence of the aggregation rules shown below. This is different
from most policies, which start in state `CONNECTING`, and it will
prevent the failover timer in the `priority` policy from working
correctly, because the timer will be started when the child is created
but then immediately cancelled when it reports `IDLE`. To address this,
we will change the `priority` policy to restart the failover timer when a
child reports `CONNECTING`, if that child has not reported `TRANSIENT_FAILURE`
more recently than it reported `READY` or `IDLE`.

Taking all of the above into account, the aggregation rules for
the `ring_hash` policy are as follows:
1. If there is at least one subchannel in `READY` state, report `READY`.
2. If there are 2 or more subchannels in `TRANSIENT_FAILURE` state, report
`TRANSIENT_FAILURE`.
3. If there is at least one subchannel in `CONNECTING` state, report
`CONNECTING`.
4. If there is at least one subchannel in `IDLE` state, report `IDLE`.
5. Otherwise, report `TRANSIENT_FAILURE`.

While the `ring_hash` policy is reporting `TRANSIENT_FAILURE`, it will not be
getting any pick requests from the priority policy. However, because the
`ring_hash` policy does not attempt to reconnect to subchannels unless it
is getting pick requests, it will need special handling to ensure that it
will eventually recover from `TRANSIENT_FAILURE` state once the problem is
resolved. Specifically, it will make sure that it is attempting to
connect (after applicable backoff period) to at least one subchannel at
any given time. After a given subchannel fails a connection attempt, it
will move on to the next subchannel in the ring. It will keep doing this
until one of the subchannels successfully connects, at which point it
will report `READY` and stop proactively trying to connect. The policy
will remain in `TRANSIENT_FAILURE` until at least one subchannel becomes
connected, even if subchannels are in state `CONNECTING` during that time.
4. If there is one subchannel in `TRANSIENT_FAILURE` and there is more than
one subchannel, report state `CONNECTING`.
5. If there is at least one subchannel in `IDLE` state, report `IDLE`.
6. Otherwise, report `TRANSIENT_FAILURE`.

##### Picker Behavior

Expand Down Expand Up @@ -504,6 +558,7 @@ if (ring[first_index].state == CONNECTING) {
return PICK_QUEUE;
}
if (ring[first_index].state == TRANSIENT_FAILURE) {
ring[first_index].subchannel.TriggerConnectionAttemptInControlPlane();
first_subchannel = ring[first_index].subchannel;
found_second_subchannel = false;
found_first_non_failed = false;
Expand Down

0 comments on commit de70e83

Please sign in to comment.