Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A34 weighted_round_robin lb_policy for per endpoint weight from ClusterLoadAssignment response #202

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
111 changes: 111 additions & 0 deletions A34-edf-weighted-round-robin.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
`edf_weighted_round_robin` lb_policy for per endpoint `load_balancing_weight` from `ClusterLoadAssignment` response
----
* Author(s): Yi-Shu Tai ([email protected])
* Approver: markdroth
* Status: In Review
* Implemented in: N/A
* Last updated: 2020-09-20
* Discussion at: https://groups.google.com/g/grpc-io/c/j76bnPgpHYo

## Abstract
This proposal is for carrying per endpoint weight in address attribute from [`ClusterLoadAssignment`](https://github.com/envoyproxy/envoy/blob/2dcf20f4baf5de71ba1d8afbd76b0681613e13f2/api/envoy/config/endpoint/v3/endpoint.proto#L34) and introducing `edf_weighted_round_robin` policy based on [earliest deadline first scheduling algorithm](https://en.wikipedia.org/wiki/Earliest_deadline_first_scheduling) for taking advantage of the information which per endpoint weight provides.

This proposal is based on [A27: xDS-Based Global Load Balancing](https://github.com/grpc/proposal/blob/master/A27-xds-global-load-balancing.md).

## Background
[A27: xDS-Based Global Load Balancing](https://github.com/grpc/proposal/blob/master/A27-xds-global-load-balancing.md) describes resolver/LB architecture and xDS client behavior. This proposal specifically extends the behavior of EDS section in [A27: xDS-Based Global Load Balancing](https://github.com/grpc/proposal/blob/master/A27-xds-global-load-balancing.md). We pass [`load_balancing_weight`](https://github.com/envoyproxy/envoy/blob/2dcf20f4baf5de71ba1d8afbd76b0681613e13f2/api/envoy/config/endpoint/v3/endpoint_components.proto#L108) of [`LbEndpoint`](https://github.com/envoyproxy/envoy/blob/2dcf20f4baf5de71ba1d8afbd76b0681613e13f2/api/envoy/config/endpoint/v3/endpoint_components.proto#L76) to `lb_policy` by carrying per endpoint `load_balancing_weight` in per-address attribute. `lb_policy` can make use of the information provided by per endpoint `load_balancing_weight` for better load balancing.

To best utilize the information, we also propose a new `lb_policy`, `edf_weighted_round_robin` which works on `LbEndpoint`s within same [`LocalityLbEndpoints`](https://github.com/envoyproxy/envoy/blob/2dcf20f4baf5de71ba1d8afbd76b0681613e13f2/api/envoy/config/endpoint/v3/endpoint_components.proto#L116).

This proposal has two parts. The first part is the new `lb_policy`, `edf_weighted_round_robin`. The second part discuss how we handle per endpoint `load_balancing_weight` from `ClusterLoadAssignment` response.

### Related Proposals:
* [A27: xDS-Based Global Load Balancing](https://github.com/grpc/proposal/blob/master/A27-xds-global-load-balancing.md).

## Proposal
The proposal is to carry [`load_balancing_weight`](https://github.com/envoyproxy/envoy/blob/2dcf20f4baf5de71ba1d8afbd76b0681613e13f2/api/envoy/config/endpoint/v3/endpoint_components.proto#L108) of [`LbEndpoint`](https://github.com/envoyproxy/envoy/blob/2dcf20f4baf5de71ba1d8afbd76b0681613e13f2/api/envoy/config/endpoint/v3/endpoint_components.proto#L76) from [`ClusterLoadAssignment`](https://github.com/envoyproxy/envoy/blob/2dcf20f4baf5de71ba1d8afbd76b0681613e13f2/api/envoy/config/endpoint/v3/endpoint.proto#L34) response to `lb_policy` and introduce a new `edf_weighted_round_robin` policy based on Earliest deadline first scheduling algorithm picker [EDF](https://en.wikipedia.org/wiki/Earliest_deadline_first_scheduling). Each endpoint will get fraction equal to the [`load_balancing_weight`](https://github.com/envoyproxy/envoy/blob/2dcf20f4baf5de71ba1d8afbd76b0681613e13f2/api/envoy/config/endpoint/v3/endpoint_components.proto#L108) for the [`LbEndpoint`](https://github.com/envoyproxy/envoy/blob/2dcf20f4baf5de71ba1d8afbd76b0681613e13f2/api/envoy/config/endpoint/v3/endpoint_components.proto#L76) divided by the sum of the `load_balancing_weight` of all `LbEndpoint` within the same [`LocalityLbEndpoints`](https://github.com/envoyproxy/envoy/blob/2dcf20f4baf5de71ba1d8afbd76b0681613e13f2/api/envoy/config/endpoint/v3/endpoint_components.proto#L116) of traffic routed to the locality.

### Overview of `edf_weighted_round_robin` policy
`edf_weighted_round_robin` distribute traffic to each endpoint in a way that each endpoint will get fraction traffic equal to the weight associated with the endpoint divided by the sum of the weight of all endpoints. The core of `edf_weighted_round_robin` is [EDF](https://en.wikipedia.org/wiki/Earliest_deadline_first_scheduling) picker.

#### Weight of each endpoint
`edf_weighted_round_robin` needs extra information `weight` of each endpoint. We pass weight to `lb_policy` as per address attribute.

#### Overview of EDF scheduler
EDF picker maintains a priority queue of `EdfEntry`. The key of the priority queue is `(deadline, order_offset)` pair. On the top of the queue, it’s the entry with lowest `deadline`. `order_offset` is the tie breaker when two entries have same deadline to maintain FIFO order. If there is a tie on deadline of two entries, the one with smaller `order_offset` will have higher priority.

Proposed `EdfEntry`
```
struct EdfEntry {
// primary key for the priority queue. The entry with least deadline is the top of the queue.
double deadline;

// secondary key for the priority queue. Used as a tiebreaker for same deadline to
// maintain FIFO order. If there is a tie on deadline of two entries, the one with
// smaller `order_offset` will have higher priority. `order_offset` is assigned to this
// entry on constructing the priority queue for the first time and it's immutable.
// Also the `order_offset` assigned to entries is strictly increasing, in other words,
// no two entries have same `order_offset`.
uint64 order_offset;

// `load_balancing_weight` of this endpoint from address attribute of this endpoint.
uint32 weight;

// Subchannel data structure of this endpoint.
Subchannel subchannel;
}
```
Initialization
- At the very beginning, `deadline` of an entry `e` is equal to `1.0/e.weight`.
- We assign `order_offset` to each entry while constructing the priority queue. `order_offset` assigned to an entry is distinct and nonnegative integer. During the whole lifecycle of this picker, `order_offset` of an entry is unchanged.

Pick
- On each call to the `Pick`, EDF picker picks the entry `e` on the top of the queue, returns the subchannel associated with the entry. After that, picker updates the `deadline` of `e` to `e.deadline + 1.0/weight` and either performs a pop and push the entry back to the queue or key increase operation.

Notes
- If all endpoints have the same `load_balancing_weight`, `EDF` picker degenerates to `round_robin` picker. The order of picked subschannel is purely decided by `order_offset`. It's easier to reason and consistent with envoy.
- Endpoints do not have `load_balancing_weight` is assigned to 1 (the smallest possible weight). This is to be consistent with the [behavior of envoy on missing weight assignment](https://github.com/envoyproxy/envoy/blob/5d95032baa803f853e9120048b56c8be3dab4b0d/source/common/upstream/upstream_impl.cc#L359)


#### Subchannel connectivity management
`edf_weighted_round_robin` proactively monitors the connectivity of each subchannel. `edf_weighted_round_robin` always tries to keep one connection open to each address in the address list at all times. When `edf_weighted_round_robin` is first instantiated, it immediately tries to connect to all addresses, and whenever a subchannel becomes disconnected, it immediately tries to reconnect.

#### Service Config
The service config for the `edf_weighted_round_robin` LB policy is an empty proto message
```
{
load_balancing_config: { edf_weighted_round_robin: {}}
}
```

### Handling per endpoint `load_balancing_weight` from `ClusterLoadAssignment` response

This part extends the behavior of EDS section in [A27: xDS-Based Global Load Balancing](https://github.com/grpc/proposal/blob/master/A27-xds-global-load-balancing.md). Instead of discarding the per endpoint `load_balancing_weight`, we want to add it to per-address attibute and pass it along to `lb_policy`.

#### `lb_policy` for per endpoint `load_balancing_weight` from `ClusterLoadAssignment`
When the `lb_policy` field in CDS response is `ROUND_ROBIN`, we use `edf_weighted_round_robin` as the `lb_policy`.

As of today, we only accept `ROUND_ROBIN` as `lb_policy` in CDS response per [A27: xDS-Based Global Load Balancing](https://github.com/grpc/proposal/blob/master/A27-xds-global-load-balancing.md). Therefore, `edf_weighted_round_robin` will always be used.

#### On update of `ClusterLoadAssignment`
When an EDS update is received, an update will be sent to the `lb_policy`. The `lb_policy` will create a new picker. This is slightly Different from [Envoy](https://github.com/envoyproxy/envoy/blob/51551ae944c642e6fc61563cbea8653087e70f1f/source/common/upstream/load_balancer_impl.cc#L733-L737). We'd like to udpate EDF priority queue so that new weights applied immediately even endpoints list is not changed.

#### NOTE
- `edf_weighted_round_robin` should always be updated to the lastest `ClusterLoadAssignment`. It's xDS server's responsibility to maintain consistency.

## Rationale

Several applications can be built upon this feature, e.g. utilization load balancing, blackhole erroring endpoints, load testing,... etc.

The reason to refresh EDF picker even there is only weight change on some endpoints which is different from envoy is because we'd like real time traffic shift for use cases like load testing, blackhole erroring endpoints.

The reasons to introduce a new algorithm instead of re-using the same algorithm of `weighted_target` policy are
- [EDF](https://en.wikipedia.org/wiki/Earliest_deadline_first_scheduling) maintains FIFO order for endpoints with same weight which is easier to reason.
- We want to be consistent with the behavior of Envoy on how lb_policy picks the backend for sending the request. However, there is a difference between `edf_weighted_round_robin` and `edf_scheduler` of Envoy. `edf_weighted_round_robin` actively monitors the connectivity of each subchannel but `edf_scheduler` of Envoy does not.

## Implementation

N/A

## Open issues (if applicable)
* How to desync to avoid all clients do synchronized pick?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what this is referring to. Can you explain the problem here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated. Please let me know it's still not clear

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I understand the problem. I see two possible solutions here:

  • The simple approach is, as you suggest, randomize the starting point in the list whenever we create a new picker, just like round_robin does. This will avoid having all clients hammer the same endpoints at the same time (and also, keep in mind that the control plane could actually send different results to different clients -- each client could see different endpoints or the same endpoints in a different order or with different weights). But it will disrupt the expected scheduling whenever the picker changes.

  • A more complicated solution would be to maintain the current scheduler state in the LB policy and use a mutex to synchronize access to it between the picker and the LB policy. This is more complicated to implement, and the synchronization imposes a performance penalty (because you have to acquire the mutex for every pick), but it should work.