Skip to content

Commit

Permalink
Add ADR #7: Nested invocations
Browse files Browse the repository at this point in the history
This ADR explains in a bit more detail what the background behind the
following commit is, and what kinds of changes were made:

buildbarn/bb-remote-execution@744c9a6
  • Loading branch information
EdSchouten committed Dec 27, 2021
1 parent e46729a commit 2aaac27
Showing 1 changed file with 127 additions and 0 deletions.
127 changes: 127 additions & 0 deletions 0007-nested-invocations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
# Buildbarn Architecture Decision Record #7: Nested invocations

Author: Ed Schouten<br/>
Date: 2021-12-27

# Context

On 2020-11-16, bb\_scheduler was extended to take the Bazel invocation
ID into account ([commit](https://github.com/buildbarn/bb-remote-execution/commit/1a8407a0e62bf559abcd4006e0ecc9d3de9838c7)).
With this change applied, bb\_scheduler no longer places all incoming
operations in a single FIFO-like queue (disregarding priorities).
Instead, every invocation of Bazel gets its own queue. All of these
queues combined are placed in another queue that determines the order of
execution. The goal of this change was to improve fairness, by making
the scheduler give every invocation an equal number of workers.

Under the hood, [`InMemoryBuildQueue` calls into `ActionRouter`](https://github.com/buildbarn/bb-remote-execution/blob/6d5d7a67f5552f9b961a7e1d81c3cd77f2086004/pkg/scheduler/in_memory_build_queue.go#L378), which then [calls into `invocation.KeyExtractor`](https://github.com/buildbarn/bb-remote-execution/blob/6d5d7a67f5552f9b961a7e1d81c3cd77f2086004/pkg/scheduler/routing/simple_action_router.go#L40)
to obtain an [`invocation.Key`](https://github.com/buildbarn/bb-remote-execution/blob/6d5d7a67f5552f9b961a7e1d81c3cd77f2086004/pkg/scheduler/invocation/key.go).
The latter acts as [a map key](https://github.com/buildbarn/bb-remote-execution/blob/6d5d7a67f5552f9b961a7e1d81c3cd77f2086004/pkg/scheduler/in_memory_build_queue.go#L1201)
for all invocations known to the scheduler.
[The default implementation](https://github.com/buildbarn/bb-remote-execution/blob/6d5d7a67f5552f9b961a7e1d81c3cd77f2086004/pkg/scheduler/invocation/request_metadata_key_extractor.go)
is one that extracts [the `tool_invocation_id` field from the REv2 `RequestMetadata` message](https://github.com/bazelbuild/remote-apis/blob/636121a32fa7b9114311374e4786597d8e7a69f3/build/bazel/remote/execution/v2/remote_execution.proto#L1836).

What is pretty elegant about this design is that `InMemoryBuildQueue`
doesn't really care about what kind of logic is used to compute
invocation keys. One may, for example, provide a custom
`invocation.KeyExtractor` that causes operations to be grouped by
username, or by [the `correlated_invocations_id` field](https://github.com/bazelbuild/remote-apis/blob/636121a32fa7b9114311374e4786597d8e7a69f3/build/bazel/remote/execution/v2/remote_execution.proto#L1840).

On 2021-07-16, bb\_scheduler was extended once more to use Bazel
invocation IDs to provide worker locality ([commit](https://github.com/buildbarn/bb-remote-execution/commit/902aab278baa45c92d48cad4992c445cfc588e32)).
This change causes the scheduler to keep track of invocation ID of the
last operation to run on a worker. When scheduling successive tasks, the
scheduler will try to reuse a worker with a matching invocation ID. This
increases the probability of input root cache hits, as it is not
uncommon for clients to run many actions having similar input root
contents.

At the same time, this change introduced [the notion of 'stickiness'](https://github.com/buildbarn/bb-remote-execution/blob/6d5d7a67f5552f9b961a7e1d81c3cd77f2086004/pkg/proto/configuration/bb_scheduler/bb_scheduler.proto#L177-L196),
where a cluster operator can effectively quantify the overhead of
switching between actions belonging to different invocations. Such
overhead could include the startup time of virtual machines, download
times of container images specified in REv2 platform properties. In the
case embedded hardware testing, it could include the time it takes to
flash an image to EEPROM/NAND and boot from it. By using a custom
`invocation.KeyExtractor` that extracts these properties from incoming
execution requests, the scheduler can reorder operations in such a way
that the number of restarts, reboots, flash erasure cycles, etc. is
reduced.

One major limitation of how this feature is implemented right now, is
that invocations are a flat namespace. The scheduler only provides
fairness by keying execution requests by a single property. For example,
there is no way to configure the scheduler to provide the following:

- Fairness by username, followed by fairness by Bazel invocation ID for builds
with an equal username.
- Fairness by team/project/department, followed by fairness by username.
- Fairness both by correlated invocations ID and tool invocation ID.
- Fairness/stickiness by worker container image, followed by fairness by
Bazel invocation ID.

In this ADR we propose to remove this limitation, with the goal of
improving fairness of multi-tenant Buildbarn setups.

# Proposed changes

This ADR mainly proposes that operations no longer have a single
`invocation.Key`, but a list of them. The flat map of invocations
managed by `InMemoryBuildQueue` will be changed to a
[trie](https://en.wikipedia.org/wiki/Trie), where every level of the
trie corresponds to a single `invocation.Key`. More concretely, if the
scheduler is configured to not apply any fairness whatsoever, all
operations will simply be queued within the root invocation. If the
`ActionRouter` is configured to always return two `invocation.Key`s, the
trie will have height two, having operations queued only at the leaves.

When changing this data structure, we'll also need to rework all of the
algorithms that are applied against it. For example, we currently use a
[binary heap](https://en.wikipedia.org/wiki/Binary_heap) to store all
invocations that have one or more queued operations, which is used for
selecting the next operation to run. This will be replaced with a heap
of heaps that mimics the layout of the trie. These algorithms will thus
all become a bit more complex than before.

Fortunately, there are also some places where the algorithms become
simpler. For the worker locality, we currently have to deal with the
case where the last task that ran on a worker was associated with
multiple invocations (due to in-flight deduplication). Or it may not be
associated with any invocation if the worker was created recently. In
the new model we can simply pick the
[lowest common ancestor](https://en.wikipedia.org/wiki/Lowest_common_ancestor)
of all invocations in case of in-flight deduplication, and pick the root
invocation for newly created workers. This ensures that every worker is
always associated with exactly one invocation.

For `InMemoryBuildQueue` to obtain multiple `invocation.Key`s for a
given operation, we will need to alter `ActionRouter` to return a list
instead of a single instance. The underlying `invocation.KeyExtractor`
interface can remain as is, though we can remove
[`invocation.EmptyKeyExtractor`](https://github.com/buildbarn/bb-remote-execution/blob/6d5d7a67f5552f9b961a7e1d81c3cd77f2086004/pkg/scheduler/invocation/empty_key_extractor.go).
The same behaviour can now be achieved by not using any key extractors
at all.

Finally we have to extend the worker invocation stickiness logic to work
with this nested model. As every level of the invocation tree now has a
different cost when switching to a different branch, we now need to let
the configuration accept a list of durations, each indicating the cost
at that level.

All of the changes proposed above have been implemented in
[this commit](https://github.com/buildbarn/bb-remote-execution/commit/744c9a65215179f19f20523acd522937a03aad56).

# Future work

With `invocation.EmptyKeyExtractor` removed, only
`invocation.RequestMetadataKeyExtractor` remains. This type is capable
of creating an `invocation.Key` based on the `tool_invocation_id`. We
should add more types, such as one capable of extracting the
`correlated_invocations_id` or authentication related properties (e.g.,
username).

Even though the scheduling policy of `InMemoryBuildQueue` is fair, there
is no support for [preemption](https://en.wikipedia.org/wiki/Preemption_(computing)).
At no point will the scheduler actively move operations from the
`EXECUTING` back to the `QUEUED` stage if it prevents other invocations
from making progress.

0 comments on commit 2aaac27

Please sign in to comment.