client: make batching of allocation updates smarter #9451

schmichael · 2020-11-25T20:24:10Z

Nomad version

Nomad v1.0.0-beta3

Issue

Load testing revealed pathological behavior in the client loop that sends allocation updates to servers. See #9435 for details.

#9435 is the most naive improvement to client allocation updating logic. Further improvements could have better performance characteristics while lowering the latency for operations that depend on allocation updates (nomad alloc status, deployments, drains, and rescheduling).

Split batching from sending

As of #9435 the alloc update batching and sending are all performed in a single goroutine. While the code is easy to understand and eases error+retry handling, it potentially causes unnecessary backpressure into alloc updates if the chan's buffer ever fills. Newer updates should always overwrite older updates, so there's no reason to keep a serialized buffer of updates and consume them one at a time.

A more efficient design would be to share the update map, and directly update it (using a mutex) instead of using a chan for serialization. This removes the potential for processing stale updates and minimizes the number of allocations in memory.

Scale Alloc Update Rate like Heartbeats

The node heartbeat rate is proportional to cluster size: the larger the size, the long between updates.

Alloc updates should apply similar logic, although potentially with a small upper bound (1s? 3s?) to cap the staleness of alloc information which is used for other time sensitive operations (eg deployments and drains).

Count alloc updates as heartbeats

Right now alloc updates and heartbeats are entirely separate code paths. It is possible, although has not been observed, that a node's alloc update succeeds while its heartbeat fails, resulting in a false down node.

If alloc updates were counted as heartbeats on both the clients and servers it could reduce a lot of spurious RPC calls and reduce opportunities for false down detection.

Prioritized updates

Not all alloc updates are equal: task events are purely for human consumption whereas fields like Allocation.ClientStatus affects a wide range of scheduler behavior (deployments, drains, rescheduling, and more).

Therefore we could update the batch interval based on the highest priority update in the current batch. The optimal behavior being something like:

A couple docker allocs are started concurrently on a client
100ms later Received task event is created for both (2 updates)
100ms later Task dir created event is created for both (2 updates)
100ms later Image downloading event is created for both (2 updates)
10s later 1 alloc is started (1 update)
100ms later the other alloc fails (1 update)

(For realistic jobs with artifacts and templates and deployments there would be a lot more updates.)

The current logic would likely call Node.UpdateAllocs at least 3 times. Optimally we would call Node.UpdateAllocs 2 times: once after Image downloading occurred for both allocs and once after both allocs started.

Steps 1-4 are "low priority" task events: intended only for human consumption and therefore can be delayed by hundreds of milliseconds or even seconds!

Steps 5 & 6 are "high priority" task events: they effect deployments and rescheduling and ideally are only delayed minimally for batching (tens-to-hundreds of milliseconds).

Reproduction steps

Produce a high rate of allocation updates on a client either with low latency batch jobs or a high number of failing+retrying service jobs. Observe client and server CPU load and Node.UpdateAlloc RPC latency.

The text was updated successfully, but these errors were encountered:

tgross · 2023-05-25T19:57:46Z

For the last couple weeks I've been working testing out a variety of solutions here, and finally have some benchmark data to recommend a best approach. I've developed spikes for three approaches:

Client Backoff

The server makes 50ms batches of updates (ref node_endpoint.go#L1382-L1408). The clients batch up to 200ms. Back of the envelope calculations suggest that on very large and very busy clusters this could result in batch sizes of 1000s of updates in a single Raft entry. Some local bench testing at a unit level reinforced this worry.

Use the size of the server's most recently-applied batch of client updates to adjust the time between client batches. This is similar to the suggested scaling based on the size of the cluster above, but it backs off based on a more meaningful value for the operation at hand.

Prioritized Updates

Each allocation sends many updates during the startup process, but only those that change the ClientStatus field are critical for progressing a deployment. Add a priority to the client allocation sync (ref client.go#L2126-L2177) and update the <-syncTicker receiver so that we only send an update if there's a high priority update waiting, or on every 4th tick (using modulo math and a counter incremented on each pass). This means when there are no high priority updates the client will send updates at most every 800ms instead of 200ms.

Shared Buffer

This is the "split batching from sending" option described above. This is more efficient on the client but it wasn't clear to me that it would improve much of anything on the server. (That being said, it would also make productionizing the prioritized updates approach nicer, so I assumed that if it was at least net-neutral we'd be able to combine the two.)

Not Reviewed

I did not test counting allocation updates as heartbeats, simply because I was more focused on Raft updates and heartbeats do not pass through Raft. This would be good for future work.

Benchmark

I've extended nomad-nodesim so that it can run a mock driver workload and simulate the allocation runner. Each mock task ends up sending 5 allocation events between 100-500ms apart, with the last one setting the health of the allocation to running.

I used our E2E environment, but with m5.large instances (8GiB of RAM) and removed the RPC connection limits from the servers so that the nomad-nodesim tool worked when run on a single host.

Terraform diff

diff --git a/e2e/terraform/etc/nomad.d/base.hcl b/e2e/terraform/etc/nomad.d/base.hcl
index 5bdffae59f..af64af10da 100644
--- a/e2e/terraform/etc/nomad.d/base.hcl
+++ b/e2e/terraform/etc/nomad.d/base.hcl
@@ -27,3 +27,10 @@ telemetry {
   publish_allocation_metrics = true
   publish_node_metrics       = true
 }
+
+limits {
+  https_handshake_timeout   = "5s"
+  http_max_conns_per_client = 100000
+  rpc_handshake_timeout     = "5s"
+  rpc_max_conns_per_client  = 100000
+}
diff --git a/e2e/terraform/terraform.tfvars b/e2e/terraform/terraform.tfvars
index b4568f467e..433b080cb3 100644
--- a/e2e/terraform/terraform.tfvars
+++ b/e2e/terraform/terraform.tfvars
@@ -2,11 +2,11 @@
 # SPDX-License-Identifier: MPL-2.0

 region                          = "us-east-1"
-instance_type                   = "t3a.medium"
+instance_type                   = "m5.large"
 server_count                    = "3"
-client_count_ubuntu_jammy_amd64 = "4"
-client_count_windows_2016_amd64 = "1"
-volumes                         = true
+client_count_ubuntu_jammy_amd64 = "1"
+client_count_windows_2016_amd64 = "0"
+volumes                         = false

 nomad_local_binary                           = "../../pkg/linux_amd64/nomad"
 nomad_local_binary_client_windows_2016_amd64 = ["../../pkg/windows_amd64/nomad.exe"]

Each of the three approaches were benchmarked, along with a baseline from the current main. For each benchmark, I did the following:

Run 1000 nodes using nomad-nodesim and wait for them all to be ready.
Launch N jobs.
Poll the event stream and metrics until the deployments for all jobs were successful.
Tear down the cluster state store and start from scratch between each run.

Because scheduling performance itself could be a factor, I used two different jobs templates: one with the expensive spread block and one without. The templates for these jobs are shown below. The spread template was used to run a benchmark of 500 jobs and one of 300 jobs. The non-spread template was used to run a benchmark of 300 jobs and one 200 jobs. Every job had 100 allocations with 2 tasks each.

benchmark job (spread)

job "JOBID" {

  group "group" {

    count = 100

    spread {
      attribute = "${node.unique.id}"
    }

    ephemeral_disk {
      migrate = true
      size    = 10
      sticky  = true
    }

    task "task_one" {

      driver = "mock_driver"

      kill_timeout = "5s"

      config {
        exit_code              = 0
        exit_err_msg           = "error on exit"
        exit_signal            = 9
        kill_after             = "3s"
        run_for                = "30s"
        signal_error           = "got signal"
        start_block_for        = "1s"
        stdout_repeat          = 1
        stdout_repeat_duration = "10s"
        stdout_string          = "hello, world!\n"
      }

      resources {
        cpu    = 1
        memory = 10
      }

      logs {
        disabled = true
        max_files = 1
        max_file_size = 1
      }
    }

    task "task_two" {

      driver = "mock_driver"

      kill_timeout = "5s"

      config {
        exit_code              = 0
        exit_err_msg           = "error on exit"
        exit_signal            = 9
        kill_after             = "3s"
        run_for                = "30s"
        signal_error           = "got signal"
        start_block_for        = "1s"
        stdout_repeat          = 1
        stdout_repeat_duration = "10s"
        stdout_string          = "hello, world!\n"
      }

      resources {
        cpu    = 1
        memory = 10
      }

      logs {
        disabled = true
        max_files = 1
        max_file_size = 1
      }

    }

  }
}

benchmark job (no spread)

job "JOBID" {

  group "group" {

    count = 100

    constraint {
      distinct_hosts = true
    }

    ephemeral_disk {
      migrate = true
      size    = 10
      sticky  = true
    }

    task "task_one" {

      driver = "mock_driver"

      kill_timeout = "5s"

      config {
        exit_code              = 0
        exit_err_msg           = "error on exit"
        exit_signal            = 9
        kill_after             = "3s"
        run_for                = "30s"
        signal_error           = "got signal"
        start_block_for        = "1s"
        stdout_repeat          = 1
        stdout_repeat_duration = "10s"
        stdout_string          = "hello, world!\n"
      }

      resources {
        cpu    = 1
        memory = 10
      }

      logs {
        disabled = true
        max_files = 1
        max_file_size = 1
      }
    }

    task "task_two" {

      driver = "mock_driver"

      kill_timeout = "5s"

      config {
        exit_code              = 0
        exit_err_msg           = "error on exit"
        exit_signal            = 9
        kill_after             = "3s"
        run_for                = "30s"
        signal_error           = "got signal"
        start_block_for        = "1s"
        stdout_repeat          = 1
        stdout_repeat_duration = "10s"
        stdout_string          = "hello, world!\n"
      }

      resources {
        cpu    = 1
        memory = 10
      }

      logs {
        disabled = true
        max_files = 1
        max_file_size = 1
      }

    }

  }

}

tgross · 2023-05-25T20:00:30Z

Analysis

Overall these tests confirmed that client updates are by far the majority of Raft load during job deployments on an otherwise steady-state cluster where nodes are coming or going. All these benchmarks pushed the cluster to near 100% CPU utilization, and a few attempts had to be made to dial-in the right number of jobs such that RPC and Raft traffic didn't exceed the available network throughout of the hosts.

The most important thing to look at is whether any of these techniques reduced the load on Raft from client updates. The chart below shows each technique for each benchmark, with client updates written to Raft in the bottom bar and server updates in the top bar. The prioritized updates approach is a clear winner here, reducing client updates by ~10% and overall Raft load by ~6%. (It's also entirely possible we could reduce this further by increasing the wait time for non-critical updates.)

(Note: I've tried my best to used accessible colors here but it's entirely possible I've been misled by some online sources I got for how to do so. Please feel free to nudge me in a better direction if you're reading this and having trouble!)

Next I wanted to confirm we weren't reducing the number of Raft events at the cost of latency on those Raft entries. The following graphs show nomad.raft.rpc.appendEntries latency (in milliseconds) over the time of the benchmark for each of the techniques. The data is fairly noisy but the heavier trendline overlaid on each graph shows fairly clearly that prioritized updates is a winner here in almost all cases. It either reduces the latency or flattens it over the length of the test. The shared buffer also improves latency somewhat.

But Raft latency and throughput don't mean anything if the user experience is bad! So I extracted data from the event stream showing the difference between the CreateTime of an allocation was created and its final ModifyTime where it was marked running. The graphs below show that all 3 techniques are marginally better than the baseline. Note the y-axis here is in seconds, which shows how badly overloaded these clusters were. The shared buffer approach is generally better but in one of the benchmarks the prioritized updates runs away with it. This is the smallest of the benchmarks, which suggests that it'll have the most benefit when the cluster isn't so absurdly overloaded.

tgross · 2023-05-25T20:00:41Z

Next Steps

The next steps for me when I come back on Tuesday after the long weekend are as follows:

Productionize the prioritized updates approach combined with the shared buffer, and get a PR up for review and inclusion in the upcoming Nomad 1.6.0-beta prioritized client updates #17354
Open PRs against nomad-nodesim to upstream all the changes I needed to make for benchmarking so we have those in the future.
- mock allocrunner implement mock allocrunner hashicorp-forge/nomad-nodesim#2
- parallel client startup (still janky, will PR later)
Open a PR with a change I've got for the limiter for the spread scheduler block that should shed a whole bunch more load from the scheduler. This isn't part of this benchmark but came about while I was exercising the test rig baseline.

tgross · 2023-05-25T20:36:34Z

A couple of notes about approaches that didn't pan out, so that we don't retread that ground in future investigations...

The client backoff approach was unimpressive here. I instrumented the batch size as part of the benchmark and the largest batch size seen was a modest 177 pending updates (and most were ~20). So the throttling mechanism almost never actually kicked in! Although the unit-level testing of this suggested that it was possible to get batch sizes in the 1000s, it turns out in the real world that the whole rest of the cluster will fall over before we can get there. Which is why we benchmark with real clusters 😀
I also looked at an approach where I tried to de-duplicate updates on the server before writing the batch. This proved to be totally useless even at the unit-test level because the clients are always waiting longer than the server batch interval, so any duplicate allocation update is still waiting on the client. It also gives the server a bunch more work to do in the RPC.
I also tried managing the server batch size by extending the lock we have around the batch future, but this turned out to introduce a lot more contention on that lock to the point where it was useless. Would not recommend that as an approach.

The allocrunner sends several updates to the server during the early lifecycle of an allocation and its tasks. Clients batch-up allocation updates every 200ms, but experiments like the C2M challenge has shown that even with this batching, servers can be overwhelmed with client updates during high volume deployments. Benchmarking done in #9451 has shown that client updates can easily represent ~70% of all Nomad Raft traffic. Each allocation sends many updates during its lifetime, but only those that change the `ClientStatus` field are critical for progressing a deployment or kicking off a reschedule to recover from failures. Add a priority to the client allocation sync and update the `syncTicker` receiver so that we only send an update if there's a high priority update waiting, or on every 5th tick. This means when there are no high priority updates, the client will send updates at most every 1s instead of 200ms. Benchmarks have shown this can reduce overall Raft traffic by 10%, as well as reduce client-to-server RPC traffic. This changeset also switches from a channel-based collection of updates to a shared buffer, so as to split batching from sending and prevent backpressure onto the allocrunner when the RPC is slow. This doesn't have a major performance benefit in the benchmarks but makes the implementation of the prioritized update simpler. Fixes: #9451

schmichael added type/enhancement theme/client labels Nov 25, 2020

schmichael mentioned this issue Nov 25, 2020

client: always wait 200ms before sending updates #9435

Merged

tgross self-assigned this May 19, 2023

tgross mentioned this issue May 30, 2023

prioritized client updates #17354

Merged

tgross added this to the 1.6.0 milestone May 30, 2023

tgross closed this as completed in #17354 May 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

client: make batching of allocation updates smarter #9451

client: make batching of allocation updates smarter #9451

schmichael commented Nov 25, 2020 •

edited

Loading

tgross commented May 25, 2023 •

edited

Loading

tgross commented May 25, 2023 •

edited

Loading

tgross commented May 25, 2023 •

edited

Loading

tgross commented May 25, 2023 •

edited

Loading

client: make batching of allocation updates smarter #9451

client: make batching of allocation updates smarter #9451

Comments

schmichael commented Nov 25, 2020 • edited Loading

Nomad version

Issue

Split batching from sending

Scale Alloc Update Rate like Heartbeats

Count alloc updates as heartbeats

Prioritized updates

Reproduction steps

tgross commented May 25, 2023 • edited Loading

Client Backoff

Prioritized Updates

Shared Buffer

Not Reviewed

Benchmark

tgross commented May 25, 2023 • edited Loading

Analysis

tgross commented May 25, 2023 • edited Loading

Next Steps

tgross commented May 25, 2023 • edited Loading

schmichael commented Nov 25, 2020 •

edited

Loading

tgross commented May 25, 2023 •

edited

Loading

tgross commented May 25, 2023 •

edited

Loading

tgross commented May 25, 2023 •

edited

Loading

tgross commented May 25, 2023 •

edited

Loading