-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
client: make batching of allocation updates smarter #9451
Comments
For the last couple weeks I've been working testing out a variety of solutions here, and finally have some benchmark data to recommend a best approach. I've developed spikes for three approaches: Client BackoffThe server makes 50ms batches of updates (ref Use the size of the server's most recently-applied batch of client updates to adjust the time between client batches. This is similar to the suggested scaling based on the size of the cluster above, but it backs off based on a more meaningful value for the operation at hand. Prioritized UpdatesEach allocation sends many updates during the startup process, but only those that change the Shared BufferThis is the "split batching from sending" option described above. This is more efficient on the client but it wasn't clear to me that it would improve much of anything on the server. (That being said, it would also make productionizing the prioritized updates approach nicer, so I assumed that if it was at least net-neutral we'd be able to combine the two.) Not ReviewedI did not test counting allocation updates as heartbeats, simply because I was more focused on Raft updates and heartbeats do not pass through Raft. This would be good for future work. BenchmarkI've extended I used our E2E environment, but with m5.large instances (8GiB of RAM) and removed the RPC connection limits from the servers so that the Terraform diffdiff --git a/e2e/terraform/etc/nomad.d/base.hcl b/e2e/terraform/etc/nomad.d/base.hcl
index 5bdffae59f..af64af10da 100644
--- a/e2e/terraform/etc/nomad.d/base.hcl
+++ b/e2e/terraform/etc/nomad.d/base.hcl
@@ -27,3 +27,10 @@ telemetry {
publish_allocation_metrics = true
publish_node_metrics = true
}
+
+limits {
+ https_handshake_timeout = "5s"
+ http_max_conns_per_client = 100000
+ rpc_handshake_timeout = "5s"
+ rpc_max_conns_per_client = 100000
+}
diff --git a/e2e/terraform/terraform.tfvars b/e2e/terraform/terraform.tfvars
index b4568f467e..433b080cb3 100644
--- a/e2e/terraform/terraform.tfvars
+++ b/e2e/terraform/terraform.tfvars
@@ -2,11 +2,11 @@
# SPDX-License-Identifier: MPL-2.0
region = "us-east-1"
-instance_type = "t3a.medium"
+instance_type = "m5.large"
server_count = "3"
-client_count_ubuntu_jammy_amd64 = "4"
-client_count_windows_2016_amd64 = "1"
-volumes = true
+client_count_ubuntu_jammy_amd64 = "1"
+client_count_windows_2016_amd64 = "0"
+volumes = false
nomad_local_binary = "../../pkg/linux_amd64/nomad"
nomad_local_binary_client_windows_2016_amd64 = ["../../pkg/windows_amd64/nomad.exe"] Each of the three approaches were benchmarked, along with a baseline from the current
Because scheduling performance itself could be a factor, I used two different jobs templates: one with the expensive benchmark job (spread)job "JOBID" {
group "group" {
count = 100
spread {
attribute = "${node.unique.id}"
}
ephemeral_disk {
migrate = true
size = 10
sticky = true
}
task "task_one" {
driver = "mock_driver"
kill_timeout = "5s"
config {
exit_code = 0
exit_err_msg = "error on exit"
exit_signal = 9
kill_after = "3s"
run_for = "30s"
signal_error = "got signal"
start_block_for = "1s"
stdout_repeat = 1
stdout_repeat_duration = "10s"
stdout_string = "hello, world!\n"
}
resources {
cpu = 1
memory = 10
}
logs {
disabled = true
max_files = 1
max_file_size = 1
}
}
task "task_two" {
driver = "mock_driver"
kill_timeout = "5s"
config {
exit_code = 0
exit_err_msg = "error on exit"
exit_signal = 9
kill_after = "3s"
run_for = "30s"
signal_error = "got signal"
start_block_for = "1s"
stdout_repeat = 1
stdout_repeat_duration = "10s"
stdout_string = "hello, world!\n"
}
resources {
cpu = 1
memory = 10
}
logs {
disabled = true
max_files = 1
max_file_size = 1
}
}
}
} benchmark job (no spread)job "JOBID" {
group "group" {
count = 100
constraint {
distinct_hosts = true
}
ephemeral_disk {
migrate = true
size = 10
sticky = true
}
task "task_one" {
driver = "mock_driver"
kill_timeout = "5s"
config {
exit_code = 0
exit_err_msg = "error on exit"
exit_signal = 9
kill_after = "3s"
run_for = "30s"
signal_error = "got signal"
start_block_for = "1s"
stdout_repeat = 1
stdout_repeat_duration = "10s"
stdout_string = "hello, world!\n"
}
resources {
cpu = 1
memory = 10
}
logs {
disabled = true
max_files = 1
max_file_size = 1
}
}
task "task_two" {
driver = "mock_driver"
kill_timeout = "5s"
config {
exit_code = 0
exit_err_msg = "error on exit"
exit_signal = 9
kill_after = "3s"
run_for = "30s"
signal_error = "got signal"
start_block_for = "1s"
stdout_repeat = 1
stdout_repeat_duration = "10s"
stdout_string = "hello, world!\n"
}
resources {
cpu = 1
memory = 10
}
logs {
disabled = true
max_files = 1
max_file_size = 1
}
}
}
} |
AnalysisOverall these tests confirmed that client updates are by far the majority of Raft load during job deployments on an otherwise steady-state cluster where nodes are coming or going. All these benchmarks pushed the cluster to near 100% CPU utilization, and a few attempts had to be made to dial-in the right number of jobs such that RPC and Raft traffic didn't exceed the available network throughout of the hosts. The most important thing to look at is whether any of these techniques reduced the load on Raft from client updates. The chart below shows each technique for each benchmark, with client updates written to Raft in the bottom bar and server updates in the top bar. The prioritized updates approach is a clear winner here, reducing client updates by ~10% and overall Raft load by ~6%. (It's also entirely possible we could reduce this further by increasing the wait time for non-critical updates.) (Note: I've tried my best to used accessible colors here but it's entirely possible I've been misled by some online sources I got for how to do so. Please feel free to nudge me in a better direction if you're reading this and having trouble!) Next I wanted to confirm we weren't reducing the number of Raft events at the cost of latency on those Raft entries. The following graphs show But Raft latency and throughput don't mean anything if the user experience is bad! So I extracted data from the event stream showing the difference between the |
Next StepsThe next steps for me when I come back on Tuesday after the long weekend are as follows:
|
A couple of notes about approaches that didn't pan out, so that we don't retread that ground in future investigations...
|
The allocrunner sends several updates to the server during the early lifecycle of an allocation and its tasks. Clients batch-up allocation updates every 200ms, but experiments like the C2M challenge has shown that even with this batching, servers can be overwhelmed with client updates during high volume deployments. Benchmarking done in #9451 has shown that client updates can easily represent ~70% of all Nomad Raft traffic. Each allocation sends many updates during its lifetime, but only those that change the `ClientStatus` field are critical for progressing a deployment or kicking off a reschedule to recover from failures. Add a priority to the client allocation sync and update the `syncTicker` receiver so that we only send an update if there's a high priority update waiting, or on every 5th tick. This means when there are no high priority updates, the client will send updates at most every 1s instead of 200ms. Benchmarks have shown this can reduce overall Raft traffic by 10%, as well as reduce client-to-server RPC traffic. This changeset also switches from a channel-based collection of updates to a shared buffer, so as to split batching from sending and prevent backpressure onto the allocrunner when the RPC is slow. This doesn't have a major performance benefit in the benchmarks but makes the implementation of the prioritized update simpler. Fixes: #9451
The allocrunner sends several updates to the server during the early lifecycle of an allocation and its tasks. Clients batch-up allocation updates every 200ms, but experiments like the C2M challenge has shown that even with this batching, servers can be overwhelmed with client updates during high volume deployments. Benchmarking done in #9451 has shown that client updates can easily represent ~70% of all Nomad Raft traffic. Each allocation sends many updates during its lifetime, but only those that change the `ClientStatus` field are critical for progressing a deployment or kicking off a reschedule to recover from failures. Add a priority to the client allocation sync and update the `syncTicker` receiver so that we only send an update if there's a high priority update waiting, or on every 5th tick. This means when there are no high priority updates, the client will send updates at most every 1s instead of 200ms. Benchmarks have shown this can reduce overall Raft traffic by 10%, as well as reduce client-to-server RPC traffic. This changeset also switches from a channel-based collection of updates to a shared buffer, so as to split batching from sending and prevent backpressure onto the allocrunner when the RPC is slow. This doesn't have a major performance benefit in the benchmarks but makes the implementation of the prioritized update simpler. Fixes: #9451
Nomad version
Nomad v1.0.0-beta3
Issue
Load testing revealed pathological behavior in the client loop that sends allocation updates to servers. See #9435 for details.
#9435 is the most naive improvement to client allocation updating logic. Further improvements could have better performance characteristics while lowering the latency for operations that depend on allocation updates (
nomad alloc status
, deployments, drains, and rescheduling).Split batching from sending
As of #9435 the alloc update batching and sending are all performed in a single goroutine. While the code is easy to understand and eases error+retry handling, it potentially causes unnecessary backpressure into alloc updates if the chan's buffer ever fills. Newer updates should always overwrite older updates, so there's no reason to keep a serialized buffer of updates and consume them one at a time.
A more efficient design would be to share the
update
map, and directly update it (using a mutex) instead of using a chan for serialization. This removes the potential for processing stale updates and minimizes the number of allocations in memory.Scale Alloc Update Rate like Heartbeats
The node heartbeat rate is proportional to cluster size: the larger the size, the long between updates.
Alloc updates should apply similar logic, although potentially with a small upper bound (1s? 3s?) to cap the staleness of alloc information which is used for other time sensitive operations (eg deployments and drains).
Count alloc updates as heartbeats
Right now alloc updates and heartbeats are entirely separate code paths. It is possible, although has not been observed, that a node's alloc update succeeds while its heartbeat fails, resulting in a false
down
node.If alloc updates were counted as heartbeats on both the clients and servers it could reduce a lot of spurious RPC calls and reduce opportunities for false
down
detection.Prioritized updates
Not all alloc updates are equal: task events are purely for human consumption whereas fields like
Allocation.ClientStatus
affects a wide range of scheduler behavior (deployments, drains, rescheduling, and more).Therefore we could update the batch interval based on the highest priority update in the current batch. The optimal behavior being something like:
Received
task event is created for both (2 updates)Task dir created
event is created for both (2 updates)Image downloading
event is created for both (2 updates)(For realistic jobs with artifacts and templates and deployments there would be a lot more updates.)
The current logic would likely call
Node.UpdateAllocs
at least 3 times. Optimally we would callNode.UpdateAllocs
2 times: once afterImage downloading
occurred for both allocs and once after both allocs started.Steps 1-4 are "low priority" task events: intended only for human consumption and therefore can be delayed by hundreds of milliseconds or even seconds!
Steps 5 & 6 are "high priority" task events: they effect deployments and rescheduling and ideally are only delayed minimally for batching (tens-to-hundreds of milliseconds).
Reproduction steps
Produce a high rate of allocation updates on a client either with low latency batch jobs or a high number of failing+retrying service jobs. Observe client and server CPU load and
Node.UpdateAlloc
RPC latency.The text was updated successfully, but these errors were encountered: