Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client: always wait 200ms before sending updates #9435

Merged
merged 3 commits into from
Dec 1, 2020

Conversation

schmichael
Copy link
Member

Always wait 200ms before calling the Node.UpdateAlloc RPC to send
allocation updates to servers.

Prior to this change we only reset the update ticker when an error was
encountered. This meant the 200ms ticker was running while the RPC was
being performed. If the RPC was slow due to network latency or server
load and took >=200ms, the ticker would tick during the RPC.

Then on the next loop only the select would randomly choose between the
two viable cases: receive an update or fire the RPC again.

If the RPC case won it would immediately loop again due to there being
no updates to send.

When the update chan receive is selected a single update is added to the
slice. The odds are then 50/50 that the subsequent loop will send the
single update instead of receiving any more updates.

This could cause a couple of problems:

  1. Since only a small number of updates are sent, the chan buffer may
    fill, applying backpressure, and slowing down other client
    operations.
  2. The small number of updates sent may already be stale and not
    represent the current state of the allocation locally.

A risk here is that it's hard to reason about how this will interact
with the 50ms batches on servers when the servers under load.

A further improvement would be to completely remove the alloc update
chan and instead use a mutex to build a map of alloc updates. I wanted
to test the lowest risk possible change on loaded servers first before
making more drastic changes.

Copy link
Contributor

@krishicks krishicks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In looking at the old code vs. the new, it's not clear to me how it is functionally different. It looks like the same behavior occurs, with what used to be a ticker Stop/re-initialization being converted to a Reset.

It is possible I am missing something, however. The difference between Stop/re-initialization and Stop is lost on me, and the godoc doesn't help.

client/client.go Show resolved Hide resolved
client/client.go Outdated Show resolved Hide resolved
Always wait 200ms before calling the Node.UpdateAlloc RPC to send
allocation updates to servers.

Prior to this change we only reset the update ticker when an error was
encountered. This meant the 200ms ticker was running while the RPC was
being performed. If the RPC was slow due to network latency or server
load and took >=200ms, the ticker would tick during the RPC.

Then on the next loop only the select would randomly choose between the
two viable cases: receive an update or fire the RPC again.

If the RPC case won it would immediately loop again due to there being
no updates to send.

When the update chan receive is selected a single update is added to the
slice. The odds are then 50/50 that the subsequent loop will send the
single update instead of receiving any more updates.

This could cause a couple of problems:

1. Since only a small number of updates are sent, the chan buffer may
   fill, applying backpressure, and slowing down other client
   operations.
2. The small number of updates sent may already be stale and not
   represent the current state of the allocation locally.

A risk here is that it's hard to reason about how this will interact
with the 50ms batches on servers when the servers under load.

A further improvement would be to completely remove the alloc update
chan and instead use a mutex to build a map of alloc updates. I wanted
to test the lowest risk possible change on loaded servers first before
making more drastic changes.
@schmichael
Copy link
Member Author

schmichael commented Nov 25, 2020

In looking at the old code vs. the new, it's not clear to me how it is functionally different.

It's super subtle and my commit message is probably far too verbose.

The case I ran into was:

  1. client batches 200ms worth of alloc updates by receiving on the update chan until the ticker ticks
  2. ticker ticks, Node.UpdateAlloc is called with the batch of updates
  3. 200ms elapse (I observed this when a server was under load but it could also be a slow network or even local load)

  4. while this is happening alloc updates are being buffered in the alloc update chan
  5. Node.UpdateAlloc returns, for loops
  6. Here's the problem: at this point there's a backlog of alloc updates waiting in the update chan, but the ticker has already ticked! Since select randomly chooses between eligible cases we may only receive a single alloc update before calling Node.UpdateAlloc again, completely defeating the intended batching in the client.

If Node.UpdateAlloc continues to take >200ms, we continue to send 1 (or maybe 2 or 3 if we're very lucky) updates per Node.UpdateAlloc call.

But it gets worse! At this point not only might the update chan buffer be full (64 updates), but it's plausible many of the updates are now stale and should be replaced by allocs deeper in the chan's buffer (or blocking on sending to the buffer!).

So this timing approach causes a cascading load effect when servers are under such load that Node.UpdateAlloc takes >200ms: we end up writing more allocs to raft because clients aren't able to coalesce a series of updates to a single allocation into the latest update. Instead each update is applied 1 at a time despite being out of date with reality.

The fix

Stop including the Node.UpdateAlloc call in the 200ms timer!

That's it. That's the whole fix.

Now it doesn't matter how long Node.UpdateAlloc takes, we will always spend an additional 200ms batching allocation updates before sending them. Since this is such a tight loop a client would have to be under severe load to not consume the entire update chan's buffer (max 64 items) within 200ms, so this change should be sufficient to prevent the applying-stale-updates problem.

The downside is that now we're delaying update allocs further -- I observed this RPC taking 1-2.5s in high load tests:
Max of nomad nomad client update_alloc 95percentile over _

This means that alloc updates would be delayed by update to 2700ms in that cluster. That affects not only operators monitoring nomad alloc status but more importantly it effects how quickly the scheduler makes progress on drains, deployments, and rescheduling.

The reason I think this is an ok tradeoff is because it provides natural backpressure on heavily loaded servers. If the server is under so much load that Node.UpdateAlloc takes 2s, it may help alleviate that load to slow down drains, deployments, and rescheduling. The pathological case for heavily loaded servers is two-fold:

  1. A heavily loaded server may be unable to process client heartbeats quickly enough, causing the clients to be treated as down, marking their allocations as lost, and causing a significant amount of extra server load to create and process evaluations. This is the nightmare scenario as it could lead to widespread Nomad and service outages.
  2. A heavily loaded server my be unable to heartbeat raft quickly enough and either drop out as a follower or lose leadership if it is the leader. A follower dropping out means extra load is now put on the other servers which will likely cause them to fail as well. A leader dropping similarly creates extra load on the remaining servers.

Effectively Node.UpdateAlloc becomes a gross (as-in naive and coarse-grained... but also a little icky) server->client backpressure mechanism: the slower a server is to respond, the longer clients wait before resubmitting updates, and the more opportunity client's have to batch those updates.

This is not optimal (as alluded to by the "gross" description). Servers should have intrinsic mechanisms for avoiding the 2 pathological overload cases I mention above. They should not rely on well-behaved clients. Ideas here are welcome but outside the scope of this modest improvement.

On the client-side we could be smarter about when we update allocs. I put those ideas in a new issue here: #9451

@schmichael schmichael marked this pull request as ready for review November 25, 2020 20:24
Copy link
Contributor

@krishicks krishicks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The update looks good, and makes it more clear that the critical change is removing the if staggered bit.

@vercel
Copy link

vercel bot commented Nov 30, 2020

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployments, click below or on the icon next to each commit.

nomad-storybook – ./ui/stories

🔍 Inspect: https://vercel.com/hashicorp/nomad-storybook/ajjr2hr0w
✅ Preview: https://nomad-storybook-git-f-allocupdate-timer.hashicorp.vercel.app

nomad-ui – ./ui

🔍 Inspect: https://vercel.com/hashicorp/nomad-ui/nla25dn1s
✅ Preview: https://nomad-ui-git-f-allocupdate-timer.hashicorp.vercel.app

@vercel vercel bot temporarily deployed to Preview – nomad November 30, 2020 18:27 Inactive
@schmichael schmichael merged commit c60d9a9 into master Dec 1, 2020
@schmichael schmichael deleted the f-allocupdate-timer branch December 1, 2020 16:45
schmichael added a commit that referenced this pull request Sep 26, 2022
Extension of #14673

Once Vault is initially fingerprinted, extend the period since changes
should be infrequent and the fingerprint is relatively expensive since
it is contacting a central Vault server.

Also move the period timer reset *after* the fingerprint. This is
similar to #9435 where the idea is to ensure the retry period starts
*after* the operation is attempted. 15s will be the *minimum* time
between fingerprints now instead of the *maximum* time between
fingerprints.

In the case of Vault fingerprinting, the original behavior might cause
the following:

1. Timer is reset to 15s
2. Fingerprint takes 16s
3. Timer has already elapsed so we immediately Fingerprint again

Even if fingerprinting Vault only takes a few seconds, that may very
well be due to excessive load and backing off our fingerprints is
desirable. The new bevahior ensures we always wait at least 15s between
fingerprint attempts and should allow some natural jittering based on
server load and network latency.
schmichael added a commit that referenced this pull request Sep 26, 2022
Extension of #14673

Once Vault is initially fingerprinted, extend the period since changes
should be infrequent and the fingerprint is relatively expensive since
it is contacting a central Vault server.

Also move the period timer reset *after* the fingerprint. This is
similar to #9435 where the idea is to ensure the retry period starts
*after* the operation is attempted. 15s will be the *minimum* time
between fingerprints now instead of the *maximum* time between
fingerprints.

In the case of Vault fingerprinting, the original behavior might cause
the following:

1. Timer is reset to 15s
2. Fingerprint takes 16s
3. Timer has already elapsed so we immediately Fingerprint again

Even if fingerprinting Vault only takes a few seconds, that may very
well be due to excessive load and backing off our fingerprints is
desirable. The new bevahior ensures we always wait at least 15s between
fingerprint attempts and should allow some natural jittering based on
server load and network latency.
@github-actions
Copy link

github-actions bot commented Dec 8, 2022

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants