Re-think deployment watcher rate limiting #10760
Labels
stage/accepted
Confirmed, and intend to work on. No timeline committment though.
stage/needs-investigation
type/bug
Nomad's deployment watcher monitors progress of deployments using blocking queries; then enqueues evaluations to have the scheduler re-evaluate the deployment job so new allocations can be placed (or allocations can be substituted). The deployment watcher implements a rate limiter that slows the creations of state store blocking queries, which can disrupt service if hit. See #10756 for more info.
The rate limiter only slows down creation of new blocking queries. It does not limit the number of concurrent blocking queries in flight. It simply spreads
The rate limiter benefits are questionable. The state store uses an in-memory immutable radix tree for storing data, which support concurrent multiple-readers even with mutations. I conjecture that blocking queries ongoing costs (e.g. tracking updates) dominate the initial cost of query creation; so it's very odd that Nomad rate limits creations but not in-flight count. The scheduler makes order of magnitude more queries in a typical cluster without having a rate limiter.
The downsides are more obvious. Clusters with large number of deployments risk hitting the rate limit and having an outage, as highlighted in #10756. The conditions where rate limiting takes effect are subtle and unpredictable: it's not simply the count of active deployments; but is the spike of new deployment creations or a burst of updates affecting many deployments. Even though the limit is tunable, operators are ill-equipped to set appropriate limit.
The rate limiter was introduced in the original deployments PR without context (back in the days where ~26k PRs can be merged without review!). The internal RFC did not call it out either.
Possible next steps
We ought to examine its benefit, if it is not established after investigation, the rate limiter should be removed.
Also, we should benchmark/profile the code and consider optimizing it. For example, we note that deployments typically have many correlated events: If a job is placed with high group count, many allocations will be created and have a burst of updates early on in their lifecycle - firing deployments watcher logic on each individual alloc update somewhat wasteful. Possibly having a per-deployment throttle so it process updates in bulk (e.g. all alloc updates within 1s) rather than individually may make sense.
The text was updated successfully, but these errors were encountered: