-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: use rate limited queue #15480
feat: use rate limited queue #15480
Conversation
Signed-off-by: Soumya Ghosh Dastidar <[email protected]>
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #15480 +/- ##
==========================================
- Coverage 49.68% 49.67% -0.02%
==========================================
Files 267 267
Lines 46362 46387 +25
==========================================
+ Hits 23036 23043 +7
- Misses 21065 21084 +19
+ Partials 2261 2260 -1
☔ View full report in Codecov by Sentry. |
f8e0e3b
to
253f2ac
Compare
Signed-off-by: Soumya Ghosh Dastidar <[email protected]>
ee51fc0
to
949609f
Compare
Signed-off-by: Soumya Ghosh Dastidar <[email protected]>
e57757c
to
b9b4550
Compare
7ef3da1
to
69d2efb
Compare
UI has a notable difference: Without rate limiting: Screen.Recording.2023-09-18.at.8.25.11.AM.movWith rate limiting: Screen.Recording.2023-09-18.at.8.26.19.AM.movCan you please check if we can tweak rate limiting params to avoid it? |
@alexmt we change |
In addition to reducing the max delay, I think the backoff can also less aggressive so that it is only throttling apps that have continuous, and sustained reconciles, as opposed to apps that sudden spikes in activity. |
@jessesuen I don't think that can be configured directly, one field we can tune for similar results is the |
Signed-off-by: Soumya Ghosh Dastidar <[email protected]>
Signed-off-by: Soumya Ghosh Dastidar <[email protected]>
@jannfis your concerns are very valid, but one thing to note - the bucket limiter is there by default which is big enough (default 500 bucket size) to handle almost all existing argocd users without changing their existing experience. The per-item limiter(exponential one) is disabled by default and can be enabled as per user requirements. |
cmd/argocd-application-controller/commands/argocd_application_controller.go
Fixed
Show fixed
Hide fixed
cmd/argocd-application-controller/commands/argocd_application_controller.go
Fixed
Show fixed
Hide fixed
Signed-off-by: Soumya Ghosh Dastidar <[email protected]>
550d54b
to
bdd9121
Compare
Signed-off-by: Soumya Ghosh Dastidar <[email protected]>
Signed-off-by: Soumya Ghosh Dastidar <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
To summarize this PR, what is being introduced are new controller tuning options to have a per-item rate-limited workqueue. This is disabled by default (WORKQUEUE_FAILURE_COOLDOWN_NS=0
) and unless enabled, will behave exactly the same as before. But at least now, we have the ability to protect the controller when we hit situations like in #15233 where a misbehaving 3rd party controller could negatively impact the argo-cd application controller due to unthrottled reconciliations.
Now that we will have this in place, the way I see this moving forward, is for people to experiment with values in real-world setups in order to come up with a good default. Once we are comfortable with that number, in future releases we will turn this on by default with some proven values.
@jannfis are your concerns met?
--wq-backoff-factor float Set Workqueue Per Item Rate Limiter Backoff Factor, default is 1.5 (default 1.5) | ||
--wq-basedelay-ns duration Set Workqueue Per Item Rate Limiter Base Delay duration in nanoseconds, default 1000000 (1ms) (default 1ms) | ||
--wq-bucket-qps int Set Workqueue Rate Limiter Bucket QPS, default 50 (default 50) | ||
--wq-bucket-size int Set Workqueue Rate Limiter Bucket Size, default 500 (default 500) | ||
--wq-cooldown-ns duration Set Workqueue Per Item Rate Limiter Cooldown duration in ns, default 0(per item rate limiter disabled) | ||
--wq-maxdelay-ns duration Set Workqueue Per Item Rate Limiter Max Delay duration in nanoseconds, default 1000000000 (1s) (default 1s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo we could really use a paragraph or two in docs explaining these options and how to experiment with them. If we'd like to solicit feedback from the community, we should make it super easy for them to understand how to tune these values so they can report back what they found to be effective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do you think would be a good place to put the doc? A separate file under operator manual or in some exisitng file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the best place would be along where we document other controller tuning variables:
### argocd-application-controller |
I know the file is called "high availability," but there are already performance-oriented tuning variables documented and so it wouldn't be totally out of place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, will add the docs for rate limiting there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated PR to include the docs PTAL.
Signed-off-by: Soumya Ghosh Dastidar <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry coming back so late to this. Change overall looks good to me.
@jessesuen Thanks, that makes sense! |
* feat: use rate limited queue Signed-off-by: Soumya Ghosh Dastidar <[email protected]>
* feat: use rate limited queue Signed-off-by: Soumya Ghosh Dastidar <[email protected]> Signed-off-by: jmilic1 <[email protected]>
* feat: use rate limited queue Signed-off-by: Soumya Ghosh Dastidar <[email protected]>
* feat: use rate limited queue Signed-off-by: Soumya Ghosh Dastidar <[email protected]>
* feat: use rate limited queue Signed-off-by: Soumya Ghosh Dastidar <[email protected]>
Fixes #15233
This PR adds a custom rate limiter that is a combination of a BucketLitmer and an ItemExponentialRateLimiter just like the previously used default limiter. However, the custom limiter's exponential backoff auto resets for an item if the coolDown period is over for that item. This is needed because the default exponential limiter needs a manual
Forget()
call to reset the backoff time but in issue #15233 which is caused due to external factors we cannot determine whenForget()
should be called so we delegate it to the limiter to auto forget once the coolDown period is reached.ENV to configure rate limits:
WORKQUEUE_BUCKET_SIZE
: default 500WORKQUEUE_BUCKET_QPS
: default 50WORKQUEUE_FAILURE_COOLDOWN_NS
: default 0s, if duration set to 0 individual rate limiting is disabled(default)WORKQUEUE_BASE_DELAY_NS
: default 1000 (1μs)WORKQUEUE_MAX_DELAY_NS
: default 3s (3 * 10^9)WORKQUEUE_BACKOFF_FACTOR
: default 1.5Checklist: