-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to accelerate cleanup of succeeded workflows with TTL workers #12206
Comments
Here are some further screenshots from our monitoring system: You can see, that the number of busy ttl workers never surpasses one, and often even no worker is busy. We expect, that this metric ( Here you can see the number of succeeded workflows, which are not cleaned up fast enough: Strangely, the metric for the depth of the workflows queue is always at zero: But we see a high frequency of the logs |
I´d like to add a point to that. It seems like the workflow garbage collection sometimes just stops for one hour or so, until I delete the workflow-controller pod. Then it will work for a while, cleaning up a lot of succeeded workflows, before stopping again. We currently use this as a workaround, when too many succeeded workflows have accumulated. Apart from setting I read through some other issues, and it seems to be related to this (or at least the comment under this issue): #4634 (comment) |
So this corresponds directly to the number of goroutines launched for the GC controller -- quite literally this Whether it utilizes it effectively, can't quite say on that one. Sounds like it may be a no based on your report.
I was gonna say to check the Does the queue depth of the other queues make sense? i.e.
Yea I would expect this too... The logic looks correct to me too and is near identical to other parts of the codebase 🤔 😕 Same as above, does the
That's a good note since fully utilized CPU would certainly limit concurrency.
This appears to be a slight misnomer; in the code this refers to the periodicity specifically of the GC for node status offloads. The period for GC for Workflow TTL / retention is located in the same
It looks like the user there did not necessarily confirm the fix from #4736, which added a small 1s delay to GC as well as the configurable number of TTL workers |
So this graph does look a little suspicious... this seems to log / run every 15-20 minutes? Any chance you can confirm the exact time period? 20 minutes corresponds to the Workflow Informer's resync period, which is when Informers rebuild their cache. This may be the same root cause as #11948 and therefore fixed by #12133. Can you try the tag published there, |
Thanks for your extensive reply !
Either it does not utilize it effectively, or the metric for that reports incorrect values.
When I look at the
When looking at the
Yes. we have set appropriate |
Thanks for the clarification. In the docs for the environment variables, this is described as "The periodicity for GC of workflows.", hence the confusion. |
Yes, this is very consistently logged in batches every 18-20 minutes. That seems to align with the Workflow Informer's resync period of 20 minutes. |
We currently only have this problem on our production environment because of the increased load, so we are naturally very hesitant to test this there. I´ll get back to you on that though. Do you have any idea, when this fix will be released ? |
The informer fix has been released as part of 3.4.14 now (and 3.5.2). Are you able to try the new version and see if it is fixed?
So you could check in a staging environment if the periodic frequency of the "Queueing Succeeded workflow [...] for delete [...] due to TTL" logs changes from ~20min to more frequently. |
I tested 3.4.14 in a staging environment and the frequency of the "Queueing Succeeded workflow [...] for delete [...] due to TTL" logs definitely increased. I don´t see this periodically happening any more, that's great! However the workflow queue depth metric |
#12659 means that twice as many pods could be going into cleanup queue than they should |
Pre-requisites
:latest
What happened/what you expected to happen?
We are experiencing difficulties in increasing the number of TTL (time-to-live) workers to faster clean up succeeded workflow.. We have set the
--workflow-ttl-workers
flag of the controller to higher values, such as 64 or 96, but it appears that the workflows controller does not recognize or utilize these values effectively. Despite configuring a higher number of TTL workers, the Prometheus metric "argo_workflows_workers_busy_count" consistently shows only one busy TTL worker, and not the expected 64 or 96.This issue is causing a build-up of succeeded workflows over time, currently peaking at around 10,000, because we are creating workflows faster than we can clean up. Our intention was to leverage multiple TTL workers to accelerate the clean-up process and reduce the backlog of succeeded workflows. It's worth noting that our machine's CPU resources for the controller are only utilized at a fraction (approximately 5% of a 16-core machine).
Additionally, we are observing frequent logs indicating that workflows are being cleaned up late, as seen below.
Version
3.4.8
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Logs from the workflow controller
Logs from in your workflow's wait container
The text was updated successfully, but these errors were encountered: