-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
argo crashed kubernetes #10441
Comments
For @sarabala1979, @juliev0, @dpadhiar and others, we think a potential memory size issue might be related to the workflow queue filling up. There was a question of whether the queue gracefully handles duplicates and also how we could scale it past a fixed size - potentially offloading it to an external queue or another Kubernetes deployment |
|
I am struggling through this problem myself in trying to vet argo workflows/events as a viable solution for a project I am working on. I realized that argo workflows really can't help here, because creating a new workflow is done against the k8s api, not the workflows controller/argo server. But that helps, because it means we can use k8s to put a limiter in place via a resource quota. For example, for testing I set an absurdly low limit of five workflows: apiVersion: v1
kind: ResourceQuota
metadata:
name: object-counts
spec:
hard:
count/workflows.argoproj.io: "5" Now when try to trigger six workflows off events, I see the sixth logged by argo events as a failure:
This is great, because it means we can enforce a limit to protect the cluster, but alone may be problematic as the event is effectively dropped. I tried setting up retries on the trigger but I haven't got them working yet. I'm also not sold on the retry behavior available with events even if I can get it working, particularly in this case. Seems like a circuit breaker pattern would be required if workflow events are so numerous your cluster would otherwise be falling over. I am also new to NATS jetstream, but it seems like even with indefinite retries events will get dropped if new events saturate the stream and old events are dropped before the workflows can be triggered. |
I am also struggling on this problem. I am trying to trigger 400 concurrent workflows at once, sending messages to a GCP Pub/Sub + having Argo Events listening / pulling and submitting the workflows. I have offload database activated and working well. I also setup the workflow archiving as well as the automated PodGC to get rid of them as fast as possible after workflow completion. I did multiple tests, using semaphore, with reduced or increased concurrency.
What i mean by "slowing down" is the fact that k8s is extremely slow to create step's pods. It just take ages. Even in a situation with pre-provisioning enough nodes before submitting workflows, it takes sometime minutes for pods to really start. I tried to boost argo workflow controller, giving it about 32 CPU and tons of memory (way to much for this test scenario i know), just to make sure it was not a matter of resource on argo workflow side. It did not change a single thing. Pretty low CPU and Memory footprint, which is a good thing. The only progress i could make is by upgrading to the latest version of GKE (1.27) since they bring some improvements at k8s api server level to increase the request per second rate limit. I believe by default it is something like 5 + burst of 10. Now it is 50 + burst of 100. This has the benefits to see more workflow in a "running" state rather than "pending" in argo workflow metrics (probably because its requests get accepted faster / more easily), but it did not change a single thing on the time required for a step Pod to start. This really seems to be on K8s side. Still, Argo Workflow sell the idea it can run thousands of workflows with thousands of steps at a time. The question is: is there anyone out there that could run a lot of workflow at once (not talking about large workflows) to help us understand where how to address this ? As a recap, there are two issues to address:
|
not a solution... another user shared their cleanup steps on slack:
|
does anyone think that i saw @crenshaw-dev mentioned need something in front of argo to prevent etcd crash: https://stackoverflow.com/a/64105331/8874837 |
My answer is probably a little out of date. With semaphore support now, it's probably possible to enforce a global parallelism limit. Not exactly a queue, but similar effect on k8s load. |
#13042 would slightly assist |
Pre-requisites
:latest
What happened/what you expected to happen?
parallelism: 60 is set on the controller.
i launched over 13000 workflows at once (via argo event sensor)
argo workflows ui showed message about over 500 workflows in pending state
as can see below etcd size has blown out
but now the whole kubernetes cluster is unresponsive with kubectl commands timing out (
argo-linux-amd64 delete --all
also times out), loads of pods crashing (including workflow controller). Doesn't seem to have improved after few hours. Any suggestion to clear this up and prevent in future?Version
3.4.4
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
n/a
Logs from the workflow controller
Logs from in your workflow's wait container
The text was updated successfully, but these errors were encountered: