argo crashed kubernetes #10441

tooptoop4 · 2023-02-01T07:27:20Z

Pre-requisites

I have double-checked my configuration
I can confirm the issues exists when I tested with :latest
I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

parallelism: 60 is set on the controller.
i launched over 13000 workflows at once (via argo event sensor)

argo workflows ui showed message about over 500 workflows in pending state

as can see below etcd size has blown out

but now the whole kubernetes cluster is unresponsive with kubectl commands timing out (argo-linux-amd64 delete --all also times out), loads of pods crashing (including workflow controller). Doesn't seem to have improved after few hours. Any suggestion to clear this up and prevent in future?

Version

3.4.4

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

n/a

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

The text was updated successfully, but these errors were encountered:

JPZ13 · 2023-02-16T19:00:34Z

@tooptoop4

Do you have Workflow offloading enabled?
Also, are you noticing the resources for the workflow controller increasing as well, or is it just etcd?
What kubernetes distribution are you using? Is it a managed offering like EKS, GKE, or AKS? We think this could be related to the Kubernetes API being rate limited but want some additional information to determine the root cause

For @sarabala1979, @juliev0, @dpadhiar and others, we think a potential memory size issue might be related to the workflow queue filling up. There was a question of whether the queue gracefully handles duplicates and also how we could scale it past a fixed size - potentially offloading it to an external queue or another Kubernetes deployment

tooptoop4 · 2023-02-17T04:05:09Z

@JPZ13

yes
unfortunately no longer have those metrics
EKS v1.23

jkeifer · 2023-03-31T23:49:05Z

I am struggling through this problem myself in trying to vet argo workflows/events as a viable solution for a project I am working on. I realized that argo workflows really can't help here, because creating a new workflow is done against the k8s api, not the workflows controller/argo server. But that helps, because it means we can use k8s to put a limiter in place via a resource quota.

For example, for testing I set an absurdly low limit of five workflows:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: object-counts
spec:
  hard:
    count/workflows.argoproj.io: "5"

Now when try to trigger six workflows off events, I see the sixth logged by argo events as a failure:

namespace=argo-events, sensorName=workflow-builder, triggerName=workflow-orchestrator, level=info, t
ime=2023-03-31T18:12:24Z, msg=creating the object...
namespace=argo-events, sensorName=workflow-builder, triggerName=workflow-orchestrator, level=error,
time=2023-03-31T18:12:24Z, msg=Failed to execute a trigger```

This is great, because it means we can enforce a limit to protect the cluster, but alone may be problematic as the event is effectively dropped.

I tried setting up retries on the trigger but I haven't got them working yet. I'm also not sold on the retry behavior available with events even if I can get it working, particularly in this case. Seems like a circuit breaker pattern would be required if workflow events are so numerous your cluster would otherwise be falling over.

I am also new to NATS jetstream, but it seems like even with indefinite retries events will get dropped if new events saturate the stream and old events are dropped before the workflows can be triggered.

nicolas-vivot · 2023-06-21T08:53:12Z

I am also struggling on this problem.

I am trying to trigger 400 concurrent workflows at once, sending messages to a GCP Pub/Sub + having Argo Events listening / pulling and submitting the workflows.

I have offload database activated and working well. I also setup the workflow archiving as well as the automated PodGC to get rid of them as fast as possible after workflow completion.

I did multiple tests, using semaphore, with reduced or increased concurrency.

when the semaphore is too low compared to the "flood", argo event get rejected by argo workflow control and log some errors. Pretty similar as the what jkeifer has observed using Quota instead of workflow semaphores. So this make sense.
when the semaphore is big enough to allow to flood k8s, then the whole server is "slowing down" somehow.

What i mean by "slowing down" is the fact that k8s is extremely slow to create step's pods. It just take ages. Even in a situation with pre-provisioning enough nodes before submitting workflows, it takes sometime minutes for pods to really start.

I tried to boost argo workflow controller, giving it about 32 CPU and tons of memory (way to much for this test scenario i know), just to make sure it was not a matter of resource on argo workflow side. It did not change a single thing. Pretty low CPU and Memory footprint, which is a good thing.

The only progress i could make is by upgrading to the latest version of GKE (1.27) since they bring some improvements at k8s api server level to increase the request per second rate limit. I believe by default it is something like 5 + burst of 10. Now it is 50 + burst of 100.

This has the benefits to see more workflow in a "running" state rather than "pending" in argo workflow metrics (probably because its requests get accepted faster / more easily), but it did not change a single thing on the time required for a step Pod to start. This really seems to be on K8s side.

Still, Argo Workflow sell the idea it can run thousands of workflows with thousands of steps at a time.
Seeing that it can barely handle 400 workflows with just 3 consecutive steps in it at once is a huge deception.

The question is: is there anyone out there that could run a lot of workflow at once (not talking about large workflows) to help us understand where how to address this ?

As a recap, there are two issues to address:

the ability to have a component on top of Argo Workflow to better control the work load, since this is clearly not the original intention / target of Argo Workflow
the ability to have a better work load on K8s api server, to have Pods starting fast enough (a couple of second is fine, until it is not several minutes like now)

tooptoop4 · 2023-07-01T10:12:34Z

not a solution... another user shared their cleanup steps on slack:

We ended up with the following cleanup process:
Get a list of workflows k get wf ... the call would not get everything and timeout at some point, but it would return a big amount.
We took the list, chunked it by 200 ids, and then ran k delete wf <list_of_200_ids>
Executed deletion in multiple tabs (poor mans parallelism)
Repeat from step 1 until there is nothing left.
Start the controller, it worked.

tooptoop4 · 2023-08-20T22:37:11Z

does anyone think that apiserver_storage_objects{resource="events"} and leases is consuming this space in etcd? ie from churning through 10000s of pods and workflows.

i saw @crenshaw-dev mentioned need something in front of argo to prevent etcd crash: https://stackoverflow.com/a/64105331/8874837

crenshaw-dev · 2023-08-25T14:10:05Z

My answer is probably a little out of date. With semaphore support now, it's probably possible to enforce a global parallelism limit. Not exactly a queue, but similar effect on k8s load.

tooptoop4 · 2024-09-16T21:36:45Z

#13042 would slightly assist

tooptoop4 added the type/bug label Feb 1, 2023

This was referenced Feb 2, 2023

don't trigger more if already too many workflows pending argoproj/argo-events#2451

Open

If step is in pending state, step timeout(active deadline seconds) is not working #3572

Closed

JPZ13 added the type/feature Feature request label Feb 16, 2023

JPZ13 added P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important and removed type/bug P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important labels Feb 16, 2023

agilgur5 added the solution/workaround There's a workaround, might not be great, but exists label Apr 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

argo crashed kubernetes #10441

argo crashed kubernetes #10441

tooptoop4 commented Feb 1, 2023 •

edited by agilgur5

Loading

JPZ13 commented Feb 16, 2023

tooptoop4 commented Feb 17, 2023

jkeifer commented Mar 31, 2023 •

edited by agilgur5

Loading

nicolas-vivot commented Jun 21, 2023 •

edited

Loading

tooptoop4 commented Jul 1, 2023 •

edited by agilgur5

Loading

tooptoop4 commented Aug 20, 2023 •

edited

Loading

crenshaw-dev commented Aug 25, 2023

tooptoop4 commented Sep 16, 2024

argo crashed kubernetes #10441

argo crashed kubernetes #10441

Comments

tooptoop4 commented Feb 1, 2023 • edited by agilgur5 Loading

Pre-requisites

What happened/what you expected to happen?

Version

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Logs from the workflow controller

Logs from in your workflow's wait container

JPZ13 commented Feb 16, 2023

tooptoop4 commented Feb 17, 2023

jkeifer commented Mar 31, 2023 • edited by agilgur5 Loading

nicolas-vivot commented Jun 21, 2023 • edited Loading

tooptoop4 commented Jul 1, 2023 • edited by agilgur5 Loading

tooptoop4 commented Aug 20, 2023 • edited Loading

crenshaw-dev commented Aug 25, 2023

tooptoop4 commented Sep 16, 2024

tooptoop4 commented Feb 1, 2023 •

edited by agilgur5

Loading

jkeifer commented Mar 31, 2023 •

edited by agilgur5

Loading

nicolas-vivot commented Jun 21, 2023 •

edited

Loading

tooptoop4 commented Jul 1, 2023 •

edited by agilgur5

Loading

tooptoop4 commented Aug 20, 2023 •

edited

Loading