Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

argo crashed kubernetes #10441

Open
2 of 3 tasks
tooptoop4 opened this issue Feb 1, 2023 · 8 comments
Open
2 of 3 tasks

argo crashed kubernetes #10441

tooptoop4 opened this issue Feb 1, 2023 · 8 comments
Labels
solution/workaround There's a workaround, might not be great, but exists type/feature Feature request

Comments

@tooptoop4
Copy link
Contributor

tooptoop4 commented Feb 1, 2023

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

parallelism: 60 is set on the controller.
i launched over 13000 workflows at once (via argo event sensor)

argo workflows ui showed message about over 500 workflows in pending state

as can see below etcd size has blown out

image

image

but now the whole kubernetes cluster is unresponsive with kubectl commands timing out (argo-linux-amd64 delete --all also times out), loads of pods crashing (including workflow controller). Doesn't seem to have improved after few hours. Any suggestion to clear this up and prevent in future?

Version

3.4.4

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

n/a

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
@JPZ13
Copy link
Member

JPZ13 commented Feb 16, 2023

@tooptoop4

  1. Do you have Workflow offloading enabled?
  2. Also, are you noticing the resources for the workflow controller increasing as well, or is it just etcd?
  3. What kubernetes distribution are you using? Is it a managed offering like EKS, GKE, or AKS? We think this could be related to the Kubernetes API being rate limited but want some additional information to determine the root cause

For @sarabala1979, @juliev0, @dpadhiar and others, we think a potential memory size issue might be related to the workflow queue filling up. There was a question of whether the queue gracefully handles duplicates and also how we could scale it past a fixed size - potentially offloading it to an external queue or another Kubernetes deployment

@JPZ13 JPZ13 added P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important and removed type/bug P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important labels Feb 16, 2023
@tooptoop4
Copy link
Contributor Author

@JPZ13

  1. yes
  2. unfortunately no longer have those metrics
  3. EKS v1.23

@jkeifer
Copy link

jkeifer commented Mar 31, 2023

I am struggling through this problem myself in trying to vet argo workflows/events as a viable solution for a project I am working on. I realized that argo workflows really can't help here, because creating a new workflow is done against the k8s api, not the workflows controller/argo server. But that helps, because it means we can use k8s to put a limiter in place via a resource quota.

For example, for testing I set an absurdly low limit of five workflows:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: object-counts
spec:
  hard:
    count/workflows.argoproj.io: "5"

Now when try to trigger six workflows off events, I see the sixth logged by argo events as a failure:

namespace=argo-events, sensorName=workflow-builder, triggerName=workflow-orchestrator, level=info, t
ime=2023-03-31T18:12:24Z, msg=creating the object...
namespace=argo-events, sensorName=workflow-builder, triggerName=workflow-orchestrator, level=error,
time=2023-03-31T18:12:24Z, msg=Failed to execute a trigger```

This is great, because it means we can enforce a limit to protect the cluster, but alone may be problematic as the event is effectively dropped.

I tried setting up retries on the trigger but I haven't got them working yet. I'm also not sold on the retry behavior available with events even if I can get it working, particularly in this case. Seems like a circuit breaker pattern would be required if workflow events are so numerous your cluster would otherwise be falling over.

I am also new to NATS jetstream, but it seems like even with indefinite retries events will get dropped if new events saturate the stream and old events are dropped before the workflows can be triggered.

@nicolas-vivot
Copy link

nicolas-vivot commented Jun 21, 2023

I am also struggling on this problem.

I am trying to trigger 400 concurrent workflows at once, sending messages to a GCP Pub/Sub + having Argo Events listening / pulling and submitting the workflows.

I have offload database activated and working well. I also setup the workflow archiving as well as the automated PodGC to get rid of them as fast as possible after workflow completion.

I did multiple tests, using semaphore, with reduced or increased concurrency.

  • when the semaphore is too low compared to the "flood", argo event get rejected by argo workflow control and log some errors. Pretty similar as the what jkeifer has observed using Quota instead of workflow semaphores. So this make sense.
  • when the semaphore is big enough to allow to flood k8s, then the whole server is "slowing down" somehow.

What i mean by "slowing down" is the fact that k8s is extremely slow to create step's pods. It just take ages. Even in a situation with pre-provisioning enough nodes before submitting workflows, it takes sometime minutes for pods to really start.

I tried to boost argo workflow controller, giving it about 32 CPU and tons of memory (way to much for this test scenario i know), just to make sure it was not a matter of resource on argo workflow side. It did not change a single thing. Pretty low CPU and Memory footprint, which is a good thing.

The only progress i could make is by upgrading to the latest version of GKE (1.27) since they bring some improvements at k8s api server level to increase the request per second rate limit. I believe by default it is something like 5 + burst of 10. Now it is 50 + burst of 100.

This has the benefits to see more workflow in a "running" state rather than "pending" in argo workflow metrics (probably because its requests get accepted faster / more easily), but it did not change a single thing on the time required for a step Pod to start. This really seems to be on K8s side.

Still, Argo Workflow sell the idea it can run thousands of workflows with thousands of steps at a time.
Seeing that it can barely handle 400 workflows with just 3 consecutive steps in it at once is a huge deception.

The question is: is there anyone out there that could run a lot of workflow at once (not talking about large workflows) to help us understand where how to address this ?

As a recap, there are two issues to address:

  • the ability to have a component on top of Argo Workflow to better control the work load, since this is clearly not the original intention / target of Argo Workflow
  • the ability to have a better work load on K8s api server, to have Pods starting fast enough (a couple of second is fine, until it is not several minutes like now)

@tooptoop4
Copy link
Contributor Author

tooptoop4 commented Jul 1, 2023

not a solution... another user shared their cleanup steps on slack:

We ended up with the following cleanup process:
Get a list of workflows k get wf ... the call would not get everything and timeout at some point, but it would return a big amount.
We took the list, chunked it by 200 ids, and then ran k delete wf <list_of_200_ids>
Executed deletion in multiple tabs (poor mans parallelism)
Repeat from step 1 until there is nothing left.
Start the controller, it worked.

@tooptoop4
Copy link
Contributor Author

tooptoop4 commented Aug 20, 2023

does anyone think that apiserver_storage_objects{resource="events"} and leases is consuming this space in etcd? ie from churning through 10000s of pods and workflows.

i saw @crenshaw-dev mentioned need something in front of argo to prevent etcd crash: https://stackoverflow.com/a/64105331/8874837

@crenshaw-dev
Copy link
Member

My answer is probably a little out of date. With semaphore support now, it's probably possible to enforce a global parallelism limit. Not exactly a queue, but similar effect on k8s load.

@agilgur5 agilgur5 added the solution/workaround There's a workaround, might not be great, but exists label Apr 13, 2024
@tooptoop4
Copy link
Contributor Author

#13042 would slightly assist

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solution/workaround There's a workaround, might not be great, but exists type/feature Feature request
Projects
None yet
Development

No branches or pull requests

6 participants