Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Master-Worker Distributed processing on Workflow Controller #4318

Closed
sarabala1979 opened this issue Oct 18, 2020 · 7 comments
Closed

Master-Worker Distributed processing on Workflow Controller #4318

sarabala1979 opened this issue Oct 18, 2020 · 7 comments
Labels
area/controller Controller issues, panics type/feature Feature request

Comments

@sarabala1979
Copy link
Member

sarabala1979 commented Oct 18, 2020

Summary

Currently, Single controller deployment has a performance impact if the Controller processing 50+ concurrent workflow or many larger workflows. There are few workarounds to medicate this issue like

  1. Namespaced installation
  2. Install multiple controllers with instanceId
  3. There is a max workflow process deadline to make sure all workflow gets processed.

Option 1 and 2 have a huge operation cost to maintain multiple deployments (install, upgrade, monitoring).
Option3, if really big workflow, it will always get processing timeout.

What change needs making?
Support Master-Worker distributed architecture on workflow-controller.

image

Master

1 It will responsible to distribute workflows to workers.
2. It will maintain the active worker list for distribution.
3. It will reassign the workflow to the active worker if the previous worker is not active.
4. It can act as a worker too.

Worker

  1. Keep updating the active status to master
  2. Process the assign workflows

Detail Flow diagram:

image

Use Cases

When would you use this?


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

@hadim
Copy link

hadim commented Oct 18, 2020

Have you made some benchmarks showing the limits at which a given workflow controller can be pushed (according to a defined number of CPU and memory allocated to the controller)? I guess the most pertinent metric for such benchmarks are the number of workflows and the distribution of the number of nodes per workflow.

@TekTimmy
Copy link

TekTimmy commented Oct 19, 2020

If i understand it right this is our concrete current problem.
Our Kubernetes Cluster (AWS EKS) is struggling as soon as we are above ~500 workflows. With struggling i mean Workflows are failing with one of the following error messages:

failed to save outputs: the server was unable to return a response in the time allotted, but may still be processing the request (patch pods bx59k-552586920)

failed to save outputs: Patch https://10.100.0.1:443/api/v1/namespaces/v1-0/pods/5r49l-2608568875: net/http: TLS handshake timeout

failed to save outputs: Patch https://10.100.0.1:443/api/v1/namespaces/v1-0/pods/nx5b7-2156588813: http2: server sent GOAWAY and closed the connection; LastStreamID=5, ErrCode=NO_ERROR, debug=""

failed to save outputs: Patch https://10.100.0.1:443/api/v1/namespaces/v1-0/pods/9q96d-4078545465: stream error: stream ID 5; INTERNAL_ERROR

failed to save outputs: Patch https://10.100.0.1:443/api/v1/namespaces/v1-0/pods/ztglr-1531658454: dial tcp 10.100.0.1:443: i/o timeout

failed to save outputs: Patch https://10.100.0.1:443/api/v1/namespaces/v1-0/pods/wg9qh-2738839668: unexpected EOF

From my experience this depends on the number of workflows that are in "argo list" but less on their state or does it ?

@alexec
Copy link
Contributor

alexec commented Oct 19, 2020

In Argo Events, the master controller effectively creates a single slave controller per namespace. This is something we could consider.

Anyone could write this master controller today and it would be possible to use it with any version Argo Workflows.

This would be the operator pattern, of course the CRD you'd be operating on would be v1/Namespace. I bet you could build it in a couple of days.

@alexec
Copy link
Contributor

alexec commented Oct 19, 2020

@alexec
Copy link
Contributor

alexec commented Nov 7, 2022

#9990

@agilgur5
Copy link

I think built-in/automatic sharding -- basically what #9990 proposes -- would be a simpler approach to this and require less change.
I believe Argo CD's controller implements sharding very similarly with a StatefulSet. CD and Workflows do have similar controller/server architectures already. It would be good to have similar scaling mechanisms that would be familiar to users. Would be even better if we could reuse code for that too. Mentioned this in Slack recently as well.

@agilgur5
Copy link

Reading the diagram specifics, this and #9990 are actually fairly similar proposals with an assignmentController. The main difference is that there is no specifically designated "master" in #9990. But if the "master can also be a worker", then these are mostly the same, just that #9990 has better deployment semantics as it is a single StatefulSet where any replica can become the leader if needed (which is better for HA as well).

My improvements in #9990 (comment) could take it a few steps further as well. If the coordination-free implementation is possible, where no leader is necessary, that would significantly simplify the architecture and effectively make it infinitely scalable.

As such, I'm going to close this out in favor of #9990

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics type/feature Feature request
Projects
None yet
Development

No branches or pull requests

5 participants