Master-Worker Distributed processing on Workflow Controller #4318

sarabala1979 · 2020-10-18T03:19:08Z

Summary

Currently, Single controller deployment has a performance impact if the Controller processing 50+ concurrent workflow or many larger workflows. There are few workarounds to medicate this issue like

Namespaced installation
Install multiple controllers with instanceId
There is a max workflow process deadline to make sure all workflow gets processed.

Option 1 and 2 have a huge operation cost to maintain multiple deployments (install, upgrade, monitoring).
Option3, if really big workflow, it will always get processing timeout.

What change needs making?
Support Master-Worker distributed architecture on workflow-controller.

Master

1 It will responsible to distribute workflows to workers.
2. It will maintain the active worker list for distribution.
3. It will reassign the workflow to the active worker if the previous worker is not active.
4. It can act as a worker too.

Worker

Keep updating the active status to master
Process the assign workflows

Detail Flow diagram:

Use Cases

When would you use this?

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

hadim · 2020-10-18T15:23:58Z

Have you made some benchmarks showing the limits at which a given workflow controller can be pushed (according to a defined number of CPU and memory allocated to the controller)? I guess the most pertinent metric for such benchmarks are the number of workflows and the distribution of the number of nodes per workflow.

TekTimmy · 2020-10-19T12:05:55Z

If i understand it right this is our concrete current problem.
Our Kubernetes Cluster (AWS EKS) is struggling as soon as we are above ~500 workflows. With struggling i mean Workflows are failing with one of the following error messages:

failed to save outputs: the server was unable to return a response in the time allotted, but may still be processing the request (patch pods bx59k-552586920)

failed to save outputs: Patch https://10.100.0.1:443/api/v1/namespaces/v1-0/pods/5r49l-2608568875: net/http: TLS handshake timeout

failed to save outputs: Patch https://10.100.0.1:443/api/v1/namespaces/v1-0/pods/nx5b7-2156588813: http2: server sent GOAWAY and closed the connection; LastStreamID=5, ErrCode=NO_ERROR, debug=""

failed to save outputs: Patch https://10.100.0.1:443/api/v1/namespaces/v1-0/pods/9q96d-4078545465: stream error: stream ID 5; INTERNAL_ERROR

failed to save outputs: Patch https://10.100.0.1:443/api/v1/namespaces/v1-0/pods/ztglr-1531658454: dial tcp 10.100.0.1:443: i/o timeout

failed to save outputs: Patch https://10.100.0.1:443/api/v1/namespaces/v1-0/pods/wg9qh-2738839668: unexpected EOF

From my experience this depends on the number of workflows that are in "argo list" but less on their state or does it ?

alexec · 2020-10-19T15:43:46Z

In Argo Events, the master controller effectively creates a single slave controller per namespace. This is something we could consider.

Anyone could write this master controller today and it would be possible to use it with any version Argo Workflows.

This would be the operator pattern, of course the CRD you'd be operating on would be v1/Namespace. I bet you could build it in a couple of days.

alexec · 2020-10-19T19:13:09Z

kubernetes/kubernetes#24946 (comment)

alexec · 2022-11-07T16:46:20Z

#9990

agilgur5 · 2023-08-22T23:50:51Z

I think built-in/automatic sharding -- basically what #9990 proposes -- would be a simpler approach to this and require less change.
I believe Argo CD's controller implements sharding very similarly with a StatefulSet. CD and Workflows do have similar controller/server architectures already. It would be good to have similar scaling mechanisms that would be familiar to users. Would be even better if we could reuse code for that too. Mentioned this in Slack recently as well.

agilgur5 · 2023-09-24T14:40:35Z

Reading the diagram specifics, this and #9990 are actually fairly similar proposals with an assignmentController. The main difference is that there is no specifically designated "master" in #9990. But if the "master can also be a worker", then these are mostly the same, just that #9990 has better deployment semantics as it is a single StatefulSet where any replica can become the leader if needed (which is better for HA as well).

My improvements in #9990 (comment) could take it a few steps further as well. If the coordination-free implementation is possible, where no leader is necessary, that would significantly simplify the architecture and effectively make it infinitely scalable.

As such, I'm going to close this out in favor of #9990

sarabala1979 added type/feature Feature request epic/scaling labels Oct 18, 2020

alexec mentioned this issue Oct 19, 2020

feat(controller): Enable hot-standby. Closes #4077 #4185

Closed

6 tasks

sarabala1979 mentioned this issue Dec 22, 2020

Workflows Operator (ex. Super-controller) #4790

Closed

alexec removed the epic/scaling label Sep 21, 2021

alexec added the area/controller Controller issues, panics label Feb 7, 2022

agilgur5 closed this as completed Sep 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Master-Worker Distributed processing on Workflow Controller #4318

Master-Worker Distributed processing on Workflow Controller #4318

sarabala1979 commented Oct 18, 2020 •

edited

Loading

hadim commented Oct 18, 2020

TekTimmy commented Oct 19, 2020 •

edited

Loading

alexec commented Oct 19, 2020

alexec commented Oct 19, 2020

alexec commented Nov 7, 2022 •

edited

Loading

agilgur5 commented Aug 22, 2023

agilgur5 commented Sep 24, 2023

Master-Worker Distributed processing on Workflow Controller #4318

Master-Worker Distributed processing on Workflow Controller #4318

Comments

sarabala1979 commented Oct 18, 2020 • edited Loading

Summary

Master

Worker

Detail Flow diagram:

Use Cases

hadim commented Oct 18, 2020

TekTimmy commented Oct 19, 2020 • edited Loading

alexec commented Oct 19, 2020

alexec commented Oct 19, 2020

alexec commented Nov 7, 2022 • edited Loading

agilgur5 commented Aug 22, 2023

agilgur5 commented Sep 24, 2023

sarabala1979 commented Oct 18, 2020 •

edited

Loading

TekTimmy commented Oct 19, 2020 •

edited

Loading

alexec commented Nov 7, 2022 •

edited

Loading