-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Horizontally Scalable Controller #9990
Comments
I think your HPA could be trivial to set-up - just based on memory usage. |
Jotting down some implementation / architectural notes here. I was interested in implementing this and discussed a fair bit with @juliev0 offline -- thanks for being a sounding board and finding very relevant edge cases and counter-examples already! 🙂 For searchability purposes, I also defined this as a "sharding" proposal, since, more specifically, each replica acts as a shard in this proposal. Alex's ProposalI'm largely in favor with the above proposal, such as using For reference, Argo CD uses a There are some edge cases that don't quite work in this proposal and idealistic improvements I'd like to make if possible as well. As a simplification, below I will refer to "Workflows", but as Alex stated above, this may include Workflow Pods, WorkflowTemplates, ClusterWorkflowTemplates, etc. If sharding is necessary for these resources, the implementation should work largely identically. Edge CasesThe modulus for scale down is actually a bit too simplistic. Specifically, it will fail when the replicas scale back up. Julie asked about a specific case that turns out to be a very good counter-example: say you scale from 14 -> 2 -> 3.
CALM Theorem reductionWhile thinking through sharding designs, I was specifically looking at possible coordination-free designs (which includes the space of lock-less designs). The CALM theorem in particular is one of my favorite recent reads and a very useful generalization (sometimes referred to as a generalization of CAP). The most basic gist that it boils down to is: if your function is monotonically increasing, then there is a coordination-free design that can implement it (and if there is a coordination-free design, it is not limited by CAP). This was totally incidental, but the function in Alex's proposal above reduces down simply to Or, putting it another way, there must be a "relabeler" or a "reassigner" that coordinates the work assignment when replicas change. Totally unintentional proofs here, but it makes for definitive arguments! All that being said, there are potentially other ways of formulating this problem that are monotonically increasing and therefore would not require coordination. Reassigner detailsIf we were to go with this proposal, the simplest implementation of a reassigner would not change any in progress Workflows that have already been assigned to a shard. In the case of scale-down, it would take the no longer assigned Workflows and reassign them (i.e. change their label) to existing shards. In the case of scale-up, it would only look at new work and try to balance that out amongst new and existing shards. This being the simplest implementation, it does not do any rebalancing of work, which could be suboptimal in some situations. Rebalancing, however, would be significantly more complicated to implement, as shards would need to be able to drop working on any in-progress Workflows at any time. There's a lot of possible race conditions to handle there as well. ImprovementsThis space of different problem formulations contain some possible improvements to this proposal. Simplest approach: greedy assignmentThe simplest possible coordination-free design follows a classic greedy lock-less design: when a shard is able to take on more work, it checks for unassigned Workflows and attempts to assign it to itself. Whichever shard gets it first takes it, the rest move to the next one. This is very similar to lock-less designs where you attempt an operation and retry if it fails, but instead of retrying, the shards in this case just move on. In pseudo-code: for workflow := range unassignedWorkflows {
if err := tryAssign(workflow); err {
woc.log.WithError(err).Debugf("failed to assign Workflow %s to Shard %s", workflow.id, shardId)
continue
}
woc.log.Infof("assigned Workflow %s to Shard %s", workflow.id, shardId)
} Problem: Informer watch selectorsThe main problem with the greedy assignment is how it interacts with Informers. Informers need to act on a list option ( Possible solutionsCache EvictionAs an Informer is a cache, it's possible that maybe some cache eviction can be done to improve this. But, uh, cache eviction is notoriously a hard problem. I don't know Informers that well to know if there are potentially some easy options to get around this though. Greedy
|
@agilgur5 there is already a proposed design for horizontal scaling https://github.com/timebertt/kubernetes-controller-sharding. |
I already posted in #argo-contributors, #argo-wf-contributors, and #argo-sig-scalability. I also presented at the SIG Scalability meeting last week, This is primarily on implementation details, which are not user-facing, so I don't think it is particularly valuable to general users. We already know that users want horizontal scaling as a general feature. |
Thanks for the reference! The design there follows the single |
I totally agree these are implementation details. But we should understand how valuable this feature is to Argo workflow users. if nobody is using this feature, There is no to implement this feature. I can see only one star. |
The related issues have plenty of upvotes:
Total: ~21 upvotes Argo CD contributors and maintainers were interested in a similar feature for CD as well. |
wow great. Can you create a proposal Doc PR with the above details?
|
What would the purpose of a duplicative doc and duplicative diagrams be? Especially as this proposal has already had duplicates as well. The purpose of a proposal is generally for a feature that is not fleshed out. Whereas this feature has been fleshed out multiple times at this point (at least 3 times and all 3 are very similar). That seems very unnecessary and redundant to me as well as a poor use of already very limited contributor time... @sarabala1979 I would ask again that you read the existing information. I have pointed a few times in my responses to existing information already. |
It is a shame that this initiative looks like it is going to fizzle out. It would have been a selling point for organizations looking to migrate from Jenkins to Argo Workflows. Jenkins can't scale horizontally, so the solution to handle the load is to spin up another Jenkins Controller. Which is very burdensome. This would be the number one reason for large deployments of Jenkins to migrate. |
I second @agilgur5 's motion to prioritize this as a feature on the roadmap - there seems like enough community interest, and all core contributors see the benefit. It would be great if we could organize contributor & maintainer efforts around this for a 3.6 or 3.7 release. Shall we add to our next Contributor Meeting agenda on Feb 6? cc @Joibel so he's aware as well I agree with @ryancurrah that this would help solidify a clear advantage for using Argo for CI over tools like Jenkins that aren't as cloud-native. Argo already out performs Jenkins on K8s for most use cases, but this would bolster that position while we are seeing dozens of migrations from Jenkins to Argo for CI/CD use cases now. |
Yes, let's discuss this in our next contributors call. I'll review the proposal when I get a chance as well |
I'm a bit worried about reassigning workflows between controller shards in cluster. At the point reassignment is occurring we are either scaling up or down. I like using labels to limit the scope of the informers. I think cache eviction can be done through the informer filterFunc in the right design. It's hard to be sure the cost of unscoped informers. I'm not sure which informers beyond the workflow informer matter - the pod informer seems the most likely to want the same scoping. Database requirement: Almost anyone who's at the scale of needing this would probably already have considered getting a database in play to do status offloading. We could leverage this and require a database for sharded controller co-ordination. It provide nice features for atomic global updates and the like, but I can't see a way for it to properly work with label scoping the informers. I'm leaving these words in here in case someone else can think of a way to make this work. Big shard numbers If instead of us labelling workflows with a specific shard number that matches the actual number of workflow-controller shards, we run shard numbers up to a much bigger number ( for a `maxVirtualShard of 12 (I'd expect this number to be bigger, but I don't want to make an even bigger table):
Maximum virtualShard number should be a balance between number of labels needed to be watched and expected maximum controllers. For small numbers of expected maximum controllers ( In the above example 5 controllers is bad, because 0 and 1 run more workflows than the others, because 12 doesn't divide nicely by 12. To get it to play nice up to 6 controllers maxVirtualShards is 6! = 720. Which is quite a lot of labels (360) to watch when we're down at 2 controllers. At 1 controller we'd stop using label selection. I don't know whether this is useful or not. |
Jenkins replacement
Personally I'd be pretty surprised if this was the "number one reason". IMO there are much better reasons to migrate off Jenkins to not just Argo, but plenty of other more modern CI systems.
If that enterprise were to switch to using Argo, it may very well still have multiple controllers, either for dedicated performance for certain teams / departments, or for security purposes to limit blast radius and access. Also, to be clear, Argo does support manual sharding already, which is significantly simpler than setting up new Jenkins controllers -- it requires one flag and a label on your Workflows, that's it.
To Caelan's point though, I do usually expect that cloud-native tools can scale horizontally, so the lack of this was something that surprised me with Argo. To be fair, when it comes to Argo and other k8s projects, sometimes scaling involves a lot more than just more Pods, and requires tuning your k8s control plane quite heavily. Some folks just spin up new clusters to get around this, at which point you're back to square one on needing more controllers (and in a multi-cluster scenario, if that were natively supported a la #3523, which I am still a bit interested in and do still need to give a talk on, it is more performant and less race-heavy to run a controller in each cluster, as I did myself). It depends. I do think there is a use-case for this.
I do also still have a whole window with like 10 tabs open on this investigating some existing approaches and the informer source code, but the initiative definitely went on hiatus. We did discuss in SIG Scalability as well and I still need to write some more on that. In an incredibly brief nutshell: CD shards per cluster, but now has multiple algorithms available, which is an interesting and potentially useful touch depending on the characteristics of your usage. CD does not usually have high memory, but high CPU, so it is optimized quite differently and shards quite differently (i.e. not necessarily applicable to Workflows). design review
In the "Reassigner details" section, I wrote that to avoid complex rebalancing scenarios (which would be loaded with race conditions), the simplest version would only assign new Workflows to new shards. As I wrote there, that is ofc not ideal for all scenarios, but I think it could still be sufficient for many. Complex rebalancing schemes could be added on top if desired, potentially opt-in with algorithm selection like in CD.
This isn't really specific to the design I wrote above -- this can happen in any horizontally scaled application, which is why configuring HPAs properly is important.
Afaik, the Note that cache eviction (or similar cache limitation) is only needed for a full coordination-free & memory-sharded design.
Yes agreed. I have a sentence "As a simplification" that ignores these for the purpose of understanding the design, but the Pod informer should only watch for Pods of Workflows that its shard is assigned to. This might not require too much code change. The WorkflowTemplate and ConfigMap informers etc are significantly smaller and can extend across shards, so I think at least the initial implementation can leave those as is. DB option
Not necessarily. Status offloading is particularly necessary with larger Workflows, but if you have lots of small Workflows, you wouldn't need it. One of the fleets I maintained was exactly this use case.
My entire approach was to be as simple and coordination-free as possible (which are typically better designs, 9 times out of 10, from both a theoretical and practical perspective). Requiring a DB throws that out of the water big time and also is a pretty significant infra requirement as well. It also adds a new layer to the failure scenarios and could easily be a single point of failure in and of itself if not designed or scaled correctly. IMO, I don't think this adds much usefulness for the amount of complexity it requires. I also tend to expect that cloud native tools use k8s resources/etcd effectively and don't usually need separate DBs except for certain features (status offload makes sense, though it could also be split across multiple entries instead of needing a DB as discussed in #7121 (comment), which would be a nice feature as well. workflow archive is a bigger use-case for a DB IMO)
This would also limit its usefulness to situations where the bottleneck is not memory |
Some brief thoughts FilterFuncI was assuming that ReshardingHow do you propose to make scaling up work with new workflows being greedily run on the new controllers - or is it just a case where we're hoping it's less busy so "wins" a race to grab them? How will acquisition work? We attempt to assign ourselves a workflow by observing an unassigned workflow, putting our shard number onto it, writing it out and then if that works we've got control and can process it? That seems viable. Scale upIn a simplistic model we could emit a metric of number of currently controlled active workflows and have a maximum that we'll ever attempt to run, the HPA can then know when we're approaching that number and scale up. Scale downI realize oscillating isn't specific to this scenario, but this does seem more likely than normal to cause this, hence wanting to find a way to make it most likely to work well. I am still not sure how you're intending the actual mechanics of reassignment on scale down to work. |
filter + scale down
Oh. No, I wasn't planning on that and I don't think that would happen without an explicit change. An active informer's shard number and label stays consistent, so it wouldn't rebuild unless we added additional logic having it do so (which I hadn't mentioned).
Were you thinking this was the case because of the above rebuild? Without that, I don't see why this would be more likely. Rebuilds at the shard change boundary would certainly increase load, and hence would be better to avoid, but I didn't even plan for the implementation to do that. resharding + scale up
If we can run an assignment controller on each shard (which would be optimal), then yes, effectively that's how a greedy scenario would play out. There are potentially more advanced techniques we could use to help that go as planned, but I was going for the simplest approach first.
Correct, it's a pretty standard lock-less pattern to do that as I mentioned in the proposal.
This is perhaps overly simplistic in that there's several variables that could determine the maximum (resources given per shard, workers given, etc). I'm less concerned about the specific implementation details of an HPA; it is primarily important that it can be used consistently per shard, which is a bit more complicated when there is a leader as that shard will have more load by definition. |
I haven't followed up in this issue in a while, but I did discuss some updates with SIG Scalability last month. Detailing that and some other updates below Greedy
|
We currently run a single controller per cluster or per namespace. Per cluster we only run one and use leader election to provide HA. Per namespace you have to manually configure each namespace (big operational overhead).
Instead, we could provide a way to horizontally scale workflows. To do this, we need to consider what each controller uses K8s informers to keep data in memory:
Let's assume we use stateful sets, and each replica is identified by an int between 0 and number of replicas.
Workflows can be assigned to a replicas by labelling them. Then we can use label filter in the informers to only load their own work.
We need to cater for scale-down, so we should assign it the work of
replica % 4
too:We need to assign work to replicas. We need a special controller to do that. It can listen to workflows without an assign replica and assign them.
Workflows and templates need to be sticky assigned. This could be
hash(namespace) % N
. Not a bad first idea, but does not cater for "hot namespaces", e.g. where a large amount of work goes into one namespace. Work should be distributed round-robin.Scale-up results in work being spread about.
The text was updated successfully, but these errors were encountered: