Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add NodePool support to Cluster resource #188

Merged
merged 35 commits into from
Nov 15, 2024
Merged

Add NodePool support to Cluster resource #188

merged 35 commits into from
Nov 15, 2024

Conversation

birdayz
Copy link
Contributor

@birdayz birdayz commented Aug 15, 2024

Operator v1: support for multiple Node Pools

Overall design notes

  • Everything is driven by spec.nodePools. The default nodepool, driven by the existing fields spec.replicas, spec.storage etc, is still fully supported. It is automatically injected into internal data structures as a nodePool with the name "default". This way. the code needs (almost) no special treatment for the default nodepool, but still supports it in full.
  • Deleted nodePools - so nodepools removed from spec.nodePools are handled by detecting their statefulset, and injecting a synthetic nodepoolSpec into the internal data structures.
  • Replace AdminAPIFactory with a factory that works with pods, instead of ordinals - as we need to support pods with different prefixes (nodepools).
  • Reduce cross-nodepool dependencies as much as possible, try to avoid code that requires overall replica count if possible. It is not possible everywhere, as we support for example only one pod in decomission at the same time.
  • Cloud Storage percentage has been reworked. We do not want to set bytes-specific cluster prop anymore. Instead, only use percentage based property.
    This avoids problems with multiple node pools: bytes-based is very different if disk size changes (move between different tiers in cloud, for example).Percentage based is kept stable mostly (even exactly the same most of the time, 15% typically), and will not cause immediate churn on the old nodePool if disk size is going down.
  • Previously, if a brokerID is marked for decomission (status.decomissionBrokerID), but eventually the pod for the brokerID is not found, the decommission was flagged as done, and the status field cleared. However, this is not safe: if this happens, in a race condition, the broker may still get registered - but is not being decommissioned anymore, making it a ghost broker.
  • Made featuregate EmptySeedStartCluster mandatory.
    Clusters <22.3.0 are not supported anymore. This significantly reduces complexity
    with multi-nodepool-support. We would have to start the first node in initial startup,
    and wait for this node. However, with multiple nodepools, all nodepools would have to
    coordinate this. Since this is a very old version, that is not supported anymore,
    support if therefore removed to simplify.

Not directly related changes, to reduce test flakiness:

  • Changed tests to require less resources, down from 1G,1cpu. Some tests run longer - if two tests like this run concurrently (we have concurrency limit 2), one of them sometimes gets starved on cpu/mem and fails to schedule until timeout.
  • Bumped timeout of readiness probe to 5s (had none configured => default 1s). this helped making lower resource allocations work. It is also rarely a problem in production, so IMHO no-brainer to bump it a little.

@CLAassistant
Copy link

CLAassistant commented Aug 15, 2024

CLA assistant check
All committers have signed the CLA.

@birdayz birdayz force-pushed the jb/nodepools branch 5 times, most recently from b4b88ea to 386ec21 Compare August 20, 2024 17:54
@birdayz birdayz force-pushed the jb/nodepools branch 3 times, most recently from aa676ba to 50b59d3 Compare August 26, 2024 13:56
@birdayz birdayz force-pushed the jb/nodepools branch 7 times, most recently from 55a961b to 19a6c99 Compare October 9, 2024 07:34
@birdayz birdayz marked this pull request as ready for review October 9, 2024 08:13
@birdayz birdayz requested a review from andrewstucki as a code owner October 9, 2024 08:13
@birdayz birdayz changed the title [WIP] Add NodePool support to Cluster resource [DO NOT MERGE YET] Add NodePool support to Cluster resource Oct 9, 2024
@birdayz
Copy link
Contributor Author

birdayz commented Nov 7, 2024

kind ping for review @chrisseto

// It returns 1 when not initialized (as fresh clusters start from 1 replica)
func (r *Cluster) GetCurrentReplicas() int32 {
if r == nil {
func (r *Cluster) SumNodePoolReplicas() int32 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This doesn't indicate if it's returning the current replicas or desired replicas. Please add a comment or rename the method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to GetDesiredReplicas. it looks like it in the diff, but this function is not a replacement for GetCurrentReplicas - this can be seen in metric_controller.go, where a reference to Spec.Replicas is replaced with a call to this function here.

@@ -1422,3 +1419,41 @@ func (r *Cluster) GetDecommissionBrokerID() *int32 {
func (r *Cluster) SetDecommissionBrokerID(id *int32) {
r.Status.DecommissioningNode = id
}

// getNodePoolsFromSpec returns the NodePools defined in the spec.
// This contains the primary NodePool (driven by spec.replicas for example), and the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// This contains the primary NodePool (driven by spec.replicas for example), and the
// This contains the default NodePool (driven by spec.replicas for example), and the

Not sure what the terminology is but let's be consistent. If the primary pool has the name "default", let's rename the constant otherwise update this comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this is an oversight. default nodepool is the correct terminology. Updated the comment

operator/pkg/nodepools/pools.go Show resolved Hide resolved
"k8s.io/apimachinery/pkg/api/resource"
"k8s.io/utils/ptr"
"sigs.k8s.io/controller-runtime/pkg/client"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra newline (we'll need to reconfigure out linters).

operator/pkg/nodepools/pools.go Show resolved Hide resolved
operator/pkg/resources/statefulset_update.go Show resolved Hide resolved
Comment on lines +12 to +17
- role: worker
image: kindest/node:v1.29.8@sha256:d46b7aa29567e93b27f7531d258c372e829d7224b25e3fc6ffdefed12476d3aa
- role: worker
image: kindest/node:v1.29.8@sha256:d46b7aa29567e93b27f7531d258c372e829d7224b25e3fc6ffdefed12476d3aa
- role: worker
image: kindest/node:v1.29.8@sha256:d46b7aa29567e93b27f7531d258c372e829d7224b25e3fc6ffdefed12476d3aa
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to Rafal. This isn't super sustainable. Can we not just schedule multiple brokers onto a single node for the purpose of testing?

@@ -17,11 +17,14 @@ import (
"io"
"time"

corev1 "k8s.io/api/core/v1"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra newline

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

// NodePoolKey is used to document the node pool associated with the StatefulSet.
NodePoolKey = "cluster.redpanda.com/nodepool"

// PodLabelNodeIDKey
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: incomplete comment. Seems like PodNodeIDKey would be more consistent with the naming convention here? Label is already implied by the package.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -30,6 +30,11 @@ const (
PartOfKey = "app.kubernetes.io/part-of"
// The tool being used to manage the operation of an application
ManagedByKey = "app.kubernetes.io/managed-by"
// NodePoolKey is used to document the node pool associated with the StatefulSet.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// NodePoolKey is used to document the node pool associated with the StatefulSet.
// NodePoolKey is used to denote the node pool associated with the StatefulSet.

Struggling to explain why document isn't the right word here. I think document implies a separation of information and thing that's being informed about. A label on something wouldn't document something about itself but rather denote something about itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@chrisseto
Copy link
Contributor

Thanks for the ping :)

@chrisseto chrisseto changed the title [DO NOT MERGE YET] Add NodePool support to Cluster resource Add NodePool support to Cluster resource Nov 7, 2024
@birdayz
Copy link
Contributor Author

birdayz commented Nov 8, 2024

Thanks for the review! Addressed most of the comments.
Ready for the next round.

Mostly, these are open and awaiting further feedback:

also waiting for tests to finish. there is some problem with flakiness (here, but also in main)

- Change param to non-pointer, do conversion from pointer at the
  caller's site, where they already checked for nil
- Use loop to simplify
- The term Label is obsolete, as it's in the labels package.
- Update comment
@birdayz
Copy link
Contributor Author

birdayz commented Nov 12, 2024

@chrisseto kind ping for next round. would be really important to bring this over the finish line this week!

@birdayz birdayz force-pushed the jb/nodepools branch 3 times, most recently from ab001c3 to 9e08973 Compare November 13, 2024 16:08
Copy link
Contributor

@chrisseto chrisseto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving so I'm not in a blocking path. Some refactoring of GetNodePools before merging would be greatly appreciated.

I can't seem to find the comment again but I'm still curious about the management of the seed servers list and why it's populated from Status instead of the Spec.

Before merging could you either point me to the answer or comment again?

})
}

// Also add "virtual NodePools" based on StatefulSets found.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AH. Wow, I really misread this function the first time around. I think what's really throwing me for a loop (aside from using side by side view in github) is the inconsistency of breaking loops within this loop and the number at that.

Could we pick a single method of iterating and short circuiting be it labeled breaks, slices.Contains, or a helper findFunc[T any]([]T, func(T) bool) (*T, bool) function?

operator/pkg/nodepools/pools.go Show resolved Hide resolved
operator/pkg/nodepools/pools.go Show resolved Hide resolved
for i := range stsList.Items {
sts := stsList.Items[i]

for _, ownerRef := range sts.OwnerReferences {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this check and the later check for the existence of a redpanda container a bit odd. We're already using a label selector to filter our list of Sts's; is there a reason we're performing these extra checks? Are they checking for different things? If they're not, why does one error but the other does not?

I like to avoid "anxious" code when possible as it creates an ethical dilemma when it comes time to refactor. I'll have all the above questions with potentially no one to answer them nor test cases to validate them.

If these checks are in place for a reason, document what exactly they're checking for and ideally the situations in which they might trigger.

If these checks are just anxiety or paranoia that you can't shake, group them together and label them as such. A simple // Paranoid check to make sure we don't delete other STSs can save so much head-aching in the future.

Copy link
Contributor Author

@birdayz birdayz Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Ownerref check

i replaced it with strings.ContainsFunc, and improved the comment.

  1. Redpanda container check

There is no guarantee that a redpanda container exists. It's unlikely, but the STS could just not have it for some reason. So i check for it and error out. It's not specifically paranoid, it's just that the container is required to build the NodePoolSpec of that deleted (extract resource requirements from it). I'll keep that unchanged

@birdayz
Copy link
Contributor Author

birdayz commented Nov 14, 2024

I can't seem to find the comment again but I'm still curious about the management of the seed servers list and why it's populated from Status instead of the Spec.

based on your comment, i refactored it. It's now not based on status anymore.
See commit: 7fce0e8

It used to use status, because spec does not contain deleted nodepools. however, later, a helper function was added, that always returns all nodepools, including these "virtual pools" based on the deleted ones. now, we're using that, status is not required anymore.

@birdayz birdayz merged commit 385cee7 into main Nov 15, 2024
5 checks passed
@RafalKorepta RafalKorepta deleted the jb/nodepools branch December 2, 2024 13:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants