Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add annotation to pause autoscaling at a fixed count #2765

Merged
merged 5 commits into from
Apr 14, 2022

Conversation

aryan9600
Copy link
Contributor

@aryan9600 aryan9600 commented Mar 16, 2022

Signed-off-by: Sanskar Jaiswal [email protected]

This PR adds a user configurable annotation autoscaling.keda.sh/paused-replicas, which lets users scale the target to a fixed replica count and pauses autoscaling.

Checklist

  • Commits are signed with Developer Certificate of Origin (DCO - learn more)
  • Tests have been added
  • A PR is opened to update our Helm chart (repo) (if applicable, ie. when deployment manifests are modified)
  • A PR is opened to update the documentation on (repo) (if applicable)
  • Changelog has been updated and is aligned with our changelog requirements

Relates to #944

@aryan9600 aryan9600 marked this pull request as ready for review March 22, 2022 11:02
@aryan9600 aryan9600 requested a review from a team as a code owner March 22, 2022 11:02
@tomkerkhove
Copy link
Member

Thanks for the PR! Would you mind opening a PR for the docs please?

@tomkerkhove
Copy link
Member

Does this support pausing without defining an amount (the scenario with just paused) or only to a fixed amount of replicas?

@aryan9600
Copy link
Contributor Author

It pauses at a fixed amount of replicas (which can be 0 as well)

@JorTurFer
Copy link
Member

Maybe we could add another subsection under https://keda.sh/docs/2.6/concepts/scaling-deployments/

@zroubalik
Copy link
Member

Do you think you can add e2e tests for this feature. A couple of scenarios come to my mind, initiate the scaling then pause to 0 replicas (and N replicas), then disable the pause and check that scaling is working. And another one is to start with paused scaling, then enable.

@aryan9600
Copy link
Contributor Author

I think we can merge these two tests into one. We can have a flow such as:

paused-replicas: 0 -> check if deployment is at 0 replicas -> remove annotation -> check if auto scaling is working -> paused-replicas: 1 -> check if deployment is at 1 replica

@zroubalik
Copy link
Member

zroubalik commented Mar 23, 2022

I think we can merge these two tests into one. We can have a flow such as:

paused-replicas: 0 -> check if deployment is at 0 replicas -> remove annotation -> check if auto scaling is working -> paused-replicas: 1 -> check if deployment is at 1 replica

Yeah, though it would be great if we can add another simple scenario:

standard scaledObject -> check if auto scaling is working -> paused-replicas: N (where N is greater than 1, so for example 3) -> check if deployment is at N replicas -> remove paused-replicas -> check if auto scaling is working 

This way we have scenarios for different paused-replicas (0, 1 and N -> those are all distinct cases) and we have scenarios for creating ScaledObject with and without paused-replicas.

You can use azure-queue test for this, there are already some additional tests around this scaler and it doesn't require creating and deploying the service (as you would have to do with Prometheus for example).

@aryan9600
Copy link
Contributor Author

I think it's redundant to test for 1 replica and N replicas separately. We could just set paused replicas as 3 instead of 1 in the last stage of the flow I suggested. Also, I think it'd be better for this test to be live separately. I was thinking of keeping it minimal using a simple memory scaler.

@zroubalik
Copy link
Member

zroubalik commented Mar 23, 2022

I think it's redundant to test for 1 replica and N replicas separately. We could just set paused replicas as 3 instead of 1 in the last stage of the flow I suggested. Also, I think it'd be better for this test to be live separately. I was thinking of keeping it minimal using a simple memory scaler.

1 and N replicas are distinct cases, because the way we KEDA scales workloads, we should really cover this. And I wasn't suggesting to extend the current azure-queue test but use it as a base. Memory scaler is a special case (it doesn't fully rely on KEDA and external metrics) so please don't use this one.

@tomkerkhove
Copy link
Member

It pauses at a fixed amount of replicas (which can be 0 as well)

Hm, I'd love to have the paused: true annotation as well so that people don't have to check what current count as they just want to pause, regardless of current replica count.

But if I recall correctly @zroubalik and @JorTurFer were not a fan of this?

This is another use-case than specifying the amount imo which is equally important that others raised as well but I don't want to push this

@aryan9600 aryan9600 force-pushed the pause-scaling branch 3 times, most recently from e756d83 to 526c261 Compare March 23, 2022 12:16
@zroubalik
Copy link
Member

zroubalik commented Mar 23, 2022

/run-e2e pause*
Update: You can check the progres here

tests/scalers/pause-at-n.test.ts Outdated Show resolved Hide resolved
tests/scalers/pause-at-n.test.ts Outdated Show resolved Hide resolved
tests/scalers/pause.test.ts Outdated Show resolved Hide resolved
pkg/scaling/executor/scale_scaledobjects.go Outdated Show resolved Hide resolved
pkg/scaling/executor/scale_scaledobjects.go Show resolved Hide resolved
Copy link
Member

@JorTurFer JorTurFer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
Thanks for the contribution! ❤️ ❤️
Please, open a PR documenting the change in keda-docs 🙏

@JorTurFer
Copy link
Member

But if I recall correctly @zroubalik and @JorTurFer were not a fan of this?

you're right. IMO, having only 1 annotation is easier than having more than one, but we can see the feedback from the users about how this approach is and iterate over it

@JorTurFer
Copy link
Member

JorTurFer commented Mar 24, 2022

/run-e2e pause
Update: You can check the progres here

@tomkerkhove
Copy link
Member

But if I recall correctly @zroubalik and @JorTurFer were not a fan of this?

you're right. IMO, having only 1 annotation is easier than having more than one, but we can see the feedback from the users about how this approach is and iterate over it

Different use cases though. My intention of this feature was "stop scaling, regardless of how many that we have now" which is not easy today as you have to check the amount of current replicas but not sure about @zroubalik's thoughts

@zroubalik
Copy link
Member

But if I recall correctly @zroubalik and @JorTurFer were not a fan of this?

you're right. IMO, having only 1 annotation is easier than having more than one, but we can see the feedback from the users about how this approach is and iterate over it

Different use cases though. My intention of this feature was "stop scaling, regardless of how many that we have now" which is not easy today as you have to check the amount of current replicas but not sure about @zroubalik's thoughts

Aren't we overcomplicating the stuff? Getting the current number of replicas is very easy, even if user would like to automate the stuff 🤷‍♂️ I am not against adding a separate option for this, but it shouldn't overcomplicate the UX (too many options might be confusing).

@tomkerkhove
Copy link
Member

I don't see it as too complicated, either you just pause it and don't care or just specify the annotation with an amount next to it.

It's not hard indeed, but another step that some people would want to avoid (see issue)

Anyway let's leave it out and we can add it later on.

Copy link
Member

@zroubalik zroubalik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aryan9600 e2e tests are failiing :(

pkg/scaling/executor/scale_scaledobjects.go Show resolved Hide resolved
pkg/scaling/executor/scale_scaledobjects.go Outdated Show resolved Hide resolved
@zroubalik
Copy link
Member

zroubalik commented Apr 6, 2022

/run-e2e azure-queue-pause*
Update: You can check the progres here

@zroubalik
Copy link
Member

@aryan9600 is the problem in the e2e test or on the implementation? ie. if you run the test locally, do you see the failure? If you try to reproduce the behavior manually on some ScaledObject and Deployment, do you see the problem?

@aryan9600
Copy link
Contributor Author

I tried running the test locally, but after a lot of hours spent struggling, I couldn't get Azure storage to work as required on my machine and gave up. I used a CPU scaler to see if KEDA is actually scaling the deployment up when the load increases and that's working fine.

@zroubalik
Copy link
Member

I tried running the test locally, but after a lot of hours spent struggling, I couldn't get Azure storage to work as required on my machine and gave up. I used a CPU scaler to see if KEDA is actually scaling the deployment up when the load increases and that's working fine.

So can you try with another scaler? (not CPU/Mem though)

@aryan9600
Copy link
Contributor Author

@zroubalik works perfectly fine with prometheus as well

@tomkerkhove
Copy link
Member

Let's use prometheus in the e2e tests then 👍

@JorTurFer
Copy link
Member

JorTurFer commented Apr 11, 2022

/run-e2e azure-queue-pause*
Update: You can check the progres here

@zroubalik
Copy link
Member

zroubalik commented Apr 12, 2022

/run-e2e azure-queue-pause*
Update: You can check the progres here

@@ -28,6 +29,8 @@ import (
"sigs.k8s.io/controller-runtime/pkg/client"

kedav1alpha1 "github.com/kedacore/keda/v2/apis/keda/v1alpha1"
kedacontrollerutil "github.com/kedacore/keda/v2/controllers/keda/util"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just my thinking, we should unify the functions for Status & Conditions handling. Referencing controller related stuff here is suboptimal.

There are two locations with similar functions:

func SetStatusConditions(ctx context.Context, client runtimeclient.StatusClient, logger logr.Logger, object interface{}, conditions *kedav1alpha1.Conditions) error {
var patch runtimeclient.Patch
runtimeObj := object.(runtimeclient.Object)
switch obj := runtimeObj.(type) {
case *kedav1alpha1.ScaledObject:
patch = runtimeclient.MergeFrom(obj.DeepCopy())
obj.Status.Conditions = *conditions
case *kedav1alpha1.ScaledJob:
patch = runtimeclient.MergeFrom(obj.DeepCopy())
obj.Status.Conditions = *conditions
default:
err := fmt.Errorf("unknown scalable object type %v", obj)
logger.Error(err, "Failed to patch Objects Status with Conditions")
return err
}
err := client.Status().Patch(ctx, runtimeObj, patch)
if err != nil {
logger.Error(err, "Failed to patch Objects Status with Conditions")
}
return err
}
// UpdateScaledObjectStatus patches the given ScaledObject with the updated status passed to it or returns an error.
func UpdateScaledObjectStatus(ctx context.Context, client runtimeclient.StatusClient, logger logr.Logger, scaledObject *kedav1alpha1.ScaledObject, status *kedav1alpha1.ScaledObjectStatus) error {
patch := runtimeclient.MergeFrom(scaledObject.DeepCopy())
scaledObject.Status = *status
err := client.Status().Patch(ctx, scaledObject, patch)
if err != nil {
logger.Error(err, "Failed to patch ScaledObjects Status")
}
return err
}

func (e *scaleExecutor) updateLastActiveTime(ctx context.Context, logger logr.Logger, object interface{}) error {
var patch runtimeclient.Patch
now := metav1.Now()
runtimeObj := object.(runtimeclient.Object)
switch obj := runtimeObj.(type) {
case *kedav1alpha1.ScaledObject:
patch = runtimeclient.MergeFrom(obj.DeepCopy())
obj.Status.LastActiveTime = &now
case *kedav1alpha1.ScaledJob:
patch = runtimeclient.MergeFrom(obj.DeepCopy())
obj.Status.LastActiveTime = &now
default:
err := fmt.Errorf("unknown scalable object type %v", obj)
logger.Error(err, "Failed to patch Objects Status")
return err
}
err := e.client.Status().Patch(ctx, runtimeObj, patch)
if err != nil {
logger.Error(err, "Failed to patch Objects Status")
}
return err
}
func (e *scaleExecutor) setCondition(ctx context.Context, logger logr.Logger, object interface{}, status metav1.ConditionStatus, reason string, message string, setCondition func(kedav1alpha1.Conditions, metav1.ConditionStatus, string, string)) error {
var patch runtimeclient.Patch
runtimeObj := object.(runtimeclient.Object)
switch obj := runtimeObj.(type) {
case *kedav1alpha1.ScaledObject:
patch = runtimeclient.MergeFrom(obj.DeepCopy())
setCondition(obj.Status.Conditions, status, reason, message)
case *kedav1alpha1.ScaledJob:
patch = runtimeclient.MergeFrom(obj.DeepCopy())
setCondition(obj.Status.Conditions, status, reason, message)
default:
err := fmt.Errorf("unknown scalable object type %v", obj)
logger.Error(err, "Failed to patch Objects Status")
return err
}
err := e.client.Status().Patch(ctx, runtimeObj, patch)
if err != nil {
logger.Error(err, "Failed to patch Objects Status")
}
return err
}
func (e *scaleExecutor) setReadyCondition(ctx context.Context, logger logr.Logger, object interface{}, status metav1.ConditionStatus, reason string, message string) error {
active := func(conditions kedav1alpha1.Conditions, status metav1.ConditionStatus, reason string, message string) {
conditions.SetReadyCondition(status, reason, message)
}
return e.setCondition(ctx, logger, object, status, reason, message, active)
}
func (e *scaleExecutor) setActiveCondition(ctx context.Context, logger logr.Logger, object interface{}, status metav1.ConditionStatus, reason string, message string) error {
active := func(conditions kedav1alpha1.Conditions, status metav1.ConditionStatus, reason string, message string) {
conditions.SetActiveCondition(status, reason, message)
}
return e.setCondition(ctx, logger, object, status, reason, message, active)
}
func (e *scaleExecutor) setFallbackCondition(ctx context.Context, logger logr.Logger, object interface{}, status metav1.ConditionStatus, reason string, message string) error {
fallback := func(conditions kedav1alpha1.Conditions, status metav1.ConditionStatus, reason string, message string) {
conditions.SetFallbackCondition(status, reason, message)
}
return e.setCondition(ctx, logger, object, status, reason, message, fallback)
}

We should refactor this to have it in one place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could move keda/controllers/keda/util/status.go to pkg/util/status.go, this way the SetStatusConditions and UpdateScaldObjectStatus can be used across the project. As for unifying the functions for Status and Conditions handling, I think that should be tackled in a different PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah a change like that. And I concur this should be tackled in a different PR.

@zroubalik
Copy link
Member

zroubalik commented Apr 12, 2022

There are 2 things we should cover, update documentation and implement similar functionality, it should be relatively trivial : a check on the annotation in the ScaleJob controller loop.

@aryan9600
Copy link
Contributor Author

There's a PR open for docs already: https://github.com/kedacore/keda-docs/pull/728/files. I think the only thing left is to update the CHANGELOG, let me do that 👍

@zroubalik
Copy link
Member

Sorry, I missed the docs PR 🤦 😄

@zroubalik
Copy link
Member

Also could you please solve the conflict in the Changelog?

Copy link
Member

@zroubalik zroubalik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

CHANGELOG.md Outdated Show resolved Hide resolved
Signed-off-by: Sanskar Jaiswal <[email protected]>
Signed-off-by: Sanskar Jaiswal <[email protected]>
Copy link
Member

@zroubalik zroubalik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

waiting for the merge for other to take a look, @kedacore/keda-contributors

@zroubalik
Copy link
Member

zroubalik commented Apr 13, 2022

/run-e2e azure-queue-pause*
Update: You can check the progres here

Copy link
Member

@JorTurFer JorTurFer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
Only small nit in the changelog and one question about e2e tests:
What's the difference between them? I fell pause-at as a case inside pause.
I mean, in the first one you do the normal scaling cycle, and then pause in 5 so the instance count goes to 5, and in the second one, you pause in 1, and the instances goes to 1. In fact, the second one is more complete because it continues after the pause.
Am I missing anything?

Thanks a ton for this amazing feature! ❤️ ❤️ ❤️

CHANGELOG.md Outdated Show resolved Hide resolved
Signed-off-by: Sanskar Jaiswal <[email protected]>
@aryan9600
Copy link
Contributor Author

@JorTurFer yeah there is some overlap between them, but the tests together cover all possible scenarios as suggested by @zroubalik in #2765 (comment)

@JorTurFer
Copy link
Member

I was missing something xD
Thanks for clarifying it! ❤️

@zroubalik zroubalik merged commit 99619e3 into kedacore:main Apr 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants