Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update Scheduled conditon when failed scheduling #988

Merged
merged 1 commit into from
Dec 7, 2021

Conversation

mrlihanbo
Copy link

@mrlihanbo mrlihanbo commented Nov 19, 2021

Signed-off-by: lihanbo [email protected]

What type of PR is this?
/kind cleanup

What this PR does / why we need it:

  1. update Scheduled conditon when failed scheduling
  2. remove FirstSchedule, reconcile schedule once applied policy is different from the latest propogation policy.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

  conditions:
  - lastTransitionTime: "2021-11-19T07:34:38Z"
    message: 'failed to assignReplicas: failed to scaleUp: clusters resources are
      not enough to schedule, max 6 replicas are support'
    reason: BindingFailedScheduling
    status: "False"
    type: Scheduled

Does this PR introduce a user-facing change?:
"NONE"

@karmada-bot karmada-bot added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Nov 19, 2021
@karmada-bot karmada-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 19, 2021
Status: metav1.ConditionTrue,
Reason: scheduleSuccessReason,
Message: scheduleSuccessMessage,
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not make the success condition a function so it can be reused。

@qianjun1993
Copy link
Contributor

qianjun1993 commented Nov 22, 2021

In scheduler, there are two reason may cause the schedule failed. One it that the Algorithm has error, and you add the failed condition. Another is that the placement has some problems, may we should also add failed condition?

func (s *Scheduler) scheduleResourceBinding(resourceBinding *workv1alpha2.ResourceBinding) (err error) {	
placement, placementStr, err := s.getPlacement(resourceBinding)
if err != nil {
       // update condition here ?
	return err
       
}

@mrlihanbo
Copy link
Author

/hold waiting for pr #967 merged, may need to refactor

@karmada-bot karmada-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 24, 2021
@dddddai dddddai mentioned this pull request Nov 26, 2021
@dddddai
Copy link
Member

dddddai commented Nov 27, 2021

There is one more concern:
If schedule failed, binding controller would not sync the binding, which means it would not aggregate status (please refer to #768)

isReady := helper.IsBindingReady(&binding.Status)
if !isReady {
klog.Infof("ResourceBinding(%s/%s) is not ready to sync", binding.GetNamespace(), binding.GetName())
return controllerruntime.Result{}, nil
}
return c.syncBinding(binding)

I think we can replace it with

isReady := binding.Annotations[util.PolicyPlacementAnnotation] != ""

@RainbowMango
Copy link
Member

Ping @mrlihanbo

@mrlihanbo
Copy link
Author

mrlihanbo commented Nov 30, 2021

/remove-hold @qianjun1993 @dddddai @RainbowMango @Garrybest

@mrlihanbo
Copy link
Author

/hold cancel

@karmada-bot karmada-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 30, 2021
@mrlihanbo mrlihanbo force-pushed the schedule-condition branch 2 times, most recently from 3d1218c to 6cf7f3f Compare November 30, 2021 06:48
@dddddai
Copy link
Member

dddddai commented Nov 30, 2021

As I mentioned above, I guess we should update (cluster) binding controller as well

@mrlihanbo
Copy link
Author

As I mentioned above, I guess we should update (cluster) binding controller as well

sorry, I forget it.

@mrlihanbo
Copy link
Author

As I mentioned above, I guess we should update (cluster) binding controller as well

@dddddai hi, when binding.Annotations[util.PolicyPlacementAnnotation] == "", it may means to match all clusters. So I choose to recover the raw logic: len(targetClusters) != 0

@dddddai
Copy link
Member

dddddai commented Nov 30, 2021

@dddddai hi, when binding.Annotations[util.PolicyPlacementAnnotation] == "", it may means to match all clusters.

I remember the annotation value would be "{}" if placement is empty

So I choose to recover the raw logic: len(targetClusters) != 0

There might be a problem, for example:
If there is no cluster that fits an updated propagation policy, len(targetClusters) would be 0, then binding controller would not be able to aggregate status

@mrlihanbo
Copy link
Author

@dddddai hi, when binding.Annotations[util.PolicyPlacementAnnotation] == "", it may means to match all clusters.

I remember the annotation value would be "{}" if placement is empty

So I choose to recover the raw logic: len(targetClusters) != 0

There might be a problem, for example: If there is no cluster that fits an updated propagation policy, len(targetClusters) would be 0, then binding controller would not be able to aggregate status

If there is no cluster that fits an updated propagation policy, the scheduler will remain the schedule result of the unupdated propagation policy. Thus, the len(targetClusters) won't be 0.

@dddddai
Copy link
Member

dddddai commented Nov 30, 2021

If there is no cluster that fits an updated propagation policy, the scheduler will remain the schedule result of the unupdated propagation policy. Thus, the len(targetClusters) won't be 0.

Yes I suspect if it's an expected behavior(which may be fixed in future), so I was thinking we might consider that case

if len(feasibleClusters) == 0 {
return result, fmt.Errorf("no clusters fit")
}

Anyway using len(util.PolicyPlacementAnnotation)==0 does no harm, doesn't it? Please correct me if I'm wrong

@mrlihanbo
Copy link
Author

If there is no cluster that fits an updated propagation policy, the scheduler will remain the schedule result of the unupdated propagation policy. Thus, the len(targetClusters) won't be 0.

Yes I suspect if it's an expected behavior(which may be fixed in future), so I was thinking we might consider that case

if len(feasibleClusters) == 0 {
return result, fmt.Errorf("no clusters fit")
}

Anyway using len(util.PolicyPlacementAnnotation)==0 does no harm, doesn't it? Please correct me if I'm wrong

If user create a propagation policy like:

apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
  name: nginx-propagation
  namespace: default
spec:
  resourceSelectors:
  - apiVersion: apps/v1
    kind: Deployment
    name: nginx
    namespace: default

with this propagation policy, len(util.PolicyPlacementAnnotation) will be 0. But it is actually schedule to all clusters.

apiVersion: work.karmada.io/v1alpha2
kind: ResourceBinding
metadata:
  annotations:
    policy.karmada.io/applied-placement: '{}'
  creationTimestamp: "2021-11-30T03:01:30Z"
  finalizers:
  - karmada.io/binding-controller
  generation: 97
  labels:
    propagationpolicy.karmada.io/name: nginx-propagation
    propagationpolicy.karmada.io/namespace: default
  name: nginx-deployment
  namespace: default
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: Deployment
    name: nginx
    uid: 96c98f99-49ae-46e7-8bec-af83c9dec13a
  resourceVersion: "54740"
  uid: 29486e35-1dcf-4574-8af7-43cd61d51ae1
spec:
  clusters:
  - name: member1
  - name: member3
  replicaRequirements:
    resourceRequest:
      cpu: "1"
  replicas: 3
  resource:
    apiVersion: apps/v1
    kind: Deployment
    name: nginx
    namespace: default
    resourceVersion: "54737"

@mrlihanbo
Copy link
Author

If there is no cluster that fits an updated propagation policy, the scheduler will remain the schedule result of the unupdated propagation policy. Thus, the len(targetClusters) won't be 0.

Yes I suspect if it's an expected behavior(which may be fixed in future), so I was thinking we might consider that case

if len(feasibleClusters) == 0 {
return result, fmt.Errorf("no clusters fit")
}

Anyway using len(util.PolicyPlacementAnnotation)==0 does no harm, doesn't it? Please correct me if I'm wrong

If user create a propagation policy like:

apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
  name: nginx-propagation
  namespace: default
spec:
  resourceSelectors:
  - apiVersion: apps/v1
    kind: Deployment
    name: nginx
    namespace: default

with this propagation policy, len(util.PolicyPlacementAnnotation) will be 0. But it is actually schedule to all clusters.

apiVersion: work.karmada.io/v1alpha2
kind: ResourceBinding
metadata:
  annotations:
    policy.karmada.io/applied-placement: '{}'
  creationTimestamp: "2021-11-30T03:01:30Z"
  finalizers:
  - karmada.io/binding-controller
  generation: 97
  labels:
    propagationpolicy.karmada.io/name: nginx-propagation
    propagationpolicy.karmada.io/namespace: default
  name: nginx-deployment
  namespace: default
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: Deployment
    name: nginx
    uid: 96c98f99-49ae-46e7-8bec-af83c9dec13a
  resourceVersion: "54740"
  uid: 29486e35-1dcf-4574-8af7-43cd61d51ae1
spec:
  clusters:
  - name: member1
  - name: member3
  replicaRequirements:
    resourceRequest:
      cpu: "1"
  replicas: 3
  resource:
    apiVersion: apps/v1
    kind: Deployment
    name: nginx
    namespace: default
    resourceVersion: "54737"

I am not sure if the length of policy.karmada.io/applied-placement: '{}' is 0?

@dddddai
Copy link
Member

dddddai commented Nov 30, 2021

I am not sure if the length of policy.karmada.io/applied-placement: '{}' is 0?

It can't be 0, anyway if you worry about this, you may use:

_, isReady := binding.Annotations[util.PolicyPlacementAnnotation]

@mrlihanbo
Copy link
Author

I am not sure if the length of policy.karmada.io/applied-placement: '{}' is 0?

It can't be 0, anyway if you worry about this, you may use:

_, isReady := binding.Annotations[util.PolicyPlacementAnnotation]

Sounds good to me.

@@ -69,7 +69,7 @@ func (c *ResourceBindingController) Reconcile(ctx context.Context, req controlle
return c.removeFinalizer(binding)
}

isReady := helper.IsBindingReady(&binding.Status)
_, isReady := binding.Annotations[util.PolicyPlacementAnnotation]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks weird because why has the annotation is ready?

Copy link
Member

@dddddai dddddai Dec 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RainbowMango Precisely, it does NOT mean schedule successed, but has been scheduled once

We have to aggregate status no matter if the schedule successed or not, don't we?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please look at the latest code. This check has been postponed to syncBinding.

Comment on lines 398 to 414
defer func() {
if err != nil {
failedSchedulingCondition := util.NewCondition(workv1alpha2.Scheduled, scheduleFailedReason, err.Error(), metav1.ConditionFalse)
if updateErr := s.updateBindingScheduledConditionIfNeeded(rb, failedSchedulingCondition); updateErr != nil {
klog.Errorf("failed to set failed scheduling condition for binding(%s/%s): %v", rb.Namespace, rb.Name, updateErr)
err = fmt.Errorf("attempted to set failed scheduling condition after error %v, but failed due to error: %v", err, updateErr)
}
} else {
successSchedulingCondition := util.NewCondition(workv1alpha2.Scheduled, scheduleSuccessReason, scheduleSuccessMessage, metav1.ConditionTrue)
if updateErr := s.updateBindingScheduledConditionIfNeeded(rb, successSchedulingCondition); updateErr != nil {
klog.Errorf("failed to set success scheduling condition for binding(%s/%s): %v", rb.Namespace, rb.Name, updateErr)
err = fmt.Errorf("attempted to set success scheduling condition, but failed due to error: %v", updateErr)
}
}
}()

Copy link
Member

@RainbowMango RainbowMango Dec 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
defer func() {
if err != nil {
failedSchedulingCondition := util.NewCondition(workv1alpha2.Scheduled, scheduleFailedReason, err.Error(), metav1.ConditionFalse)
if updateErr := s.updateBindingScheduledConditionIfNeeded(rb, failedSchedulingCondition); updateErr != nil {
klog.Errorf("failed to set failed scheduling condition for binding(%s/%s): %v", rb.Namespace, rb.Name, updateErr)
err = fmt.Errorf("attempted to set failed scheduling condition after error %v, but failed due to error: %v", err, updateErr)
}
} else {
successSchedulingCondition := util.NewCondition(workv1alpha2.Scheduled, scheduleSuccessReason, scheduleSuccessMessage, metav1.ConditionTrue)
if updateErr := s.updateBindingScheduledConditionIfNeeded(rb, successSchedulingCondition); updateErr != nil {
klog.Errorf("failed to set success scheduling condition for binding(%s/%s): %v", rb.Namespace, rb.Name, updateErr)
err = fmt.Errorf("attempted to set success scheduling condition, but failed due to error: %v", updateErr)
}
}
}()
// Update "Scheduled" condition according to schedule result.
defer func() {
var condition metav1.Condition
if err == nil {
condition = util.NewCondition(workv1alpha2.Scheduled, scheduleSuccessReason, scheduleSuccessMessage, metav1.ConditionTrue)
} else {
condition = util.NewCondition(workv1alpha2.Scheduled, scheduleFailedReason, err.Error(), metav1.ConditionFalse)
}
if updateErr := s.updateBindingScheduledConditionIfNeeded(rb, condition); updateErr != nil {
klog.Errorf("Failed update condition(%s) for ResourceBinding(%s/%s)", workv1alpha2.Scheduled, rb.Namespace, rb.Name)
if err == nil {
// schedule succeed but update condition failed, return err in order to retry in next loop.
err = updateErr
}
}
}()

How about this?

@mrlihanbo mrlihanbo force-pushed the schedule-condition branch 2 times, most recently from 806f6e7 to cb7f351 Compare December 1, 2021 09:39
// the binding has not been scheduled, need schedule
klog.Infof("Start scheduling ResourceBinding(%s/%s)", namespace, name)
err = s.scheduleResourceBinding(rb)
metrics.BindingSchedule(string(FirstSchedule), metrics.SinceInSeconds(start), err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qianjun1993
Here merged FirstSchedule to ReconcileSchedule because they share the same schedule logic, are you ok with it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with that for they share the same schedule logic now. But As mentioned in #900, there may be some differences in the two kind of schedule.

@RainbowMango
Copy link
Member

/assign @qianjun1993 @dddddai
How do you say?

@dddddai
Copy link
Member

dddddai commented Dec 2, 2021

lgtm

@RainbowMango
Copy link
Member

/lgtm
/approve

@dddddai
All robot commands start with /.
Here is the command help documents.

@karmada-bot karmada-bot added the lgtm Indicates that a PR is ready to be merged. label Dec 7, 2021
@karmada-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: RainbowMango

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@karmada-bot karmada-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 7, 2021
@karmada-bot karmada-bot merged commit 09c0449 into karmada-io:master Dec 7, 2021
@mrlihanbo mrlihanbo deleted the schedule-condition branch March 2, 2022 07:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants