Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ResourceBinding aggregated status is not updated after unjoinning a target cluster #768

Closed
dddddai opened this issue Sep 26, 2021 · 12 comments · Fixed by #817
Closed

ResourceBinding aggregated status is not updated after unjoinning a target cluster #768

dddddai opened this issue Sep 26, 2021 · 12 comments · Fixed by #817
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@dddddai
Copy link
Member

dddddai commented Sep 26, 2021

What happened:
ResourceBinding aggregated status is not updated after unjoinning a target cluster

What you expected to happen:
ResourceBinding aggregated status(along with workload status) should be updated after unjoinning a target cluster

How to reproduce it (as minimally and precisely as possible):
1.Set up environment

root@myserver:~/karmada# hack/local-up-karmada.sh

root@myserver:~/karmada# hack/create-cluster.sh member1 $HOME/.kube/karmada.config

root@myserver:~/karmada# kubectl config use-context karmada-apiserver

root@myserver:~/karmada# karmadactl join member1 --cluster-kubeconfig=$HOME/.kube/karmada.config

root@myserver:~/karmada# kubectl apply -f samples/nginx

root@myserver:~/karmada# kubectl get deploy
NAME    READY   UP-TO-DATE   AVAILABLE   AGE
nginx   1/1     1            1           16m

2.Unjoin member1

root@myserver:~/karmada# karmadactl unjoin member1

root@myserver:~/karmada# kubectl get clusters
No resources found

3.Check the ResourceBinding aggregated status and Deployment status

root@myserver:~/karmada# kubectl get deploy
NAME    READY   UP-TO-DATE   AVAILABLE   AGE
nginx   1/1     1            1           16m

Both ResourceBinding aggregated status and Deployment status show there is still one ready pod.

Anything else we need to know?:
Turns out ResourceBindingController would always fail to ensureWork and requeue the request since the target cluster has been removed, which makes itself has no chance to aggregate status.

Related snippet:

err = ensureWork(c.Client, workload, c.OverrideManager, binding, apiextensionsv1.NamespaceScoped)
if err != nil {
klog.Errorf("Failed to transform resourceBinding(%s/%s) to works. Error: %v.",
binding.GetNamespace(), binding.GetName(), err)
return controllerruntime.Result{Requeue: true}, err
}
err = helper.AggregateResourceBindingWorkStatus(c.Client, binding, workload)

Here is the karmada-controller-manager log.

E0926 09:09:13.420846       1 binding_controller.go:97] Failed to transform resourceBinding(default/nginx-deployment) to works. Error: Cluster.cluster.karmada.io "member1" not found.
E0926 09:11:57.314073       1 overridemanager.go:50] Failed to get member cluster: member1, error: Cluster.cluster.karmada.io "member1" not found
E0926 09:11:57.314168       1 common.go:78] Failed to apply overrides for Deployment/default/nginx, err is: Cluster.cluster.karmada.io "member1" not found
E0926 09:11:57.314186       1 binding_controller.go:97] Failed to transform resourceBinding(default/nginx-deployment) to works. Error: Cluster.cluster.karmada.io "member1" not found.
E0926 09:17:24.998822       1 overridemanager.go:50] Failed to get member cluster: member1, error: Cluster.cluster.karmada.io "member1" not found
E0926 09:17:24.998878       1 common.go:78] Failed to apply overrides for Deployment/default/nginx, err is: Cluster.cluster.karmada.io "member1" not found

Environment:

  • Karmada version: v0.8.0
  • Others:
@dddddai dddddai added the kind/bug Categorizes issue or PR as related to a bug. label Sep 26, 2021
@dddddai
Copy link
Member Author

dddddai commented Sep 26, 2021

I suppose this could be a bug rather than an intention, isn't it? If so, I'm willing to fix this :)

@RainbowMango
Copy link
Member

Yeah, I think it's a bug too. But any idea about how to fix it?

@dddddai
Copy link
Member Author

dddddai commented Sep 27, 2021

Yeah, I think it's a bug too. But any idea about how to fix it?

Please see dddddai@981201c, ResourceBindingController should aggregate status no matter if ensured work or not, WDYT?
I have tested this patch and it works.

@XiShanYongYe-Chang
Copy link
Member

Is it possible to put the aggregate status to an independent controller?

@dddddai
Copy link
Member Author

dddddai commented Sep 27, 2021

Is it possible to put the aggregate status to an independent controller?

Agree, aggregate status should be separated from ResourceBindingController, like ExecutionController with WorkStatusController and ClusterController with ClusterStatusController

May I commit dddddai@981201c first for quick fix and move aggregate status to an independent controller in a follow up?

@dddddai
Copy link
Member Author

dddddai commented Oct 14, 2021

@RainbowMango @XiShanYongYe-Chang Any ideas about this?
Shall we add (Cluster)ResourceBindingStatusController to aggregate status?

@RainbowMango
Copy link
Member

RainbowMango commented Oct 14, 2021

Oh, sorry, i missed this issue.

I looked at the commit but can't remember why the Binding controller gets a chance to reconcile after a cluster be removed. Who can help remind me?

Shall we add (Cluster)ResourceBindingStatusController to aggregate status?

Perhaps not worth introducing a controller yet to aggregate status.

Also, I'm thinking after the cluster has been removed, why does karmada-scheduler not cleanup the cluster from binding?

@RainbowMango
Copy link
Member

/assign @dddddai
As #768 (comment), and appreciate!

@dddddai
Copy link
Member Author

dddddai commented Oct 14, 2021

I looked at the commit but can't remember why the Binding controller gets a chance to reconcile after a cluster be removed. Who can help remind me?

@RainbowMango The procedure is as follows:

cluster removed -> execution namespace removed (by ClusterController) -> work removed -> 

BindingController.Reconcile -> ensure work (failed since the target cluster has been removed!)

-> requeue the request (it cannot reach helper.AggregateResourceBindingWorkStatus)

As I metioned above, BindingController would always fail to ensureWork and requeue the request since the target cluster has been removed, which makes itself cannot reach helper.AggregateResourceBindingWorkStatus.

err = ensureWork(c.Client, workload, c.OverrideManager, binding, apiextensionsv1.NamespaceScoped)
if err != nil {
klog.Errorf("Failed to transform resourceBinding(%s/%s) to works. Error: %v.",
binding.GetNamespace(), binding.GetName(), err)
return controllerruntime.Result{Requeue: true}, err
}
err = helper.AggregateResourceBindingWorkStatus(c.Client, binding, workload)

So this commit delays the requeue and makes it able to aggregate status

Also, I'm thinking after the cluster has been removed, why does karmada-scheduler not cleanup the cluster from binding?

Yes I'm also thinking about this problem, and it confuses me that if scheduler cleans up the cluster from binding, BindingController would have no chance to syncBinding, which means it cannot aggregate status...

isReady := helper.IsBindingReady(binding.Spec.Clusters)
if !isReady {
klog.Infof("ResourceBinding(%s/%s) is not ready to sync", binding.GetNamespace(), binding.GetName())
return controllerruntime.Result{}, nil
}
return c.syncBinding(binding)

@RainbowMango
Copy link
Member

Got it. Thanks.
@dddddai Please file a PR with your commit and solve the issue first. Then work out the solution to improve scheudler.

@RainbowMango
Copy link
Member

Yes I'm also thinking about this problem, and it confuses me that if scheduler cleans up the cluster from binding, BindingController would have no chance to syncBinding, which means it cannot aggregate status...

Yeah. Perhaps it's time to add a schedule condition I think. len(status.clusters) == 0 would be a normal schedule result.

@RainbowMango
Copy link
Member

@dddddai I created #821 to track the left issues. Feel free to take it if you are interested in it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants