Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] node has erased its taint,but broadcastjob won't make a pod on that node #1199

Closed
weldonlwz opened this issue Mar 1, 2023 · 3 comments · Fixed by #1204
Closed

[BUG] node has erased its taint,but broadcastjob won't make a pod on that node #1199

weldonlwz opened this issue Mar 1, 2023 · 3 comments · Fixed by #1204
Assignees
Labels
kind/bug Something isn't working kind/good-first-issue Good for newcomers

Comments

@weldonlwz
Copy link
Contributor

weldonlwz commented Mar 1, 2023

What happened:
k8s集群有一个节点有unschedule taint, 此时提交了一个broadcastjob, pod没有在该 node 上创建,这是符合预期的。
但是当该节点taint被消除后,broadcastjob仍然不会在该node上创建pod。

What you expected to happen:
broadcastjob会在节点taint被清除后在该节点上创建pod

How to reproduce it (as minimally and precisely as possible):
cordon一个节点。
提交一个 broadcastjob

apiVersion: apps.kruise.io/v1alpha1
kind: BroadcastJob
metadata:
  name: busybox
spec:
  template:
    spec:
      containers:
        - name: main
          image: busybox:latest
          imagePullPolicy: IfNotPresent
          command: ["sleep", "100"]        
          resources:
            limits:
              cpu: 10m
              memory: 20Mi
            requests:
              cpu: 10m
              memory: 20Mi
      restartPolicy: OnFailure
  completionPolicy:
    type: Never
  failurePolicy:
    type: Continue

然后再 uncordon 该节点,没有pod被新建出来

Anything else we need to know?:
我分析了代码发现:
broadcastjob_event_handler.go line141这里 canOldNodeFit, err := checkNodeFitness(mockPod, oldNode)
返回err就会直接continue,而不会去检查新的node状态是否已经把taint清除了,这个是不是不妥?

func (p *enqueueBroadcastJobForNode) updateNode(q workqueue.RateLimitingInterface, old, cur runtime.Object) {
	oldNode := old.(*v1.Node)
	curNode := cur.(*v1.Node)
	if shouldIgnoreNodeUpdate(*oldNode, *curNode) {
		return
	}
	jobList := &v1alpha1.BroadcastJobList{}
	err := p.reader.List(context.TODO(), jobList)
	if err != nil {
		klog.Errorf("Error enqueueing broadcastjob on updateNode %v", err)
	}
	for _, bcj := range jobList.Items {
		mockPod := NewMockPod(&bcj, oldNode.Name)
		canOldNodeFit, err := checkNodeFitness(mockPod, oldNode)
		if err != nil {
			klog.Errorf("failed to checkNodeFitness for job %s/%s, on old node %s, %v", bcj.Namespace, bcj.Name, oldNode.Name, err)
			continue
		}

		canCurNodeFit, err := checkNodeFitness(mockPod, curNode)
		if err != nil {
			klog.Errorf("failed to checkNodeFitness for job %s/%s, on cur node %s, %v", bcj.Namespace, bcj.Name, curNode.Name, err)
			continue
		}

		if canOldNodeFit != canCurNodeFit {
			// enqueue the broadcast job for matching node
			q.Add(reconcile.Request{
				NamespacedName: types.NamespacedName{
					Namespace: bcj.Namespace,
					Name:      bcj.Name}})
		}
	}
}

Environment:

  • Kruise version: master branch
  • Kubernetes version: 1.19.16
@weldonlwz weldonlwz added the kind/bug Something isn't working label Mar 1, 2023
@weldonlwz weldonlwz changed the title node has erased its taint,but broadcastjob won't make a pod on that node [BUG] node has erased its taint,but broadcastjob won't make a pod on that node Mar 1, 2023
@veophi
Copy link
Member

veophi commented Mar 3, 2023

@weldonlwz cloud you help us fix it?

@weldonlwz
Copy link
Contributor Author

yes sure

@zmberg
Copy link
Member

zmberg commented Mar 9, 2023

/unassign @FillZpp
/assign @weldonlwz

@kruise-bot kruise-bot assigned weldonlwz and unassigned FillZpp Mar 9, 2023
@zmberg zmberg added this to the v1.4 milestone Mar 9, 2023
@zmberg zmberg added the kind/good-first-issue Good for newcomers label Mar 9, 2023
@zmberg zmberg removed this from the v1.4 milestone Mar 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working kind/good-first-issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants