[BUG] node has erased its taint，but broadcastjob won't make a pod on that node #1199

weldonlwz · 2023-03-01T10:00:02Z

What happened:
k8s集群有一个节点有unschedule taint, 此时提交了一个broadcastjob, pod没有在该 node 上创建，这是符合预期的。
但是当该节点taint被消除后，broadcastjob仍然不会在该node上创建pod。

What you expected to happen:
broadcastjob会在节点taint被清除后在该节点上创建pod

How to reproduce it (as minimally and precisely as possible):
cordon一个节点。
提交一个 broadcastjob

apiVersion: apps.kruise.io/v1alpha1
kind: BroadcastJob
metadata:
  name: busybox
spec:
  template:
    spec:
      containers:
        - name: main
          image: busybox:latest
          imagePullPolicy: IfNotPresent
          command: ["sleep", "100"]        
          resources:
            limits:
              cpu: 10m
              memory: 20Mi
            requests:
              cpu: 10m
              memory: 20Mi
      restartPolicy: OnFailure
  completionPolicy:
    type: Never
  failurePolicy:
    type: Continue

然后再 uncordon 该节点，没有pod被新建出来

Anything else we need to know?:
我分析了代码发现:
broadcastjob_event_handler.go line141这里 canOldNodeFit, err := checkNodeFitness(mockPod, oldNode)
返回err就会直接continue，而不会去检查新的node状态是否已经把taint清除了，这个是不是不妥？

func (p *enqueueBroadcastJobForNode) updateNode(q workqueue.RateLimitingInterface, old, cur runtime.Object) {
	oldNode := old.(*v1.Node)
	curNode := cur.(*v1.Node)
	if shouldIgnoreNodeUpdate(*oldNode, *curNode) {
		return
	}
	jobList := &v1alpha1.BroadcastJobList{}
	err := p.reader.List(context.TODO(), jobList)
	if err != nil {
		klog.Errorf("Error enqueueing broadcastjob on updateNode %v", err)
	}
	for _, bcj := range jobList.Items {
		mockPod := NewMockPod(&bcj, oldNode.Name)
		canOldNodeFit, err := checkNodeFitness(mockPod, oldNode)
		if err != nil {
			klog.Errorf("failed to checkNodeFitness for job %s/%s, on old node %s, %v", bcj.Namespace, bcj.Name, oldNode.Name, err)
			continue
		}

		canCurNodeFit, err := checkNodeFitness(mockPod, curNode)
		if err != nil {
			klog.Errorf("failed to checkNodeFitness for job %s/%s, on cur node %s, %v", bcj.Namespace, bcj.Name, curNode.Name, err)
			continue
		}

		if canOldNodeFit != canCurNodeFit {
			// enqueue the broadcast job for matching node
			q.Add(reconcile.Request{
				NamespacedName: types.NamespacedName{
					Namespace: bcj.Namespace,
					Name:      bcj.Name}})
		}
	}
}

Environment:

Kruise version: master branch
Kubernetes version: 1.19.16

The text was updated successfully, but these errors were encountered:

veophi · 2023-03-03T02:14:51Z

@weldonlwz cloud you help us fix it?

weldonlwz · 2023-03-06T02:15:07Z

yes sure

zmberg · 2023-03-09T02:16:01Z

/unassign @FillZpp
/assign @weldonlwz

weldonlwz added the kind/bug Something isn't working label Mar 1, 2023

weldonlwz assigned FillZpp Mar 1, 2023

weldonlwz changed the title ~~node has erased its taint，but broadcastjob won't make a pod on that node~~ [BUG] node has erased its taint，but broadcastjob won't make a pod on that node Mar 1, 2023

weldonlwz mentioned this issue Mar 6, 2023

fix: bcj doesn't make pod on node that has erased taint #1204

Merged

kruise-bot assigned weldonlwz and unassigned FillZpp Mar 9, 2023

zmberg added this to the v1.4 milestone Mar 9, 2023

zmberg added the kind/good-first-issue Good for newcomers label Mar 9, 2023

zmberg removed this from the v1.4 milestone Mar 9, 2023

kruise-bot closed this as completed in #1204 Mar 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] node has erased its taint，but broadcastjob won't make a pod on that node #1199

[BUG] node has erased its taint，but broadcastjob won't make a pod on that node #1199

weldonlwz commented Mar 1, 2023 •

edited

Loading

veophi commented Mar 3, 2023

weldonlwz commented Mar 6, 2023

zmberg commented Mar 9, 2023

[BUG] node has erased its taint，but broadcastjob won't make a pod on that node #1199

[BUG] node has erased its taint，but broadcastjob won't make a pod on that node #1199

Comments

weldonlwz commented Mar 1, 2023 • edited Loading

veophi commented Mar 3, 2023

weldonlwz commented Mar 6, 2023

zmberg commented Mar 9, 2023

weldonlwz commented Mar 1, 2023 •

edited

Loading