-
Notifications
You must be signed in to change notification settings - Fork 39.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add pvc as part of equivalence hash #56577
Conversation
continue | ||
} | ||
pvcName := volume.PersistentVolumeClaim.ClaimName | ||
result.Insert(pod.GetNamespace() + "-" + pvcName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use the PVC UID? I understand this is probably equivalent, but using the UID might make it more obvious. Is there a reason to avoid using the UID? If @bsalamat is ok with this, then maybe we can add a comment explaining it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A user could delete a PVC and create a new one with the same name, but different UID. The PVC contents could be different, so we would not want to use the cached results for the old PVC.
But actually, how do entries in the equivalence cache get removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that UID will be more reliable to ensure uniqueness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@msau42 Yes when PVC is updated, the volume predicate cache will be invalidated so it still works.
I was just trying to avoid fetching UID it from scheduler cache since we may need to change get ecache function's interface a bit. Not sure if it's allowed during code freeze ...
Anyway let me update the patch to a more reliable way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see what you mean, the Pod object only contains the PVC name, so you have to lookup the PVC object from the informer cache to get the UID
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, so now I've already refactored it to a factory like NewEquivalencePodGenerator(args.PVCInfo)
, will update the PR soon after test passes.
b746d6d
to
2932f4c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the NoVolumeZoneConflict predicate needs to be invalidated on PVC delete
2932f4c
to
cdb1295
Compare
@msau42 Thanks, just updated. Also cleaned up and fixed the related |
cdb1295
to
dcfd861
Compare
if pvc.Spec.VolumeName != "" { | ||
c.equivalencePodCache.InvalidateCachedPredicateItemOfAllNodes(maxPDVolumeCountPredicateSet) | ||
} | ||
// TODO(harry): make all predicate key to be global const, e.g. predicates.NoVolumeZoneConflict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment that we need to do this here because the ecache uses PVC uid to hash the pod
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess for that same reason, VolumeBind predicate also needs to be invalidated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack, just updated.
@@ -384,16 +384,7 @@ func (c *configFactory) onPvDelete(obj interface{}) { | |||
} | |||
|
|||
func (c *configFactory) invalidatePredicatesForPv(pv *v1.PersistentVolume) { | |||
invalidPredicates := sets.NewString("MaxPDVolumeCountPredicate") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to stay. You could have a PVC that points to a PV, but the PV object doesn't exist. So when the PV object gets added, we can recount.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, and also updated other comments to clarify the invalidation op reasosn.
@@ -188,12 +188,29 @@ func (ec *EquivalenceCache) InvalidateCachedPredicateItemForPodAdd(pod *v1.Pod, | |||
|
|||
// GeneralPredicates: will always be affected by adding a new pod | |||
invalidPredicates := sets.NewString("GeneralPredicates") | |||
|
|||
// MaxPDVolumeCountPredicate: we check the volumes of pod to make decision. | |||
for _, vol := range pod.Spec.Volumes { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this need to be added here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed MaxPDVolumeCountChecker
actually relies on volumes count of specify Pod to do predicate. So when a pod with PV is add/delete on a node. The MaxPDVolumeCountChecker
e-cache of this node should be re-calculated for subsequent pods.
Hope I am understanding right :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the volume count checker counts the volumes on pods that are assigned on nodes. I guess for the case where you create the Pod with NodeName already set, then yes, the predicate should be invalidated.
dcfd861
to
8fbc5f2
Compare
lgtm from volume perspective. Have you tried testing your change with a StatefulSet with PVCs to make sure it is fixed? |
@@ -1492,3 +1492,14 @@ func (c *VolumeBindingChecker) predicate(pod *v1.Pod, meta algorithm.PredicateMe | |||
glog.V(5).Info("All PVCs found matches for pod %v/%v, node %q", pod.Namespace, pod.Name, node.Name) | |||
return true, nil, nil | |||
} | |||
|
|||
type FakePersistentVolumeClaimInfo []v1.PersistentVolumeClaim |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this block moved to this file? It does not belong here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be used by equivalence_cache_test.go, maybe create a test helper pkg?
// For now we only consider pods: | ||
// 1. OwnerReferences is Controller | ||
// 2. with same OwnerReferences | ||
// 3. with same PVC claim | ||
// to be equivalent | ||
if len(pod.OwnerReferences) != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that you are making changes, please drop this if statement. The for loop takes care of this.
// For now we only consider pods: | ||
// 1. OwnerReferences is Controller | ||
// 2. with same OwnerReferences | ||
// 3. with same PVC claim | ||
// to be equivalent | ||
if len(pod.OwnerReferences) != 0 { | ||
for _, ref := range pod.OwnerReferences { | ||
if *ref.Controller { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't ref.Controller
ever be nil? We should probably check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch ... just fixed
// PVC volume binding has changed | ||
invalidPredicates.Insert(predicates.CheckVolumeBinding) | ||
} | ||
// The binded volume type may change |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/binded/bound/ here an other places in your PR
8fbc5f2
to
055b3a5
Compare
@msau42 Yes, I've tested the patch with a StatefulSet with volumeClaimTemplates:
- metadata:
name: www
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 1Gi And when set cluster/kubectl.sh get pods
NAME READY STATUS RESTARTS AGE
web-0 1/1 Running 0 4s
web-1 0/1 Pending 0 1s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 0s (x6 over 15s) default-scheduler pod has unbound PersistentVolumeClaims |
The eclass feature do have e2e tests but normally skipped due to it's |
Use factory to generat get equivalence pod func
055b3a5
to
b3bb74e
Compare
The second pod fails to schedule with or without your patch? The original reported 1.8 bug was that the second pod did get scheduled, but in the wrong zone. You would need a multizone cluster to test this. Or the other way is to enable my alpha features in 1.9 for local volumes and to run my local e2e test. Or I can run it with your patch. Also in general, for e2e alpha tests, you have to explicitly add it to the regex for the alpha suite. See kubernetes/test-infra#5679 |
if *ref.Controller { | ||
// a pod can only belongs to one controller | ||
for _, ref := range pod.OwnerReferences { | ||
if ref.Controller != nil && *ref.Controller { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you also need to *ref.Controller
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@msau42 ref.Controller
is a *bool
. Its value can be false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah thanks, makes sense
lgtm from volume side |
matchInterPodAffinitySet = sets.NewString("MatchInterPodAffinity") | ||
generalPredicatesSets = sets.NewString("GeneralPredicates") | ||
noDiskConflictSet = sets.NewString("NoDiskConflict") | ||
maxPDVolumeCountPredicateKeys = []string{"MaxGCEPDVolumeCount", "MaxAzureDiskVolumeCount", "MaxEBSVolumeCount"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I had asked this question before, but looks like I haven't:
why do we need to break MaxPDVolumeCountPredicate
to three keys? Is it only for optimization purposes? If so, I doubt if it has a major effect on performance. It will probably be rare to have more than one of these volume types on a single node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the registered predicates are three separated names instead of "MaxPDVolumeCount".
Otherwise we may need to check the type of volume before invalidating, while just as you pointed out, it's rare to have more than one of them as PV, so I chose to invalidate them all in most cases unless we can check the PV type very easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Thanks for pointing it out.
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bsalamat, resouer Associated issue: 56265 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
/kind bug |
/priority critical-urgent |
[MILESTONENOTIFIER] Milestone Pull Request Current Note: This pull request is marked as Example update:
Pull Request Labels
|
/test all [submit-queue is verifying that this PR is safe to merge] |
Automatic merge from submit-queue (batch tested with PRs 56688, 56577). If you want to cherry-pick this change to another branch, please follow the instructions here. |
@resouer: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
What this PR does / why we need it:
Should add PVC as part of equivalence hash so that
StatefulSe
t andOperator
will always run the volume predicate, while theReplicaSet
can still re-use cached ones.Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #56265
Special notes for your reviewer:
Release note: