Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return existing reservation if podRef matches #296

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

xagent003
Copy link
Contributor

@xagent003 xagent003 commented Jan 18, 2023

#291

As mentioned, the Pod has entered a state where the container has died/been killed. There was already an existing IP reservation for this Pod. But for whatever reason, the DEL is missed. But whereabouts should be smart enough to work around this and tie a reservation to podRef and NOT container id.

On restart/post upgrade of our stack, kubelet detects no sandbox container for Pod and tries to start it back up sending an ADD to cni/ipam. In the case of a non-full IP Pool, whereabouts will allocate a NEW IP for this Pod. Now IP Pool has 2 IPs allocated to same PodRef but different container IDs. So we leak IPs.

In the case of a full IP Pool, the ADD fails because no more IPs in range. However as mentioned... there is already an IP rez for same podRef. So why can't it just re-use it?

ip-reconciler does not help in this case as Pod is running and kubelet is stuck in a container retry loop. The graceful cleanup was missed due to container dying out of band and/or kubelet being down.

The only way to get out of this state is by manually deleting the IP reservation in both IPPool and overlappingranges, or, by this fix which returns an existing reservation if PodRef matches.

@xagent003 xagent003 requested a review from dougbtv as a code owner January 18, 2023 21:39
@xagent003 xagent003 force-pushed the private/arjun/matchPodRefAdd branch from cc1fa6a to 31583e5 Compare January 18, 2023 21:54
Copy link
Collaborator

@nicklesimba nicklesimba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good and thanks for the detailed breakdown for the PR description. Just have one question...also will try to bring this up to others for further review.

Comment on lines +215 to +219
if reserved[i.String()] != "" {
if reserved[i.String()] != podRef {
continue
}
logging.Debugf("Found existing reservation %v with matching podRef %s", i.String(), podRef)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the case where the reserved map is nonempty but doesn't contain the podRef? In other words, when will the "continue" get hit in the code?

Copy link
Member

@s1061123 s1061123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR. The diff makes sense to me. Could you please add unit-test in allocate_test.go for allocate.go changes, if you don't mind it?

@moradiyaashish
Copy link

Till fix is available, is there any workaround available to recover from this situation ?

@maiqueb
Copy link
Collaborator

maiqueb commented Aug 3, 2023

@xagent003 hey o/

This slipped me by; could you rebase ? Do you need help sorting out the unit tests ?

@caribbeantiger
Copy link
Contributor

@xagent003 hey o/

This slipped me by; could you rebase ? Do you need help sorting out the unit tests ?

if @xagent003 doesn't mind , i can open a new merge request with this fix, as we are running into this issue often while testing high availability and node crash scenarios.

@maiqueb is that okay?

@maiqueb
Copy link
Collaborator

maiqueb commented Aug 29, 2023

@xagent003 hey o/
This slipped me by; could you rebase ? Do you need help sorting out the unit tests ?

if @xagent003 doesn't mind , i can open a new merge request with this fix, as we are running into this issue often while testing high availability and node crash scenarios.

@maiqueb is that okay?

Please do. We'll review it. Please take into account @s1061123 's request in #296 (review)

Thanks for offering !

@Ritwik037
Copy link

@dougbtv do you have any plan merging this PR because we are facing this issue with version 0.6.2 also and @xagent003 can you please confirm in which whereabouts version you made this changes and is it compatible with kubernetes 1.25

@Ritwik037
Copy link

@maiqueb we are facing the issue that during the rolling upgrade of the kubernetes whenever a pods shift to a new worker node it got stuck the whereabout is not able to assign the ips to we tried the latest version 0.6.2 also but it didn't make any difference

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants