-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: make the process of patching pods exclusive #12596
Conversation
Signed-off-by: Atsushi Sakai <[email protected]>
40ce9d3
to
34b9863
Compare
Signed-off-by: Atsushi Sakai <[email protected]>
Signed-off-by: Atsushi Sakai <[email protected]>
Signed-off-by: Atsushi Sakai <[email protected]>
Signed-off-by: Atsushi Sakai <[email protected]>
Signed-off-by: Atsushi Sakai <[email protected]>
Signed-off-by: Atsushi Sakai <[email protected]>
Signed-off-by: Atsushi Sakai <[email protected]>
Signed-off-by: Atsushi Sakai <[email protected]>
Signed-off-by: Atsushi Sakai <[email protected]>
Signed-off-by: Atsushi Sakai <[email protected]>
Signed-off-by: Atsushi Sakai <[email protected]>
Signed-off-by: Atsushi Sakai <[email protected]>
Signed-off-by: Atsushi Sakai <[email protected]>
Signed-off-by: Atsushi Sakai <[email protected]>
Signed-off-by: Atsushi Sakai <[email protected]>
… into ci/bump-timeout Signed-off-by: Atsushi Sakai <[email protected]>
Signed-off-by: Atsushi Sakai <[email protected]>
Signed-off-by: Atsushi Sakai <[email protected]>
Currently, the following flaky errors occur when the
I don't know why, the controller try to apply the finalizer remove patch 2 times, so The flaky patching fails when the action is This issue causes the failure of the Therefore, I am attempting to debug it. |
Signed-off-by: Atsushi Sakai <[email protected]>
Looks like the E2E tests passed once you added the mutex fix -- great to solve it at the root cause instead of increasing timeouts for race conditions! I didn't review the code in depth to check if mutexes are the most optimal solution here yet though. Unit tests are now failing on pod cleanup as well. The test is getting a nil pointer error now from the logs, so I imagine the test needs to be altered |
Signed-off-by: Atsushi Sakai <[email protected]>
Signed-off-by: Atsushi Sakai <[email protected]>
Signed-off-by: Atsushi Sakai <[email protected]>
The tests seem alright after dealing with the issues, so I have fixed the PR title and description. |
Signed-off-by: Atsushi Sakai <[email protected]>
podNameMutex.Lock() | ||
defer func() { | ||
podNameMutex.Unlock() | ||
wfc.podNameLocks.Delete(podName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice remembering to remove it from memory
I think this is probably good. I actually just did an analysis of this type of thing on another repo. As you noticed, the underlying issue is multiple goroutines at the same time doing a "Get", then a "Patch" based on making some calculation from the "Get" (in this case determining if there was a finalizer) - if one does a "Get" and then makes a calculation, then the live state may have changed in the meantime and that calculation may no longer be valid. I determined that there were 2 options: 1. put a lock around the whole thing like you are now; 2. Use an Update() (PUT) instead of a Patch. Doing a To be extra cautious, we can also consider anywhere else in the code that these Pods could get updated to make sure any places that can run concurrently with this will work, if any. |
Okay, I just searched for |
In other words, a lack of atomicity. So a lock is used to make it more like a transaction in this case.
I'm wondering if an |
Looking at the code more closely, it looks like we get the live state, then determine if this particular Finalizer is in there and get its Index. I suppose there is a very small risk of somebody else maybe inserting their own Finalizer or removing one, such that we would remove the wrong Finalizer? Is it a realistic thing to be worried about do you think? |
It's an edge case of an edge case, but I do think it is quite possible, especially for a fundamental resource like Pods that could have various admission controllers etc apply to it. |
What do you think @sakai-ast ? If we were to write an issue related to this, would you be open to converting the logic to use |
I'm open to it and appreciate the suggestion. I have tried the following implementation using |
very nice! yes, thanks :) I'll assign that one to myself and review more in depth later. |
Signed-off-by: Atsushi Sakai <[email protected]>
Signed-off-by: Atsushi Sakai <[email protected]>
Motivation
This is to avoid the timeout errors such as the following recent CI errors.
https://github.com/argoproj/argo-workflows/actions/runs/7707372476/job/21004514734
https://github.com/argoproj/argo-workflows/actions/runs/7707667909/job/21005263284
https://github.com/argoproj/argo-workflows/actions/runs/7709093677/job/21009561377
https://github.com/argoproj/argo-workflows/actions/runs/7721063793/job/21047011775
This seems to have happened after merging #12413.
The reason this happens is that
DeleteFunc
andlabelPodCompleted
call theenablePodForDeletion
at the same time, and the patching fails, and eventually the pod is in an unexpected state and fails the tests. It's described below.#12596 (comment)
#12596 (comment)
Modifications
I have implemented mutual exclusion based on pod names in the
enablePodForDeletion
.Verification
ut and e2e tests