Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: make the process of patching pods exclusive #12596

Merged
merged 31 commits into from
Feb 5, 2024

Conversation

sakai-ast
Copy link
Contributor

@sakai-ast sakai-ast commented Jan 31, 2024

Motivation

This is to avoid the timeout errors such as the following recent CI errors.

https://github.com/argoproj/argo-workflows/actions/runs/7707372476/job/21004514734
Screenshot 2024-01-31 at 21 29 36
https://github.com/argoproj/argo-workflows/actions/runs/7707667909/job/21005263284
Screenshot 2024-01-31 at 21 30 36
https://github.com/argoproj/argo-workflows/actions/runs/7709093677/job/21009561377
Screenshot 2024-01-31 at 21 30 55
https://github.com/argoproj/argo-workflows/actions/runs/7721063793/job/21047011775
Screenshot 2024-01-31 at 21 31 36

This seems to have happened after merging #12413.

The reason this happens is that DeleteFunc and labelPodCompleted call the enablePodForDeletion at the same time, and the patching fails, and eventually the pod is in an unexpected state and fails the tests. It's described below.
#12596 (comment)
#12596 (comment)

Modifications

I have implemented mutual exclusion based on pod names in the enablePodForDeletion.

Verification

ut and e2e tests

@sakai-ast sakai-ast changed the title ci: Bump E2E_WAIT_TIMEOUT to avoid E2E test failures in CI ci: Bump timeout to avoid E2E test failures in CI Jan 31, 2024
@sakai-ast sakai-ast changed the title ci: Bump timeout to avoid E2E test failures in CI ci: Bump timeouts to avoid E2E test failures in CI Jan 31, 2024
@sakai-ast sakai-ast changed the title ci: Bump timeouts to avoid E2E test failures in CI ci: Bump E2E_WAIT_TIMEOUT to avoid E2E test failures in CI Jan 31, 2024
sakai-ast and others added 15 commits January 31, 2024 23:48
Signed-off-by: Atsushi Sakai <[email protected]>
Signed-off-by: Atsushi Sakai <[email protected]>
Signed-off-by: Atsushi Sakai <[email protected]>
Signed-off-by: Atsushi Sakai <[email protected]>
Signed-off-by: Atsushi Sakai <[email protected]>
Signed-off-by: Atsushi Sakai <[email protected]>
Signed-off-by: Atsushi Sakai <[email protected]>
Signed-off-by: Atsushi Sakai <[email protected]>
Signed-off-by: Atsushi Sakai <[email protected]>
Signed-off-by: Atsushi Sakai <[email protected]>
Signed-off-by: Atsushi Sakai <[email protected]>
Signed-off-by: Atsushi Sakai <[email protected]>
Signed-off-by: Atsushi Sakai <[email protected]>
Signed-off-by: Atsushi Sakai <[email protected]>
@agilgur5 agilgur5 added area/controller Controller issues, panics area/build Build or GithubAction/CI issues labels Feb 2, 2024
@agilgur5
Copy link

agilgur5 commented Feb 2, 2024

Hit this in #12609 and #12610 too. Seems to be happening extremely frequently, so this is a blocker to fix or at least temporarily skip

@sakai-ast
Copy link
Contributor Author

Currently, the following flaky errors occur when the test-functional tests (especially PodCleanupSuite) fail.

controller: time="2024-02-03T09:08:29.485Z" level=info msg="cleaning up pod" action=terminateContainers key=argo/test-pod-cleanup-ljrzv/terminateContainers
controller: time="2024-02-03T09:08:30.512Z" level=info msg="patching pod" data="[{\"op\":\"remove\",\"path\":\"/metadata/finalizers/0\"}]" podName=test-pod-cleanup-ljrzv
controller: time="2024-02-03T09:08:30.516Z" level=info msg="cleaning up pod" action=labelPodCompleted key=argo/test-pod-cleanup-ljrzv/labelPodCompleted
controller: time="2024-02-03T09:08:30.521Z" level=info msg="patching pod" data="[{\"op\":\"remove\",\"path\":\"/metadata/finalizers/0\"},{\"op\":\"replace\",\"path\":\"/metadata/labels/workflows.argoproj.io~1completed\",\"value\":\"true\"}]" podName=test-pod-cleanup-ljrzv
controller: time="2024-02-03T09:08:30.523Z" level=info msg="insignificant pod change" key=argo/test-pod-cleanup-ljrzv
controller: time="2024-02-03T09:08:30.524Z" level=warning msg="failed to clean-up pod" action=labelPodCompleted error="the server rejected our request due to an error in our request" key=argo/test-pod-cleanup-ljrzv/labelPodCompleted
controller: time="2024-02-03T09:08:30.524Z" level=warning msg="Non-transient error: the server rejected our request due to an error in our request"
controller: time="2024-02-03T09:08:31.488Z" level=info msg="cleaning up pod" action=killContainers key=argo/test-pod-cleanup-ljrzv/killContainers

I don't know why, the controller try to apply the finalizer remove patch 2 times, so the server rejected our request due to an error in our request occur on second time. the error reference here

The flaky patching fails when the action is labelPodCompleted, resulting in a pod condition that is not expected for tests, as shown below.
metadata.labels.workflows.argoproj.io/completed: "false"

This issue causes the failure of the test-functional tests and is difficult to reproduce locally, where it always succeeds.

Therefore, I am attempting to debug it.

Signed-off-by: Atsushi Sakai <[email protected]>
@agilgur5
Copy link

agilgur5 commented Feb 3, 2024

After adding some debug log settings, I found the cause of 2 times applying the finalizer removal patch is that DeleteFunc and labelPodCompleted are calling the enablePodForDeletion at the same time.

Looks like the E2E tests passed once you added the mutex fix -- great to solve it at the root cause instead of increasing timeouts for race conditions! I didn't review the code in depth to check if mutexes are the most optimal solution here yet though.

Unit tests are now failing on pod cleanup as well. The test is getting a nil pointer error now from the logs, so I imagine the test needs to be altered

Signed-off-by: Atsushi Sakai <[email protected]>
Signed-off-by: Atsushi Sakai <[email protected]>
@sakai-ast sakai-ast changed the title ci: Bump E2E_WAIT_TIMEOUT to avoid E2E test failures in CI fix: make the process of patching pods exclusive Feb 4, 2024
@sakai-ast
Copy link
Contributor Author

The tests seem alright after dealing with the issues, so I have fixed the PR title and description.

@sakai-ast sakai-ast marked this pull request as ready for review February 4, 2024 06:35
@agilgur5 agilgur5 requested a review from juliev0 February 4, 2024 17:38
Signed-off-by: Atsushi Sakai <[email protected]>
podNameMutex.Lock()
defer func() {
podNameMutex.Unlock()
wfc.podNameLocks.Delete(podName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice remembering to remove it from memory

@juliev0
Copy link
Contributor

juliev0 commented Feb 5, 2024

I think this is probably good.

I actually just did an analysis of this type of thing on another repo. As you noticed, the underlying issue is multiple goroutines at the same time doing a "Get", then a "Patch" based on making some calculation from the "Get" (in this case determining if there was a finalizer) - if one does a "Get" and then makes a calculation, then the live state may have changed in the meantime and that calculation may no longer be valid.

I determined that there were 2 options: 1. put a lock around the whole thing like you are now; 2. Use an Update() (PUT) instead of a Patch. Doing a Get and then modifying the Resource, and then doing an Update() on that Resource means that there's a ResourceVersion embedded in the object, and the K8S API will notice if the live ResourceVersion has in the meantime changed, so that the whole Get()->calculation->Update() can be retried again.

To be extra cautious, we can also consider anywhere else in the code that these Pods could get updated to make sure any places that can run concurrently with this will work, if any.

@juliev0
Copy link
Contributor

juliev0 commented Feb 5, 2024

To be extra cautious, we can also consider anywhere else in the code that these Pods could get updated to make sure any places that can run concurrently with this will work, if any.

Okay, I just searched for Pods( in the code. The only other place I see a Pod getting updated is in the executor running on the Pod (wait container) where an Annotation can be set. I think probably no issue there really since both of them are touching/looking for totally separate things.

@juliev0 juliev0 merged commit e771bde into argoproj:main Feb 5, 2024
27 checks passed
@sakai-ast sakai-ast deleted the ci/bump-timeout branch February 5, 2024 05:51
@agilgur5
Copy link

agilgur5 commented Feb 5, 2024

if one does a "Get" and then makes a calculation, then the live state may have changed in the meantime and that calculation may no longer be valid.

In other words, a lack of atomicity. So a lock is used to make it more like a transaction in this case.

I determined that there were 2 options: 1. put a lock around the whole thing like you are now; 2. Use an Update() (PUT) instead of a Patch.

I'm wondering if an update might be better. The mutex lock ensures that the Controller's goroutines won't overwrite each other (i.e. make it thread-safe), but not that something else outside of Argo doesn't overwrite the Pod. It sounds like, per your analysis, that the semantics of update are safer?

@juliev0
Copy link
Contributor

juliev0 commented Feb 5, 2024

I'm wondering if an update might be better. The mutex lock ensures that the Controller's goroutines won't overwrite each other (i.e. make it thread-safe), but not that something else outside of Argo doesn't overwrite the Pod. It sounds like, per your analysis, that the semantics of update are safer?

Looking at the code more closely, it looks like we get the live state, then determine if this particular Finalizer is in there and get its Index. I suppose there is a very small risk of somebody else maybe inserting their own Finalizer or removing one, such that we would remove the wrong Finalizer? Is it a realistic thing to be worried about do you think?

@agilgur5
Copy link

agilgur5 commented Feb 5, 2024

I suppose there is a very small risk of somebody else maybe inserting their own Finalizer or removing one, such that we would remove the wrong Finalizer? Is it a realistic thing to be worried about do you think?

It's an edge case of an edge case, but I do think it is quite possible, especially for a fundamental resource like Pods that could have various admission controllers etc apply to it.

@juliev0
Copy link
Contributor

juliev0 commented Feb 5, 2024

I suppose there is a very small risk of somebody else maybe inserting their own Finalizer or removing one, such that we would remove the wrong Finalizer? Is it a realistic thing to be worried about do you think?

It's an edge case of an edge case, but I do think it is quite possible, especially for a fundamental resource like Pods that could have various admission controllers etc apply to it.

What do you think @sakai-ast ? If we were to write an issue related to this, would you be open to converting the logic to use Update() rather than Patch(), with Retry capability in the case that the K8S API rejects the request due to ResourceVersion being outdated?

@sakai-ast
Copy link
Contributor Author

sakai-ast commented Feb 6, 2024

I suppose there is a very small risk of somebody else maybe inserting their own Finalizer or removing one, such that we would remove the wrong Finalizer? Is it a realistic thing to be worried about do you think?

It's an edge case of an edge case, but I do think it is quite possible, especially for a fundamental resource like Pods that could have various admission controllers etc apply to it.

What do you think @sakai-ast ? If we were to write an issue related to this, would you be open to converting the logic to use Update() rather than Patch(), with Retry capability in the case that the K8S API rejects the request due to ResourceVersion being outdated?

I'm open to it and appreciate the suggestion. I have tried the following implementation using Update() with retry logic. Does this meet your expectations?
#12632

@juliev0
Copy link
Contributor

juliev0 commented Feb 7, 2024

I'm open to it and appreciate the suggestion. I have tried the following implementation using Update() with retry logic. Does this meet your expectations? #12632

very nice! yes, thanks :) I'll assign that one to myself and review more in depth later.

isubasinghe pushed a commit to isubasinghe/argo-workflows that referenced this pull request Feb 27, 2024
isubasinghe pushed a commit to isubasinghe/argo-workflows that referenced this pull request Feb 27, 2024
@agilgur5 agilgur5 added this to the v3.6.0 milestone Apr 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/build Build or GithubAction/CI issues area/controller Controller issues, panics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants