fix: make the process of patching pods exclusive #12596

sakai-ast · 2024-01-31T12:55:18Z

Motivation

This is to avoid the timeout errors such as the following recent CI errors.

https://github.com/argoproj/argo-workflows/actions/runs/7707372476/job/21004514734

https://github.com/argoproj/argo-workflows/actions/runs/7707667909/job/21005263284

https://github.com/argoproj/argo-workflows/actions/runs/7709093677/job/21009561377

https://github.com/argoproj/argo-workflows/actions/runs/7721063793/job/21047011775

This seems to have happened after merging #12413.

The reason this happens is that DeleteFunc and labelPodCompleted call the enablePodForDeletion at the same time, and the patching fails, and eventually the pod is in an unexpected state and fails the tests. It's described below.
#12596 (comment)
#12596 (comment)

Modifications

I have implemented mutual exclusion based on pod names in the enablePodForDeletion.

Verification

ut and e2e tests

Signed-off-by: Atsushi Sakai <[email protected]>

.github/workflows/ci-build.yaml

agilgur5 · 2024-02-02T23:38:38Z

Hit this in #12609 and #12610 too. Seems to be happening extremely frequently, so this is a blocker to fix or at least temporarily skip

Signed-off-by: Atsushi Sakai <[email protected]>

… into ci/bump-timeout Signed-off-by: Atsushi Sakai <[email protected]>

Signed-off-by: Atsushi Sakai <[email protected]>

sakai-ast · 2024-02-03T09:32:55Z

Currently, the following flaky errors occur when the test-functional tests (especially PodCleanupSuite) fail.

controller: time="2024-02-03T09:08:29.485Z" level=info msg="cleaning up pod" action=terminateContainers key=argo/test-pod-cleanup-ljrzv/terminateContainers
controller: time="2024-02-03T09:08:30.512Z" level=info msg="patching pod" data="[{\"op\":\"remove\",\"path\":\"/metadata/finalizers/0\"}]" podName=test-pod-cleanup-ljrzv
controller: time="2024-02-03T09:08:30.516Z" level=info msg="cleaning up pod" action=labelPodCompleted key=argo/test-pod-cleanup-ljrzv/labelPodCompleted
controller: time="2024-02-03T09:08:30.521Z" level=info msg="patching pod" data="[{\"op\":\"remove\",\"path\":\"/metadata/finalizers/0\"},{\"op\":\"replace\",\"path\":\"/metadata/labels/workflows.argoproj.io~1completed\",\"value\":\"true\"}]" podName=test-pod-cleanup-ljrzv
controller: time="2024-02-03T09:08:30.523Z" level=info msg="insignificant pod change" key=argo/test-pod-cleanup-ljrzv
controller: time="2024-02-03T09:08:30.524Z" level=warning msg="failed to clean-up pod" action=labelPodCompleted error="the server rejected our request due to an error in our request" key=argo/test-pod-cleanup-ljrzv/labelPodCompleted
controller: time="2024-02-03T09:08:30.524Z" level=warning msg="Non-transient error: the server rejected our request due to an error in our request"
controller: time="2024-02-03T09:08:31.488Z" level=info msg="cleaning up pod" action=killContainers key=argo/test-pod-cleanup-ljrzv/killContainers

I don't know why, the controller try to apply the finalizer remove patch 2 times, so the server rejected our request due to an error in our request occur on second time. the error reference here

The flaky patching fails when the action is labelPodCompleted, resulting in a pod condition that is not expected for tests, as shown below.
metadata.labels.workflows.argoproj.io/completed: "false"

This issue causes the failure of the test-functional tests and is difficult to reproduce locally, where it always succeeds.

Therefore, I am attempting to debug it.

Signed-off-by: Atsushi Sakai <[email protected]>

agilgur5 · 2024-02-03T19:40:55Z

After adding some debug log settings, I found the cause of 2 times applying the finalizer removal patch is that DeleteFunc and labelPodCompleted are calling the enablePodForDeletion at the same time.

Looks like the E2E tests passed once you added the mutex fix -- great to solve it at the root cause instead of increasing timeouts for race conditions! I didn't review the code in depth to check if mutexes are the most optimal solution here yet though.

Unit tests are now failing on pod cleanup as well. The test is getting a nil pointer error now from the logs, so I imagine the test needs to be altered

Signed-off-by: Atsushi Sakai <[email protected]>

sakai-ast · 2024-02-04T06:35:52Z

The tests seem alright after dealing with the issues, so I have fixed the PR title and description.

workflow/controller/controller.go

Signed-off-by: Atsushi Sakai <[email protected]>

juliev0 · 2024-02-05T05:14:58Z

workflow/controller/controller.go

+	podNameMutex.Lock()
+	defer func() {
+		podNameMutex.Unlock()
+		wfc.podNameLocks.Delete(podName)


nice remembering to remove it from memory

juliev0 · 2024-02-05T05:27:41Z

I think this is probably good.

I actually just did an analysis of this type of thing on another repo. As you noticed, the underlying issue is multiple goroutines at the same time doing a "Get", then a "Patch" based on making some calculation from the "Get" (in this case determining if there was a finalizer) - if one does a "Get" and then makes a calculation, then the live state may have changed in the meantime and that calculation may no longer be valid.

I determined that there were 2 options: 1. put a lock around the whole thing like you are now; 2. Use an Update() (PUT) instead of a Patch. Doing a Get and then modifying the Resource, and then doing an Update() on that Resource means that there's a ResourceVersion embedded in the object, and the K8S API will notice if the live ResourceVersion has in the meantime changed, so that the whole Get()->calculation->Update() can be retried again.

To be extra cautious, we can also consider anywhere else in the code that these Pods could get updated to make sure any places that can run concurrently with this will work, if any.

juliev0 · 2024-02-05T05:39:10Z

To be extra cautious, we can also consider anywhere else in the code that these Pods could get updated to make sure any places that can run concurrently with this will work, if any.

Okay, I just searched for Pods( in the code. The only other place I see a Pod getting updated is in the executor running on the Pod (wait container) where an Annotation can be set. I think probably no issue there really since both of them are touching/looking for totally separate things.

agilgur5 · 2024-02-05T16:33:25Z

if one does a "Get" and then makes a calculation, then the live state may have changed in the meantime and that calculation may no longer be valid.

In other words, a lack of atomicity. So a lock is used to make it more like a transaction in this case.

I determined that there were 2 options: 1. put a lock around the whole thing like you are now; 2. Use an Update() (PUT) instead of a Patch.

I'm wondering if an update might be better. The mutex lock ensures that the Controller's goroutines won't overwrite each other (i.e. make it thread-safe), but not that something else outside of Argo doesn't overwrite the Pod. It sounds like, per your analysis, that the semantics of update are safer?

juliev0 · 2024-02-05T20:08:20Z

I'm wondering if an update might be better. The mutex lock ensures that the Controller's goroutines won't overwrite each other (i.e. make it thread-safe), but not that something else outside of Argo doesn't overwrite the Pod. It sounds like, per your analysis, that the semantics of update are safer?

Looking at the code more closely, it looks like we get the live state, then determine if this particular Finalizer is in there and get its Index. I suppose there is a very small risk of somebody else maybe inserting their own Finalizer or removing one, such that we would remove the wrong Finalizer? Is it a realistic thing to be worried about do you think?

agilgur5 · 2024-02-05T20:24:01Z

I suppose there is a very small risk of somebody else maybe inserting their own Finalizer or removing one, such that we would remove the wrong Finalizer? Is it a realistic thing to be worried about do you think?

It's an edge case of an edge case, but I do think it is quite possible, especially for a fundamental resource like Pods that could have various admission controllers etc apply to it.

juliev0 · 2024-02-05T20:29:48Z

I suppose there is a very small risk of somebody else maybe inserting their own Finalizer or removing one, such that we would remove the wrong Finalizer? Is it a realistic thing to be worried about do you think?

It's an edge case of an edge case, but I do think it is quite possible, especially for a fundamental resource like Pods that could have various admission controllers etc apply to it.

What do you think @sakai-ast ? If we were to write an issue related to this, would you be open to converting the logic to use Update() rather than Patch(), with Retry capability in the case that the K8S API rejects the request due to ResourceVersion being outdated?

sakai-ast · 2024-02-06T14:00:21Z

I suppose there is a very small risk of somebody else maybe inserting their own Finalizer or removing one, such that we would remove the wrong Finalizer? Is it a realistic thing to be worried about do you think?

It's an edge case of an edge case, but I do think it is quite possible, especially for a fundamental resource like Pods that could have various admission controllers etc apply to it.

What do you think @sakai-ast ? If we were to write an issue related to this, would you be open to converting the logic to use Update() rather than Patch(), with Retry capability in the case that the K8S API rejects the request due to ResourceVersion being outdated?

I'm open to it and appreciate the suggestion. I have tried the following implementation using Update() with retry logic. Does this meet your expectations?
#12632

juliev0 · 2024-02-07T02:39:09Z

I'm open to it and appreciate the suggestion. I have tried the following implementation using Update() with retry logic. Does this meet your expectations? #12632

very nice! yes, thanks :) I'll assign that one to myself and review more in depth later.

Signed-off-by: Atsushi Sakai <[email protected]>

ci: Bump E2E_WAIT_TIMEOUT to avoid E2E test failures in CI

34b9863

Signed-off-by: Atsushi Sakai <[email protected]>

sakai-ast changed the title ~~ci: Bump E2E_WAIT_TIMEOUT to avoid E2E test failures in CI~~ ci: Bump timeout to avoid E2E test failures in CI Jan 31, 2024

sakai-ast changed the title ~~ci: Bump timeout to avoid E2E test failures in CI~~ ci: Bump timeouts to avoid E2E test failures in CI Jan 31, 2024

sakai-ast force-pushed the ci/bump-timeout branch from 40ce9d3 to 34b9863 Compare January 31, 2024 14:22

sakai-ast changed the title ~~ci: Bump timeouts to avoid E2E test failures in CI~~ ci: Bump E2E_WAIT_TIMEOUT to avoid E2E test failures in CI Jan 31, 2024

sakai-ast and others added 15 commits January 31, 2024 23:48

ci: empty commit for ci

55e2f7d

Signed-off-by: Atsushi Sakai <[email protected]>

ci: more extended

2e1fc69

Signed-off-by: Atsushi Sakai <[email protected]>

ci: empty commit for ci

e4e2078

Signed-off-by: Atsushi Sakai <[email protected]>

ci: more extended

2800456

Signed-off-by: Atsushi Sakai <[email protected]>

fix: adjust parallelism

9f1dcc4

Signed-off-by: Atsushi Sakai <[email protected]>

fix: fix E2E_PARALLEL location

e432484

Signed-off-by: Atsushi Sakai <[email protected]>

fix: set default value

6255201

Signed-off-by: Atsushi Sakai <[email protected]>

fix: adjust parallel size

656696b

Signed-off-by: Atsushi Sakai <[email protected]>

fix: bump timeout

997c676

Signed-off-by: Atsushi Sakai <[email protected]>

fix: using max-parallel

3ca0e00

Signed-off-by: Atsushi Sakai <[email protected]>

ci: reduce max-parallel

2a9a698

Signed-off-by: Atsushi Sakai <[email protected]>

ci: bump E2E_WAIT_TIMEOUT

4990927

Signed-off-by: Atsushi Sakai <[email protected]>

Merge branch 'main' into ci/bump-timeout

544c14a

ci: reduce max-parallel

d53deef

Signed-off-by: Atsushi Sakai <[email protected]>

ci: bump E2E_WAIT_TIMEOUT

f9c0868

Signed-off-by: Atsushi Sakai <[email protected]>

agilgur5 added area/controller Controller issues, panics area/build Build or GithubAction/CI issues labels Feb 2, 2024

agilgur5 reviewed Feb 2, 2024

View reviewed changes

.github/workflows/ci-build.yaml Outdated Show resolved Hide resolved

sakai-ast and others added 5 commits February 3, 2024 15:50

Merge branch 'main' into ci/bump-timeout

5f9c747

ci: bump E2E_WAIT_TIMEOUT

4a9f8cb

Signed-off-by: Atsushi Sakai <[email protected]>

Merge branch 'ci/bump-timeout' of github.com:sakai-ast/argo-workflows…

9446f81

… into ci/bump-timeout Signed-off-by: Atsushi Sakai <[email protected]>

fix: change the log level when patching for debug

8eeab5b

Signed-off-by: Atsushi Sakai <[email protected]>

fix: add ths logs for debug

66ecfda

Signed-off-by: Atsushi Sakai <[email protected]>

ci: empty commit for ci

90ad55d

Signed-off-by: Atsushi Sakai <[email protected]>

sakai-ast added 3 commits February 4, 2024 10:09

fix: fix nil pointer error

fa17230

Signed-off-by: Atsushi Sakai <[email protected]>

fix: fix lint

eb9b38f

Signed-off-by: Atsushi Sakai <[email protected]>

chore: remove the log settings for debug

1070b9b

Signed-off-by: Atsushi Sakai <[email protected]>

sakai-ast changed the title ~~ci: Bump E2E_WAIT_TIMEOUT to avoid E2E test failures in CI~~ fix: make the process of patching pods exclusive Feb 4, 2024

sakai-ast marked this pull request as ready for review February 4, 2024 06:35

agilgur5 reviewed Feb 4, 2024

View reviewed changes

workflow/controller/controller.go Outdated Show resolved Hide resolved

agilgur5 requested a review from juliev0 February 4, 2024 17:38

chore: revert to debug logs

4cd274b

Signed-off-by: Atsushi Sakai <[email protected]>

eduardodbr mentioned this pull request Feb 4, 2024

feat(cron): allow multiple schedules #12616

Merged

juliev0 self-assigned this Feb 5, 2024

juliev0 reviewed Feb 5, 2024

View reviewed changes

juliev0 approved these changes Feb 5, 2024

View reviewed changes

juliev0 merged commit e771bde into argoproj:main Feb 5, 2024
27 checks passed

sakai-ast deleted the ci/bump-timeout branch February 5, 2024 05:51

sakai-ast mentioned this pull request Feb 6, 2024

fix: use optimistic concurrency strategy when updating pod status #12632

Merged

isubasinghe pushed a commit to isubasinghe/argo-workflows that referenced this pull request Feb 27, 2024

fix: make the process of patching pods exclusive (argoproj#12596)

ea21a5c

Signed-off-by: Atsushi Sakai <[email protected]>

isubasinghe mentioned this pull request Feb 27, 2024

Release v3.5 patch releases discussion #11997

Open

isubasinghe pushed a commit to isubasinghe/argo-workflows that referenced this pull request Feb 27, 2024

fix: make the process of patching pods exclusive (argoproj#12596)

7c40cce

Signed-off-by: Atsushi Sakai <[email protected]>

agilgur5 added this to the v3.6.0 milestone Apr 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: make the process of patching pods exclusive #12596

fix: make the process of patching pods exclusive #12596

sakai-ast commented Jan 31, 2024 •

edited

Loading

agilgur5 commented Feb 2, 2024

sakai-ast commented Feb 3, 2024

agilgur5 commented Feb 3, 2024

sakai-ast commented Feb 4, 2024

juliev0 Feb 5, 2024

juliev0 commented Feb 5, 2024

juliev0 commented Feb 5, 2024 •

edited

Loading

agilgur5 commented Feb 5, 2024

juliev0 commented Feb 5, 2024

agilgur5 commented Feb 5, 2024 •

edited

Loading

juliev0 commented Feb 5, 2024

sakai-ast commented Feb 6, 2024 •

edited

Loading

juliev0 commented Feb 7, 2024 •

edited by agilgur5

Loading

fix: make the process of patching pods exclusive #12596

fix: make the process of patching pods exclusive #12596

Conversation

sakai-ast commented Jan 31, 2024 • edited Loading

Motivation

Modifications

Verification

agilgur5 commented Feb 2, 2024

sakai-ast commented Feb 3, 2024

agilgur5 commented Feb 3, 2024

sakai-ast commented Feb 4, 2024

juliev0 Feb 5, 2024

Choose a reason for hiding this comment

juliev0 commented Feb 5, 2024

juliev0 commented Feb 5, 2024 • edited Loading

agilgur5 commented Feb 5, 2024

juliev0 commented Feb 5, 2024

agilgur5 commented Feb 5, 2024 • edited Loading

juliev0 commented Feb 5, 2024

sakai-ast commented Feb 6, 2024 • edited Loading

juliev0 commented Feb 7, 2024 • edited by agilgur5 Loading

sakai-ast commented Jan 31, 2024 •

edited

Loading

juliev0 commented Feb 5, 2024 •

edited

Loading

agilgur5 commented Feb 5, 2024 •

edited

Loading

sakai-ast commented Feb 6, 2024 •

edited

Loading

juliev0 commented Feb 7, 2024 •

edited by agilgur5

Loading