Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add service ctr cleanup to PlayKubeDown #17821

Merged
merged 1 commit into from
Mar 21, 2023

Conversation

umohnani8
Copy link
Member

@umohnani8 umohnani8 commented Mar 16, 2023

Since we can't guarantee when the worker queue will come
and clean up the service container in the remote case when
podman kube play --wait is called, cleanup the service container
at the end of PlayKubeDown() to ensure that it is removed right
after all the containers, pods, volumes, etc are removed.

[NO NEW TESTS NEEDED]

Fixes #17803
Foxes #17820

Signed-off-by: Urvashi Mohnani [email protected]

Does this PR introduce a user-facing change?

None

@openshift-ci openshift-ci bot added release-note-none approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 16, 2023
@umohnani8
Copy link
Member Author

@edsantiago I think this should fix the race. I will re-run the test a bunch of times in the f37 aarch64 root environment to verify.

@edsantiago
Copy link
Member

This may be a stupid question, but what exactly is the purpose of --wait if it doesn't actually wait?

If --wait only guarantees that the containers are stopped, then maybe the fix is to remove -a from the podman ps -aq at the end? (Actually, removing the -q would be pretty helpful for future errors, too. --noheading would be better).

If --wait should guarantee that the containers are removed, then maybe these flakes are actually showing a real bug that needs to be fixed?

Either way, I think the documentation needs to be fixed to specify what --wait is intended to do.

@umohnani8
Copy link
Member Author

--wait is supposed to clean up the resources created once the pods have exited, so the -a is needed to verify that. I have tested this countless times and the resources have always been removed. I have no way of really reproducing this flake given that is only happening in one environment. My guess it that remote might be a bit slow in this environment in finishing the removal by the time we do a ps -aq to verify it.
I can add some more debugs to the test to see what may be happening if this flake continues to happen after this patch.

@rhatdan
Copy link
Member

rhatdan commented Mar 17, 2023

Seems most likely that the containers were marked for removal but not fully removed. Is the remote side waiting for the content to be removed? Is there a way to tell containers to be removed and return without waiting for them to be removed?

@Luap99
Copy link
Member

Luap99 commented Mar 17, 2023

I think the problem is that the serviceContainer is removed via worker queue. This is not a problem for local podman because it waits for all queue jobs to be completed before it exits. However in the remote case the service will finish the API response but there is no way of controlling what jobs were done by the worker queue.

I think the client should make the equivalent to podman wait --condition removing serviceContianer in this case to ensure everything is cleaned up before it exits.

cc @vrothberg

@vrothberg
Copy link
Member

I think the problem is that the serviceContainer is removed via worker queue. This is not a problem for local podman because it waits for all queue jobs to be completed before it exits. However in the remote case the service will finish the API response but there is no way of controlling what jobs were done by the worker queue.

Very thorough analysis, @Luap99!

I think the client should make the equivalent to podman wait --condition removing serviceContianer in this case to ensure everything is cleaned up before it exits.

Alternatively, the service container could be removed in PlayKubeDown() in the backend. Note that it must be removed last and after all containers, networks, volumes, etc.

@umohnani8 umohnani8 changed the title Fix wait test to avoid race Add service ctr cleanup to PlayKubeDown Mar 20, 2023
@umohnani8
Copy link
Member Author

Thanks @vrothberg and @Luap99 - added service container clean up to PlayKubeDown()

Since we can't guarantee when the worker queue will come
and clean up the service container in the remote case when
podman kube play --wait is called, cleanup the service container
at the end of PlayKubeDown() to ensure that it is removed right
after all the containers, pods, volumes, etc are removed.

[NO NEW TESTS NEEDED]

Signed-off-by: Urvashi Mohnani <[email protected]>
Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@vrothberg vrothberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 21, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 21, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: umohnani8, vrothberg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [umohnani8,vrothberg]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit d8265f0 into containers:main Mar 21, 2023
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 5, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 5, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. release-note-none
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kube play --wait test: looks like a race
6 participants