Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Use Kubernetes GC to clean kubevirt VMs (packet-* jobs) #11530

Merged
merged 10 commits into from
Nov 14, 2024

Conversation

VannTen
Copy link
Contributor

@VannTen VannTen commented Sep 13, 2024

What type of PR is this?
/kind feature

What this PR does / why we need it:
We regularly have CI flakes where the job failed to delete k8s namespace in the CI cluster.
It's not much, but it's a little hiccup in the PR process which I'd like to eliminate.

I'm not sure what the exact reason is, probably some race between the jobs and the time between fetching the list of namespace and the deletion.
Regardless, a simpler way to delete the VMs is to let them be dependants (in the kubernetes sense) of the job pod. This way, once the job pod is deleted, kubernetes garbage collection in the CI cluster will take care of removing the associated VMs

Special notes for your reviewer:
PR on the ci infra kubespray/kspray-infra#1 (private repo, maintainers have access)

Does this PR introduce a user-facing change?:

NONE

/label tide/merge-method-merge

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. tide/merge-method-merge Denotes a PR that should use a standard merge by tide when it merges. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 13, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: VannTen

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 13, 2024
@VannTen
Copy link
Contributor Author

VannTen commented Sep 13, 2024

/ok-to-test

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Sep 13, 2024
@VannTen
Copy link
Contributor Author

VannTen commented Sep 13, 2024

/cc @ant31

@k8s-ci-robot k8s-ci-robot requested a review from ant31 September 13, 2024 11:03
@VannTen
Copy link
Contributor Author

VannTen commented Sep 13, 2024

/retest
(now that I fixed the gitlab-runner config)

@VannTen
Copy link
Contributor Author

VannTen commented Sep 13, 2024 via email

@VannTen VannTen force-pushed the ci/cleanup_with_k8s_gc branch 2 times, most recently from fd43216 to d70f8e2 Compare September 20, 2024 13:45
@VannTen
Copy link
Contributor Author

VannTen commented Sep 20, 2024

/label ci-full

(To test it works correctly for everything)

@k8s-ci-robot
Copy link
Contributor

@VannTen: The label(s) /label ci-full cannot be applied. These labels are supported: api-review, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, team/katacoda, refactor. Is this label configured under labels -> additional_labels or labels -> restricted_labels in plugin.yaml?

In response to this:

/label ci-full

(To test it works correctly for everything)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@VannTen VannTen force-pushed the ci/cleanup_with_k8s_gc branch from d70f8e2 to bbfd93a Compare September 20, 2024 13:50
@VannTen
Copy link
Contributor Author

VannTen commented Sep 20, 2024 via email

@ant31
Copy link
Contributor

ant31 commented Sep 23, 2024

I think the PR to add them via /label is still in review

For now you can add them manually
image

@VannTen
Copy link
Contributor Author

VannTen commented Sep 23, 2024

I'll do that once the initial set of tests pass then 👍

@VannTen VannTen force-pushed the ci/cleanup_with_k8s_gc branch from 25bb5d0 to 32a0dfc Compare September 23, 2024 10:17
@VannTen VannTen force-pushed the ci/cleanup_with_k8s_gc branch 3 times, most recently from 81ac78a to d1ca52f Compare October 4, 2024 07:18
@ant31
Copy link
Contributor

ant31 commented Nov 5, 2024

we probably want to add some delay here and there and have some kind of retry at this step

@VannTen
Copy link
Contributor Author

VannTen commented Nov 5, 2024 via email

@ant31
Copy link
Contributor

ant31 commented Nov 5, 2024

maybe upgrading kubevirt ?

@VannTen
Copy link
Contributor Author

VannTen commented Nov 5, 2024 via email

VMI in Kubevirt are the abstraction below VirtualMachine.

- We don't really need the extra abstraction of VirtualMachine objects
- Convert the waiting for VMs ip address to use kubernetes.core.k8s_info
  and no shell pipeline
Not constraining the inventory to .ini allows us to use dynamic
inventory, which is needed for simplifying kubevirt jobs inventory.

Also reduces the scope of the ANSIBLE_INVENTORY variable.
@VannTen
Copy link
Contributor Author

VannTen commented Nov 13, 2024

/label ci-extended

@k8s-ci-robot k8s-ci-robot added the ci-extended Run additional tests label Nov 13, 2024
@VannTen VannTen force-pushed the ci/cleanup_with_k8s_gc branch from 13f0332 to a4e517d Compare November 13, 2024 16:41
@VannTen
Copy link
Contributor Author

VannTen commented Nov 13, 2024

I believe I have found a way to make this work: instead of using the kubevirt inventory plugin, I'm creating a static inventory out the the valid state
(which we have in "Wait for vms to have IP addresses")

@VannTen VannTen force-pushed the ci/cleanup_with_k8s_gc branch from a4e517d to f0f437e Compare November 13, 2024 16:55
@VannTen
Copy link
Contributor Author

VannTen commented Nov 13, 2024 via email

@VannTen
Copy link
Contributor Author

VannTen commented Nov 13, 2024 via email

@VannTen
Copy link
Contributor Author

VannTen commented Nov 14, 2024

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 14, 2024
This allows a single source of truth for the virtual machines in a
kubevirt ci-run.

`etcd_member_name` should be correctly handled in kubespray-defaults for
testing the recover cases.
VirtualMachineInstance resources sometimes temporarily loose their
IP (at least as far as the kubevirt controllers can see).
See kubevirt/kubevirt#12698 for the upstream
bug.

This does not seems to affect actual connection (if it did, our current
CI would not work).
However, our CI execute multiple playbooks, and in particular:
1. The provisioning playbook (which checks that the IPs have been
   provisioned by querying the K8S API)
2. Kubespray itself

If any of the VirtualMachineInstance looses its IP between after 1
checked for it, and before 2 starts, the dynamic inventory (which is
invoked when the playbook is launched by ansible-playbook) will not have
an ip for that host, and will try to use the name for ssh, which of
course will not work.

Instead, when we have a valid state during provisioning (all IPs
presents), use it to construct a static inventory which will be used for
the rest of the CI run.
We should not rollback our test setup during upgrade test.
The only reason to do that would be for incompatible changes in the test
inventory, and we already checkout master for those (${CI_JOB_NAME}.yml)

Also do some cleanup by removing unnecessary intermediary variables
The new CI does not define k8s_cluster group, so it relies on
kubernetes-sigs#11559.

This does not work for upgrade testing (which use the previous release).
We can revert this commit after 2.27.0
@VannTen VannTen force-pushed the ci/cleanup_with_k8s_gc branch from f0f437e to 47f6781 Compare November 14, 2024 08:41
@VannTen
Copy link
Contributor Author

VannTen commented Nov 14, 2024

@ant31 @tico88612 this should be ready for review (finally ^^)

@ant31
Copy link
Contributor

ant31 commented Nov 14, 2024

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 14, 2024
@ant31
Copy link
Contributor

ant31 commented Nov 14, 2024

great work, thanks @VannTen

@k8s-ci-robot k8s-ci-robot merged commit 05e2b47 into kubernetes-sigs:master Nov 14, 2024
52 checks passed
@VannTen VannTen deleted the ci/cleanup_with_k8s_gc branch November 29, 2024 10:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. ci-extended Run additional tests cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. tide/merge-method-merge Denotes a PR that should use a standard merge by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants