Skip to content
This repository has been archived by the owner on Jun 28, 2024. It is now read-only.

Vagrant: boot q35 machine and other improvements #3965

Merged
merged 8 commits into from
Nov 12, 2021

Conversation

wainersm
Copy link
Contributor

I started this string of commits by changing the Vagrantfile so that the VM boots with q35 machine type (fix #3752). As q35 is required to run the vfio tests (because it needs iommu and one extra virtio-net device) and the Vagrantfile had no support for such as test, I also changed it so that if the user exports CI_JOB=VFIO then extra arguments are passed to QEMU to allow testing Kata + VFIO.

Other changes came after to address the issues #3757 and #3942.

I tested those changes on my laptop x86_64 (Intel) with Fedora 33. I booted the fedora and ubuntu guests for VFIO and non-VFIO tests (CI_JOB=CRIO_K8S) normally (although sometimes the SSH connection fails after reboot; usually I destroy/up an 2nd time and it works). I don't have a ubuntu bare-metal to test the VFIO job but I will give a try with nested VMs.

Still about the VFIO arguments being passed to QEMU...now I realized that they are specific for Intel processors, I should be working on the AMD equivalents when/if time permits (if that is your case, i.e., you'd like to run VFIO tests on AMD, then please let me know).

Warn: if you have any vagrant boxes created from our old Vagrantfile then I recommend you to destroy them all before apply this pull request. The reason is because I changed the prefix name which is used to create the domain in libvirt; so it may leave orphan domains on your machine.

@wainersm wainersm added no-backport-needed Changed do not need to be applied to an older branch / repository area/vagrant-local-ci labels Sep 14, 2021
@wainersm wainersm requested a review from a team as a code owner September 14, 2021 23:27
Copy link
Contributor

@dgibson dgibson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good, to the extent that I understand them, with the exception of a couple of nits I've commented on. Currently trying the changes out as well, I'll follow up when I have some results from that.

Vagrantfile Outdated Show resolved Hide resolved
Vagrantfile Outdated Show resolved Hide resolved
@wainersm
Copy link
Contributor Author

I just updated this pull request, where I:
a) addressed @dgibson 's the comments by simplifying the -machine argurment string for VFIO tests; also the virtio-net is attached to a higher PCIe slot number
b) added the commit 3b286bf

Vagrantfile Outdated Show resolved Hide resolved
Vagrantfile Outdated
@@ -44,6 +44,23 @@ Vagrant.configure("2") do |config|
if host_arch == "x86_64"
lv.machine_type = "q35"
end
# The VM needs one additional virtio-net device and iommu enabled
# for the vfio tests.
if job == "VFIO" and host_arch == "x86_64"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there actually a compelling reason to do this only if the VFIO job is specified. That means that it's not possible to start up the vagrant box with the default options, then run the VFIO job inside manually (possibly after running other tests firsts). Since it takes a pretty long time for vagrant to initialize the box, having to redo it between different types of tests is pretty painful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason is that for Ubuntu it will require a reboot, and reboots sometimes fail. Fedora will always reboot as it needs the cgroups tweak, so not a concern for Fedora.

Indeed it is time consuming and painful the switch between jobs. I will accept your comment and always pass those flags regardless whether VFIO or not. If it turns out a problem when I can implement another solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to rollback that idea... in Ubuntu host vagrant fails to detect reboot. If I always set the variables for VFIO tests then the ubuntu guest will always reboot (so it won't work for any CI_JOB).

cat <<-EOF
This script helps you to clean up the VMs and Vagrant's control
directories.
By default it will attempt to destroy all VMs. For additionally
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's kind of horrible that this isn't built into vagrant :(.

@dgibson dgibson linked an issue Sep 27, 2021 that may be closed by this pull request
Currently the VM is configured with QEMU's i440FX machine type which has no
PCIe support (among other restrictions). This changed the Vagrantfile so that
the it boots a q35 machine instead.

Fixes kata-containers#3752
Signed-off-by: Wainer dos Santos Moschetta <[email protected]>
@wainersm
Copy link
Contributor Author

I just updated this PR to address two comments from @dgibson :

  • Passing -machine kernel_irqchip=split instead of -machine q35,kernel_irqchip=split
  • QEMU and kernel parameters required for VFIO tests are passed by default (i.e. for any type of CI_JOB)

@dgibson while this PR may not be perfect, it seems good enough (okay, I am suspect...). Can I have your approval?

README.md Outdated
then please [open an enhancement request](https://github.com/kata-containers/tests/issues/new/choose)
to let us know.

|Host \ Guest | fedora | ubuntu |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • s!\!/!g
  • I don't understand this table ;(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jodh-intel , I will rephrase to:

Currently it supports the creation of Fedora 32 and Ubuntu 20.04 VM, as shown on the table
below. The Vagrantfile was tested on Fedora 33 and Ubuntu 20.04 hosts, and it is
known to fail the boot of Fedora VM on
Ubuntu host. If you have the need of testing on a different guest or it fails to work
on your host's distro then please open an issue
to let us know.

Host / Guest Fedora 32 Ubuntu 20.04
Fedora 33 Yes Yes
Ubuntu 20.04 No Yes

Does it looks better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - I see what you were trying to do with the backslash now. Assuming I'm now understanding this, I'd make it clearer still, even if it does involve a bit of repetition:

Host Fedora 32 guest Ubuntu 20.04 guest
Fedora 33 Yes Yes
Ubuntu 20.04 No Yes

Maybe one day markdown will support cell spans natively, like reStructuredText has for years.

done

if ! command -v vagrant &>/dev/null; then
echo "ERROR: missing 'vagrant' command. Run $0 -h for help."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error messages should be redirected to stderr (here and elsewhere).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


vagrant_destroy "$vm"
if [ $? -ne 0 ]; then
echo "Attempt to clean up the domain on libvirt"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To the user, it's unclear if this is a warning or informational message. Maybe add a "WARNING" or "INFO" prefix to clarify?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I will print as a warn message.

@wainersm wainersm force-pushed the vagrant_q35 branch 2 times, most recently from dfe3f2f to e1b3ae5 Compare October 28, 2021 14:23
@wainersm
Copy link
Contributor Author

@jodh-intel @dgibson can we get this merged? I think it is good enough... I can work on follow up PRs if needed.

@dgibson
Copy link
Contributor

dgibson commented Nov 1, 2021

/test
/test-vfio

cidir="$(dirname $(readlink -f "$0"))"

# Print message to stderr. Add the 'ERROR:' prefix.
perror() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spot the C hacker! 😄

Should this call exit 1 though? It looks like the majority of calls immediately exit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I just updated the PR but I renamed perror to die :(

if vagrant destroy --force "$vm"; then
echo "VM '$vm' destroyed: OK"
else
echo "VM '$vm' destroyed: FAILED"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/echo/perror/ ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I just want to info the user.

done

[ $ret -eq 0 ] || \
perror "Failed the removal of some VM."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the script die at this point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Contributor

@jodh-intel jodh-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wainersm.

@jodh-intel
Copy link
Contributor

/test

@wainersm
Copy link
Contributor Author

/test-ubuntu

@wainersm
Copy link
Contributor Author

/retest-clh-k8s-containerd

@wainersm
Copy link
Contributor Author

/test-vfio

@wainersm wainersm added the wip Work in Progress (PR incomplete - needs more work or rework) label Nov 10, 2021
@wainersm
Copy link
Contributor Author

Moved to wip to fix the job jenkins-ci-ubuntu-1804-containerd-k8s-e2e-minimal. This job is running on Ubuntu 18.04, which apparently hasn't the driverctl package (see commit a3d903a of this PR).

The commit a043363 introduced improvements to the vfio
functional test which now requires the driverctl tool. The
driverctl package is installed in the host via .ci/vfio_jenkins_job_build.sh,
but that script is never called by the .ci/setup.sh; as a result, the
package isn't installed when I run the Vagrant-based job locally.

This commit adds the driverctl (also pciutils) to the list of packages
installed in the host Fedora and Ubuntu.

Signed-off-by: Wainer dos Santos Moschetta <[email protected]>
The Vagrantfile was changed so that the VM will have IOMMU enabled
and an additional virtio-net device in case of the VFIO job.

Additional Kernel command-line arguments are required to enable IOMMU,
therefore GRUB is reconfigured and the VM should reboot.

Signed-off-by: Wainer dos Santos Moschetta <[email protected]>
The vagrant configuration sometimes can get into inconsistent state. For
instance, the domain on libvirt was created by the framework that thinks
the box is not initialized yet. Or eventually the developer wants to
simply wipe out all vagrant files and created resources from his/her
workstation.

In this commit it was introduced the vagrant-cleaner.sh script which
aims on help developers to clean up the vagrant stuffs.

Fixes kata-containers#3757
Signed-off-by: Wainer dos Santos Moschetta <[email protected]>
Added on README.md a table with the combinations of tested host
versus guest so developers are aware of what is (or not) expected
to work.

Fixes kata-containers#3942
Signed-off-by: Wainer dos Santos Moschetta <[email protected]>
By default Vagrant creates the domain in libvirt prefixed with
the project's directory name, in our case it is "tests-". Let's
use meaningful name instead.

Below is shown the new domain names in libvirt.

$ virsh list --all
 Id   Name                          State
---------------------------------------------
 1    kata_containers_test-ubuntu   running
 4    kata_containers_test-fedora   running

Signed-off-by: Wainer dos Santos Moschetta <[email protected]>
The grubby package is already installed in the pulled VM image.

Signed-off-by: Wainer dos Santos Moschetta <[email protected]>
The installation of packages in the generic/fedora32 box is way faster
than the currently used fedora/32-cloud-base. In my laptop it can save
approximately 30min of the time to `vagrant up fedora`. So in this
commit I am switching the box used.

Signed-off-by: Wainer dos Santos Moschetta <[email protected]>
@wainersm
Copy link
Contributor Author

Updated the commit fe4bc7f so that driverctl package is only installed on Ubuntu 20.04 (or newer) because that isn't available on 18.04 thus some CI jobs failing.

@wainersm wainersm removed the wip Work in Progress (PR incomplete - needs more work or rework) label Nov 11, 2021
@wainersm
Copy link
Contributor Author

/test

@wainersm
Copy link
Contributor Author

Hi @jodh-intel , updated the commit fe4bc7f , mind to review again? :)

Copy link
Contributor

@jodh-intel jodh-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wainersm.

lgtm

@jodh-intel jodh-intel merged commit 3f2a663 into kata-containers:main Nov 12, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/vagrant-local-ci no-backport-needed Changed do not need to be applied to an older branch / repository
Projects
None yet
3 participants