Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Future of Power CI under P10/PowerVM #2473

Closed
ravanelli opened this issue Oct 1, 2021 · 17 comments
Closed

Future of Power CI under P10/PowerVM #2473

ravanelli opened this issue Oct 1, 2021 · 17 comments
Labels
enhancement New feature or request jira for syncing to jira

Comments

@ravanelli
Copy link
Member

I'm creating this issue for us to have a common place to discuss the next steps for Power CI. So, we can get more insights from multiarch folks around, and decide the best way to more forward.

With gangplank we are improving our CI to create a more multi-arch world for FCOS/RHCOS/Cosa, and also to resolve eliminate some issues as duplicated CIs around. The arm64 was successfully added, and now we are looking for Power and s390x to be part of this beautiful world.

Unfortunately, there are some strugglers with Power looking for the future. As we know P10 dropped baremetal support (PowerVM only) as RHEL9 also dropped support for kvm on Power.

Our entire ci is based on qemu/kvm. It will be really hard to change it to accommodate only Power.

Recently, I was trying to enable gangplank remote in Power, using a server provided for IBM in IBM cloud. Nonetheless, this server is a P9 using PowerVM, and here is where we can start to feel the issues working with PowerVM/kvm.

I reached to folks in IBM to understand better the options we here, and the feedback I got so far is:

  • kvm_pr has not been supported for a long time, and Red Hat removed it from the tree from RHEL a few months ago (should be available on RHEL 8.4). There's no upstream support neither.

  • As for TCG, pseries+tcg works on PowerVM without problems. The problem is that it is considerably slower than pseries+kvm. Not officially supported by IBM/Red Hat, only upstream support is provided.

    I also able to build fcos with a couple of TGC warning . However, --basic-qemu-scenarios was kept running for more than 1 hour with no results back.

  • kvm_hv never ran on PowerVM. Maybe... could be plans to make this happen, but it depends on the roadmap for PowerVM.

Looking for these scenarios looks we are not really able to run kvm under a PowerVm.

More details:
https://bugzilla.redhat.com/show_bug.cgi?id=2008271

@dustymabe
Copy link
Member

Thank you for writing this up @ravanelli.

Looking for these scenarios looks we are not really able to run kvm under a PowerVm.

Ouch.. That really breaks our existing model and will force us to carry quite the delta just to add that architecture.

@mkumatag
Copy link

mkumatag commented Oct 1, 2021

cc @clnperez @manojnkumar

@laggarcia
Copy link

Here it is a summary of the discussion we had with Renata on this topic. If I got something wrong, please, let me know, as I am not knowledgeable on COSA/FCOS/RHCOS.

The CI infrastructure controller you have today run on an x86 environment. At some point in the process, this controller will contact a Power server to actually build the Power images and run basic build verification tests on them. There are two requirements on the Power server so that it can seamlessly integrate with your infrastructure:

  • It needs to have a public IP address.
  • It needs to support running KVM/QEMU guests, as the built images will be tested by launching a KVM/QEMU VM on the Power server.

In order to fulfill these requirements, you will have to run your build process on a POWER9 bare metal machine. You will need to find one that is available with a public IP address. Given that is available, you should have no issues in running the build process on that machine and spawning VMs with the built image to do your basic verification of the build process.

Availability of a Power10 system with KVM support should not be an impediment here. The build process usually targets old processor versions because of compatibility and support reasons. Just as an example, IIRC, RHEL 8 is built targeting POWER8 processors as it needs to run on both POWER8 and Power9 processors. So, for the foreseeable future, using a Power9 bare metal machine to build the FCOS image and test the build process with KVM should be enough. This environment will be supported for many years to come yet.

Please, let me know in case you have any additional questions on this.

@ravanelli
Copy link
Member Author

Thanks @laggarcia for all the discussion related to this topic.

Right now, we don't have any bare metal Power server around with public ip access, to allow us to continue with the FCOS improvements for Power. Unless we can find it, there is no other option but to wait.

@jcajka
Copy link
Collaborator

jcajka commented Oct 4, 2021

@laggarcia my understanding has been that FCOS CI/pipeline requires openstack/aws/ocp(nowadays it should be just the first two) like cloud infra and is not really able to work with stable VMs/hosts. @dustymabe please correct me if I'm wrong.
@ravanelli We should have around kvm based power9 VMs that can be provided(if there is no issue with them being outside of the Fedora infra), hosted at Brno University of Technology. Possibly even one whole bare metal p8 box. AFAIK nested kvm should work there.

@ravanelli
Copy link
Member Author

ravanelli commented Oct 4, 2021

@jcajka How reliable is the support for the Brno University? I tried to use the minicloud in Unicamp, but lack of support is really an issue there. I had to wait more than a month to get a firmware update.

@clnperez
Copy link

clnperez commented Oct 4, 2021

You can also get an openstack environment from OSU: https://osuosl.org/services/powerdev/request_hosting/. I've only ever requested standalone VMs, but have had very good stability and support from them. Not suggesting over Brno, but if we need another option that's one to consider. I believe this project falls under the "Free and Open Source" restriction.

@dustymabe
Copy link
Member

@laggarcia my understanding has been that FCOS CI/pipeline requires openstack/aws/ocp(nowadays it should be just the first two) like cloud infra and is not really able to work with stable VMs/hosts. @dustymabe please correct me if I'm wrong.

We can work with a single bare metal machine and talk to it over SSH. That's what we're doing currently for aarch64

@jcajka
Copy link
Collaborator

jcajka commented Oct 5, 2021

@dustymabe cool, good to know. I still assumed that it is in aws was essential for various reasons, mostly redeployment, etc.
@ravanelli what are your expectations, requirements? Most of issues, if there are solutions(new FW) available from the HW vendor, I can probably resolve under a week(I'm one of the admins there). But formally it is not commercial offering, so best effort.

@clnperez
Copy link

Can we pick this conversation back up? We're getting a couple of new ping from customers about OKD.

@mtarsel
Copy link
Contributor

mtarsel commented Sep 11, 2024

So I have built the Fedora CoreOS images for ppc64le using a Power10 Rainier using firmware 1060.10 with Fedora 40 using kernel version 6.10.7-200.fc40.ppc64le. KVM has been enabled from the HMC and the kvm_hv module is loaded.

I thought this issue would be the best place to update my status about this effort but I am available on slack to discuss next steps if that's easier.

I followed the instructions from the docs…

Ran build.sh
create new dir
cosa init fcos-url
cosa fetch; cosa build

This machine is using Legacy Compatibility interrupt mode which is referred to as XICS in QEMU. As such, the following warning happens when running the tests:

qemu-system-ppc64: warning: kernel_irqchip allowed but unavailable: IRQ_XIVE capability must be present for KVM
Falling back to kernel-irqchip=off

Currently KVM on LPAR doesnt support native XIVE, so qemu doesnt have kernel-irq support which means the KVM interrupt controller is turned off. Suggested flags would be to use something like

qemu-system-ppc64 -accel kvm -machine pseries,ic-mode=xics

I ran the tests like this

cosa kola run --parallel 4

however for the past couple weeks i have not been able to get a complete test run. The tests stall and im not sure how to further debug this.

In my build dir, the ./tmp/kola/reports dir is empty and in test.tap I see:


[root@f40-de kola]# cat test.tap 
1..89
ok - ext.config.networking.ifname-karg.udev-rule-firstboot-propagation
ok - ext.config.networking.nameserver
ok - fcos.users.shells
ok - coreos.unique.boot.failure
ok - ext.config.gshadow
ok - ext.config.boot.grub2-install
ok - ext.config.var-mount.luks

Is there another output dir where the tests would be stored?
Is there an existing deny-list for ppc64le?

Additionally, Oregon State University Open Source Lab (OSU OSL) does have Power10 machines available that will have kvm enabled. I’m hoping to replicate this setup at OSU on an LPAR and this could provide us with a p10 kvm setup without a vpn to run tests long term. More info

@jlebon
Copy link
Member

jlebon commented Sep 13, 2024

Thank you for working on this!

This machine is using Legacy Compatibility interrupt mode which is referred to as XICS in QEMU. As such, the following warning happens when running the tests:

qemu-system-ppc64: warning: kernel_irqchip allowed but unavailable: IRQ_XIVE capability must be present for KVM
Falling back to kernel-irqchip=off

Currently KVM on LPAR doesnt support native XIVE, so qemu doesnt have kernel-irq support which means the KVM interrupt controller is turned off. Suggested flags would be to use something like

qemu-system-ppc64 -accel kvm -machine pseries,ic-mode=xics

Yeah, we've seen that warning for a while now and haven't dug into it. Feel free to submit a patch to choose the right set of arguments based on $factors.

I ran the tests like this

cosa kola run --parallel 4

however for the past couple weeks i have not been able to get a complete test run. The tests stall and im not sure how to further debug this.

In my build dir, the ./tmp/kola/reports dir is empty and in test.tap I see:


[root@f40-de kola]# cat test.tap 
1..89
ok - ext.config.networking.ifname-karg.udev-rule-firstboot-propagation
ok - ext.config.networking.nameserver
ok - fcos.users.shells
ok - coreos.unique.boot.failure
ok - ext.config.gshadow
ok - ext.config.boot.grub2-install
ok - ext.config.var-mount.luks

Is there another output dir where the tests would be stored? Is there an existing deny-list for ppc64le?

Which tests stall? You should see log files under e.g. tmp/kola/qemu-latest/. You can upload that directory.

@mtarsel
Copy link
Contributor

mtarsel commented Oct 10, 2024

Tests have been consistently passing on my Power10 machine with the following command: cosa kola run --tag '!reprovision'

@jlebon
Copy link
Member

jlebon commented Oct 30, 2024

@mtarsel Where are we on the reprovisioning tests? Have you been able to get those to pass now?

@mtarsel
Copy link
Contributor

mtarsel commented Oct 30, 2024

I now have a p10 kvm enabled box at OSU OSL for development and yesterday all tests passed consistently. I'm not sure what changed on this new box. The previous box might not have had enough disk space is my only guess at this point.

@jlebon
Copy link
Member

jlebon commented Oct 30, 2024

Sweet, that's great to hear!
Overall, it seems like we can close this issue?

@mtarsel
Copy link
Contributor

mtarsel commented Oct 30, 2024

Yes I think we can close this issue since there exists a P10 environment where all tests are passing.

I'm still investigating 2cf91c9 however I dont think that is directly related to the future of CI with p10.

@jlebon jlebon closed this as completed Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request jira for syncing to jira
Projects
None yet
Development

No branches or pull requests

9 participants