-
Notifications
You must be signed in to change notification settings - Fork 416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Add RHCOS oscontainer into payload, render to 00-$role-osimageurl MC #273
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cgwalters The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
a346594
to
0271d40
Compare
The high level idea with this so far (still writing this code) is that we have a ConfigMap that points to the oscontainer and is applied by the CVO. Then, the controller renders that into This supercedes #258 and also takes a small bit of #228 in that we're taking the first non-empty osImageURL. |
A very important thing here is that we need to make an immediate choice - do we pin the oscontainer in this repo and then update it via PRs? (Would be very noisy and painful, but give us CI gating) Or in the short term do we have it float Or, (this would be my preference probably) matching discussion in openshift/installer#987 do we have the release controller "gather" it? |
(Hm, since AIUI the release controller reacts to |
0271d40
to
1882c9e
Compare
I would recommend pushing to the image stream because it is exactly what ART will do, so we can then get the RHCoS image integrated into OCP exactly like origin more quickly. You can always control how often you push. |
Instead of pinning or pushing, you can also just manually tag a new one in, or we can set up a periodic job that grabs the latest maipo, tests it, and then promotes it if the test passes (I could set that up in < 1hr) |
This one sounds good! I'd be interested in seeing the code and understanding how that works...there's a lot of magic in the ci-operator/release stuff that I haven't yet fully grasped. The "test it" part is blocked a bit on this PR landing though right? |
And it turns out we aren't pushing a Although...maybe we really do need to separate out "add rhcos to payload" from "apply updates"? Here's a possibility, we could add a config option or so to ignore |
yes - once you have an image, with a payload, it's really easy to script overlaying that test. I can go over what would be necessary there. That would block promoting a new image, but we would still need a job for PRs that overrides the image appropriately (which is not difficult) |
1882c9e
to
79b5001
Compare
OK this is updated 🆕 - we now have a |
Today each MC will contain both an Ignition fragment and an `osImageURL`. Define "merging" as using the first non-empty `osImageURL` so we don't have to be very picky about ordering. This is a smaller version of openshift#228 Prep for: openshift#273
Hm:
Hmm but it worked in #258 |
/test images |
/hold |
/retest |
This PR is held - it's probably OK to just leave it as it won't be merging yet. We have more issues to work through. |
Success! Retested using a diff os container and it worked (using image files up to commit 7999a25 )
|
OK so #301 is definitely happening in some of the CI runs from this PR - but is it actually a new issue? As far as I understand it... The other thing (in the new PR status at the top) is...actually I think I am wrong, there's no "initial master" versus "secondary masters", the installer creates them all at once. Which means they should all consistently do the early pivot. So I think the design here is going to work. |
@cgwalters IME with AWS clusters (which is what I'm running everything on), while I see the error syncing messages, my nodes do not degrade bc of them. Is this perhaps a libvirt issue? |
Yeah; those error messages aren't relevant to the real problem in #301 - I edited the initial message to remove them. The actual problem in #301 is that if a MC is garbage collected while a node is booting targeting it, the MCD will fail to load it on start, and the node will go degraded. And that's what I saw in at least one CI run. |
I think we have this in CI but we're not noticing it. If it's happening we need to fix it. Ref: openshift#301
7999a25
to
662d0fe
Compare
OK I rebased 🏄♂️ this, squashed the two commits into one for clarity, did This is still though waiting for openshift/pivot#25 and the corresponding MCD work, but we can at least run what we have through e2e here. |
Well, that's a kind of progress! New test successfully failing. But the cluster got GC'd before I could really debug it... |
Why do i see changes to Secondly I would live to avoid any more diff that is served by the server from the generated machineconfig contents. |
It's to handle the difference between the "bootimage" (e.g. AMI) versus the CVO-driven payload. Will push this in a bit:
How would we pass the target data other than writing a file in Ignition? |
And to elaborate, this is now required because of #245 - previously the MCD would have just said "oh it's a config difference" and updated, but the problem with that approach was we could get into reboot loops and other issues as described in that PR's commit message. (And an advantage of this is that we reboot before node joins the cluster, doesn't make sense to land cluster workloads and then immediately evict them and reboot) |
Why is getting served as addition by server not through the machineconfig? |
You're arguing for having the MCD handle it? Then we'd need to special case the "initial pivot" there - maybe a good way to do that would be to have "bootstrap node annotations exist" as a flag specifying that. That has some advantages; it's more consistent. But one downside as I noted is the node comes online and then reboots right away. Does anyone else have a strong opinion on this? |
I think i'm confused, any doc detailing the exact flow that this PR is moving towards will help me judge the approach or even provide feedback (if that matters).. currently i am not flowing whats happening and why :( |
Have the MCC take `osImageURL` as provided by the cluster update/release payload and generate a `00-{master,worker}-osimageurl` MC from it, which ensures the MCD will update the node to it. However, we need special handling for the *initial* case where we boot into a target config, but we may be using an old OS image. Change the MCC to write the target osImageURL from the MC it uses for bootstrapping to `/etc/rhcos-initial-pivot-target`. This will then be handled by the `rhcos-initial-pivot.service` systemd unit. Closes: openshift#183
662d0fe
to
01c309d
Compare
No worries; we've gone through a lot of architecture changes here, and there are now a whole lot of comments on this PR that Github is hiding. Did you see #273 (comment) ? To rephrase a bit, if we landed the bit of this PR to add And see #273 (comment) again for the semantics of "validate". Options:
|
/test e2e-aws-op |
@cgwalters: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
@abhinavdahiya @cgwalters it may be worth chatting about this in a higher bandwidth system |
@cgwalters: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This is obsoleted by #363 |
PR Status:
This closes the gap in getting the RHCOS oscontainer into the update
payload, and having the MCO then render that down into the MachineConfigs.
Introduce a new ConfigMap
machine-config-osimageurl
that points tothe oscontainer and is applied by the CVO. Then, the controller
renders that into
machineconfigs/00-$role-osimageurl
which then finallygoes into the rendered config, and should be applied by the daemon.
Closes: #183