-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reboot coordination: locksmith successor #3
Comments
I think there are several interlocking questions here:
Would it make sense to break these out into separate issues? |
We had some out-of-band discussion on this, and here I'm summarizing the points we covered:
This is not yet a final design, but if there are no controversies or radically different suggestions we can move forward with it. |
Isn't that just |
@cgwalters from my shallow understanding of |
Yeah, you're right. That said I lament the lack of reboot management in rpm-ostree itself for the single node case today - you can see that in the discussion threads. I'd like to support simple logic like "reboot if an update is ready and there are no active sessions" as a systemd timer unit that we can also render in rpm-ostree status. |
That makes sense to me. We could add a trivial |
For reference, here is the |
Maybe...we were discussing this in the rpm-ostree ticket too. There are also the "headless IoT" and "desktop" cases. |
I'm wary of having separate reboot flows for the reboot-coordinated and uncoordinated cases, not only because of the code complexity but because it'd be another point of confusion when configuring the system: do I enable rpm-ostree reboots or locksmith reboots? What if I enable both? @cgwalters How would you feel about moving all reboot handling into whatever is replacing locksmith? Edit: or at least disabling/hiding the rpm-ostree knob on FCOS. |
Do we see "locksmith2" handling the degenerate case of a single node system? That's one possible approach; if you deploy a single node it skips using etcd and all of that and just talks directly to the local rpm-ostreed.
Definitely for sure upstream in rpm-ostree will continue to support being completely driven by an external agent using the DBus API. The current |
If we scope in more than reboot management but actually "channel management" (see #22 ) - then an approach here is for this agent to point rpm-ostree at commit objects rather than refs. That UX is better now and we're going to be using this for RHCOS, where host updates are always cluster-driven. |
I'd say that:
|
Followup with some feedback after an initial exploratory experiment. In #83 (comment) we discussed keeping the on-host logic to a minimum and moving the etcd semaphore management into a container reachable over HTTP. The latter would be the locksmith successor (locksmith2?), which I tried to explore at https://github.com/lucab/exp-locksmith2. These are my experimental findings, starting from original locksmith code:
I sketched an experimental on-host agent in parallel for double-checking, more followups on this later in #83. Huge thanks to @s-urbaniak for quick-pairing on historical locksmith code ❤️. |
Out of band request: this "containerized locksmith" replacement is going to manage fleet-wide reboot locks, but it is not carrying over all locksmith functionality. As such, it should have its own proper name which is no linked back to the original "locksmith". We recently went through a similar renaming exercise, so we could just pick a name from the list: coreos/afterburn#126 |
|
|
What about |
If we go with something at least partially descriptive, I like |
Chipping in with a +1 for |
Due diligence for naming: there are no |
Created coreos/airlock. |
The components implementing the two ends of this discussion are up at https://github.com/coreos/zincati and https://github.com/coreos/airlock in a minimum-functionality form (additive non-breaking change will happen on each new iteration). The only remaining piece of work is closing the loop in zincati with coreos/zincati#37. Closing in favor of that. |
CL encourages using locksmith + etcd by default as a "cluster". Do we want to do that out of the box, or focus on e.g. https://github.com/ashcrow/container-linux-update-operator/tree/spike ?
Another option is to document how to "roll your own" coordination with e.g. Ansible; we have APIs.
The text was updated successfully, but these errors were encountered: