Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFE: Move user-data secret creation from the installer into the MCO #683

Closed
wking opened this issue Apr 30, 2019 · 19 comments
Closed

RFE: Move user-data secret creation from the installer into the MCO #683

wking opened this issue Apr 30, 2019 · 19 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@wking
Copy link
Member

wking commented Apr 30, 2019

Currently the installer creates pointer Ignition configs which end up as user data secrets referenced by the installer-generated Machine(Sets). That makes things like the Ignition v2 (spec 3) transition tricky, because you need to touch both the installer (openshift/installer#1468) and MCO (#583) with some kind of ratcheting.

I think we should drop the installer's pointer Ignition-config secrets and instead have the MCO create pointer Ignition-config secrets for any roles it supports (currently master and worker). This consolidates knowledge of supported roles their configuration in the MCO and simplifies the installer->MCO interface. The MCO can put the MCS cert into the pointer configs without having to worry about cross-repo communication (and it is better positioned to update the secrets if/when the MCS cert rotates). The MCO can switch Ignition versions without involving the installer. The MCO can move the MCS to a new URI scheme without involving the installer. The interface is reduced to just the installer relying on "the cluster will have master-user-data and worker-user-data secrets".

Thoughts?

@cgwalters
Copy link
Member

I had forgotten about this issue, but actually came back to this idea as part of #736

@cgwalters
Copy link
Member

I thought about this more, and it makes sense to do this - but I don't see how we get out of the installer understanding Ignition because it needs to generate it for the bootstrap machine and currently the masters.

And improving this actually introduces a new cross-dependency between the MCO and machineAPI operator. Which is a good thing I think but currently the installer pregenerating the userdata implicitly avoids a dependency.

cgwalters added a commit to cgwalters/machine-config-operator that referenced this issue May 15, 2019
This is prep for fixing
openshift#683

Today the installer generates this "pointer Ignition" which ends up
as e.g. user-data in AWS, which is just a pair of (MCS URL, root CA).
We need to do this because of size limitations on AWS user data.

It makes a lot of sense for the MCO to be in control of generating
the pointer ignition config too, as it helps centralize knowledge
of Ignition.

Ultimately, the goal here is tighter integration between machineAPI
and the the MCO; e.g. the machineAPI asks to generate the pointer
Ignition config, rather than having it stored as a static secret.
This could then unlock the ability for the MCO to inject e.g.
one time auth tokens.
@cgwalters
Copy link
Member

PR in #759

@cgwalters
Copy link
Member

Discussed this with some machineAPI folks today. One thing that came out of that is that a good likely design is for the MCO to watch for creation of new machine objects (note: targeting our own cluster), and then create a secret/userdata-$machine in the openshift-machine-api namespace. The machineAPI would spin until that secret was created, and then launch the underlying instance.

And we'd clean up the generated secrets when a machine was deleted.

@derekwaynecarr
Copy link
Member

@cgwalters i like that option. seems promising.

@runcom
Copy link
Member

runcom commented Jun 5, 2020

The machineAPI would spin until that secret was created, and then launch the underlying instance.

by reading the code in the machine-api-operator - it's not the MAO which would spin but all the providers must be tweaked to spin (https://github.com/openshift/installer/blob/master/pkg/asset/machines/aws/machines.go#L106) which makes it trickier afaict - I'll sync with Alberto on this

@runcom
Copy link
Member

runcom commented Jun 5, 2020

and then create a secret/userdata-$machine

the change of name here to be per-machine will imply also that MachineSets aren't holding that anymore (meaning more changes in MachineSets as well I guess)

@enxebre
Copy link
Member

enxebre commented Jun 5, 2020

by reading the code in the machine-api-operator - it's not the MAO which would spin but all the providers must be tweaked to spin (https://github.com/openshift/installer/blob/master/pkg/asset/machines/aws/machines.go#L106) which makes it trickier afaict - I'll sync with Alberto on this

yeh you are right, but that's our burden, it's a limitation in our API and we'll have to deal with it, e.g we could let providers return a well known error type and let the core machine controller react accordingly, or consider deprecating the providers userData field, or have a new backward compatible machine.enableAutomaticBootstrapping=true...

One thing that came out of that is that a good likely design is for the MCO to watch for creation of new machine objects (note: targeting our own cluster), and then create a secret/userdata-$machine in the openshift-machine-api namespace. The machineAPI would spin until that secret was created, and then launch the underlying instance.

sgtm. Do you think we could try to extend this identity awareness to completely remove the need for the machineApprover?
One thing to keep in mind while making MCO aware of machines is that MCO still needs to support scenarios where the machine api might not be operational.

Would the MCO have the ability to track if that owned secret resulted in a successful node or a failure? If so could we let this new controller this signal back to the machine object i.e as a condition to leave a breadcrumb for users on why that machine didn't become a node?

@JoelSpeed
Copy link
Contributor

As far as I understand this, as Machine API, we would need to drop the user-data secret reference in each of our providers and instead have some sort of role field so that MCO knows whether the machine is a worker or master. This would probably be better on the machine rather than the providerspec so that MCO can read it easily and MAPI can spin waiting for it before calling to the actuators. We wouldn't need to update the actuator interface, they would just need to know how to find the secret from the MachineSpec rather than ProviderSpec.

If we didn't deprecate the user-data-secret field, it would somehow need to be modified as machines are created wouldn't it? We would still want it to point to a valid secret name, but the MachineSet can't have it set to the right name. A webhook could do this, but it feels a little clunky?

sgtm. Do you think we could try to extend this identity awareness to completely remove the need for the machineApprover?

Would your suggestion be to move to unique boostrap tokens per machine and move back to the built in K8s CRS approving mechanisms?

Would the MCO have the ability to track if that owned secret resulted in a successful node or a failure? If so could we let this new controller this signal back to the machine object i.e as a condition to leave a breadcrumb for users on why that machine didn't become a node?

If it didn't track that a node joined the cluster, how would it clean up resources that are no longer needed? Time based? Owner Ref to Machine?

@wking
Copy link
Member Author

wking commented Jun 5, 2020

This issue predates openshift/enhancements, but if there is renewed interest it might make sense to port to an enhancement so you can freeze out places which have consensus. I'm fine either way, but thought I'd mention it in case folks were interested.

@runcom
Copy link
Member

runcom commented Jun 8, 2020

I'm going to create an enhancements but I've been thinking a lot about this the past few days. Given the current schedule I think coupling MCO/MAO is going to be non-trivial to accomplish in the timeframe we have to ship ignition v3, or well, it will require more than than this idea I came up with w/o tightly coupling Machines/Sets into MCO:

  • The secret changes namespace to openshift-machine-config-operator
  • The MAO is adjusted to read the secret from that namespace (1)
  • MCO manages the lifecycle of that secret directly in the MCO (as it would have been anyway) so we can support v3 upgrades
  • The MAO grab the secret
  • Everything works the same w/o renaming the secret to be per-machine (so the MCO doesn't depend on MAO and viceversa)

I think the above is a good approach and we leverage the controllers we already have in place in MCO/MAO and we use the Status's of the CRD to synchronize between controllers and it requires less changes to both projects (MAO: 2 major? changes)

Let me know what you all think meanwhile (/cc @yuqi-zhang)

@cgwalters
Copy link
Member

Overall 👍 from me! This sounds like a simple integration, which is good because it'll be easy to implement and understand, and we can do something more sophisticated later.

But, let's at least briefly look at

Basically if we can create a flow where it's more like the MAO can use an annotation on its machineset that selects the secret (which we need for control plane vs worker anyways) then we can extend it to generic pools and ideally eventually machine-specific configs.

@runcom
Copy link
Member

runcom commented Jun 8, 2020

But, let's at least briefly look at

Basically if we can create a flow where it's more like the MAO can use an annotation on its machineset that selects the secret (which we need for control plane vs worker anyways) then we can extend it to generic pools and ideally eventually machine-specific configs.

don't get me wrong, I 100% support that per-machine secret and it indeed would allow for more sophisticated scenarios like the one highlighted there! I think that design would be best once we move forward but I feel the migration to ignition v3 brings alone a lot of complexity and pitfalls that I'd love to remedy by starting with something simpler (hope it makes sense). I also think the per-machine secret would still be valid once we go the simpler path, the simpler path itself isn't a stopgap but a simpler implementation so it makes sense in the future to allow per-machine secrets to implement things like #1619 #1720

@runcom
Copy link
Member

runcom commented Jun 9, 2020

I think we should drop the installer's pointer Ignition-config secrets and instead have the MCO create pointer Ignition-config secrets for any roles it supports (currently master and worker).

I just realized we can do this only in the code but the installer should still create the master|worker.ign for UPI to work correctly and the MCO isn't running until bootkube.

@michaelgugino
Copy link
Contributor

Discussed this with some machineAPI folks today. One thing that came out of that is that a good likely design is for the MCO to watch for creation of new machine objects (note: targeting our own cluster), and then create a secret/userdata-$machine in the openshift-machine-api namespace. The machineAPI would spin until that secret was created, and then launch the underlying instance.

And we'd clean up the generated secrets when a machine was deleted.

The abstraction that I'd like to see implemented can be summarized as such:

ObjRef:
    Name: string
    Namespace: string
    Kind: string (or some other concrete kind type)
    KeyName: string
    PrependSelfName: bool

Machine:
...
ProviderSpec: {}
UserObjects: [] ObjRef

The machine-controller could fetch 1 or more objects per machine with the dynamic client during create. These objects would be passed to each actuator (cloud provider) as map[string]interface. Each actuator would verify the type and presence of the items fetch.

Consider the fields in ObjRef. Name and Namespace are obvious. KeyName is what goes in the key portion of 'map[string]interface'. The actuator knows already what keys it wants. PrependSelfName is meant to prepend the machine's own name to the object's Name it's looking for.

Example:

Machine:
  ObjectMeta:
    Name: worker-1
  Spec:
  ...
  userObjects:
  - Name: "-user-data"
    Namespace: "openshift-machine-api"
    Kind: "secret"
    KeyName: "instance-user-data"
    PrependSelfName: true

This would cause the machine-controller to request the secret named "worker-1-user-data" from namespace openshift-machine-api. If that object is found, that object is placed into the map we pass to the actuator. EG:

c.Get(mything)
userObjects["instance-user-data"] = mything
actuator.Create(userObjects ...)

This kind of abstraction could be widely supported, I'm sure there are other k8s components that need to do similar things.

In this manner, we could support all the clouds similarly, and since it supports arbitrary objects, is quite extensible.

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 23, 2020
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 22, 2020
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci-robot
Copy link
Contributor

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@zaneb zaneb mentioned this issue Nov 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

9 participants