Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenized Cloud Auth Enablement For Operators #1339

Merged
merged 72 commits into from
Jun 13, 2023

Conversation

bentito
Copy link
Contributor

@bentito bentito commented Feb 6, 2023

I think this is ready for approval review: @sdodson, @joelanford, @bparees (you have one comment thread that seems maybe unaddressed as of 6/13 ?), @JoelSpeed

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 6, 2023
@openshift-ci openshift-ci bot requested review from jsafrane and sjenning February 6, 2023 19:23
Copy link
Contributor

@gallettilance gallettilance left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting this started!

Copy link
Contributor

@jharrington22 jharrington22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing this up @bentito ! I've added some initial feedback, more to come. I'd like to generalize this a little more to be cloud agnostic. STS is a AWS service not really the name of an industry standard.

Copy link
Member

@sdodson sdodson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, submitting pending comments from quite some time ago.

Also, I think we should broadly change from "STS" to "Token authentication" or "Token Based Workload Identity" or something in between unless we're specifically talking about AWS when we say STS.

@JoelSpeed
Copy link
Contributor

If I read this correctly, there's an assumption that this new role-arn field could be populated by either the user or a controller, what happens if the user manually created the roles and then hasn't set the field?

Copy link
Contributor

@bparees bparees left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things that are still unclear to me:

  1. what is responsible for creating the CredReq object? Is this just another resource defined in the CSV? as a custom resource alongside the CSV? Either way, if so, it sounds like olm needs updates to support it. Other possible components to create it could be the console(ew, relies on user having permission to create credreqs), the olm operator itself(creates its own credreq which could be messy)

  2. discussion of how olm operators advertise the permissions/roles they need created(not just that they are STS-enabled), especially in light of the packagemanifest only showing information from the head bundles which may not be what the user is installing. Note there might be a phase 1 vs phase 2 aspect this (phase 1 is just install instructions, phase 2 is programmatic metadata)

  3. what is responsible for creating the roles from (2) (possible answers include manually by the user, automatically by the ROSA cli, maybe other possible answers? Again, may have a phase1 and phase2 answer, but if users are creating this the roles/perms manually it's not clear to me how we've significantly moved the needle beyond where we are today by offering some minor ARN value passing)

@bentito
Copy link
Contributor Author

bentito commented Apr 19, 2023

If I read this correctly, there's an assumption that this new role-arn field could be populated by either the user or a controller, what happens if the user manually created the roles and then hasn't set the field?

Well-behaved operator will sit there waiting for info about those roles to be added the CredentialsRequest, that is waiting on the Secret to be present so it can use it to use the role info.

Comment on lines 135 to 138
Subscription objects. The Subscription fields will allow UX for the information needed by CCO or webhook, for input of
the cloud credentials. Generate a manual ack for operator upgrade when cloud resources are involved. This can be
determined by parsing the Subscription objects. Ensures admin signs off that cloud resources are provisioned and working
as intended before an operator upgrade.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new ... Subscription fields

Is this proposing a change to the OLM Subscription API?

Generate a manual ack for operator upgrade when cloud resources are involved.

It isn't clear to me what this means. If the Subscription has spec.installPlanApproval: Automatic, but an upgrade requires cloud API permission changes, it seems like we would be violating user intent by injecting a manual ack.

But this leads me to a boarder question: at what point of the OLM install/upgrade progression would this manual ack be injected, and how would it be surfaced to a user? Would OLM be expected to check the permissions so that the ack is only required if permissions need to change, or are we expecting users to manually ack every single install/upgrade involving an STS-supporting bundle deployed on an STS-enabled cluster?

- Add CredentialRequests to the bundle, known location;
- Add "role-arn"-type fields to the CredentialRequests
- Add references to the CredentialRequests in the CRD spec;
- Use SDK to validate bundle to catch permission changes or mis-configuration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this some sort of static validation of the annotations? Or is this suggesting that SDK somehow assesses the requests the operator intends to make and warns/errors if they require increased permission?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this needs more detail about what validation is being done and where/when/how

Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I need to re-read this again, but having just read through, there's lots of discussion about putting information in various places to trigger actions, eg putting role arn annotations, but, I couldn't work out onto what object or what specific annotations I'm supposed to be using.
Can we make sure the mechanics of this are clarified, perhaps within the workflows? or API changes?

I'd also like to understand why we are using annotations rather than fields on the objects (though I guess if those objects aren't owned by us this may be why).

bentito added 3 commits May 18, 2023 08:55
Mostly to check workflow graph Mermaid is working in Github
To start fleshing out the proposed solution.
@sdodson
Copy link
Member

sdodson commented Jun 9, 2023

@bentito I fixed it in 9926a34 but that commit got stomped, just re apply those changes and it should be fine

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 13, 2023

@bentito: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@sdodson
Copy link
Member

sdodson commented Jun 13, 2023

@bparees I approve of this in its current form.

I understand that future refinements will include clarification around Hypershift to ensure CCO is compatible with use on management cluster and extension to add Azure as a supported platform, but those all seem very minor to me.

Copy link
Contributor

@bparees bparees left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bunch of nits that i'd like to see cleaned up post-merge, but i'm going to merge this so it can land now. Thanks for the persistence, @bentito and everyone else who contributed

Mint mode to automatically create the IAM user and credentials required for the operator to
authenticate with the Cloud Provider.

In Manual mode or more generally on clusters that support TAT (CCO not necessarily installed), an admin installing an
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: supporting TAT isn't "more general" than clusters in manual mode, it's a different want of managing auth.


Extract the CredentialsRequest from the operator's image or codebase in order to know what IAM role is appropriate for
the operator to assume
if using the [Cloud Credential Operator Utility](https://github.com/openshift/cloud-credential-operator/blob/master/docs/ccoctl.md), `ccoctl`:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit...."Extract the CredentialsRequest from the operator's image or codebase in order to know what IAM role is appropriate for the operator to assume and, if using the....... : "

(if they aren't using ccoctl, then they need to do other steps to create the roles and secret)

(via OLM) simpler on clusters where TAT authentication is supported.

For an example of how operators are currently navigating accessing cloud resources today, see the AWS STS install
instructions for the [AWS EFS CSI Driver Operator](https://docs.openshift.com/container-platform/4.11/storage/container_storage_interface/persistent-storage-csi-aws-efs.html#efs-sts_persistent-storage-csi-aws-efs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would be better to make this a permalink since if we succeed w/ this EP, the EFS install instructions are likely going to change.

Validation of CredentialsRequest by this new token-aware CCO? Maybe. Currently, CCO gets some permissions under ROSA
(unused), could expand these permissions to include tools like AWS's Policy Simulator such that it could validate a role
has permissions, and with this CCO could be the alerting mechanism for a changed CredentialsRequest without sufficient
permissions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we doing this in the initial implementation or not? I assume we are not. Can we update this to make it clear what's being done and what is being left as future improvements?

**OperatorHub and Console changes**: Allow for input from user of additional fields during install depending on the
Cloud Provider which will result in ENV variables on the Subscription object that are required in the
CredentialsRequests created by the Operator. Setting the Subscription config.ENV will allow UX for the information
needed by CCO or webhook, for input of the cloud credentials while not having to change the Subscription API.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: needed by CCO. no webhook, since we abandoned that approach, right?

don't automatically get upgraded without first having the admin verify the permissions required by the next version
and making the requisit changes if needed prior to upgrade.

Additionally, a doc will be created to:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: by doc you mean a future enhancement?

2. Extract the operator description where it should be clear what cloud permissions the operator requires.

**Operator team changes**: Follow new guidelines for allowing for the operator to work on token auth
enabled cluster. New guidelines would include the following to use CCO has detected cluster is using time-based tokens:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New guidelines would include the following to use CCO has detected cluster is using time-based tokens:

nit: there's a typo or word missing here or something.

- Changing CSV to add CredentialRequest RBAC
- If loading the CredentialRequests from a yaml, place it into a discoverable location in the codebase
- Add a bundle annotation claiming token-based authentication support
- Add a script in the operator description, per supported cloud provider, to help set up the correct role
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: i assume you mean cloud provider roles and not cluster rbac roles?

- If loading the CredentialRequests from a yaml, place it into a discoverable location in the codebase
- Add a bundle annotation claiming token-based authentication support
- Add a script in the operator description, per supported cloud provider, to help set up the correct role
- (Optional) Add the projected ServiceAccount volume to the Deployment embedded in the CSV;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't optional, it's not needed at all because it always happens automatically.

cloud platform.

For Operator Administrator:
- Supply cloud credentials for STS (ARN ID, etc) in the known location.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by "in the known location" you mean in the ARN field the Console displays and/or the Subscription object? if not, i don't know what this is referring to.

@bparees bparees added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Jun 13, 2023
@bparees
Copy link
Contributor

bparees commented Jun 13, 2023

/approve

@bparees
Copy link
Contributor

bparees commented Jun 13, 2023

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 13, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 13, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bparees

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.