Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

need to represent role assignments for resources #890

Closed
Tracked by #849
davepacheco opened this issue Apr 7, 2022 · 10 comments
Closed
Tracked by #849

need to represent role assignments for resources #890

davepacheco opened this issue Apr 7, 2022 · 10 comments

Comments

@davepacheco
Copy link
Collaborator

davepacheco commented Apr 7, 2022

Recall that:

  • actors are the term we use for a user or program that can get access to something (e.g., "alice")
  • resources are the things in the system that actors can get access to (e.g., a Project)
  • roles define a set of permissions for a specific kind of resource (e.g., "project viewer" grants "read" permissions on a Project and everything inside it)
  • role assignments combine a specific resource, role, and actor, saying: "alice has the 'project viewer' role for Project 'maze-war'"

We need a way to grant role assignments on resources. I expect it'll be pretty straightforward to do this on the database side, but I'm not clear yet how to represent this in the API.

RFD 43 touches on this (heavily inspired by GCP). But much of that RFD has been deferred to post-MVP (including custom roles and custom policies) and it also doesn't actually say how policies are managed (i.e., is there a /policies endpoint where you create them, get back an id, then set a policy_id field on a resource?)

Option 1: separate API endpoints underneath each resource (e.g., .../roles)

e.g., /organization/:org_name/roles, with the usual CRUD to create or remove role assignments for that resource. If you granted access to three Groups and two Users, you'd have five HTTP resources under this one. The HTTP resources under this endpoint would have a representation of the actor and role, with the resource being implied by the path. e.g., GET /organizations/maze-war/roles would show a list of objects with actor and role.

Nice:

  • matches the current database representation pretty well -- CRUD on these corresponds to CRUD of a row in role_assignments
  • common operations are very straightforward: create/remove a role assignment grants/revokes access
  • seems pretty similar to what GitHub does with repo/team assignments

Not so nice:

  • cannot modify the whole ACL for a resource atomically

Risks:

  • are there resources today where we can't do this (e.g., because there's no place to put child resources), or where it's awkward to do it (e.g., because they don't look like a resource that should have child resources)? Edit: I don't think so today, since we only let you assign roles on the Fleet, Silos, Organizations, and Projects.
  • if/when we add custom policies, what will these endpoints do?

Option 2: one separate API endpoint underneath each resource (e.g., .../policy)

This is the same as option 1, but we put the entire list of roles into one resource called policy. So if you have three Groups and two Users with access, you'd have one Policy object with an array of five assignments (rather than five separate HTTP resources, as in the previous case).

Nice:

  • can update the whole ACL atomically
  • it's clear how this could be evolved for custom policies, though it might be awkward (this might turn into an object with one policy_id field that points to some other thing)

Not so nice:

  • More annoying for clients that just want to grant/revoke access to one actor -- they have to do a read-modify-write
  • More complex implementation (every change involves fetching a bunch of records, updating a bunch of records, and maybe doing this in batches)
  • Size limits will be needed to avoid requests being huge or triggering huge database queries. GCP appears to apply such limits so that shouldn't be unreasonable.

Risks:

  • Are there resources today where we can't do this, or where it's awkward to do it? (Same as above, and for the same reason, I don't think so.)

Option 3: a top-level "policy" property on every resource with the policy contents

This looks just like option 2, except instead of the Policy being a separate resource, it's a new top-level property called policy on resources that are allowed to have their own policy. That property directly contains the list of role assignments.

Nice:

  • can easily fit into any resource (whereas putting a "policy" resource under some resources might look awkward)
  • can update the whole ACL atomically
  • it's clear and smooth to evolve this for totally custom policies

Not so nice:

  • object representations may get large, and a large fraction of them might be taken up with these assignments
  • it means fetching a potentially large number of database records every time you want to get every resource, even if the user doesn't want or need those things. This seems like kind of a big deal.

Option 4: Policies as separate HTTP resources

This seems closest to what RFD 43 envisions, where each resource is associated with a policy that's managed via some other endpoint (e.g., /policies, with associated CRUD)

Nice:

  • Smooth transition to custom policies (but possibly only because we have to do all the work we're trying to punt on by not doing custom policies now).

Not so nice:

  • This model seems a lot more annoying to users and clients. If you just want to grant/revoke access to a resource: you need to fetch the resource, look at its policy id, fetch that policy, compute the new value for the policy, and then either update the policy directly or else create a new policy and then update the original resource to point to the new policy.

I lean towards option 1 as it seems to provide a clear, simple model for clients and end users and I think it's also the easiest to implement :) though it does mean creating a bunch more API endpoints, with associated boilerplate, tests, etc.


Note that I think there's a somewhat deeper question here, which is about the model we want to expose to end users.

The model of GitHub, Google Drive, etc. is:

  • There's a relatively small number of "important" resources (for GitHub: organizations and repos; for Drive: directories and documents). These are the main resources that you can grant access to.
  • For each resource, there's basically a list of users or groups with access and a specific level of access. These are basically a small number of canned roles (like "admin", "editor", "commenter", "viewer").

This is similar to how things work in our API today, and we've said this is what we're going to do for MVP because it's much simpler to implement. Personally, I find it pretty intuitive as a user.

The model of AWS and GCP seems to be more like: you can assign arbitrarily complex policies to virtually any resource. This is much more flexible. But we've also heard feedback that IAM is nightmarishly complex in AWS and GCP -- is that partly because in order to grant someone access, you need to create a new policy that grants them a specific role and attach that policy to a resource? Feels like a lot of paperwork for the common case.

Whatever we pick here, I don't think we're foreclosing on custom roles and policies, because we can implement compatibility APIs that read or modify the corresponding bits of a resource's policy. On the other hand, if we want the simpler model for the longer term, it doesn't make sense to do option 4 now, and maybe not 2-3 either.

@jessfraz
Copy link
Contributor

jessfraz commented Apr 7, 2022

1 seeems fine to me and easy to generate everywhere else if its a known thing that mostly always exists by the same name

@smklein
Copy link
Collaborator

smklein commented Apr 7, 2022

I don't have much to add (thanks for the great writeup!) but IMO options 1 + 2 both seem reasonable - and option 1 seems the most intuitive, even if it does have the minor "atomic batch updates" issue

@davepacheco
Copy link
Collaborator Author

Based on that feedback, I was all set on option 1. But as I go to implement it, I found a critical part that I had overlooked: if there's a separate resource for each role assignment, how are they identified? If you look at the resources:

/organizations/my-org/roles
/organizations/my-org/roles/$role_assignment_identifier_of_some_kind
                            ^ what actually goes here?

Internally, role assignments are uniquely identified by (resource, actor, role_name). We could add a uuid for them, but it's kind of lame, since the whole row is unique and immutable already. One answer is to encode the actor and role name, so it'd be something like /organizations/my-org/roles/099e2f95-fc1b-c9ed-b6b0-eb9a9c166436-admin (which would identify a role assignment for user "099e2f95-fc1b-c9ed-b6b0-eb9a9c166436" role "admin" on organization "my-org"). This is practically kind of nice because you could easily check whether someone has a specific role with a HEAD or GET on that resource, and you could idempotently remove access by just deleting it (without a read-modify-write). But it looks really awkward to me, and raises questions around "what if you want to reference the user by username instead of id", etc.

Another idea would be to create a whole hierarchy there, so it's:

/organizations/my-org/roles/$role_name/$user_id

but that feels unnecessarily bureaucratic (especially if a client wants to fetch all role assignments -- they'd need to make several requests) and would constrain the implementation in annoying ways (like having a way to efficiently list role assignments on a resource for a particular role).

This whole thing has me coming around to option 2. If we expose one resource with PUT/GET, this question goes away. One of the downsides we said with option 2 is that it's more annoying for people trying to delete one role (revoke access for one person), but I don't think it actually is any harder for that case. If we used uuids, you'd still have to list the roles to get the uuid to delete. If we used an identifier that encodes the user id and role name, you'd probably have to list all the role assignments to make sure you found all the roles this user had and remove them.

The other downsides were around batching and limits. I'd propose that we set a pretty conservative limit (like 32 role assignments per resource) and not bother with pagination internally. This is consistent with GCP, which has a limit of 1500 principals. (1,500 is obviously much larger than 32, but we can always bump this up if folks want it. Then we can evaluate whether we need pagination internally.)

@david-crespo
Copy link
Contributor

Sorry for not getting to this sooner — I've had this tab open for weeks.

Like with firewall rules, I lean toward a pragmatic approach that combines 1 and 2. Both are useful for all the reasons you give. I agree it's a bummer to create UUIDs for the already-unique (resource, actor, role_name) tuples. It's also hideous to put those in path params. Unlike with resource hierarchies, the order of the params would be arbitrary. How about using a POST body to avoid that?

POST /organizations/my-org/roles/add (or /remove)

{
  "role_name": "abc",
  "actor_id": "def"
}

(Do we need an actor_type or something? How do we give a role to a group?)

I agree about putting limits on the number of role assignments so we don't have to paginate, but 32 seems a little small.

@davepacheco
Copy link
Collaborator Author

Sorry for not getting to this sooner — I've had this tab open for weeks.

Like with firewall rules, I lean toward a pragmatic approach that combines 1 and 2. Both are useful for all the reasons you give. I agree it's a bummer to create UUIDs for the already-unique (resource, actor, role_name) tuples. It's also hideous to put those in path params. Unlike with resource hierarchies, the order of the params would be arbitrary. How about using a POST body to avoid that?

POST /organizations/my-org/roles/add (or /remove)

{
  "role_name": "abc",
  "actor_id": "def"
}

I think we can do this in addition to Option 2. That is, we can have PUT/GET /organizations/my-org/roles, as well as POST to add/remove individual roles if folks really want it.

(Do we need an actor_type or something? How do we give a role to a group?)

Yes, we do.

I agree about putting limits on the number of role assignments so we don't have to paginate, but 32 seems a little small.

Fair enough. My thought was to start small because it's much easier to make larger than smaller, and I imagine folks will generally be using groups for this sort of thing anyway. I'm happy to choose a larger number here though. What value do you think makes sense?

@david-crespo
Copy link
Contributor

It doesn't fundamentally change things and it's easy to change later, but how about 64 then? That feels a little harder to accidentally run into during RAP.

@davepacheco
Copy link
Collaborator Author

My suggestion looks a lot like what GCP does, as far as I can tell.

https://cloud.google.com/iam/docs/manage-access-other-resources

Add/remove individual role assignments

https://cloud.google.com/sdk/gcloud/reference/compute/instances/add-iam-policy-binding https://cloud.google.com/sdk/gcloud/reference/compute/instances/remove-iam-policy-binding

GET/PUT entire policy

No pagination.

https://cloud.google.com/compute/docs/reference/rest/v1/instances/getIamPolicy https://cloud.google.com/compute/docs/reference/rest/v1/instances/setIamPolicy

Note that those first two links are for command-line tools, I think. It's not clear to me whether the API itself supports add/remove of individual bindings.


Here are some random thoughts as I dig deeper into the implementation.

I've got an implementation of option 2. The problem is that eventually we're going to want to let you do a conditional PUT so you can avoid clobbering a dueling administrator. How do we implement this? The obvious solution is to hang some property off the resource itself that identifies the version of the role assignments (or policy) that are in-effect right now. This would also make it possible to scale this further if we needed because we could build a new policy in batches, then flip the resource's "policy_id" as the last step.

This also feels related to the Zanzibar "new enemy" problem, in that Zanzibar solves it by having consumers update their resource record with the zookie representing the up-to-date Zanzibar state.

This starts to feel a lot like option 3 or 4, but maybe it's just an implementation detail. We could implement option 2 for now, but under the hood, we could put a policy_id (or even policy_generation) on each resource. When we replace the policy, we generate a new id/generation, write out the new role assignments, and update the id/generation on the parent resource. That step can be conditional on previous policy id/generation when we go to support conditional PUT.

In terms of implementation, we may want to update the LookupPath stuff to fetch the policy associated with the resource that you're looking up. That way, when we go to do an authz check, we know we're using the same policy that was there when you fetched it. This gets closer to nailing the New Enemy problem, though there's still nothing to guarantee that the code after the authz check is looking at the same state of the resource that was authz-checked -- and that seems a lot harder. For example: suppose you revoked someone's access to a Project, then someone creates a new Instance in the Project. Concurrently, that person attempts to list Instances in the Project, they pass the authz check (because their access hasn't been revoked yet), but then we go to list Instances and show them the new one that they shouldn't see. Fixing this seems hard.

@david-crespo
Copy link
Contributor

Interesting point. It hadn't occurred to me that the reason I couldn't find the API endpoint for that is they're faking it client-side. That is very plausible.

Is the clobbering problem the same problem we have on all PUTs? I thought we wanted to use ETags for that. On the other hand, you made the good point here that using an ETag (for the whole policy) doesn't make sense on the add/remove single assignment endpoints because the whole point of single add/remove is to avoid having to fetch the whole policy. I don't really see any way around that and it goes some way toward explaining why GCP are handling it on the client. I'm fine saying forget the add/remove for now then.

@davepacheco
Copy link
Collaborator Author

Closing this for now because we've got something concrete enough to move forward. I'm sure we'll revisit later.

leftwo pushed a commit that referenced this issue Sep 11, 2023
Crucible:
update rust crate base64 to 0.21.3 (#913)
update rust crate slog-async to 2.8 (#915)
update rust crate async-recursion to 1.0.5 (#912)
Move active jobs into a separate data structure and optimize `ackable_work` (#908)
Check repair IDs correctly (#910)
update actions/checkout action to v4 (#903)
update rust crate tokio to 1.32 (#890)
Remove single-item Vecs (#898)
Move "extent under repair" into a helper function (#907)
offset_mod shouldn't be randomized (#905)
Only rehash if a write may have failed (#899)
Make negotiation state an enum (#901)
Test update for fast write ack and gather errors on test failure (#897)

Propolis:
Update cpuid-gen util for better coverage
Make storage backend config more flexible and consistent
Use correct register sizes for PIIX3 PM device
Update bitflags dependency
fix softnpu port order (#517)
Use hex formatting for unhandled MMIO/PIO/MSRs
Update deps for new crucible and oximeter
Update standalone-with-crucible docs (#514)
leftwo added a commit that referenced this issue Sep 12, 2023
Crucible:
update rust crate base64 to 0.21.3 (#913)
update rust crate slog-async to 2.8 (#915)
update rust crate async-recursion to 1.0.5 (#912)
Move active jobs into a separate data structure and optimize
`ackable_work` (#908) Check repair IDs correctly (#910)
update actions/checkout action to v4 (#903)
update rust crate tokio to 1.32 (#890)
Remove single-item Vecs (#898)
Move "extent under repair" into a helper function (#907) offset_mod
shouldn't be randomized (#905)
Only rehash if a write may have failed (#899)
Make negotiation state an enum (#901)
Test update for fast write ack and gather errors on test failure (#897)

Propolis:
Update cpuid-gen util for better coverage
Make storage backend config more flexible and consistent Use correct
register sizes for PIIX3 PM device
Update bitflags dependency
fix softnpu port order (#517)
Use hex formatting for unhandled MMIO/PIO/MSRs
Update deps for new crucible and oximeter
Update standalone-with-crucible docs (#514)

Co-authored-by: Alan Hanson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants