Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

instance create sagas look up multiple things by name multiple times #1536

Open
davepacheco opened this issue Aug 2, 2022 · 2 comments
Open
Labels
security Related to security.
Milestone

Comments

@davepacheco
Copy link
Collaborator

There are three places in the instance-create saga where we look up things by name well into the saga:

  1. Towards the end, when we go to attach disks (or detach them, during undo), we use Nexus::instance_attach_disk()/Nexus::instance_detach_disk() and pass along the organization_name, project_name, instance_name, and disk_name from the instance create request. This causes a new lookup of the organization/project/instance/disk path. That seems bad: there's no guarantee that it will find the same disk (or instance or project or organization, for that matter). So we might attach or detach the wrong disk, or we might fail spuriously.
  2. At the end, during instance_ensure(), we look up the instance by name in order to call Nexus::instance_set_runtime(). This could find a different instance than the one we created. We should be able to use the instance_id stored from a previous action instead.
  3. In sic_delete_instance_record, the undo action for the (early) create-instance-record action, we look up the instance name that we created. We should use the instance id instead.

RFD 192 talks about this:

When Nexus receives an API request, we’ll usually want to resolve the name to an id up front and do most internal work using ids. Otherwise, we can wind up with nasty concurrency issues (i.e., serializability violations) if somebody renames a resource while we’re working on it. This can lead to security vulnerabilities, if we were to do an access check of some resource and then somebody renames it and we wind up operating on a different resource than the one we checked.

In terms of fixing: my first thought here is to do the lookup of organization and project once early in the saga to get the project_id. Then (again, still early in the saga) look up anything else we need (like all the disks) using LookupPath(...).project_id(...).disk_name(...) and store their ids. Thereafter, we can always use the ids.

We'll probably want to change Nexus::instance_attach_disk() and Nexus::instance_detach_disk() to accept an authz::Instance and authz::Disk (and have the lookups done in the caller).

Then I think we can remove the "instance_name" output from any saga action (so we're not ever tempted to use the instance name). We can also remove organization_name and project_name from the Params (for the same reason).

@davepacheco davepacheco added the security Related to security. label Aug 2, 2022
@davepacheco davepacheco mentioned this issue Aug 2, 2022
5 tasks
@davepacheco
Copy link
Collaborator Author

This seems easy to get wrong. It makes me wonder if we should have the Nexus layer only accept authz resources (rather than ever accepting names) and having callers (mostly the HTTP layer, but sometimes the saga layer) use LookupPath directly. I think it'd be fair to say that the mapping of an API URL path to a resource is part of the API layer anyway, not Nexus.

I'm reminded too that right after that quote from RFD 192, it also says that some operations could be done without a separate lookup step altogether. To do that, you do want to pass the names all the way down to the datastore. Doing this correctly also requires integrating authz into the query, which feels like a long way off. So maybe we could consider this an optimization to be re-added later.

A middle ground might be to accept LookupPaths. That would avoid encouraging people to pass a name without actually preventing it. It would create its own problem, which is that if you've already got the authz object, you'll wind up needing to make a LookupPath out of it and then we'll wind up looking it up again. This too could be addressed: we could have a LookupPath::instance_value or something that, when you go do the lookup(), just returns what it was given. This approach seems like a can of worms not worth opening now. Maybe we can look into this if we decide that optimization is important.

@zephraph
Copy link
Contributor

Once RFD-322 starts getting implemented I believe we should solve this at the API level. E.g. we should at least move everything to operate as id by default. Id or LookupPath might be a viable option depending, but I think we should restrict the usage of names (as much as possible) to go no deeper than the http_entrypoints file.

@davepacheco davepacheco added this to the MVP milestone Feb 2, 2023
leftwo pushed a commit that referenced this issue Nov 19, 2024
No Propolis changes other than to update Crucible

Crucible changes are:
Add debug/timeout to test_memory.sh (#1563)
Consolidate ack checking (#1561)
Rename for crutest: RegionInfo -> DiskInfo (#1562)
Fix dtrace system level scripts (#1560)
Remove `ackable_work`; ack immediately instead (#1552)
No more New jobs, no more New jobs column (#1559)
Remove delay-based backpressure in favor of explicit queue limits (#1515)
Only send flushes when Downstairs is idle; send Barrier otherwise (#1505)
Update Rust crate reqwest to v0.12.9 (#1536)
Update Rust crate omicron-zone-package to 0.11.1 (#1535)
Remove separate validation array (#1522)
Remove more unnecessary `DsState` variants (#1550)
Consolidate `DownstairsClient::reinitialize` (#1549)
Update Rust crate uuid to v1.11.0 (#1546)
Update Rust crate reedline to 0.36.0 (#1544)
Update Rust crate bytes to v1.8.0 (#1541)
Update Rust crate thiserror to v1.0.66 (#1539)
Update Rust crate serde_json to v1.0.132 (#1538)
Update Rust crate serde to v1.0.214 (#1537)
Remove transient states in `DsState` (#1526)
Update Rust crate libc to v0.2.161 (#1534)
Update Rust crate futures to v0.3.31 (#1532)
Update Rust crate clap to v4.5.20 (#1531)
Update Rust crate async-trait to 0.1.83 (#1530)
Update Rust crate anyhow to v1.0.92 (#1529)
Remove obsolete crutest perf test (#1528)
Update dependency rust to v1.82.0 (#1512)
Still more updates to support Volume layer activities. (#1508)
Remove remaining IOPS/bandwidth limiting code (#1525)
Add unit test for VersionMismatch (#1524)
Removing panic paths by only destructuring once (#1523)
Update actions/checkout digest to 11bd719 (#1518)
Switch to using `Duration` for times (#1520)
leftwo added a commit that referenced this issue Nov 20, 2024
No Propolis changes other than to update Crucible

Crucible changes are:
Add debug/timeout to test_memory.sh (#1563)
Consolidate ack checking (#1561)
Rename for crutest: RegionInfo -> DiskInfo (#1562) Fix dtrace system
level scripts (#1560)
Remove `ackable_work`; ack immediately instead (#1552) No more New jobs,
no more New jobs column (#1559)
Remove delay-based backpressure in favor of explicit queue limits
(#1515) Only send flushes when Downstairs is idle; send Barrier
otherwise (#1505) Update Rust crate reqwest to v0.12.9 (#1536)
Update Rust crate omicron-zone-package to 0.11.1 (#1535) Remove separate
validation array (#1522)
Remove more unnecessary `DsState` variants (#1550) Consolidate
`DownstairsClient::reinitialize` (#1549) Update Rust crate uuid to
v1.11.0 (#1546)
Update Rust crate reedline to 0.36.0 (#1544)
Update Rust crate bytes to v1.8.0 (#1541)
Update Rust crate thiserror to v1.0.66 (#1539)
Update Rust crate serde_json to v1.0.132 (#1538)
Update Rust crate serde to v1.0.214 (#1537)
Remove transient states in `DsState` (#1526)
Update Rust crate libc to v0.2.161 (#1534)
Update Rust crate futures to v0.3.31 (#1532)
Update Rust crate clap to v4.5.20 (#1531)
Update Rust crate async-trait to 0.1.83 (#1530)
Update Rust crate anyhow to v1.0.92 (#1529)
Remove obsolete crutest perf test (#1528)
Update dependency rust to v1.82.0 (#1512)
Still more updates to support Volume layer activities. (#1508) Remove
remaining IOPS/bandwidth limiting code (#1525) Add unit test for
VersionMismatch (#1524)
Removing panic paths by only destructuring once (#1523) Update
actions/checkout digest to 11bd719 (#1518)
Switch to using `Duration` for times (#1520)

Co-authored-by: Alan Hanson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
security Related to security.
Projects
None yet
Development

No branches or pull requests

2 participants