instance creation & the create saga should avoid refetching DB records where possible #2945

gjcolombo · 2023-04-28T00:13:18Z

When Nexus creates an instance in a project, it runs the creation saga, then refetches the instance's record from CRDB and returns the refetched record. But once the instance has moved out of the "Creating" state, it can be stopped and deleted, which could cause instance creation to 404, as described in this comment:

omicron/nexus/src/app/instance.rs

Lines 180 to 224 in 8682616

    
           let instance_id = saga_outputs 
        
               .lookup_node_output::<Uuid>("instance_id") 
        
               .map_err(|e| Error::internal_error(&format!("{:#}", &e))) 
        
               .internal_context("looking up output from instance create saga")?; 
        
           // TODO-correctness TODO-robustness TODO-design It's not quite correct 
        
           // to take this instance id and look it up again.  It's possible that 
        
           // it's been modified or even deleted since the saga executed.  In that 
        
           // case, we might return a different state of the Instance than the one 
        
           // that the user created or even fail with a 404!  Both of those are 
        
           // wrong behavior -- we should be returning the very instance that the 
        
           // user created. 
        
           // 
        
           // How can we fix this?  Right now we have internal representations like 
        
           // Instance and analaogous end-user-facing representations like 
        
           // Instance.  The former is not even serializable.  The saga 
        
           // _could_ emit the View version, but that's not great for two (related) 
        
           // reasons: (1) other sagas might want to provision instances and get 
        
           // back the internal representation to do other things with the 
        
           // newly-created instance, and (2) even within a saga, it would be 
        
           // useful to pass a single Instance representation along the saga, 
        
           // but they probably would want the internal representation, not the 
        
           // view. 
        
           // 
        
           // The saga could emit an Instance directly.  Today, Instance 
        
           // etc. aren't supposed to even be serializable -- we wanted to be able 
        
           // to have other datastore state there if needed.  We could have a third 
        
           // InstanceInternalView...but that's starting to feel pedantic.  We 
        
           // could just make Instance serializable, store that, and call it a 
        
           // day.  Does it matter that we might have many copies of the same 
        
           // objects in memory? 
        
           // 
        
           // If we make these serializable, it would be nice if we could leverage 
        
           // the type system to ensure that we never accidentally send them out a 
        
           // dropshot endpoint.  (On the other hand, maybe we _do_ want to do 
        
           // that, for internal interfaces!  Can we do this on a 
        
           // per-dropshot-server-basis?) 
        
           // 
        
           // TODO Even worse, post-authz, we do two lookups here instead of one. 
        
           // Maybe sagas should be able to emit `authz::Instance`-type objects. 
        
           let (.., db_instance) = LookupPath::new(opctx, &self.db_datastore) 
        
               .instance_id(instance_id) 
        
               .fetch() 
        
               .await?; 
        
           Ok(db_instance)

The create saga itself also refetches records in a mildly dangerous way. If the saga creates an instance but doesn't start it, the saga will have a step that tries to move the instance's state to "Stopped" unconditionally. This is incorrect if the step gets replayed, as described in this comment:

omicron/nexus/src/app/sagas/instance_create.rs

Lines 1340 to 1364 in 8682616

    
           // TODO-correctness: This is dangerous if this step is replayed, since 
        
           // a user can discover this instance and ask to start it in between 
        
           // attempts to run this step. One way to fix this is to avoid refetching 
        
           // the previous runtime state each time this step is taken, such that 
        
           // once this update is applied once, subsequent attempts to apply it 
        
           // will have an already-used generation number. 
        
           let runtime_state = db::model::InstanceRuntimeState { 
        
               state: db::model::InstanceState::new(InstanceState::Stopped), 
        
               // Must update the generation, or the database query will fail. 
        
               // 
        
               // The runtime state of the instance record is only changed as a 
        
               // result of the successful completion of the saga (i.e. after 
        
               // ensure which we're skipping in this case) or during saga 
        
               // unwinding. So we're guaranteed that the cached generation in the 
        
               // saga log is the most recent in the database. 
        
               gen: db::model::Generation::from( 
        
                   db_instance.runtime_state.gen.next(), 
        
               ), 
        
               ..db_instance.runtime_state 
        
           }; 
        
           let updated = datastore 
        
               .instance_update_runtime(&instance_id, &runtime_state) 
        
               .await 
        
               .map_err(ActionError::action_failed)?;

These cases can be fixed by taking advantage of the fact that nexus_db_model::Instances are (now) serializable (at one point they weren't), such that a serialized instance description can be passed from saga step to saga step (or provided as saga output).

This may help address some cases of #1536 (if we don't do a DB fetch, we can't use the wrong key!), but the two issues are distinct and this one won't totally address that one.

The text was updated successfully, but these errors were encountered:

gjcolombo added bug Something that isn't working. nexus Related to nexus labels Apr 28, 2023

gjcolombo mentioned this issue Apr 28, 2023

nexus: revamp live migration saga #2892

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

instance creation & the create saga should avoid refetching DB records where possible #2945

instance creation & the create saga should avoid refetching DB records where possible #2945

gjcolombo commented Apr 28, 2023

instance creation & the create saga should avoid refetching DB records where possible #2945

instance creation & the create saga should avoid refetching DB records where possible #2945

Comments

gjcolombo commented Apr 28, 2023