NVMe keepalive feature #210

yupavlen-ms · 2024-10-29T19:39:17Z

NVMe keepalive feature implementation. The goal is to keep attached NVMe devices intact during OpenVMM servicing (e.g. reloading new OpenVMM binary from the host OS).

The memory allocated for NVMe queues and buffers must be preserved during servicing, including queue head/tail pointers. The memory region must be marked as reserved so VTL2 Linux kernel will not try to use it, a new device tree property dma-preserve-pages will be passed from Host to indicate the preserved memory size (the memory is extracted from vtl2 available memory, it is not additional memory).
The NVMe device should not go through any resets, from physical device point of view there was no change during servicing.
New OpenHCL module gets saved state from host and continues from there.

mattkur · 2024-10-29T19:59:13Z

Thanks for submitting this Yuri. I'm starting to take a look.

vm/devices/user_driver/src/memory.rs

vm/devices/storage/nvme/src/pci.rs

support/sparse_mmap/src/unix.rs

vm/devices/storage/disk_nvme/nvme_driver/src/queues.rs

vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs

jstarks · 2024-10-29T20:24:34Z

vm/loader/loader_defs/src/shim.rs

@@ -88,6 +88,8 @@ open_enum! {
        VTL0_MMIO = 5,
        /// This range is mmio for VTL2.
        VTL2_MMIO = 6,
+        /// Memory preserved during servicing.
+        VTL2_PRESERVED = 7,


How do these entries make it into the tables? Does this mean we tell the host about these ranges somehow?

Or does the host just know what the preserved range is? Is this some new IGVM concept?

Host knows what the DMA preserved size should be but does not calculate the ranges. Host provides size through device tree, boot shim selects the top of vtl2 memory range and marks it as reserved with the new type.

I think we need some way to enforce that the next version of ohcl selects the same size + location as previous. Outside the scope of this PR, but i'll incorporate it in some changes I'm making moving forward.

vm/devices/storage/disk_nvme/nvme_driver/src/namespace.rs

vm/devices/storage/disk_nvme/nvme_driver/src/driver.rs

jstarks · 2024-10-29T20:40:05Z

vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs

+                command: FromBytes::read_from_prefix(cmd.command.as_bytes()).unwrap(),
+                respond: send,
+            };
+            pending.push((cmd.cid as usize, pending_command));


It seems like pending commands need to save the fact that they allocated memory, somehow.

Specifically, any PRP list allocation needs to be saved/restored. Is this done somewhere?

We also need to re-lock any guest memory/re-reserve any double buffer memory that was in use.

And of course, all this needs to be released after the pending command completes.

If PRP points to the bounce buffer (vtl2 buffer) then its GPN should not change and it still points to the same physical page which is preserved during servicing.
Same should be true if PRP points to a region in vtl0 memory as we don't touch vtl0 during servicing.

However, maybe I missed the part on where we allocate PRP list itself, if there are multiple entries. If it is not a part of preserved memory then such change needs to be added.

vm/devices/storage/disk_nvme/nvme_driver/src/driver.rs

jstarks · 2024-10-29T20:55:54Z

vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs

+    }
+
+    /// Restore queue data after servicing.
+    pub fn restore(&mut self, saved_state: &QueuePairSavedState) -> anyhow::Result<()> {


A key thing to reason about is what happens to requests that were in flight at save. It seems that the design is that we will keep those requests in flight (i.e., we won't wait for them to complete before saving/after restoring), right? And so storvsp's existing restore path will reissue any IOs that were in flight, but they won't wait for old IOs.

I think this is a problem, because new IOs can race with the original IO and cause data loss. Here's an example of what can go wrong:

Before servicing, the guest writes "A" to block 0x1000 with CID 1.

Before CID 1 completes, a servicing operation starts. CID 1 makes it into the saved state.

At restore, storvsp reissues the IO, so it writes "A" to block 0x1000 with CID 2.

Due to reordering in the backend, CID 2 completes before CID 1.

Storvsp completes the original guest IO.

The guest issues another IO, this time writing "B" to the same block 0x1000. Our driver issues this with CID 3.

CID 3 completes before CID 1.

Finally, some backend code gets around to processing CID 1. It writes "A" to block 0x1000.

So the guest thought it wrote "A" and then "B", but actually what happens is "A" then "B" then "A". Corruption.

There are variants on this, where the guest changes the memory that CID 1 was using and ASAP DMAs it late, after the IO was completed to the guest. This could even affects reads by corrupting guest memory.

The most reasonable solution, in my opinion, is to avoid issuing any more commands to the SQ until all IOs that were in flight at save time have completed. This should be simple to implement and to reason about. A more complicated solution would be to reason about what the in flight commands are doing and only block IOs that alias those pages. I don't think that's necessary for v1.

Did I miss somewhere where this is handled?

The most reasonable solution, in my opinion, is to avoid issuing any more commands to the SQ until all IOs that were in flight at save time have completed. This should be simple to implement and to reason about.

I think we cut off receiving mesh commands once servicing save request is received, let me double check that and at which point this is done.

Regarding draining of the outstanding (in-flight) I/Os - eventually we want to bypass this but for v1 it makes sense to drain them, for simplicity. Let me confirm the path. The draining happens for non-keepalive path.

I think two things are important in the conversation above:

It's okay to drain IOs before issuing new IOs. But: unlike the non-keepalive path, the overall servicing operation should not block on outstanding IOs. This means that the replay will need to remain asynchronous.

I agree with John's analysis. It is unsafe to complete any IOs back to the guest until all outstanding IOs (IOs in progress at the save time to the underlying nvme device) complete from the NVMe device to the HCL environment. I think it is fine to start replaying the saved IOs while the in flight IOs are still in progress. This means you have two choices: (a) wait to begin replaying IOs until everything in flight completes, or (b) build some logic in the storvsp layer to not notify the guest until all the in flight IOs complete. (a) seems simpler to me.

In future changes, I think we will need to consider some sort of filtering (e.g. hold new IOs from guest that overlap LBA ranges, as John suggested in an offline conversation).

So the in-flight IOs will be executed twice if my understanding is correct?

We are saving Pending Commands slab so we can correctly find a CID for outstanding responses once we're back from blackout. We do not resubmit those commands.

I mean in these steps "Before CID 1 completes, a servicing operation starts. CID 1 makes it into the saved state.
At restore, storvsp reissues the IO, so it writes "A" to block 0x1000 with CID 2." the same IO was issued twice with CID 1 and CID 2. Is it still true?

That is true, I can observe that in the instrumented FW that same command may be issued twice - before and after servicing. Let me double check if CID is still same, as there is recent change to randomize CID in NVMe queue.

vm/devices/user_driver/src/lockmem.rs

openhcl/fixed_pool_alloc/Cargo.toml

openhcl/host_fdt_parser/src/lib.rs

openhcl/openhcl_boot/src/main.rs

vm/devices/user_driver/vfio_sys/src/lib.rs

vm/devices/user_driver/Cargo.toml

daprilik · 2024-10-31T18:31:31Z

vm/devices/storage/nvme_spec/src/nvm.rs

@@ -16,7 +16,7 @@ use zerocopy::LE;
 use zerocopy::U16;

 #[repr(C)]
-#[derive(Debug, AsBytes, FromBytes, FromZeroes, Inspect)]
+#[derive(Debug, AsBytes, FromBytes, FromZeroes, Inspect, Clone)]


can you share what fails to compile (if anything) if you remove Clone?

Fails in namespace.rs line 85 - when we restore namespace object if Identify structure was provided from saved state

lots of code has changed in 2 weeks - bumping to check if this is still necessary (and if so - what code requires the Clone bound)

@daprilik it is used in NvmeDriver::restore even though there is a new direction to re-query Identify Namespace instead, I think we should keep that code for future improvements. If you see a better way to implement that without clone() please let me know.

vm/devices/user_driver/src/vfio.rs

vm/devices/user_driver/src/lib.rs

vm/devices/user_driver/src/emulated.rs

vm/devices/storage/nvme_spec/src/lib.rs

vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs

jstarks · 2024-11-21T23:01:15Z

vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs

+                respond: send,
+            };
+            // Remove high CID bits to be used as a key.
+            let cid = cmd.cdw0.cid() & Self::CID_KEY_MASK;


Should we include this mask in the saved state (or perhaps the max # of CIDs, and recompute the mask) so that we have the flexibility to change this?

Added save of Self::CID_KEY_BITS which is the base for other calculations, open for suggestions.

vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs

vm/devices/storage/disk_nvme/nvme_driver/src/namespace.rs

vm/devices/user_driver/src/memory.rs

vm/devices/get/underhill_config/src/lib.rs

jstarks · 2024-11-21T23:09:21Z

openhcl/underhill_core/src/dispatch/mod.rs

+        if let Some(m) = self.nvme_manager.as_mut() {
+            // Only override if explicitly disabled from host.
+            if !nvme_keepalive {
+                m.override_nvme_keepalive_flag(nvme_keepalive);


Instead of storing a flag in the NvmeManager, let's just take this as a parameter to shutdown.

And instead of having save fail if nvme_keepalive is false, just don't call save() below when you don't want to save.

It used to be as a parameter to shutdown but I had to redo it this way to cover some failing cases. Let me get back to this with details once I review.

To resolve this, I need to add this flag to both save() and shutdown() and this approach was rejected during earlier reviews. Let me add it back - see if it is better this time.

jstarks · 2024-11-21T23:12:03Z

vm/devices/storage/disk_nvme/nvme_driver/src/namespace.rs

+            identify,
+        )?;
+
+        // Spawn a task, but detach is so that it doesn't get dropped while NVMe


Don't copy/paste this code. Can this not be in new_from_identify?

Hmmm I think I had issues with async/non-async, let me check again

vm/devices/storage/disk_nvme/nvme_driver/src/namespace.rs

jstarks · 2024-11-21T23:21:02Z

vm/devices/storage/disk_nvme/nvme_driver/src/driver.rs

+            // TODO: Current approach is to re-query namespace data after servicing
+            // and this array will be empty. Once we confirm that we can process
+            // namespace change notification AEN, the restore code will be re-added.
+            this.namespaces.push(Arc::new(Namespace::restore(


If we're not going to do this yet, then let's remove this namespaces field and put #[allow(dead_code)] on Namespace::restore.

Added #[allow(dead_code)] to Namespace::save instead - save is not called anywhere in the code and restore will be skipped because the vector is empty.

jstarks · 2024-11-21T23:22:56Z

vm/devices/storage/disk_nvme/nvme_driver/src/driver.rs

+    pub fn required_dma_size(expect_q_count: usize) -> usize {
+        QueuePair::required_dma_size() * expect_q_count
+    }
+
    /// Change device's behavior when servicing.
    pub fn update_servicing_flags(&mut self, nvme_keepalive: bool) {


This function should just become a parameter to shutdown.

vm/devices/storage/disk_nvme/nvme_driver/src/queues.rs

jstarks · 2024-11-21T23:31:43Z

vm/devices/storage/disk_nvme/nvme_driver/src/driver.rs

+        let admin = self.admin.as_ref().unwrap().save().await?;
+        let mut io: Vec<QueuePairSavedState> = Vec::new();
+        for io_q in self.io.iter() {
+            io.push(io_q.save().await?);


I don't see how this is safe at all. The IO queue workers are still running at this point, right? So if there is anything in their in-memory queues, those items may concurrently, at any point during or after the save, get put into the SQ. But we won't save that fact.

We need to stop all the queues before saving, and leave them stopped. save needs to be a destructive operation, that puts the device in a quiesced state.

Do you agree with this analysis?

save needs to be a destructive operation, that puts the device in a quiesced state

agree, after today's discussion I have same thought.

Added a break statement to QueueHandler::run after save completed, it appears to be working as expected during my service tests with I/O running. Please let me know if I missed some other safeguards.

vm/devices/storage/disk_nvme/nvme_driver/src/driver.rs

chris-oo · 2024-11-25T15:59:57Z

openhcl/openhcl_boot/src/main.rs

@@ -331,6 +334,19 @@ fn reserved_memory_regions(
            ReservedMemoryType::Vtl2Reserved,
        ));
    }
+    let dma_4k_pages = partition_info.preserve_dma_4k_pages.unwrap_or(0);
+    // If DMA reserved hint was provided by Host, allocate top of VTL2 memory range


This is actually insufficient - you need to report to sidecar allocation in start_sidecar that these ranges are not available for allocation. This fn is called after.

Looking into this and have questions.

There are other reserved memory regions, none of which is passed to start_sidecar because we calculate those regions afterwards.

Sidecar itself is the reserved region from our point of view.

So how do we restrict sidecar from using today's VFIO DMA, for example? My assumption was that sidecar cannot access any memory beyond its assigned space (e.g. VFIO DMA is inaccessible), and that space is defined in IGVM. Do you mean that fixed_pool mem could be the part of sidecar range?

it's a bit confusing today, because some of the regions are captured as part of the ShimParams::used range, this describes the whole range used by the loader. Those existing reserved ranges got accounted for by that.

However, any new ranges we allocate (like pool memory) needs to be accounted for, so when we load sidecar we tell it which ranges are legal for it to allocate from. I have on my list to clean all of this address space management in the bootshim up very soon, but you can see how I solved this in my WIP branch: chris-oo@33cb4ea. What I've done here is add some new accounting to track all ranges at allocation time, since we need to track all of it.

I may solve this for you if I get the pool PR in first.

… queue loop on save

…ants

yupavlen-ms requested review from a team as code owners October 29, 2024 19:39

jstarks reviewed Oct 29, 2024

View reviewed changes

vm/devices/user_driver/src/memory.rs Outdated Show resolved Hide resolved

jstarks reviewed Oct 29, 2024

View reviewed changes

vm/devices/storage/nvme/src/pci.rs Outdated Show resolved Hide resolved

jstarks reviewed Oct 29, 2024

View reviewed changes

support/sparse_mmap/src/unix.rs Outdated Show resolved Hide resolved

jstarks reviewed Oct 29, 2024

View reviewed changes

vm/devices/storage/disk_nvme/nvme_driver/src/queues.rs Outdated Show resolved Hide resolved

jstarks reviewed Oct 29, 2024

View reviewed changes

vm/devices/storage/disk_nvme/nvme_driver/src/queues.rs Outdated Show resolved Hide resolved

jstarks reviewed Oct 29, 2024

View reviewed changes

vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs Outdated Show resolved Hide resolved

jstarks reviewed Oct 29, 2024

View reviewed changes

vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs Outdated Show resolved Hide resolved

jstarks reviewed Oct 29, 2024

View reviewed changes

vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs Outdated Show resolved Hide resolved

jstarks reviewed Oct 29, 2024

View reviewed changes

vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs Outdated Show resolved Hide resolved

jstarks reviewed Oct 29, 2024

View reviewed changes

vm/devices/storage/disk_nvme/nvme_driver/src/namespace.rs Outdated Show resolved Hide resolved

jstarks reviewed Oct 29, 2024

View reviewed changes

vm/devices/storage/disk_nvme/nvme_driver/src/driver.rs Outdated Show resolved Hide resolved

jstarks reviewed Oct 29, 2024

View reviewed changes

vm/devices/storage/disk_nvme/nvme_driver/src/driver.rs Outdated Show resolved Hide resolved

jstarks reviewed Oct 29, 2024

View reviewed changes

vm/devices/storage/disk_nvme/nvme_driver/src/driver.rs Outdated Show resolved Hide resolved

jstarks reviewed Oct 29, 2024

View reviewed changes

mattkur reviewed Oct 29, 2024

View reviewed changes

vm/devices/storage/disk_nvme/nvme_driver/src/driver.rs Outdated Show resolved Hide resolved

jstarks reviewed Oct 29, 2024

View reviewed changes

jstarks reviewed Oct 31, 2024

View reviewed changes

vm/devices/user_driver/src/lockmem.rs Outdated Show resolved Hide resolved

daprilik reviewed Oct 31, 2024

View reviewed changes

openhcl/fixed_pool_alloc/Cargo.toml Outdated Show resolved Hide resolved

daprilik reviewed Oct 31, 2024

View reviewed changes

openhcl/fixed_pool_alloc/Cargo.toml Show resolved Hide resolved

daprilik reviewed Oct 31, 2024

View reviewed changes

openhcl/host_fdt_parser/src/lib.rs Outdated Show resolved Hide resolved

daprilik reviewed Oct 31, 2024

View reviewed changes

openhcl/host_fdt_parser/src/lib.rs Outdated Show resolved Hide resolved

daprilik reviewed Oct 31, 2024

View reviewed changes

openhcl/openhcl_boot/src/main.rs Outdated Show resolved Hide resolved

daprilik reviewed Oct 31, 2024

View reviewed changes

openhcl/openhcl_boot/src/main.rs Outdated Show resolved Hide resolved

daprilik reviewed Oct 31, 2024

View reviewed changes

vm/devices/user_driver/vfio_sys/src/lib.rs Outdated Show resolved Hide resolved

daprilik reviewed Oct 31, 2024

View reviewed changes

vm/devices/user_driver/Cargo.toml Outdated Show resolved Hide resolved

daprilik reviewed Oct 31, 2024

View reviewed changes