Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVMe keepalive feature #210

Open
wants to merge 33 commits into
base: main
Choose a base branch
from

Conversation

yupavlen-ms
Copy link

@yupavlen-ms yupavlen-ms commented Oct 29, 2024

NVMe keepalive feature implementation. The goal is to keep attached NVMe devices intact during OpenVMM servicing (e.g. reloading new OpenVMM binary from the host OS).

The memory allocated for NVMe queues and buffers must be preserved during servicing, including queue head/tail pointers. The memory region must be marked as reserved so VTL2 Linux kernel will not try to use it, a new device tree property dma-preserve-pages will be passed from Host to indicate the preserved memory size (the memory is extracted from vtl2 available memory, it is not additional memory).
The NVMe device should not go through any resets, from physical device point of view there was no change during servicing.
New OpenHCL module gets saved state from host and continues from there.

@yupavlen-ms yupavlen-ms requested review from a team as code owners October 29, 2024 19:39
@mattkur
Copy link
Contributor

mattkur commented Oct 29, 2024

Thanks for submitting this Yuri. I'm starting to take a look.

@@ -88,6 +88,8 @@ open_enum! {
VTL0_MMIO = 5,
/// This range is mmio for VTL2.
VTL2_MMIO = 6,
/// Memory preserved during servicing.
VTL2_PRESERVED = 7,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do these entries make it into the tables? Does this mean we tell the host about these ranges somehow?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or does the host just know what the preserved range is? Is this some new IGVM concept?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Host knows what the DMA preserved size should be but does not calculate the ranges. Host provides size through device tree, boot shim selects the top of vtl2 memory range and marks it as reserved with the new type.

Copy link
Member

@chris-oo chris-oo Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need some way to enforce that the next version of ohcl selects the same size + location as previous. Outside the scope of this PR, but i'll incorporate it in some changes I'm making moving forward.

command: FromBytes::read_from_prefix(cmd.command.as_bytes()).unwrap(),
respond: send,
};
pending.push((cmd.cid as usize, pending_command));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like pending commands need to save the fact that they allocated memory, somehow.

Specifically, any PRP list allocation needs to be saved/restored. Is this done somewhere?

We also need to re-lock any guest memory/re-reserve any double buffer memory that was in use.

And of course, all this needs to be released after the pending command completes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If PRP points to the bounce buffer (vtl2 buffer) then its GPN should not change and it still points to the same physical page which is preserved during servicing.
Same should be true if PRP points to a region in vtl0 memory as we don't touch vtl0 during servicing.

However, maybe I missed the part on where we allocate PRP list itself, if there are multiple entries. If it is not a part of preserved memory then such change needs to be added.

}

/// Restore queue data after servicing.
pub fn restore(&mut self, saved_state: &QueuePairSavedState) -> anyhow::Result<()> {
Copy link
Member

@jstarks jstarks Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A key thing to reason about is what happens to requests that were in flight at save. It seems that the design is that we will keep those requests in flight (i.e., we won't wait for them to complete before saving/after restoring), right? And so storvsp's existing restore path will reissue any IOs that were in flight, but they won't wait for old IOs.

I think this is a problem, because new IOs can race with the original IO and cause data loss. Here's an example of what can go wrong:

  1. Before servicing, the guest writes "A" to block 0x1000 with CID 1.
  2. Before CID 1 completes, a servicing operation starts. CID 1 makes it into the saved state.
  3. At restore, storvsp reissues the IO, so it writes "A" to block 0x1000 with CID 2.
  4. Due to reordering in the backend, CID 2 completes before CID 1.
  5. Storvsp completes the original guest IO.
  6. The guest issues another IO, this time writing "B" to the same block 0x1000. Our driver issues this with CID 3.
  7. CID 3 completes before CID 1.
  8. Finally, some backend code gets around to processing CID 1. It writes "A" to block 0x1000.

So the guest thought it wrote "A" and then "B", but actually what happens is "A" then "B" then "A". Corruption.

There are variants on this, where the guest changes the memory that CID 1 was using and ASAP DMAs it late, after the IO was completed to the guest. This could even affects reads by corrupting guest memory.

The most reasonable solution, in my opinion, is to avoid issuing any more commands to the SQ until all IOs that were in flight at save time have completed. This should be simple to implement and to reason about. A more complicated solution would be to reason about what the in flight commands are doing and only block IOs that alias those pages. I don't think that's necessary for v1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did I miss somewhere where this is handled?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most reasonable solution, in my opinion, is to avoid issuing any more commands to the SQ until all IOs that were in flight at save time have completed. This should be simple to implement and to reason about.

I think we cut off receiving mesh commands once servicing save request is received, let me double check that and at which point this is done.

Regarding draining of the outstanding (in-flight) I/Os - eventually we want to bypass this but for v1 it makes sense to drain them, for simplicity. Let me confirm the path. The draining happens for non-keepalive path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think two things are important in the conversation above:

  1. It's okay to drain IOs before issuing new IOs. But: unlike the non-keepalive path, the overall servicing operation should not block on outstanding IOs. This means that the replay will need to remain asynchronous.
  2. I agree with John's analysis. It is unsafe to complete any IOs back to the guest until all outstanding IOs (IOs in progress at the save time to the underlying nvme device) complete from the NVMe device to the HCL environment. I think it is fine to start replaying the saved IOs while the in flight IOs are still in progress. This means you have two choices: (a) wait to begin replaying IOs until everything in flight completes, or (b) build some logic in the storvsp layer to not notify the guest until all the in flight IOs complete. (a) seems simpler to me.

In future changes, I think we will need to consider some sort of filtering (e.g. hold new IOs from guest that overlap LBA ranges, as John suggested in an offline conversation).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the in-flight IOs will be executed twice if my understanding is correct?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are saving Pending Commands slab so we can correctly find a CID for outstanding responses once we're back from blackout. We do not resubmit those commands.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean in these steps "Before CID 1 completes, a servicing operation starts. CID 1 makes it into the saved state.
At restore, storvsp reissues the IO, so it writes "A" to block 0x1000 with CID 2." the same IO was issued twice with CID 1 and CID 2. Is it still true?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is true, I can observe that in the instrumented FW that same command may be issued twice - before and after servicing. Let me double check if CID is still same, as there is recent change to randomize CID in NVMe queue.

@@ -16,7 +16,7 @@ use zerocopy::LE;
use zerocopy::U16;

#[repr(C)]
#[derive(Debug, AsBytes, FromBytes, FromZeroes, Inspect)]
#[derive(Debug, AsBytes, FromBytes, FromZeroes, Inspect, Clone)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you share what fails to compile (if anything) if you remove Clone?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fails in namespace.rs line 85 - when we restore namespace object if Identify structure was provided from saved state

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lots of code has changed in 2 weeks - bumping to check if this is still necessary (and if so - what code requires the Clone bound)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daprilik it is used in NvmeDriver::restore even though there is a new direction to re-query Identify Namespace instead, I think we should keep that code for future improvements. If you see a better way to implement that without clone() please let me know.

respond: send,
};
// Remove high CID bits to be used as a key.
let cid = cmd.cdw0.cid() & Self::CID_KEY_MASK;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we include this mask in the saved state (or perhaps the max # of CIDs, and recompute the mask) so that we have the flexibility to change this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added save of Self::CID_KEY_BITS which is the base for other calculations, open for suggestions.

if let Some(m) = self.nvme_manager.as_mut() {
// Only override if explicitly disabled from host.
if !nvme_keepalive {
m.override_nvme_keepalive_flag(nvme_keepalive);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of storing a flag in the NvmeManager, let's just take this as a parameter to shutdown.

And instead of having save fail if nvme_keepalive is false, just don't call save() below when you don't want to save.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It used to be as a parameter to shutdown but I had to redo it this way to cover some failing cases. Let me get back to this with details once I review.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To resolve this, I need to add this flag to both save() and shutdown() and this approach was rejected during earlier reviews. Let me add it back - see if it is better this time.

identify,
)?;

// Spawn a task, but detach is so that it doesn't get dropped while NVMe
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't copy/paste this code. Can this not be in new_from_identify?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm I think I had issues with async/non-async, let me check again

// TODO: Current approach is to re-query namespace data after servicing
// and this array will be empty. Once we confirm that we can process
// namespace change notification AEN, the restore code will be re-added.
this.namespaces.push(Arc::new(Namespace::restore(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're not going to do this yet, then let's remove this namespaces field and put #[allow(dead_code)] on Namespace::restore.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added #[allow(dead_code)] to Namespace::save instead - save is not called anywhere in the code and restore will be skipped because the vector is empty.

pub fn required_dma_size(expect_q_count: usize) -> usize {
QueuePair::required_dma_size() * expect_q_count
}

/// Change device's behavior when servicing.
pub fn update_servicing_flags(&mut self, nvme_keepalive: bool) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function should just become a parameter to shutdown.

let admin = self.admin.as_ref().unwrap().save().await?;
let mut io: Vec<QueuePairSavedState> = Vec::new();
for io_q in self.io.iter() {
io.push(io_q.save().await?);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how this is safe at all. The IO queue workers are still running at this point, right? So if there is anything in their in-memory queues, those items may concurrently, at any point during or after the save, get put into the SQ. But we won't save that fact.

We need to stop all the queues before saving, and leave them stopped. save needs to be a destructive operation, that puts the device in a quiesced state.

Do you agree with this analysis?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

save needs to be a destructive operation, that puts the device in a quiesced state

agree, after today's discussion I have same thought.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a break statement to QueueHandler::run after save completed, it appears to be working as expected during my service tests with I/O running. Please let me know if I missed some other safeguards.

@@ -331,6 +334,19 @@ fn reserved_memory_regions(
ReservedMemoryType::Vtl2Reserved,
));
}
let dma_4k_pages = partition_info.preserve_dma_4k_pages.unwrap_or(0);
// If DMA reserved hint was provided by Host, allocate top of VTL2 memory range
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually insufficient - you need to report to sidecar allocation in start_sidecar that these ranges are not available for allocation. This fn is called after.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking into this and have questions.

  1. There are other reserved memory regions, none of which is passed to start_sidecar because we calculate those regions afterwards.
  2. Sidecar itself is the reserved region from our point of view.

So how do we restrict sidecar from using today's VFIO DMA, for example? My assumption was that sidecar cannot access any memory beyond its assigned space (e.g. VFIO DMA is inaccessible), and that space is defined in IGVM. Do you mean that fixed_pool mem could be the part of sidecar range?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a bit confusing today, because some of the regions are captured as part of the ShimParams::used range, this describes the whole range used by the loader. Those existing reserved ranges got accounted for by that.

However, any new ranges we allocate (like pool memory) needs to be accounted for, so when we load sidecar we tell it which ranges are legal for it to allocate from. I have on my list to clean all of this address space management in the bootshim up very soon, but you can see how I solved this in my WIP branch: chris-oo@33cb4ea. What I've done here is add some new accounting to track all ranges at allocation time, since we need to track all of it.

I may solve this for you if I get the pool PR in first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants