-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NVMe keepalive feature #210
base: main
Are you sure you want to change the base?
NVMe keepalive feature #210
Conversation
Thanks for submitting this Yuri. I'm starting to take a look. |
vm/loader/loader_defs/src/shim.rs
Outdated
@@ -88,6 +88,8 @@ open_enum! { | |||
VTL0_MMIO = 5, | |||
/// This range is mmio for VTL2. | |||
VTL2_MMIO = 6, | |||
/// Memory preserved during servicing. | |||
VTL2_PRESERVED = 7, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do these entries make it into the tables? Does this mean we tell the host about these ranges somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or does the host just know what the preserved range is? Is this some new IGVM concept?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Host knows what the DMA preserved size should be but does not calculate the ranges. Host provides size through device tree, boot shim selects the top of vtl2 memory range and marks it as reserved with the new type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need some way to enforce that the next version of ohcl selects the same size + location as previous. Outside the scope of this PR, but i'll incorporate it in some changes I'm making moving forward.
command: FromBytes::read_from_prefix(cmd.command.as_bytes()).unwrap(), | ||
respond: send, | ||
}; | ||
pending.push((cmd.cid as usize, pending_command)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like pending commands need to save the fact that they allocated memory, somehow.
Specifically, any PRP list allocation needs to be saved/restored. Is this done somewhere?
We also need to re-lock any guest memory/re-reserve any double buffer memory that was in use.
And of course, all this needs to be released after the pending command completes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If PRP points to the bounce buffer (vtl2 buffer) then its GPN should not change and it still points to the same physical page which is preserved during servicing.
Same should be true if PRP points to a region in vtl0 memory as we don't touch vtl0 during servicing.
However, maybe I missed the part on where we allocate PRP list itself, if there are multiple entries. If it is not a part of preserved memory then such change needs to be added.
} | ||
|
||
/// Restore queue data after servicing. | ||
pub fn restore(&mut self, saved_state: &QueuePairSavedState) -> anyhow::Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A key thing to reason about is what happens to requests that were in flight at save. It seems that the design is that we will keep those requests in flight (i.e., we won't wait for them to complete before saving/after restoring), right? And so storvsp's existing restore path will reissue any IOs that were in flight, but they won't wait for old IOs.
I think this is a problem, because new IOs can race with the original IO and cause data loss. Here's an example of what can go wrong:
- Before servicing, the guest writes "A" to block 0x1000 with CID 1.
- Before CID 1 completes, a servicing operation starts. CID 1 makes it into the saved state.
- At restore, storvsp reissues the IO, so it writes "A" to block 0x1000 with CID 2.
- Due to reordering in the backend, CID 2 completes before CID 1.
- Storvsp completes the original guest IO.
- The guest issues another IO, this time writing "B" to the same block 0x1000. Our driver issues this with CID 3.
- CID 3 completes before CID 1.
- Finally, some backend code gets around to processing CID 1. It writes "A" to block 0x1000.
So the guest thought it wrote "A" and then "B", but actually what happens is "A" then "B" then "A". Corruption.
There are variants on this, where the guest changes the memory that CID 1 was using and ASAP DMAs it late, after the IO was completed to the guest. This could even affects reads by corrupting guest memory.
The most reasonable solution, in my opinion, is to avoid issuing any more commands to the SQ until all IOs that were in flight at save time have completed. This should be simple to implement and to reason about. A more complicated solution would be to reason about what the in flight commands are doing and only block IOs that alias those pages. I don't think that's necessary for v1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did I miss somewhere where this is handled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most reasonable solution, in my opinion, is to avoid issuing any more commands to the SQ until all IOs that were in flight at save time have completed. This should be simple to implement and to reason about.
I think we cut off receiving mesh commands once servicing save request is received, let me double check that and at which point this is done.
Regarding draining of the outstanding (in-flight) I/Os - eventually we want to bypass this but for v1 it makes sense to drain them, for simplicity. Let me confirm the path. The draining happens for non-keepalive path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think two things are important in the conversation above:
- It's okay to drain IOs before issuing new IOs. But: unlike the non-keepalive path, the overall servicing operation should not block on outstanding IOs. This means that the replay will need to remain asynchronous.
- I agree with John's analysis. It is unsafe to complete any IOs back to the guest until all outstanding IOs (IOs in progress at the save time to the underlying nvme device) complete from the NVMe device to the HCL environment. I think it is fine to start replaying the saved IOs while the in flight IOs are still in progress. This means you have two choices: (a) wait to begin replaying IOs until everything in flight completes, or (b) build some logic in the storvsp layer to not notify the guest until all the in flight IOs complete. (a) seems simpler to me.
In future changes, I think we will need to consider some sort of filtering (e.g. hold new IOs from guest that overlap LBA ranges, as John suggested in an offline conversation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the in-flight IOs will be executed twice if my understanding is correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are saving Pending Commands slab so we can correctly find a CID for outstanding responses once we're back from blackout. We do not resubmit those commands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean in these steps "Before CID 1 completes, a servicing operation starts. CID 1 makes it into the saved state.
At restore, storvsp reissues the IO, so it writes "A" to block 0x1000 with CID 2." the same IO was issued twice with CID 1 and CID 2. Is it still true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is true, I can observe that in the instrumented FW that same command may be issued twice - before and after servicing. Let me double check if CID is still same, as there is recent change to randomize CID in NVMe queue.
@@ -16,7 +16,7 @@ use zerocopy::LE; | |||
use zerocopy::U16; | |||
|
|||
#[repr(C)] | |||
#[derive(Debug, AsBytes, FromBytes, FromZeroes, Inspect)] | |||
#[derive(Debug, AsBytes, FromBytes, FromZeroes, Inspect, Clone)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you share what fails to compile (if anything) if you remove Clone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fails in namespace.rs line 85 - when we restore namespace object if Identify structure was provided from saved state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lots of code has changed in 2 weeks - bumping to check if this is still necessary (and if so - what code requires the Clone bound)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@daprilik it is used in NvmeDriver::restore even though there is a new direction to re-query Identify Namespace instead, I think we should keep that code for future improvements. If you see a better way to implement that without clone() please let me know.
respond: send, | ||
}; | ||
// Remove high CID bits to be used as a key. | ||
let cid = cmd.cdw0.cid() & Self::CID_KEY_MASK; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we include this mask in the saved state (or perhaps the max # of CIDs, and recompute the mask) so that we have the flexibility to change this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added save of Self::CID_KEY_BITS
which is the base for other calculations, open for suggestions.
if let Some(m) = self.nvme_manager.as_mut() { | ||
// Only override if explicitly disabled from host. | ||
if !nvme_keepalive { | ||
m.override_nvme_keepalive_flag(nvme_keepalive); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of storing a flag in the NvmeManager
, let's just take this as a parameter to shutdown
.
And instead of having save
fail if nvme_keepalive
is false, just don't call save()
below when you don't want to save.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It used to be as a parameter to shutdown but I had to redo it this way to cover some failing cases. Let me get back to this with details once I review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To resolve this, I need to add this flag to both save()
and shutdown()
and this approach was rejected during earlier reviews. Let me add it back - see if it is better this time.
identify, | ||
)?; | ||
|
||
// Spawn a task, but detach is so that it doesn't get dropped while NVMe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't copy/paste this code. Can this not be in new_from_identify
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm I think I had issues with async/non-async, let me check again
// TODO: Current approach is to re-query namespace data after servicing | ||
// and this array will be empty. Once we confirm that we can process | ||
// namespace change notification AEN, the restore code will be re-added. | ||
this.namespaces.push(Arc::new(Namespace::restore( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're not going to do this yet, then let's remove this namespaces
field and put #[allow(dead_code)]
on Namespace::restore
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added #[allow(dead_code)]
to Namespace::save
instead - save is not called anywhere in the code and restore will be skipped because the vector is empty.
pub fn required_dma_size(expect_q_count: usize) -> usize { | ||
QueuePair::required_dma_size() * expect_q_count | ||
} | ||
|
||
/// Change device's behavior when servicing. | ||
pub fn update_servicing_flags(&mut self, nvme_keepalive: bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function should just become a parameter to shutdown
.
let admin = self.admin.as_ref().unwrap().save().await?; | ||
let mut io: Vec<QueuePairSavedState> = Vec::new(); | ||
for io_q in self.io.iter() { | ||
io.push(io_q.save().await?); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how this is safe at all. The IO queue workers are still running at this point, right? So if there is anything in their in-memory queues, those items may concurrently, at any point during or after the save, get put into the SQ. But we won't save that fact.
We need to stop all the queues before saving, and leave them stopped. save
needs to be a destructive operation, that puts the device in a quiesced state.
Do you agree with this analysis?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
save needs to be a destructive operation, that puts the device in a quiesced state
agree, after today's discussion I have same thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a break
statement to QueueHandler::run
after save completed, it appears to be working as expected during my service tests with I/O running. Please let me know if I missed some other safeguards.
@@ -331,6 +334,19 @@ fn reserved_memory_regions( | |||
ReservedMemoryType::Vtl2Reserved, | |||
)); | |||
} | |||
let dma_4k_pages = partition_info.preserve_dma_4k_pages.unwrap_or(0); | |||
// If DMA reserved hint was provided by Host, allocate top of VTL2 memory range |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually insufficient - you need to report to sidecar allocation in start_sidecar
that these ranges are not available for allocation. This fn is called after.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking into this and have questions.
- There are other reserved memory regions, none of which is passed to
start_sidecar
because we calculate those regions afterwards. - Sidecar itself is the reserved region from our point of view.
So how do we restrict sidecar from using today's VFIO DMA, for example? My assumption was that sidecar cannot access any memory beyond its assigned space (e.g. VFIO DMA is inaccessible), and that space is defined in IGVM. Do you mean that fixed_pool mem could be the part of sidecar range?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a bit confusing today, because some of the regions are captured as part of the ShimParams::used
range, this describes the whole range used by the loader. Those existing reserved ranges got accounted for by that.
However, any new ranges we allocate (like pool memory) needs to be accounted for, so when we load sidecar we tell it which ranges are legal for it to allocate from. I have on my list to clean all of this address space management in the bootshim up very soon, but you can see how I solved this in my WIP branch: chris-oo@33cb4ea. What I've done here is add some new accounting to track all ranges at allocation time, since we need to track all of it.
I may solve this for you if I get the pool PR in first.
… queue loop on save
NVMe keepalive feature implementation. The goal is to keep attached NVMe devices intact during OpenVMM servicing (e.g. reloading new OpenVMM binary from the host OS).
The memory allocated for NVMe queues and buffers must be preserved during servicing, including queue head/tail pointers. The memory region must be marked as reserved so VTL2 Linux kernel will not try to use it, a new device tree property dma-preserve-pages will be passed from Host to indicate the preserved memory size (the memory is extracted from vtl2 available memory, it is not additional memory).
The NVMe device should not go through any resets, from physical device point of view there was no change during servicing.
New OpenHCL module gets saved state from host and continues from there.