-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: Convert volume to another directory instead of copying it while taking volume snapshots on KVM #8041
Fix: Convert volume to another directory instead of copying it while taking volume snapshots on KVM #8041
Conversation
…g volume snapshots on KVM
I am still working on the unit tests and smoke tests. Manual tests with a stopped VM will be done soon. |
Codecov Report
@@ Coverage Diff @@
## 4.18 #8041 +/- ##
============================================
+ Coverage 13.02% 13.07% +0.05%
- Complexity 9032 9109 +77
============================================
Files 2720 2720
Lines 257080 257519 +439
Branches 40088 40151 +63
============================================
+ Hits 33476 33667 +191
- Misses 219400 219621 +221
- Partials 4204 4231 +27
... and 9 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
plugins/hypervisors/kvm/src/main/java/com/cloud/hypervisor/kvm/storage/KVMStorageProcessor.java
Outdated
Show resolved
Hide resolved
} catch (IOException ex) { | ||
return String.format("Unable to copy %s snapshot [%s] to [%s] due to [%s].", volume, baseFile, snapshotPath, ex.getMessage()); | ||
} catch (QemuImgException | LibvirtException ex) { | ||
return String.format("Failed to convert %s snapshot of volume [%s] to [%s] due to [%s].", volume, baseFile, snapshotPath, ex.getMessage()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't it be convenient to throw the CloudRuntimeException(convertResult)
instead of in the validate method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DaanHoogland I will check if we can refactor this part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DaanHoogland
Here we return a String
instead of throwing an exception so we can handle the merge of the delta before throwing the exception. We could refactor KVMStorageProcessor#createSnapshot
a little bit to work with a try-catch-finally
. Though, I think we could do it afterward, so we focus on the issue fixing and reduce the scope of changes in this PR. What do you think?
Co-authored-by: dahn <[email protected]>
plugins/hypervisors/kvm/src/main/java/com/cloud/hypervisor/kvm/storage/KVMStorageProcessor.java
Show resolved
Hide resolved
Thanks for quickly raising a PR @GutoVeronezi left some questions, pl advise when this is ready for review/testing |
@blueorangutan package |
I did some more testing with stopped VMs and the process is occurring as expected too: logs
@rohityadavcloud Just for sanity: |
@GutoVeronezi a [SF] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 7230 |
@blueorangutan test |
@shwstppr a [SF] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @GutoVeronezi for raising the PR.
I've a question here. As I understood with this PR we are now backing up the consolidated snapshot file which is good. How is this PR different from reverting the original PR #5297 where I see lot more changes wrt to the behaviour that had introduced. Do we also need to address any code in other places ?
[SF] Trillian test result (tid-7846)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
- Deployed a test VM. Stopped it. Took ROOT volume snapshot. Verified the snapshot does not have any backing_file
- Deleted the VM. This deleted cached template files on the primary store when storage GC ran.
- Created a new template from the snapshot <- Successful
- Create a new volume from the snapshot <- Successful
- Tried deploying a VM with the new template <- Successful
I've not tested revert operations. I hope they still work fine. @GutoVeronezi
@shwstppr @GutoVeronezi have you tested if the qemu-img conversion happens on the primary storage (where the snapshot file is created) or after it's copied to secondary storage ? |
@rohityadavcloud happens on primary store before copy/backup |
@harikrishna-patnala However, as explained in #8034 (comment), due to a misconception on the implementation of the specific creating snapshot workflow, we have reference to the volume template on the snapshot. This reference only exists if we take snapshots of |
@rohityadavcloud, as @shwstppr reported in #8041 (comment), it happens on the primary storage. To better ilustrate the changes, following is an overview of the current workflow and the workflow with the fix, with the actual changes highlighted: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clgtm
extra extra |
@DaanHoogland a [SF] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
Just for the record, following there is a quick use case to reproduce the situation with the current patch and to validate the PR:
With the current patch, step 5 is expected to fail, as hosts in |
[SF] Trillian test result (tid-7861)
|
Thanks for explaining @GutoVeronezi |
…taking volume snapshots on KVM (apache#8041) (cherry picked from commit 9b8eaee) Signed-off-by: Rohit Yadav <[email protected]>
Hello guys, Just for the record, I reproduced the issue without the patch and validated the changes in this PR. Following there is the tests that I have made and its details: First, I created two clusters as described here: Before the patch
qemu-img info -U /mnt/c3f4c767-72fe-39e0-a4d1-176c7c19dc74/7c354408-3fed-4113-92b1-4f59f6f97ce0
image: /mnt/c3f4c767-72fe-39e0-a4d1-176c7c19dc74/7c354408-3fed-4113-92b1-4f59f6f97ce0
file format: qcow2
virtual size: 8 GiB (8589934592 bytes)
disk size: 324 KiB
cluster_size: 65536
backing file: /mnt/c3f4c767-72fe-39e0-a4d1-176c7c19dc74/bb4b7b6f-5249-4c18-a5f6-dd886683b702
backing file format: qcow2
Format specific information:
compat: 1.1
lazy refcounts: false
refcount bits: 16
corrupt: false
qemu-img info /mnt/secondary-storage/snapshots/2/275/165748b5-6ac3-4f95-aac5-3df84945074c
image: /mnt/secondary-storage/snapshots/2/275/165748b5-6ac3-4f95-aac5-3df84945074c
file format: qcow2
virtual size: 8 GiB (8589934592 bytes)
disk size: 22.9 MiB
cluster_size: 65536
backing file: /mnt/c3f4c767-72fe-39e0-a4d1-176c7c19dc74/bb4b7b6f-5249-4c18-a5f6-dd886683b702
backing file format: qcow2
Format specific information:
compat: 1.1
lazy refcounts: false
refcount bits: 16
corrupt: false
After applying the patch
qemu-img info -U /mnt/c3f4c767-72fe-39e0-a4d1-176c7c19dc74/cee7704d-07cc-452f-b6eb-7342ccd87f2c
image: cee7704d-07cc-452f-b6eb-7342ccd87f2c
file format: qcow2
virtual size: 8 GiB (8589934592 bytes)
disk size: 25.9 MiB
cluster_size: 65536
backing file: /mnt/c3f4c767-72fe-39e0-a4d1-176c7c19dc74/bb4b7b6f-5249-4c18-a5f6-dd886683b702
backing file format: qcow2
Format specific information:
compat: 1.1
lazy refcounts: false
refcount bits: 16
corrupt: false
qemu-img info -U /mnt/secondary-storage/snapshots/2/268/60e80be4-ec59-41dc-aaeb-915abe08ec38
image: snapshots/2/268/60e80be4-ec59-41dc-aaeb-915abe08ec38
file format: qcow2
virtual size: 8 GiB (8589934592 bytes)
disk size: 2.6 GiB
cluster_size: 65536
Format specific information:
compat: 1.1
lazy refcounts: false
refcount bits: 16
corrupt: false
I also tested the consolidation of volumes after migrating them, described here. The results were the same before and after applying the patch. Before the patch
qemu-img info -U /mnt/c3f4c767-72fe-39e0-a4d1-176c7c19dc74/7c354408-3fed-4113-92b1-4f59f6f97ce0
image: /mnt/c3f4c767-72fe-39e0-a4d1-176c7c19dc74/7c354408-3fed-4113-92b1-4f59f6f97ce0
file format: qcow2
virtual size: 8 GiB (8589934592 bytes)
disk size: 324 KiB
cluster_size: 65536
backing file: /mnt/c3f4c767-72fe-39e0-a4d1-176c7c19dc74/bb4b7b6f-5249-4c18-a5f6-dd886683b702
backing file format: qcow2
Format specific information:
compat: 1.1
lazy refcounts: false
refcount bits: 16
corrupt: false
qemu-img info -U /mnt/c3f4c767-72fe-39e0-a4d1-176c7c19dc74/a61c6c88-73ae-4597-be3d-e489b77db09d
image: /mnt/c3f4c767-72fe-39e0-a4d1-176c7c19dc74/a61c6c88-73ae-4597-be3d-e489b77db09d
file format: qcow2
virtual size: 5 GiB (5368709120 bytes)
disk size: 196 KiB
cluster_size: 65536
Format specific information:
compat: 1.1
lazy refcounts: false
refcount bits: 16
corrupt: false
qemu-img info -U /mnt/e0560866-421e-313d-8602-320c58842a7e/458e15ce-61cd-476a-88da-660c28c0e940
image: /mnt/e0560866-421e-313d-8602-320c58842a7e/458e15ce-61cd-476a-88da-660c28c0e940
file format: qcow2
virtual size: 8 GiB (8589934592 bytes)
disk size: 2.6 GiB
cluster_size: 65536
Format specific information:
compat: 1.1
lazy refcounts: false
refcount bits: 16
corrupt: false
qemu-img info -U /mnt/e0560866-421e-313d-8602-320c58842a7e/bfdc7820-d090-4622-b9aa-f7c780a96ddc
image: /mnt/e0560866-421e-313d-8602-320c58842a7e/bfdc7820-d090-4622-b9aa-f7c780a96ddc
file format: qcow2
virtual size: 5 GiB (5368709120 bytes)
disk size: 196 KiB
cluster_size: 65536
Format specific information:
compat: 1.1
lazy refcounts: false
refcount bits: 16
corrupt: false After applying the patch
qemu-img info -U /mnt/c3f4c767-72fe-39e0-a4d1-176c7c19dc74/cee7704d-07cc-452f-b6eb-7342ccd87f2c
image: cee7704d-07cc-452f-b6eb-7342ccd87f2c
file format: qcow2
virtual size: 8 GiB (8589934592 bytes)
disk size: 25.9 MiB
cluster_size: 65536
backing file: /mnt/c3f4c767-72fe-39e0-a4d1-176c7c19dc74/bb4b7b6f-5249-4c18-a5f6-dd886683b702
backing file format: qcow2
Format specific information:
compat: 1.1
lazy refcounts: false
refcount bits: 16
corrupt: false
qemu-img info -U /mnt/c3f4c767-72fe-39e0-a4d1-176c7c19dc74/4bc334f5-95e7-43c6-bede-245f22ef7aa5
image: 4bc334f5-95e7-43c6-bede-245f22ef7aa5
file format: qcow2
virtual size: 5 GiB (5368709120 bytes)
disk size: 196 KiB
cluster_size: 65536
Format specific information:
compat: 1.1
lazy refcounts: false
refcount bits: 16
corrupt: false
qemu-img info -U /mnt/e0560866-421e-313d-8602-320c58842a7e/7491cf25-7364-4f6a-bf62-b0f1dca16e61
image: 7491cf25-7364-4f6a-bf62-b0f1dca16e61
file format: qcow2
virtual size: 8 GiB (8589934592 bytes)
disk size: 2.6 GiB
cluster_size: 65536
Format specific information:
compat: 1.1
lazy refcounts: false
refcount bits: 16
corrupt: false
qemu-img info -U /mnt/e0560866-421e-313d-8602-320c58842a7e/5776344c-1e74-4653-9fb1-ca10cdca3ab6
image: /mnt/e0560866-421e-313d-8602-320c58842a7e/5776344c-1e74-4653-9fb1-ca10cdca3ab6
file format: qcow2
virtual size: 5 GiB (5368709120 bytes)
disk size: 196 KiB
cluster_size: 65536
Format specific information:
compat: 1.1
lazy refcounts: false
refcount bits: 16
corrupt: false Finally, I tested all the workflows described in the original PR #5297 twice, once with the global setting |
Hello everyone! Sorry to open a discussion on already merged PR, but did somebody test it with encrypted volumes? |
…taking volume snapshots on KVM (apache#8041) (cherry picked from commit 9b8eaee) Signed-off-by: Rohit Yadav <[email protected]>
Description
This PR address issue #8034.
While taking the volume snapshot on KVM, we call the method
copySnapshotToPrimaryStorageDir
. This method copies the base snapshot file to another directory, to enable the block commit for the delta of the disk right after taking the snapshot and avoid creating chains in the primary storage. However, copying a file (via the operating system) with a backing store will also copy the reference; therefore, it causes problems when the volume is not consolidated yet (which is always executed in KVM in volume migration process). If we create a template from the snapshot, and the deployment destination does not have access to the backing store file, the template will not be usable, as it would have a backing store pointing to the template that generated the volume.This PR intends to solve this situation by converting the base volume to the snapshot file instead of just copying it.
Types of changes
Feature/Enhancement Scale or Bug Severity
Bug Severity
Screenshots (if appropriate):
How Has This Been Tested?
I created a VM from a template and checked the volume's chain:
Then, I triggered a manual volume snapshot with the VM still running:
logs
After that, I checked the snapshot's file on the secondary storage and identified that it did not have a chain (i.e.: it was consolidated):
Then, I created a template from the snapshot and a VM with the template:
With that, it was concluded that the template was a single file, without references to the template of the volume that generated the snapshot.