Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken zvol links #183

Closed
piexil opened this issue Sep 23, 2021 · 23 comments
Closed

Broken zvol links #183

piexil opened this issue Sep 23, 2021 · 23 comments
Labels
bug Something isn't working

Comments

@piexil
Copy link

piexil commented Sep 23, 2021

I'm having the same issue experienced by this person on the proxmox forums
https://forum.proxmox.com/threads/no-zvol-device-link-found-after-300-sec-found-issue.94514/

tl;dr VMs fail to start with error 'no zvols found after 300 secs'

@fabianishere fabianishere added the bug Something isn't working label Sep 23, 2021
@fabianishere
Copy link
Owner

Which kernel version are you using?

@piexil
Copy link
Author

piexil commented Sep 23, 2021

happens on both pve-kernel-5.13-edge and pve-kernel-5.14-edge. pve-kernel-5.12-edge works fine

@fabianishere fabianishere changed the title broken zvol links Broken zvol links Sep 24, 2021
@fabianishere
Copy link
Owner

I am now updating the ZFS version to 2.1.1. Hopefully that resolves the issue.

@piexil
Copy link
Author

piexil commented Sep 27, 2021

TASK ERROR: timeout: no zvol device link for 'vm-103-disk-0' found after 300 sec found.
Still happens on v5.14.7-2

@piexil
Copy link
Author

piexil commented Sep 27, 2021

ok it seemed to work after a second boot of the VM this time...weird.

@piexil
Copy link
Author

piexil commented Sep 27, 2021

fyi 5.14.7-2 still reports as -1
root@epyc:~# uname -a Linux epyc 5.14.7-1-edge #1 SMP 5.14.7-1-edge generic (Thu, 24 Sep 2021 12:30:00 +0000) x86_64 GNU/Linux root@epyc:~#
root@epyc:~# pveversion -v proxmox-ve: 7.0-2 (running kernel: 5.14.7-1-edge) pve-manager: 7.0-9 (running version: 7.0-9/228c9caa) pve-kernel-helper: 7.0-4 pve-kernel-5.11: 7.0-3 pve-kernel-5.14.7-1-edge: 5.14.7-2

@fabianishere
Copy link
Owner

fabianishere commented Sep 27, 2021

Glad to hear that you got it working. Let me know if the issue reappears.

fyi 5.14.7-2 still reports as -1
root@epyc:~# uname -a Linux epyc 5.14.7-1-edge #1 SMP 5.14.7-1-edge generic (Thu, 24 Sep 2021 12:30:00 +0000) x86_64 GNU/Linux root@epyc:~#
root@epyc:~# pveversion -v proxmox-ve: 7.0-2 (running kernel: 5.14.7-1-edge) pve-manager: 7.0-9 (running version: 7.0-9/228c9caa) pve-kernel-helper: 7.0-4 pve-kernel-5.11: 7.0-3 pve-kernel-5.14.7-1-edge: 5.14.7-2

That number represents the kernel ABI revision and not the Debian release. I currently do not track ABI changes, so I will remove this number from v5.15.x onwards.

@piexil
Copy link
Author

piexil commented Nov 4, 2021

Happens again on 5.15.0, reverting back to 5.14.16 works.

@piexil
Copy link
Author

piexil commented Nov 5, 2021

Actually seems reverting to 5.14.16 did not work, happening there after attempting 5.15.0 again

@fabianishere
Copy link
Owner

Does it work again if you try to reboot the VM?

@piexil
Copy link
Author

piexil commented Nov 5, 2021

nope. However sometimes rebooting the whole machine makes it work, it's working this second but I'm scared to reboot it., it took quite a few to get it working.

@piexil
Copy link
Author

piexil commented Nov 6, 2021

ok i don't know what causes this but the problem is not every link that's supposed to get created does. After booting for VM-103, I have

root@epyc:~# ls -la /dev/zvol/rpool/data/ | grep -i vm-103
lrwxrwxrwx 1 root root  15 Nov  6 00:39 vm-103-disk-0-part1 -> ../../../zd48p1
lrwxrwxrwx 1 root root  15 Nov  6 00:39 vm-103-disk-0-part2 -> ../../../zd48p2
lrwxrwxrwx 1 root root  15 Nov  6 00:39 vm-103-disk-0-part3 -> ../../../zd48p3

when it should be

root@epyc:~# ls -la /dev/zvol/rpool/data/ | grep -i vm-103
lrwxrwxrwx 1 root root  13 Nov  6 00:51 vm-103-disk-0 -> ../../../zd48
lrwxrwxrwx 1 root root  15 Nov  6 00:39 vm-103-disk-0-part1 -> ../../../zd48p1
lrwxrwxrwx 1 root root  15 Nov  6 00:39 vm-103-disk-0-part2 -> ../../../zd48p2
lrwxrwxrwx 1 root root  15 Nov  6 00:39 vm-103-disk-0-part3 -> ../../../zd48p3

If I manually create the link
ln -s ../../../zd48 /dev/zvol/rpool/data/vm-103-disk-0
the VM boots

@fabianishere
Copy link
Owner

Your issue looks similar to the one reported upstream: openzfs/zfs#12507.
There it is also mentioned that this issue possibly exists since Linux 5.13: openzfs/zfs#12301

This probably means that you'll have to wait until openzfs/zfs#12301 is fixed.

@amoiseiev
Copy link
Contributor

The problem seems severe enough, wondering if it makes sense to add a temporary patch reverting:

torvalds/linux@a8ed1a0

and then dropping it when it's fixed in ZFS upstream. Obviously, not ideal but likely better than having random VM failures

@fabianishere
Copy link
Owner

@amoiseiev I agree, I'll create a patch reverting this change for the v5.15.x and v5.14.x branches.

fabianishere added a commit that referenced this issue Nov 8, 2021
This change adds a patch that reverts a change to the Linux kernel that
was intended to clean-up some internal behavior. Unfortunately, ZFS
currently relies on this behavior to perform correctly.

Without this patch, zvol links might not be created correctly due to a
race condition.

See #183 for more information about this issue.
fabianishere added a commit that referenced this issue Nov 8, 2021
This change adds a patch that reverts a change to the Linux kernel that
was intended to clean-up some internal behavior. Unfortunately, ZFS
currently relies on this behavior to perform correctly.

Without this patch, zvol links might not be created correctly due to a
race condition.

See #183 for more information about this issue.
fabianishere added a commit that referenced this issue Nov 8, 2021
This change adds a patch that reverts a change to the Linux kernel that
was intended to clean-up some internal behavior. Unfortunately, ZFS
currently relies on this behavior to perform correctly.

Without this patch, zvol links might not be created correctly due to a
race condition.

See #183 for more information about this issue.
@fabianishere
Copy link
Owner

fabianishere commented Nov 9, 2021

@piexil Could you check whether the issue still appears in v5.14.17-1 or v5.15.1-1?

@piexil
Copy link
Author

piexil commented Nov 19, 2021

sorry for the late reply, have been away.
5.15.2 seems to be okay so far

@fabianishere
Copy link
Owner

This issue should be resolved in the latest builds.

@dac2020
Copy link

dac2020 commented Sep 2, 2022

What's going on that this still happened to me using pve 7.1.7 and 7.2?
no zvol device link for...

I lost a bunch of guests out of the blue on a node in a cluster. Does it mean the others will vanish too?
Never had this happen on vmware esx, what does it mean?

@fabianishere
Copy link
Owner

@dac2020 What kernel version are you using?

@dac2020
Copy link

dac2020 commented Sep 2, 2022

Hi,

I started this thread; https://forum.proxmox.com/threads/cannot-migrate-guests.114340/

It's now this;
proxmox-ve: 7.2-1 (running kernel: 5.15.39-4-pve)
pve-manager: 7.2-7 (running version: 7.2-7/d0dd0e85)
pve-kernel-5.15: 7.2-9
pve-kernel-helper: 7.2-9
pve-kernel-5.13: 7.1-9
pve-kernel-5.15.39-4-pve: 5.15.39-4
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-2-pve: 5.13.19-4

But the problem started while the system was still 7.1.7 which I've since upgraded to 7.2 thinking it might help but didn't.
If I've done something wrong, so be it, hope to learn from it but as far as I recall, these guests were simply shut off and there were no problems what so ever until I wanted to migrate them before upgrading this node. The thread explains it all.

I can simply rebuild it but wanted to share in case it's something I didn't do and important to the proxmox devs.

@fabianishere
Copy link
Owner

Since you are using the stock kernel (and not this project), I am unable to help with this issue.

@dac2020
Copy link

dac2020 commented Sep 2, 2022

No problem, wanted to share because it seems important before I rebuild.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants