-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kernel crashes at Oracle Linux 8 #178
Comments
vmcore-dmesg.txt.tar.gz |
Looks like this is similar to LINBIT/drbd#86 |
Hello! Thanks for the report. I guess it would be a good idea to add that information to the DRBD issue, as that seems to be the root cause. We have seen it internally, but never been able to reproduce it reliably. Adding more context seems like a good idea. |
Thanks for the answer. Should I add more details how I reproduced that? |
Also, does it make sense to try with some older piraeus version? It's also reproduced with drbd 9.2.6 and piraeus v2.3.0 |
You could try DRBD 9.1.18. That does mean you have to use host networking, but you already do use that. |
@WanzenBug hello. There are our reproduction steps: We have 5-nodes k8s cluster with SSD storage pools of 100 GB each (Thin LVM) All queues are processed in 1 parallel operation:
When such a scheme is launched in a continuous cycle, we almost invariably have several node reboots per day. The operating system is not essential; we have encountered a similar problem with various 5.x and 6.x kernels from different distributions. However, the issue is definitely reproducible on the current LTS Ubuntu 22.04. STS spec:
|
Thanks! You could also try switching to DRBD 9.1.18. We suspect there is a race condition introduced in the 9.2 branch. |
Another idea on what might be causing the issue, with a work around in the CSI driver: piraeusdatastore/linstor-csi#256 You might try that by using the |
We have tested with DRBD 9.1.18. Looks like the issue is not reproduced with this version |
I'm also testing 9.1.18 now. |
Yes, it is safe. |
@WanzenBug it looks like v1.5.0-2-g16c206a solves the node restart problem. Will you please create a tag version with it? (maybe like 1.5.1) |
Thank you for testing! So just to confirm, you tested with DRBD 9.2.8 and the above CSI version and did not observe the crash? Then it must have something to do with removing a volume from a resource, as I expected. I will use that to try to reproduce the bevahiour. |
We tested this with 9.2.5 and 9.2.8, and above CSI version. Yes, there were no crash anymore. Thank you, I'll wait for your solution. Can you tell, will fix from v1.5.0-2-g16c206a come in 1.5.1? |
Yes, there will be a 1.5.1 with that. We still intend to fix the issue in DRBD, too. |
We will also test with 1.5.1 and drbd 3.2.8 when 1.5.1 is released |
Just wanted to let you know that we think we have tracked down the issue, no fix yet but we should have something ready for next DRBD release. |
Fixed on the DRBD side with LINBIT/drbd@857db82 and LINBIT/drbd@343e077. |
Kubernetes v1.27.5
Bare metal nodes
LVM Thinpool
piraeus-operator v2.4.1
Oracle Linux 8
Kernel 5.15.0-204.147.6.2.el8uek.x86_64 + default drbd image drbd9-jammy
Also reproduced with kernel 4.18 + drbd image drbd9-almalinux8
How to reproduce:
Create and subsequently delete a number of volumes and attach them. I tested with about 8 pvc-s and pod-s and made around 20 operations of creation and then deletion of them. Randomly the server goes to reboot because of crash. Most often it happened during volumes deletion but also it was reproduced during a new pvc creation.
UEK kernel Makefile (/usr/src/kernels/5.15.0-204.147.6.2.el8uek.x86_64/Makefile) patched to be able to build drbd:
The text was updated successfully, but these errors were encountered: