-
Notifications
You must be signed in to change notification settings - Fork 615
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LVM activation controller needs to wait for udevd #9505
Comments
@jfroy as you seem to be running with your changes on top, just to make sure - do you have |
I believe I am. I've rebased my release branches on yours ( But the behavior has improved, which suggests I do have the current fixes. I used to see LVM controller errors and get none of the VGs activated. Now the controller seems to work and the VGs all get activated. It feels like this current issue may be an interaction with udev somehow missing or dropping events or not being aware somehow of all the activated VGs. I've seen the big changes to udev coming and maybe that will help. I can try to rebase that on 1.8 and test, though I don't know how mature the patches are. Or I could be wrong and this is something else. I also want to go see what that refresh command does. Maybe it will suggest some further line of investigation. |
The way you described it I haven't seen while debugging this issue - the usual issue is that the detection logic falsely reports the volume as activated. |
I can't reproduce the issue on both
I wonder if we could dig more from |
Thanks for looking into it some more. I will get a bundle done, and also investigate more on my own. I can definitely repro on that node in my cluster, so hopefully I can root cause. |
OK, turns out the root cause was pretty straight forward. The LVM controller can start activating volumes before udevd is running. Those volumes will not have their dev node created because that is performed by udevd in response to LVM activation events. See the log below. Two activations happen before udevd is running. Those are the 2 volumes missing their node for that particular boot sequence.
|
Oh, thanks for digging into this, this should be easy to fix! |
@jfroy not totally related, but what is the NVMe hardware in your box? (or I wonder how they come into existence before |
Ah, my kernel has most drivers built-in and not as modules. I arguably should modularize it a bit more, but the ordering issue should probably still be fixed. I checked and indeed your kernel config has They are Micron 7300 Pro [8x]. |
Ok, it explains why I can't reproduce it, but we'll still get a fix for the next 1.8.2, thank you! |
Fixes siderolabs#9505 Signed-off-by: Andrey Smirnov <[email protected]> (cherry picked from commit b7801df)
Fixes siderolabs#9505 Signed-off-by: Andrey Smirnov <[email protected]> (cherry picked from commit b7801df)
Fixes siderolabs#9505 Signed-off-by: Andrey Smirnov <[email protected]> Signed-off-by: Utku Ozdemir <[email protected]>
Bug Report
Description
Talos 1.8.1 improved lvm activation quite a bit, but I am still seeing some number of VGs missing when booting a node. These VGs are encrypted ceph OSDs, on NVMe drives. The machine has a fair number of IO devices (10 NVMe, 15 SATA hard drives), so maybe it's a race or time out.
The issue manifests as missing
/dev
nodes for some VGs, usually the same 2. Issuing avgchange --refresh
command in a node-shell creates the missing nodes and allows the Ceph OSDs to start.Logs
I see one such log entry for each of the 8 Ceph OSDs:
So the controller is not missing any. There are no additional logs or differences in the log lines between the VGs that do get a device nodes and those that don't.
Environment
The text was updated successfully, but these errors were encountered: