LVM activation controller needs to wait for udevd #9505

jfroy · 2024-10-16T05:06:07Z

Bug Report

Description

Talos 1.8.1 improved lvm activation quite a bit, but I am still seeing some number of VGs missing when booting a node. These VGs are encrypted ceph OSDs, on NVMe drives. The machine has a fair number of IO devices (10 NVMe, 15 SATA hard drives), so maybe it's a race or time out.

The issue manifests as missing /dev nodes for some VGs, usually the same 2. Issuing a vgchange --refresh command in a node-shell creates the missing nodes and allows the Ceph OSDs to start.

Logs

I see one such log entry for each of the 8 Ceph OSDs:

192.168.1.13: user: warning: [2024-10-15T23:59:03.258425978Z]: [talos] activating LVM volume {"component": "controller-runtime", "controller": "block.LVMActivationController", "name": "ceph-41785a27-5010-4418-becf-d6d52fd93f9c"}

So the controller is not missing any. There are no additional logs or differences in the log lines between the VGs that do get a device nodes and those that don't.

Environment

Talos version:

Client:
    Tag:         v1.6.6
    SHA:         7dceba06
    Built:       
    Go version:  go1.21.8 X:loopvar
    OS/Arch:     darwin/arm64
Server:
    NODE:        192.168.1.13
    Tag:         v1.8.1-jfroy.2
    SHA:         e398d8a0
    Built:       
    Go version:  go1.22.8
    OS/Arch:     linux/amd64
    Enabled:     RBAC

Kubernetes version: 1.31.1
Platform: baremetal amd64

The text was updated successfully, but these errors were encountered:

jfroy · 2024-10-16T05:07:38Z

See #9422 and #9333 for recent work in this area.

smira · 2024-10-16T09:22:15Z

@jfroy as you seem to be running with your changes on top, just to make sure - do you have $PKGS matching v1.8.1, as there were several fixes in pkgs as well

jfroy · 2024-10-16T12:47:26Z

@jfroy as you seem to be running with your changes on top, just to make sure - do you have $PKGS matching v1.8.1, as there were several fixes in pkgs as well

I believe I am. I've rebased my release branches on yours (release-1.8) and rebuilt. But I do not build PKGS (with few exceptions, eg base and kernel), so I assume that you've rebuild upstream. Maybe the PKGS ref is stale in my forks, I will check.

But the behavior has improved, which suggests I do have the current fixes. I used to see LVM controller errors and get none of the VGs activated. Now the controller seems to work and the VGs all get activated. It feels like this current issue may be an interaction with udev somehow missing or dropping events or not being aware somehow of all the activated VGs.

I've seen the big changes to udev coming and maybe that will help. I can try to rebase that on 1.8 and test, though I don't know how mature the patches are.

Or I could be wrong and this is something else. I also want to go see what that refresh command does. Maybe it will suggest some further line of investigation.

smira · 2024-10-16T13:26:57Z

The way you described it I haven't seen while debugging this issue - the usual issue is that the detection logic falsely reports the volume as activated.

smira · 2024-10-21T14:54:35Z

I can't reproduce the issue on both main and Talos 1.8.1. I tested with 4 NVMe drives and Ceph with encryption enabled. On any reboot all 3 worker nodes came up with all LVM volumes activated.

172.20.0.7   runtime     DiscoveredVolume   dm-0      1         disk        11 GB    luks                                                  
172.20.0.7   runtime     DiscoveredVolume   dm-1      1         disk        11 GB    luks                                                  
172.20.0.7   runtime     DiscoveredVolume   dm-2      1         disk        11 GB    luks                                                  
172.20.0.7   runtime     DiscoveredVolume   dm-3      1         disk        11 GB    luks                                                  
172.20.0.7   runtime     DiscoveredVolume   dm-4      1         disk        11 GB    bluestore                                             
172.20.0.7   runtime     DiscoveredVolume   dm-5      1         disk        11 GB    bluestore                                             
172.20.0.7   runtime     DiscoveredVolume   dm-6      1         disk        11 GB    bluestore                                             
172.20.0.7   runtime     DiscoveredVolume   dm-7      1         disk        11 GB    bluestore                                             
172.20.0.7   runtime     DiscoveredVolume   loop0     1         disk        77 MB    squashfs                                              
172.20.0.7   runtime     DiscoveredVolume   nvme0n1   1         disk        11 GB    lvm2-pv      0PsH8G-afuZ-mS2J-A35D-FirN-Qpfd-OyTv2a   
172.20.0.7   runtime     DiscoveredVolume   nvme0n2   1         disk        11 GB    lvm2-pv      QFHt0k-ka8i-KX1t-kTGk-WWww-SGCK-GPpHZ8   
172.20.0.7   runtime     DiscoveredVolume   nvme0n3   1         disk        11 GB    lvm2-pv      K3nU4P-u2hf-6wAy-orfN-bSju-h0By-Upx64T   
172.20.0.7   runtime     DiscoveredVolume   nvme0n4   1         disk        11 GB    lvm2-pv      E64OOb-lK4c-Dzfi-xiGa-9bNC-vW4T-aeQhL7

I wonder if we could dig more from talosctl support bundle.

jfroy · 2024-10-22T02:31:31Z

Thanks for looking into it some more. I will get a bundle done, and also investigate more on my own. I can definitely repro on that node in my cluster, so hopefully I can root cause.

jfroy · 2024-10-24T00:41:32Z

OK, turns out the root cause was pretty straight forward. The LVM controller can start activating volumes before udevd is running. Those volumes will not have their dev node created because that is performed by udevd in response to LVM activation events.

See the log below. Two activations happen before udevd is running. Those are the 2 volumes missing their node for that particular boot sequence.

192.168.1.13: user: warning: [2024-10-23T21:58:15.064749698Z]: [talos] volume status {"component": "controller-runtime", "controller": "block.VolumeManagerController", "volume": "META", "phase": "waiting -> ready"}
192.168.1.13: user: warning: [2024-10-23T21:58:15.086447698Z]: [talos] task writeIMAPolicy (1/1): done, 166.680011ms
192.168.1.13: user: warning: [2024-10-23T21:58:15.090304698Z]: [talos] phase integrity (2/12): done, 183.238453ms
192.168.1.13: user: warning: [2024-10-23T21:58:15.094078698Z]: [talos] phase etc (3/12): 3 tasks(s)
192.168.1.13: user: warning: [2024-10-23T21:58:15.097845698Z]: [talos] task setUserEnvVars (3/3): starting
192.168.1.13: user: warning: [2024-10-23T21:58:15.101460698Z]: [talos] task setUserEnvVars (3/3): done, 3.630649ms
192.168.1.13: user: warning: [2024-10-23T21:58:15.104960698Z]: [talos] task CreateSystemCgroups (1/3): starting
192.168.1.13: user: warning: [2024-10-23T21:58:15.108488698Z]: [talos] task createOSReleaseFile (2/3): starting
192.168.1.13: user: warning: [2024-10-23T21:58:15.112089698Z]: [talos] task CreateSystemCgroups (1/3): using cgroups root: /
192.168.1.13: user: warning: [2024-10-23T21:58:15.115654698Z]: [talos] task createOSReleaseFile (2/3): done, 14.329314ms
192.168.1.13: user: warning: [2024-10-23T21:58:15.136053698Z]: [talos] activating LVM volume {"component": "controller-runtime", "controller": "block.LVMActivationController", "name": "ceph-c033ff07-8e3d-439b-94bf-2b8f94445df4"}
192.168.1.13: user: warning: [2024-10-23T21:58:15.156689698Z]: [talos] task CreateSystemCgroups (1/3): done, 58.826575ms
192.168.1.13: user: warning: [2024-10-23T21:58:15.160161698Z]: [talos] phase etc (3/12): done, 66.085972ms
192.168.1.13: user: warning: [2024-10-23T21:58:15.163461698Z]: [talos] phase earlyServices (4/12): 4 tasks(s)
192.168.1.13: user: warning: [2024-10-23T21:58:15.166679698Z]: [talos] task startContainerd (4/4): starting
192.168.1.13: user: warning: [2024-10-23T21:58:15.169868698Z]: [talos] service[containerd](Starting): Starting service
192.168.1.13: user: warning: [2024-10-23T21:58:15.172979698Z]: [talos] task startUdevd (1/4): starting
192.168.1.13: user: warning: [2024-10-23T21:58:15.176138698Z]: [talos] task startMachined (2/4): starting
192.168.1.13: user: warning: [2024-10-23T21:58:15.179168698Z]: [talos] task startSyslogd (3/4): starting
192.168.1.13: user: warning: [2024-10-23T21:58:15.182115698Z]: [talos] service[containerd](Preparing): Running pre state
192.168.1.13: user: warning: [2024-10-23T21:58:15.185049698Z]: [talos] volume status {"component": "controller-runtime", "controller": "block.VolumeManagerController", "volume": "STATE", "phase": "waiting -> failed", "error": "volume is encrypted, but no encryption config provided"}
192.168.1.13: user: warning: [2024-10-23T21:58:15.193783698Z]: [talos] service[containerd](Preparing): Creating service runner
192.168.1.13: user: warning: [2024-10-23T21:58:15.196659698Z]: [talos] service[udevd](Starting): Starting service
192.168.1.13: user: warning: [2024-10-23T21:58:15.199467698Z]: [talos] service[udevd](Preparing): Running pre state
192.168.1.13: user: warning: [2024-10-23T21:58:15.202199698Z]: [talos] service[syslogd](Starting): Starting service
192.168.1.13: user: warning: [2024-10-23T21:58:15.204872698Z]: [talos] service[containerd](Running): Process Process(["/bin/containerd" "--address" "/system/run/containerd/containerd.sock" "--state" "/system/run/containerd" "--root" "/system/var/lib/containerd"]) started with PID 1896
192.168.1.13: user: warning: [2024-10-23T21:58:15.212845698Z]: [talos] service[syslogd](Waiting): Waiting for service "machined" to be "up"
192.168.1.13: user: warning: [2024-10-23T21:58:15.215571698Z]: [talos] task startSyslogd (3/4): done, 38.144554ms
192.168.1.13: user: warning: [2024-10-23T21:58:15.218198698Z]: [talos] service[machined](Starting): Starting service
192.168.1.13: user: warning: [2024-10-23T21:58:15.220825698Z]: [talos] service[machined](Preparing): Running pre state
192.168.1.13: user: warning: [2024-10-23T21:58:15.223438698Z]: [talos] service[udevd](Preparing): Creating service runner
192.168.1.13: user: warning: [2024-10-23T21:58:15.226068698Z]: [talos] service[machined](Preparing): Creating service runner
192.168.1.13: user: warning: [2024-10-23T21:58:15.228763698Z]: [talos] service[udevd](Running): Process Process(["/sbin/udevd" "--resolve-names=never"]) started with PID 1902
192.168.1.13: user: warning: [2024-10-23T21:58:15.234298698Z]: [talos] service[machined](Running): Service started as goroutine
192.168.1.13: user: warning: [2024-10-23T21:58:15.296740698Z]: [talos] volume status {"component": "controller-runtime", "controller": "block.VolumeManagerController", "volume": "STATE", "phase": "failed -> failed", "error": "volume is encrypted, but no encryption config provided"}
192.168.1.13: user: warning: [2024-10-23T21:58:15.407836698Z]: [talos] volume status {"component": "controller-runtime", "controller": "block.VolumeManagerController", "volume": "STATE", "phase": "failed -> failed", "error": "volume is encrypted, but no encryption config provided"}
192.168.1.13: user: warning: [2024-10-23T21:58:15.454934698Z]: [talos] activating LVM volume {"component": "controller-runtime", "controller": "block.LVMActivationController", "name": "ceph-cd59b9e6-3943-4927-b7a5-c95d920a35ed"}
192.168.1.13: daemon:    info: [2024-10-23T21:58:15.513383698Z]: udevd[1902]: starting version 3.2.14
192.168.1.13: daemon:    info: [2024-10-23T21:58:15.520355698Z]: udevd[1902]: starting eudev-3.2.14

smira · 2024-10-24T10:40:06Z

Oh, thanks for digging into this, this should be easy to fix!

smira · 2024-10-24T12:58:22Z

@jfroy not totally related, but what is the NVMe hardware in your box?

(or talosctl get pcidevices)

I wonder how they come into existence before udevd starts (as it should load the required modules)

jfroy · 2024-10-24T14:20:07Z

@jfroy not totally related, but what is the NVMe hardware in your box?

(or talosctl get pcidevices)

I wonder how they come into existence before udevd starts (as it should load the required modules)

Ah, my kernel has most drivers built-in and not as modules. I arguably should modularize it a bit more, but the ordering issue should probably still be fixed.

I checked and indeed your kernel config has CONFIG_BLK_DEV_NVME=m and mine has CONFIG_BLK_DEV_NVME=y.

They are Micron 7300 Pro [8x].

smira · 2024-10-24T15:00:45Z

Ok, it explains why I can't reproduce it, but we'll still get a fix for the next 1.8.2, thank you!

Fixes siderolabs#9505 Signed-off-by: Andrey Smirnov <[email protected]> (cherry picked from commit b7801df)

Fixes siderolabs#9505 Signed-off-by: Andrey Smirnov <[email protected]> Signed-off-by: Utku Ozdemir <[email protected]>

smira self-assigned this Oct 17, 2024

jfroy changed the title ~~LVM activation is not entirely working~~ LVM activation controller needs to wait for udevd Oct 24, 2024

smira mentioned this issue Oct 25, 2024

fix: wait for udevd to be running before activating LVM #9574

Merged

talos-bot closed this as completed in b7801df Oct 25, 2024

smira added a commit to smira/talos that referenced this issue Oct 25, 2024

fix: wait for udevd to be running before activating LVM

066574b

Fixes siderolabs#9505 Signed-off-by: Andrey Smirnov <[email protected]> (cherry picked from commit b7801df)

smira added a commit to smira/talos that referenced this issue Oct 25, 2024

fix: wait for udevd to be running before activating LVM

d8e2daf

Fixes siderolabs#9505 Signed-off-by: Andrey Smirnov <[email protected]> (cherry picked from commit b7801df)

utkuozdemir pushed a commit to utkuozdemir/talos that referenced this issue Oct 25, 2024

fix: wait for udevd to be running before activating LVM

62fc61d

Fixes siderolabs#9505 Signed-off-by: Andrey Smirnov <[email protected]> Signed-off-by: Utku Ozdemir <[email protected]>

github-actions bot locked as resolved and limited conversation to collaborators Dec 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LVM activation controller needs to wait for udevd #9505

LVM activation controller needs to wait for udevd #9505

jfroy commented Oct 16, 2024

jfroy commented Oct 16, 2024

smira commented Oct 16, 2024

jfroy commented Oct 16, 2024

smira commented Oct 16, 2024

smira commented Oct 21, 2024

jfroy commented Oct 22, 2024

jfroy commented Oct 24, 2024 •

edited

Loading

smira commented Oct 24, 2024

smira commented Oct 24, 2024

jfroy commented Oct 24, 2024

smira commented Oct 24, 2024

LVM activation controller needs to wait for udevd #9505

LVM activation controller needs to wait for udevd #9505

Comments

jfroy commented Oct 16, 2024

Bug Report

Description

Logs

Environment

jfroy commented Oct 16, 2024

smira commented Oct 16, 2024

jfroy commented Oct 16, 2024

smira commented Oct 16, 2024

smira commented Oct 21, 2024

jfroy commented Oct 22, 2024

jfroy commented Oct 24, 2024 • edited Loading

smira commented Oct 24, 2024

smira commented Oct 24, 2024

jfroy commented Oct 24, 2024

smira commented Oct 24, 2024

jfroy commented Oct 24, 2024 •

edited

Loading