Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LVM activation controller needs to wait for udevd #9505

Closed
jfroy opened this issue Oct 16, 2024 · 11 comments · Fixed by #9574
Closed

LVM activation controller needs to wait for udevd #9505

jfroy opened this issue Oct 16, 2024 · 11 comments · Fixed by #9574
Assignees

Comments

@jfroy
Copy link
Contributor

jfroy commented Oct 16, 2024

Bug Report

Description

Talos 1.8.1 improved lvm activation quite a bit, but I am still seeing some number of VGs missing when booting a node. These VGs are encrypted ceph OSDs, on NVMe drives. The machine has a fair number of IO devices (10 NVMe, 15 SATA hard drives), so maybe it's a race or time out.

The issue manifests as missing /dev nodes for some VGs, usually the same 2. Issuing a vgchange --refresh command in a node-shell creates the missing nodes and allows the Ceph OSDs to start.

Logs

I see one such log entry for each of the 8 Ceph OSDs:

192.168.1.13: user: warning: [2024-10-15T23:59:03.258425978Z]: [talos] activating LVM volume {"component": "controller-runtime", "controller": "block.LVMActivationController", "name": "ceph-41785a27-5010-4418-becf-d6d52fd93f9c"}

So the controller is not missing any. There are no additional logs or differences in the log lines between the VGs that do get a device nodes and those that don't.

Environment

  • Talos version:
    Client:
        Tag:         v1.6.6
        SHA:         7dceba06
        Built:       
        Go version:  go1.21.8 X:loopvar
        OS/Arch:     darwin/arm64
    Server:
        NODE:        192.168.1.13
        Tag:         v1.8.1-jfroy.2
        SHA:         e398d8a0
        Built:       
        Go version:  go1.22.8
        OS/Arch:     linux/amd64
        Enabled:     RBAC
    
  • Kubernetes version: 1.31.1
  • Platform: baremetal amd64
@jfroy
Copy link
Contributor Author

jfroy commented Oct 16, 2024

See #9422 and #9333 for recent work in this area.

@smira
Copy link
Member

smira commented Oct 16, 2024

@jfroy as you seem to be running with your changes on top, just to make sure - do you have $PKGS matching v1.8.1, as there were several fixes in pkgs as well

@jfroy
Copy link
Contributor Author

jfroy commented Oct 16, 2024

@jfroy as you seem to be running with your changes on top, just to make sure - do you have $PKGS matching v1.8.1, as there were several fixes in pkgs as well

I believe I am. I've rebased my release branches on yours (release-1.8) and rebuilt. But I do not build PKGS (with few exceptions, eg base and kernel), so I assume that you've rebuild upstream. Maybe the PKGS ref is stale in my forks, I will check.

But the behavior has improved, which suggests I do have the current fixes. I used to see LVM controller errors and get none of the VGs activated. Now the controller seems to work and the VGs all get activated. It feels like this current issue may be an interaction with udev somehow missing or dropping events or not being aware somehow of all the activated VGs.

I've seen the big changes to udev coming and maybe that will help. I can try to rebase that on 1.8 and test, though I don't know how mature the patches are.

Or I could be wrong and this is something else. I also want to go see what that refresh command does. Maybe it will suggest some further line of investigation.

@smira
Copy link
Member

smira commented Oct 16, 2024

The way you described it I haven't seen while debugging this issue - the usual issue is that the detection logic falsely reports the volume as activated.

@smira smira self-assigned this Oct 17, 2024
@smira
Copy link
Member

smira commented Oct 21, 2024

I can't reproduce the issue on both main and Talos 1.8.1. I tested with 4 NVMe drives and Ceph with encryption enabled. On any reboot all 3 worker nodes came up with all LVM volumes activated.

172.20.0.7   runtime     DiscoveredVolume   dm-0      1         disk        11 GB    luks                                                  
172.20.0.7   runtime     DiscoveredVolume   dm-1      1         disk        11 GB    luks                                                  
172.20.0.7   runtime     DiscoveredVolume   dm-2      1         disk        11 GB    luks                                                  
172.20.0.7   runtime     DiscoveredVolume   dm-3      1         disk        11 GB    luks                                                  
172.20.0.7   runtime     DiscoveredVolume   dm-4      1         disk        11 GB    bluestore                                             
172.20.0.7   runtime     DiscoveredVolume   dm-5      1         disk        11 GB    bluestore                                             
172.20.0.7   runtime     DiscoveredVolume   dm-6      1         disk        11 GB    bluestore                                             
172.20.0.7   runtime     DiscoveredVolume   dm-7      1         disk        11 GB    bluestore                                             
172.20.0.7   runtime     DiscoveredVolume   loop0     1         disk        77 MB    squashfs                                              
172.20.0.7   runtime     DiscoveredVolume   nvme0n1   1         disk        11 GB    lvm2-pv      0PsH8G-afuZ-mS2J-A35D-FirN-Qpfd-OyTv2a   
172.20.0.7   runtime     DiscoveredVolume   nvme0n2   1         disk        11 GB    lvm2-pv      QFHt0k-ka8i-KX1t-kTGk-WWww-SGCK-GPpHZ8   
172.20.0.7   runtime     DiscoveredVolume   nvme0n3   1         disk        11 GB    lvm2-pv      K3nU4P-u2hf-6wAy-orfN-bSju-h0By-Upx64T   
172.20.0.7   runtime     DiscoveredVolume   nvme0n4   1         disk        11 GB    lvm2-pv      E64OOb-lK4c-Dzfi-xiGa-9bNC-vW4T-aeQhL7 

I wonder if we could dig more from talosctl support bundle.

@jfroy
Copy link
Contributor Author

jfroy commented Oct 22, 2024

Thanks for looking into it some more. I will get a bundle done, and also investigate more on my own. I can definitely repro on that node in my cluster, so hopefully I can root cause.

@jfroy jfroy changed the title LVM activation is not entirely working LVM activation controller needs to wait for udevd Oct 24, 2024
@jfroy
Copy link
Contributor Author

jfroy commented Oct 24, 2024

OK, turns out the root cause was pretty straight forward. The LVM controller can start activating volumes before udevd is running. Those volumes will not have their dev node created because that is performed by udevd in response to LVM activation events.

See the log below. Two activations happen before udevd is running. Those are the 2 volumes missing their node for that particular boot sequence.

192.168.1.13: user: warning: [2024-10-23T21:58:15.064749698Z]: [talos] volume status {"component": "controller-runtime", "controller": "block.VolumeManagerController", "volume": "META", "phase": "waiting -> ready"}
192.168.1.13: user: warning: [2024-10-23T21:58:15.086447698Z]: [talos] task writeIMAPolicy (1/1): done, 166.680011ms
192.168.1.13: user: warning: [2024-10-23T21:58:15.090304698Z]: [talos] phase integrity (2/12): done, 183.238453ms
192.168.1.13: user: warning: [2024-10-23T21:58:15.094078698Z]: [talos] phase etc (3/12): 3 tasks(s)
192.168.1.13: user: warning: [2024-10-23T21:58:15.097845698Z]: [talos] task setUserEnvVars (3/3): starting
192.168.1.13: user: warning: [2024-10-23T21:58:15.101460698Z]: [talos] task setUserEnvVars (3/3): done, 3.630649ms
192.168.1.13: user: warning: [2024-10-23T21:58:15.104960698Z]: [talos] task CreateSystemCgroups (1/3): starting
192.168.1.13: user: warning: [2024-10-23T21:58:15.108488698Z]: [talos] task createOSReleaseFile (2/3): starting
192.168.1.13: user: warning: [2024-10-23T21:58:15.112089698Z]: [talos] task CreateSystemCgroups (1/3): using cgroups root: /
192.168.1.13: user: warning: [2024-10-23T21:58:15.115654698Z]: [talos] task createOSReleaseFile (2/3): done, 14.329314ms
192.168.1.13: user: warning: [2024-10-23T21:58:15.136053698Z]: [talos] activating LVM volume {"component": "controller-runtime", "controller": "block.LVMActivationController", "name": "ceph-c033ff07-8e3d-439b-94bf-2b8f94445df4"}
192.168.1.13: user: warning: [2024-10-23T21:58:15.156689698Z]: [talos] task CreateSystemCgroups (1/3): done, 58.826575ms
192.168.1.13: user: warning: [2024-10-23T21:58:15.160161698Z]: [talos] phase etc (3/12): done, 66.085972ms
192.168.1.13: user: warning: [2024-10-23T21:58:15.163461698Z]: [talos] phase earlyServices (4/12): 4 tasks(s)
192.168.1.13: user: warning: [2024-10-23T21:58:15.166679698Z]: [talos] task startContainerd (4/4): starting
192.168.1.13: user: warning: [2024-10-23T21:58:15.169868698Z]: [talos] service[containerd](Starting): Starting service
192.168.1.13: user: warning: [2024-10-23T21:58:15.172979698Z]: [talos] task startUdevd (1/4): starting
192.168.1.13: user: warning: [2024-10-23T21:58:15.176138698Z]: [talos] task startMachined (2/4): starting
192.168.1.13: user: warning: [2024-10-23T21:58:15.179168698Z]: [talos] task startSyslogd (3/4): starting
192.168.1.13: user: warning: [2024-10-23T21:58:15.182115698Z]: [talos] service[containerd](Preparing): Running pre state
192.168.1.13: user: warning: [2024-10-23T21:58:15.185049698Z]: [talos] volume status {"component": "controller-runtime", "controller": "block.VolumeManagerController", "volume": "STATE", "phase": "waiting -> failed", "error": "volume is encrypted, but no encryption config provided"}
192.168.1.13: user: warning: [2024-10-23T21:58:15.193783698Z]: [talos] service[containerd](Preparing): Creating service runner
192.168.1.13: user: warning: [2024-10-23T21:58:15.196659698Z]: [talos] service[udevd](Starting): Starting service
192.168.1.13: user: warning: [2024-10-23T21:58:15.199467698Z]: [talos] service[udevd](Preparing): Running pre state
192.168.1.13: user: warning: [2024-10-23T21:58:15.202199698Z]: [talos] service[syslogd](Starting): Starting service
192.168.1.13: user: warning: [2024-10-23T21:58:15.204872698Z]: [talos] service[containerd](Running): Process Process(["/bin/containerd" "--address" "/system/run/containerd/containerd.sock" "--state" "/system/run/containerd" "--root" "/system/var/lib/containerd"]) started with PID 1896
192.168.1.13: user: warning: [2024-10-23T21:58:15.212845698Z]: [talos] service[syslogd](Waiting): Waiting for service "machined" to be "up"
192.168.1.13: user: warning: [2024-10-23T21:58:15.215571698Z]: [talos] task startSyslogd (3/4): done, 38.144554ms
192.168.1.13: user: warning: [2024-10-23T21:58:15.218198698Z]: [talos] service[machined](Starting): Starting service
192.168.1.13: user: warning: [2024-10-23T21:58:15.220825698Z]: [talos] service[machined](Preparing): Running pre state
192.168.1.13: user: warning: [2024-10-23T21:58:15.223438698Z]: [talos] service[udevd](Preparing): Creating service runner
192.168.1.13: user: warning: [2024-10-23T21:58:15.226068698Z]: [talos] service[machined](Preparing): Creating service runner
192.168.1.13: user: warning: [2024-10-23T21:58:15.228763698Z]: [talos] service[udevd](Running): Process Process(["/sbin/udevd" "--resolve-names=never"]) started with PID 1902
192.168.1.13: user: warning: [2024-10-23T21:58:15.234298698Z]: [talos] service[machined](Running): Service started as goroutine
192.168.1.13: user: warning: [2024-10-23T21:58:15.296740698Z]: [talos] volume status {"component": "controller-runtime", "controller": "block.VolumeManagerController", "volume": "STATE", "phase": "failed -> failed", "error": "volume is encrypted, but no encryption config provided"}
192.168.1.13: user: warning: [2024-10-23T21:58:15.407836698Z]: [talos] volume status {"component": "controller-runtime", "controller": "block.VolumeManagerController", "volume": "STATE", "phase": "failed -> failed", "error": "volume is encrypted, but no encryption config provided"}
192.168.1.13: user: warning: [2024-10-23T21:58:15.454934698Z]: [talos] activating LVM volume {"component": "controller-runtime", "controller": "block.LVMActivationController", "name": "ceph-cd59b9e6-3943-4927-b7a5-c95d920a35ed"}
192.168.1.13: daemon:    info: [2024-10-23T21:58:15.513383698Z]: udevd[1902]: starting version 3.2.14
192.168.1.13: daemon:    info: [2024-10-23T21:58:15.520355698Z]: udevd[1902]: starting eudev-3.2.14

@smira
Copy link
Member

smira commented Oct 24, 2024

Oh, thanks for digging into this, this should be easy to fix!

@smira
Copy link
Member

smira commented Oct 24, 2024

@jfroy not totally related, but what is the NVMe hardware in your box?

(or talosctl get pcidevices)

I wonder how they come into existence before udevd starts (as it should load the required modules)

@jfroy
Copy link
Contributor Author

jfroy commented Oct 24, 2024

@jfroy not totally related, but what is the NVMe hardware in your box?

(or talosctl get pcidevices)

I wonder how they come into existence before udevd starts (as it should load the required modules)

Ah, my kernel has most drivers built-in and not as modules. I arguably should modularize it a bit more, but the ordering issue should probably still be fixed.

I checked and indeed your kernel config has CONFIG_BLK_DEV_NVME=m and mine has CONFIG_BLK_DEV_NVME=y.

They are Micron 7300 Pro [8x].

@smira
Copy link
Member

smira commented Oct 24, 2024

Ok, it explains why I can't reproduce it, but we'll still get a fix for the next 1.8.2, thank you!

smira added a commit to smira/talos that referenced this issue Oct 25, 2024
Fixes siderolabs#9505

Signed-off-by: Andrey Smirnov <[email protected]>
(cherry picked from commit b7801df)
smira added a commit to smira/talos that referenced this issue Oct 25, 2024
Fixes siderolabs#9505

Signed-off-by: Andrey Smirnov <[email protected]>
(cherry picked from commit b7801df)
utkuozdemir pushed a commit to utkuozdemir/talos that referenced this issue Oct 25, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 25, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants