Losing quorum as soon as a node goes down #162

Ulrar · 2023-12-07T09:23:10Z

Hi,

I have 3 nodes, and a placementCount of 2. After quite a bit of fiddling, the third node got 'TieBreaker' volumes (or Diskless, for some) setup on it, so I'd assume I'm okay to lose one node.

But sadly as soon as any of the nodes go down, I lose quorum and the remaining two nodes get tainted with drbd.linbit.com/lost-quorum:NoSchedule.

╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName                             ┊ Node          ┊ Port ┊ Usage  ┊ Conns                     ┊      State ┊ CreatedOn           ┊
╞═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-1a9e5a5e-fdba-4b8e-ae9f-1a7acd048184 ┊ talos-00r-fu9 ┊ 7001 ┊ Unused ┊ Connecting(talos-ozt-z3h) ┊   Diskless ┊ 2023-11-19 15:36:20 ┊
┊ pvc-1a9e5a5e-fdba-4b8e-ae9f-1a7acd048184 ┊ talos-813-fn2 ┊ 7001 ┊ InUse  ┊ Connecting(talos-ozt-z3h) ┊   UpToDate ┊ 2023-11-07 18:03:48 ┊
┊ pvc-1a9e5a5e-fdba-4b8e-ae9f-1a7acd048184 ┊ talos-ozt-z3h ┊ 7001 ┊        ┊                           ┊    Unknown ┊ 2023-10-27 12:04:28 ┊
┊ pvc-56924ed3-7815-4655-9536-6b64792182ca ┊ talos-00r-fu9 ┊ 7004 ┊ Unused ┊ Connecting(talos-ozt-z3h) ┊   Diskless ┊ 2023-11-19 15:36:23 ┊
┊ pvc-56924ed3-7815-4655-9536-6b64792182ca ┊ talos-813-fn2 ┊ 7004 ┊ Unused ┊ Connecting(talos-ozt-z3h) ┊   UpToDate ┊ 2023-11-19 09:47:12 ┊
┊ pvc-56924ed3-7815-4655-9536-6b64792182ca ┊ talos-ozt-z3h ┊ 7004 ┊        ┊                           ┊    Unknown ┊ 2023-10-27 12:04:32 ┊
┊ pvc-86499a05-3ba9-4722-9bb1-69ae47406263 ┊ talos-00r-fu9 ┊ 7005 ┊ Unused ┊ Connecting(talos-ozt-z3h) ┊ TieBreaker ┊ 2023-11-19 15:36:23 ┊
┊ pvc-86499a05-3ba9-4722-9bb1-69ae47406263 ┊ talos-813-fn2 ┊ 7005 ┊ InUse  ┊ Connecting(talos-ozt-z3h) ┊   UpToDate ┊ 2023-11-12 18:11:43 ┊
┊ pvc-86499a05-3ba9-4722-9bb1-69ae47406263 ┊ talos-ozt-z3h ┊ 7005 ┊        ┊                           ┊    Unknown ┊ 2023-11-12 18:11:43 ┊
┊ pvc-c7bdfa9e-e3c2-4dd3-ac9c-b7b2e847d30b ┊ talos-00r-fu9 ┊ 7003 ┊ Unused ┊ Connecting(talos-ozt-z3h) ┊   Diskless ┊ 2023-11-19 15:36:23 ┊
┊ pvc-c7bdfa9e-e3c2-4dd3-ac9c-b7b2e847d30b ┊ talos-813-fn2 ┊ 7003 ┊ Unused ┊ Connecting(talos-ozt-z3h) ┊   UpToDate ┊ 2023-11-07 18:03:50 ┊
┊ pvc-c7bdfa9e-e3c2-4dd3-ac9c-b7b2e847d30b ┊ talos-ozt-z3h ┊ 7003 ┊        ┊                           ┊    Unknown ┊ 2023-10-27 12:04:33 ┊
┊ pvc-e57930e5-6772-41e4-8c98-99105b77970a ┊ talos-00r-fu9 ┊ 7002 ┊ Unused ┊ Connecting(talos-ozt-z3h) ┊   Diskless ┊ 2023-11-19 15:36:23 ┊
┊ pvc-e57930e5-6772-41e4-8c98-99105b77970a ┊ talos-813-fn2 ┊ 7002 ┊ InUse  ┊ Connecting(talos-ozt-z3h) ┊   UpToDate ┊ 2023-11-07 18:03:49 ┊
┊ pvc-e57930e5-6772-41e4-8c98-99105b77970a ┊ talos-ozt-z3h ┊ 7002 ┊        ┊                           ┊    Unknown ┊ 2023-10-27 12:04:33 ┊
┊ pvc-fbdf5c3c-2d49-49b8-ac10-f8e1212c7788 ┊ talos-00r-fu9 ┊ 7000 ┊ Unused ┊ Connecting(talos-ozt-z3h) ┊ TieBreaker ┊ 2023-11-19 15:36:23 ┊
┊ pvc-fbdf5c3c-2d49-49b8-ac10-f8e1212c7788 ┊ talos-813-fn2 ┊ 7000 ┊ InUse  ┊ Connecting(talos-ozt-z3h) ┊   UpToDate ┊ 2023-11-08 08:47:17 ┊
┊ pvc-fbdf5c3c-2d49-49b8-ac10-f8e1212c7788 ┊ talos-ozt-z3h ┊ 7000 ┊        ┊                           ┊    Unknown ┊ 2023-10-27 12:05:17 ┊
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

I have no idea why the above leads to loosing quorum, there's clearly two connected nodes (even if one is the TieBreaker).

I'm not sure what I'm doing wrong, but tainting the nodes like that make recovering pretty difficult as most pods won't get re-scheduled, depending on what went down I sometimes have to manually untaint a node to let pods come back up and slowly recover by hand, using drbdadm to decide which to keep for every volume.

Thanks

The text was updated successfully, but these errors were encountered:

WanzenBug · 2023-12-07T09:53:43Z

Have you checked with drbdsetup status on the remaining nodes that they indeed have quorum? If they do have it, it seems like a bug in the HA controller.

Ulrar · 2023-12-07T10:02:26Z

Yes, they do lose quorum. For example just now :

pvc-e57930e5-6772-41e4-8c98-99105b77970a role:Secondary suspended:quorum
  disk:UpToDate quorum:no blocked:upper
  talos-00r-fu9 role:Secondary
    peer-disk:Diskless
  talos-813-fn2 connection:Connecting

It has an UpToDate and a Diskless node, and yet it thinks it lost quorum. That's the only volume that lost quorum, the other ones look the same but with quorum, and the local node became Primary, maybe it's something to do with that specific volume somehow

WanzenBug · 2023-12-07T10:11:36Z

Very weird. Probably something for the DRBD folks to look at.

If you just want to disable the taints, you can disable the HA Controller since 2.3.0: https://github.com/piraeusdatastore/piraeus-operator/blob/v2/docs/reference/linstorcluster.md#spechighavailabilitycontroller

Ulrar · 2023-12-07T18:04:51Z

It looks like the TieBreaker / Diskless node doesn't count towards the quorum when changing primary, so if the Primary for a volume goes down (even cleanly, it appears) the other one can't become primary anymore, and goes into a lost quorum state.

That is probably a drbd issue, but when the primary goes down cleanly I wonder if the operator could make sure the secondary switches first, while it has quorum ?
Or maybe I should just go to a placement count of 3 to avoid this

Ulrar · 2024-08-30T15:28:12Z

Nothing fancy, I'm using 3 Talos nodes, with scheduling on control plane nodes (since there's only 3 nodes) and a replica 3.

But this actually seems to have fixed itself, I suspect DRBD 9.2.9 is what did it. Or at least I used to run into this all the time, and since that upgrade I haven't seen it once, so I think this was it :

  - Fix a kernel crash that is sometimes triggered when downing drbd
    resources in a specific, unusual order (was triggered by the
    Kubernetes CSI driver)

WanzenBug · 2024-09-02T07:41:43Z

Check if the right DRBD version is in use: cat /proc/drbd, should report > 9.0.0.

I'm not sure what exact steps you run when you cordon, could you please elaborate a bit on that?

Ulrar mentioned this issue Dec 13, 2023

Pods crashing and disconnections during resync piraeusdatastore/piraeus-operator#579

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Losing quorum as soon as a node goes down #162

Losing quorum as soon as a node goes down #162

Ulrar commented Dec 7, 2023

WanzenBug commented Dec 7, 2023

Ulrar commented Dec 7, 2023 •

edited

Loading

WanzenBug commented Dec 7, 2023

Ulrar commented Dec 7, 2023

Ulrar commented Aug 30, 2024 •

edited

Loading

WanzenBug commented Sep 2, 2024

Losing quorum as soon as a node goes down #162

Losing quorum as soon as a node goes down #162

Comments

Ulrar commented Dec 7, 2023

WanzenBug commented Dec 7, 2023

Ulrar commented Dec 7, 2023 • edited Loading

WanzenBug commented Dec 7, 2023

Ulrar commented Dec 7, 2023

Ulrar commented Aug 30, 2024 • edited Loading

WanzenBug commented Sep 2, 2024

Ulrar commented Dec 7, 2023 •

edited

Loading

Ulrar commented Aug 30, 2024 •

edited

Loading