This repository has been archived by the owner on Feb 29, 2024. It is now read-only.
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Workaround ovn cluster failure during update when schema change.
During update the ovndb server can have a schema change. The problem is that an updated slave ovndb wouldn't connect to a master which still has the old db schema. At some point (200000ms) pacemaker put the resource in error Time Out. Then it will wait for the operator to cleanup the resource. Meaning that the update can goes like this: - Original state: (Master, Slave, Failed): nothing updated - ctl0-M-old - ctl1-S-old - ctl2-S-old - First state: after update of ctl0 - ctl0-F-new - ctl1-M-old - ctl2-S-old - Second state: after update of ctl1 - ctl0-F-new - ctl1-F-new - ctl2-M-old - Third and final state: after update of ctl2 - ctl0-F-new - ctl1-F-new - ctl2-M-new During the third state we have a cut in the control plane as ctl2 is the master and there is no slave to fall back to. Then we end up loosing HA as only one node is active. The error persists after reboot. Only a pcs resource cleanup will bring the cluster online. The real solution will come from ovndb and the associated ocf agent, but in the meantime, we workaround it by: - cleanup - ban the resource; in step 1 and: - cleanup - unban the resource in step 5. This has the net effect of preventing the cut in the control plane for the last node as we move master to the updated controller which will form a cluster of one master and one slave (as two are updated). The last one will happily join then when it will be updated. That means: - we always have either 1 or 2 nodes working; - we end the update with the cluster converged back to a stable state. The problems are : - we could hide a real ovndb cluster issue; - if the update break in-between we could have a leftover ban on one of the node; But, all things considered, this looks like the best compromise for the time being. Change-Id: I8f71bf83ddafca167deae1a38ca819f7d930fb80 Closes-Bug: #1847780
- Loading branch information