Etcd action to recover from majority failure #177

justinmclark · 2020-08-07T19:13:45Z

Bug: [1]

A few questions before I convert this to a real PR:

I wanted some way of waiting for the configuration to be loaded before writing over the changed config parameter force-new-cluster. Waiting a sufficiently high number of seconds would also work, but it didn't seem like super robust solution to me. Does this make sense?
I hard coded a timeout, but this could be a parameter to the action. Does this seem necessary? Through my testing, I didn't encounter any situation where we were even remotely close to the timeout step, but I figured I'd add it just in case there's some weird reason the other loop condition is not met. (Changing log text in the future?).
ETCD_CONFIG (line 14) just matches what we see in [2] - is there a better way of defining this?

[1] https://bugs.launchpad.net/charm-etcd/+bug/1842332
[2]

layer-etcd/layer.yaml

Line 23 in aca040b

default: /var/snap/etcd/common

kwmonroe

Looks mostly good, but I'm curious if we should handle the cases where a restart doesn't show the journalctl messages that we expect, and whether or not we should restart again once we set force-new-cluster back to false.

kwmonroe · 2020-08-07T20:44:49Z

actions/recover

+# before changing 'force-new-cluster' back to false
+# if $TIMEOUT seconds have passed, break from the loop
+LOADED_CONF=0
+while [ $LOADED_CONF -eq 0 ] && [ $(($(date +%s) - $START)) -lt $TIMEOUT ]


i see you punting here to the developers in the year 2038. nice.

Edit: i'm just razzing you here; this is fine.

actions/recover

kwmonroe · 2020-08-07T21:05:35Z

actions/recover

+done
+
+# Step 3
+sed -i 's/force-new-cluster.*/force-new-cluster: false/g' $ETCD_CONFIG


Do we need to do another restart/wait (and potential fail) if the cluster doesn't come back with the new config?

I'm not sure about this one. Part of me thinks if it doesn't work the first time, then something else might be wrong and some other action is necessary.

I could be persuaded either way. You're much more familiar with this, though so I can definitely add something if you think this is needed.

Cynerva · 2020-08-07T21:08:57Z

I wanted some way of waiting for the configuration to be loaded before writing over the changed config parameter force-new-cluster. Waiting a sufficiently high number of seconds would also work, but it didn't seem like super robust solution to me. Does this make sense?

I think what you've done makes sense. Better than "sleep and hope". The only other idea I have would be to check etcdctl member list or something like that, but that doesn't seem any better than what you've done, really.

I hard coded a timeout, but this could be a parameter to the action. Does this seem necessary? Through my testing, I didn't encounter any situation where we were even remotely close to the timeout step, but I figured I'd add it just in case there's some weird reason the other loop condition is not met. (Changing log text in the future?).

Good call. 10 minutes seems plenty. I don't see a need to parameterize the timeout, but then again, it might save someone who's in a weird situation somewhere down the line. If it's not too much trouble, I would do it.

ETCD_CONFIG (line 14) just matches what we see in [2] - is there a better way of defining this?

I don't think so.

justinmclark · 2020-08-10T15:02:59Z

@kwmonroe / @Cynerva I like the idea of using etcdctl directly and tried out member list and cluster-health as alternatives to the current loop criteria. In testing this out, I found something potentially problematic but probably unrelated to this.

Basically, juju status shows that the cluster is healthy, but etcdctl cluster-health show something else.

$ juju status
Model    Controller           Cloud/Region         Version  SLA          Timestamp
default  localhost-localhost  localhost/localhost  2.8.1    unsupported  10:07:28-04:00

App      Version  Status  Scale  Charm    Store       Rev  OS      Notes
easyrsa  3.0.1    active      1  easyrsa  jujucharms  318  ubuntu  
etcd     3.3.19   active      3  etcd     local         0  ubuntu  

Unit        Workload  Agent  Machine  Public address  Ports     Message
easyrsa/0*  active    idle   0        10.56.64.110              Certificate Authority connected.
etcd/0*     active    idle   1        10.56.64.32     2379/tcp  Healthy with 3 known peers
etcd/11     active    idle   12       10.56.64.243    2379/tcp  Healthy with 3 known peers
etcd/12     active    idle   13       10.56.64.254    2379/tcp  Healthy with 3 known peers

Machine  State    DNS           Inst id         Series  AZ  Message
0        started  10.56.64.110  juju-89af4b-0   bionic      Running
1        started  10.56.64.32   juju-89af4b-1   bionic      Running
12       started  10.56.64.243  juju-89af4b-12  bionic      Running
13       started  10.56.64.254  juju-89af4b-13  bionic      Running

Then SSH into machine 1

ubuntu@juju-89af4b-1:~$ /snap/bin/etcd.etcdctl cluster-health
failed to check the health of member 67110542bddd46f4 on https://10.56.64.243:2379: Get https://10.56.64.243:2379/health: x509: certificate signed by unknown authority
member 67110542bddd46f4 is unreachable: [https://10.56.64.243:2379] are all unreachable
failed to check the health of member cd8d8b5931626662 on https://10.56.64.32:2379: Get https://10.56.64.32:2379/health: x509: certificate signed by unknown authority
member cd8d8b5931626662 is unreachable: [https://10.56.64.32:2379] are all unreachable
failed to check the health of member d5d2d06a22f29594 on https://10.56.64.254:2379: Get https://10.56.64.254:2379/health: x509: certificate signed by unknown authority
member d5d2d06a22f29594 is unreachable: [https://10.56.64.254:2379] are all unreachable

Looking further, I noticed that the criteria used for the juju status output might not have this case covered since the only thing that is checked is whether the etcdctl['status'] contains "Unhealthy" or not [1].

I'm mostly just pointing this out to you both since it's a fairly hidden problem and might cause unknown issues.

[1] https://github.com/justinmclark/layer-etcd/blob/933fa7ce5bbe1536ec815ca8869a5c3a65da5352/reactive/etcd.py#L99

justinmclark added 3 commits August 7, 2020 11:54

add action to recover a failed cluster

32be40b

clean up wait loop in recover action

e337692

clean up comments and remove -x flag from script

933fa7c

kwmonroe suggested changes Aug 7, 2020

View reviewed changes

justinmclark added 2 commits August 10, 2020 11:17

add action-fail if the timeout is reached

908858c

add timeout as action parameter

de338af

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Etcd action to recover from majority failure #177

Etcd action to recover from majority failure #177

justinmclark commented Aug 7, 2020

kwmonroe left a comment

kwmonroe Aug 7, 2020 •

edited

Loading

kwmonroe Aug 7, 2020

justinmclark Aug 10, 2020

Cynerva commented Aug 7, 2020

justinmclark commented Aug 10, 2020

Etcd action to recover from majority failure #177

Are you sure you want to change the base?

Etcd action to recover from majority failure #177

Conversation

justinmclark commented Aug 7, 2020

kwmonroe left a comment

Choose a reason for hiding this comment

kwmonroe Aug 7, 2020 • edited Loading

Choose a reason for hiding this comment

kwmonroe Aug 7, 2020

Choose a reason for hiding this comment

justinmclark Aug 10, 2020

Choose a reason for hiding this comment

Cynerva commented Aug 7, 2020

justinmclark commented Aug 10, 2020

kwmonroe Aug 7, 2020 •

edited

Loading