Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Etcd action to recover from majority failure #177

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

justinmclark
Copy link

Bug: [1]

A few questions before I convert this to a real PR:

  1. I wanted some way of waiting for the configuration to be loaded before writing over the changed config parameter force-new-cluster. Waiting a sufficiently high number of seconds would also work, but it didn't seem like super robust solution to me. Does this make sense?
  2. I hard coded a timeout, but this could be a parameter to the action. Does this seem necessary? Through my testing, I didn't encounter any situation where we were even remotely close to the timeout step, but I figured I'd add it just in case there's some weird reason the other loop condition is not met. (Changing log text in the future?).
  3. ETCD_CONFIG (line 14) just matches what we see in [2] - is there a better way of defining this?

[1] https://bugs.launchpad.net/charm-etcd/+bug/1842332
[2]

default: /var/snap/etcd/common

Copy link
Contributor

@kwmonroe kwmonroe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good, but I'm curious if we should handle the cases where a restart doesn't show the journalctl messages that we expect, and whether or not we should restart again once we set force-new-cluster back to false.

# before changing 'force-new-cluster' back to false
# if $TIMEOUT seconds have passed, break from the loop
LOADED_CONF=0
while [ $LOADED_CONF -eq 0 ] && [ $(($(date +%s) - $START)) -lt $TIMEOUT ]
Copy link
Contributor

@kwmonroe kwmonroe Aug 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see you punting here to the developers in the year 2038. nice.

Edit: i'm just razzing you here; this is fine.

actions/recover Show resolved Hide resolved
done

# Step 3
sed -i 's/force-new-cluster.*/force-new-cluster: false/g' $ETCD_CONFIG
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to do another restart/wait (and potential fail) if the cluster doesn't come back with the new config?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this one. Part of me thinks if it doesn't work the first time, then something else might be wrong and some other action is necessary.

I could be persuaded either way. You're much more familiar with this, though so I can definitely add something if you think this is needed.

@Cynerva
Copy link
Contributor

Cynerva commented Aug 7, 2020

  1. I wanted some way of waiting for the configuration to be loaded before writing over the changed config parameter force-new-cluster. Waiting a sufficiently high number of seconds would also work, but it didn't seem like super robust solution to me. Does this make sense?

I think what you've done makes sense. Better than "sleep and hope". The only other idea I have would be to check etcdctl member list or something like that, but that doesn't seem any better than what you've done, really.

  1. I hard coded a timeout, but this could be a parameter to the action. Does this seem necessary? Through my testing, I didn't encounter any situation where we were even remotely close to the timeout step, but I figured I'd add it just in case there's some weird reason the other loop condition is not met. (Changing log text in the future?).

Good call. 10 minutes seems plenty. I don't see a need to parameterize the timeout, but then again, it might save someone who's in a weird situation somewhere down the line. If it's not too much trouble, I would do it.

  1. ETCD_CONFIG (line 14) just matches what we see in [2] - is there a better way of defining this?

I don't think so.

@justinmclark
Copy link
Author

@kwmonroe / @Cynerva I like the idea of using etcdctl directly and tried out member list and cluster-health as alternatives to the current loop criteria. In testing this out, I found something potentially problematic but probably unrelated to this.

Basically, juju status shows that the cluster is healthy, but etcdctl cluster-health show something else.

$ juju status
Model    Controller           Cloud/Region         Version  SLA          Timestamp
default  localhost-localhost  localhost/localhost  2.8.1    unsupported  10:07:28-04:00

App      Version  Status  Scale  Charm    Store       Rev  OS      Notes
easyrsa  3.0.1    active      1  easyrsa  jujucharms  318  ubuntu  
etcd     3.3.19   active      3  etcd     local         0  ubuntu  

Unit        Workload  Agent  Machine  Public address  Ports     Message
easyrsa/0*  active    idle   0        10.56.64.110              Certificate Authority connected.
etcd/0*     active    idle   1        10.56.64.32     2379/tcp  Healthy with 3 known peers
etcd/11     active    idle   12       10.56.64.243    2379/tcp  Healthy with 3 known peers
etcd/12     active    idle   13       10.56.64.254    2379/tcp  Healthy with 3 known peers

Machine  State    DNS           Inst id         Series  AZ  Message
0        started  10.56.64.110  juju-89af4b-0   bionic      Running
1        started  10.56.64.32   juju-89af4b-1   bionic      Running
12       started  10.56.64.243  juju-89af4b-12  bionic      Running
13       started  10.56.64.254  juju-89af4b-13  bionic      Running

Then SSH into machine 1

ubuntu@juju-89af4b-1:~$ /snap/bin/etcd.etcdctl cluster-health
failed to check the health of member 67110542bddd46f4 on https://10.56.64.243:2379: Get https://10.56.64.243:2379/health: x509: certificate signed by unknown authority
member 67110542bddd46f4 is unreachable: [https://10.56.64.243:2379] are all unreachable
failed to check the health of member cd8d8b5931626662 on https://10.56.64.32:2379: Get https://10.56.64.32:2379/health: x509: certificate signed by unknown authority
member cd8d8b5931626662 is unreachable: [https://10.56.64.32:2379] are all unreachable
failed to check the health of member d5d2d06a22f29594 on https://10.56.64.254:2379: Get https://10.56.64.254:2379/health: x509: certificate signed by unknown authority
member d5d2d06a22f29594 is unreachable: [https://10.56.64.254:2379] are all unreachable

Looking further, I noticed that the criteria used for the juju status output might not have this case covered since the only thing that is checked is whether the etcdctl['status'] contains "Unhealthy" or not [1].

I'm mostly just pointing this out to you both since it's a fairly hidden problem and might cause unknown issues.

[1] https://github.com/justinmclark/layer-etcd/blob/933fa7ce5bbe1536ec815ca8869a5c3a65da5352/reactive/etcd.py#L99

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants