-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Etcd action to recover from majority failure #177
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks mostly good, but I'm curious if we should handle the cases where a restart doesn't show the journalctl
messages that we expect, and whether or not we should restart again once we set force-new-cluster
back to false
.
# before changing 'force-new-cluster' back to false | ||
# if $TIMEOUT seconds have passed, break from the loop | ||
LOADED_CONF=0 | ||
while [ $LOADED_CONF -eq 0 ] && [ $(($(date +%s) - $START)) -lt $TIMEOUT ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see you punting here to the developers in the year 2038. nice.
Edit: i'm just razzing you here; this is fine.
done | ||
|
||
# Step 3 | ||
sed -i 's/force-new-cluster.*/force-new-cluster: false/g' $ETCD_CONFIG |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to do another restart/wait (and potential fail) if the cluster doesn't come back with the new config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this one. Part of me thinks if it doesn't work the first time, then something else might be wrong and some other action is necessary.
I could be persuaded either way. You're much more familiar with this, though so I can definitely add something if you think this is needed.
I think what you've done makes sense. Better than "sleep and hope". The only other idea I have would be to check
Good call. 10 minutes seems plenty. I don't see a need to parameterize the timeout, but then again, it might save someone who's in a weird situation somewhere down the line. If it's not too much trouble, I would do it.
I don't think so. |
@kwmonroe / @Cynerva I like the idea of using Basically,
Then SSH into
Looking further, I noticed that the criteria used for the I'm mostly just pointing this out to you both since it's a fairly hidden problem and might cause unknown issues. |
Bug: [1]
A few questions before I convert this to a real PR:
force-new-cluster
. Waiting a sufficiently high number of seconds would also work, but it didn't seem like super robust solution to me. Does this make sense?ETCD_CONFIG
(line 14) just matches what we see in [2] - is there a better way of defining this?[1] https://bugs.launchpad.net/charm-etcd/+bug/1842332
[2]
layer-etcd/layer.yaml
Line 23 in aca040b