-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WARN: "update about identity with same prefix as ours, declaring it down" #30
Comments
Yeah, foca does gossip when leaving the cluster but consuming the instance on I think we should make As for the traces: ack. I'm quite unhappy with their state atm- I think everything debug level and lower is ok, but the other ones have a tendency of getting in the way and going down the route of filtering traces via subscribers and whatnot is annoying af. I'll try to come up with something less unpleasant in the future. For now, I'll lower the level of this to Debug. I'll try to ship these along with something for #28 early this w/e |
I've shipped v0.14.0 with a rework on traces. Now foca never emits anything higher than DEBUG. On this level, only high level traces are emitted (membership changes, probes, etc); And the TRACE level exposes the innards (messages being sent, timer events, etc) And as I write this I realize I forgot about changing |
I won't see the noise anymore, however I do notice double the cluster size for a while when I restart my cluster. I'm sort of attributing that to the unclean leave. |
hmm... The leave is definitely not unclean and since the higher counts go down after a while it suggests that the knowledge persists in the cluster. I've released v0.15.0 right now so you can see if gossiping a bunch after I wonder if persisting recent exists would help here: idk how you're doing the store/load of the state during restarts but if you have a global storage, you could also save the "node X decided to leave at time T" then you can feed the recent ones to |
The simplest way to explain how we do it is: before leaving the cluster, we iterate all members and store their serialized state per identity in sqlite. On start we only pick Alive or Down states and apply_many for them. It's replacing them entirely within a transaction so there shouldn't be any stray members.
What do you mean by this? I can add a timestamp to the persisted states for sure. Or are you saying something else? |
Heh sorry it got confusing since I jumped straight to an attempt at solving the problem. Let's say you're taking members A and B out of the cluster for a restart and doing the state save + load that you describe. By the time they come back B is now B' and a is A' however B' thinks that A is still alive (and A' thinks that B is). So they feed this information back to the cluster. ^ That's where I believe your double counting is coming from. And then when A' learns about the cluster thinks that A is alive the noise starts. Thankfully the cluster recovers nicely after a while since we learned our lesson with previous bugs 😀 So one solution to minimize this would be to store in a table that everyone can access when you're leaving. Say, you write {timestamp, identity} to a table and you can use this information during the restart of any instance. It's ok if there's some replication delay because of said self correcting behaviour |
Since we're talking about persisting the Down state, what I think would be best is:
less moving pieces on your side and pretty trivial to expose in foca forgetting the down members during a restart reenables all the issues we had with double counting again |
Turns out the thing that uses What's odd to me is: when I'm restarting the cluster, every current known identity is going down, assuredly. The order is random though, and members seeing a new state for a left member is also random. So I don't really know what the solution is except that maybe it could be a new type of message. When leaving, send the new down state to every other member, not just a few random members. I'm now using QUIC instead of UDP / TCP, so updates are pretty reliable (pretty much as reliable as TCP). I figure all nodes would get the "leave" message. Ultimately, I wouldn't have to use |
I'm pretty sure the knowledge is getting disseminated correctly and fast enough. The problem is not that they never learn that the member went down, is that they forget :) When you're restarting the cluster, this scenario from a couple of comments ago repeats itself multiple times:
(But instead of 2 members, it's whatever your rolling restart batch size is) It doesn't look like a problem when you think about the first time this happens, but in the second batch of nodes the problem becomes evident: Let's say you're restarting your cluster in 3 batches. B1, the first one, just completed and now you're doing B2. At this point in time you have:
So when B2 starts coming back online, if a node from B1 talks to B2 and still has updates in its backlog, B2 will think that a identity that just declared itself as down is actually alive. The larger the batch size and the number of batches, the higher the likelihood that it reintroduces down nodes as alive. If you persist the down members the problem mostly disappears, only the nodes within a batch will (possibly) have outdated knowledge. To be clear: the problem here is asymmetric knowledge of the terminal (Down) state of identities. It's the same problem we had before with forgetting down members too early due to configuration. So, getting back to the meat of it: you're trying to speed up discovery of cluster members. The reason foca doesn't do this very well is because it's limited to a maximum packet size. You, however, aren't. I think you should consider the approach that memberlist uses: periodically, a member connects to another member and they do a full synch (i.e.: member A sends its full list of members, including down ones; member B applies these to its own state THEN sends its full list back to A). It's very similar to what foca does with announce, but having a proper connection between members enables giving the whole state and ensuring a reply. Whether you stick to the current approach or try a different one, foca can facilitate this by exposing the full state directly (think foca.iter_members(), but without filtering for liveness); I'll get this done |
Would this help on batch restarts? I suppose it could self correct by merging states in a way to dismisses a lot of "bad" identities? |
Huh I thought I'd released v0.16.0 and replied to this before. Apologies. Doing this sync between live members pretty much eliminates any knowledge disparity because it guarantees that the nodes will have the exact same state. I understand the need to converge to the full cluster size as fast as possible so you'll have to address the problem somehow. Possible approaches. Useful for any scenario related to converging
My tiny cluster uses an identity with a timestamp (I golfed it and actually has a bug if I restart during a year change but I digress 😅 ), so I go for the last approach as it'd rather not introduce tcp or something similar here |
Thanks, that's helpful! I'm tempted to try the timestamp technique. That should work fine for us.
Do you mean declare it down with Foca or declare it down internally (like the a |
With foca. The idea here is that someone is spreading this knowledge to the cluster, some might have learned it already and you want it to stop; So you teach foca the correct state and it disseminates. (As soon as you do the |
I think the timestamp change helped! The cluster seems to eventually coalesce to the same number of members for each node. I've also started using the new How can I tell |
You've just done it by feeding the output of iter_membership_state to apply_many :) Any member you insert using this becomes part of the distributed state and Down is a state members can't transition out of so: foca.apply_many(core::iter::once(
Member::new(identity_to_kill, Incarnation::default(), State::Down)
), &mut runtime) Makes foca declare |
closing stale issues that seem resolved. feel free to reopen |
In our setup, I'm cleanly leaving the cluster by using
leave_cluster
and waiting 2 seconds w/ the hope that the update propagates to as many nodes as possible.Since
leave_cluster
moves theFoca
instance, we can't callhandle_data
and such on it anymore. We've lost control of it. Doesfoca
still handle dispatching the leave message thoroughly?We're often restarting the cluster w/ a concurrency of 6 (or more) nodes at a time. I figure it's possible for nodes to not receive the leave / down message and consider this node as up. So when they start again, they might
apply_many
an up state for the node and it might be outdated.For example, there's no way to store the current state of the cluster past the
leave_cluster
call, therefore as other nodes are also leaving at the same time, we'll have stored the wrong identity for them.When there's a deploy (and therefore a restart), we keep getting these log lines:
I know these are mostly harmless, but I wonder if there's a way to either avoid them or to reduce their log level.
The text was updated successfully, but these errors were encountered: