Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gossip: demonstrate problem with loopback info propagation #34472

Merged

Conversation

petermattis
Copy link
Collaborator

Release note: None

@petermattis petermattis requested a review from a team January 31, 2019 23:08
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@petermattis petermattis requested review from tbg and bdarnell January 31, 2019 23:08
@tbg
Copy link
Member

tbg commented Feb 1, 2019

Glad the problem is where we hoped it was going to be. Is the "fix" as simple as making sure that the first time a node receives Gossip, it receives everything that isn't expired?

More generally, infinite TTLs are problematic in themselves because we can't delete from Gossip (and I'd rather not introduce it), but at the same time this isn't a problem with infinite TTLs, right? It would affect any TTL that was longer-lived. In the case of node liveness, we could Gossip liveness record X with a TTL that makes the gossip record expire at the liveness entry's expiration + one week, meaning that a week after decommissioning, the node would disappear from Gossip.

@petermattis
Copy link
Collaborator Author

Glad the problem is where we hoped it was going to be. Is the "fix" as simple as making sure that the first time a node receives Gossip, it receives everything that isn't expired?

Is that sufficient? What happens if a node starts up and connects to another node that is partitioned? It might only only receive the the non-expired entries from that node. Or were you suggesting that we send all non-expired entries every time a gossip connection is made, disregarding the highwater stamps?

More generally, infinite TTLs are problematic in themselves because we can't delete from Gossip (and I'd rather not introduce it), but at the same time this isn't a problem with infinite TTLs, right? It would affect any TTL that was longer-lived. In the case of node liveness, we could Gossip liveness record X with a TTL that makes the gossip record expire at the liveness entry's expiration + one week, meaning that a week after decommissioning, the node would disappear from Gossip.

The problem here is bad for any gossip record that is gossiped by more than one node and that isn't gossiped on a regular interval. I think the liveness records might be the most problematic case, but I need to do an audit of the other keys.

I'm hesitant to replace infinite TTLs with ones that expire after a week. Figuring out that a gossip record expired after a week and that is causing problems in a cluster isn't something I want to debug. Super subtle problems in gossip like we're continuing to discover make me want to replace it, not keep on applying bandaids.

@petermattis
Copy link
Collaborator Author

I think the liveness records might be the most problematic case, but I need to do an audit of the other keys.

This might also affect the table-stat-added infos and the system-config info. I don't think the table-stat-added infos is problematic, because that key simply causes the table stat cache to invalidate an entry. If a node restarts, its cache will be empty. The system-config may also not be problematic, but I think that is only because there is careful logic to re-gossip the system-config if the leaseholder for the ranges changes. See:

	if gossipedCfg := r.store.Gossip().GetSystemConfig(); gossipedCfg != nil && gossipedCfg.Equal(loadedCfg) &&
		r.store.Gossip().InfoOriginatedHere(gossip.KeySystemConfig) {
		log.VEventf(ctx, 2, "not gossiping unchanged system config")
		return nil
	}

I also just discovered that we're aware that a node's gossiped infos won't necessarily come back to it on restart. See 1b43a93:

storage: Always gossip system config after lease changes hands

Without doing this, it's possible for n1 to gossip the system config,
lose its lease, and restart with the source node of the system config
gossip still being n1. If this happens, other nodes will not provide the
system config gossip back to n1 because its watermark for itself will be
higher than the system config's timestamp.

@bdarnell
Copy link
Contributor

bdarnell commented Feb 1, 2019

I'm hesitant to replace infinite TTLs with ones that expire after a week.

Yeah, a week is long enough to be very hard to debug. I'd suggest that we have a limit of say an hour for any non-infinite TTL (so anything that needs to live longer than that needs to be re-gossiped periodically).

Is that sufficient? What happens if a node starts up and connects to another node that is partitioned? It might only only receive the the non-expired entries from that node. Or were you suggesting that we send all non-expired entries every time a gossip connection is made, disregarding the highwater stamps?

I think what we probably need is to start identifying gossip connections by a tuple like (node id, start time) so that a restarted node is not seen as a continuation of its previous incarnation.

@petermattis
Copy link
Collaborator Author

I think what we probably need is to start identifying gossip connections by a tuple like (node id, start time) so that a restarted node is not seen as a continuation of its previous incarnation.

Heh, I just filed a bug where I mention that as being an ideal solution. I think there are complexities in actually implementing this and handling the backwards compatibility, though. I need to think about this more.

Similar to cockroachdb#17285. We need to always gossip node liveness infos after
the node-liveness range lease changes hands due to the way gossip
handles propagation of infos which "loopback" to the sender. Due to the
highwater stamp mechanism, such infos are never sent back to the
sender. If the gossip info is subsequently never re-gossiped, a node in
the cluster may never see the info. This is only a problem for gossip
keys which can be gossiped from multiple nodes, such as the node
liveness keys and the system config key. The other keys which can be
gossiped from multiple nodes are the table-stats-added key, and the
table-disable-merges key, neither of which appear to have problematic
scenarios associated with the failure to propagate them back to the
sender after a restart.

See cockroachdb#34494

Release note: None
@petermattis petermattis force-pushed the pmattis/gossip-loopback-info-propagation branch from cf99eef to 4034f80 Compare February 1, 2019 19:09
@petermattis
Copy link
Collaborator Author

I added a 2nd commit which works around the problem with loopback propagation of infos. This is similar to what was done in #17285. I consider this a temporary bandaid that is being utilized because it is much simpler and more easily backported than a fix inside gossip itself.

PTAL

@petermattis
Copy link
Collaborator Author

@tbg or @bdarnell I'd like to get this merged and backported. Any concerns with the hackiness of this "fix"?

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1, 2 of 2 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell)

@petermattis
Copy link
Collaborator Author

bors r=tbg

craig bot pushed a commit that referenced this pull request Feb 5, 2019
34472: gossip: demonstrate problem with loopback info propagation r=tbg a=petermattis

Release note: None

Co-authored-by: Peter Mattis <[email protected]>
@craig
Copy link
Contributor

craig bot commented Feb 5, 2019

Build succeeded

@craig craig bot merged commit 4034f80 into cockroachdb:master Feb 5, 2019
@petermattis petermattis deleted the pmattis/gossip-loopback-info-propagation branch February 5, 2019 17:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants