Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gossip: remove race on client peer id #122440

Merged
merged 1 commit into from
Apr 17, 2024

Conversation

kvoli
Copy link
Collaborator

@kvoli kvoli commented Apr 16, 2024

The gossip client peerID was accessed outside of the gossip mutex,
leading to concurrent access. Serialize access by acquiring the gossip
mutex.

Fixes: #122435
Release note: None

@kvoli kvoli added backport-23.1.x Flags PRs that need to be backported to 23.1 backport-23.2.x Flags PRs that need to be backported to 23.2. backport-24.1.x Flags PRs that need to be backported to 24.1. labels Apr 16, 2024
@kvoli kvoli self-assigned this Apr 16, 2024
Copy link

blathers-crl bot commented Apr 16, 2024

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@andrewbaptist
Copy link
Collaborator

Nice catch!

Interestingly enough we introduced by an attempt to address incorrect unlocking: #108064.
To note, the file was added to the list of files we don't lint on here: #107577. In nogo_config.json is still marked as

            "pkg/gossip/client.go": "flagged by linter, should be evaluated",

I don't think this lint would have helped here though although I'm not positivev. In gossip we are acquiring the g.mu lock but then accessing c.peerIdD, and I don't know if it is possible to know programmatically which lock needed to be held.

@andrewbaptist
Copy link
Collaborator

Looking at this a little more, there is at least one more usage of c.peerID outside of the lock in the RunAsyncTask goroutine. That is also not safe, but I don't know the exact situation to hit it. Maybe we can lift that log line to before the RunAsyncTask call (although it might always be 0 before requestGossip is called.

There are also uses of c.addr, but that doesn't appear to change after construction.

Can you also add a comment on client.peerID to note that it can only be accessed while holding the gossip.mu lock.

I think there is also a bug in gossip client.go:333 where it sets this outside of a lock.

@kvoli kvoli marked this pull request as ready for review April 16, 2024 15:34
@kvoli kvoli requested a review from a team as a code owner April 16, 2024 15:34
@kvoli kvoli requested a review from arulajmani April 16, 2024 15:35
@kvoli kvoli force-pushed the 240416.rm-gossip-race branch from a09ec3f to cfbd9f6 Compare April 16, 2024 15:54
Copy link
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: modulo comment

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @kvoli)


pkg/gossip/client.go line 36 at r1 (raw file):

	createdAt time.Time
	// peerID is the node ID of the peer we're connected to. This is set when we
	// receive a response from the peer. The gossip mu should be held when

I found it a bit surprising that we're using gossip mu here, but then I came across this comment:

// Note that access to each client's internal state is serialized by the
// embedded server's mutex. This is surprising!

Is this the "surprise" it's talking about?


pkg/gossip/client.go line 335 at r1 (raw file):

					initCh <- struct{}{}
				}
				if peerID == 0 {

We don't seem to be doing anything meaningful with this. Should we just get rid of peerID entirely?

The gossip client `peerID` was accessed outside of the gossip mutex,
leading to concurrent access. Serialize access by acquiring the gossip
mutex.

Fixes: cockroachdb#122435
Release note: None
@kvoli kvoli force-pushed the 240416.rm-gossip-race branch from cfbd9f6 to d357a32 Compare April 16, 2024 21:09
Copy link
Collaborator Author

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TYFTR

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @arulajmani)


pkg/gossip/client.go line 36 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

I found it a bit surprising that we're using gossip mu here, but then I came across this comment:

// Note that access to each client's internal state is serialized by the
// embedded server's mutex. This is surprising!

Is this the "surprise" it's talking about?

Indeed.


pkg/gossip/client.go line 335 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

We don't seem to be doing anything meaningful with this. Should we just get rid of peerID entirely?

Nice, I didn't notice -- removed.

@kvoli
Copy link
Collaborator Author

kvoli commented Apr 17, 2024

bors r=arulajmani

@kvoli kvoli removed backport-23.1.x Flags PRs that need to be backported to 23.1 backport-23.2.x Flags PRs that need to be backported to 23.2. labels Apr 17, 2024
@craig craig bot merged commit 7011271 into cockroachdb:master Apr 17, 2024
21 of 22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-24.1.x Flags PRs that need to be backported to 24.1.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kv/kvclient/kvcoord: TestPartialPartition failed
4 participants