Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grpcproxy: TestLaunchDuplicateMemberShouldFail: unexpect successful launch #7267

Closed
heyitsanthony opened this issue Feb 2, 2017 · 3 comments
Assignees
Milestone

Comments

@heyitsanthony
Copy link
Contributor

via https://jenkins-etcd-public.prod.coreos.systems/job/etcd-proxy/620/console

--- FAIL: TestLaunchDuplicateMemberShouldFail (0.80s)
	member_test.go:81: unexpect successful launch

Possibly related to #7239

gyuho added a commit to gyuho/etcd that referenced this issue May 4, 2017
Fix etcd-io#7267.

In slow CPU machine, it often takes more than 10ms, so times out.

Signed-off-by: Gyu-Ho Lee <[email protected]>
gyuho added a commit to gyuho/etcd that referenced this issue May 4, 2017
Fix etcd-io#7267.

In slow CPU machine, it often takes more than 10ms, so times out.

Signed-off-by: Gyu-Ho Lee <[email protected]>
@gyuho gyuho added this to the v3.2.0 milestone May 4, 2017
@gyuho
Copy link
Contributor

gyuho commented May 4, 2017

Root cause: duplicate member launched and got Client.Timeout exceeded while awaiting headers error(https://github.com/coreos/etcd/blob/master/etcdserver/server.go#L342,https://github.com/coreos/etcd/blob/master/etcdserver/cluster_util.go#L66-L73) from peer handler(https://github.com/coreos/etcd/blob/master/etcdserver/api/v2http/peer.go). Existing members should have started serving /members in its peer handlers, and the duplicate member fetches client URLs from these /members endpoints and finds out itself conflicts with existing members.

Problem: it is possible that duplicate member's GET request to /members times out, and etcd mistakenly thinks there's no conflict, so successfully bootstraps. Since duplicate member tries to boot with same token, it won't be caught even after successful boot.

Assume this should fail with TCP port conflicts in regular set-up. Not sure how this is handled in unix ports.

@gyuho
Copy link
Contributor

gyuho commented May 4, 2017

I feel like just bumping up the bootstrap time-out for tests is enough. Problem was all GET requests to peers' /members endpoint timed out. But in real world, this never happens. Duplicate config member will never boot with such conflicting TCP ports.

@heyitsanthony
Copy link
Contributor Author

A member could have more than one ip/port so there won't be a bind conflict. Maybe split-brain issues if the member is killed, has its wal dir removed, then tries to boot as a new cluster while partitioned. This path needs some more testing...

I'm thinking:

  • infof the bootstrap error so there's some indication the member is assuming it's not bootstrapped
  • a test that partitions a duplicate member from a booted cluster so it bootstraps, then confirms it shuts down when the partition is lifted
  • either a boosted bootstrap timeout for LaunchDuplicateMemberShouldFail or confirm the duplicate eventually exits

@gyuho gyuho modified the milestones: v3.3.0, v3.2.0 Jun 5, 2017
@gyuho gyuho modified the milestones: v3.4.0, v3.3.0 Aug 14, 2017
@gyuho gyuho closed this as completed Apr 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants
@gyuho @heyitsanthony and others