Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestDecommission is flaky #17995

Closed
tbg opened this issue Aug 29, 2017 · 0 comments · Fixed by #18220 or #23082
Closed

TestDecommission is flaky #17995

tbg opened this issue Aug 29, 2017 · 0 comments · Fixed by #18220 or #23082
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered).
Milestone

Comments

@tbg
Copy link
Member

tbg commented Aug 29, 2017

it times out during

// Run a second time to wait until the replicas have all been GC'ed.
// Note that we specify "all" because even though the first node is
// now running, it may not be live by the time the command runs.
o, err = decommission(ctx, c, 2, target, "decommission", "--wait", "all")

https://teamcity.cockroachdb.com/repository/download/Cockroach_UnitTests_Acceptance/337493:id/acceptance/acceptance.log

@tbg tbg added the C-test-failure Broken test (automatically or manually discovered). label Aug 29, 2017
@tbg tbg self-assigned this Aug 29, 2017
tbg added a commit to tbg/cockroach that referenced this issue Aug 29, 2017
@petermattis petermattis added this to the 1.1 milestone Sep 6, 2017
tbg added a commit to tbg/cockroach that referenced this issue Sep 6, 2017
The test frequently got stuck during `decommission --wait` for a
freshly killed node, due to the following:

- node 1 gets killed
- node 3 increments node 1's epoch
- everything from now on happens on node 2
- calls IncrementEpoch, which stalls for now
- sets node 1's decommissioning flag
- gossip fires and updates node1's liveness map entry
- IncrementEpoch continues
- it fails (since the decommissioning flag mismatches)
- blindly updates the liveness[1], effectively clobbering its in-memory state
- calls to Decommission for node1 now turn into no-ops on node2, but
  when the liveness is queried, it looks like node1 isn't decommissioning.

Interestingly, when I looked at the liveness output as seen by the other two
nodes, they always had the same problem. Perhaps something else is at work on
top of the above? I filed an issue to investigate & fix the problematic behavior
of IncrementEpoch:

cockroachdb#18219

For this PR, I am simply using the semaphore to avoid interleavings between
IncrementEpoch and Decommission, and was able to run the test 10x without it
hanging (usually would hang every other time before, maybe more often).

Fixes cockroachdb#17995.

[1]: https://github.com/cockroachdb/cockroach/blob/a9ef8342681baa571ab552d2dc93185ac8eccb49/pkg/storage/node_liveness.go#L566-L572
@tbg tbg closed this as completed in #18220 Sep 6, 2017
tbg added a commit to tbg/cockroach that referenced this issue Feb 26, 2018
When a node is decommissioned in absentia, the gateway node may not be
aware of its most recent liveness entry. Before this bug fix, the
gateway would fail the decommissioning process.  Instead, it now falls
back to reading the liveness record from the KV store.

Fixes cockroachdb#17995.

Release note (bug fix): Decommissioning a node that has already been
terminated now works in all cases. Success previously depended on
whether the gateway node "remembered" the absent decommissionee.
tbg added a commit to tbg/cockroach that referenced this issue Feb 26, 2018
When a node is decommissioned in absentia, the gateway node may not be
aware of its most recent liveness entry. Before this bug fix, the
gateway would fail the decommissioning process.  Instead, it now falls
back to reading the liveness record from the KV store.

Fixes cockroachdb#17995.

Release note (bug fix): Decommissioning a node that has already been
terminated now works in all cases. Success previously depended on
whether the gateway node "remembered" the absent decommissionee.
tbg added a commit to tbg/cockroach that referenced this issue Feb 26, 2018
When a node is decommissioned in absentia, the gateway node may not be
aware of its most recent liveness entry. Before this bug fix, the
gateway would fail the decommissioning process. Instead, we now read the
liveness record from the KV store in this path.

Fixes cockroachdb#17995.

Release note (bug fix): Decommissioning a node that has already been
terminated now works in all cases. Success previously depended on
whether the gateway node "remembered" the absent decommissionee.
tbg added a commit to tbg/cockroach that referenced this issue Mar 5, 2018
When a node is decommissioned in absentia, the gateway node may not be
aware of its most recent liveness entry. Before this bug fix, the
gateway would fail the decommissioning process. Instead, we now read the
liveness record from the KV store in this path.

Fixes cockroachdb#17995.

Release note (bug fix): Decommissioning a node that has already been
terminated now works in all cases. Success previously depended on
whether the gateway node "remembered" the absent decommissionee.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered).
Projects
None yet
2 participants