TestDecommission is flaky #17995

tbg · 2017-08-29T17:02:15Z

it times out during

// Run a second time to wait until the replicas have all been GC'ed.
// Note that we specify "all" because even though the first node is
// now running, it may not be live by the time the command runs.
o, err = decommission(ctx, c, 2, target, "decommission", "--wait", "all")

https://teamcity.cockroachdb.com/repository/download/Cockroach_UnitTests_Acceptance/337493:id/acceptance/acceptance.log

The text was updated successfully, but these errors were encountered:

See cockroachdb#17995.

The test frequently got stuck during `decommission --wait` for a freshly killed node, due to the following: - node 1 gets killed - node 3 increments node 1's epoch - everything from now on happens on node 2 - calls IncrementEpoch, which stalls for now - sets node 1's decommissioning flag - gossip fires and updates node1's liveness map entry - IncrementEpoch continues - it fails (since the decommissioning flag mismatches) - blindly updates the liveness[1], effectively clobbering its in-memory state - calls to Decommission for node1 now turn into no-ops on node2, but when the liveness is queried, it looks like node1 isn't decommissioning. Interestingly, when I looked at the liveness output as seen by the other two nodes, they always had the same problem. Perhaps something else is at work on top of the above? I filed an issue to investigate & fix the problematic behavior of IncrementEpoch: cockroachdb#18219 For this PR, I am simply using the semaphore to avoid interleavings between IncrementEpoch and Decommission, and was able to run the test 10x without it hanging (usually would hang every other time before, maybe more often). Fixes cockroachdb#17995. [1]: https://github.com/cockroachdb/cockroach/blob/a9ef8342681baa571ab552d2dc93185ac8eccb49/pkg/storage/node_liveness.go#L566-L572

When a node is decommissioned in absentia, the gateway node may not be aware of its most recent liveness entry. Before this bug fix, the gateway would fail the decommissioning process. Instead, it now falls back to reading the liveness record from the KV store. Fixes cockroachdb#17995. Release note (bug fix): Decommissioning a node that has already been terminated now works in all cases. Success previously depended on whether the gateway node "remembered" the absent decommissionee.

When a node is decommissioned in absentia, the gateway node may not be aware of its most recent liveness entry. Before this bug fix, the gateway would fail the decommissioning process. Instead, we now read the liveness record from the KV store in this path. Fixes cockroachdb#17995. Release note (bug fix): Decommissioning a node that has already been terminated now works in all cases. Success previously depended on whether the gateway node "remembered" the absent decommissionee.

tbg added the C-test-failure Broken test (automatically or manually discovered). label Aug 29, 2017

tbg self-assigned this Aug 29, 2017

tbg added a commit to tbg/cockroach that referenced this issue Aug 29, 2017

acceptance: (again) skip TestDecommission

f13967f

See cockroachdb#17995.

This was referenced Aug 29, 2017

acceptance: (again) skip TestDecommission #17998

Merged

cli: add command or flag to actively remove node from cluster #6198

Closed

benesch mentioned this issue Sep 2, 2017

server,storage,sql: record node {de,re}commissioning in the event log #18178

Merged

tbg mentioned this issue Sep 5, 2017

storage: node liveness' IncrementEpoch can clobber "newer" state #18219

Closed

tbg mentioned this issue Sep 5, 2017

acceptance: de-flake TestDecommission #18220

Merged

petermattis added this to the 1.1 milestone Sep 6, 2017

tbg closed this as completed in #18220 Sep 6, 2017

tbg mentioned this issue Feb 26, 2018

storage: fix decommissioning of absent node #23082

Merged

tbg mentioned this issue Mar 5, 2018

cherrypick-2.0: storage: fix decommissioning of absent node #23378

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TestDecommission is flaky #17995

TestDecommission is flaky #17995

tbg commented Aug 29, 2017

TestDecommission is flaky #17995

TestDecommission is flaky #17995

Comments

tbg commented Aug 29, 2017