-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TestDecommission is flaky #17995
Comments
tbg
added a commit
to tbg/cockroach
that referenced
this issue
Aug 29, 2017
This was referenced Aug 29, 2017
tbg
added a commit
to tbg/cockroach
that referenced
this issue
Sep 6, 2017
The test frequently got stuck during `decommission --wait` for a freshly killed node, due to the following: - node 1 gets killed - node 3 increments node 1's epoch - everything from now on happens on node 2 - calls IncrementEpoch, which stalls for now - sets node 1's decommissioning flag - gossip fires and updates node1's liveness map entry - IncrementEpoch continues - it fails (since the decommissioning flag mismatches) - blindly updates the liveness[1], effectively clobbering its in-memory state - calls to Decommission for node1 now turn into no-ops on node2, but when the liveness is queried, it looks like node1 isn't decommissioning. Interestingly, when I looked at the liveness output as seen by the other two nodes, they always had the same problem. Perhaps something else is at work on top of the above? I filed an issue to investigate & fix the problematic behavior of IncrementEpoch: cockroachdb#18219 For this PR, I am simply using the semaphore to avoid interleavings between IncrementEpoch and Decommission, and was able to run the test 10x without it hanging (usually would hang every other time before, maybe more often). Fixes cockroachdb#17995. [1]: https://github.com/cockroachdb/cockroach/blob/a9ef8342681baa571ab552d2dc93185ac8eccb49/pkg/storage/node_liveness.go#L566-L572
tbg
added a commit
to tbg/cockroach
that referenced
this issue
Feb 26, 2018
When a node is decommissioned in absentia, the gateway node may not be aware of its most recent liveness entry. Before this bug fix, the gateway would fail the decommissioning process. Instead, it now falls back to reading the liveness record from the KV store. Fixes cockroachdb#17995. Release note (bug fix): Decommissioning a node that has already been terminated now works in all cases. Success previously depended on whether the gateway node "remembered" the absent decommissionee.
tbg
added a commit
to tbg/cockroach
that referenced
this issue
Feb 26, 2018
When a node is decommissioned in absentia, the gateway node may not be aware of its most recent liveness entry. Before this bug fix, the gateway would fail the decommissioning process. Instead, it now falls back to reading the liveness record from the KV store. Fixes cockroachdb#17995. Release note (bug fix): Decommissioning a node that has already been terminated now works in all cases. Success previously depended on whether the gateway node "remembered" the absent decommissionee.
tbg
added a commit
to tbg/cockroach
that referenced
this issue
Feb 26, 2018
When a node is decommissioned in absentia, the gateway node may not be aware of its most recent liveness entry. Before this bug fix, the gateway would fail the decommissioning process. Instead, we now read the liveness record from the KV store in this path. Fixes cockroachdb#17995. Release note (bug fix): Decommissioning a node that has already been terminated now works in all cases. Success previously depended on whether the gateway node "remembered" the absent decommissionee.
tbg
added a commit
to tbg/cockroach
that referenced
this issue
Mar 5, 2018
When a node is decommissioned in absentia, the gateway node may not be aware of its most recent liveness entry. Before this bug fix, the gateway would fail the decommissioning process. Instead, we now read the liveness record from the KV store in this path. Fixes cockroachdb#17995. Release note (bug fix): Decommissioning a node that has already been terminated now works in all cases. Success previously depended on whether the gateway node "remembered" the absent decommissionee.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
it times out during
https://teamcity.cockroachdb.com/repository/download/Cockroach_UnitTests_Acceptance/337493:id/acceptance/acceptance.log
The text was updated successfully, but these errors were encountered: