Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix testFollowerCheckerDetectsUnresponsiveNodeAfterMasterReelection #84200

Conversation

DaveCTurner
Copy link
Contributor

This test would fail if we introduce the network partition while the
master is still publishing a cluster state update and hasn't received
the ack from the victim node. In this case the default publish timeout
means that the master will wait for 30s before completing the stalled
publication and moving on to the node-left one, but
ensureStableCluster also times out after 30s which leaves not much
time for the master to remove the victim node.

This commit reduces the publish timeout to 10s so that the master
recovers well before ensureStableCluster times out.

Closes #84172

This test would fail if we introduce the network partition while the
master is still publishing a cluster state update and hasn't received
the ack from the victim node. In this case the default publish timeout
means that the master will wait for 30s before completing the stalled
publication and moving on to the `node-left` one, but
`ensureStableCluster` also times out after 30s which leaves not much
time for the master to remove the victim node.

This commit reduces the publish timeout to 10s so that the master
recovers well before `ensureStableCluster` times out.

Closes elastic#84172
@DaveCTurner DaveCTurner added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v7.17.1 v8.2.0 v8.1.1 v8.0.2 labels Feb 21, 2022
@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Feb 21, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@DaveCTurner DaveCTurner merged commit e853bf5 into elastic:master Feb 22, 2022
@DaveCTurner DaveCTurner deleted the 2022-02-21-fix-testFollowerCheckerDetectsUnresponsiveNodeAfterMasterReelection branch February 22, 2022 08:21
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

The backport operation could not be completed due to the following error:
An unexpected error occurred when attempting to backport this PR.

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 84200

DaveCTurner added a commit that referenced this pull request Feb 22, 2022
…84200)

This test would fail if we introduce the network partition while the
master is still publishing a cluster state update and hasn't received
the ack from the victim node. In this case the default publish timeout
means that the master will wait for 30s before completing the stalled
publication and moving on to the `node-left` one, but
`ensureStableCluster` also times out after 30s which leaves not much
time for the master to remove the victim node.

This commit reduces the publish timeout to 10s so that the master
recovers well before `ensureStableCluster` times out.

Closes #84172
DaveCTurner added a commit that referenced this pull request Feb 22, 2022
…84200)

This test would fail if we introduce the network partition while the
master is still publishing a cluster state update and hasn't received
the ack from the victim node. In this case the default publish timeout
means that the master will wait for 30s before completing the stalled
publication and moving on to the `node-left` one, but
`ensureStableCluster` also times out after 30s which leaves not much
time for the master to remove the victim node.

This commit reduces the publish timeout to 10s so that the master
recovers well before `ensureStableCluster` times out.

Closes #84172
DaveCTurner added a commit that referenced this pull request Feb 22, 2022
…84200)

This test would fail if we introduce the network partition while the
master is still publishing a cluster state update and hasn't received
the ack from the victim node. In this case the default publish timeout
means that the master will wait for 30s before completing the stalled
publication and moving on to the `node-left` one, but
`ensureStableCluster` also times out after 30s which leaves not much
time for the master to remove the victim node.

This commit reduces the publish timeout to 10s so that the master
recovers well before `ensureStableCluster` times out.

Closes #84172
probakowski pushed a commit to probakowski/elasticsearch that referenced this pull request Feb 23, 2022
…lastic#84200)

This test would fail if we introduce the network partition while the
master is still publishing a cluster state update and hasn't received
the ack from the victim node. In this case the default publish timeout
means that the master will wait for 30s before completing the stalled
publication and moving on to the `node-left` one, but
`ensureStableCluster` also times out after 30s which leaves not much
time for the master to remove the victim node.

This commit reduces the publish timeout to 10s so that the master
recovers well before `ensureStableCluster` times out.

Closes elastic#84172
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test Issues or PRs that are addressing/adding tests v7.17.1 v8.0.2 v8.1.1 v8.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] StableMasterDisruptionIT testFollowerCheckerDetectsUnresponsiveNodeAfterMasterReelection failing
4 participants