Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle master failure in NodeSeenService #77220

Merged

Conversation

gwbrown
Copy link
Contributor

@gwbrown gwbrown commented Sep 2, 2021

NodeSeenService can miss seeing nodes if the master changes while it's
processing the cluster state update which adds the nodes to the cluster.
This caused occasional test failures in the test intended to check that
NodeSeenService is working as intended.

This commit adjusts NodeSeenService's early returns to ensure that, if
the master changed, the new master checks for seen nodes even if nodes
were not added in that particular cluster state update.

Follow-up to #75750
Fixes #76689

NodeSeenService can miss seeing nodes if the master changes while it's
processing the cluster state update which adds the nodes to the cluster.
This caused occasional test failures in the test intended to check that
NodeSeenService is working as intended.

This commit adjusts NodeSeenService's early returns to ensure that, if
the master changed, the new master checks for seen nodes even if nodes
were not added in that particular cluster state update.
@gwbrown gwbrown added >bug v8.0.0 :Core/Infra/Node Lifecycle Node startup, bootstrapping, and shutdown v7.16.0 v7.15.1 labels Sep 2, 2021
@gwbrown gwbrown requested a review from dakrone September 2, 2021 22:42
@elasticmachine elasticmachine added the Team:Core/Infra Meta label for core/infra team label Sep 2, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@gwbrown
Copy link
Contributor Author

gwbrown commented Sep 2, 2021

Prior to the change to NodeSeenService, I was seeing a roughly 1%-2% failure rate running locally. After the change, I ran several hundred runs with zero failures.

Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for fixing this

@gwbrown
Copy link
Contributor Author

gwbrown commented Sep 8, 2021

@dakrone I realized that I was missing an == false on the condition - as it was, the cluster state would get updated and the test would pass, but submit update tasks when not necessary to do so. I've updated it to be easier to read, and run the new version several hundred times as well.

@gwbrown gwbrown requested a review from dakrone September 8, 2021 22:11
Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying it by pulling it into a variable, after removing the logging it LGTM

final boolean thisNodeJustBecameMaster = event.previousState().nodes().isLocalNodeElectedMaster() == false
&& event.state().nodes().isLocalNodeElectedMaster();
if ((event.nodesAdded() || thisNodeJustBecameMaster) == false) {
logger.error("GWB> Bailing early");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably remove this line :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, I had those in there to help verify my reading that the == false was missing 🤦

Removed.

@@ -67,6 +70,7 @@ public void clusterChanged(ClusterChangedEvent event) {
.collect(Collectors.toUnmodifiableSet());

if (nodesNotPreviouslySeen.isEmpty() == false) {
logger.error("GWB> Submitting update task for nodes [{}]", nodesNotPreviouslySeen);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this one :)

@@ -86,6 +90,7 @@ public ClusterState execute(ClusterState currentState) throws Exception {

final NodesShutdownMetadata newNodesMetadata = new NodesShutdownMetadata(newShutdownMetadataMap);
if (newNodesMetadata.equals(currentShutdownMetadata)) {
logger.error("GWB> Bailing update task as it's a no-op");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this one :)

@gwbrown gwbrown added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Sep 9, 2021
@gwbrown
Copy link
Contributor Author

gwbrown commented Sep 9, 2021

@elasticmachine update branch

@elasticsearchmachine elasticsearchmachine merged commit 62fed2f into elastic:master Sep 9, 2021
gwbrown added a commit to gwbrown/elasticsearch that referenced this pull request Sep 9, 2021
* Handle master failure in NodeSeenService

NodeSeenService can miss seeing nodes if the master changes while it's
processing the cluster state update which adds the nodes to the cluster.
This caused occasional test failures in the test intended to check that
NodeSeenService is working as intended.

This commit adjusts NodeSeenService's early returns to ensure that, if
the master changed, the new master checks for seen nodes even if nodes
were not added in that particular cluster state update.

* Clarify & correct "just-became-master" check

* Remove leftover debug logging (d'oh!)

Co-authored-by: Elastic Machine <[email protected]>
gwbrown added a commit to gwbrown/elasticsearch that referenced this pull request Sep 9, 2021
* Handle master failure in NodeSeenService

NodeSeenService can miss seeing nodes if the master changes while it's
processing the cluster state update which adds the nodes to the cluster.
This caused occasional test failures in the test intended to check that
NodeSeenService is working as intended.

This commit adjusts NodeSeenService's early returns to ensure that, if
the master changed, the new master checks for seen nodes even if nodes
were not added in that particular cluster state update.

* Clarify & correct "just-became-master" check

* Remove leftover debug logging (d'oh!)

Co-authored-by: Elastic Machine <[email protected]>
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
7.x
7.15

elasticsearchmachine pushed a commit that referenced this pull request Sep 9, 2021
* Handle master failure in NodeSeenService

NodeSeenService can miss seeing nodes if the master changes while it's
processing the cluster state update which adds the nodes to the cluster.
This caused occasional test failures in the test intended to check that
NodeSeenService is working as intended.

This commit adjusts NodeSeenService's early returns to ensure that, if
the master changed, the new master checks for seen nodes even if nodes
were not added in that particular cluster state update.

* Clarify & correct "just-became-master" check

* Remove leftover debug logging (d'oh!)

Co-authored-by: Elastic Machine <[email protected]>

Co-authored-by: Elastic Machine <[email protected]>
elasticsearchmachine pushed a commit that referenced this pull request Sep 9, 2021
* Handle master failure in NodeSeenService

NodeSeenService can miss seeing nodes if the master changes while it's
processing the cluster state update which adds the nodes to the cluster.
This caused occasional test failures in the test intended to check that
NodeSeenService is working as intended.

This commit adjusts NodeSeenService's early returns to ensure that, if
the master changed, the new master checks for seen nodes even if nodes
were not added in that particular cluster state update.

* Clarify & correct "just-became-master" check

* Remove leftover debug logging (d'oh!)

Co-authored-by: Elastic Machine <[email protected]>

Co-authored-by: Elastic Machine <[email protected]>
@jakelandis jakelandis removed the v8.0.0 label Sep 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >bug :Core/Infra/Node Lifecycle Node startup, bootstrapping, and shutdown Team:Core/Infra Meta label for core/infra team v7.15.1 v7.16.0 v8.0.0-alpha2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] NodeShutdownShardsIT testShardStatusStaysCompleteAfterNodeLeavesIfRegisteredWhileNodeOffline failing
5 participants