Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don’t ack if unable to remove failing replica #39584

Merged
merged 6 commits into from
Mar 5, 2019

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented Mar 1, 2019

Today when a replicated write operation fails to execute on a replica, the primary will reach out to the master to fail that replica (and mark it stale). We then won't ack that request until the master removes the failing replica; otherwise, we will lose the acked operation if the failed replica is still in the in-sync set. However, if a node with the primary is shutting down, we might ack such request even though we are unable to send a shard-failure request to the master. This happens because we ignore NodeClosedException which is triggered when the ClusterService is being closed.

Closes #39467

/cc @bleskes @martijnvg @jasontedor

@dnhatn dnhatn added >bug blocker v7.0.0 :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. v6.7.0 v8.0.0 v7.2.0 v6.6.2 v5.6.16 labels Mar 1, 2019
@dnhatn dnhatn requested a review from ywelsch March 1, 2019 20:05
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@@ -436,4 +442,54 @@ public void testIndicesDeleted() throws Exception {
assertFalse(client().admin().indices().prepareExists(idxName).get().isExists());
}

public void testRestartPrimaryNodeWhileIndexing() throws Exception {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will combine this newly added test to testAckedIndexing in a follow-up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this test able to expose the bug without the fix?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran 5000 iterations and this failed twice. This test is merely copied from

public void testFollowIndexAndCloseNode() throws Exception {

primary.failShard(message, failure);
} else {
// these can occur if the node is shutting down and are okay any other exception here is not expected and merits investigation.
assert failure instanceof NodeClosedException || failure instanceof TransportException : failure;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bleskes mentioned that the TransportException here should be coming from

if (lifecycle.stoppedOrClosed()) {
// if we are not started the exception handling will remove the RequestHolder again and calls the handler to notify
// the caller. It will only notify if the toStop code hasn't done the work yet.
throw new TransportException("TransportService is closed stopped can't send request");
}

We should verify this here with an assertion and (I think in a follow-up) look into throwing a more appropriate exception in TransportService. Possible options are AlreadyClosedException or a custom subclass of TransportException.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed 6afb094.

@@ -436,4 +442,54 @@ public void testIndicesDeleted() throws Exception {
assertFalse(client().admin().indices().prepareExists(idxName).get().isExists());
}

public void testRestartPrimaryNodeWhileIndexing() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this test able to expose the bug without the fix?

for (ShardRouting shardRouting : clusterState.routingTable().allShards(index)) {
if (shardRouting.primary()) {
String nodeName = clusterState.nodes().get(shardRouting.currentNodeId()).getName();
internalCluster().restartNode(nodeName, new InternalTestCluster.RestartCallback());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps just restart a random node instead of explicitly the one with the primary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I pushed ab65f4b

@dnhatn
Copy link
Member Author

dnhatn commented Mar 4, 2019

@ywelsch Thanks for looking. It's ready again.

@dnhatn dnhatn requested a review from ywelsch March 4, 2019 17:40
@dnhatn dnhatn added the v6.6.2 label Mar 4, 2019
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@ywelsch ywelsch removed the v6.6.2 label Mar 5, 2019
@dnhatn
Copy link
Member Author

dnhatn commented Mar 5, 2019

@ywelsch Thanks!

@dnhatn dnhatn merged commit ecf6af4 into elastic:master Mar 5, 2019
@dnhatn dnhatn deleted the primary-shutdown branch March 5, 2019 16:22
@jpountz
Copy link
Contributor

jpountz commented Mar 6, 2019

Thanks for letting me know!

dnhatn added a commit that referenced this pull request Mar 6, 2019
We need to unwrap and use the actual cause when determining if the node
with primary shard is shutting down because TransportService will throw
a TransportException wrapped in a SendRequestTransportException.

Relates #39584
dnhatn added a commit that referenced this pull request Mar 6, 2019
Today when a replicated write operation fails to execute on a replica,
the primary will reach out to the master to fail that replica (and mark
it stale). We then won't ack that request until the master removes the
failing replica; otherwise, we will lose the acked operation if the
failed replica is still in the in-sync set. However, if a node with the
primary is shutting down, we might ack such request even though we are
unable to send a shard-failure request to the master. This happens
because we ignore NodeClosedException which is triggered when the
ClusterService is being closed.

Closes #39467
dnhatn added a commit that referenced this pull request Mar 6, 2019
We need to unwrap and use the actual cause when determining if the node
with primary shard is shutting down because TransportService will throw
a TransportException wrapped in a SendRequestTransportException.

Relates #39584
dnhatn added a commit that referenced this pull request Mar 6, 2019
Today when a replicated write operation fails to execute on a replica,
the primary will reach out to the master to fail that replica (and mark
it stale). We then won't ack that request until the master removes the
failing replica; otherwise, we will lose the acked operation if the
failed replica is still in the in-sync set. However, if a node with the
primary is shutting down, we might ack such request even though we are
unable to send a shard-failure request to the master. This happens
because we ignore NodeClosedException which is triggered when the
ClusterService is being closed.

Closes #39467
dnhatn added a commit that referenced this pull request Mar 6, 2019
We need to unwrap and use the actual cause when determining if the node
with primary shard is shutting down because TransportService will throw
a TransportException wrapped in a SendRequestTransportException.

Relates #39584
dnhatn added a commit that referenced this pull request Mar 6, 2019
Today when a replicated write operation fails to execute on a replica,
the primary will reach out to the master to fail that replica (and mark
it stale). We then won't ack that request until the master removes the
failing replica; otherwise, we will lose the acked operation if the
failed replica is still in the in-sync set. However, if a node with the
primary is shutting down, we might ack such request even though we are
unable to send a shard-failure request to the master. This happens
because we ignore NodeClosedException which is triggered when the
ClusterService is being closed.

Closes #39467
dnhatn added a commit that referenced this pull request Mar 6, 2019
We need to unwrap and use the actual cause when determining if the node
with primary shard is shutting down because TransportService will throw
a TransportException wrapped in a SendRequestTransportException.

Relates #39584
dnhatn added a commit that referenced this pull request Mar 7, 2019
Today when a replicated write operation fails to execute on a replica,
the primary will reach out to the master to fail that replica (and mark
it stale). We then won't ack that request until the master removes the
failing replica; otherwise, we will lose the acked operation if the
failed replica is still in the in-sync set. However, if a node with the
primary is shutting down, we might ack such request even though we are
unable to send a shard-failure request to the master. This happens
because we ignore NodeClosedException which is triggered when the
ClusterService is being closed.

Closes #39467
dnhatn added a commit that referenced this pull request Mar 7, 2019
We need to unwrap and use the actual cause when determining if the node
with primary shard is shutting down because TransportService will throw
a TransportException wrapped in a SendRequestTransportException.

Relates #39584
dnhatn added a commit that referenced this pull request Mar 7, 2019
dnhatn added a commit that referenced this pull request Mar 7, 2019
dnhatn added a commit that referenced this pull request Mar 7, 2019
dnhatn added a commit that referenced this pull request Mar 7, 2019
dnhatn added a commit that referenced this pull request Mar 7, 2019
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Mar 7, 2019
* 6.7:
  Fix CCR HLRC docs
  Introduce forget follower API (elastic#39718)
  6.6.2 release notes.
  Update release notes for 6.7.0
  Add documentation for min_hash filter (elastic#39671)
  Unmute testIndividualActionsTimeout
  Unmute testFollowIndexAndCloseNode
  Use unwrapped cause to determine if node is closing (elastic#39723)
  Don’t ack if unable to remove failing replica (elastic#39584)
  Wipe Snapshots Before Indices in RestTests (elastic#39662) (elastic#39765)
  Bug fix for AnnotatedTextHighlighter (elastic#39525)
  Fix Snapshot BwC with Version 5.6.x (elastic#39737)
  Fix occasional SearchServiceTests failure (elastic#39697)
  Correct date in daterange-aggregation.asciidoc (elastic#39727)
  Add a note to purge the ingest-geoip plugin (elastic#39553)
dnhatn added a commit that referenced this pull request Mar 8, 2019
If TransportService is stopped before a shard-failure request is sent
but after the request is registered, TransportService will notify
ReplicationOperation a TransportException with an error message:
"transport stop, action: internal:cluster/shard/failure".

Relates #39584
dnhatn added a commit that referenced this pull request Mar 8, 2019
If TransportService is stopped before a shard-failure request is sent
but after the request is registered, TransportService will notify
ReplicationOperation a TransportException with an error message:
"transport stop, action: internal:cluster/shard/failure".

Relates #39584
dnhatn added a commit that referenced this pull request Mar 8, 2019
If TransportService is stopped before a shard-failure request is sent
but after the request is registered, TransportService will notify
ReplicationOperation a TransportException with an error message:
"transport stop, action: internal:cluster/shard/failure".

Relates #39584
dnhatn added a commit that referenced this pull request Mar 8, 2019
If TransportService is stopped before a shard-failure request is sent
but after the request is registered, TransportService will notify
ReplicationOperation a TransportException with an error message:
"transport stop, action: internal:cluster/shard/failure".

Relates #39584
dnhatn added a commit that referenced this pull request Mar 11, 2019
If TransportService is stopped before a shard-failure request is sent
but after the request is registered, TransportService will notify
ReplicationOperation a TransportException with an error message:
"transport stop, action: internal:cluster/shard/failure".

Relates #39584
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker >bug :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. v6.6.3 v6.7.0 v7.0.0-rc2 v7.2.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FollowerFailOverIT.testFailOverOnFollower fails on CI
5 participants