Propagate shard failures to index task and auto-pause on failures. #56

krishna-ggk · 2021-07-16T04:27:13Z

Description

This commit introduces FailedState for ShardTask. Any failure in
shard-replication task (even via nested coroutine) are caught and captured as
FailedState.

The IndexReplicationTask notices the failure while it is in MonitoringState and
triggers a pause. If the pause fails, it marks the overall replication state as
failed.

This also required fixing the nesting of coroutines. The outer coroutine
was being used to trigger actor and waiting for cancellation, which was
breaking intuitive parent-child coroutine relationship via nesting.

Signed-off-by: Gopala Krishna Ambareesh [email protected]

Testing

Since there is no failure injection framework yet, resorted to manual testing. Ran the following manual tests

Ensured existing integ tests passed fully
Injected failure in TranslogSequencer and ensured replication paused and verified the logging.
Verified replication resumes after resuming post step 2.
Injected failure in TranslogSequencer and in TransportPauseIndexReplicationAction. Verified that the pause is attempted and moves the overall replication to FAILED state.
Injected failure in ShardReplicationTask and ensured replication paused.
Verified replication resumes after resuming post step 4.

Example output

Paused due to failure

$ curl "$FOLLOWER/_plugins/_replication/$FOLLOWER_INDEX/_status?pretty"
{
  "status" : "PAUSED",
  "reason" : "[[fdgexbtoaq][0] - com.amazon.elasticsearch.replication.ReplicationException - \"failed to replay changes\"], ",
  "remote_cluster" : "source",
  "leader_index" : "ohxnmlauwx",
  "follower_index" : "fdgexbtoaq",
  "syncing_details" : {
    "remote_checkpoint" : 0,
    "local_checkpoint" : 0,
    "seq_no" : 1
  }
}

Failure to pause on failure

# [email protected] K M in ~/code
$ curl "$FOLLOWER/_plugins/_replication/$FOLLOWER_INDEX/_status?pretty"
{
  "status" : "FAILED",
  "reason" : "Pause failed with \"Index is in restore phase currently for index: zmptafwtwc. You can pause after restore completes.\". Original failure for initiating pause - [[zmptafwtwc][0] - com.amazon.elasticsearch.replication.ReplicationException - \"failed to replay changes\"], ",
  "remote_cluster" : "source",
  "leader_index" : "rzsrrcncmx",
  "follower_index" : "xfpdeazkrx",
  "syncing_details" : {
    "remote_checkpoint" : 1,
    "local_checkpoint" : 0,
    "seq_no" : 1
  }
}

Check List

Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

ankitkala · 2021-07-16T04:49:50Z

LGTM.

One minor comment: We can use e.stackTraceToString() while logging the exceptions everywhere.

src/main/kotlin/com/amazon/elasticsearch/replication/task/shard/ShardReplicationTask.kt

gbbafna

Why are we not pausing in Shard Task itself ?

This commit introduces FailedState for ShardTask. Any failure in shard-replication task (even via nested coroutine) are caught and captured as FailedState. The IndexReplicationTask notices the failure while it is in MonitoringState and triggers a pause. If the pause fails, it marks the overall replication state as failed. This also required fixing the nesting of coroutines. The outer coroutine was being used to trigger actor and waiting for cancellation, which was breaking intuitive parent-child coroutine relationship via nesting. Signed-off-by: Gopala Krishna Ambareesh <[email protected]>

krishna-ggk · 2021-07-18T18:53:54Z

LGTM.

One minor comment: We can use e.stackTraceToString() while logging the exceptions everywhere.

Sure. However, we should reverse the trend sometime later as I realized we lose the benefit of non-evaluation of exception due to string-interpolation if that log-level is disabled.

krishna-ggk · 2021-07-18T18:56:55Z

@gbbafna : Why are we not pausing in Shard Task itself ?

Good point - I added a comment in source as well. The main reason is to keep index level state transitions at IndexReplicationTask, especially since there could be need to resolve ties when multiple shards fail with different exceptions.

saikaranam-amazon reviewed Jul 16, 2021

View reviewed changes

src/main/kotlin/com/amazon/elasticsearch/replication/task/shard/ShardReplicationTask.kt Outdated Show resolved Hide resolved

gbbafna reviewed Jul 16, 2021

View reviewed changes

krishna-ggk closed this Jul 18, 2021

krishna-ggk reopened this Jul 18, 2021

saikaranam-amazon approved these changes Jul 19, 2021

View reviewed changes

ankitkala approved these changes Jul 19, 2021

View reviewed changes

krishna-ggk merged commit 14c3058 into opensearch-project:main Jul 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Propagate shard failures to index task and auto-pause on failures. #56

Propagate shard failures to index task and auto-pause on failures. #56

krishna-ggk commented Jul 16, 2021 •

edited

Loading

ankitkala commented Jul 16, 2021

gbbafna left a comment

krishna-ggk commented Jul 18, 2021 •

edited

Loading

krishna-ggk commented Jul 18, 2021

Propagate shard failures to index task and auto-pause on failures. #56

Propagate shard failures to index task and auto-pause on failures. #56

Conversation

krishna-ggk commented Jul 16, 2021 • edited Loading

Description

Testing

Example output

Paused due to failure

Failure to pause on failure

Check List

ankitkala commented Jul 16, 2021

gbbafna left a comment

Choose a reason for hiding this comment

krishna-ggk commented Jul 18, 2021 • edited Loading

krishna-ggk commented Jul 18, 2021

krishna-ggk commented Jul 16, 2021 •

edited

Loading

krishna-ggk commented Jul 18, 2021 •

edited

Loading