Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propagate shard failures to index task and auto-pause on failures. #56

Merged
merged 1 commit into from
Jul 19, 2021
Merged

Propagate shard failures to index task and auto-pause on failures. #56

merged 1 commit into from
Jul 19, 2021

Conversation

krishna-ggk
Copy link
Collaborator

@krishna-ggk krishna-ggk commented Jul 16, 2021

Description

This commit introduces FailedState for ShardTask. Any failure in
shard-replication task (even via nested coroutine) are caught and captured as
FailedState.

The IndexReplicationTask notices the failure while it is in MonitoringState and
triggers a pause. If the pause fails, it marks the overall replication state as
failed.

This also required fixing the nesting of coroutines. The outer coroutine
was being used to trigger actor and waiting for cancellation, which was
breaking intuitive parent-child coroutine relationship via nesting.

Signed-off-by: Gopala Krishna Ambareesh [email protected]

Testing

Since there is no failure injection framework yet, resorted to manual testing. Ran the following manual tests

  1. Ensured existing integ tests passed fully
  2. Injected failure in TranslogSequencer and ensured replication paused and verified the logging.
  3. Verified replication resumes after resuming post step 2.
  4. Injected failure in TranslogSequencer and in TransportPauseIndexReplicationAction. Verified that the pause is attempted and moves the overall replication to FAILED state.
  5. Injected failure in ShardReplicationTask and ensured replication paused.
  6. Verified replication resumes after resuming post step 4.

Example output

Paused due to failure

$ curl "$FOLLOWER/_plugins/_replication/$FOLLOWER_INDEX/_status?pretty"
{
  "status" : "PAUSED",
  "reason" : "[[fdgexbtoaq][0] - com.amazon.elasticsearch.replication.ReplicationException - \"failed to replay changes\"], ",
  "remote_cluster" : "source",
  "leader_index" : "ohxnmlauwx",
  "follower_index" : "fdgexbtoaq",
  "syncing_details" : {
    "remote_checkpoint" : 0,
    "local_checkpoint" : 0,
    "seq_no" : 1
  }
}

Failure to pause on failure

# [email protected] K M in ~/code
$ curl "$FOLLOWER/_plugins/_replication/$FOLLOWER_INDEX/_status?pretty"
{
  "status" : "FAILED",
  "reason" : "Pause failed with \"Index is in restore phase currently for index: zmptafwtwc. You can pause after restore completes.\". Original failure for initiating pause - [[zmptafwtwc][0] - com.amazon.elasticsearch.replication.ReplicationException - \"failed to replay changes\"], ",
  "remote_cluster" : "source",
  "leader_index" : "rzsrrcncmx",
  "follower_index" : "xfpdeazkrx",
  "syncing_details" : {
    "remote_checkpoint" : 1,
    "local_checkpoint" : 0,
    "seq_no" : 1
  }
}

Check List

  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@ankitkala
Copy link
Member

LGTM.

One minor comment: We can use e.stackTraceToString() while logging the exceptions everywhere.

Copy link
Collaborator

@gbbafna gbbafna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we not pausing in Shard Task itself ?

This commit introduces FailedState for ShardTask. Any failure in
shard-replication task (even via nested coroutine) are caught and captured as
FailedState.

The IndexReplicationTask notices the failure while it is in MonitoringState and
triggers a pause. If the pause fails, it marks the overall replication state as
failed.

This also required fixing the nesting of coroutines. The outer coroutine
was being used to trigger actor and waiting for cancellation, which was
breaking intuitive parent-child coroutine relationship via nesting.

Signed-off-by: Gopala Krishna Ambareesh <[email protected]>
@krishna-ggk
Copy link
Collaborator Author

krishna-ggk commented Jul 18, 2021

LGTM.

One minor comment: We can use e.stackTraceToString() while logging the exceptions everywhere.

Sure. However, we should reverse the trend sometime later as I realized we lose the benefit of non-evaluation of exception due to string-interpolation if that log-level is disabled.

@krishna-ggk krishna-ggk reopened this Jul 18, 2021
@krishna-ggk
Copy link
Collaborator Author

@gbbafna : Why are we not pausing in Shard Task itself ?

Good point - I added a comment in source as well. The main reason is to keep index level state transitions at IndexReplicationTask, especially since there could be need to resolve ties when multiple shards fail with different exceptions.

@krishna-ggk krishna-ggk merged commit 14c3058 into opensearch-project:main Jul 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants