Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Segment Replication] Add new background task to fail stale replica shards. #6850

Merged

Conversation

Rishikesh1159
Copy link
Member

Description

-> This PR adds new async task to send remote shard failure for lagging replica shard.
-> This Works only with Segment Replication backpressure enabled.
-> This Async task fails any replica shards which are stale (few checkpoints behind primary shard) and having current replication time twice the max replication time limit.

Issues Resolved

Resolves #6606

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: Rishikesh1159 <[email protected]>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.index.SegmentReplicationPressureIT.testFailStaleReplica

@codecov-commenter
Copy link

codecov-commenter commented Mar 28, 2023

Codecov Report

Merging #6850 (11d82d1) into main (95c6ed9) will decrease coverage by 0.07%.
The diff coverage is 67.69%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@             Coverage Diff              @@
##               main    #6850      +/-   ##
============================================
- Coverage     70.78%   70.71%   -0.07%     
+ Complexity    59305    59255      -50     
============================================
  Files          4813     4822       +9     
  Lines        283781   283926     +145     
  Branches      40924    40947      +23     
============================================
- Hits         200864   200788      -76     
- Misses        66420    66637     +217     
- Partials      16497    16501       +4     
Impacted Files Coverage Δ
...ch/index/codec/customcodecs/CustomCodecPlugin.java 0.00% <0.00%> (ø)
...h/index/codec/customcodecs/CustomCodecService.java 0.00% <0.00%> (ø)
.../codec/customcodecs/CustomCodecServiceFactory.java 0.00% <0.00%> (ø)
...ustomcodecs/PerFieldMappingPostingFormatCodec.java 0.00% <0.00%> (ø)
...rc/main/java/org/opensearch/index/IndexModule.java 81.92% <ø> (ø)
...c/main/java/org/opensearch/index/IndexService.java 74.71% <0.00%> (+0.84%) ⬆️
...in/java/org/opensearch/indices/IndicesService.java 64.74% <ø> (+0.70%) ⬆️
...s/replication/SegmentReplicationTargetService.java 48.10% <0.00%> (ø)
.../java/org/opensearch/plugins/IndexStorePlugin.java 100.00% <ø> (ø)
...customcodecs/Lucene95CustomStoredFieldsFormat.java 25.00% <25.00%> (ø)
... and 13 more

... and 481 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@github-actions
Copy link
Contributor

github-actions bot commented Apr 4, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Apr 4, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Apr 4, 2023

Gradle Check (Jenkins) Run Completed with:

entry.getValue().getReplicaStats()
);
for (SegmentReplicationShardStats staleReplica : staleReplicas) {
if (staleReplica.getCurrentReplicationTimeMillis() > highestCurrentReplicationTimeMillis) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

highestCurrentReplicationTimeMillis is never set to a value other than zero, is it? I think you can use the stream API, something like this:

stats.getShardStats().entrySet().stream()
    .flatMap(entry -> pressureService.getStaleReplicas(entry.getValue().getReplicaStats()).stream()
        .map(r -> Tuple.tuple(entry.getKey(), r.getCurrentReplicationTimeMillis())))
    .max(Comparator.comparingLong(Tuple::v2))
    .map(Tuple::v1);
    .ifPresent(shardId -> {
        ...
    });

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry you are right, I missed reassigning highestCurrentReplicationTimeMillis in if condition. But Actually as you suggested using stream API makes this entire logic much cleaner. Thanks, I have updated the PR

Signed-off-by: Rishikesh1159 <[email protected]>
@github-actions
Copy link
Contributor

github-actions bot commented Apr 4, 2023

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.indices.replication.SegmentReplicationIT.testReplicaHasDiffFilesThanPrimary
      1 org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness

Copy link
Member

@andrross andrross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asking again, but why is the limit implemented as double the "max replication time limit" setting?

@@ -154,4 +182,95 @@ public void setMaxAllowedStaleReplicas(double maxAllowedStaleReplicas) {
public void setMaxReplicationTime(TimeValue maxReplicationTime) {
this.maxReplicationTime = maxReplicationTime;
}

@Override
public void close() throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this newly added method get called?

Copy link
Member Author

@Rishikesh1159 Rishikesh1159 Apr 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, it is not called/used directly from anywhere. I took the reference from PersistentTasksClusterService, just to make sure when service is closed this async task is also closed.

@Rishikesh1159
Copy link
Member Author

Rishikesh1159 commented Apr 4, 2023

Asking again, but why is the limit implemented as double the "max replication time limit" setting?

Sorry I missed this question previously. The "max replication time limit"(MAX_REPLICATION_TIME_SETTING), is the maximum time a replica can take to catch up to primary shard without triggering backpressure mechanism. Once this MAX_REPLICATION_TIME_SETTING limit along with MAX_INDEXING_CHECKPOINTS hit limit we trigger the backpressure mechanism to kick in. Once backpressure mechanism kicks in we temporarily stop indexing/writes requests to primary shard so that the replica shards can catch up with all the previous checkpoints.

Currently the default value of MAX_REPLICATION_TIME_SETTING is 5min, @mch2 came up with this number after running few benchmarks but users have the option to change this setting.

Now, to answer your question of why the limit of failing replicas is double the "max replication time limit" setting?:

Once backpressure mechanism kicks in we stop writes to primary shards, so replicas can catch up. So essentially if default of 5min is used, after backpressure kicks-in we give each replica 5min more to finish the replication, if it is not able to finish within given 5min then replica shard might be stuck (taking too long) so we just fail the shard and new shard can recover directly from primary and catch up.
The reason we just wait for double the MAX_REPLICATION_TIME_SETTING is we don't want primary shards to be idle and block write requests for too long. If one replica shard is having trouble catching up in a replication group then instead of waiting forever for the replica to catch up, it is better to just fail the replica shard and move on.

@andrross
Copy link
Member

andrross commented Apr 4, 2023

@Rishikesh1159 Great explanation, thanks! I suggest documenting this in the code, probably either the SegmentReplicationPressureService classdoc or on the SEGMENT_REPLICATION_INDEXING_PRESSURE_ENABLED setting definition. Something like:

When enabled, writes will be rejected when a replica shard falls behind by both the MAX_REPLICATION_TIME_SETTING time value and MAX_INDEXING_CHECKPOINTS number of checkpoints. Once a shard falls behind double the MAX_REPLICATION_TIME_SETTING time value it will be marked as failed.

@Rishikesh1159
Copy link
Member Author

@Rishikesh1159 Great explanation, thanks! I suggest documenting this in the code, probably either the SegmentReplicationPressureService classdoc or on the SEGMENT_REPLICATION_INDEXING_PRESSURE_ENABLED setting definition. Something like:

When enabled, writes will be rejected when a replica shard falls behind by both the MAX_REPLICATION_TIME_SETTING time value and MAX_INDEXING_CHECKPOINTS number of checkpoints. Once a shard falls behind double the MAX_REPLICATION_TIME_SETTING time value it will be marked as failed.

Sure I will add this code doc in next commit. Thanks @andrross for your review on this PR.

@github-actions
Copy link
Contributor

github-actions bot commented Apr 4, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Apr 4, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Apr 4, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Apr 5, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Apr 5, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Apr 5, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Apr 5, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Apr 5, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Apr 5, 2023

Gradle Check (Jenkins) Run Completed with:

@Rishikesh1159 Rishikesh1159 merged commit 59e881b into opensearch-project:main Apr 5, 2023
@Rishikesh1159 Rishikesh1159 added the backport 2.x Backport to 2.x branch label Apr 5, 2023
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-6850-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 59e881b705c29b7bf809740e918955349942e397
# Push it to GitHub
git push --set-upstream origin backport/backport-6850-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-6850-to-2.x.

mitrofmep pushed a commit to mitrofmep/OpenSearch that referenced this pull request Apr 5, 2023
…hards. (opensearch-project#6850)

* Add new background task to fail stale replica shards.

Signed-off-by: Rishikesh1159 <[email protected]>

* Add condition to check if backpressure is enabled.

Signed-off-by: Rishikesh1159 <[email protected]>

* Fix failing tests.

Signed-off-by: Rishikesh1159 <[email protected]>

* Fix failing tests by adding manual refresh.

Signed-off-by: Rishikesh1159 <[email protected]>

* Address comments on PR.

Signed-off-by: Rishikesh1159 <[email protected]>

* Addressing comments on PR.

Signed-off-by: Rishikesh1159 <[email protected]>

* Update background task logic to fail stale replicas of only one shardId's in a single iteration of background task.

Signed-off-by: Rishikesh1159 <[email protected]>

* Fix failing import.

Signed-off-by: Rishikesh1159 <[email protected]>

* Address comments.

Signed-off-by: Rishikesh1159 <[email protected]>

* Add code doc to SEGMENT_REPLICATION_INDEXING_PRESSURE_ENABLED setting.

Signed-off-by: Rishikesh1159 <[email protected]>

---------

Signed-off-by: Rishikesh1159 <[email protected]>
Signed-off-by: Valentin Mitrofanov <[email protected]>
@dreamer-89
Copy link
Member

@Rishikesh1159 : Looks like backport workflow has some trouble with backporting. Care to raise manual backport ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch skip-changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Segment Replication] Send Remote shard failure for lagging replicas
5 participants