[Segment Replication] Add new background task to fail stale replica shards. #6850

Rishikesh1159 · 2023-03-28T01:09:22Z

Description

-> This PR adds new async task to send remote shard failure for lagging replica shard.
-> This Works only with Segment Replication backpressure enabled.
-> This Async task fails any replica shards which are stale (few checkpoints behind primary shard) and having current replication time twice the max replication time limit.

Issues Resolved

Resolves #6606

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed per the DCO using --signoff
Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Rishikesh1159 <[email protected]>

github-actions · 2023-03-28T01:30:14Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/12970/
CommitID: bf9b3dc
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

github-actions · 2023-03-28T01:44:12Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/12971/
CommitID: 05897cc
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

Signed-off-by: Rishikesh1159 <[email protected]>

github-actions · 2023-03-28T05:29:44Z

Gradle Check (Jenkins) Run Completed with:

RESULT: UNSTABLE ❕
TEST FAILURES:

      1 org.opensearch.index.SegmentReplicationPressureIT.testFailStaleReplica

URL: https://build.ci.opensearch.org/job/gradle-check/12979/
CommitID: dd34f0f
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

codecov-commenter · 2023-03-28T05:31:17Z

Codecov Report

Merging #6850 (11d82d1) into main (95c6ed9) will decrease coverage by 0.07%.
The diff coverage is 67.69%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@             Coverage Diff              @@
##               main    #6850      +/-   ##
============================================
- Coverage     70.78%   70.71%   -0.07%     
+ Complexity    59305    59255      -50     
============================================
  Files          4813     4822       +9     
  Lines        283781   283926     +145     
  Branches      40924    40947      +23     
============================================
- Hits         200864   200788      -76     
- Misses        66420    66637     +217     
- Partials      16497    16501       +4

Impacted Files	Coverage Δ
...ch/index/codec/customcodecs/CustomCodecPlugin.java	`0.00% <0.00%> (ø)`
...h/index/codec/customcodecs/CustomCodecService.java	`0.00% <0.00%> (ø)`
.../codec/customcodecs/CustomCodecServiceFactory.java	`0.00% <0.00%> (ø)`
...ustomcodecs/PerFieldMappingPostingFormatCodec.java	`0.00% <0.00%> (ø)`
...rc/main/java/org/opensearch/index/IndexModule.java	`81.92% <ø> (ø)`
...c/main/java/org/opensearch/index/IndexService.java	`74.71% <0.00%> (+0.84%)`	⬆️
...in/java/org/opensearch/indices/IndicesService.java	`64.74% <ø> (+0.70%)`	⬆️
...s/replication/SegmentReplicationTargetService.java	`48.10% <0.00%> (ø)`
.../java/org/opensearch/plugins/IndexStorePlugin.java	`100.00% <ø> (ø)`
...customcodecs/Lucene95CustomStoredFieldsFormat.java	`25.00% <25.00%> (ø)`
... and 13 more

... and 481 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

github-actions · 2023-04-04T00:04:51Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/13337/
CommitID: c332c15
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

github-actions · 2023-04-04T00:30:45Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/13339/
CommitID: c332c15
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

github-actions · 2023-04-04T00:35:06Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/13346/
CommitID: c332c15
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

andrross · 2023-04-04T00:50:01Z

server/src/main/java/org/opensearch/index/SegmentReplicationPressureService.java

+                        entry.getValue().getReplicaStats()
+                    );
+                    for (SegmentReplicationShardStats staleReplica : staleReplicas) {
+                        if (staleReplica.getCurrentReplicationTimeMillis() > highestCurrentReplicationTimeMillis) {


highestCurrentReplicationTimeMillis is never set to a value other than zero, is it? I think you can use the stream API, something like this:

stats.getShardStats().entrySet().stream() .flatMap(entry -> pressureService.getStaleReplicas(entry.getValue().getReplicaStats()).stream() .map(r -> Tuple.tuple(entry.getKey(), r.getCurrentReplicationTimeMillis()))) .max(Comparator.comparingLong(Tuple::v2)) .map(Tuple::v1); .ifPresent(shardId -> { ... });

Sorry you are right, I missed reassigning highestCurrentReplicationTimeMillis in if condition. But Actually as you suggested using stream API makes this entire logic much cleaner. Thanks, I have updated the PR

Signed-off-by: Rishikesh1159 <[email protected]>

github-actions · 2023-04-04T06:30:00Z

Gradle Check (Jenkins) Run Completed with:

RESULT: UNSTABLE ❕
TEST FAILURES:

      1 org.opensearch.indices.replication.SegmentReplicationIT.testReplicaHasDiffFilesThanPrimary
      1 org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness

URL: https://build.ci.opensearch.org/job/gradle-check/13359/
CommitID: 54bae01
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

andrross

Asking again, but why is the limit implemented as double the "max replication time limit" setting?

andrross · 2023-04-04T20:57:45Z

server/src/main/java/org/opensearch/index/SegmentReplicationPressureService.java

@@ -154,4 +182,95 @@ public void setMaxAllowedStaleReplicas(double maxAllowedStaleReplicas) {
    public void setMaxReplicationTime(TimeValue maxReplicationTime) {
        this.maxReplicationTime = maxReplicationTime;
    }
+
+    @Override
+    public void close() throws IOException {


Where does this newly added method get called?

Currently, it is not called/used directly from anywhere. I took the reference from PersistentTasksClusterService, just to make sure when service is closed this async task is also closed.

Rishikesh1159 · 2023-04-04T22:18:40Z

Asking again, but why is the limit implemented as double the "max replication time limit" setting?

Sorry I missed this question previously. The "max replication time limit"(MAX_REPLICATION_TIME_SETTING), is the maximum time a replica can take to catch up to primary shard without triggering backpressure mechanism. Once this MAX_REPLICATION_TIME_SETTING limit along with MAX_INDEXING_CHECKPOINTS hit limit we trigger the backpressure mechanism to kick in. Once backpressure mechanism kicks in we temporarily stop indexing/writes requests to primary shard so that the replica shards can catch up with all the previous checkpoints.

Currently the default value of MAX_REPLICATION_TIME_SETTING is 5min, @mch2 came up with this number after running few benchmarks but users have the option to change this setting.

Now, to answer your question of why the limit of failing replicas is double the "max replication time limit" setting?:

Once backpressure mechanism kicks in we stop writes to primary shards, so replicas can catch up. So essentially if default of 5min is used, after backpressure kicks-in we give each replica 5min more to finish the replication, if it is not able to finish within given 5min then replica shard might be stuck (taking too long) so we just fail the shard and new shard can recover directly from primary and catch up.
The reason we just wait for double the MAX_REPLICATION_TIME_SETTING is we don't want primary shards to be idle and block write requests for too long. If one replica shard is having trouble catching up in a replication group then instead of waiting forever for the replica to catch up, it is better to just fail the replica shard and move on.

andrross · 2023-04-04T22:28:55Z

@Rishikesh1159 Great explanation, thanks! I suggest documenting this in the code, probably either the SegmentReplicationPressureService classdoc or on the SEGMENT_REPLICATION_INDEXING_PRESSURE_ENABLED setting definition. Something like:

When enabled, writes will be rejected when a replica shard falls behind by both the MAX_REPLICATION_TIME_SETTING time value and MAX_INDEXING_CHECKPOINTS number of checkpoints. Once a shard falls behind double the MAX_REPLICATION_TIME_SETTING time value it will be marked as failed.

Rishikesh1159 · 2023-04-04T22:47:09Z

@Rishikesh1159 Great explanation, thanks! I suggest documenting this in the code, probably either the SegmentReplicationPressureService classdoc or on the SEGMENT_REPLICATION_INDEXING_PRESSURE_ENABLED setting definition. Something like:

When enabled, writes will be rejected when a replica shard falls behind by both the MAX_REPLICATION_TIME_SETTING time value and MAX_INDEXING_CHECKPOINTS number of checkpoints. Once a shard falls behind double the MAX_REPLICATION_TIME_SETTING time value it will be marked as failed.

Sure I will add this code doc in next commit. Thanks @andrross for your review on this PR.

Signed-off-by: Rishikesh1159 <[email protected]>

github-actions · 2023-04-04T23:00:31Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/13439/
CommitID: 1be88b7
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

github-actions · 2023-04-04T23:07:24Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/13440/
CommitID: 1be88b7
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

github-actions · 2023-04-04T23:16:19Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/13441/
CommitID: 1be88b7
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

github-actions · 2023-04-05T00:13:46Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/13448/
CommitID: 1be88b7
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

github-actions · 2023-04-05T00:18:33Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/13449/
CommitID: 1be88b7
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

github-actions · 2023-04-05T00:26:25Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/13450/
CommitID: 1be88b7
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

github-actions · 2023-04-05T00:39:02Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/13453/
CommitID: 1be88b7
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

github-actions · 2023-04-05T01:51:10Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/13461/
CommitID: 11d82d1
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

github-actions · 2023-04-05T04:49:13Z

Gradle Check (Jenkins) Run Completed with:

RESULT: SUCCESS ✅
URL: https://build.ci.opensearch.org/job/gradle-check/13472/
CommitID: 11d82d1

opensearch-trigger-bot · 2023-04-05T05:06:38Z

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-6850-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 59e881b705c29b7bf809740e918955349942e397
# Push it to GitHub
git push --set-upstream origin backport/backport-6850-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-6850-to-2.x.

…hards. (opensearch-project#6850) * Add new background task to fail stale replica shards. Signed-off-by: Rishikesh1159 <[email protected]> * Add condition to check if backpressure is enabled. Signed-off-by: Rishikesh1159 <[email protected]> * Fix failing tests. Signed-off-by: Rishikesh1159 <[email protected]> * Fix failing tests by adding manual refresh. Signed-off-by: Rishikesh1159 <[email protected]> * Address comments on PR. Signed-off-by: Rishikesh1159 <[email protected]> * Addressing comments on PR. Signed-off-by: Rishikesh1159 <[email protected]> * Update background task logic to fail stale replicas of only one shardId's in a single iteration of background task. Signed-off-by: Rishikesh1159 <[email protected]> * Fix failing import. Signed-off-by: Rishikesh1159 <[email protected]> * Address comments. Signed-off-by: Rishikesh1159 <[email protected]> * Add code doc to SEGMENT_REPLICATION_INDEXING_PRESSURE_ENABLED setting. Signed-off-by: Rishikesh1159 <[email protected]> --------- Signed-off-by: Rishikesh1159 <[email protected]> Signed-off-by: Valentin Mitrofanov <[email protected]>

dreamer-89 · 2023-04-05T16:38:51Z

@Rishikesh1159 : Looks like backport workflow has some trouble with backporting. Care to raise manual backport ?

Add new background task to fail stale replica shards.

bf9b3dc

Signed-off-by: Rishikesh1159 <[email protected]>

Rishikesh1159 requested review from reta, anasalkouz, andrross, Bukhtawar, CEHENKLE, dblock, gbbafna, setiah, kartg, kotwanikunal, mch2, nknize, owaiskazi19, ryanbogan, saratvemulapalli, shwetathareja, dreamer-89, tlfeng, VachaShah and xuezhou25 as code owners March 28, 2023 01:09

github-actions bot added the distinguished-contributor label Mar 28, 2023

Rishikesh1159 added skip-changelog and removed distinguished-contributor labels Mar 28, 2023

Add condition to check if backpressure is enabled.

05897cc

Signed-off-by: Rishikesh1159 <[email protected]>

Fix failing tests.

dd34f0f

Signed-off-by: Rishikesh1159 <[email protected]>

andrross reviewed Apr 4, 2023

View reviewed changes

Address comments.

54bae01

Signed-off-by: Rishikesh1159 <[email protected]>

andrross reviewed Apr 4, 2023

View reviewed changes

andrross approved these changes Apr 4, 2023

View reviewed changes

Add code doc to SEGMENT_REPLICATION_INDEXING_PRESSURE_ENABLED setting.

1be88b7

Signed-off-by: Rishikesh1159 <[email protected]>

andrross approved these changes Apr 4, 2023

View reviewed changes

Merge branch 'opensearch-project:main' into replica-lagging-fail

11d82d1

Rishikesh1159 merged commit 59e881b into opensearch-project:main Apr 5, 2023

Rishikesh1159 added the backport 2.x Backport to 2.x branch label Apr 5, 2023

This was referenced Apr 5, 2023

[Backport 2.x] [Segment Replication]Add background async task to fail stale replica shards. #7022

Merged

[Segment Replication & Remote Translog] Back-pressure and Recovery for lagging replica copies #4478

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Segment Replication] Add new background task to fail stale replica shards. #6850

[Segment Replication] Add new background task to fail stale replica shards. #6850

Rishikesh1159 commented Mar 28, 2023

github-actions bot commented Mar 28, 2023

github-actions bot commented Mar 28, 2023

github-actions bot commented Mar 28, 2023

codecov-commenter commented Mar 28, 2023 •

edited

Loading

github-actions bot commented Apr 4, 2023

github-actions bot commented Apr 4, 2023

github-actions bot commented Apr 4, 2023

andrross Apr 4, 2023

Rishikesh1159 Apr 4, 2023

github-actions bot commented Apr 4, 2023

andrross left a comment

andrross Apr 4, 2023

Rishikesh1159 Apr 4, 2023 •

edited

Loading

Rishikesh1159 commented Apr 4, 2023 •

edited

Loading

andrross commented Apr 4, 2023

Rishikesh1159 commented Apr 4, 2023

github-actions bot commented Apr 4, 2023

github-actions bot commented Apr 4, 2023

github-actions bot commented Apr 4, 2023

github-actions bot commented Apr 5, 2023

github-actions bot commented Apr 5, 2023

github-actions bot commented Apr 5, 2023

github-actions bot commented Apr 5, 2023

github-actions bot commented Apr 5, 2023

github-actions bot commented Apr 5, 2023

opensearch-trigger-bot bot commented Apr 5, 2023

dreamer-89 commented Apr 5, 2023

[Segment Replication] Add new background task to fail stale replica shards. #6850

[Segment Replication] Add new background task to fail stale replica shards. #6850

Conversation

Rishikesh1159 commented Mar 28, 2023

Description

Issues Resolved

Check List

github-actions bot commented Mar 28, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Mar 28, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Mar 28, 2023

Gradle Check (Jenkins) Run Completed with:

codecov-commenter commented Mar 28, 2023 • edited Loading

Codecov Report

github-actions bot commented Apr 4, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Apr 4, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Apr 4, 2023

Gradle Check (Jenkins) Run Completed with:

andrross Apr 4, 2023

Choose a reason for hiding this comment

Rishikesh1159 Apr 4, 2023

Choose a reason for hiding this comment

github-actions bot commented Apr 4, 2023

Gradle Check (Jenkins) Run Completed with:

andrross left a comment

Choose a reason for hiding this comment

andrross Apr 4, 2023

Choose a reason for hiding this comment

Rishikesh1159 Apr 4, 2023 • edited Loading

Choose a reason for hiding this comment

Rishikesh1159 commented Apr 4, 2023 • edited Loading

andrross commented Apr 4, 2023

Rishikesh1159 commented Apr 4, 2023

github-actions bot commented Apr 4, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Apr 4, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Apr 4, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Apr 5, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Apr 5, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Apr 5, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Apr 5, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Apr 5, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Apr 5, 2023

Gradle Check (Jenkins) Run Completed with:

opensearch-trigger-bot bot commented Apr 5, 2023

dreamer-89 commented Apr 5, 2023

codecov-commenter commented Mar 28, 2023 •

edited

Loading

Rishikesh1159 Apr 4, 2023 •

edited

Loading

Rishikesh1159 commented Apr 4, 2023 •

edited

Loading