-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Segment Replication] Trigger a round of replication for replica shards during peer recovery when segment replication is enabled #5332
Conversation
Gradle Check (Jenkins) Run Completed with:
|
Codecov Report
@@ Coverage Diff @@
## main #5332 +/- ##
============================================
- Coverage 71.06% 70.93% -0.14%
+ Complexity 58136 58092 -44
============================================
Files 4704 4704
Lines 277244 277270 +26
Branches 40137 40142 +5
============================================
- Hits 197025 196669 -356
- Misses 64095 64544 +449
+ Partials 16124 16057 -67
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
for (int i = 0; i < 10; i++) { | ||
client().prepareIndex(INDEX_NAME).setId(Integer.toString(i)).setSource("field", "value" + i).execute().actionGet(); | ||
} | ||
logger.info("--> flush so we have an actual index"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean by actual index
-> index/segment files on disk
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let me change the terminology here. actual index
might be confusing.
*/ | ||
public void testAddNewReplica() throws Exception { | ||
logger.info("--> starting [node1] ..."); | ||
final String node_1 = internalCluster().startNode(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Usually I find it better to call nodes by their role. This makes it easier to understand when we perform any node specific actions (e.g. restart(primary), stop (replica) etc). Otherwise, we need to look back when node_i was created and its role.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, makes sense. I will rename both nodes accordingly
// is marked as Started. | ||
if (indexShard.indexSettings().isSegRepEnabled() | ||
&& shardRouting.primary() == false | ||
&& ShardRoutingState.RELOCATING != shardRouting.state()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this condition be ShardRoutingState.STARTED == shardRouting.state()
? Existing condition applies for UNASSIGNED and INITIALIED shards, is that correct ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I think ShardRoutingState.RELOCATING != shardRouting.state()
is an edge case check we are doing, so that relocating shard doesn't receive any checkpoints.
For ShardRoutingState.STARTED == shardRouting.state()
this check will be false at this point, because we are performing a round of replication before marking shard as STARTED. So, shard routing will never be in STARTED state at this point.
Yes existing conditions works for INITIALIZED shard routing state. ShardRoutingState.INITIALIZED
will be shard routing state at this point. Not sure if shard routing state will be UNASSIGNED, after peer recovery is completed usually shard routing will be in INITIALIZED state.
) | ||
); | ||
if (sendShardFailure == true) { | ||
logger.error("replication failure", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: These are logged at debug level on failShard call. May be we can remove it from here
Minor, but I would change the commit message/PR title to explain what you've done, as opposed to the side effect you're fixing. Something like "Trigger a round of replication during recovery" or whatever makes sense. In the description you can describe the bug you're fixing and any other details, but the message header should be a clear and concise description of what is changed. |
Thanks @andrross for pointing out. Sure, what you said makes sense. I will update the commit message and PR title. |
…ication is enabled. Signed-off-by: Rishikesh1159 <[email protected]>
Merge branch 'seg-rep/force-replication' of https://github.com/Rishikesh1159/OpenSearch into seg-rep/force-replication
Gradle Check (Jenkins) Run Completed with:
|
Signed-off-by: Rishikesh1159 <[email protected]>
Gradle Check (Jenkins) Run Completed with:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Rishikesh1159 for this quick fix. LGTM!
IndexShard indexShard = (IndexShard) indexService.getShardOrNull(shardRouting.id()); | ||
// For Segment Replication enabled indices, we want replica shards to start a replication event to fetch latest segments before it | ||
// is marked as Started. | ||
if (indexShard.indexSettings().isSegRepEnabled() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You will need a null check here given you are invoking getShardOrNull
above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out. Sure, I will add null check.
final String primary = internalCluster().startNode(); | ||
|
||
logger.info("--> creating test index ..."); | ||
prepareCreate(INDEX_NAME, Settings.builder().put("index.number_of_shards", 1).put("index.number_of_replicas", 1)).get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the actual settings instead of strings - IndexMetadata.SETTING_NUMBER_OF_SHARDS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes sure
* We don't perform any refresh on index and assert new replica shard on doc hit count. | ||
* This test makes sure that when a new replica is added to an existing cluster it gets all latest segments from primary even without a refresh. | ||
*/ | ||
public void testAddNewReplica() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is very similar to testStartReplicaAfterPrimaryIndexesDocs
, can we reuse that test? That test currently indexes a doc after the replica is recovered to force another round of replication, but you could assert the doc count is sync'd on line 412 after ensureGreen().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think you right. Let me see if we can reuse it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few small changes required here - particularly the null check in handleRecoveryDone
IndexShard indexShard = (IndexShard) indexService.getShardOrNull(shardRouting.id()); | ||
// For Segment Replication enabled indices, we want replica shards to start a replication event to fetch latest segments before it | ||
// is marked as Started. | ||
if (indexShard.indexSettings().isSegRepEnabled() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also read the setting from indexSettings before fetching a reference to the IndexShard.
indexService.getIndexSettings().isSegRepEnabled()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure I can add that
); | ||
if (sendShardFailure == true) { | ||
logger.error("replication failure", e); | ||
indexShard.failShard("replication failure", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can reuse handleRecoveryFailure
here instead of this added block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Err sorry I'm off here, we'll need both indexShard.failShard("replication failure", e);
that fails the engine, followed by handleRecoveryFailure
which removes the shard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On that note - could you pls add test here for the failure case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is important. Thanks for catching this. I will update it and an unit/integ test for failure case.
Signed-off-by: Rishikesh1159 <[email protected]>
Signed-off-by: Rishikesh1159 <[email protected]>
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
Signed-off-by: Rishikesh1159 <[email protected]>
Gradle Check (Jenkins) Run Completed with:
|
Signed-off-by: Rishikesh1159 <[email protected]>
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only a nit so approving. Thanks for this change.
); | ||
} | ||
} else { | ||
shardStateAction.shardStarted( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit - this is now invoked 3x. You could clean this up by using a StepListener that when completes marks the shard started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mch2 sure I can do that
Signed-off-by: Rishikesh1159 <[email protected]>
Gradle Check (Jenkins) Run Completed with:
|
The backport to
To backport manually, run these commands in your terminal: # Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-5332-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 0cf67979064c6c8be95299911db0d1bf1ea5ed68
# Push it to GitHub
git push --set-upstream origin backport/backport-5332-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/backport-2.x Then, create a pull request where the |
Signed-off-by: Rishikesh1159 <[email protected]>
…ds during peer recovery when segment replication is enabled (#5332) * Fix new added replica shards falling behind primary. Signed-off-by: Rishikesh1159 <[email protected]> * Trigger a round of replication during peer recovery when segment replication is enabled. Signed-off-by: Rishikesh1159 <[email protected]> * Remove unnecessary start replication overloaded method. Signed-off-by: Rishikesh1159 <[email protected]> * Add test for failure case and refactor some code. Signed-off-by: Rishikesh1159 <[email protected]> * Apply spotless check. Signed-off-by: Rishikesh1159 <[email protected]> * Addressing comments on the PR. Signed-off-by: Rishikesh1159 <[email protected]> * Remove unnecessary condition check. Signed-off-by: Rishikesh1159 <[email protected]> * Apply spotless check. Signed-off-by: Rishikesh1159 <[email protected]> * Add step listeners to resolve forcing round of segment replication. Signed-off-by: Rishikesh1159 <[email protected]> Signed-off-by: Rishikesh1159 <[email protected]>
Signed-off-by: Rishikesh1159 <[email protected]> Signed-off-by: Rishikesh1159 <[email protected]>
…ature/identity (#5581) * Fix flaky ShardIndexingPressureConcurrentExecutionTests (#5439) Add conditional check on assertNull to fix flaky tests. Signed-off-by: Rishikesh1159 <[email protected]> * Fix bwc for cluster manager throttling settings (#5305) Signed-off-by: Dhwanil Patel <[email protected]> * Update ingest-attachment plugin dependencies: Apache Tika 3.6.0, Apache Mime4j 0.8.8, Apache Poi 5.2.3, Apache PdfBox 2.0.27 (#5448) Signed-off-by: Andriy Redko <[email protected]> Signed-off-by: Andriy Redko <[email protected]> * Enhance CheckpointState to support no-op replication (#5282) * CheckpointState enhanced to support no-op replication Signed-off-by: Ashish Singh <[email protected]> Co-authored-by: Bukhtawar Khan<[email protected]> * [BUG] org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT/test {yaml=repository_s3/20_repository_permanent_credentials/Snapshot and Restore with repository-s3 using permanent credentials} flaky: randomizing basePath (#5482) Signed-off-by: Andriy Redko <[email protected]> Signed-off-by: Andriy Redko <[email protected]> * [Bug] fix case sensitivity for wildcard queries (#5462) Fixes the wildcard query to not normalize the pattern when case_insensitive is set by the user. This is achieved by creating a new normalizedWildcardQuery method so that query_string queries (which do not support case sensitivity) can still normalize the pattern when the default analyzer is used; maintaining existing behavior. Signed-off-by: Nicholas Walter Knize <[email protected]> * Support OpenSSL Provider with default Netty allocator (#5460) Signed-off-by: Andriy Redko <[email protected]> Signed-off-by: Andriy Redko <[email protected]> * Revert "build no-jdk distributions as part of release build (#4902)" (#5465) This reverts commit 8c9ca4e. It seems that this wasn't entirely the correct way and is currently blocking us from removing the `build.sh` from the `opensearch-build` repository (i.e. this `build.sh` here is not yet being used). See the discussion in opensearch-project/opensearch-build#2835 for further details. Signed-off-by: Ralph Ursprung <[email protected]> Signed-off-by: Ralph Ursprung <[email protected]> * Add max_shard_size parameter for Shrink API (fix supported version after backport) (#5503) Signed-off-by: Andriy Redko <[email protected]> Signed-off-by: Andriy Redko <[email protected]> * Sync CODEOWNERS with MAINTAINERS. (#5501) Signed-off-by: Daniel (dB.) Doubrovkine <[email protected]> Signed-off-by: Daniel (dB.) Doubrovkine <[email protected]> * Added jackson dependency to server (#5366) * Added jackson dependency to server Signed-off-by: Ryan Bogan <[email protected]> * Updated CHANGELOG Signed-off-by: Ryan Bogan <[email protected]> * Update build.gradle files Signed-off-by: Ryan Bogan <[email protected]> * Add RuntimePermission to fix errors Signed-off-by: Ryan Bogan <[email protected]> Signed-off-by: Ryan Bogan <[email protected]> * Fix flaky test BulkIntegrationIT.testDeleteIndexWhileIndexing (#5491) Signed-off-by: Poojita Raj <[email protected]> Signed-off-by: Poojita Raj <[email protected]> * Add release notes for 2.4.1 (#5488) Signed-off-by: Xue Zhou <[email protected]> Signed-off-by: Xue Zhou <[email protected]> * Properly skip OnDemandBlockSnapshotIndexInputTests.testVariousBlockSize on Windows. (#5511) PR #5397 skipped this test in @before block but still frequently throws a TestCouldNotBeSkippedException. This is caused by the after block still executing and throwing an exception while cleaning the directory created at the path in @before. Moving the assumption to the individual test prevents this exception by ensuring the path exists. Signed-off-by: Marc Handalian <[email protected]> Signed-off-by: Marc Handalian <[email protected]> * Merge first batch of feature/extensions into main (#5347) * Merge first batch of feature/extensions into main Signed-off-by: Ryan Bogan <[email protected]> * Fixed CHANGELOG Signed-off-by: Ryan Bogan <[email protected]> * Fixed newline errors Signed-off-by: Ryan Bogan <[email protected]> * Renaming and CHANGELOG fixes Signed-off-by: Ryan Bogan <[email protected]> * Refactor extension loading into private method Signed-off-by: Ryan Bogan <[email protected]> * Removed skipValidation and added connectToExtensionNode method Signed-off-by: Ryan Bogan <[email protected]> * Remove unnecessary feature flag calls Signed-off-by: Ryan Bogan <[email protected]> * Renaming and exception handling Signed-off-by: Ryan Bogan <[email protected]> * Change latches to CompletableFuture Signed-off-by: Ryan Bogan <[email protected]> * Removed unnecessary validateSettingKey call Signed-off-by: Ryan Bogan <[email protected]> * Fix azure-core dependency Signed-off-by: Ryan Bogan <[email protected]> * Update SHAs Signed-off-by: Ryan Bogan <[email protected]> * Remove unintended dependency changes Signed-off-by: Ryan Bogan <[email protected]> * Removed dynamic settings regitration, removed info() method, and added NoopExtensionsManager Signed-off-by: Ryan Bogan <[email protected]> * Add javadoc Signed-off-by: Ryan Bogan <[email protected]> * Fixed spotless failure Signed-off-by: Ryan Bogan <[email protected]> * Removed NoopExtensionsManager Signed-off-by: Ryan Bogan <[email protected]> * Added functioning NoopExtensionsManager Signed-off-by: Ryan Bogan <[email protected]> * Added missing javadoc Signed-off-by: Ryan Bogan <[email protected]> * Remove forbiddenAPI Signed-off-by: Ryan Bogan <[email protected]> * Fix spotless Signed-off-by: Ryan Bogan <[email protected]> * Change logger.info to logger.error in handleException Signed-off-by: Ryan Bogan <[email protected]> * Fix ExtensionsManagerTests Signed-off-by: Ryan Bogan <[email protected]> * Removing unrelated change Signed-off-by: Ryan Bogan <[email protected]> * Update SHAs Signed-off-by: Ryan Bogan <[email protected]> Signed-off-by: Ryan Bogan <[email protected]> * Bump commons-compress from 1.21 to 1.22 (#5520) Bumps commons-compress from 1.21 to 1.22. --- updated-dependencies: - dependency-name: org.apache.commons:commons-compress dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * [Segment Replication] Trigger a round of replication for replica shards during peer recovery when segment replication is enabled (#5332) * Fix new added replica shards falling behind primary. Signed-off-by: Rishikesh1159 <[email protected]> * Trigger a round of replication during peer recovery when segment replication is enabled. Signed-off-by: Rishikesh1159 <[email protected]> * Remove unnecessary start replication overloaded method. Signed-off-by: Rishikesh1159 <[email protected]> * Add test for failure case and refactor some code. Signed-off-by: Rishikesh1159 <[email protected]> * Apply spotless check. Signed-off-by: Rishikesh1159 <[email protected]> * Addressing comments on the PR. Signed-off-by: Rishikesh1159 <[email protected]> * Remove unnecessary condition check. Signed-off-by: Rishikesh1159 <[email protected]> * Apply spotless check. Signed-off-by: Rishikesh1159 <[email protected]> * Add step listeners to resolve forcing round of segment replication. Signed-off-by: Rishikesh1159 <[email protected]> Signed-off-by: Rishikesh1159 <[email protected]> * Adding support to register settings dynamically (#5495) * Adding support to register settings dynamically Signed-off-by: Ryan Bogan <[email protected]> * Update CHANGELOG Signed-off-by: Ryan Bogan <[email protected]> * Removed unnecessary registerSetting methods Signed-off-by: Ryan Bogan <[email protected]> * Change setting registration order Signed-off-by: Ryan Bogan <[email protected]> * Add unregisterSettings method Signed-off-by: Ryan Bogan <[email protected]> * Remove unnecessary feature flag Signed-off-by: Ryan Bogan <[email protected]> Signed-off-by: Ryan Bogan <[email protected]> * Updated 1.3.7 release notes date (#5536) Signed-off-by: owaiskazi19 <[email protected]> Signed-off-by: owaiskazi19 <[email protected]> * Pre conditions check before updating weighted routing metadata (#4955) * Pre conditions check to allow weight updates for non decommissioned attribute Signed-off-by: Rishab Nahata <[email protected]> * Atomically update cluster state with decommission status and corresponding action (#5093) * Atomically update the cluster state with decommission status and its corresponding action in the same execute call Signed-off-by: Rishab Nahata <[email protected]> * Update Netty to 4.1.86.Final (#5529) Signed-off-by: Andriy Redko <[email protected]> Signed-off-by: Andriy Redko <[email protected]> * Update release date in 2.4.1 release notes (#5549) Signed-off-by: Suraj Singh <[email protected]> Signed-off-by: Suraj Singh <[email protected]> * Update 2.4.1 release notes (#5552) Signed-off-by: Andriy Redko <[email protected]> Signed-off-by: Andriy Redko <[email protected]> * Refactor fuzziness interface on query builders (#5433) * Refactor Object to Fuzziness type for all query builders Signed-off-by: noCharger <[email protected]> * Revise on bwc Signed-off-by: noCharger <[email protected]> * Update change log Signed-off-by: noCharger <[email protected]> Signed-off-by: noCharger <[email protected]> Co-authored-by: Daniel (dB.) Doubrovkine <[email protected]> * Upgrade lucene version (#5570) * Added bwc version 2.4.2 Signed-off-by: Daniel (dB.) Doubrovkine <[email protected]> * Added 2.4.2. Signed-off-by: Daniel (dB.) Doubrovkine <[email protected]> * Update Lucene snapshot to 9.5.0-snapshot-d5cef1c Signed-off-by: Suraj Singh <[email protected]> * Update changelog entry Signed-off-by: Suraj Singh <[email protected]> * Add 2.4.2 bwc version Signed-off-by: Suraj Singh <[email protected]> * Internal changes post lucene upgrade Signed-off-by: Suraj Singh <[email protected]> Signed-off-by: Daniel (dB.) Doubrovkine <[email protected]> Signed-off-by: Suraj Singh <[email protected]> Co-authored-by: opensearch-ci-bot <[email protected]> Co-authored-by: Daniel (dB.) Doubrovkine <[email protected]> * Add CI bundle pattern to distribution download (#5348) * Add CI bundle pattern for ivy repo Signed-off-by: Zelin Hao <[email protected]> * Gradle update Signed-off-by: Zelin Hao <[email protected]> * Extract path Signed-off-by: Zelin Hao <[email protected]> * Change with customDistributionDownloadType Signed-off-by: Zelin Hao <[email protected]> * Add default for exception handle Signed-off-by: Zelin Hao <[email protected]> * Add documentations Signed-off-by: Zelin Hao <[email protected]> Signed-off-by: Zelin Hao <[email protected]> * Bump protobuf-java from 3.21.9 to 3.21.11 in /plugins/repository-hdfs (#5519) * Bump protobuf-java from 3.21.9 to 3.21.11 in /plugins/repository-hdfs Bumps [protobuf-java](https://github.com/protocolbuffers/protobuf) from 3.21.9 to 3.21.11. - [Release notes](https://github.com/protocolbuffers/protobuf/releases) - [Changelog](https://github.com/protocolbuffers/protobuf/blob/main/generate_changelog.py) - [Commits](protocolbuffers/protobuf@v3.21.9...v3.21.11) --- updated-dependencies: - dependency-name: com.google.protobuf:protobuf-java dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> * Updating SHAs Signed-off-by: dependabot[bot] <[email protected]> * Updated changelog Signed-off-by: Owais Kazi <[email protected]> Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: Owais Kazi <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: dependabot[bot] <dependabot[bot]@users.noreply.github.com> Co-authored-by: Owais Kazi <[email protected]> Co-authored-by: Suraj Singh <[email protected]> Signed-off-by: Rishikesh1159 <[email protected]> Signed-off-by: Dhwanil Patel <[email protected]> Signed-off-by: Andriy Redko <[email protected]> Signed-off-by: Ashish Singh <[email protected]> Signed-off-by: Nicholas Walter Knize <[email protected]> Signed-off-by: Ralph Ursprung <[email protected]> Signed-off-by: Daniel (dB.) Doubrovkine <[email protected]> Signed-off-by: Ryan Bogan <[email protected]> Signed-off-by: Poojita Raj <[email protected]> Signed-off-by: Xue Zhou <[email protected]> Signed-off-by: Marc Handalian <[email protected]> Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: owaiskazi19 <[email protected]> Signed-off-by: Rishab Nahata <[email protected]> Signed-off-by: Suraj Singh <[email protected]> Signed-off-by: noCharger <[email protected]> Signed-off-by: Zelin Hao <[email protected]> Signed-off-by: Owais Kazi <[email protected]> Co-authored-by: Rishikesh Pasham <[email protected]> Co-authored-by: Dhwanil Patel <[email protected]> Co-authored-by: Andriy Redko <[email protected]> Co-authored-by: Ashish <[email protected]> Co-authored-by: Nick Knize <[email protected]> Co-authored-by: Ralph Ursprung <[email protected]> Co-authored-by: Daniel (dB.) Doubrovkine <[email protected]> Co-authored-by: Ryan Bogan <[email protected]> Co-authored-by: Poojita Raj <[email protected]> Co-authored-by: Xue Zhou <[email protected]> Co-authored-by: Marc Handalian <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Owais Kazi <[email protected]> Co-authored-by: Rishab Nahata <[email protected]> Co-authored-by: Suraj Singh <[email protected]> Co-authored-by: Louis Chu <[email protected]> Co-authored-by: opensearch-ci-bot <[email protected]> Co-authored-by: Zelin Hao <[email protected]> Co-authored-by: dependabot[bot] <dependabot[bot]@users.noreply.github.com>
…ds during peer recovery when segment replication is enabled (opensearch-project#5332) * Fix new added replica shards falling behind primary. Signed-off-by: Rishikesh1159 <[email protected]> * Trigger a round of replication during peer recovery when segment replication is enabled. Signed-off-by: Rishikesh1159 <[email protected]> * Remove unnecessary start replication overloaded method. Signed-off-by: Rishikesh1159 <[email protected]> * Add test for failure case and refactor some code. Signed-off-by: Rishikesh1159 <[email protected]> * Apply spotless check. Signed-off-by: Rishikesh1159 <[email protected]> * Addressing comments on the PR. Signed-off-by: Rishikesh1159 <[email protected]> * Remove unnecessary condition check. Signed-off-by: Rishikesh1159 <[email protected]> * Apply spotless check. Signed-off-by: Rishikesh1159 <[email protected]> * Add step listeners to resolve forcing round of segment replication. Signed-off-by: Rishikesh1159 <[email protected]> Signed-off-by: Rishikesh1159 <[email protected]>
Description
This PR adds logic of triggering a round of replication during peer recovery before shard is marked as STARTED. It fixes the bug of newly added replica shards falling behind primary shard until an operation is performed on index when segment replication is enabled. More detail about bug is present on issue #5313.
Solution used to fix bug
With segment replication enabled when a new replica is added to cluster it goes through peer recovery process. During this recovery process after peer recovery process is completed and before replica shard is marked as STARTED, we are triggering a replication event from replica to copy all latest segment from primary shard.
Issues Resolved
Resolves #5313
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.