-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Segment Replication] For replica recovery, force segment replication sync from peer recovery source #5746
[Segment Replication] For replica recovery, force segment replication sync from peer recovery source #5746
Conversation
Gradle Check (Jenkins) Run Completed with:
|
There was a 2.x version bump to [Edit]: This is already done in #5745. This PR needs rebase against
|
0d133e6
to
ce607cb
Compare
This comment was marked as outdated.
This comment was marked as outdated.
2d7ff7b
to
300f141
Compare
As this PR change recovery flow for replica, it also needed changes in unit tests, which were failing in last gradle check. There are still 3 legitimate unit test failures related to
|
This comment was marked as outdated.
This comment was marked as outdated.
300f141
to
60ca0ac
Compare
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
60ca0ac
to
00ee972
Compare
Gradle Check (Jenkins) Run Completed with:
|
00ee972
to
c1837d6
Compare
Gradle Check (Jenkins) Run Completed with:
|
@@ -847,7 +847,7 @@ protected DiscoveryNode getFakeDiscoNode(String id) { | |||
} | |||
|
|||
protected void recoverReplica(IndexShard replica, IndexShard primary, boolean startReplica) throws IOException { | |||
recoverReplica(replica, primary, startReplica, (a) -> null); | |||
recoverReplica(replica, primary, startReplica, getReplicationFunc(replica)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
love this add to our tests!
...ernalClusterTest/java/org/opensearch/indices/replication/SegmentReplicationRelocationIT.java
Outdated
Show resolved
Hide resolved
...ernalClusterTest/java/org/opensearch/indices/replication/SegmentReplicationRelocationIT.java
Show resolved
Hide resolved
.actionGet(); | ||
assertTrue(clusterHealthResponse.isTimedOut()); | ||
ensureYellow(INDEX_NAME); | ||
IndicesService indicesService = internalCluster().getInstance(IndicesService.class, replicaNode); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the flakiness in these tests is asserting that IndicesService exists. After the round of SR fails the shard will fail and node will be yellow, it will then try and spin up and recover the shard again, causing this assertion to trip?
to confirm you could maybe flip this assertion to true and wrap it in an assertBusy and see that it always succeeds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the flakiness in these tests is asserting that IndicesService exists.
Yes, it was flaky and fails asserting index on replica doesn't exist.
After the round of SR fails the shard will fail and node will be yellow, it will then try and spin up and recover the shard again, causing this assertion to trip?
Bingo, yes you are right. The recovery kicks in again post failure. Modified the test to wait for first round of recovery before assertion.
… sync from source Signed-off-by: Suraj Singh <[email protected]>
Signed-off-by: Suraj Singh <[email protected]>
Signed-off-by: Suraj Singh <[email protected]>
Signed-off-by: Suraj Singh <[email protected]>
c1837d6
to
a19dccc
Compare
Gradle Check (Jenkins) Run Completed with:
|
Signed-off-by: Suraj Singh <[email protected]>
Gradle Check (Jenkins) Run Completed with:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks!
The backport to
To backport manually, run these commands in your terminal: # Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-5746-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 53d54de90423e31af7ca2514b953d77df4ac2be4
# Push it to GitHub
git push --set-upstream origin backport/backport-5746-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/backport-2.x Then, create a pull request where the |
… sync from peer recovery source (opensearch-project#5746) * [Segment Replication] For replica recovery, force segment replication sync from source Signed-off-by: Suraj Singh <[email protected]> * Rebase against main Signed-off-by: Suraj Singh <[email protected]> * Fix unit test Signed-off-by: Suraj Singh <[email protected]> * PR feedback Signed-off-by: Suraj Singh <[email protected]> * Fix remote store recovery test Signed-off-by: Suraj Singh <[email protected]> --------- Signed-off-by: Suraj Singh <[email protected]>
… sync from peer recovery source (#5746) (#6149) * [Segment Replication] For replica recovery, force segment replication sync from source * Rebase against main * Fix unit test * PR feedback * Fix remote store recovery test --------- Signed-off-by: Suraj Singh <[email protected]>
… sync from peer recovery source (opensearch-project#5746) * [Segment Replication] For replica recovery, force segment replication sync from source Signed-off-by: Suraj Singh <[email protected]> * Rebase against main Signed-off-by: Suraj Singh <[email protected]> * Fix unit test Signed-off-by: Suraj Singh <[email protected]> * PR feedback Signed-off-by: Suraj Singh <[email protected]> * Fix remote store recovery test Signed-off-by: Suraj Singh <[email protected]> --------- Signed-off-by: Suraj Singh <[email protected]>
Description
This change is a clean up and improvement during peer recovery with segment replication. The change was originally introduced in #5332. This change performs
IndicesClusterStateService
fromSegmentReplicationTargetService
SegmentReplicationSourceHandlerTests
to use correct engine factory and updates existing unit test classes to consider replica recoverytestNewlyAddedReplicaIsUpdated
which verifies happy path scenario. The change moves this test along withtestAddNewReplicaFailure
toSegmentReplicationRelocationIT
which is more natural class for these ITs. Both tests are muted as they tend to fail on CI. These relocation test failures are tracked in [BUG] failing IT test : SegmentReplicationRelocationIT #6065Issues Resolved
#5743
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.