Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Segment Replication] For replica recovery, force segment replication sync from peer recovery source #5746

Merged

Conversation

dreamer-89
Copy link
Member

@dreamer-89 dreamer-89 commented Jan 7, 2023

Description

This change is a clean up and improvement during peer recovery with segment replication. The change was originally introduced in #5332. This change performs

  • Untie IndicesClusterStateService from SegmentReplicationTargetService
  • Fail fast during recovery.
  • Updates SegmentReplicationSourceHandlerTests to use correct engine factory and updates existing unit test classes to consider replica recovery
  • Adds one integration test testNewlyAddedReplicaIsUpdated which verifies happy path scenario. The change moves this test along with testAddNewReplicaFailure to SegmentReplicationRelocationIT which is more natural class for these ITs. Both tests are muted as they tend to fail on CI. These relocation test failures are tracked in [BUG] failing IT test : SegmentReplicationRelocationIT  #6065
  • The change fixes some unit test assertions, identified with this change.

Issues Resolved

#5743

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 7, 2023

Gradle Check (Jenkins) Run Completed with:

@dreamer-89
Copy link
Member Author

dreamer-89 commented Jan 8, 2023

Gradle Check (Jenkins) Run Completed with:

There was a 2.x version bump to 2.6.0 from 2.5.0 in 06c1712. This also needs change in main.

[Edit]: This is already done in #5745. This PR needs rebase against main.

FAILURE: Build completed with 2 failures.

1: Task failed with an exception.
-----------
* What went wrong:
Execution failed for task ':distribution:bwc:minor:buildBwcLinuxTar'.
> Building 2.5.0 didn't generate expected file /var/jenkins/workspace/gradle-check/search/distribution/bwc/minor/build/bwc/checkout-2.x/distribution/archives/linux-tar/build/distributions/opensearch-min-2.5.0-SNAPSHOT-linux-x64.tar.gz

@dreamer-89 dreamer-89 force-pushed the update_replica_recovery_segrep branch from 0d133e6 to ce607cb Compare January 8, 2023 02:53
@github-actions

This comment was marked as outdated.

@dreamer-89 dreamer-89 force-pushed the update_replica_recovery_segrep branch 2 times, most recently from 2d7ff7b to 300f141 Compare January 9, 2023 20:00
@dreamer-89
Copy link
Member Author

As this PR change recovery flow for replica, it also needed changes in unit tests, which were failing in last gradle check. There are still 3 legitimate unit test failures related to SegmentReplicationIndexShardTests which needs more dive. These unit tests are failing during doc count assertions. The target's translog contains all documents rather than ones ingested post round of segment replication.

assertEquals(additonalDocs, nextPrimary.translogStats().estimatedNumberOfOperations());

@github-actions

This comment was marked as outdated.

@dreamer-89 dreamer-89 force-pushed the update_replica_recovery_segrep branch from 300f141 to 60ca0ac Compare January 9, 2023 20:10
@github-actions
Copy link
Contributor

github-actions bot commented Jan 9, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Jan 9, 2023

Gradle Check (Jenkins) Run Completed with:

@dreamer-89 dreamer-89 changed the title [Segment Replication] Force segment replication sync from peer recovery source [Segment Replication] For replica recovery, force segment replication sync from peer recovery source Jan 28, 2023
@dreamer-89 dreamer-89 force-pushed the update_replica_recovery_segrep branch from 60ca0ac to 00ee972 Compare January 28, 2023 22:57
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@dreamer-89 dreamer-89 force-pushed the update_replica_recovery_segrep branch from 00ee972 to c1837d6 Compare February 1, 2023 01:34
@github-actions
Copy link
Contributor

github-actions bot commented Feb 1, 2023

Gradle Check (Jenkins) Run Completed with:

@@ -847,7 +847,7 @@ protected DiscoveryNode getFakeDiscoNode(String id) {
}

protected void recoverReplica(IndexShard replica, IndexShard primary, boolean startReplica) throws IOException {
recoverReplica(replica, primary, startReplica, (a) -> null);
recoverReplica(replica, primary, startReplica, getReplicationFunc(replica));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love this add to our tests!

.actionGet();
assertTrue(clusterHealthResponse.isTimedOut());
ensureYellow(INDEX_NAME);
IndicesService indicesService = internalCluster().getInstance(IndicesService.class, replicaNode);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the flakiness in these tests is asserting that IndicesService exists. After the round of SR fails the shard will fail and node will be yellow, it will then try and spin up and recover the shard again, causing this assertion to trip?

to confirm you could maybe flip this assertion to true and wrap it in an assertBusy and see that it always succeeds.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the flakiness in these tests is asserting that IndicesService exists.

Yes, it was flaky and fails asserting index on replica doesn't exist.

After the round of SR fails the shard will fail and node will be yellow, it will then try and spin up and recover the shard again, causing this assertion to trip?

Bingo, yes you are right. The recovery kicks in again post failure. Modified the test to wait for first round of recovery before assertion.

Signed-off-by: Suraj Singh <[email protected]>
Signed-off-by: Suraj Singh <[email protected]>
Signed-off-by: Suraj Singh <[email protected]>
@dreamer-89 dreamer-89 force-pushed the update_replica_recovery_segrep branch from c1837d6 to a19dccc Compare February 1, 2023 18:55
@github-actions
Copy link
Contributor

github-actions bot commented Feb 1, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Feb 1, 2023

Gradle Check (Jenkins) Run Completed with:

Copy link
Member

@mch2 mch2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks!

@dreamer-89 dreamer-89 merged commit 53d54de into opensearch-project:main Feb 1, 2023
@dreamer-89 dreamer-89 added the backport 2.x Backport to 2.x branch label Feb 1, 2023
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-5746-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 53d54de90423e31af7ca2514b953d77df4ac2be4
# Push it to GitHub
git push --set-upstream origin backport/backport-5746-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-5746-to-2.x.

dreamer-89 added a commit to dreamer-89/OpenSearch that referenced this pull request Feb 2, 2023
… sync from peer recovery source (opensearch-project#5746)

* [Segment Replication] For replica recovery, force segment replication sync from source

Signed-off-by: Suraj Singh <[email protected]>

* Rebase against main

Signed-off-by: Suraj Singh <[email protected]>

* Fix unit test

Signed-off-by: Suraj Singh <[email protected]>

* PR feedback

Signed-off-by: Suraj Singh <[email protected]>

* Fix remote store recovery test

Signed-off-by: Suraj Singh <[email protected]>

---------

Signed-off-by: Suraj Singh <[email protected]>
dreamer-89 added a commit that referenced this pull request Feb 2, 2023
… sync from peer recovery source (#5746) (#6149)

* [Segment Replication] For replica recovery, force segment replication sync from source



* Rebase against main



* Fix unit test



* PR feedback



* Fix remote store recovery test



---------

Signed-off-by: Suraj Singh <[email protected]>
mch2 pushed a commit to mch2/OpenSearch that referenced this pull request Mar 4, 2023
… sync from peer recovery source (opensearch-project#5746)

* [Segment Replication] For replica recovery, force segment replication sync from source

Signed-off-by: Suraj Singh <[email protected]>

* Rebase against main

Signed-off-by: Suraj Singh <[email protected]>

* Fix unit test

Signed-off-by: Suraj Singh <[email protected]>

* PR feedback

Signed-off-by: Suraj Singh <[email protected]>

* Fix remote store recovery test

Signed-off-by: Suraj Singh <[email protected]>

---------

Signed-off-by: Suraj Singh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch skip-changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants