Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure latest replication checkpoint post failover has correct operational primary term #11990

Conversation

linuxpi
Copy link
Collaborator

@linuxpi linuxpi commented Jan 23, 2024

Description

  • During primary failover, post primaryMode activate on the new primary, the new primary can still upload segment metadata with older primary term
  • This is because the ReplicationCheckpoint used to get the primaryTerm during upload might itself have the older primaryTerm
  • This is more prominent if indexing has stopped during failover(need to verify this)
  • Currently it seems the replication checkpoint is only updated during recovery, failover, refresh, that too if the latest segmentInfos version and generation doesnt match with the current replication checkpoint.
  • After failover we trigger refresh, which should update the replication checkpoint but since there is not new data, we use the cached replication checkpoint
  • With this PR, we have added additional check on primaryTerm to make sure replication checkpoint is updated during failover as well with new primaryTerm

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@linuxpi linuxpi changed the title Force update operation primary term in replication checkout post failover Ensure latest replication checkpoint post failover has correct operational primary term Jan 23, 2024
@linuxpi
Copy link
Collaborator Author

linuxpi commented Jan 23, 2024

Tagging @sachinpkale @ashking94 @mch2 @Bukhtawar for review

@linuxpi
Copy link
Collaborator Author

linuxpi commented Jan 23, 2024

Working on adding/fixing tests if required.

Copy link
Contributor

github-actions bot commented Jan 23, 2024

Compatibility status:

Checks if related components are compatible with change ebb5b4a

Incompatible components

Incompatible components: [https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/performance-analyzer-rca.git]

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/k-nn.git]

Copy link
Contributor

❌ Gradle check result for 77d94ac: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Collaborator

@Bukhtawar Bukhtawar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put an assertion during upload to ensure writes only happens on the latest primary term

@linuxpi
Copy link
Collaborator Author

linuxpi commented Jan 29, 2024

Can we put an assertion during upload to ensure writes only happens on the latest primary term

Sure @Bukhtawar, will add an assertion.

Copy link
Contributor

❌ Gradle check result for 69b325b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 9610d2d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for a59a276: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@linuxpi linuxpi force-pushed the update-latest-primary-term-repl-ckp branch from 1b762da to 1b90a1b Compare January 30, 2024 10:05
Copy link
Contributor

❌ Gradle check result for 1b762da: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 1b90a1b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for d39fc54: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

✅ Gradle check result for ebb5b4a: SUCCESS

Copy link

codecov bot commented Jan 30, 2024

Codecov Report

Attention: 13 lines in your changes are missing coverage. Please review.

Comparison is base (6012504) 71.28% compared to head (ebb5b4a) 71.31%.
Report is 3 commits behind head on main.

Files Patch % Lines
...ava/org/opensearch/index/mapper/IpFieldMapper.java 65.71% 10 Missing and 2 partials ⚠️
...in/java/org/opensearch/index/shard/IndexShard.java 50.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #11990      +/-   ##
============================================
+ Coverage     71.28%   71.31%   +0.02%     
- Complexity    59414    59444      +30     
============================================
  Files          4925     4925              
  Lines        279479   279513      +34     
  Branches      40635    40643       +8     
============================================
+ Hits         199226   199325      +99     
+ Misses        63731    63605     -126     
- Partials      16522    16583      +61     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@sachinpkale sachinpkale requested a review from Bukhtawar January 30, 2024 13:30
@sachinpkale sachinpkale merged commit c55af66 into opensearch-project:main Jan 30, 2024
32 of 33 checks passed
@sachinpkale sachinpkale added the backport 2.x Backport to 2.x branch label Jan 30, 2024
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jan 30, 2024
…ional primary term (#11990)

* Force update operation primary term in replication checkout post failover

Signed-off-by: bansvaru <[email protected]>
(cherry picked from commit c55af66)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
gbbafna pushed a commit that referenced this pull request Jan 30, 2024
…ional primary term (#11990) (#12083)

* Force update operation primary term in replication checkout post failover


(cherry picked from commit c55af66)

Signed-off-by: bansvaru <[email protected]>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
peteralfonsi pushed a commit to peteralfonsi/OpenSearch that referenced this pull request Mar 1, 2024
…ional primary term (opensearch-project#11990)

* Force update operation primary term in replication checkout post failover

Signed-off-by: bansvaru <[email protected]>
rayshrey pushed a commit to rayshrey/OpenSearch that referenced this pull request Mar 18, 2024
…ional primary term (opensearch-project#11990)

* Force update operation primary term in replication checkout post failover

Signed-off-by: bansvaru <[email protected]>
shiv0408 pushed a commit to Gaurav614/OpenSearch that referenced this pull request Apr 25, 2024
…ional primary term (opensearch-project#11990)

* Force update operation primary term in replication checkout post failover

Signed-off-by: bansvaru <[email protected]>
Signed-off-by: Shivansh Arora <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch skip-changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants