-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent CCR recovery from missing documents #38237
Prevent CCR recovery from missing documents #38237
Conversation
Pinging @elastic/es-distributed |
This is not ready for production. I pushed it up here so we can discuss. This commit:
|
@tbrooks8 I've fixed things up as far as I think how we should handle this. I've also added a unit test. The integration test While this fixes the problem here, it exposes another issue I think, namely that the primary will start off (i.e. be marked as started) with a history that contains gaps, i.e. local checkpoint != max sequence number. This can turn out to be problematic for replicas, because peer recovery only completes if all gaps are filled on the primary (see call to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@ywelsch your changes look good to me. There is a test failing (reproducibly). Looks like we assert that the sequence numbers match between Lucene and the translog. Did you mean for this to be the assertion for
instead of:
I assume we should not be asserting the ops are the same for the target since there will be some no-ops? I can update the PR, I just wanted to check if that is what you intended. |
No, I only wanted the source index to be leniently closed (because we force-fully inject a gap). The problem was that the test was not ensuring that the index is restored with soft-deletes enabled when it is snapshotted with soft-deletes enabled. I've fixed this now |
@elasticmachine run elasticsearch-ci/default-distro |
@elasticmachine run elasticsearch-ci/2 |
…nto ccr_initial_global_checkpoint
* master: (23 commits) Lift retention lease expiration to index shard (elastic#38380) Make Ccr recovery file chunk size configurable (elastic#38370) Prevent CCR recovery from missing documents (elastic#38237) re-enables awaitsfixed datemath tests (elastic#38376) Types removal fix FullClusterRestartIT warnings (elastic#38445) Make sure to reject mappings with type _doc when include_type_name is false. (elastic#38270) Updates the grok patterns to be consistent with logstash (elastic#27181) Ignore type-removal warnings in XPackRestTestHelper (elastic#38431) testHlrcFromXContent() should respect assertToXContentEquivalence() (elastic#38232) add basic REST test for geohash_grid (elastic#37996) Remove DiscoveryPlugin#getDiscoveryTypes (elastic#38414) Fix the clock resolution to millis in GetWatchResponseTests (elastic#38405) Throw AssertionError when no master (elastic#38432) `if_seq_no` and `if_primary_term` parameters aren't wired correctly in REST Client's CRUD API (elastic#38411) Enable CronEvalToolTest.testEnsureDateIsShownInRootLocale (elastic#38394) Fix failures in BulkProcessorIT#testGlobalParametersAndBulkProcessor. (elastic#38129) SQL: Implement CURRENT_DATE (elastic#38175) Mute testReadRequestsReturnLatestMappingVersion (elastic#38438) [ML] Report index unavailable instead of waiting for lazy node (elastic#38423) Update Rollup Caps to allow unknown fields (elastic#38339) ...
Currently the snapshot/restore process manually sets the global checkpoint to the max sequence number from the restored segements. This does not work for Ccr as this will lead to documents that would be recovered in the normal followering operation from being recovered. This commit fixes this issue by setting the initial global checkpoint to the existing local checkpoint.
Currently the snapshot/restore process manually sets the global
checkpoint to the max sequence number from the restored segements. This
does not work for Ccr as this will lead to documents that would be
recovered in the normal followering operation from being recovered.
This commit fixes this issue by setting the initial global checkpoint to
the existing local checkpoint.