-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
POC: Handling failover with remote segment storage #2481
Comments
Sort of thinking out loud here, but I see at least two possible broad approaches. At the time of failover, the new primary:
Are there any other options that you're thinking about here? I think both of these approaches will potentially make failover a more heavyweight operation than it currently is. |
@andrross Yes, these are two of the potential approaches. Let me list down all the potential approaches below. |
Potential ApproachesAs mentioned in the description of the issue, these approaches are to handle failover case with document based replication. For Segment based replication, it would be comparatively easy as segment files do not change between primary and replicas (we still would want to have the same approach work for both the replication strategies). Sync all data from remote store to new primary before failover is complete
Sync all data from new primary to remote store before failover is complete
Sync all data from new primary to remote store in the background, commit only after data sync completes
Every node keeps keeps own copy in remote store
Keeping 3 copied of data from 3 different nodes
Keep a completely different copy of segments on the remote store
Incremental upload only new segments of new primary to remote store
|
Next steps:
|
For the sake of argument, is it truly a requirement that remote storage needs to work with both replication strategies, particularly for the initial version? It seems like a whole bunch of complexity could be avoided if this feature was only supported with segment replication. Just from a usability perspective these options result in performance and cost tradeoffs to be considered when enabling remote storage with document replication that can be avoided if segment replication is used. |
This POC will help us in the design of remote store and IMO remote store design should not be tied to a replication strategy. From release point of view, I agree with you. We can think of having an experimental first release of remote store limited to Segment Replication. But interfaces should be defined in such a way that supporting document based replication will be an extension. |
Thank you for the thorough walkthrough of the various approaches above! IMO, we shouldn't rule the two "sync all data" approaches (the first two options listed above) from the get-go, especially considering that they are the most straight-forward to implement. I'm sure the use of remote segment storage will introduce a number of other tradeoffs, so it would be worth it to consider the impact of failover latency alongside those. Building on what @andrross said above, I don't think we can integrate remote segment storage alongside document/segment replication. If the purpose of the remote segment store is to be the authoritative source of data and guard against data loss when nodes go down, then it must work in tandem with primaries and replicas. We would need a new replication strategy from document and segment replication - one where the primary processes documents and ultimately writes segments only to the remote store. Replica shards would simply pull data from the remote store and only serve to minimize failover time should the primary become unavailable. Thoughts? |
One final, pie-in-the-sky thought 😄 Down the line, it would be worth considering the benefits of allowing the remote store to independently optimize its stored data rather than simply mirroring the primary. That way, we don't need to expend network bandwidth just to mirror the primary shard's process of slowly merging down to larger and larger segments. |
@kartg makes a good point about a possible architecture where replicas pull from the remote store and essentially use the remote store as the replication method. There are a few variables here: remote segment store, remote translog, and replication method. It seems to me that there are a few permutations that probably don't make sense. For example, would you ever want to use document replication along with a remote translog? I would think not because a benefit of a remote translog is that you don't need to replicate the translog to replicas at all. Maybe I'm wrong about that, but it might be helpful to detail all the use cases being targeted here to help answer some of these questions and refine the design. |
I agree with @andrross, why remote storage should work with document replication? What is the benefit of doing this? I think we need to trade between the benefit of enabling this vs the complexity we will add. |
We are combining two things here: Durability and Replication. Durability should be achieved irrespective of what replication strategy we choose. Durability feature will make sure that there will not be any data loss when outage happens. Let me know if you disagree with this.
@kartg This is a feature that can be built using remote segment store but again not related to durability feature.
Is it similar to
@andrross Yes, to provide durability guarantees, we need to use remote translog along with document replication.
@anasalkouz It is required to provide durability guarantees. |
Durability and replication are separate considerations as long as we're only changing how we make things durable, or how we're replicating. With remote storage, we're changing where we're making things durable which affects where we replicate from. I think we have an opportunity to build a really efficient implementation if we work together to ensure that remote segment storage and segment replication play well with each other, rather than trying to build them completely independent of one another. wdyt? |
Completely agree. Not just replication, we can integrate remote store with other constructs/features of OpenSearch (like snapshot). While designing the remote store, we have to make sure the use case is extensible. We have started drafting design proposal here: #2700 |
Approach: Incremental upload only new segments of new primary to remote storeSegments are uploaded to the remote storage in directory format: cluster_UUID/primary_term/index_hash/shard_number/. Segments file (segments_N) per commit will be used to keep track of max sequence number that is committed. Processed local checkpoint is added in segments_N file as a part of commit data. Currently, OpenSearch keeps segments_N file only for the last successful commit and deletes the older ones Code Reference. In this approach, we will keep all the segments_N files. Check for impact on performance or scale. Can this result in reaching max open file limit earlier than current implementation? We will also upload segments_latest file which will point to the latest segments_N file. A special primary_term_latest will be added under cluster_UUID/ which will hold the value of latest primary term.
Test case for restore and checking duplicates3 nodes cluster - no dedicated master nodes Indexing Flow (value of X can be changed as per the run)
Restore Flow
Conclusion
|
Recommended Approach: Sync all data from new primary to remote store in the background, commit only after data sync completesWe can use a variant of this approach. We take help of remote translog in this approach. Commits will be triggered in the same way as they are triggered today on the new primary. Local translog on new primary will be purged on these commits. But the remote translog will not be purged until the segment upload completes. |
Adding more details in the design proposal: #2700 |
The way we store segments in remote segment storage (Feature Proposal) depends on how we handle failover.
In document based replication, segments are created separately on primary and replicas and the process is not in sync. This means, number of segments can be different. Also, if we inspect the segments for primary and replica, there can be a difference and it depends on when the segment was created on the given node and translog checkpoint at that time. This does not mean data will be inconsistent. With the help of translog, the end state still remains same.
As there isn’t a consistent state of segments across primary and replicas, when primary goes down and one of the replicas become primary, the segments in the remote store and segments in the primary will differ. Once new primary starts uploading new segments to the remote store, we need to make sure that a consistent state is maintained. This becomes tricky once segment merge happens at the new primary and older segments need to be deleted.
Goal of this POC is to list down potential approaches to handle the failover and recommend one based on pros and cons. This failover approach will dictate the overall design of how segments will be stored.
The text was updated successfully, but these errors were encountered: