-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Segment Replication] Remote store integration high level design. #4555
Comments
Also, How do we ensure that the segments which are being copied by replicas doesn't get deleted by primary? |
Based on discussion on writes on NRT replica with translog, it does not look like primary and replicas are completely decoupled for translog writes. What would we then gain by decoupling primary and replicas completely during segment copy? I believe that we should start from this as that will inform and potentially simplify a lot of the design trade-offs you captured above. |
I definitely think we should valuate a poll based model for checkpoints and segment transfer on replica, this should simplify cases not only with remote segment store but also with CCR use cases where we currently have remote cluster poll for operations on the leader cluster.
Now if we go with polling mechanism the replicas can request for segments from the primary or the remote store alike based on the diff between the latest checkpoint and the last processed checkpoint
I guess we should do that @sachinpkale would be sharing the related issue around this, translog replay might be expensive @itiyamas I guess remote translogs would only be relying on a replication proxy for the control flow(single writer invariant), the data flow however is decoupled. This is only to support existing blob stores which don't support OCC but eventually the goal is to provide extension points for stores that support OCC for single writer use cases which might totally decouple the translog flow in such cases. Decoupling primary and replica for segment copy will help optimise lots of cases around recovery, primarily deployments/failovers, CCR replication etc. With remote store we could also think about publishing checkpoints to a remote queue like Kafka, ActiveMQ which should effectively decouple the segment copy and can be consumed not only internally within the cluster but across clusters for CCR with the appropriate access controls. |
Created an issue to track this: #4578. Downloading segments for lagging replica from remote segment store would definitely be faster than replaying all the operations from translog. We may need to think about common abstraction which will be used to get segments downloaded from remote segment store. Once SegRep is integrated with remote segment store, we can use the same APIs to download segments during failover. |
Do you mean that primary term check is control flow and actual document or segments is data flow? |
Yes thats how I would look at it :) |
Thank you folks for the inputs. The comments really helped in identifying different problem areas and the probable solutions. From POC perspective, pull based model where replica directly polls the remote store for new replication checkpoints seems better bit. The data deletion from remote store is not considered as part of this POC though remote store currently purges older than N commit points. Also, there are few changes which may needed on remote store side, created issue for tracking #4628. Captured design choices below, please review and provide feedback. Design considerationReplication modelPull based mechanism provides certain advantages, works well with CCR issue (#4577), and provide simplication comment (#4555 (comment)). It is better to provide both options and evaluate based on benchmarking results. As part of this POC, poll based mechanism will be attempted. 1. Push based mechanismPrimary in charge of starting replication by sending shard refresh checkpoints. Replica compares these checkpoints with its store and initiates replication when it falls behind. Primary maintains state of on-going replications to prevent data deletion. This is existing approach. Pros
Cons
2. Pull based mechanismReplica is in-charge of updating their local copy of data. Replica pings remote store to check if there is any new data. Need to trade off with network consumption Vs how soon data is available. Pros
Cons
Refresh Vs CommitThe existing implementation of segment replication supports incremental back up of segment files on replica. This is meaningful for disk based stores as it saves Data purgingData once stored on remote store needs to be cleaned up as well. This deletion shouldn't impact the on-going file copy operations on replica node. From POC perspective, data will not be deleted from remote store. Below are some options which can be considered as future enhancements. 1. Background process on primaryOn replica node, a background process ping replicas to fetch on-going replication checkpoints and cleans up commit points older than N, where N is least replication checkpoint active in replication amont all replicas. This approach couples primary and replica nodes and is not a scalable solution 2. Distributed queueSimilar to previous approach, but instead of transport calls from primary to replica, primary uses a distributed queue to synchronize active replication checkpoints. When event is processed by all replicas, event can be purged (exact N semantics) and cleaned up from remote store. Pros:
Cons:
3. Deletion policies ( cleanup after M days, or keep only last N commits)No synchronization among nodes to identify active replication checkpoints. Data is instead deleted based on data deletion policy. This completely decouples primary and replica nodes and simplify the design. Remote store by default keeps last 10 commits on store. Replica nodes handle file not found exception and retry for latest commit point. The solution can result in replica starvation when primary is aggressively committing. This solution is attempted for POC and doesn't need any extra work on integration side. |
Overall proposal looks good to me. Few open questions that we probably need to elaborate.
|
We need to make sure we are not breaking the refresh semantics with remote store. So in essence its not one vs the other, we need to ensure we add features without changing staleness
I think we also need to think from a failover perspective, we cannot have commits only in memory as it will increase the failover time and impact availability.
A reasonable retention policy should be fine as long as there is no additional cost or trade-offs |
Does this replace Also, this would require |
Thanks @ankitkala for the comments. Please find my response below.
The commit is performed on shard refreshes.
Can you please elaborate more on this ?
I am thinking of having a poller process on replica nodes which scans remote store for commit points. Poller captures the latest commit point and passes it on to TargetService on replica node. This service compares the commit point with last seen and start file copy operations for new commit points.
I think the poller process on replica should be able to read commit points on remote store. With this, I think separate replication checkpoints will not be needed. |
The existing implementations of both segrep & remote store operate on refresh, not commit. So in order for replicas to fetch the latest checkpoint (which is only in memory on the primary after a refresh operation with no flush) we'd need to create a new file representing the in memory SegmentInfos at a refresh point and write/upload it. Right now with the node->node implementation that is sent directly from memory as a byte[]. The idea to create an fsyncless commit instead of the normal refresh and avoid creating the new file. The fsync removal makes this about as expensive as a regular refresh, which we are able to do withe the durability guarantee of the remote store. So the flow would be. During failover replicas can either fetch from remote store & then replay from xlog, or directly replay from xlog a larget set of ops. All ops should be durably persisted. The alternative is we replicate after normal commit which will cause unacceptable staleness.
With what we are proposing, the refresh operation would need to be extended when these features are enabled to effectively execute what a flush does today without the xlog purge. This should not impact staleness however.
@Bukhtawar Are you referring about the absence of fsync here or the need to write the incremental refresh's SegmentInfos to disk if we are pushing a new file to the store? |
Thank you @Bukhtawar for the feedback and calling this out. Yes, there will be changes in refresh mechanism, but the advantages we seek from commit only backup seems to outweigh this short-coming as also mentioned in last comment by @mch2. We plan to call this out clearly in the documentation and perform benchmarking/tests to verify the claim.
In order to identify the failover time and availability, we need to add the tests/benchmark. We can make fsync-less commits bounded in order to keep failover/recovery time bounded. One way is to ensure is to ensure at least last N commit is fsynced.
The deletion policy |
Closing this issue and POC work is tracked in #4536 |
Before starting a POC for this integration we should start with a high level design.
Some questions that come to mind.
lower lvl questions:
The text was updated successfully, but these errors were encountered: