-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable indexing optimization using sequence numbers on replicas #43616
Conversation
Pinging @elastic/es-distributed |
server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java
Outdated
Show resolved
Hide resolved
Thanks for tackling this. I wonder if we should go one step further and remove the |
I remember that we discussed and agreed to proceed with this change. However, when I was working on this PR I realized we should not remove the append-only optimization on 7.x as 6.x indices don't have soft-deletes. Are you okay if we remove the append-only optimization from 8.x only in a follow-up? |
Can't we activate the new optimization also for the case where there are no soft-deletes? |
With #41161 where we initialize max_seq_no_of_updates in the constructor of InternalEngine, we don't strictly need the LocalCheckpointTracker containing all operations when we open an engine. Thus, we can activate the new optimization without soft-deletes. |
@ywelsch I have deactivated the append-only optimization and removed the related code on replicas. Sadly we can't remove the logic that transfers auto_id timestamp from primary to replicas in peer recovery or resync as this will be required (when a replica becomes primary) if we switch peer recovery to use soft-deletes instead of translog (see #33693). Please take another look when you have some cycles. Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ywelsch I have deactivated the append-only optimization and removed the related code on replicas. Sadly we can't remove the logic that transfers auto_id timestamp from primary to replicas in peer recovery or resync as this will be required (when a replica becomes primary) if we switch peer recovery to use soft-deletes instead of translog (see #33693).
I'm not sure I'm following this. Would it only be a problem if the original append-only request would go the newly promoted primary whereas the retry had gone to the previous primary? Is that a scenario that can happen?
Perhaps we should disable the append-only optimization for any other case than where origin == Primary
?
server/src/test/java/org/elasticsearch/index/replication/IndexLevelReplicationTests.java
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/index/replication/IndexLevelReplicationTests.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java
Outdated
Show resolved
Hide resolved
Yeah, I was thinking of that scenario. If we are sure that can't happen, we can remove the transfer timestamp logic. |
It's very difficult to construct such a scenario at least. On the off-chance, however, best keep it in for the time being... Not in this PR, but I wonder if we should simplify the append-only optimization logic so that when we generate the timestamp, we also attach it to the identity of the primary shard instance that we want to send this write to, and only apply the optimization if this matches. This would make our life easier, as there is no need to persist or transfer the max unsafe timestamp. Writes would only be deoptimized during the short time of primary failover or relocation. WDYT? |
This is a great idea :). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Let's benchmark this for both an append-only workload with _id and without _id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
This PR increases the indexing throughput around 13% for append-only indices with external Ids. There's no regression for append-only indices with auto-generated Ids. Below is the detail. 1. geopoint with external Ids
2. geonames with external Ids
3. geonames with auto-generated Ids
|
Unrelated test failure (tracked at #43889). @elasticmachine run elasticsearch-ci/2 |
Thanks everyone! |
This PR enables the indexing optimization using sequence numbers on replicas. With this optimization, indexing on replicas should be faster and use less memory as it can forgo the version lookup when possible. This change also deactivates the append-only optimization on replicas. Relates #34099
Hi~I hava some question to ask: When a replica receives an index operation O, it first ensure its own MSU at least MSU(O), and then compares its MSU to its local checkpoint (LCP).
But if there is a gap, which mean LCP < MSU(O) < seqNo(O), what errors will be caused by directly append? Any response would be appreciated! |
@LiuDui it is not clear whether you are after extending the optimization to handle more cases or whether you think there is a problem here? Can you clarify the intentions of your question>
If LCP < MSU, a check will be made. If we were to direct append in the case where the operation has already been updated, we risk having two documents for the same |
@henningandersen Thanks so much for your reply! I can understand that when LCP >= MSU, we can append operation without any problems on replica. It's safe and is the current logic. According to the documentation, If LCP < MSU then there's a gap: there may be some operations that act on docID(O) about which we do not yet know, so we cannot perform an add. But I can't find a situation which will keep two documents for the same
What's the meaning about the operation has already been updated? Is there some example? |
@henningandersen I've suddenly figured it out! Thank you very much! |
This PR enables the indexing optimization using sequence numbers on replicas. With this optimization, indexing on replicas should be faster and use less memory as it can forgo the version lookup when possible. This change also deactivates the append-only optimization on replicas.
Relates #34099