-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support writes with previous major lucene versions #12391
Comments
For reference there was a similar discussion at #10274. |
I'm confused as to why the solution described won't work with segment replication. As I understand it what that solution describes would be some writer nodes writing segments with old version while replicas' (readers) software gets upgraded. Since replicas only read they can continue to read these old (version N-1) index segments. Even if you wanted to allow them to do merging (why, I don't know) you could and they would write new-format segments, which is fine as long as they never share them with other replicas. Then once all replicas are on version N, you can upgrade the writer nodes to version N and start publishing version N segments. Where is the hitch - what am I missing? |
Thank you @msokolov for the comment and apologies for not having the description clear enough. The issue is the other way round where readers are on non-upgraded nodes while writers on upgraded nodes during major lucene version upgrade. Though, it works for minor lucene version upgrades. |
@msokolov @mikemccand What is your recommendation on this for minor version upgrades? Lucene does not support writes on the old codec format for minor versions as well, but I can potentially override the codec to use the old writer in my application. Since Lucene has not tested IndexWriter with the old codec format, it can potentially result in unknown bugs. The challenge for Opensearch is that it stores the data in remote store, so upgrades are not seamless if we do not build the forward compatibility in Opensearch for old codec versions, so even minor version upgrades for Lucene become tricky. |
I thought "OpenSearch attempted to solve mixed cluster issue by updating primary shard copies to keep using older codec until all replica copies are on latest software" was solving this issue? How does a remote store alter the picture? I guess you need a remote store for the old version and a separate remote store for the new version? |
Opensearch never went ahead with the proposed solution since it does not work for major versions. I am wondering whether we should even rely on it for minor versions and need your help on the same. Opensearch downgrades are not seamless due to codec version compatibility issues during deployments. Current deployment process Proposed 2 phase deployment process In this phased deployment, there is a version of software at every stage which you can rollback to. Challenges with supporting intermediate stage
Questions
Remote store |
Since the writer logic is available in backward codecs for testing, I can
still go ahead and override the codec write methods to work with old
writers. But the path is not tested well in Lucene e.g. IndexWriter may not
work with old codec version for writes, even for minor version upgrades.
Although it is possible that IndexWriter would somehow stop being able to
write an older version of a codec, that seems unlikely for a minor release.
It is true that nothing enforces that that works. However running unit
tests with your backwards-supporting codec should be enough to have
confidence that it works.
I think the answers to your yes/no questions are 1. Yes, 2. Yes. For 3, I'm
not sure. It does seem like a difficult situation. I don't see how Lucene
would support writing two index versions at the same time though. I think
it sometimes happens that the backwards-codec implementations even drop
support for writing, so it might not be a reliable solution to (2) in the
general case.
…On Tue, Apr 30, 2024 at 10:31 AM itiyama ***@***.***> wrote:
Opensearch never went ahead with the proposed solution since it does not
work for major versions. I am wondering whether we should even rely on it
for minor versions and need your help on the same. Opensearch downgrades
are not seamless due to codec version compatibility issues during
deployments.
*Current deployment process*
Transition from OpenSearch 2.x (Lucene 9.4) to 2.x+1 (Lucene 9.9) involves
moving all replicas followed by primaries.
Downgrading primaries is unsupported as there is no version of the
software that understands both old and new codec formats.
*Proposed 2 phase deployment process*
Phase 1: Transition to OpenSearch 2.x+1 with Lucene 9.9, setting default
write version to 9_4.
Phase 2: Enable codec version 9_9 for OpenSearch 2.x+1.
In this phased deployment, there is a version of software at every stage
which you can rollback to.
*Challenges with supporting intermediate stage*
1. Old codecs are not supported
<https://github.com/apache/lucene/blob/branch_9_10/lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene90/Lucene90HnswVectorsFormat.java#L115>
out of the box for writes even for minor version upgrades.
2. Since the writer logic is available in backward codecs for testing,
I can still go ahead and override the codec write methods to work with old
writers. But the path is not tested well in Lucene e.g. IndexWriter may not
work with old codec version for writes, even for minor version upgrades.
So, I am not comfortable in relying on this mechanism unless I am aware of
compatibility risks.
*Questions*
1. Does Lucene officially support only the latest codec version for
write operations?
2. Can I assess the compatibility risks associated with older codec
versions on new Lucene software by running the entire test suite with older
codec versions? Is this method sufficient for identifying potential issues?
3. For applications relying on Lucene's segment replication model, and
lacking a stable software version for rollback, how can they address
deployment risks without independently verifying compatibility?
Alternatively, how can they manage deployments without a stable fallback
option, potentially risking downtime during rollbacks? Should Lucene
consider supporting this officially?
*Remote store*
It is a special case for segment replication, so same problems exist.
—
Reply to this email directly, view it on GitHub
<#12391 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHHUQIYJM47NNONJLDTLATY76TLPAVCNFSM6AAAAAAZR5C5NSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBVGQ4TAMRUGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Lucene does support reading N-1 index versions at the same time, and it should theoretically be feasible to enable writing for N-1 versions, though not concurrently. The software boots up with one index version, maintaining it throughout its runtime. This means that the software doesn't need to handle cases where the writer version is switched in memory but can perform all necessary checks at boot time. While this approach may involve high maintenance overhead for Lucene, I want to emphasize and understand the feasibility aspect better. |
Description
Allow write capabilities with previous major bwc lucene versions to suport rolling upgrades[2] on OpenSearch.
Background
Customers using OpenSearch performs upgrades to move to latest version of OpenSearch software. One of the possible upgrades is rolling upgrade[2] where each node is upgraded one at a time. This upgrade process result in intermediate state where few nodes are on latest OpenSearch version while others are still running on older version, creating a state of mixed version cluster. This state does not work well for segment replication[1] enabled indices because the primary shard copies over the segment files onto replica shard copies. This works fine when all nodes are running on same version but during upgrades it is possible that replica shard copy be running on a older version node and thus, does not understand the segment files written with newer codec on primary shard. This results in replica shard failures and impacts search availability.
Solution attempted and issue
OpenSearch attempted to solve mixed cluster issue by updating primary shard copies to keep using older codec until all replica copies are on latest software. This resulted in segment files written with older codec which replica shard can read. This works for upgrades where there is minor Lucene version bump but not when there is major Lucene bump. We identified from [4] and manual test that Lucene moves older codecs into bwc-codecs and only allow reads with all previous major versions. Thus, there is no write compatibility with previous major lucene versions and solution attempted on OpenSearch will not work for major Lucene version upgrades.
References
[1] Segment replication in OpenSearch
[2] Rolling upgrades in OpenSearch
[3] OpenSearch engine issue
[4] Backward codecs in Lucene
The text was updated successfully, but these errors were encountered: