Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Segment Replication][Rolling upgrade] Create integration tests to simulate rolling upgrade scenarios #7490

Closed
4 tasks
Poojita-Raj opened this issue May 9, 2023 · 4 comments
Assignees
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request

Comments

@Poojita-Raj
Copy link
Contributor

Is your feature request related to a problem? Please describe.

Create integration tests that simulate the behavior of a rolling upgrade.
This would include:

  • 1. Setting up different scenarios of mixed version clusters - varying the cluster size, number of primaries, number of replicas, the order in which nodes upgraded, etc.
  • 2. Ensuring we can pull distributions that have a definite lucene codec change between them to init the nodes.
  • 3. Testing that all indexing operations take place as expected in a mixed cluster state (downgrading the lucene version of segments if needed).
  • 4. Testing that the codec used is updated once all nodes are on a higher version and upgrade is complete.
@dreamer-89
Copy link
Member

We already have rollling upgrade and mixed cluster test support. Added both as part of PR #7537

@dreamer-89
Copy link
Member

  1. Setting up different scenarios of mixed version clusters - varying the cluster size, number of primaries, number of replicas, the order in which nodes upgraded, etc.

PR #7537 covers use case with 3 node cluster having 3 primary, 1 replica shards. This setup simulates the required behavior where replica and primary run on different OS version. The nodes are always upgraded deterministically based on node names. @Poojita-Raj : Let know if you feel adding additional tests with different setup will be useful.

  1. Ensuring we can pull distributions that have a definite lucene codec change between them to init the nodes.

I think this is ideal (and easier) to do via unit tests rather than bwc integration test.

  1. Testing that all indexing operations take place as expected in a mixed cluster state (downgrading the lucene version of segments if needed).

This is verified in the test

  1. Testing that the codec used is updated once all nodes are on a higher version and upgrade is complete.

This is verified via the rolling upgarde tests.

@dreamer-89
Copy link
Member

In order to have bwc test running on different Lucene codec versions, hacked 2.x branch to use Lucene94Codec. This needed below changes:

  1. Revert backport of Lucene upgrade to 9.6 Update Apache Lucene to 9.6.0 #7505
  2. Revert backport of Lucene upgrade to 9.5 [Upgrade] Lucene 9.5.0 release #6078
  3. Change Lucene codec version inside Version.java

Having this setup immediately identifies breaking functionality when there are actual differing codecs during upgrades

  1. The test first breaks due to replica shard failure (failed recovery) because of this check which cancels round of segment replication on source when replica is running on a differing Lucene codec.
[2023-05-16T14:11:54,161][WARN ][o.o.c.r.a.AllocationService] [v2.8.0-2] failing shard [failed shard, shard [test-index-segrep][1], node[-_S5jn77Sp6N4iuVYhgrAw], relocating [9TaegTklQ0eGE38bwx606w], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=CkmjnQ1xR_-P_Mk80g5l9w, rId=rBnU6JGFTvypCnO59Gq4Dw], expected_shard_size[4272], message [failed recovery], failure [RecoveryFailedException[[test-index-segrep][1]: Recovery failed from {v2.8.0-1}{JTOdWb9QQ3-GjUE-dHquqQ}{JKv64zmEQhuyKK9sp4HgRA}{127.0.0.1}{127.0.0.1:60674}{dimr}{testattr=test, shard_indexing_pressure_enabled=true} into {v2.8.0-0}{-_S5jn77Sp6N4iuVYhgrAw}{lSZ_ejd-QMaWpq7VVjzaSA}{127.0.0.1}{127.0.0.1:60834}{dimr}{upgraded=true, testattr=test, shard_indexing_pressure_enabled=true} ([test-index-segrep][1]: Recovery failed from {v2.8.0-1}{JTOdWb9QQ3-GjUE-dHquqQ}{JKv64zmEQhuyKK9sp4HgRA}{127.0.0.1}{127.0.0.1:60674}{dimr}{testattr=test, shard_indexing_pressure_enabled=true} into {v2.8.0-0}{-_S5jn77Sp6N4iuVYhgrAw}{lSZ_ejd-QMaWpq7VVjzaSA}{127.0.0.1}{127.0.0.1:60834}{dimr}{upgraded=true, testattr=test, shard_indexing_pressure_enabled=true})]; nested: RecoveryFailedException[[test-index-segrep][1]: Recovery failed from {v2.8.0-1}{JTOdWb9QQ3-GjUE-dHquqQ}{JKv64zmEQhuyKK9sp4HgRA}{127.0.0.1}{127.0.0.1:60674}{dimr}{testattr=test, shard_indexing_pressure_enabled=true} into {v2.8.0-0}{-_S5jn77Sp6N4iuVYhgrAw}{lSZ_ejd-QMaWpq7VVjzaSA}{127.0.0.1}{127.0.0.1:60834}{dimr}{upgraded=true, testattr=test, shard_indexing_pressure_enabled=true}]; nested: RemoteTransportException[[v2.8.0-1][127.0.0.1:60674][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[v2.8.0-0][127.0.0.1:60834][internal:index/shard/replication/segments_sync]]; nested: ReplicationFailedException[[test-index-segrep][1]: Replication failed on ]; nested: ExecutionCancelledException[ParameterizedMessage[messagePattern=Requested unsupported codec version {}, stringArgs=[Lucene95], throwable=null]]; ], markAsStale [true]]
  1. After disabling above check, the test breaks because of replica not having same number of docs as primary. This happens due to this check on replica which prevents segment replication when lucene codec name differs from primary. Below logs show replica running on lower OS version with older codec decides to skip published checkpoint. This holds true vice-versa i.e. when replica runs on higher version and primary is on lower codec version.
[2023-05-17T11:17:58,748][TRACE][o.o.i.s.IndexShard       ] [v2.8.0-2] [test-index-segrep][1] Shard does not support the received lucene codec version Lucene95

2.x branch (bwc branch): https://github.com/dreamer-89/OpenSearch/commits/2.8.0_Lucene94Codec

@dreamer-89
Copy link
Member

Verified that existing segment replication breaks while building SegmentInfos object from copied over file from primary.

Caused by: java.lang.IllegalArgumentException: An SPI class of type org.apache.lucene.codecs.Codec with name 'Lucene95' does not exist.  You need to add the corresponding JAR file supporting this SPI to your classpath.  The current classpath supports the following names: [Lucene94, Lucene80, Lucene84, Lucene86, Lucene87, Lucene70, Lucene90, Lucene91, Lucene92]
	at org.apache.lucene.util.NamedSPILoader.lookup(NamedSPILoader.java:113) ~[lucene-core-9.4.2.jar:9.4.2 858d9b437047a577fa9457089afff43eefa461db - jpountz - 2022-11-17 12:56:39]
	at org.apache.lucene.codecs.Codec.forName(Codec.java:118) ~[lucene-core-9.4.2.jar:9.4.2 858d9b437047a577fa9457089afff43eefa461db - jpountz - 2022-11-17 12:56:39]
	at org.apache.lucene.index.SegmentInfos.readCodec(SegmentInfos.java:511) ~[lucene-core-9.4.2.jar:9.4.2 858d9b437047a577fa9457089afff43eefa461db - jpountz - 2022-11-17 12:56:39]
	at org.apache.lucene.index.SegmentInfos.parseSegmentInfos(SegmentInfos.java:404) ~[lucene-core-9.4.2.jar:9.4.2 858d9b437047a577fa9457089afff43eefa461db - jpountz - 2022-11-17 12:56:39]
	at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:363) ~[lucene-core-9.4.2.jar:9.4.2 858d9b437047a577fa9457089afff43eefa461db - jpountz - 2022-11-17 12:56:39]
	at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:310) ~[lucene-core-9.4.2.jar:9.4.2 858d9b437047a577fa9457089afff43eefa461db - jpountz - 2022-11-17 12:56:39]
	at org.opensearch.indices.replication.SegmentReplicationTarget.lambda$finalizeReplication$4(SegmentReplicationTarget.java:226) ~[opensearch-2.8.0-SNAPSHOT.jar:2.8.0-SNAPSHOT]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request
Projects
None yet
Development

No branches or pull requests

2 participants