Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG][Segment Replication] Manual reroute of shards from lower codec node to higher version fails #7489

Closed
Poojita-Raj opened this issue May 9, 2023 · 1 comment
Assignees
Labels
bug Something isn't working distributed framework

Comments

@Poojita-Raj
Copy link
Contributor

Poojita-Raj commented May 9, 2023

Describe the bug

Related to #3881
On adding a node of a higher version to an opensearch cluster and manually rerouting a shard onto the new node (which is using a higher version codec) we see a compatibility check failure due to the codec mismatch.

For eg: node1, node2, node3 are nodes using codec Lucene94 and we add a node "new-1" which is on a new version using codec Lucene95.
We then run a reroute of a shard from node1 to new-1.
We then see a replication failure with the root cause being:

Caused by: org.opensearch.common.util.CancellableThreads$ExecutionCancelledException: ParameterizedMessage[messagePattern=Requested unsupported codec version {}, stringArgs=[Lucene95], throwable=null]

To Reproduce
Steps to reproduce the behavior:

  1. Create cluster with nodes of differing lucene codec versions.
  2. Manually reroute shards from lower codec to higher codec.
  3. You will see a replicationFailure thrown.

Expected behavior
This is faulty behavior since we always want to enable movement of shards onto higher version nodes - the most common scenario being during an upgrade. We want the check to allow the shard movement to go through if the target node is on a higher version.

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

Full stack trace:

Caused by: org.opensearch.transport.RemoteTransportException: [node-2][172.31.6.116:9300][internal:index/shard/recovery/start_recovery]
Caused by: org.opensearch.transport.RemoteTransportException: [new-2][172.31.4.178:9300][internal:index/shard/replication/segments_sync] 
Caused by: org.opensearch.indices.replication.common.ReplicationFailedException: [my-index1][1]: Replication failed on 
	at org.opensearch.indices.replication.SegmentReplicationTargetService$3.onFailure(SegmentReplicationTargetService.java:362) ~[opensearch-3.0.0.jar:3.0.0]
	at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) ~[opensearch-3.0.0.jar:3.0.0]
	at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:103) ~[opensearch-3.0.0.jar:3.0.0]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:54) ~[opensearch-3.0.0.jar:3.0.0]
	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) ~[opensearch-3.0.0.jar:3.0.0]
	at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) ~[opensearch-3.0.0.jar:3.0.0]
	at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) ~[opensearch-3.0.0.jar:3.0.0]
	at java.util.ArrayList.forEach(ArrayList.java:1511) ~[?:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) ~[opensearch-3.0.0.jar:3.0.0]
	at org.opensearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:178) ~[opensearch-3.0.0.jar:3.0.0]
	at org.opensearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:149) ~[opensearch-3.0.0.jar:3.0.0]
	at org.opensearch.action.StepListener.innerOnFailure(StepListener.java:82) ~[opensearch-3.0.0.jar:3.0.0]
	at org.opensearch.action.NotifyOnceListener.onFailure(NotifyOnceListener.java:62) ~[opensearch-3.0.0.jar:3.0.0]
	at org.opensearch.action.ActionListener$4.onFailure(ActionListener.java:190) ~[opensearch-3.0.0.jar:3.0.0]
	at org.opensearch.action.ActionListener$6.onFailure(ActionListener.java:309) ~[opensearch-3.0.0.jar:3.0.0]
	at org.opensearch.action.support.RetryableAction$RetryingListener.onFinalFailure(RetryableAction.java:218) ~[opensearch-3.0.0.jar:3.0.0]
	at org.opensearch.action.support.RetryableAction$RetryingListener.onFailure(RetryableAction.java:210) ~[opensearch-3.0.0.jar:3.0.0]
	at org.opensearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:74) ~[opensearch-3.0.0.jar:3.0.0]
	... 6 more
Caused by: org.opensearch.common.util.CancellableThreads$ExecutionCancelledException: ParameterizedMessage[messagePattern=Requested unsupported codec version {}, stringArgs=[Lucene95], throwable=null]
	at org.opensearch.indices.replication.OngoingSegmentReplications.prepareForReplication(OngoingSegmentReplications.java:154) ~[opensearch-3.0.0.jar:3.0.0]
	at org.opensearch.indices.replication.SegmentReplicationSourceService$CheckpointInfoRequestHandler.messageReceived(SegmentReplicationSourceService.java:138) ~[opensearch-3.0.0.jar:3.0.0]
	at org.opensearch.indices.replication.SegmentReplicationSourceService$CheckpointInfoRequestHandler.messageReceived(SegmentReplicationSourceService.java:119) ~[opensearch-3.0.0.jar:3.0.0]
	at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106) ~[opensearch-3.0.0.jar:3.0.0]
	at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:453) ~[opensearch-3.0.0.jar:3.0.0]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806) ~[opensearch-3.0.0.jar:3.0.0]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-3.0.0.jar:3.0.0]
	... 3 more
@Poojita-Raj Poojita-Raj added bug Something isn't working untriaged and removed untriaged labels May 9, 2023
@Poojita-Raj Poojita-Raj self-assigned this May 9, 2023
@Poojita-Raj
Copy link
Contributor Author

Fixed by removing string matching check for codec compatibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed framework
Projects
None yet
Development

No branches or pull requests

2 participants