Node fails to perform data write on replica shards while upgrading to OS 2.12 #4085

Dhruvan1217 · 2024-02-28T10:43:35Z

What is the bug?

While upgrading the Opensearch cluster from 1.3.9 to 2.12, the shards are failing to recover. The data is unable to replicate to the replicas with following exception in the logs.

NOTE: This issue was also showing up in 2.11.x and was expected to be fixed in 2.12.0, Reference: opensearch-project/OpenSearch#11491

In an on-going in-place upgrade (where certain nodes are on 1.3 and other nodes on 2.12), I have noticed this happening when the primary shard assigned on new/upgraded node is trying to replicate the data to the replica assigned to other upgraded node.

[2024-02-28T06:56:09,731][WARN ][org.opensearch.action.bulk.TransportShardBulkAction] [[Index_x_y_z][5]] failed to perform indices:data/write/bulk[s] on replica [Index_x_y_z][5], node[WGf0Uf3fSBC9Cvu8a0OwFg], [R], s[STARTED], a[id=tkTI2NBcQGKedZRRAi6_Hg]
org.opensearch.transport.RemoteTransportException: [sysdigcloud-elasticsearch-1][p.q.r.s:9300][indices:data/write/bulk[s][r]]
Caused by: org.opensearch.OpenSearchException: java.lang.IllegalArgumentException: -84 is not a valid id
	at org.opensearch.security.support.Base64CustomHelper.deserializeObject(Base64CustomHelper.java:136) ~[?:?]
	at org.opensearch.security.support.Base64Helper.deserializeObject(Base64Helper.java:46) ~[?:?]
	at org.opensearch.security.transport.SecurityRequestHandler.messageReceivedDecorate(SecurityRequestHandler.java:187) ~[?:?]
	at org.opensearch.security.ssl.transport.SecuritySSLRequestHandler.messageReceived(SecuritySSLRequestHandler.java:154) ~[?:?]
	at org.opensearch.security.OpenSearchSecurityPlugin$6$1.messageReceived(OpenSearchSecurityPlugin.java:795) ~[?:?]
	at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106) ~[opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.transport.InboundHandler.handleRequest(InboundHandler.java:271) [opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.transport.InboundHandler.messageReceived(InboundHandler.java:144) [opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.transport.InboundHandler.inboundMessage(InboundHandler.java:127) [opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.transport.TcpTransport.inboundMessage(TcpTransport.java:770) [opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:175) [opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:150) [opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:115) [opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:95) [transport-netty4-client-2.12.0.jar:2.12.0]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) [netty-transport-4.1.106.Final.jar:4.1.106.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.106.Final.jar:4.1.106.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) [netty-transport-4.1.106.Final.jar:4.1.106.Final]
	at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:280) [netty-handler-4.1.106.Final.jar:4.1.106.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) ~[netty-transport-4.1.106.Final.jar:4.1.106.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.106.Final.jar:4.1.106.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.106.Final.jar:4.1.106.Final]
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) ~[netty-codec-4.1.106.Final.jar:4.1.106.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) [netty-transport-4.1.106.Final.jar:4.1.106.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.106.Final.jar:4.1.106.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) [netty-transport-4.1.106.Final.jar:4.1.106.Final]
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1475) [netty-handler-4.1.106.Final.jar:4.1.106.Final]

How can one reproduce the bug?
Steps to reproduce the behavior:

Setup a 1.3.x cluster (consider 3 node cluster)
Ingest some data to indices and let the data continue to flow in (possibly also create new indices in-between node restarts)
Set shard allocation to primaries
Start in-place upgrade to 2.12.0 by upgrading the nodes one by one
Reboot the first node, once it is initialized, let the cluster turn green (by setting allocation to all) before restarting the following nodes.
Set allocation to primaries again and Reboot the second node.
As soon as the second node is initialised and then the allocation is set to all, if a replica shard comes to this node for which the primary shard is on the first node upgraded, you will see errors in the logs when the first node tries to write/replicate the data to the replica shard.

What is the expected behavior?
The data write to shard replicas should be successful.

What is your host/environment?

OS: AMD64
Plugins

$ /usr/share/opensearch/bin/opensearch-plugin list
opensearch-index-management
opensearch-job-scheduler
opensearch-security
repository-gcs
repository-s3

The text was updated successfully, but these errors were encountered:

Dhruvan1217 · 2024-02-28T10:50:23Z

@cwperks Tagging you if you can take a look at this further opensearch-project/OpenSearch#11491

peternied · 2024-02-28T13:41:41Z

This issue is in performance-analyzer, transferring to that repo

cwperks · 2024-02-28T15:16:30Z

When the second node is being rebooted, could there be an inflight transport request that gets resumed when the second node is brought back up as a 2.12 node?

I'd have to dig into it further, but this line determines what serialization to use when sending a transport request. It could be the case that the first node that was replaced with a 2.12 node is in the middle of sending a request to the second 1.3 node that is rebooted before replying back to the 2.12 node. When the node comes back online, could it be that the transport request is being replayed?

Dhruvan1217 · 2024-02-29T05:36:13Z

@peternied We don't have PA enabled, this is happening without that plugin and this seems to be generic security module problem. We should move it back to security repo. Thanks

peternied · 2024-02-29T12:07:11Z

@opensearch-project/admin Please transfer this issue to the security repo.

stephen-crawford · 2024-03-04T16:09:05Z

[Triage] Hi @Dhruvan1217, thanks for filing this issue and providing detailed reproduction steps. Someone will take a look and see what the problem is and note steps forward.

Dhruvan1217 · 2024-03-21T10:25:22Z

Also AFAIU, there were no issues in Opensearch 2.10, so maybe we can take a look at what was changed there after (I believe introduction of custom serializartion/de-serialization)

Dhruvan1217 added bug Something isn't working untriaged Require the attention of the repository maintainers and may need to be prioritized labels Feb 28, 2024

Dhruvan1217 changed the title ~~Node fails to perform data write on replica shards while upgrading to OS >=2.11~~ Node fails to perform data write on replica shards while upgrading to OS 2.12 Feb 28, 2024

peternied transferred this issue from opensearch-project/security Feb 28, 2024

bbarani transferred this issue from opensearch-project/performance-analyzer Feb 29, 2024

stephen-crawford added triaged Issues labeled as 'Triaged' have been reviewed and are deemed actionable. and removed untriaged Require the attention of the repository maintainers and may need to be prioritized labels Mar 4, 2024

Dhruvan1217 mentioned this issue Apr 9, 2024

Introduce Environment variable for choosing serialisation - JDK or custom #4227

Closed

cwperks assigned peternied Apr 16, 2024

peternied closed this as completed Apr 26, 2024

Dhruvan1217 mentioned this issue Aug 12, 2024

[BUG] Unexpected failure while sending request, -84 is not a valid id #4494

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node fails to perform data write on replica shards while upgrading to OS 2.12 #4085

Node fails to perform data write on replica shards while upgrading to OS 2.12 #4085

Dhruvan1217 commented Feb 28, 2024 •

edited

Loading

Dhruvan1217 commented Feb 28, 2024

peternied commented Feb 28, 2024

cwperks commented Feb 28, 2024 •

edited

Loading

Dhruvan1217 commented Feb 29, 2024 •

edited

Loading

peternied commented Feb 29, 2024 •

edited

Loading

stephen-crawford commented Mar 4, 2024

Dhruvan1217 commented Mar 21, 2024 •

edited

Loading

Node fails to perform data write on replica shards while upgrading to OS 2.12 #4085

Node fails to perform data write on replica shards while upgrading to OS 2.12 #4085

Comments

Dhruvan1217 commented Feb 28, 2024 • edited Loading

Dhruvan1217 commented Feb 28, 2024

peternied commented Feb 28, 2024

cwperks commented Feb 28, 2024 • edited Loading

Dhruvan1217 commented Feb 29, 2024 • edited Loading

peternied commented Feb 29, 2024 • edited Loading

stephen-crawford commented Mar 4, 2024

Dhruvan1217 commented Mar 21, 2024 • edited Loading

Dhruvan1217 commented Feb 28, 2024 •

edited

Loading

cwperks commented Feb 28, 2024 •

edited

Loading

Dhruvan1217 commented Feb 29, 2024 •

edited

Loading

peternied commented Feb 29, 2024 •

edited

Loading

Dhruvan1217 commented Mar 21, 2024 •

edited

Loading