Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node fails to perform data write on replica shards while upgrading to OS 2.12 #4085

Closed
Dhruvan1217 opened this issue Feb 28, 2024 · 7 comments
Assignees
Labels
bug Something isn't working triaged Issues labeled as 'Triaged' have been reviewed and are deemed actionable.

Comments

@Dhruvan1217
Copy link

Dhruvan1217 commented Feb 28, 2024

What is the bug?

While upgrading the Opensearch cluster from 1.3.9 to 2.12, the shards are failing to recover. The data is unable to replicate to the replicas with following exception in the logs.

NOTE: This issue was also showing up in 2.11.x and was expected to be fixed in 2.12.0, Reference: opensearch-project/OpenSearch#11491

  • In an on-going in-place upgrade (where certain nodes are on 1.3 and other nodes on 2.12), I have noticed this happening when the primary shard assigned on new/upgraded node is trying to replicate the data to the replica assigned to other upgraded node.
[2024-02-28T06:56:09,731][WARN ][org.opensearch.action.bulk.TransportShardBulkAction] [[Index_x_y_z][5]] failed to perform indices:data/write/bulk[s] on replica [Index_x_y_z][5], node[WGf0Uf3fSBC9Cvu8a0OwFg], [R], s[STARTED], a[id=tkTI2NBcQGKedZRRAi6_Hg]
org.opensearch.transport.RemoteTransportException: [sysdigcloud-elasticsearch-1][p.q.r.s:9300][indices:data/write/bulk[s][r]]
Caused by: org.opensearch.OpenSearchException: java.lang.IllegalArgumentException: -84 is not a valid id
	at org.opensearch.security.support.Base64CustomHelper.deserializeObject(Base64CustomHelper.java:136) ~[?:?]
	at org.opensearch.security.support.Base64Helper.deserializeObject(Base64Helper.java:46) ~[?:?]
	at org.opensearch.security.transport.SecurityRequestHandler.messageReceivedDecorate(SecurityRequestHandler.java:187) ~[?:?]
	at org.opensearch.security.ssl.transport.SecuritySSLRequestHandler.messageReceived(SecuritySSLRequestHandler.java:154) ~[?:?]
	at org.opensearch.security.OpenSearchSecurityPlugin$6$1.messageReceived(OpenSearchSecurityPlugin.java:795) ~[?:?]
	at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106) ~[opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.transport.InboundHandler.handleRequest(InboundHandler.java:271) [opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.transport.InboundHandler.messageReceived(InboundHandler.java:144) [opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.transport.InboundHandler.inboundMessage(InboundHandler.java:127) [opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.transport.TcpTransport.inboundMessage(TcpTransport.java:770) [opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:175) [opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:150) [opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:115) [opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:95) [transport-netty4-client-2.12.0.jar:2.12.0]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) [netty-transport-4.1.106.Final.jar:4.1.106.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.106.Final.jar:4.1.106.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) [netty-transport-4.1.106.Final.jar:4.1.106.Final]
	at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:280) [netty-handler-4.1.106.Final.jar:4.1.106.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) ~[netty-transport-4.1.106.Final.jar:4.1.106.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.106.Final.jar:4.1.106.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.106.Final.jar:4.1.106.Final]
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) ~[netty-codec-4.1.106.Final.jar:4.1.106.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) [netty-transport-4.1.106.Final.jar:4.1.106.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.106.Final.jar:4.1.106.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) [netty-transport-4.1.106.Final.jar:4.1.106.Final]
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1475) [netty-handler-4.1.106.Final.jar:4.1.106.Final]

How can one reproduce the bug?
Steps to reproduce the behavior:

  1. Setup a 1.3.x cluster (consider 3 node cluster)
  2. Ingest some data to indices and let the data continue to flow in (possibly also create new indices in-between node restarts)
  3. Set shard allocation to primaries
  4. Start in-place upgrade to 2.12.0 by upgrading the nodes one by one
  5. Reboot the first node, once it is initialized, let the cluster turn green (by setting allocation to all) before restarting the following nodes.
  6. Set allocation to primaries again and Reboot the second node.
  7. As soon as the second node is initialised and then the allocation is set to all, if a replica shard comes to this node for which the primary shard is on the first node upgraded, you will see errors in the logs when the first node tries to write/replicate the data to the replica shard.

What is the expected behavior?
The data write to shard replicas should be successful.

What is your host/environment?

  • OS: AMD64
  • Plugins
$ /usr/share/opensearch/bin/opensearch-plugin list
opensearch-index-management
opensearch-job-scheduler
opensearch-security
repository-gcs
repository-s3
@Dhruvan1217 Dhruvan1217 added bug Something isn't working untriaged Require the attention of the repository maintainers and may need to be prioritized labels Feb 28, 2024
@Dhruvan1217
Copy link
Author

@cwperks Tagging you if you can take a look at this further opensearch-project/OpenSearch#11491

@Dhruvan1217 Dhruvan1217 changed the title Node fails to perform data write on replica shards while upgrading to OS >=2.11 Node fails to perform data write on replica shards while upgrading to OS 2.12 Feb 28, 2024
@peternied
Copy link
Member

This issue is in performance-analyzer, transferring to that repo

@peternied peternied transferred this issue from opensearch-project/security Feb 28, 2024
@cwperks
Copy link
Member

cwperks commented Feb 28, 2024

When the second node is being rebooted, could there be an inflight transport request that gets resumed when the second node is brought back up as a 2.12 node?

I'd have to dig into it further, but this line determines what serialization to use when sending a transport request. It could be the case that the first node that was replaced with a 2.12 node is in the middle of sending a request to the second 1.3 node that is rebooted before replying back to the 2.12 node. When the node comes back online, could it be that the transport request is being replayed?

@Dhruvan1217
Copy link
Author

Dhruvan1217 commented Feb 29, 2024

@peternied We don't have PA enabled, this is happening without that plugin and this seems to be generic security module problem. We should move it back to security repo. Thanks

@peternied
Copy link
Member

peternied commented Feb 29, 2024

@opensearch-project/admin Please transfer this issue to the security repo.

@bbarani bbarani transferred this issue from opensearch-project/performance-analyzer Feb 29, 2024
@stephen-crawford
Copy link
Contributor

[Triage] Hi @Dhruvan1217, thanks for filing this issue and providing detailed reproduction steps. Someone will take a look and see what the problem is and note steps forward.

@stephen-crawford stephen-crawford added triaged Issues labeled as 'Triaged' have been reviewed and are deemed actionable. and removed untriaged Require the attention of the repository maintainers and may need to be prioritized labels Mar 4, 2024
@Dhruvan1217
Copy link
Author

Dhruvan1217 commented Mar 21, 2024

Also AFAIU, there were no issues in Opensearch 2.10, so maybe we can take a look at what was changed there after (I believe introduction of custom serializartion/de-serialization)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triaged Issues labeled as 'Triaged' have been reviewed and are deemed actionable.
Projects
None yet
Development

No branches or pull requests

4 participants