[BUG][OpenSearch] Cluster-Manager discovery not working #499

felix185 · 2023-11-24T09:30:34Z

Describe the bug
I tried to deploy a simple OpenSearch cluster with the provided helm charts in our k8s (1.26) with helm (3.13.1). As soon as I increase the number of replicas for a cluster-manager/master from 1 to 2 or 3 the cluster does not start successfully.
I had to make a few adjustments in the values.yaml (i.e. providing the url of our private imageregistry and a corresponding imagepullsecret). For all other values default values are used.

To Reproduce
Steps to reproduce the behavior:

clone this repository
open values.yaml from charts/opensearch
change global.registry to private registry
add image pull secret to imagePullSecrets
deploy via helm upgrade opensearch ./charts/opensearch --install

Expected behavior
OpenSearch is starting without errors and the request to verify OpenSearch installation as described here is successful.

Chart Name
Specify the Chart which is affected?
OpenSearch

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

Helm Version: 3.13.1
Kubernetes Version: 1.26.7

Additional context
Add any other context about the problem here.
If I try to run the helm chart with replicas set to 1, everything is setup as I would expect. But as soon as I increase number of replicas to 2 or 3 (3 is default from initial clone) I am getting the following exception:

org.opensearch.discovery.ClusterManagerNotDiscoveredException: null
	at org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction$AsyncSingleAction$2.onTimeout(TransportClusterManagerNodeAction.java:350) [opensearch-2.11.0.jar:2.11.0]
	at org.opensearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:394) [opensearch-2.11.0.jar:2.11.0]
	at org.opensearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:294) [opensearch-2.11.0.jar:2.11.0]
	at org.opensearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:707) [opensearch-2.11.0.jar:2.11.0]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:849) [opensearch-2.11.0.jar:2.11.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]
[2023-11-23T13:39:57,962][ERROR][o.o.b.OpenSearchUncaughtExceptionHandler] [opensearch-cluster-master-0] uncaught exception in thread [main]
org.opensearch.bootstrap.StartupException: ClusterManagerNotDiscoveredException[null]
	at org.opensearch.bootstrap.OpenSearch.init(OpenSearch.java:184) ~[opensearch-2.11.0.jar:2.11.0]
	at org.opensearch.bootstrap.OpenSearch.execute(OpenSearch.java:171) ~[opensearch-2.11.0.jar:2.11.0]
	at org.opensearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:104) ~[opensearch-2.11.0.jar:2.11.0]
	at org.opensearch.cli.Command.mainWithoutErrorHandling(Command.java:138) ~[opensearch-cli-2.11.0.jar:2.11.0]
	at org.opensearch.cli.Command.main(Command.java:101) ~[opensearch-cli-2.11.0.jar:2.11.0]
	at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:137) ~[opensearch-2.11.0.jar:2.11.0]
	at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:103) ~[opensearch-2.11.0.jar:2.11.0]
Caused by: org.opensearch.discovery.ClusterManagerNotDiscoveredException
	at org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction$AsyncSingleAction$2.onTimeout(TransportClusterManagerNodeAction.java:350) ~[opensearch-2.11.0.jar:2.11.0]
	at org.opensearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:394) ~[opensearch-2.11.0.jar:2.11.0]
	at org.opensearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:294) ~[opensearch-2.11.0.jar:2.11.0]
	at org.opensearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:707) ~[opensearch-2.11.0.jar:2.11.0]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:849) ~[opensearch-2.11.0.jar:2.11.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]
uncaught exception in thread [main]
ClusterManagerNotDiscoveredException[null]
	at org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction$AsyncSingleAction$2.onTimeout(TransportClusterManagerNodeAction.java:350)
	at org.opensearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:394)
	at org.opensearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:294)
	at org.opensearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:707)
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:849)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)

When turning on trace logs I'm also facing the following:

[2023-11-24T08:11:40,318][TRACE][o.o.d.PeerFinder         ] [opensearch-cluster-master-0] startProbe(192.168.128.139:9300) not probing local node
[2023-11-24T08:11:40,319][TRACE][o.o.d.SeedHostsResolver  ] [opensearch-cluster-master-0] resolved host [opensearch-cluster-master-headless] to [192.168.128.139:9300, 192.168.129.236:9300]
[2023-11-24T08:11:40,319][TRACE][o.o.d.PeerFinder         ] [opensearch-cluster-master-0] probing resolved transport addresses [192.168.129.236:9300]
[2023-11-24T08:11:40,350][DEBUG][o.o.d.PeerFinder         ] [opensearch-cluster-master-0] Peer{transportAddress=192.168.129.236:9300, discoveryNode=null, peersRequestInFlight=false} connection failed
org.opensearch.transport.ConnectTransportException: [][192.168.129.236:9300] connect_timeout[3s]
	at org.opensearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout(TcpTransport.java:1083) ~[opensearch-2.11.0.jar:2.11.0]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:849) ~[opensearch-2.11.0.jar:2.11.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]
[2023-11-24T08:11:41,319][TRACE][o.o.d.PeerFinder         ] [opensearch-cluster-master-0] probing cluster-manager nodes from cluster state: nodes: 
   {opensearch-cluster-master-0}{6ln17rDKRuS80Z40Rmt8Og}{ibQZHrlmQ_SFaIIVNythJQ}{192.168.128.139}{192.168.128.139:9300}{dimr}{shard_indexing_pressure_enabled=true}, local

It seems like the pods can find themselves as cluster-manager, but as soon as they have to peer with/discover the other cluster-manager pods, they cannot find them. If I exec into a pod and try to call the IP of one of the other pods with curl, there is also a timeout.

I'm working on a clean namespace, so there is no networkpolicy deployed. The only thing deployed apart from the helm charts is the secret to pull the images from the private registry.

The text was updated successfully, but these errors were encountered:

felix185 · 2023-11-27T07:36:04Z

closing this issue as this has nothing to do with helm charts, but with the network configurations of the managed k8s cluster. sorry for any inconvenience caused.

felix185 added bug Something isn't working untriaged Issues that have not yet been triaged labels Nov 24, 2023

felix185 closed this as completed Nov 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG][OpenSearch] Cluster-Manager discovery not working #499

[BUG][OpenSearch] Cluster-Manager discovery not working #499

felix185 commented Nov 24, 2023 •

edited

Loading

felix185 commented Nov 27, 2023

[BUG][OpenSearch] Cluster-Manager discovery not working #499

[BUG][OpenSearch] Cluster-Manager discovery not working #499

Comments

felix185 commented Nov 24, 2023 • edited Loading

felix185 commented Nov 27, 2023

felix185 commented Nov 24, 2023 •

edited

Loading