Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG][OpenSearch] Cluster-Manager discovery not working #499

Closed
felix185 opened this issue Nov 24, 2023 · 1 comment
Closed

[BUG][OpenSearch] Cluster-Manager discovery not working #499

felix185 opened this issue Nov 24, 2023 · 1 comment
Labels
bug Something isn't working untriaged Issues that have not yet been triaged

Comments

@felix185
Copy link

felix185 commented Nov 24, 2023

Describe the bug
I tried to deploy a simple OpenSearch cluster with the provided helm charts in our k8s (1.26) with helm (3.13.1). As soon as I increase the number of replicas for a cluster-manager/master from 1 to 2 or 3 the cluster does not start successfully.
I had to make a few adjustments in the values.yaml (i.e. providing the url of our private imageregistry and a corresponding imagepullsecret). For all other values default values are used.

To Reproduce
Steps to reproduce the behavior:

  1. clone this repository
  2. open values.yaml from charts/opensearch
  3. change global.registry to private registry
  4. add image pull secret to imagePullSecrets
  5. deploy via helm upgrade opensearch ./charts/opensearch --install

Expected behavior
OpenSearch is starting without errors and the request to verify OpenSearch installation as described here is successful.

Chart Name
Specify the Chart which is affected?
OpenSearch

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • Helm Version: 3.13.1
  • Kubernetes Version: 1.26.7

Additional context
Add any other context about the problem here.
If I try to run the helm chart with replicas set to 1, everything is setup as I would expect. But as soon as I increase number of replicas to 2 or 3 (3 is default from initial clone) I am getting the following exception:

org.opensearch.discovery.ClusterManagerNotDiscoveredException: null
	at org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction$AsyncSingleAction$2.onTimeout(TransportClusterManagerNodeAction.java:350) [opensearch-2.11.0.jar:2.11.0]
	at org.opensearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:394) [opensearch-2.11.0.jar:2.11.0]
	at org.opensearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:294) [opensearch-2.11.0.jar:2.11.0]
	at org.opensearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:707) [opensearch-2.11.0.jar:2.11.0]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:849) [opensearch-2.11.0.jar:2.11.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]
[2023-11-23T13:39:57,962][ERROR][o.o.b.OpenSearchUncaughtExceptionHandler] [opensearch-cluster-master-0] uncaught exception in thread [main]
org.opensearch.bootstrap.StartupException: ClusterManagerNotDiscoveredException[null]
	at org.opensearch.bootstrap.OpenSearch.init(OpenSearch.java:184) ~[opensearch-2.11.0.jar:2.11.0]
	at org.opensearch.bootstrap.OpenSearch.execute(OpenSearch.java:171) ~[opensearch-2.11.0.jar:2.11.0]
	at org.opensearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:104) ~[opensearch-2.11.0.jar:2.11.0]
	at org.opensearch.cli.Command.mainWithoutErrorHandling(Command.java:138) ~[opensearch-cli-2.11.0.jar:2.11.0]
	at org.opensearch.cli.Command.main(Command.java:101) ~[opensearch-cli-2.11.0.jar:2.11.0]
	at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:137) ~[opensearch-2.11.0.jar:2.11.0]
	at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:103) ~[opensearch-2.11.0.jar:2.11.0]
Caused by: org.opensearch.discovery.ClusterManagerNotDiscoveredException
	at org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction$AsyncSingleAction$2.onTimeout(TransportClusterManagerNodeAction.java:350) ~[opensearch-2.11.0.jar:2.11.0]
	at org.opensearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:394) ~[opensearch-2.11.0.jar:2.11.0]
	at org.opensearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:294) ~[opensearch-2.11.0.jar:2.11.0]
	at org.opensearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:707) ~[opensearch-2.11.0.jar:2.11.0]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:849) ~[opensearch-2.11.0.jar:2.11.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]
uncaught exception in thread [main]
ClusterManagerNotDiscoveredException[null]
	at org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction$AsyncSingleAction$2.onTimeout(TransportClusterManagerNodeAction.java:350)
	at org.opensearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:394)
	at org.opensearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:294)
	at org.opensearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:707)
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:849)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)

When turning on trace logs I'm also facing the following:

[2023-11-24T08:11:40,318][TRACE][o.o.d.PeerFinder         ] [opensearch-cluster-master-0] startProbe(192.168.128.139:9300) not probing local node
[2023-11-24T08:11:40,319][TRACE][o.o.d.SeedHostsResolver  ] [opensearch-cluster-master-0] resolved host [opensearch-cluster-master-headless] to [192.168.128.139:9300, 192.168.129.236:9300]
[2023-11-24T08:11:40,319][TRACE][o.o.d.PeerFinder         ] [opensearch-cluster-master-0] probing resolved transport addresses [192.168.129.236:9300]
[2023-11-24T08:11:40,350][DEBUG][o.o.d.PeerFinder         ] [opensearch-cluster-master-0] Peer{transportAddress=192.168.129.236:9300, discoveryNode=null, peersRequestInFlight=false} connection failed
org.opensearch.transport.ConnectTransportException: [][192.168.129.236:9300] connect_timeout[3s]
	at org.opensearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout(TcpTransport.java:1083) ~[opensearch-2.11.0.jar:2.11.0]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:849) ~[opensearch-2.11.0.jar:2.11.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]
[2023-11-24T08:11:41,319][TRACE][o.o.d.PeerFinder         ] [opensearch-cluster-master-0] probing cluster-manager nodes from cluster state: nodes: 
   {opensearch-cluster-master-0}{6ln17rDKRuS80Z40Rmt8Og}{ibQZHrlmQ_SFaIIVNythJQ}{192.168.128.139}{192.168.128.139:9300}{dimr}{shard_indexing_pressure_enabled=true}, local

It seems like the pods can find themselves as cluster-manager, but as soon as they have to peer with/discover the other cluster-manager pods, they cannot find them. If I exec into a pod and try to call the IP of one of the other pods with curl, there is also a timeout.

I'm working on a clean namespace, so there is no networkpolicy deployed. The only thing deployed apart from the helm charts is the secret to pull the images from the private registry.

@felix185 felix185 added bug Something isn't working untriaged Issues that have not yet been triaged labels Nov 24, 2023
@felix185
Copy link
Author

closing this issue as this has nothing to do with helm charts, but with the network configurations of the managed k8s cluster. sorry for any inconvenience caused.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working untriaged Issues that have not yet been triaged
Projects
None yet
Development

No branches or pull requests

1 participant