You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
I tried to deploy a simple OpenSearch cluster with the provided helm charts in our k8s (1.26) with helm (3.13.1). As soon as I increase the number of replicas for a cluster-manager/master from 1 to 2 or 3 the cluster does not start successfully.
I had to make a few adjustments in the values.yaml (i.e. providing the url of our private imageregistry and a corresponding imagepullsecret). For all other values default values are used.
To Reproduce
Steps to reproduce the behavior:
clone this repository
open values.yaml from charts/opensearch
change global.registry to private registry
add image pull secret to imagePullSecrets
deploy via helm upgrade opensearch ./charts/opensearch --install
Expected behavior
OpenSearch is starting without errors and the request to verify OpenSearch installation as described here is successful.
Chart Name
Specify the Chart which is affected?
OpenSearch
Screenshots
If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
Helm Version: 3.13.1
Kubernetes Version: 1.26.7
Additional context
Add any other context about the problem here.
If I try to run the helm chart with replicas set to 1, everything is setup as I would expect. But as soon as I increase number of replicas to 2 or 3 (3 is default from initial clone) I am getting the following exception:
org.opensearch.discovery.ClusterManagerNotDiscoveredException: null
at org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction$AsyncSingleAction$2.onTimeout(TransportClusterManagerNodeAction.java:350) [opensearch-2.11.0.jar:2.11.0]
at org.opensearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:394) [opensearch-2.11.0.jar:2.11.0]
at org.opensearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:294) [opensearch-2.11.0.jar:2.11.0]
at org.opensearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:707) [opensearch-2.11.0.jar:2.11.0]
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:849) [opensearch-2.11.0.jar:2.11.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
[2023-11-23T13:39:57,962][ERROR][o.o.b.OpenSearchUncaughtExceptionHandler] [opensearch-cluster-master-0] uncaught exception in thread [main]
org.opensearch.bootstrap.StartupException: ClusterManagerNotDiscoveredException[null]
at org.opensearch.bootstrap.OpenSearch.init(OpenSearch.java:184) ~[opensearch-2.11.0.jar:2.11.0]
at org.opensearch.bootstrap.OpenSearch.execute(OpenSearch.java:171) ~[opensearch-2.11.0.jar:2.11.0]
at org.opensearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:104) ~[opensearch-2.11.0.jar:2.11.0]
at org.opensearch.cli.Command.mainWithoutErrorHandling(Command.java:138) ~[opensearch-cli-2.11.0.jar:2.11.0]
at org.opensearch.cli.Command.main(Command.java:101) ~[opensearch-cli-2.11.0.jar:2.11.0]
at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:137) ~[opensearch-2.11.0.jar:2.11.0]
at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:103) ~[opensearch-2.11.0.jar:2.11.0]
Caused by: org.opensearch.discovery.ClusterManagerNotDiscoveredException
at org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction$AsyncSingleAction$2.onTimeout(TransportClusterManagerNodeAction.java:350) ~[opensearch-2.11.0.jar:2.11.0]
at org.opensearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:394) ~[opensearch-2.11.0.jar:2.11.0]
at org.opensearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:294) ~[opensearch-2.11.0.jar:2.11.0]
at org.opensearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:707) ~[opensearch-2.11.0.jar:2.11.0]
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:849) ~[opensearch-2.11.0.jar:2.11.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
uncaught exception in thread [main]
ClusterManagerNotDiscoveredException[null]
at org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction$AsyncSingleAction$2.onTimeout(TransportClusterManagerNodeAction.java:350)
at org.opensearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:394)
at org.opensearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:294)
at org.opensearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:707)
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:849)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
When turning on trace logs I'm also facing the following:
[2023-11-24T08:11:40,318][TRACE][o.o.d.PeerFinder ] [opensearch-cluster-master-0] startProbe(192.168.128.139:9300) not probing local node
[2023-11-24T08:11:40,319][TRACE][o.o.d.SeedHostsResolver ] [opensearch-cluster-master-0] resolved host [opensearch-cluster-master-headless] to [192.168.128.139:9300, 192.168.129.236:9300]
[2023-11-24T08:11:40,319][TRACE][o.o.d.PeerFinder ] [opensearch-cluster-master-0] probing resolved transport addresses [192.168.129.236:9300]
[2023-11-24T08:11:40,350][DEBUG][o.o.d.PeerFinder ] [opensearch-cluster-master-0] Peer{transportAddress=192.168.129.236:9300, discoveryNode=null, peersRequestInFlight=false} connection failed
org.opensearch.transport.ConnectTransportException: [][192.168.129.236:9300] connect_timeout[3s]
at org.opensearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout(TcpTransport.java:1083) ~[opensearch-2.11.0.jar:2.11.0]
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:849) ~[opensearch-2.11.0.jar:2.11.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
[2023-11-24T08:11:41,319][TRACE][o.o.d.PeerFinder ] [opensearch-cluster-master-0] probing cluster-manager nodes from cluster state: nodes:
{opensearch-cluster-master-0}{6ln17rDKRuS80Z40Rmt8Og}{ibQZHrlmQ_SFaIIVNythJQ}{192.168.128.139}{192.168.128.139:9300}{dimr}{shard_indexing_pressure_enabled=true}, local
It seems like the pods can find themselves as cluster-manager, but as soon as they have to peer with/discover the other cluster-manager pods, they cannot find them. If I exec into a pod and try to call the IP of one of the other pods with curl, there is also a timeout.
I'm working on a clean namespace, so there is no networkpolicy deployed. The only thing deployed apart from the helm charts is the secret to pull the images from the private registry.
The text was updated successfully, but these errors were encountered:
closing this issue as this has nothing to do with helm charts, but with the network configurations of the managed k8s cluster. sorry for any inconvenience caused.
Describe the bug
I tried to deploy a simple OpenSearch cluster with the provided helm charts in our k8s (1.26) with helm (3.13.1). As soon as I increase the number of replicas for a cluster-manager/master from 1 to 2 or 3 the cluster does not start successfully.
I had to make a few adjustments in the values.yaml (i.e. providing the url of our private imageregistry and a corresponding imagepullsecret). For all other values default values are used.
To Reproduce
Steps to reproduce the behavior:
helm upgrade opensearch ./charts/opensearch --install
Expected behavior
OpenSearch is starting without errors and the request to verify OpenSearch installation as described here is successful.
Chart Name
Specify the Chart which is affected?
OpenSearch
Screenshots
If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
Additional context
Add any other context about the problem here.
If I try to run the helm chart with replicas set to 1, everything is setup as I would expect. But as soon as I increase number of replicas to 2 or 3 (3 is default from initial clone) I am getting the following exception:
When turning on trace logs I'm also facing the following:
It seems like the pods can find themselves as cluster-manager, but as soon as they have to peer with/discover the other cluster-manager pods, they cannot find them. If I exec into a pod and try to call the IP of one of the other pods with curl, there is also a timeout.
I'm working on a clean namespace, so there is no networkpolicy deployed. The only thing deployed apart from the helm charts is the secret to pull the images from the private registry.
The text was updated successfully, but these errors were encountered: