fix cluster outage, add masterService template #41

kimxogus · 2019-01-21T05:19:01Z

This PR includes

fix cluster outage [stable/elasticsearch] fix cluster outage during master termination helm/charts#10687 ([stable/elasticsearch] Terminating current master pod causes cluster outage of more than 30 seconds helm/charts#8785)
add masterService template in _helpers.tpl

elasticmachine · 2019-01-21T05:19:02Z

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

kimxogus · 2019-01-21T05:23:02Z

signed cla

desaintmartin · 2019-01-21T09:07:00Z

elasticsearch/values.yaml

@@ -37,7 +37,7 @@ extraEnvs:
 # A list of secrets and their paths to mount inside the pod
 # This is useful for mounting certificates for security and for mounting
 # the X-Pack license
-secretMounts: 
+secretMounts:


In order for this to pass validations, shouln't it be set to []?

desaintmartin · 2019-01-21T09:10:38Z

elasticsearch/templates/statefulset.yaml

+          privileged: true
+        image: "{{ .Values.image }}:{{ .Values.imageTag }}"
+        command:
+        - /bin/bash


Maybe add some comment here to state the purpose of the initContainer?

desaintmartin · 2019-01-21T09:15:42Z

elasticsearch/templates/statefulset.yaml

+          HOST="${!HOSTVAR}"
+
+          if [ ! -f /usr/share/elasticsearch/config/elasticsearch.yml ]; then
+              echo "" > /usr/share/elasticsearch/config/elasticsearch.yml


Very nitpicky, but "touch" would be more elegant.

desaintmartin · 2019-01-21T09:21:57Z

elasticsearch/templates/statefulset.yaml

@@ -78,6 +78,8 @@ spec:
          secret:
            secretName: {{ .name }}
        {{- end }}
+        - name: config
+          emptyDir: {}


Why should we store the configuration here instead of regenerating it at each start?

Crazybus · 2019-01-21T14:04:30Z

@kimxogus

The issue you are linking to is referring to a different helm chart. The readinessProbes and service setup of this chart are quite different and have been designed and tested to not cause downtime during rolling upgrades and restarts.

Were you able to reproduce the same problem with this helm chart? There is a rolling upgrade script that I used when initially developing the chart to make sure it remained available. It isn't currently running as part of the automated testing (and is currently broken by the path changes to the proxy api) but this seems like a good time for it to be re-enabled to ensure this behaviour remains.

I have just done some testing locally and wasn't able to cause any failed search requests during a rolling upgrade, killing the master pod, or doing a kill 1 in the master pod.

Here is how I'm testing it

# Deploy the default example
cd elasticsearch/examples/default
make

# Start kubectl proxy
kubectl proxy

# Run constant search requests against the service
watch -n 0.5 'curl -s --fail http://localhost:8001/api/v1/namespaces/default/services/elasticsearch-master:9200/proxy/_search'

If you are able to cause a failure could you please open an issue first with details on how you reproduced the problem?

rendhalver · 2019-01-24T19:52:12Z

elasticsearch/templates/master-announce-svc.yaml

@@ -0,0 +1,29 @@
+{{ if eq .Values.roles.master "true" }}
+{{- range $i := until (int .Values.replicas) }}


Do we really need this?
The headless service is used for service discovery and includes all members in the cluster even the unready ones
https://github.com/elastic/helm-charts/blob/master/elasticsearch/templates/service.yaml#L31

kimxogus · 2019-01-29T10:07:04Z

@Crazybus I wrote this PR because I had cluster outage during rolling upgrade when I dived into elastic/elasticsearch#36822, and I thought this is not fixed yet because I couldn't find changes related to connection timeout problem(https://discuss.elastic.co/t/timed-out-waiting-for-all-nodes-to-process-published-state-and-cluster-unavailability/138590/2 and elastic/elasticsearch#36822 (comment)). but it seems already fixed here without announce service.

This PR doesn't seem necessary any more. Thanks.

desaintmartin · 2019-01-29T11:15:41Z

I am curious, where has it been fixed? I don't see any related change into the chart itself, is it from a new version of ES?
I've seen PRs in latest version of kube about solving such problems when deleting a service, maybe it also solves our problem?

kimxogus · 2019-01-30T01:12:30Z

I'm curious too. I also tested with es 6.5.3 which I used when I had cluster outage, but curl 127.0.0.1:9200 responses immediately too which means cluster outage no longer exists.
I upgraded k8s from 1.10 to 1.11 recently, so I think some changes in k8s affected this issue.

Crazybus · 2019-01-30T09:04:20Z

I am curious, where has it been fixed? I don't see any related change into the chart itself, is it from a new version of ES?

I'm curious too. I also tested with es 6.5.3 which I used when I had cluster outage

Did you reproduce the outage using this chart (elastic/helm-charts)? Or was it with the helm/charts/elasticsearch version which you linked to in the original comment? This chart is not based on the helm/charts version so it wouldn't make sense for it to share the same issues. Nothing has been changed from 6.5.3-alpha1 to the current version that could have affected anything related to this.

desaintmartin · 2019-01-30T09:10:16Z

What 1.11 version do you have?

kimxogus · 2019-01-31T01:35:17Z

@Crazybus I reproduced the outage using both charts (elastic/elasticsearch#36822 (comment))

@desaintmartin 1.11.7 and canal with calico 3.2.3 and flannel 0.9.0 as cni.

fix cluster outage, add masterService template

ed67799

bump chart version

17fe76d

desaintmartin reviewed Jan 21, 2019

View reviewed changes

rendhalver suggested changes Jan 24, 2019

View reviewed changes

kimxogus added 3 commits January 29, 2019 17:06

Merge branch 'master' into fix/cluster-outage

f2ba6e0

default value for secretMounts

1048ce3

use touch instead of echo ""

95cf412

kimxogus closed this Jan 29, 2019

andreykaipov mentioned this pull request Feb 18, 2019

Slow re-election when elected master pod is deleted #63

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix cluster outage, add masterService template #41

fix cluster outage, add masterService template #41

kimxogus commented Jan 21, 2019 •

edited

Loading

elasticmachine commented Jan 21, 2019

kimxogus commented Jan 21, 2019

desaintmartin Jan 21, 2019

desaintmartin Jan 21, 2019

desaintmartin Jan 21, 2019

desaintmartin Jan 21, 2019

Crazybus commented Jan 21, 2019

rendhalver Jan 24, 2019

kimxogus commented Jan 29, 2019

desaintmartin commented Jan 29, 2019 •

edited

Loading

kimxogus commented Jan 30, 2019

Crazybus commented Jan 30, 2019

desaintmartin commented Jan 30, 2019

kimxogus commented Jan 31, 2019

		@@ -0,0 +1,29 @@
		{{ if eq .Values.roles.master "true" }}
		{{- range $i := until (int .Values.replicas) }}

fix cluster outage, add masterService template #41

fix cluster outage, add masterService template #41

Conversation

kimxogus commented Jan 21, 2019 • edited Loading

elasticmachine commented Jan 21, 2019

kimxogus commented Jan 21, 2019

desaintmartin Jan 21, 2019

Choose a reason for hiding this comment

desaintmartin Jan 21, 2019

Choose a reason for hiding this comment

desaintmartin Jan 21, 2019

Choose a reason for hiding this comment

desaintmartin Jan 21, 2019

Choose a reason for hiding this comment

Crazybus commented Jan 21, 2019

rendhalver Jan 24, 2019

Choose a reason for hiding this comment

kimxogus commented Jan 29, 2019

desaintmartin commented Jan 29, 2019 • edited Loading

kimxogus commented Jan 30, 2019

Crazybus commented Jan 30, 2019

desaintmartin commented Jan 30, 2019

kimxogus commented Jan 31, 2019

kimxogus commented Jan 21, 2019 •

edited

Loading

desaintmartin commented Jan 29, 2019 •

edited

Loading