-
Notifications
You must be signed in to change notification settings - Fork 1.9k
fix cluster outage, add masterService template #41
Conversation
Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually? |
signed cla |
elasticsearch/values.yaml
Outdated
@@ -37,7 +37,7 @@ extraEnvs: | |||
# A list of secrets and their paths to mount inside the pod | |||
# This is useful for mounting certificates for security and for mounting | |||
# the X-Pack license | |||
secretMounts: | |||
secretMounts: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order for this to pass validations, shouln't it be set to []?
privileged: true | ||
image: "{{ .Values.image }}:{{ .Values.imageTag }}" | ||
command: | ||
- /bin/bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add some comment here to state the purpose of the initContainer?
HOST="${!HOSTVAR}" | ||
|
||
if [ ! -f /usr/share/elasticsearch/config/elasticsearch.yml ]; then | ||
echo "" > /usr/share/elasticsearch/config/elasticsearch.yml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nitpicky, but "touch" would be more elegant.
@@ -78,6 +78,8 @@ spec: | |||
secret: | |||
secretName: {{ .name }} | |||
{{- end }} | |||
- name: config | |||
emptyDir: {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should we store the configuration here instead of regenerating it at each start?
The issue you are linking to is referring to a different helm chart. The readinessProbes and service setup of this chart are quite different and have been designed and tested to not cause downtime during rolling upgrades and restarts. Were you able to reproduce the same problem with this helm chart? There is a rolling upgrade script that I used when initially developing the chart to make sure it remained available. It isn't currently running as part of the automated testing (and is currently broken by the path changes to the proxy api) but this seems like a good time for it to be re-enabled to ensure this behaviour remains. I have just done some testing locally and wasn't able to cause any failed search requests during a rolling upgrade, killing the master pod, or doing a Here is how I'm testing it
If you are able to cause a failure could you please open an issue first with details on how you reproduced the problem? |
@@ -0,0 +1,29 @@ | |||
{{ if eq .Values.roles.master "true" }} | |||
{{- range $i := until (int .Values.replicas) }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need this?
The headless service is used for service discovery and includes all members in the cluster even the unready ones
https://github.com/elastic/helm-charts/blob/master/elasticsearch/templates/service.yaml#L31
@Crazybus I wrote this PR because I had cluster outage during rolling upgrade when I dived into elastic/elasticsearch#36822, and I thought this is not fixed yet because I couldn't find changes related to connection timeout problem(https://discuss.elastic.co/t/timed-out-waiting-for-all-nodes-to-process-published-state-and-cluster-unavailability/138590/2 and elastic/elasticsearch#36822 (comment)). but it seems already fixed here without announce service. This PR doesn't seem necessary any more. Thanks. |
I am curious, where has it been fixed? I don't see any related change into the chart itself, is it from a new version of ES? |
I'm curious too. I also tested with es 6.5.3 which I used when I had cluster outage, but |
Did you reproduce the outage using this chart (elastic/helm-charts)? Or was it with the helm/charts/elasticsearch version which you linked to in the original comment? This chart is not based on the helm/charts version so it wouldn't make sense for it to share the same issues. Nothing has been changed from |
What 1.11 version do you have? |
@Crazybus I reproduced the outage using both charts (elastic/elasticsearch#36822 (comment)) @desaintmartin |
This PR includes
fix cluster outage [stable/elasticsearch] fix cluster outage during master termination helm/charts#10687 ([stable/elasticsearch] Terminating current master pod causes cluster outage of more than 30 seconds helm/charts#8785)
add
masterService
template in_helpers.tpl