Elasticsearch 8.0.0-SNAPSHOT fails at startup due to volume permissions #2791

sebgl · 2020-03-31T12:43:34Z

Exception raised at startup:

["org.elasticsearch.bootstrap.StartupException: java.lang.IllegalStateException: failed to obtain node locks, tried [/usr/share/elasticsearch/d
ata]; maybe these locations are not writable or multiple nodes were started on the same data path?",

I think that's because the Docker image runs with user elasticsearch by default, whereas it was using user root before (even though the elasticsearch process itself runs as user elasticsearch):

⟩ docker run -ti docker.elastic.co/elasticsearch/elasticsearch:7.6.0 id
uid=0(root) gid=0(root) groups=0(root)
⟩ docker run -ti docker.elastic.co/elasticsearch/elasticsearch:8.0.0-SNAPSHOT id
uid=1000(elasticsearch) gid=0(root) groups=0(root)

The text was updated successfully, but these errors were encountered:

sebgl · 2020-03-31T12:45:55Z

We run an init container to change the owner of the data volume to elasticsearch, but only if the init container runs with the root user:

        # chown the data and logs volume to the elasticsearch user
	# only done when running as root, other cases should be handled
	# with a proper security context
	chown_start=$(date +%s)
	if [[ $EUID -eq 0 ]]; then
		{{range .ChownToElasticsearch}}
			echo "chowning {{.}} to elasticsearch:elasticsearch"
			chown -v elasticsearch:elasticsearch {{.}}
		{{end}}
	fi

In 8.0.0-SNAPSHOT the init container runs with the elasticsearch user, hence does not have permission to chown the volume.
If I comment the if condition above, the init container fails with:

chowning /usr/share/elasticsearch/data to elasticsearch:elasticsearch
chown: changing ownership of '/usr/share/elasticsearch/data': Operation not permitted
failed to change ownership of '/usr/share/elasticsearch/data' from root:root to elasticsearch:elasticsearch

pebrc · 2020-03-31T13:00:21Z

Related to #2599

sebgl · 2020-03-31T13:40:23Z

I think the way we currently deal with volumes permissions is not great: we run an init container to chown the mounted volumes (with write access to user root) so they belong to the elasticsearch user instead.

I think this would be better dealt with securityContext.fsGroup in the pod spec.
This modified podTemplate allows files to be written in the mounted volume by a user with group ID 1000, and works fine with Elasticsearch 8.0.0-SNAPSHOT:

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: elasticsearch-sample
spec:
  version: 8.0.0-SNAPSHOT
  nodeSets:
  - name: default
    count: 3
    podTemplate:
      spec:
        securityContext:
          fsGroup: 1000

I think the right thing to do is replace our custom init container chown mechanism by a default fsGroup that works out of the box.

However I'm not exactly sure of the implications on Openshift. This documentation gives more details.

sebgl · 2020-04-01T13:07:30Z

The difference in Elasticsearch behaviour comes from elastic/elasticsearch#50277, where tini was added to the image as process manager. It does not default to using the root user, which is a good thing IMO.

I've done some tests regarding fsGroup on Openshift 3.11 (using minishift).

First, there is no problem running 8.0..0-SNAPSHOT on Openshift. Openshift changes the default user the container runs with to an arbitrary one for security reasons (in my example: UID 1000140000). It also ensures this arbitrary user is member of the root group (but is not the root user), which gives it write access to our mounted volume.

Whereas in "regular" Kubernetes, the elasticsearch user in the container cannot write to the mounted volumes owned by root. To fix this, we can set fsGroup: 1000 in the pod spec, which allows the user with UID 1000 to write in mounted volumes.

Setting fsGroup: 1000 on Openshift leads to the Pod not being created at all:

create Pod elasticsearch-sample-es-default-0 in StatefulSet elasticsearch-sample-es-default failed error: pods "elasticsearch-sample-es-default-0" is forbidden: unable to validate against any security context constraint: [fsGroup: Invalid val
ue: []int64{1000}: 1000 is not an allowed group]

One solution to this, detailed in Openshift docs is to not use the default restricted SCC, but to create a custom one where group 1000 is allowed (or part of a range that is allowed).

tl;dr:

we can use fsGroup on most k8s distributions
we cannot use fsGroup on Openshift unless using a custom SCC

Let's see if we can find a common solution here. In any case, changing permissions in the init container does not feel like the right thing to do.

Related k8s issue: kubernetes/kubernetes#2630.

barkbay · 2020-04-01T13:42:19Z

we can use fsGroup on most k8s distributions

I think it will work as long as the cluster is not secured. If there is a PSP that restrict the range for the fsGroup (which is I would expect on production clusters) chances are it will fail the same way I guess.

sebgl · 2020-04-01T14:02:26Z

You're right @barkbay this goes beyond the scope of Openshift vs. not Openshift.

The question is more: is there a PSP (or SCC on Openshift) or not?

IIUC this doc correctly, setting an fsGroup automatically makes all processes running in the container (which may or not be using a runAsUser range) part of the fsGroup supplementary group. So as long as the default PSP specifies an fsGroup (can be a range), the assigned arbitrary user will be able to write in the mounted volumes. So we should just do nothing in this situation.

When no PSP/SCC is enforced, we probably need to set fsGroup: 1000.

Should we default to one or the other, or try to auto-detect what's best? Users can still override the securityContext in the podTemplate, but picking a default seems hard :(

sebgl · 2020-04-03T09:45:44Z

Assuming we want to rely on securityContext.fsGroup, we can write a dedicated documentation page that explains:

what is the default applied by ECK
how to remove that default (if there's one) if you don't want any fsGroup set, because you already rely on a PSP/SCC
how to override the podTemplate to set your own fsGroup

Regarding ECK defaults, we have several options.

1. Don't set a default `securityContext.fsGroup`

If we don't set a default value, it is likely that:

Vanilla K8s users with no PSP set will run into troubles
Openshift users running with the default SCC will not have any problem
Vanilla K8s users or Openshift users with a custom PSP/SCC may run into troubles if that PSP does not set the fsGroup

The first point seems quite representative of a quickstart experience, so we probably have to adapt the quickstart to explain what fsGroup is and why it matters.

cat <<EOF | kubectl apply -f -
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: quickstart
spec:
  version: 7.6.2
  nodeSets:
  - name: default
    count: 1
    config:
      node.master: true
      node.data: true
      node.ingest: true
      node.store.allow_mmap: false
    # allow the elasticsearch user in the container to write on mounted volumes
    # remove this if you already have an SCC (Openshift) or PSP (Kubernetes) set
    podTemplate:
      spec:
        securityContext:
          fsGroup: 1000
EOF

It's important to notice this example does not "just work": it will probably work on a basic Kubernetes setup, but not on a basic Openshift setup. Users have to understand they need to remove or comment some lines in the yaml.

I think if we go this path we also have to adapt other examples in the rest of the documentation, and also adapt the recipes we have in the Github repository.

2. Set a default `securityContext.fsGroup: 1000`

If we set a default value, it is likely that:

Vanilla K8s users with no PSP set will not have any problem
Openshift users running with the default SCC will run into troubles
Vanilla K8s users or Openshift users with a custom PSP/SCC may run into troubles if that PSP conflicts with the default securityContext

We probably need to adapt the quickstart example to mention the securityContext, especially for Openshift users.

We can either mention it explicitly in the quickstart (and other examples):

cat <<EOF | kubectl apply -f -
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: quickstart
spec:
  version: 7.6.2
  nodeSets:
  - name: default
    count: 1
    config:
      node.master: true
      node.data: true
      node.ingest: true
      node.store.allow_mmap: false
    # # if a PSP (Kubernetes) or SCC (Openshift) is set, ensure ECK does not set any custom
    # # securityContext, by uncommenting this line. This is likely to be the case on Openshift.
    # podTemplate:
    #  spec:
    #    securityContext: {}

EOF

Or add a note about it:

cat <<EOF | kubectl apply -f -
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: quickstart
spec:
  version: 7.6.2
  nodeSets:
  - name: default
    count: 1
    config:
      node.master: true
      node.data: true
      node.ingest: true
      node.store.allow_mmap: false
EOF

NOTE: ECK sets a default security context to allow the `elasticsearch` user to access mounted volumes. On Openshift, this conflicts with the default `restricted` SCC. It may also conflict with any custom PSP or SCC configured on your Kubernetes cluster. If that is the case you can disable the default pod security context. See [this page](link) for more details.

We also need to highlight this on the documentation page dedicated to OpenShift.

3. Attempt to set the best default depending on the environment

We could attempt the following:

If we detect ECK is deployed on Openshift, don't set a default pod securityContext.
Otherwise, set a default securityContext.fsGroup: 1000.

Detecting Openshift can probably be done in various ways (an example), but there does not seem to be a documented robust way of doing it.
This have to be done in agreement with the existing RBAC permissions. Requiring additional RBAC permissions for this call feels wrong.

It is also an option to list the existing PSP/SCC in the cluster, and don't set the securityContext if the returned list is not empty, but that would require an explicit RBAC permission that may differ in K8s vs. Openshift.

It is also possible to pass an explicit flag to the operator so it nevers sets a default securityContext, which would have to be documented especially for Openshift.

In any case, we probably still need to add the following to the quickstart:

NOTE: ECK sets a default pod `securityContext` to allow the `elasticsearch` user to access mounted volumes, unless deployed on Openshift. This may conflict with existing PSP and SCC. See [this page](link) for more information on how to disable it.

sebgl · 2020-04-03T09:53:08Z

My inclination would be for option 3. It makes things tricky since there is an implicit/non-robust decision being made by the operator, but it gives the easiest quickstart experience with less overhead in the documentation.
I think it is worth trying to find the best way to do a best-effort Openshift detection on operator startup, with limited RBAC permission.

If that ends up being too complicated, my second choice would be option 2. In short: favor a quickstart experience on an unsecured k8s cluster, and try to redirect other users (including Openshift users) to a dedicated doc page about disabling the securityContext.

sebgl · 2020-04-07T14:15:25Z

A few things we discussed today with @nkvoll @pebrc @anyasabo. No decision reached yet:

Elasticsearch moving to not use root as a default user feels like a good thing. We don't want to revert that decision.
Auto-detecting if we're running on Openshift (option 3 above) feels wrong, and will be hard to do reliabily.
Option 1 above (don't set a default fsGroup, specify one in the quickstart) makes the quickstart quite verbose. Also, if we remove the current init container chown process, it will break existing clusters on ECK upgrade.
Option 2 above (default to fsGroup: 1000, can be overridden to empty in the podTemplate) is tempting since Openshift users already have to go through a specific documentation page. In the quickstart we would just add a note with a link to a documentation page dedicated to that setting. Existing Openshift clusters and/or secured k8s clusters will likely be broken on ECK upgrade.
We could make this a setting at the operator level. Either as a boolean flag (--set-default-elasticsearch-fsgroup), either as a more complete parameter ({securityContext: fsGroup: 1000}). The last one probably requires a dedicated configuration file. Also we may want to extend this setting to other resources (Kibana, APMServer, etc.) for consistency.
If we end up picking option 1 or 2, existing clusters will be rolled out with the new settings. The rolling upgrade will fail on the first Pod if the securityContext conflicts with the default SCC/PSP. To avoid this situation, we may want to synchronise that change with Elasticsearch 8.0.0? It feels complicated to understand the ECK version default/ES version mapping though.

sebgl · 2020-04-27T13:58:20Z

I think I'm leaning towards the following:

Add a boolean flag to the operator arguments: --set-default-fsgroup=true. Why boolean and not an int (eg. 1000)? Because that value may differ for each resource managed by ECK, so it would then need to be a per-Kind flag. Or a more complex yaml configuration.
Set this flag to true by default in ECK manifests. Add a dedicated doc page explaining this flag, and how its behaviour can also be overridden in the podTemplate, which then takes priority. Specify in the Openshift docs that this flag should most likely be overridden to false.
The above implicitly means that we optimize our defaults for non-secured k8s setups, and not for Openshift and k8s clusters with PSPs. We can eventually decide to provide different manifests for Openshift.
Attempt a best-effort Pod creation dry-run to better surface the error if an incompatible PSP/SCC exists in operator logs and events attached to the resource.

I'm not sure whether this should apply to all stack versions, or only apply to 8.0+ so we don't break compatibility with existing pre-8.0 clusters.

anyasabo · 2020-04-27T20:18:38Z

I think I agree Seb. I'm less sure on only implementing it for 8.0+. It's nice because it is "simple" from a user experience -- if an existing user with PSPs upgrades to the version of ECK that includes the automatic fsGroup setting, it only fails for new 8.x clusters, or fails on the first pod during a rolling upgrade to 8.x. So the impact is minimal and should be relatively easy to notice by users.

I think it's less nice because of the complexity involved. We're arguably not doing the "right thing" now in <8.0 by using the init container instead of the native feature that does what we want it to do. There's even more differences between 6.x/7.x and 8.x that aren't really related to actual Elasticsearch and are more related to how Elasticsearch is packaged. Being consistent wherever we can is nice just to minimize mental load both for us and our users (who have to keep a mental map of what behavior we default with across different ES versions).

Downsides of defaulting fsGroups for all versions:

existing users with PSPs/SCCs who blindly upgrade to a new ECK version without flipping the toggle will have their existing clusters broken as the pods cannot start
- this is mitigated by the rolling upgrade process -- only one pod will go unavailable (by default). That said, users who did not read the upgrade notes may not notice that one pod is unavailable, since from their perspective they did not change anything in Elasticsearch
- We can also mitigate this in K8s 1.13+ with the dry run as mentioned. Even if they have the fsGroup toggle enabled still, if we can detect that a pod cannot be created because of the fsGroup (I'm not sure what kind of feedback it gives you), then we can proceed as if the user did switch the fsGroup toggle off. Users on k8s <1.13 would not have this mitigation and would experience the single pod going down.

Overall I think I'm okay with making this change for all ES versions, but could still be persuaded otherwise.

pebrc · 2020-04-29T10:10:18Z

👍 on using the flag @sebgl proposed.
Arguments for using the fsGroup mechanism only as of ES 8.0 + imo:

less potential for disruption on existing clusters
a major version upgrade is usually something users plan to a certain extent and chances are higher that they realise/read about the necessity to have ECK configured correctly before moving forward. A minor version upgrade or an ECK upgrade might be taken more lightly and lead to surprises if suddenly all clusters have are stuck in a rolling upgrade

Side-note: I am still a bit worried about the number of flags we add to the binary (17 atm) and still think we should consider a configuration file. But maybe not for the ones that are feature toggles like this one but for the configuration values like cert validity and such.

barkbay · 2020-06-30T06:12:18Z

If we use a flag I think we will have to make a choice regarding the operator hub:

Either :

the flag is not set and the upgrade from the operator hub will break the experience for the non-(openshift|psp) users
we do the opposite and it breaks the experience for Openshift users

sebgl · 2020-07-01T07:40:00Z

A few things we discussed with the team:

There seems to be no way around making life slightly harder for some users (Openshift users, or vanilla k8s users, or k8s PSP users).
Upgrading ECK should not break existing clusters. Binding the fsGroup change to stack version 8.0.0 seems to be a reasonable way to alleviate this concern.
If feels more natural to change the value of an operator-level setting in a configMap rather than changing the operator binary args in the StatefulSet spec.
We could add a note in the quickstart docs, right after installing ECK, about how the setting needs to be changed for some users (Openshift and k8s with PSP enabled). So far we have optimized ECK for a smooth quickstart experience on vanilla k8s clusters with no particular security enforcement.
We could decide to force users to make an explicit choice for that setting (set-default-fsgroup=true|false). If not set, we would error-out in the reconciliation of Elasticsearch 8.0.0. The downside of this approach is that people need to care and understand this fsGroup thing when installing ECK for the first time (or upgrading). It moves us away from a very simple quickstart experience, which would be a big loss.
We can try to surface any error coming from the fsGroup setting not being set to the right value. If an SCC/PSP enforces fsGroup to not be 1000, the Pod creation will fail, and an error will be reported in the StatefulSets events. Unfortunately that's not easy to discover, even from within the operator. Pod creation dry-run can help grab a better error message, but the feature is not available on all k8s environments. We could also try to detect, at ECK startup, if we're running on Openshift. If that's the case, and set-default-fsgroup=true, we can output an explicit warning.

sebgl · 2020-07-28T12:19:10Z

What we agreed on with the team (basically summarizes the discussion above):

operator-level --set-default-fsgroup=true|false flag (defaults: true, also in operator hub) - Add set-default-security-context flag to handle runAs user in ES 8.0+ #3342
only set the Elasticsearch fsGroup starting 8.0.0
marked as breaking change, since it breaks expectations for Openshift users starting 8.0.0
dedicated documentation page to explain how to disable this for Openshift & non-PSP users, linked from the quickstart
as part of ECK 1.3 so we release way before 8.0.0 (new Openshift users will likely notice in the quickstart while deploying a new 7.x)
emit warnings if we suspect the setting is wrongly set
best-effort autodetect openshift at ECK startup
double-check pod creation dry-run fails if available
eventually a setting available in a configMap (not only a flag)

pebrc · 2020-11-30T10:38:20Z

operator-level --set-default-fsgroup=true|false flag (defaults: true, also in operator hub) - Add set-default-security-context flag to handle runAs user in ES 8.0+ #3342

only set the Elasticsearch fsGroup starting 8.0.0

marked as breaking change, since it breaks expectations for Openshift users starting 8.0.0

dedicated documentation page to explain how to disable this for Openshift & non-PSP users, linked from the quickstart
as part of ECK 1.3 so we release way before 8.0.0 (new Openshift users will likely notice in the quickstart while deploying a new 7.x)

emit warnings if we suspect the setting is wrongly set

best-effort autodetect openshift at ECK startup

double-check pod creation dry-run fails if available

eventually a setting available in a configMap (not only a flag)

@sebgl just trying to figure out where we stand with this issue. I think I ticked all the right boxes.

sebgl · 2020-11-30T11:20:02Z

@pebrc yes! Unassigning myself here since not really working on this at the moment.

david-kow · 2021-11-02T08:21:09Z

As we are getting to the release of 8.0.0 I'm looking at this again. Based on the comments above, I'd propose the following:

don't set a default value for set-default-security-context flag in our manifests (ie. default flag in the operator process)
best-effort detect OCP (ie. already mentioned method kubevirt uses, or what we use in E2E tests)

Based on the above:

if flag is not explicitly set, use detection result
if flag is explicitly set to true and detection sees ocp, issue a warning, but run as flag indicates
if flag is explicitly set to false and detection sees k8s, do not issue a warning (this can be the case of non-ocp with PSP), run as flag indicates

I'd like to use the fact that we actually have three values possible for the flag: false, true and not-set. We could make this very explicit and add a third possible value to the flag (autodetect?) and default to it in the operator.

The experience for different user grous will be following:

no flag set/autodetect:
- non-secured k8s/ocp/psp-secured k8s allowing fsgroup 1000 - everything works without any config changes
- psp-secured k8s disallowing fsgroup 1000 - breaks only if documentation was not followed (ie. flag was not set to false by the user)
flag explicitly set to true:
- non-secured k8s/psp-secured k8s allowing fsgroup 1000 - ok
- ocp - not ok, warning
- psp-secured k8s disallowing fsgroup 1000 - not ok
flag explicitly set to false:
- non-secured k8s/psp-secured k8s allowing fsgroup 1000 - not ok
- ocp/psp-secured k8s disallowing fsgroup 1000 - ok

Tbh, I'd challenge the usefulness of the warning. Because of non-ocp clusters with PSP we can't warn if users have this flag set to false. This means that the flag is only helpful if OCP user misconfigures it.

david-kow · 2021-11-02T15:56:53Z

We've discussed offline and decided to:

introduce autodetect flag value and default to it explicitly
not issue any warnings as they might be misleading
document the flag behavior for 1.9.0 release

david-kow · 2021-11-22T13:30:42Z

As this issue discussed the operator bug in Elastic Stack 8.0.0 which is now fixed, I'm going to close it. The follow up decided above will be implemented in #5061 and #5062.

sebgl added the >bug Something isn't working label Mar 31, 2020

sebgl self-assigned this Mar 31, 2020

sebgl mentioned this issue Apr 22, 2020

Add E2E tests for Elastic Stack 8.0.0-SNAPSHOT #2608

Closed

david-kow mentioned this issue Apr 24, 2020

[META] Elastic Stack 8.0 upgrade #2599

Closed

8 tasks

david-kow mentioned this issue Jun 29, 2020

Add set-default-security-context flag to handle runAs user in ES 8.0+ #3342

Merged

sebgl mentioned this issue Jul 7, 2020

Rely on a configMap for operator-level settings #3401

Closed

sebgl removed their assignment Nov 30, 2020

pebrc mentioned this issue Aug 19, 2021

Document fsGroup setting in securityContext #4781

Closed

david-kow self-assigned this Oct 29, 2021

This was referenced Nov 22, 2021

Add autodetect option to set-default-security-context flag #5061

Closed

Document the behaviour expected in ECK 2.0.0 for set-default-security-context flag in 1.9.0 release notes #5062

Closed

david-kow closed this as completed Nov 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elasticsearch 8.0.0-SNAPSHOT fails at startup due to volume permissions #2791

Elasticsearch 8.0.0-SNAPSHOT fails at startup due to volume permissions #2791

sebgl commented Mar 31, 2020

sebgl commented Mar 31, 2020

pebrc commented Mar 31, 2020

sebgl commented Mar 31, 2020 •

edited

Loading

sebgl commented Apr 1, 2020 •

edited

Loading

barkbay commented Apr 1, 2020

sebgl commented Apr 1, 2020

sebgl commented Apr 3, 2020 •

edited

Loading

sebgl commented Apr 3, 2020 •

edited

Loading

sebgl commented Apr 7, 2020

sebgl commented Apr 27, 2020 •

edited

Loading

anyasabo commented Apr 27, 2020

pebrc commented Apr 29, 2020

barkbay commented Jun 30, 2020 •

edited

Loading

sebgl commented Jul 1, 2020

sebgl commented Jul 28, 2020 •

edited

Loading

pebrc commented Nov 30, 2020

sebgl commented Nov 30, 2020

david-kow commented Nov 2, 2021

david-kow commented Nov 2, 2021

david-kow commented Nov 22, 2021

Elasticsearch 8.0.0-SNAPSHOT fails at startup due to volume permissions #2791

Elasticsearch 8.0.0-SNAPSHOT fails at startup due to volume permissions #2791

Comments

sebgl commented Mar 31, 2020

sebgl commented Mar 31, 2020

pebrc commented Mar 31, 2020

sebgl commented Mar 31, 2020 • edited Loading

sebgl commented Apr 1, 2020 • edited Loading

barkbay commented Apr 1, 2020

sebgl commented Apr 1, 2020

sebgl commented Apr 3, 2020 • edited Loading

1. Don't set a default securityContext.fsGroup

2. Set a default securityContext.fsGroup: 1000

3. Attempt to set the best default depending on the environment

sebgl commented Apr 3, 2020 • edited Loading

sebgl commented Apr 7, 2020

sebgl commented Apr 27, 2020 • edited Loading

anyasabo commented Apr 27, 2020

pebrc commented Apr 29, 2020

barkbay commented Jun 30, 2020 • edited Loading

sebgl commented Jul 1, 2020

sebgl commented Jul 28, 2020 • edited Loading

pebrc commented Nov 30, 2020

sebgl commented Nov 30, 2020

david-kow commented Nov 2, 2021

david-kow commented Nov 2, 2021

david-kow commented Nov 22, 2021

sebgl commented Mar 31, 2020 •

edited

Loading

sebgl commented Apr 1, 2020 •

edited

Loading

sebgl commented Apr 3, 2020 •

edited

Loading

1. Don't set a default `securityContext.fsGroup`

2. Set a default `securityContext.fsGroup: 1000`

sebgl commented Apr 3, 2020 •

edited

Loading

sebgl commented Apr 27, 2020 •

edited

Loading

barkbay commented Jun 30, 2020 •

edited

Loading

sebgl commented Jul 28, 2020 •

edited

Loading