Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add KEP for SELinux label change #1621

Merged
merged 2 commits into from
May 19, 2020
Merged

Conversation

jsafrane
Copy link
Member

@jsafrane jsafrane commented Mar 19, 2020

This KEP tries to speed up the way how volumes (incl. persistent volumes) are made available to Pods on systems with SELinux in enforcing mode.

Current way includes recursive relabeling of all files on a volume before a container can be started. This is slow for large volumes.

Familiarity with fsGroupChangePolicy KEP is suggested

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/storage Categorizes an issue or PR as relevant to SIG Storage. labels Mar 19, 2020
@jsafrane jsafrane force-pushed the selinux branch 3 times, most recently from 354a12f to 03052e7 Compare March 19, 2020 13:48
keps/sig-storage/20200319-selinux-relabeling.md Outdated Show resolved Hide resolved
keps/sig-storage/20200319-selinux-relabeling.md Outdated Show resolved Hide resolved
keps/sig-storage/20200319-selinux-relabeling.md Outdated Show resolved Hide resolved
keps/sig-storage/20200319-selinux-relabeling.md Outdated Show resolved Hide resolved
@jsafrane
Copy link
Member Author

jsafrane commented Apr 1, 2020

Passed the first round of internal review. Adding more people from the community:

Explicitly @tallclair @derekwaynecarr for kubelet expertise and strong security background.
@haircommander @rhatdan for CRI and SELinux
@kubernetes/sig-node-proposals
@kubernetes/sig-storage-pr-reviews

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. kind/design Categorizes issue or PR as related to design. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Apr 1, 2020
@haircommander
Copy link
Contributor

LGTM from runtime side 👍
cc @mrunalp @umohnani8 @saschagrunert to see if they have any comments

@derekwaynecarr
Copy link
Member

@sjenning or @mrunalp can you help do an initial review?

@saschagrunert
Copy link
Member

LGTM too (unfortunately I'm not a SELinux expert)

@rhatdan
Copy link

rhatdan commented Apr 29, 2020

It should be noted that not all file systems support SELinux labeling. Specifically file systems without XAttr support. For these cases the volume needs to be mounted with the context mount or the container has to be run without SELinux separation. In the future we are shipping UDICA (https://github.com/containers/udica) which is a mechanism to have more then on container process type. "container_t", This would allow users to have access to random labels on the system, but the policy needs to be installed on the system.

@k8s-ci-robot k8s-ci-robot added the sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. label May 5, 2020
@jsafrane
Copy link
Member Author

jsafrane commented May 5, 2020

Thinking again about shared volumes, I needed to add CSIDriver.Spec.SupportsSELinux to tell kubelet that it should not use mount -o context=xyz for volumes provided by a particular CSI driver. For some CSI drivers (e.g. NFS-client, mount -o context=A nfs-server.com:/export/vol1 <somewhere> sets SELinux context of all mounts from the same NFS share. mount -o context=B nfs-server.com:/export/vol2 <somewhere else> will fail, because the NFS export is already mounted with context A.

Not sure if SupportsSELinux is a great name.

### Graduation Criteria

* Alpha:
* Provided all tests defined above are passing and gated by the feature gate `SELinuxRelabelPolicy` and set to a default of `false`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean default to Always right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is Kubernetes feature gate.

// on a NodePublished directory. If "seclabel" is present (i.e. kernel supports SELinux
// labels for this volume", container runtime may change label of all files on the volume
// to match the Pod requirements.
SupportsSELinux *bool;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is perhaps - to have a field called StorageIsolationPolicy or SecurityPolicy and define a enum for it with possible value of selinux for now. In future this field could be expanded to include other security policies such as apparmour or something else in windows.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I pointed above, the name (and actually also its description) is misleading. The flag is uses to determine if mount -o context on a volume can affect other volumes of the same CSI driver on the same node.

If nfs-client provisioner had a CSI driver, mounting one of its volume with -o context=A sets the context for all other volumes on the node, because they come from the same NFS export.

If nfs provisioner had a CSI driver, mounting one of its volume with -o context=A does not affect the other volumes, because they use separate NFS exports [here I am not 100% sure what ganesha actually does, but I believe there are other NFS servers where it would work].

So the flag is actually about the server and independence of its volumes, not about SELinux or apparmor. It needs a better name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more description for SupportsSELinux and table with examples.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following my review comment for the fsgroup field, I think it would be better to have explicit enums for the 3 behaviors.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed the field to SELinuxMountSupported to emphasize that it is only about mounts with -o context. It reduces number of cases to handle in kubelet / CSI volume plugin. If the mounted filesystem supports SELinux or not is always autodetected by presence of seclabel mount option, as it is already done now.

Therefore the values are true or false. nil means false. Do you still want enum? It would be something line "Supported" and "Unsupported", which looks odd.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 12, 2020
@jsafrane jsafrane force-pushed the selinux branch 2 times, most recently from 2a27648 to fb838f9 Compare May 12, 2020 10:31

const (
OnVolumeMount SELinuxRelabelPolicy = "OnVolumeMount"
AlwaysRelabel SELinuxRelabelPolicy = "Always"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can debate naming later, but I prefer something like "Recursive" instead of "Always", because we will relabel in all cases (if driver supports), regardless of what value is set here.

keps/sig-storage/1710-selinux-relabeling/README.md Outdated Show resolved Hide resolved
// podSecurityContext.seLinuxRelabelPolicy "OnVolumeMount" is silently ignored.
//
// Default is "false".
SELinuxMountSupporteded *bool;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the case where the driver doesn't support it and we don't want to recursively relabel?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently autodetected by presence of seclabel mount option. If it's there after NodePublish, the mounted volume supports SELinux and CRI relabels, even in Kubernetes 1.18 and earlier.

This behavior will be the same with this KEP. I haven't heard requests not to relabel when seclabel is present.

@jsafrane jsafrane force-pushed the selinux branch 2 times, most recently from 5a7fbdc to 2c3081d Compare May 13, 2020 15:12
Copy link
Member

@msau42 msau42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly lgtm. just some nits

// Defines behavior of changing SELinux labels of the volume before being exposed inside Pod.
// Valid values are "OnVolumeMount" and "Always". If not specified, "Always" is used.
// "Always" policy recursively changes SELinux labels on all files on all volumes used by the Pod.
// "OnVolumeMount" tries to mount volumes used by the Pod with the right context and skip recursive ownership
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mention here that this option may still fallback to recursive mode if the driver doesn't support volume mount mode?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this:

Kubernetes may fall back to policy "Always" if a storage backed does not support this policy.

We don't think that this is a bug in the design.
Only one pod will have access to the volume, this KEP only changes the selection.

The only regression is when two pods with different SELinux context use the same volume, but different SubPath - they were working before, as the container runtime relabeled only the subpaths, now the whole volume must have the same context.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This isn't a "regression". Users using the default mode (today's behavior) will continue to work. It's a difference in behavior for the new mode.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reworded to:

The only different behavior is when two pods with different SELinux context use the same volume, but different SubPath - they are working with Always policy, as the container runtime relabeled only the subpaths, with OnVolumeMount the whole volume must have the same context.

keps/sig-storage/1710-selinux-relabeling/README.md Outdated Show resolved Hide resolved
@jsafrane
Copy link
Member Author

Updated and squashed everything to a single commit.

@jsafrane jsafrane force-pushed the selinux branch 2 times, most recently from 50280e6 to ddbc204 Compare May 15, 2020 16:47
@msau42
Copy link
Member

msau42 commented May 15, 2020

LGTM will let @tallclair also review


## Summary

This KEP tries to speed up the way how volumes (incl. persistent volumes) are made available to Pods on systems with SELinux in enforcing mode.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This KEP tries to speed up the way how volumes (incl. persistent volumes) are made available to Pods on systems with SELinux in enforcing mode.
This KEP tries to speed up the way that volumes (incl. persistent volumes) are made available to Pods on systems with SELinux in enforcing mode.

- "@jsafrane"
owning-sig: sig-storage
participating-sigs:
- sig-auth
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't think this KEP is relevant to sig-auth.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

// change. Kubernetes may fall back to policy "Always" if a storage backed does not support this policy.
// This field is ignored for Pod's volumes that do not support SELinux.
// + optional
SELinuxRelabelPolicy *SELinuxRelabelPolicy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about combining SELinuxRelabelPolicy and FSGroupChangePolicy into a single option? For example, something like this:

const (
  // The heuristic policy acts like setting both the OnVolumeMount policy and the OnRootMismatch policy.
  HeuristicVolumeChangePolicy VolumeChangePolicy = "Heuristic"
  RecursiveVolumeChangePolicy VolumeChangePolicy = "Recursive"
)

type PodSeucrityContext struct {
  ...
  VolumeChangePolicy *VolumeChangePolicy
  ...
}

The motivation is that these settings seem very closely related, and would probably typically be set together. This decreases the flexibility, but simplifies the API and feature usage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are targeting beta for fsGroupChangePolicy feature this quarter - https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/695-skip-permission-change . If we were to rename the field, we will have to probably put a halt on that. Nothing too problematic, I think it is fine if it stays alpha in one more release but just wanted to give heads up.

Copy link
Member

@gnufied gnufied May 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking through - there are some downsides of combining selinux and fsGroup options which are worth considering:

  1. Having them behind one option kind of makes API hard to understand without reading the documentation in detail. IMO this is lot of detail to hide behind one field which is not very self-explanatory.
  2. There are use cases where one may want to set OnRootMistach but Always for selinux or vice-versa. For example for a volume driver that supports selinux(such as gce-pd), using OnVolumeMount is perfectly fine default for all use cases even when you bring a volume with data on it. On other hand there could be cases where a user may have to use Always as default fsGroupChangePolicy but not Always selinux policy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see benefits of a single field, however, "Heuristic" won't work well with fsGroup and a PV that's used also outside of Kubernetes, without fsGroup knowledge. If Heuristic was used, mounting -o context will work without issues, but skipping fsGroup chown / chmod may not work, because the top dir may have the right owner, while something outside of Kubernetes did create files on the volume with a wrong owner.

So, user may want full fsGroup change, but skip SELinux. I admit it may be a minor artificial use case, still, I don't want to paint us in the corner by combining the fields together.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added as "considered alternative".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this reasoning makes sense to me. It's unfortunate though, as the majority of the time I forsee the non-"always" option being used when the volume is too large and the recursive rewriting is too slow. It's unfortunate that these implementation details are getting surfaced in the API.

Copy link
Member

@msau42 msau42 May 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the SELinux case, would anyone want the Recursive approach over the VolumeMount approach? I'm wondering if we can get rid of the SELinux policy in PodSecurityContext, and only use the CSIDriver field to figure out what to do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the SELinux case, would anyone want the Recursive approach over the VolumeMount approach

I can think only about shared volumes + subpaths, they behave a bit differently with OnMount.

BTW, there is still the first alternative - follow fsGroup / OnRootMismatch. Then we can merge the API easily, save lot of code and make everything consistent... Just the speed will suffer, as kubelet has to relabel the volume when used for the first time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first alternative of having the same behavior of the fsgroup change policy sounds nice if that means we can also combine the API fields into one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the penalty of relabeling on the first mount is acceptable.

@jsafrane jsafrane force-pushed the selinux branch 3 times, most recently from 6033f88 to a991a50 Compare May 19, 2020 09:31
@jsafrane
Copy link
Member Author

/retest

@msau42
Copy link
Member

msau42 commented May 19, 2020

Discussed offline, we'll continue for alpha as proposed, but

  • We'll keep fsgroup policy in alpha for 1.19
  • consider if we can merge the two APIs before going to beta

@msau42
Copy link
Member

msau42 commented May 19, 2020

/lgtm
/approve
/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 19, 2020
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 19, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jsafrane, msau42

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit c65cfdb into kubernetes:master May 19, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.19 milestone May 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/design Categorizes issue or PR as related to design. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants