-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add KEP for SELinux label change #1621
Conversation
354a12f
to
03052e7
Compare
Passed the first round of internal review. Adding more people from the community: Explicitly @tallclair @derekwaynecarr for kubelet expertise and strong security background. |
LGTM from runtime side 👍 |
LGTM too (unfortunately I'm not a SELinux expert) |
It should be noted that not all file systems support SELinux labeling. Specifically file systems without XAttr support. For these cases the volume needs to be mounted with the context mount or the container has to be run without SELinux separation. In the future we are shipping UDICA (https://github.com/containers/udica) which is a mechanism to have more then on container process type. "container_t", This would allow users to have access to random labels on the system, but the policy needs to be installed on the system. |
Thinking again about shared volumes, I needed to add Not sure if |
### Graduation Criteria | ||
|
||
* Alpha: | ||
* Provided all tests defined above are passing and gated by the feature gate `SELinuxRelabelPolicy` and set to a default of `false`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean default to Always
right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is Kubernetes feature gate.
// on a NodePublished directory. If "seclabel" is present (i.e. kernel supports SELinux | ||
// labels for this volume", container runtime may change label of all files on the volume | ||
// to match the Pod requirements. | ||
SupportsSELinux *bool; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option is perhaps - to have a field called StorageIsolationPolicy
or SecurityPolicy
and define a enum for it with possible value of selinux
for now. In future this field could be expanded to include other security policies such as apparmour or something else in windows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I pointed above, the name (and actually also its description) is misleading. The flag is uses to determine if mount -o context
on a volume can affect other volumes of the same CSI driver on the same node.
If nfs-client provisioner had a CSI driver, mounting one of its volume with -o context=A
sets the context for all other volumes on the node, because they come from the same NFS export.
If nfs provisioner had a CSI driver, mounting one of its volume with -o context=A
does not affect the other volumes, because they use separate NFS exports [here I am not 100% sure what ganesha actually does, but I believe there are other NFS servers where it would work].
So the flag is actually about the server and independence of its volumes, not about SELinux or apparmor. It needs a better name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added more description for SupportsSELinux
and table with examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following my review comment for the fsgroup field, I think it would be better to have explicit enums for the 3 behaviors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed the field to SELinuxMountSupported
to emphasize that it is only about mounts with -o context
. It reduces number of cases to handle in kubelet / CSI volume plugin. If the mounted filesystem supports SELinux or not is always autodetected by presence of seclabel
mount option, as it is already done now.
Therefore the values are true
or false
. nil
means false
. Do you still want enum? It would be something line "Supported"
and "Unsupported"
, which looks odd.
2a27648
to
fb838f9
Compare
|
||
const ( | ||
OnVolumeMount SELinuxRelabelPolicy = "OnVolumeMount" | ||
AlwaysRelabel SELinuxRelabelPolicy = "Always" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can debate naming later, but I prefer something like "Recursive" instead of "Always", because we will relabel in all cases (if driver supports), regardless of what value is set here.
// podSecurityContext.seLinuxRelabelPolicy "OnVolumeMount" is silently ignored. | ||
// | ||
// Default is "false". | ||
SELinuxMountSupporteded *bool; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the case where the driver doesn't support it and we don't want to recursively relabel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is currently autodetected by presence of seclabel
mount option. If it's there after NodePublish, the mounted volume supports SELinux and CRI relabels, even in Kubernetes 1.18 and earlier.
This behavior will be the same with this KEP. I haven't heard requests not to relabel when seclabel
is present.
5a7fbdc
to
2c3081d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly lgtm. just some nits
// Defines behavior of changing SELinux labels of the volume before being exposed inside Pod. | ||
// Valid values are "OnVolumeMount" and "Always". If not specified, "Always" is used. | ||
// "Always" policy recursively changes SELinux labels on all files on all volumes used by the Pod. | ||
// "OnVolumeMount" tries to mount volumes used by the Pod with the right context and skip recursive ownership |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we mention here that this option may still fallback to recursive mode if the driver doesn't support volume mount mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this:
Kubernetes may fall back to policy "Always" if a storage backed does not support this policy.
We don't think that this is a bug in the design. | ||
Only one pod will have access to the volume, this KEP only changes the selection. | ||
|
||
The only regression is when two pods with different SELinux context use the same volume, but different SubPath - they were working before, as the container runtime relabeled only the subpaths, now the whole volume must have the same context. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This isn't a "regression". Users using the default mode (today's behavior) will continue to work. It's a difference in behavior for the new mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reworded to:
The only different behavior is when two pods with different SELinux context use the same volume, but different SubPath - they are working with
Always
policy, as the container runtime relabeled only the subpaths, withOnVolumeMount
the whole volume must have the same context.
Updated and squashed everything to a single commit. |
50280e6
to
ddbc204
Compare
LGTM will let @tallclair also review |
|
||
## Summary | ||
|
||
This KEP tries to speed up the way how volumes (incl. persistent volumes) are made available to Pods on systems with SELinux in enforcing mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This KEP tries to speed up the way how volumes (incl. persistent volumes) are made available to Pods on systems with SELinux in enforcing mode. | |
This KEP tries to speed up the way that volumes (incl. persistent volumes) are made available to Pods on systems with SELinux in enforcing mode. |
- "@jsafrane" | ||
owning-sig: sig-storage | ||
participating-sigs: | ||
- sig-auth |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I don't think this KEP is relevant to sig-auth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
// change. Kubernetes may fall back to policy "Always" if a storage backed does not support this policy. | ||
// This field is ignored for Pod's volumes that do not support SELinux. | ||
// + optional | ||
SELinuxRelabelPolicy *SELinuxRelabelPolicy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about combining SELinuxRelabelPolicy
and FSGroupChangePolicy
into a single option? For example, something like this:
const (
// The heuristic policy acts like setting both the OnVolumeMount policy and the OnRootMismatch policy.
HeuristicVolumeChangePolicy VolumeChangePolicy = "Heuristic"
RecursiveVolumeChangePolicy VolumeChangePolicy = "Recursive"
)
type PodSeucrityContext struct {
...
VolumeChangePolicy *VolumeChangePolicy
...
}
The motivation is that these settings seem very closely related, and would probably typically be set together. This decreases the flexibility, but simplifies the API and feature usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are targeting beta for fsGroupChangePolicy
feature this quarter - https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/695-skip-permission-change . If we were to rename the field, we will have to probably put a halt on that. Nothing too problematic, I think it is fine if it stays alpha in one more release but just wanted to give heads up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking through - there are some downsides of combining selinux and fsGroup options which are worth considering:
- Having them behind one option kind of makes API hard to understand without reading the documentation in detail. IMO this is lot of detail to hide behind one field which is not very self-explanatory.
- There are use cases where one may want to set
OnRootMistach
butAlways
for selinux or vice-versa. For example for a volume driver that supports selinux(such as gce-pd), usingOnVolumeMount
is perfectly fine default for all use cases even when you bring a volume with data on it. On other hand there could be cases where a user may have to useAlways
as defaultfsGroupChangePolicy
but notAlways
selinux policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see benefits of a single field, however, "Heuristic" won't work well with fsGroup
and a PV that's used also outside of Kubernetes, without fsGroup
knowledge. If Heuristic
was used, mounting -o context
will work without issues, but skipping fsGroup
chown
/ chmod
may not work, because the top dir may have the right owner, while something outside of Kubernetes did create files on the volume with a wrong owner.
So, user may want full fsGroup
change, but skip SELinux. I admit it may be a minor artificial use case, still, I don't want to paint us in the corner by combining the fields together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added as "considered alternative".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, this reasoning makes sense to me. It's unfortunate though, as the majority of the time I forsee the non-"always" option being used when the volume is too large and the recursive rewriting is too slow. It's unfortunate that these implementation details are getting surfaced in the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the SELinux case, would anyone want the Recursive approach over the VolumeMount approach? I'm wondering if we can get rid of the SELinux policy in PodSecurityContext, and only use the CSIDriver field to figure out what to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the SELinux case, would anyone want the Recursive approach over the VolumeMount approach
I can think only about shared volumes + subpaths, they behave a bit differently with OnMount
.
BTW, there is still the first alternative - follow fsGroup
/ OnRootMismatch
. Then we can merge the API easily, save lot of code and make everything consistent... Just the speed will suffer, as kubelet has to relabel the volume when used for the first time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first alternative of having the same behavior of the fsgroup change policy sounds nice if that means we can also combine the API fields into one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the penalty of relabeling on the first mount is acceptable.
6033f88
to
a991a50
Compare
/retest |
Discussed offline, we'll continue for alpha as proposed, but
|
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jsafrane, msau42 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This KEP tries to speed up the way how volumes (incl. persistent volumes) are made available to Pods on systems with SELinux in enforcing mode.
Current way includes recursive relabeling of all files on a volume before a container can be started. This is slow for large volumes.
Familiarity with
fsGroupChangePolicy
KEP is suggested