-
Notifications
You must be signed in to change notification settings - Fork 39.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an optional flag to kubelet for vols outside of kublet control #81805
Add an optional flag to kubelet for vols outside of kublet control #81805
Conversation
Welcome @matthias50! |
Hi @matthias50. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: matthias50 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/sig storage |
/assign @derekwaynecarr |
/ok-to-test Is the limit you are hitting specific to EBS volumes, or is it a limit on the number of mounts? |
Correct, the number of EBS volumes. |
Allows kubelet to account for volumes, other than root, which exist on the node when reporting number of allowable volume mounts.
71b2431
to
b8c0c79
Compare
/test pull-kubernetes-bazel-test |
You didn't answer my question. Is the limit specific to EBS, or is it generic to all volume types? |
Sorry, my particular use case is EBS, but the limit is idea of a limit is generic to all volume types. So regardless of the volume type, it is possible that some volume or volumes have been mounted to the node by some node bootstrap process that kubelet doesn't know anything about. This PR allows one to tell kubelet about this fact so it does correct reporting. |
There are couple of issues with this proposal. a. It does not work for CSI volumes or when in-tree to CSI migration is enabled for EBS. But this might be a okay - "in the meanwhile" solution. cc @msau42 @leakingtapan @justinsb |
Thanks @gnufied. I suspected this is an issue we would want to solve per-CSI-driver... |
@kubernetes/sig-storage-pr-reviews |
From my perspective, it allows for solving 100% of the attach limit issues for EC2. At least it is solving 100% of the problems we are facing :). For background, our clusters have a heterogenous mix of instances. We have non-Nitro instances, Nitro instances with instance storage and without. Here is what we want to have happen and how we will get that with this PR in place: Non Nitro: 2 EBS volumes, root and container storage. Pass --num-non-kube-volumes=1 so there can be a max of 38 volumes for PVCs. As I see it, it allows for solving #80967 by allowing for tooling outside of Kubernetes to account for the presence or absence of non-root volumes. This tooling has to be aware of this anyway, because it created the volumes in the first place, so it should be in a good place to tell Kubelet about the extra volumes if they exist. Also, this should also allow operators on other clouds such as Azure to account for the same issue in the same way---by passing in some value for --num-non-kube-volumes which corresponds to the number of volumes attached to the node by their bootstrap process. |
So I took another look at #80967 and the earlier proposal I made doesn't address that issue. I think, as a short term solution, that instead of taking some number of volumes as a flag to kubelet and subtracting, we could instead take a kubelet level equivalent of KUBE_MAX_PD_VOLS which would be an override if set. Then, kubernetes/pkg/kubelet/nodestatus/setters.go Line 725 in c9f1b57
Could something like this be a ok "in the meanwhile" solution? |
Any solution we propose must work for both CSI and in-tree volumes. At this point, I am afraid we can't have a solution that works for one but not for the other.. So, assuming this is implemented for both CSI and in-tree volumes and assuming volume supports such limits, will this create problems if user is using two different types of volumes on the node? Usually each plugin can freely specify its own limits but having a environment variable like |
Got it. What about passing kubelet a list of mappings from the volume type limit to the value. For instance, we could pass a comma separate list as in: --kube-node-max-pd-vols-map=attachable-volumes-aws-ebs=38,attachable-volumes-some-other-type=15 This worms its way through Kubelet till we end up with: // VolumeLimits returns a Setter that updates the volume limits on the node.
func VolumeLimits(volumePluginListFunc func() []volume.VolumePluginWithAttachLimits, // typically Kubelet.volumePluginMgr.ListVolumePluginWithLimits
kubeNodeMaxPDVolsMap map[string]int64,
) Setter {
return func(node *v1.Node) error {
if node.Status.Capacity == nil {
node.Status.Capacity = v1.ResourceList{}
}
if node.Status.Allocatable == nil {
node.Status.Allocatable = v1.ResourceList{}
}
pluginWithLimits := volumePluginListFunc()
for _, volumePlugin := range pluginWithLimits {
attachLimits, err := volumePlugin.GetVolumeLimits()
if err != nil {
klog.V(4).Infof("Error getting volume limit for plugin %s", volumePlugin.GetPluginName())
continue
}
for limitKey, value := range attachLimits {
override, overrideExists := kubeNodeMaxPDVolsMap[limitKey]
if overrideExists {
value = override
}
node.Status.Capacity[v1.ResourceName(limitKey)] = *resource.NewQuantity(value, resource.DecimalSI)
node.Status.Allocatable[v1.ResourceName(limitKey)] = *resource.NewQuantity(value, resource.DecimalSI)
}
}
return nil
}
} I tried this out, and for the in tree AWS, it works. As for the CSI case, it seems that this would require a similar, but different solution. In that case, somehow we need to provide this information to the daemonset pod running on each of the nodes. As @gnufied mentioned in kubernetes-sigs/aws-ebs-csi-driver#347, perhaps this could be solved by having multiple daemonsets with different command line arguments. Regardless of the approach for CSI, it seems like that is a totally separate case. Can we instead treat these as separate problems and use something like this for in-tree volumes drivers? |
@gnufied This issue is causing us pain because we have to do custom builds of k8s components every time we update Kubernetes to keep from running into production issues where too many volumes are mounted to nodes. Can I get some feedback on the above approach of separating these into an in-tree solution and an out of tree solution? Perhaps we could bring some unity here by using a similarly named flag for the CSI provider as you suggested here: kubernetes-sigs/aws-ebs-csi-driver#347 I'm open to other approaches if someone has some better idea, but right now we seem to be in a situation where we will be stuck doing local builds of k8s binaries every time we want to update to a new version of Kubernetes. |
@gnufied, @leakingtapan, is there a path forward here, yet? I also consider KUBE_MAX_PD_VOLS (as it stands today) to be an interim hack. If no "proper" solution is imminent, then I would consider extending KUBE_MAX_PD_VOLS to kubelet (and not just kube-controller-manager) as a small, reasonable, improvement on it. |
Just FYI, once CSI migration is GA (it's beta now, but disabled by default), the CSI driver will be the source for the volume limits value. I updated the CSI docs recommending CSI drivers to allow for customization of volume limits: https://kubernetes-csi.github.io/docs/volume-limits.html |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@matthias50: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Allows kubelet to account for volumes, other than root, which
exist on the node when reporting number of allowable volume mounts.
What type of PR is this?
/kind bug
What this PR does / why we need it:
Allows a cluster operator to inform kubelet about how many volumes exist on nodes other than the root volume. This covers use cases such as creating separate volumes for container storage. Without this information, kubelet will report too many volume mounts as being possible since it assumes the only volume in use is root.
Which issue(s) this PR fixes:
Fixes #81533
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: