Add an optional flag to kubelet for vols outside of kublet control #81805

matthias50 · 2019-08-22T19:13:03Z

Allows kubelet to account for volumes, other than root, which
exist on the node when reporting number of allowable volume mounts.

What type of PR is this?

/kind bug

What this PR does / why we need it:
Allows a cluster operator to inform kubelet about how many volumes exist on nodes other than the root volume. This covers use cases such as creating separate volumes for container storage. Without this information, kubelet will report too many volume mounts as being possible since it assumes the only volume in use is root.

Which issue(s) this PR fixes:
Fixes #81533

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Add optional kubelet flag --num-non-kube-volumes. This allows one to specify the number of volumes, other than root, which are mounted to the node outside of kubelet's control. This information allows kubelet to accurately report the number of available volume mounts.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2019-08-22T19:13:04Z

Welcome @matthias50!

It looks like this is your first PR to kubernetes/kubernetes 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kubernetes has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2019-08-22T19:13:10Z

Hi @matthias50. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2019-08-22T19:14:03Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: matthias50
To complete the pull request process, please assign derekwaynecarr
You can assign the PR to them by writing /assign @derekwaynecarr in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

cmd/kubelet/OWNERS
pkg/kubelet/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

matthias50 · 2019-08-22T19:15:57Z

/sig storage

matthias50 · 2019-08-22T19:17:35Z

/assign @derekwaynecarr

dashpole · 2019-08-22T20:02:24Z

/ok-to-test

Is the limit you are hitting specific to EBS volumes, or is it a limit on the number of mounts?

matthias50 · 2019-08-22T20:07:51Z

/ok-to-test

Is the limit you are hitting specific to EBS volumes, or is it a limit on the number of mounts?

Correct, the number of EBS volumes.

Allows kubelet to account for volumes, other than root, which exist on the node when reporting number of allowable volume mounts.

matthias50 · 2019-08-22T21:29:11Z

/test pull-kubernetes-bazel-test

dashpole · 2019-08-22T21:40:07Z

You didn't answer my question. Is the limit specific to EBS, or is it generic to all volume types?

matthias50 · 2019-08-22T21:49:12Z

You didn't answer my question. Is the limit specific to EBS, or is it generic to all volume types?

Sorry, my particular use case is EBS, but the limit is idea of a limit is generic to all volume types. So regardless of the volume type, it is possible that some volume or volumes have been mounted to the node by some node bootstrap process that kubelet doesn't know anything about. This PR allows one to tell kubelet about this fact so it does correct reporting.

gnufied · 2019-08-22T22:01:46Z

There are couple of issues with this proposal.

a. It does not work for CSI volumes or when in-tree to CSI migration is enabled for EBS.
b. For CSI volumes in general the plan is to override the limits in the driver themselves rather than kubelet.

But this might be a okay - "in the meanwhile" solution. cc @msau42 @leakingtapan @justinsb

dashpole · 2019-08-22T22:06:44Z

Thanks @gnufied. I suspected this is an issue we would want to solve per-CSI-driver...

msau42 · 2019-08-22T22:23:28Z

@kubernetes/sig-storage-pr-reviews

leakingtapan · 2019-08-23T05:02:14Z

Yep. EBS CSI driver static volume limit implemented here. And here is a similar request to make it configurable.

But even this approach doesn't solve 100% of the attach limit issue on EC2 instance. See #80967. So which one is the problem you are facing and trying to solve?

matthias50 · 2019-08-23T13:45:58Z

But even this approach doesn't solve 100% of the attach limit issue on EC2 instance. See #80967. So which one is the problem you are facing and trying to solve?

From my perspective, it allows for solving 100% of the attach limit issues for EC2. At least it is solving 100% of the problems we are facing :). For background, our clusters have a heterogenous mix of instances. We have non-Nitro instances, Nitro instances with instance storage and without. Here is what we want to have happen and how we will get that with this PR in place:

Non Nitro: 2 EBS volumes, root and container storage. Pass --num-non-kube-volumes=1 so there can be a max of 38 volumes for PVCs.
Nitro w/o instance storage: 2 EBS volumes, root and container storage. Pass --num-non-kube-volumes=1 so there can be a max of 24 volumes for for PVCs.
Nitro instances w/instance storage: 1 EBS volume, root. Container storage is on instance storage. Pass --num-non-kube-volumes=0 (or don't pass anything as this is default). So there are 25 volumes for PVCs.

As I see it, it allows for solving #80967 by allowing for tooling outside of Kubernetes to account for the presence or absence of non-root volumes. This tooling has to be aware of this anyway, because it created the volumes in the first place, so it should be in a good place to tell Kubelet about the extra volumes if they exist.

Also, this should also allow operators on other clouds such as Azure to account for the same issue in the same way---by passing in some value for --num-non-kube-volumes which corresponds to the number of volumes attached to the node by their bootstrap process.

matthias50 · 2019-08-26T12:53:54Z

Yep. EBS CSI driver static volume limit implemented here. And here is a similar request to make it configurable.

But even this approach doesn't solve 100% of the attach limit issue on EC2 instance. See #80967. So which one is the problem you are facing and trying to solve?

So I took another look at #80967 and the earlier proposal I made doesn't address that issue. I think, as a short term solution, that instead of taking some number of volumes as a flag to kubelet and subtracting, we could instead take a kubelet level equivalent of KUBE_MAX_PD_VOLS which would be an override if set. Then,

kubernetes/pkg/kubelet/nodestatus/setters.go

Line 725 in c9f1b57

    
           func VolumeLimits(volumePluginListFunc func() []volume.VolumePluginWithAttachLimits, // typically Kubelet.volumePluginMgr.ListVolumePluginWithLimits

becomes something like this:

// VolumeLimits returns a Setter that updates the volume limits on the node.
func VolumeLimits(volumePluginListFunc func() []volume.VolumePluginWithAttachLimits, // typically Kubelet.volumePluginMgr.ListVolumePluginWithLimits
	kubeNodeMaxPDVols int64,
) Setter {
	return func(node *v1.Node) error {
		if node.Status.Capacity == nil {
			node.Status.Capacity = v1.ResourceList{}
		}
		if node.Status.Allocatable == nil {
			node.Status.Allocatable = v1.ResourceList{}
		}

		pluginWithLimits := volumePluginListFunc()
		for _, volumePlugin := range pluginWithLimits {
			attachLimits, err := volumePlugin.GetVolumeLimits()
			if err != nil {
				klog.V(4).Infof("Error getting volume limit for plugin %s", volumePlugin.GetPluginName())
				continue
			}
			for limitKey, value := range attachLimits {
				if kubeNodeMaxPDVols != -1 {
					value = kubeNodeMaxPDVols
				}
				node.Status.Capacity[v1.ResourceName(limitKey)] = *resource.NewQuantity(value, resource.DecimalSI)
				node.Status.Allocatable[v1.ResourceName(limitKey)] = *resource.NewQuantity(value, resource.DecimalSI)
			}
		}
		return nil
	}
}

Could something like this be a ok "in the meanwhile" solution?

gnufied · 2019-08-28T01:47:46Z

Any solution we propose must work for both CSI and in-tree volumes. At this point, I am afraid we can't have a solution that works for one but not for the other..

So, assuming this is implemented for both CSI and in-tree volumes and assuming volume supports such limits, will this create problems if user is using two different types of volumes on the node? Usually each plugin can freely specify its own limits but having a environment variable like KUBE_MAX_PD_VOLS in kubelet, would override the limits from all types of volume types. Is that desirable? Guess - even a "in the meanwhile" solution has to be supported literally forever in k8s 1.x lifecycle, so we have to be kinda careful.

matthias50 · 2019-08-29T02:35:54Z

Any solution we propose must work for both CSI and in-tree volumes. At this point, I am afraid we can't have a solution that works for one but not for the other..

So, assuming this is implemented for both CSI and in-tree volumes and assuming volume supports such limits, will this create problems if user is using two different types of volumes on the node? Usually each plugin can freely specify its own limits but having a environment variable like KUBE_MAX_PD_VOLS in kubelet, would override the limits from all types of volume types. Is that desirable? Guess - even a "in the meanwhile" solution has to be supported literally forever in k8s 1.x lifecycle, so we have to be kinda careful.

Got it. What about passing kubelet a list of mappings from the volume type limit to the value. For instance, we could pass a comma separate list as in:

--kube-node-max-pd-vols-map=attachable-volumes-aws-ebs=38,attachable-volumes-some-other-type=15

This worms its way through Kubelet till we end up with:

// VolumeLimits returns a Setter that updates the volume limits on the node.
func VolumeLimits(volumePluginListFunc func() []volume.VolumePluginWithAttachLimits, // typically Kubelet.volumePluginMgr.ListVolumePluginWithLimits
	kubeNodeMaxPDVolsMap map[string]int64,
) Setter {
	return func(node *v1.Node) error {
		if node.Status.Capacity == nil {
			node.Status.Capacity = v1.ResourceList{}
		}
		if node.Status.Allocatable == nil {
			node.Status.Allocatable = v1.ResourceList{}
		}

		pluginWithLimits := volumePluginListFunc()
		for _, volumePlugin := range pluginWithLimits {
			attachLimits, err := volumePlugin.GetVolumeLimits()
			if err != nil {
				klog.V(4).Infof("Error getting volume limit for plugin %s", volumePlugin.GetPluginName())
				continue
			}
			for limitKey, value := range attachLimits {
				override, overrideExists := kubeNodeMaxPDVolsMap[limitKey]
				if overrideExists {
					value = override
				}
				node.Status.Capacity[v1.ResourceName(limitKey)] = *resource.NewQuantity(value, resource.DecimalSI)
				node.Status.Allocatable[v1.ResourceName(limitKey)] = *resource.NewQuantity(value, resource.DecimalSI)
			}
		}
		return nil
	}
}

I tried this out, and for the in tree AWS, it works. As for the CSI case, it seems that this would require a similar, but different solution. In that case, somehow we need to provide this information to the daemonset pod running on each of the nodes. As @gnufied mentioned in kubernetes-sigs/aws-ebs-csi-driver#347, perhaps this could be solved by having multiple daemonsets with different command line arguments.

Regardless of the approach for CSI, it seems like that is a totally separate case. Can we instead treat these as separate problems and use something like this for in-tree volumes drivers?

matthias50 · 2019-09-20T16:18:39Z

@gnufied This issue is causing us pain because we have to do custom builds of k8s components every time we update Kubernetes to keep from running into production issues where too many volumes are mounted to nodes. Can I get some feedback on the above approach of separating these into an in-tree solution and an out of tree solution? Perhaps we could bring some unity here by using a similarly named flag for the CSI provider as you suggested here: kubernetes-sigs/aws-ebs-csi-driver#347

I'm open to other approaches if someone has some better idea, but right now we seem to be in a situation where we will be stuck doing local builds of k8s binaries every time we want to update to a new version of Kubernetes.

rmt · 2019-10-29T13:30:51Z

@gnufied, @leakingtapan, is there a path forward here, yet?

I also consider KUBE_MAX_PD_VOLS (as it stands today) to be an interim hack. If no "proper" solution is imminent, then I would consider extending KUBE_MAX_PD_VOLS to kubelet (and not just kube-controller-manager) as a small, reasonable, improvement on it.

bertinatto · 2019-11-27T10:17:44Z

Just FYI, once CSI migration is GA (it's beta now, but disabled by default), the CSI driver will be the source for the volume limits value.

I updated the CSI docs recommending CSI drivers to allow for customization of volume limits: https://kubernetes-csi.github.io/docs/volume-limits.html

fejta-bot · 2020-02-25T10:53:08Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

k8s-ci-robot · 2020-02-25T10:53:15Z

@matthias50: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

fejta-bot · 2020-03-26T11:36:21Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2020-04-25T12:19:09Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2020-04-25T12:19:22Z

@fejta-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 22, 2019

k8s-ci-robot requested review from dashpole and krmayankk August 22, 2019 19:14

k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 22, 2019

k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Aug 22, 2019

matthias50 mentioned this pull request Aug 22, 2019

Too many EBS volumes mounted to a node #81533

Closed

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 22, 2019

Add an optional flag to kubelet for vols outside of kublet control

b8c0c79

Allows kubelet to account for volumes, other than root, which exist on the node when reporting number of allowable volume mounts.

matthias50 force-pushed the matthias50/add-kubelet-vols-flag branch from 71b2431 to b8c0c79 Compare August 22, 2019 20:59

matthias50 mentioned this pull request Aug 26, 2019

Improve Node-specific Volume Limit Calculation #80967

Closed

k8s-ci-robot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Feb 25, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 26, 2020

k8s-ci-robot closed this Apr 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an optional flag to kubelet for vols outside of kublet control #81805

Add an optional flag to kubelet for vols outside of kublet control #81805

matthias50 commented Aug 22, 2019 •

edited

Loading

k8s-ci-robot commented Aug 22, 2019

k8s-ci-robot commented Aug 22, 2019

k8s-ci-robot commented Aug 22, 2019

matthias50 commented Aug 22, 2019

matthias50 commented Aug 22, 2019

dashpole commented Aug 22, 2019

matthias50 commented Aug 22, 2019

matthias50 commented Aug 22, 2019

dashpole commented Aug 22, 2019

matthias50 commented Aug 22, 2019

gnufied commented Aug 22, 2019

dashpole commented Aug 22, 2019

msau42 commented Aug 22, 2019

leakingtapan commented Aug 23, 2019 •

edited

Loading

matthias50 commented Aug 23, 2019

matthias50 commented Aug 26, 2019

gnufied commented Aug 28, 2019

matthias50 commented Aug 29, 2019

matthias50 commented Sep 20, 2019

rmt commented Oct 29, 2019

bertinatto commented Nov 27, 2019

fejta-bot commented Feb 25, 2020

k8s-ci-robot commented Feb 25, 2020

fejta-bot commented Mar 26, 2020

fejta-bot commented Apr 25, 2020

k8s-ci-robot commented Apr 25, 2020

Add an optional flag to kubelet for vols outside of kublet control #81805

Add an optional flag to kubelet for vols outside of kublet control #81805

Conversation

matthias50 commented Aug 22, 2019 • edited Loading

k8s-ci-robot commented Aug 22, 2019

k8s-ci-robot commented Aug 22, 2019

k8s-ci-robot commented Aug 22, 2019

matthias50 commented Aug 22, 2019

matthias50 commented Aug 22, 2019

dashpole commented Aug 22, 2019

matthias50 commented Aug 22, 2019

matthias50 commented Aug 22, 2019

dashpole commented Aug 22, 2019

matthias50 commented Aug 22, 2019

gnufied commented Aug 22, 2019

dashpole commented Aug 22, 2019

msau42 commented Aug 22, 2019

leakingtapan commented Aug 23, 2019 • edited Loading

matthias50 commented Aug 23, 2019

matthias50 commented Aug 26, 2019

gnufied commented Aug 28, 2019

matthias50 commented Aug 29, 2019

matthias50 commented Sep 20, 2019

rmt commented Oct 29, 2019

bertinatto commented Nov 27, 2019

fejta-bot commented Feb 25, 2020

k8s-ci-robot commented Feb 25, 2020

fejta-bot commented Mar 26, 2020

fejta-bot commented Apr 25, 2020

k8s-ci-robot commented Apr 25, 2020

matthias50 commented Aug 22, 2019 •

edited

Loading

leakingtapan commented Aug 23, 2019 •

edited

Loading