-
Notifications
You must be signed in to change notification settings - Fork 809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove block-device-mapping from attach limit #1843
Remove block-device-mapping from attach limit #1843
Conversation
meta-data/block-device-mapping reports devices that were attached to the VM at the time the VM was started. It does not reflect devices attached/detached later. In addition, even when it showed the attachments accurately, the CSI driver uses this information only at the driver startup. At this time, there may be some CSI volumes attached from the previous node boot. Those volumes are then counted as permanently attached to the node, while Kubernetes may detach them later. Therefore do not count block-device-mapping at all and always reserve just 1 attachment for the root disk. Assume all other attachments are CSI volumes, and let Kubernetes do the calculations of attached volumes instead.
Code Coverage Diff
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jsafrane /lgtm
At the very least, blockDevMappings
should definitely not take into account CSI-managed volumes. Users relying on the current behavior can make use of the node.volumeAttachLimit
parameter to "reserve" attachment slots.
cc: @ConnorJC3 @AndrewSirenko to review before approval. |
Big thank you @jsafrane for diving deep on #1808.
^^ is quite the workload interruption to pay without this PR. |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: torredil The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@jsafrane I don't fully understand this particular fix, but does it take into account network interfaces? They also contribute to the same limit. Eg: if you attach 10 more network interfaces, you will be short by number of EBS to be attached by the same number. |
I also need to express concerns. Wouldn't this PR break the use of any non-boot volume if it is not done via kubernetes ? For example some of our users in gardener do opt for additional data volumes which will completely break. It is quite the assumption to make that no other volume can be used at creation and it is a highly breaking change that should be marked as such. |
I didn't get a chance to review this PR before it was merged, but I do agree that it breaks the ability to use non-CSI volumes (other than the root volume) on nodes where the EBS CSI driver is running. That's a significant enough breaking change we need to be looking into a fix or revert of this PR before the next release. |
Hi @kon-angelo thanks for chiming in and starting this discussion, really appreciate the feedback. How are you folks solving this problem with other providers? It would be great to work on a solution that works well with the broader community - currently, K8s does not support managing both non-CSI and CSI volumes simultaneously. Other CSI drivers do not take into account non-CSI managed volumes. Not at all familiar with gardener so pardon if this is completely unfeasible or does not meet your requirements - given that the driver is deployed by default on every cluster already: would it be possible to follow the standard and use the driver to manage these additional data volumes? I'm not inclined to revert the changes made by this PR - it fixes a critical bug in the driver that affects all users relying on it to manage the lifecycle of their volumes (the supported, common case) - and as previously mentioned, the resulting failure mode requires manual intervention to fix (once the scheduler incorrectly schedules, the stateful pod needs to be deleted + node cordoned so that the same mistake does not happen again).
For clarity: the changes in this PR do not inherently break the ability to use non-CSI volumes on nodes where the EBS CSI driver is running. If the node has capacity to support the attachment, it will succeed (these data volumes are practically guaranteed to successfully grab a slot - I'd be more worried about the consequences of that). That said, this has been actively discouraged in the community: kubernetes/kubernetes#81533 (comment) cc: @gnufied As Jan already mentioned, the driver reports the allocatable count on the node only once during driver initialization. There is no reconciliation. Even when reverting, if users attach (or detach) volumes or ENIs outside of K8s such as via the CLI, the allocatable count will not be updated after plugin registration. Finally, also wanted to highlight that since the driver is unaware of block devices attached outside of K8s, attempting to force the CSI driver to be aware of non-CSI attachments means the hard-coded table of values specifying device names is unreliable - if the driver attempts to re-use a device name, the attachment will fail indefinitely. This async failure mode is extra troublesome because it's not very obvious and quite hard to track down. |
There is no Kubernetes way to count volume attachments done outside of Kubernetes. Different CSI drivers solve it in different way, a very quick check of few random cloud based CSI drivers shows:
None of these CSI drivers counts existing attachments at the driver startup. Does AWS provide a reliable way how to distinguish CSI and non-CSI attachment in instance metadata or with very limited AWS permissions? And is driver pod start acceptable time to count the non-CSI attachments? It can happen at arbitrary time, so those non-CSI attachments basically need to be attached forever, which does not sound very reliable to me. But it would work for Gardener's data disks, if I understood it correctly. I agree this PR could break some clusters and it should be treated as a breaking change. In the worst case, gardener and other clusters that need reserved attachment slots could set |
Hey @torredil
Not necessary. I will be happy to answer any points.
As @jsafrane analyzed, it is not that easy to support this for other providers. But it was working as expected for AWS. The behavior of the driver before this PR made it so EBS volumes at boot time would be taken into account for the We had faced a different (but related) issue in the past where the AWS API was breaking its convention and was reporting ebs volumes via the metadata service even without a node reboot. In gardener also we don't do node reboots. As part of the normal operation we would opt to create new nodes which is why the old behavior made more sense at least for us. We treat boot-time volumes as part of the an immutable node/VM spec.
I think you already made the example I would use. Before this PR, you could rely on the kube-scheduler correctly scheduling pods even on nodes with non-CSI volumes. Now you will face the issue that a pod is scheduled and everything looks "healthy" from k8s side but the volume can never be attached.
That comment also is dangerous then as it mentions a "docker" volume which is already a problem with this PR.
Which is what made the behavior correct, as this allows you to separate CSI and non-CSI volumes by virtue of the first being there on VM creation and the metadata service accounting it.
At least not from the metadata service. But this would be the optimal solution.
This may be a bit Gardener specific, but non-CSI attachments are there for the node lifetime. They are also used in some cases to setup the node e.g. for kubelet, containerd, additional drivers etc. Gardener just set the line at VM boot up and everything past that happens in k8s/CSI world only. |
That is actually the problem here. AWS metadata report attachments not at VM creation time, but at node startup time. If there were CSI volume attached to it at the startup time, then the instance metadata contain both CSI and non-CSI volume and the driver has no way to tell which is which. We can teach cluster admins to always drain nodes before shutdown / reboot, but in reality there are always some nodes that get shut down ungracefully. |
… limit Signed-off-by: Eddie Torres <[email protected]>
… limit Signed-off-by: Eddie Torres <[email protected]>
… limit Signed-off-by: Eddie Torres <[email protected]>
Is this a bug fix or adding new feature?
/kind bug
Fixes: #1808
What is this PR about? / Why do we need it?
meta-data/block-device-mapping
reports devices that were attached to the VM at the time the VM was started. It does not reflect devices attached/detached later.In addition, even when it showed the attachments accurately, the CSI driver uses this information only at the driver startup. At this time, there may be some CSI volumes attached from the previous node boot. Those volumes are then counted as permanently attached to the node, while Kubernetes may detach them later.
Therefore do not count block-device-mapping at all and always reserve just 1 attachment for the root disk. Assume all other attachments are CSI volumes, and let Kubernetes do the calculations of attached volumes instead.
What testing is done?
Tested with Kubernetes 1.28 (OpenShift) + a node shut down without draining it first and starting it with some CSI driver volumes attached.
Before this PR:
After this PR: