CAPA applies AWS cloud-controller-manager tag to user-owned subnet, but does not delete the tag on cluster delete, leaving the subnet unusable by future clusters #3854

dlipovetsky · 2022-11-15T20:32:29Z

/kind bug

What steps did you take and what happened:
The cloud-controller-manager (CCM) can only use tagged resources; if they're missing, it will not create the ELB to back the LoadBalancer-type Service.

The tag, which I'll refer to as the "CCM tag," is: kubernetes.io/cluster/<name>: [owned|shared]. Importantly, a resource may have exactly one such tag. Otherwise, the CCM will return an error.

CAPA applies the CCM tag to various resources that it owns. That's fine. However, with some recent changes on the main branch, CAPA applies this tag to some resources that the user owns. These resources outlive the cluster, and CAPA does not remove the CCM tag on cluster delete. Until the CCM tag is removed from them, these resources can't be used for other clusters. That's a problem.

Part of the reason CAPA does not remove the CCM tag from these resources is because it does not know whether it applied the CCM tag, or whether the user applied it themselves. Without storing some state (e.g. using yet another tag), CAPA cannot know this.

What did you expect to happen:
CAPA should not apply the CCM tag to user-owned resources, or it should delete the CCM tag when reconciling the cluster delete.

Anything else you would like to add:
CAPA begin to apply the CCM tag to a user-owned subnet in #2715.

Environment:

Cluster-api-provider-aws version:
v1.5.1

#3854 (comment)

The text was updated successfully, but these errors were encountered:

dlipovetsky · 2022-11-15T20:47:24Z

Fixing this bug means changing tagging behavior that's been present since v1.0. The possible solutions:

(a) CAPA stops applying the CCM tag to user-owned subnets
(b) CAPA applies (on cluster create) and removes (on cluster delete) the CCM tag to user-owned subnets, security groups, and route tables.

If we cannot do either (a) or (b) in a minor release, then I think we need to address this issue before the v2.0.0.

richardcase · 2022-11-16T08:14:05Z

Thanks for creating this for discussion @dlipovetsky. My thoughts are:

(a) we can't stop applying the tag to user-owned subnets because this was a specific feature request (#2584, from @scottslowe). Reading the original request, this seems like a reasonable feature to have imho.
(b) If we add the CCM tag to anything, then we should be removing the added tag when we delete the cluster. And as you've pointed out, we'd need to know that CAPA added the tag and not the user (by using another tag?).

I think that (b) is the way to go out of the 2 options.

If we cannot do either (a) or (b) in a minor release, then I think we need to address this issue before the v2.0.0.

If we go with either option, then we are not changing the API and so from that point of view changing in a minor version is OK. The question is then are we changing the behaviour in a non-backwards compatible way? I think as long as we have a way to determine that we added the CCM tag and only act on delete when we see we added the tag, then i think the behaviour stays the same for existing clusters but changes for newly created clusters. If we then start adding/deleting ccm tags on sg and route tables this is new behaviour which we add in minor versions.

That's a long way of saying that I think we can do (b) in a minor version bump, so v2.1.0

What does everyone else think?

richardcase · 2022-11-16T09:23:09Z

And from the discussions on slack (https://kubernetes.slack.com/archives/CD6U2V71N/p1668469376275609 and https://kubernetes.slack.com/archives/CD6U2V71N/p1668587787725479):

should we even be adding the tag to BYO infra? (@MarcusNoble)

If we reverted the behaviour of adding the tags to the subnets then this is a change in behaviour that was requested in #2584.

richardcase · 2022-11-16T09:26:10Z

If we did want to remove the tagging of user managed resources we could introduce this via a feature flag?

Ankitasw · 2022-11-16T09:53:53Z

Even I am not in favor of option (a) because of the same reason listed by @richardcase. So option (b) seems to be the way forward to me as well.

should we even be adding the tag to BYO infra? (@MarcusNoble)

Since this is a supported behavior, IMO we should not change it at this point. And we have seen users benefitting from this, so -1 from me.

AverageMarcus · 2022-11-16T10:26:13Z

I was unaware of the previous user request to support this. While I’d still prefer if user provided infra was wholly set up by the user, I can see it is beneficial to others already so I agree to go with option B.

richardcase · 2022-11-16T10:41:45Z

@MarcusNoble @Ankitasw - if we had an option (c) which was to gather details of usage of this current behaviour from people like @scottslowe and then consider option (a) in the future - if we give notice of deprecation of this behaviour to give time for users to change their usage and update any automation. Would this change your thoughts?

pydctw · 2022-11-16T16:07:42Z

Just catching up on the issue.

Regarding the PR that added CCM tags to BYOI subnets, I implemented them per a user request and it was only done for subnets per doc, meaning there were no mention of security groups or other resources.

Agree with @dlipovetsky - we should have deleted the tags CAPA added to the subnets - it was one of my early PRs in CAPA so haven't thought about it too carefully.

Now, regarding options suggested

(a) CAPA stops applying the CCM tag to user-owned subnets
(b) CAPA applies (on cluster create) and removes (on cluster delete) the CCM tag to user-owned subnets, security groups, and route tables.

I am ok with either options as long as the pattern of applying/removing CCM tags are consistent across resources. The only thing is that I know some of our customers rely on (a) right now so we will need to give advance warning/time for them.

yastij · 2022-11-16T16:13:55Z

catching up on the thread, I think (b) is reasonable, I don't think we can consider it as an API breaking change so I think 2.1.0 would make sense

scottslowe · 2022-11-16T17:05:07Z

Also just catching up (I'm on holiday in Spain): Since my request seems to be the culprit that started a portion of this, I figured I should weigh in. 😄

The use case for me was (and is) that I use an infrastructure-as-code tool to stand up a set of shared infrastructure (VPC, subnets, route tables, gateways, etc.) that are then used/leveraged by multiple workload clusters. It's possible that I could modify my IaC code to better handle the tags, but it would be far better if CAPA could manage the tags as part of cluster lifecycle. For me, therefore, option B (removing tags on cluster delete) would be preferable/ideal.

I'm happy to provide additional information if needed.

dlipovetsky · 2022-11-16T17:44:53Z

Thanks all for your thoughtful comments!

It sounds like we agree to go with

(b) CAPA applies (on cluster create) and removes (on cluster delete) the CCM tag to user-owned subnets, security groups, and route tables.

And we agree that the remaining work () can be addressed in a 2.x minor release.

Apply the CCM tag (kubernetes.io/cluster/<name>: shared) to user-owned security groups, route tables, and any other resources that the CCM might use.
Remove the CCM tag from user-owned resources when reconciling cluster delete.

I will create GH issues for these.

Update: I created #3860 for route tables, and amended #3481 for security groups.

dlipovetsky · 2022-11-16T18:01:11Z

I’d still prefer if user provided infra was wholly set up by the user

@AverageMarcus I recognize your concern. But I'm not sure that's what users expect. For example, for every LoadBalancer-type Service, the CCM itself creates a rule in the user-owned security group. Users don't expect to create these rules on behalf of the CCM.

Moreover, without the CCM tag applied to user-owned resources, the CCM will fail to function. Since the CCM tag is specific to the cluster, it should be applied on cluster create, and removed on cluster delete. While we can make the user responsible for these two actions, it seems like CAPA is in a good position to make them on behalf of the user.

I would like to rely on the "principle of least surprise" here: Let's document why CAPA applies the CCM tag, namely, to ensure CCM can do its job with user-owned resources, and why CAPA removes this tag on cluster delete.

dlipovetsky · 2022-11-17T20:04:51Z

/triage accepted
/priority important-longterm

k8s-triage-robot · 2024-01-19T07:58:53Z

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

k8s-triage-robot · 2024-04-18T08:15:57Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-05-18T08:41:19Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

scottslowe · 2024-05-18T11:39:24Z

/remove-lifecycle rotten

k8s-triage-robot · 2024-08-16T12:34:17Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-09-15T13:06:35Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 15, 2022

dlipovetsky mentioned this issue Nov 15, 2022

feat: Tag for CCM the security group provided as an override for the "sg-lb" security group role #3844

Closed

4 tasks

dlipovetsky mentioned this issue Nov 15, 2022

Add CCM tag to routing tables #3838

Merged

4 tasks

This was referenced Nov 17, 2022

Unmanaged SecurityGroups should also get tagged for CCM #3481

Closed

Manage CCM tag on user-owned AWS Route Tables #3860

Open

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jan 19, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 18, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 18, 2024

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 18, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 16, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CAPA applies AWS cloud-controller-manager tag to user-owned subnet, but does not delete the tag on cluster delete, leaving the subnet unusable by future clusters #3854

CAPA applies AWS cloud-controller-manager tag to user-owned subnet, but does not delete the tag on cluster delete, leaving the subnet unusable by future clusters #3854

dlipovetsky commented Nov 15, 2022 •

edited

Loading

dlipovetsky commented Nov 15, 2022

richardcase commented Nov 16, 2022 •

edited

Loading

richardcase commented Nov 16, 2022

richardcase commented Nov 16, 2022

Ankitasw commented Nov 16, 2022

AverageMarcus commented Nov 16, 2022

richardcase commented Nov 16, 2022

pydctw commented Nov 16, 2022

yastij commented Nov 16, 2022

scottslowe commented Nov 16, 2022

dlipovetsky commented Nov 16, 2022 •

edited

Loading

dlipovetsky commented Nov 16, 2022

dlipovetsky commented Nov 17, 2022

k8s-triage-robot commented Jan 19, 2024

k8s-triage-robot commented Apr 18, 2024

k8s-triage-robot commented May 18, 2024

scottslowe commented May 18, 2024

k8s-triage-robot commented Aug 16, 2024

k8s-triage-robot commented Sep 15, 2024

CAPA applies AWS cloud-controller-manager tag to user-owned subnet, but does not delete the tag on cluster delete, leaving the subnet unusable by future clusters #3854

CAPA applies AWS cloud-controller-manager tag to user-owned subnet, but does not delete the tag on cluster delete, leaving the subnet unusable by future clusters #3854

Comments

dlipovetsky commented Nov 15, 2022 • edited Loading

dlipovetsky commented Nov 15, 2022

richardcase commented Nov 16, 2022 • edited Loading

richardcase commented Nov 16, 2022

richardcase commented Nov 16, 2022

Ankitasw commented Nov 16, 2022

AverageMarcus commented Nov 16, 2022

richardcase commented Nov 16, 2022

pydctw commented Nov 16, 2022

yastij commented Nov 16, 2022

scottslowe commented Nov 16, 2022

dlipovetsky commented Nov 16, 2022 • edited Loading

dlipovetsky commented Nov 16, 2022

dlipovetsky commented Nov 17, 2022

k8s-triage-robot commented Jan 19, 2024

k8s-triage-robot commented Apr 18, 2024

k8s-triage-robot commented May 18, 2024

scottslowe commented May 18, 2024

k8s-triage-robot commented Aug 16, 2024

k8s-triage-robot commented Sep 15, 2024

dlipovetsky commented Nov 15, 2022 •

edited

Loading

richardcase commented Nov 16, 2022 •

edited

Loading

dlipovetsky commented Nov 16, 2022 •

edited

Loading