Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CAPA applies AWS cloud-controller-manager tag to user-owned subnet, but does not delete the tag on cluster delete, leaving the subnet unusable by future clusters #3854

Open
dlipovetsky opened this issue Nov 15, 2022 · 19 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.

Comments

@dlipovetsky
Copy link
Contributor

dlipovetsky commented Nov 15, 2022

/kind bug

What steps did you take and what happened:
The cloud-controller-manager (CCM) can only use tagged resources; if they're missing, it will not create the ELB to back the LoadBalancer-type Service.

The tag, which I'll refer to as the "CCM tag," is: kubernetes.io/cluster/<name>: [owned|shared]. Importantly, a resource may have exactly one such tag. Otherwise, the CCM will return an error.

CAPA applies the CCM tag to various resources that it owns. That's fine. However, with some recent changes on the main branch, CAPA applies this tag to some resources that the user owns. These resources outlive the cluster, and CAPA does not remove the CCM tag on cluster delete. Until the CCM tag is removed from them, these resources can't be used for other clusters. That's a problem.

Part of the reason CAPA does not remove the CCM tag from these resources is because it does not know whether it applied the CCM tag, or whether the user applied it themselves. Without storing some state (e.g. using yet another tag), CAPA cannot know this.

What did you expect to happen:
CAPA should not apply the CCM tag to user-owned resources, or it should delete the CCM tag when reconciling the cluster delete.

Anything else you would like to add:
CAPA begin to apply the CCM tag to a user-owned subnet in #2715.

Environment:

  • Cluster-api-provider-aws version:
    v1.5.1

#3854 (comment)

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 15, 2022
@dlipovetsky
Copy link
Contributor Author

Fixing this bug means changing tagging behavior that's been present since v1.0. The possible solutions:

(a) CAPA stops applying the CCM tag to user-owned subnets
(b) CAPA applies (on cluster create) and removes (on cluster delete) the CCM tag to user-owned subnets, security groups, and route tables.

If we cannot do either (a) or (b) in a minor release, then I think we need to address this issue before the v2.0.0.

@richardcase
Copy link
Member

richardcase commented Nov 16, 2022

Thanks for creating this for discussion @dlipovetsky. My thoughts are:

(a) we can't stop applying the tag to user-owned subnets because this was a specific feature request (#2584, from @scottslowe). Reading the original request, this seems like a reasonable feature to have imho.
(b) If we add the CCM tag to anything, then we should be removing the added tag when we delete the cluster. And as you've pointed out, we'd need to know that CAPA added the tag and not the user (by using another tag?).

I think that (b) is the way to go out of the 2 options.

If we cannot do either (a) or (b) in a minor release, then I think we need to address this issue before the v2.0.0.

If we go with either option, then we are not changing the API and so from that point of view changing in a minor version is OK. The question is then are we changing the behaviour in a non-backwards compatible way? I think as long as we have a way to determine that we added the CCM tag and only act on delete when we see we added the tag, then i think the behaviour stays the same for existing clusters but changes for newly created clusters. If we then start adding/deleting ccm tags on sg and route tables this is new behaviour which we add in minor versions.

That's a long way of saying that I think we can do (b) in a minor version bump, so v2.1.0

What does everyone else think?

@richardcase
Copy link
Member

And from the discussions on slack (https://kubernetes.slack.com/archives/CD6U2V71N/p1668469376275609 and https://kubernetes.slack.com/archives/CD6U2V71N/p1668587787725479):

  • should we even be adding the tag to BYO infra? (@MarcusNoble)

If we reverted the behaviour of adding the tags to the subnets then this is a change in behaviour that was requested in #2584.

@richardcase
Copy link
Member

If we did want to remove the tagging of user managed resources we could introduce this via a feature flag?

@Ankitasw
Copy link
Member

Even I am not in favor of option (a) because of the same reason listed by @richardcase. So option (b) seems to be the way forward to me as well.

should we even be adding the tag to BYO infra? (@MarcusNoble)

Since this is a supported behavior, IMO we should not change it at this point. And we have seen users benefitting from this, so -1 from me.

@AverageMarcus
Copy link
Member

I was unaware of the previous user request to support this. While I’d still prefer if user provided infra was wholly set up by the user, I can see it is beneficial to others already so I agree to go with option B.

@richardcase
Copy link
Member

@MarcusNoble @Ankitasw - if we had an option (c) which was to gather details of usage of this current behaviour from people like @scottslowe and then consider option (a) in the future - if we give notice of deprecation of this behaviour to give time for users to change their usage and update any automation. Would this change your thoughts?

@pydctw
Copy link
Contributor

pydctw commented Nov 16, 2022

Just catching up on the issue.

Regarding the PR that added CCM tags to BYOI subnets, I implemented them per a user request and it was only done for subnets per doc, meaning there were no mention of security groups or other resources.

Agree with @dlipovetsky - we should have deleted the tags CAPA added to the subnets - it was one of my early PRs in CAPA so haven't thought about it too carefully.

Now, regarding options suggested

(a) CAPA stops applying the CCM tag to user-owned subnets
(b) CAPA applies (on cluster create) and removes (on cluster delete) the CCM tag to user-owned subnets, security groups, and route tables.

I am ok with either options as long as the pattern of applying/removing CCM tags are consistent across resources. The only thing is that I know some of our customers rely on (a) right now so we will need to give advance warning/time for them.

@yastij
Copy link
Member

yastij commented Nov 16, 2022

catching up on the thread, I think (b) is reasonable, I don't think we can consider it as an API breaking change so I think 2.1.0 would make sense

@scottslowe
Copy link
Contributor

Also just catching up (I'm on holiday in Spain): Since my request seems to be the culprit that started a portion of this, I figured I should weigh in. 😄

The use case for me was (and is) that I use an infrastructure-as-code tool to stand up a set of shared infrastructure (VPC, subnets, route tables, gateways, etc.) that are then used/leveraged by multiple workload clusters. It's possible that I could modify my IaC code to better handle the tags, but it would be far better if CAPA could manage the tags as part of cluster lifecycle. For me, therefore, option B (removing tags on cluster delete) would be preferable/ideal.

I'm happy to provide additional information if needed.

@dlipovetsky
Copy link
Contributor Author

dlipovetsky commented Nov 16, 2022

Thanks all for your thoughtful comments!

It sounds like we agree to go with

(b) CAPA applies (on cluster create) and removes (on cluster delete) the CCM tag to user-owned subnets, security groups, and route tables.

And we agree that the remaining work () can be addressed in a 2.x minor release.

  1. Apply the CCM tag (kubernetes.io/cluster/<name>: shared) to user-owned security groups, route tables, and any other resources that the CCM might use.
  2. Remove the CCM tag from user-owned resources when reconciling cluster delete.

I will create GH issues for these.

Update: I created #3860 for route tables, and amended #3481 for security groups.

@dlipovetsky
Copy link
Contributor Author

I’d still prefer if user provided infra was wholly set up by the user

@AverageMarcus I recognize your concern. But I'm not sure that's what users expect. For example, for every LoadBalancer-type Service, the CCM itself creates a rule in the user-owned security group. Users don't expect to create these rules on behalf of the CCM.

Moreover, without the CCM tag applied to user-owned resources, the CCM will fail to function. Since the CCM tag is specific to the cluster, it should be applied on cluster create, and removed on cluster delete. While we can make the user responsible for these two actions, it seems like CAPA is in a good position to make them on behalf of the user.

I would like to rely on the "principle of least surprise" here: Let's document why CAPA applies the CCM tag, namely, to ensure CCM can do its job with user-owned resources, and why CAPA removes this tag on cluster delete.

@dlipovetsky
Copy link
Contributor Author

/triage accepted
/priority important-longterm

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority labels Nov 17, 2022
@k8s-triage-robot
Copy link

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jan 19, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 18, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 18, 2024
@scottslowe
Copy link
Contributor

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 18, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 16, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
None yet
Development

No branches or pull requests

9 participants