Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Allow communication between pods on workers and pods using the primary cluster security group #892

Merged
merged 6 commits into from
May 30, 2020

Conversation

itssimon
Copy link
Contributor

@itssimon itssimon commented May 26, 2020

PR o'clock

Description

Adds two security group rules allowing pods running on worker nodes to communicate with pods using the primary cluster security group (such as pods running on Fargate).

See #866 (comment).

Checklist

@itssimon itssimon changed the title Allow communication between pods on workers and pods using the primary cluster security group feat: Allow communication between pods on workers and pods using the primary cluster security group May 26, 2020
@itssimon itssimon mentioned this pull request May 26, 2020
2 tasks
@dpiddockcmp
Copy link
Contributor

This is a potentially breaking change as it may conflict with rules put in place by users already running mixed clusters. I think there should be a variable, default false, to enable creation of these rules. We can then remove it in a future major release.

@itssimon
Copy link
Contributor Author

Agree @dpiddockcmp and done.

workers.tf Outdated Show resolved Hide resolved
local.tf Outdated Show resolved Hide resolved
Copy link
Contributor

@dpiddockcmp dpiddockcmp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was playing around testing this PR.

Created a 1.13 cluster with worker_create_cluster_primary_security_group_rules = true. Everything worked great. I then tried to upgrade to 1.14 and after finishing the eks_cluster upgrade terraform threw an error:

Error: Provider produced inconsistent final plan

When expanding the plan for
module.eks.aws_security_group_rule.cluster_primary_ingress_workers[0] to
include new values learned so far during apply, provider
"registry.terraform.io/-/aws" produced an invalid new value for
.security_group_id: was cty.StringVal(""), but now
cty.StringVal("sg-0892893c8410071f4").

This is a bug in the provider, which should be reported in the provider's own
issue tracker.

Applying a second time was fine.

I don't think this is something to worry about long term. The default value for these security groups is false so users upgrading 1.13 clusters won't naturally hit the bug. And 1.13 is being removed at the end of June, just a month away.

@barryib
Copy link
Member

barryib commented May 29, 2020

Error: Provider produced inconsistent final plan

When expanding the plan for
module.eks.aws_security_group_rule.cluster_primary_ingress_workers[0] to
include new values learned so far during apply, provider
"registry.terraform.io/-/aws" produced an invalid new value for
.security_group_id: was cty.StringVal(""), but now
cty.StringVal("sg-0892893c8410071f4").

This is a bug in the provider, which should be reported in the provider's own
issue tracker.

Means that module.eks.aws_security_group_rule.cluster_primary_ingress_workers[0].security_group_id was at "" during the plan phase and became sg-0892893c8410071f4 during the apply.

I think it's because of :

cluster_primary_security_group_id = var.cluster_version >= 1.14 ? element(concat(aws_eks_cluster.this[*].vpc_config[0].cluster_security_group_id, list("")), 0) : null

local.cluster_primary_security_group_id is empty for all clusters created before Kubernetes 1.14 and eks.3 during the plan phase. So during the apply phase, the cluster is updated and the primary security group is created by AWS. This is probably the inconsistency.

You can notice with a terraform plan, that there is no security_group_id for the cluster_primary_ingress_workers SG rule.

  # module.eks.aws_security_group_rule.cluster_primary_ingress_workers[0] will be created
  + resource "aws_security_group_rule" "cluster_primary_ingress_workers" {
      + description              = "Allow pods running on workers to send communication to cluster primary security group (e.g. Fargate pods)."
      + from_port                = 0
      + id                       = (known after apply)
      + protocol                 = "-1"
      + self                     = false
      + source_security_group_id = "sg-0d741dba799316523"
      + to_port                  = 65535
      + type                     = "ingress"
    }

So we have to ensure a correct ordering during resources creation to let Terraform wait for the SG creation or at least mark the security_group_id as unknown.

Here how I make it work (but I don't know if it brokes something else).

locals {
  cluster_primary_security_group_id = var.cluster_version >= 1.14 ? coalescelist(data.null_data_source.cluster[*].outputs["cluster_primary_security_group_id"], [""])[0] : null
}

data "null_data_source" "cluster" {
  count = var.create_eks ? 1 : 0

  inputs = {
    cluster_primary_security_group_id = aws_eks_cluster.this[0].vpc_config[0].cluster_security_group_id
  }
}

With the null_data_source, Terraform wait for the cluster_security_group_id attribute to be available.

You can see the difference with a plan (security_group_id = (known after apply))

  # module.eks.aws_security_group_rule.cluster_primary_ingress_workers[0] will be created
  + resource "aws_security_group_rule" "cluster_primary_ingress_workers" {
      + description              = "Allow pods running on workers to send communication to cluster primary security group (e.g. Fargate pods)."
      + from_port                = 0
      + id                       = (known after apply)
      + protocol                 = "-1"
      + security_group_id        = (known after apply)
      + self                     = false
      + source_security_group_id = "sg-01ee228ae43275c29"
      + to_port                  = 65535
      + type                     = "ingress"
    }

This will also make the output cluster_primary_security_group_id wait for the SG (cluster_security_group_id) to be available.

@dpiddockcmp @itssimon Is this work for you ? Do you see any downside or side effects of doing this ?

@dpiddockcmp I assigned this PR to you, so you can squash and merge it when it's ready ^^

@dpiddockcmp
Copy link
Contributor

I don't think the added complexity of using a null_data_source here is necessary. I also tried removing the local so that the security groups were directly referencing aws_eks_cluster.this[0].vpc_config[0].cluster_security_group_id. Same result when updating a 1.13 cluster but only for the aws_security_group_rule.cluster_primary_ingress_workers resource. The other applied correctly.

I guess don't enable this on a 1.13 cluster 😉

@barryib
Copy link
Member

barryib commented May 30, 2020

I don't think the added complexity of using a null_data_source here is necessary. I also tried removing the local so that the security groups were directly referencing aws_eks_cluster.this[0].vpc_config[0].cluster_security_group_id. Same result when updating a 1.13 cluster but only for the aws_security_group_rule.cluster_primary_ingress_workers resource. The other applied correctly.

Ohh that was the first thing I've tried, but no luck. Maybe I didn't test it correctly ^^

@dpiddockcmp dpiddockcmp merged commit 3fefc2a into terraform-aws-modules:master May 30, 2020
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants