aws_ecs_service no longer retries when target group is not attached to load balancer #3495

tomelliff · 2018-02-23T09:37:03Z

Terraform Version

1.9.0 AWS provider, all TF versions

Affected Resource(s)

aws_ecs_service

Terraform Configuration Files

Taken from testAccAWSEcsService_healthCheckGracePeriodSeconds but removing the depends_on the lb_listener resource:

data "aws_availability_zones" "available" {}

resource "aws_vpc" "main" {
  cidr_block = "10.10.0.0/16"
  tags {
    Name = "foo"
  }
}

resource "aws_subnet" "main" {
  count = 2
  cidr_block = "${cidrsubnet(aws_vpc.main.cidr_block, 8, count.index)}"
  availability_zone = "${data.aws_availability_zones.available.names[count.index]}"
  vpc_id = "${aws_vpc.main.id}"
}

resource "aws_ecs_cluster" "main" {
  name = "foo"
}

resource "aws_ecs_task_definition" "with_lb_changes" {
  family = "foo"
  container_definitions = <<DEFINITION
[
  {
    "cpu": 256,
    "essential": true,
    "image": "ghost:latest",
    "memory": 512,
    "name": "ghost",
    "portMappings": [
      {
        "containerPort": 2368,
        "hostPort": 8080
      }
    ]
  }
]
DEFINITION
}

resource "aws_iam_role" "ecs_service" {
  name = "foo"
  assume_role_policy = <<EOF
{
  "Version": "2008-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "Service": "ecs.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
EOF
}

resource "aws_iam_role_policy" "ecs_service" {
  name = "foo"
  role = "${aws_iam_role.ecs_service.name}"
  policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:Describe*",
        "elasticloadbalancing:DeregisterInstancesFromLoadBalancer",
        "elasticloadbalancing:DeregisterTargets",
        "elasticloadbalancing:Describe*",
        "elasticloadbalancing:RegisterInstancesWithLoadBalancer",
        "elasticloadbalancing:RegisterTargets"
      ],
      "Resource": "*"
    }
  ]
}
EOF
}

resource "aws_lb_target_group" "test" {
  name = "foo"
  port = 80
  protocol = "HTTP"
  vpc_id = "${aws_vpc.main.id}"
}

resource "aws_lb" "main" {
  name            = "foo"
  internal        = true
  subnets         = ["${aws_subnet.main.*.id}"]
}

resource "aws_lb_listener" "front_end" {
  load_balancer_arn = "${aws_lb.main.id}"
  port = "80"
  protocol = "HTTP"

  default_action {
    target_group_arn = "${aws_lb_target_group.test.id}"
    type = "forward"
  }
}

resource "aws_ecs_service" "with_alb" {
  name = "foo"
  cluster = "${aws_ecs_cluster.main.id}"
  task_definition = "${aws_ecs_task_definition.with_lb_changes.arn}"
  desired_count = 1
  health_check_grace_period_seconds = %d
  iam_role = "${aws_iam_role.ecs_service.name}"

  load_balancer {
    target_group_arn = "${aws_lb_target_group.test.id}"
    container_name = "ghost"
    container_port = "2368"
  }

  depends_on = [
    "aws_iam_role_policy.ecs_service",
  ]
}

Expected Behavior

The ECS service should be created at the same time as the LB listener because they both depend on the LB target group. At this point the target group may not yet be attached to the load balancer because the LB listener resource hasn't finished being created. This throws an InvalidParameterException which before @bflad's change in #3240 was then retried.

Actual Behavior

Now it just throws the error and doesn't retry.

Steps to Reproduce

terraform apply

References

resource/aws_ecs_service: Add public_assign_ip attribute and fix InvalidParameterException handling #3240

Looks like we just need to re-add the InvalidParameterException retry but I'm wary of doing without understanding why @bflad removed it in the first place. We should probably remove that depends_on as well in the acceptance tests although I think it's still needed for the IAM role policy.

Note that I'm unable to add a depends_on to the listener rule because I have nested modules that has one module create an ECS service (potentially a worker based, non load balanced service) and another one that creates the load balancer and sets up security groups etc that uses the ECS service module, telling it to use the load balanced service resource. I can provide the config if necessary but ultimately I don't think we should be forcing people to put a depends_on for a race condition that will resolve itself if we simply retry for as much as a couple of seconds.

The text was updated successfully, but these errors were encountered:

bflad · 2018-02-23T18:44:09Z

@tomelliff can you please provide the full InvalidParameterException error message?

Looks like we just need to re-add the InvalidParameterException retry but I'm wary of doing without understanding why @bflad removed it in the first place.

We removed the blanket InvalidParamaterException handling because there were plenty of cases it was retrying on errors that would never be fixed by waiting like pure misconfigurations. In those cases its poor user experience to have operators wait multiple minutes to discover those configuration errors. Since AWS uses the same error code between scenarios that might eventually work with retrying and misconfigurations, I think we want to strike a good balance to match on specific messages and conditions only.

I would personally say that banking on retries for Terraform eventually creating the appropriate resource is not an ideal scenario. There are many factors that would contribute to this error still occurring even if retries are put in place:

The number of other resources being created, their required time for creation, and Terraform parallelism
API rate limiting
API server-side errors requiring retries
LB listener misconfigurations preventing their creation

Adding the retries for the load balancer listener might provide some convenience and help sometimes, but it does not fix the underlying requirement for ECS wanting the explicit ordering of resources before it.

We should probably remove that depends_on as well in the acceptance tests although I think it's still needed for the IAM role policy.

I would personally disagree here as the explicit depends_on consistently orders the Terraform resource creation in the way that AWS expects and is the way we would recommend operators to setup their configurations as well as it prevents possible errors. If its not in the documentation, we should probably add it, even if the retries are added.

Until Terraform core supports us configuring something like waiting for all children resources of a parent resource to complete, the depends_on here is a necessary evil 🙈. We would recommend these explicit depends_on in various other places where AWS requires other layered dependencies (like depending on an Internet Gateway in a public ECS service configuration as AWS requires the subnets configured in the ECS service to have an Internet Gateway attached).

I hope this makes sense!

tomelliff · 2018-02-23T19:02:42Z

Full error was:

InvalidParameterException: The target group with targetGroupArn arn:aws:elasticloadbalancing:eu-west-1:123456789:targetgroup/targetGroupId does not have an associated load balancer

I think I saw something in the code base a long time back that looked at the message as well as the status code so maybe that could be used here if you want to minimise how wide that retry logic is? I'm not sure if there's a good way to have different retry timeouts in Terraform either but the listener creation is barely slower than instant so if so dropping that to a tiny amount would work for me.

The explanation makes sense but I'm stuck right now because I can't see a way of linking the listener from the parent module to the child module even in a hacky way so I can't force a dependency in Terraform without core enabling modules to depend on things and not do anything in the module until the depends_on is complete.

Right now 1.9.0 breaks any time we deploy a new environment/service and we need to retry. I could probably force Gitlab CI to retry the job automatically but I'd rather not have that there long term.

Longer term the plan is to move to a single ALB per ECS cluster and add listener rules for each service and environment once the auto priority PR is merged but I'm not sure if I'm going to have the same race condition there.

bflad · 2018-08-17T19:51:04Z

Sorry for the delay here. 😅 The pull request to allow some retries for this condition has been merged into master and will release with version 1.33.0 of the AWS provider, likely middle of next week.

Please note: we'll continue to recommend the usage of depends_on in the aws_ecs_service resource pointing to the aws_lb_target_group, where possible. 👍

bflad · 2018-08-22T18:46:28Z

This has been released in version 1.33.0 of the AWS provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

jritsema · 2018-08-30T16:06:31Z

FWIW @bflad, I just tested this it and still got the error...

* aws_ecs_service.app: InvalidParameterException: The target group with targetGroupArn arn:aws:elasticloadbalancing:us-east-1:367555685970:targetgroup/depends-on-test-dev/b96ad7836a2fe8e7 does not have an associated load balancer.

resource "aws_ecs_service" "app" {
  ...
  load_balancer {
    target_group_arn = "${aws_alb_target_group.main.id}"
    container_name   = "${var.container_name}"
    container_port   = "${var.container_port}"
  }

  depends_on = [
    "aws_alb_target_group.main",
  ]

Tested with:

Terraform v0.11.8
+ provider.aws v1.33.0

HughWarrington · 2018-10-29T16:14:59Z

I believe this issue is not fixed. I get the above error on every run.

I'm using

Terraform v0.11.10
+ provider.aws v1.41.0

I can't add an explicit depends_on since my ALB is defined in a different module from the ECS service.

bleggett · 2018-11-05T17:41:50Z

Can confirm this is not fixed, or the retries are not working.

tomelliff · 2018-11-05T23:04:48Z

Can you share a minimal, complete example of Terraform code that errors out with that error? I've just ran the TestAccAWSEcsService_withAlb which has no depends_on on the ALB/listener and it seems to work fine.

I have since refactored how we deploy ECS services behind ALBs so don't have this issue any more but it was definitely working fine and no longer erroring as of up to a month ago for us when previously it would error regularly.

Test output:

=== RUN   TestAccAWSEcsService_withAlb
=== PAUSE TestAccAWSEcsService_withAlb
=== CONT  TestAccAWSEcsService_withAlb
--- PASS: TestAccAWSEcsService_withAlb (317.78s)
PASS

Process finished with exit code 0

HughWarrington · 2018-11-06T15:20:10Z

Thanks for your work on this so far.

Here's a much reduced version of my setup.

I'm afraid it's still fairly long since it uses a 3rd party module to create a VPC to isolate it from other resources.

Also I've kept the ECS service in a separate module myapp, since after reading related tickets it seems that could be a relevant detail.

./vars.tf

variable "aws_region" {
  default = "eu-central-1"
}

variable "azs" {
  default = [
    "eu-central-1a",
    "eu-central-1b",
  ]
}

variable "cidr_private" {
  default = [
    "10.0.1.0/24",
    "10.0.2.0/24",
  ]
}

variable "cidr_public" {
  default = [
    "10.0.101.0/24",
    "10.0.102.0/24",
  ]
}

./main.tf

provider "aws" {
  region  = "${var.aws_region}"
}

module "base_vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "1.46.0"

  name = "tf-vpc"
  cidr = "10.0.0.0/16"

  azs = [
    "${var.azs}",
  ]

  private_subnets = [
    "${var.cidr_private}",
  ]

  public_subnets = [
    "${var.cidr_public}",
  ]

  enable_nat_gateway = true
  single_nat_gateway = true
}

resource "aws_alb" "alb_for_fargate" {
  internal = false

  subnets = [
    "${module.base_vpc.public_subnets}",
  ]
}

resource "aws_ecs_cluster" "my_cluster" {
  name = "mycluster"
}

module "myapp_instance" {
  source = "myapp/"

  alb_arn          = "${aws_alb.alb_for_fargate.arn}"
  ecs_cluster_arn  = "${aws_ecs_cluster.my_cluster.arn}"
  private_subnets  = ["${module.base_vpc.private_subnets}"]
  vpc_arn          = "${module.base_vpc.vpc_id}"
}

./myapp/vars.tf

variable "alb_arn" {}

variable "ecs_cluster_arn" {}

variable "private_subnets" {
  type = "list"
}

variable "vpc_arn" {}

./myapp/main.tf

resource "aws_alb_target_group" "my_tg" {
  protocol    = "HTTP"
  port        = "80"
  vpc_id      = "${var.vpc_arn}"
  target_type = "ip"
}

resource "aws_ecr_repository" "my_repo" {
  name = "myapp"
}

resource "aws_ecs_task_definition" "my_td" {
  family                   = "myapp"
  requires_compatibilities = ["FARGATE"]
  network_mode             = "awsvpc"
  cpu                      = 256
  memory                   = 512

  container_definitions = <<EOF
[
  {
    "name": "myapp",
    "image": "myapp:latest",
    "networkMode": "awsvpc",
    "portMappings": [
      {
        "containerPort": 80,
        "protocol": "tcp"
      }
    ],
    "requiresCompatibilities": [
        "FARGATE"
    ]
  }
]
EOF
}

resource "aws_ecs_service" "my_service" {
  name            = "myapp"
  cluster         = "${var.ecs_cluster_arn}"
  launch_type     = "FARGATE"
  task_definition = "${aws_ecs_task_definition.my_td.arn}"

  network_configuration = {
    subnets = ["${var.private_subnets}"]
  }

  load_balancer {
    target_group_arn = "${aws_alb_target_group.my_tg.arn}"
    container_name   = "myapp"
    container_port   = 80
  }
}

ezequielarevalo-natgeo · 2019-06-12T19:54:49Z

This is not fixed !!! Using load_balancer properties for aws_ecs_service fails with : validParameterException: The target group with targetGroupArn arn:aws:elasticloadbalancing:us-east-1:367555685970:targetgroup/depends-on-test-dev/b96ad7836a2fe8e7 does not have an associated load balancer. The problem is the load_balancer option for aws_ecs_service, it doesn't matter if it depends on target group creation, it fails trying to attaching the service to the target group, if you run this twice, the second one will work because target group was already attached for the first run.

hakro · 2019-10-11T14:20:56Z

I still have this issue too. It fails the first time, but the second time, it works.
Terraform v0.12.8

provider.aws v2.32.0

ghost · 2019-11-01T14:52:52Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

bflad added bug Addresses a defect in current functionality. service/ecs Issues and PRs that pertain to the ecs service. labels Feb 23, 2018

tomelliff mentioned this issue Feb 26, 2018

Retry ECS create/update when target group isn't yet attached #3535

Merged

bflad mentioned this issue Jul 3, 2018

Modules inter-dependencies not resolving #1066

Closed

mikemac42 mentioned this issue Jul 20, 2018

[AWS] Creating of ALB, Target Group and Service in one script causes errors hashicorp/terraform#12634

Closed

bflad added this to the v1.33.0 milestone Aug 17, 2018

bflad added enhancement Requests to existing resources that expand the functionality or scope. and removed bug Addresses a defect in current functionality. labels Aug 17, 2018

bflad closed this as completed in #3535 Aug 17, 2018

abarfoot mentioned this issue Aug 27, 2018

Update depends_on to refer to the targetgroup rather than the listener turnerlabs/terraform-ecs-fargate#12

Closed

ghost locked and limited conversation to collaborators Nov 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aws_ecs_service no longer retries when target group is not attached to load balancer #3495

aws_ecs_service no longer retries when target group is not attached to load balancer #3495

tomelliff commented Feb 23, 2018

bflad commented Feb 23, 2018

tomelliff commented Feb 23, 2018

bflad commented Aug 17, 2018

bflad commented Aug 22, 2018

jritsema commented Aug 30, 2018

HughWarrington commented Oct 29, 2018

bleggett commented Nov 5, 2018 •

edited

Loading

tomelliff commented Nov 5, 2018

HughWarrington commented Nov 6, 2018

ezequielarevalo-natgeo commented Jun 12, 2019

hakro commented Oct 11, 2019

ghost commented Nov 1, 2019

aws_ecs_service no longer retries when target group is not attached to load balancer #3495

aws_ecs_service no longer retries when target group is not attached to load balancer #3495

Comments

tomelliff commented Feb 23, 2018

Terraform Version

Affected Resource(s)

Terraform Configuration Files

Expected Behavior

Actual Behavior

Steps to Reproduce

References

bflad commented Feb 23, 2018

tomelliff commented Feb 23, 2018

bflad commented Aug 17, 2018

bflad commented Aug 22, 2018

jritsema commented Aug 30, 2018

HughWarrington commented Oct 29, 2018

bleggett commented Nov 5, 2018 • edited Loading

tomelliff commented Nov 5, 2018

HughWarrington commented Nov 6, 2018

ezequielarevalo-natgeo commented Jun 12, 2019

hakro commented Oct 11, 2019

ghost commented Nov 1, 2019

bleggett commented Nov 5, 2018 •

edited

Loading