Fix AWS EMR Instance Group Deletion Errors #10425

joelthompson · 2019-10-08T17:45:18Z

In situations where Terraform needs to replace an aws_emr_cluster
resource that has aws_emr_instance_group resources associated with it,
Terraform tries to execute a destroy on the instance group, but it fails
as the notion of a "destroy" on an instance group is to set the number
of instances to zero, but AWS doesn't let you modify the count of
instances in an instance group on an EMR cluster. This fixes the issue
by treating an instance group that has been terminated as no longer
existing, so Terraform won't try to execute a "destroy" and not error
out.

Community Note

Please vote on this pull request by adding a 👍 reaction to the original pull request comment to help the community and maintainers prioritize this request
Please do not leave "+1" comments, they generate extra noise for pull request followers and do not help prioritize the request

Closes #1355
Closes #9400

Release note for CHANGELOG:

Gracefully handle deletion of aws_emr_instance_group resources when the underlying cluster has been terminated.

Output from acceptance testing:

Note that one test failed due to EC2 resource limits in my test account but succeeded on a retry:

$ make testacc TESTARGS='-run=TestAccAWSEMRInstanceGroup'
==> Checking that code complies with gofmt requirements...
TF_ACC=1 go test ./... -v -count 1 -parallel 20 -run=TestAccAWSEMRInstanceGroup -timeout 120m
go: finding github.com/terraform-providers/terraform-provider-tls v2.1.1+incompatible
go: finding github.com/terraform-providers/terraform-provider-tls v2.1.1+incompatible
?       github.com/terraform-providers/terraform-provider-aws   [no test files]
=== RUN   TestAccAWSEMRInstanceGroup_basic
=== PAUSE TestAccAWSEMRInstanceGroup_basic
=== RUN   TestAccAWSEMRInstanceGroup_BidPrice
=== PAUSE TestAccAWSEMRInstanceGroup_BidPrice
=== RUN   TestAccAWSEMRInstanceGroup_AutoScalingPolicy
=== PAUSE TestAccAWSEMRInstanceGroup_AutoScalingPolicy
=== RUN   TestAccAWSEMRInstanceGroup_InstanceCount
=== PAUSE TestAccAWSEMRInstanceGroup_InstanceCount
=== RUN   TestAccAWSEMRInstanceGroup_EbsConfig_EbsOptimized
=== PAUSE TestAccAWSEMRInstanceGroup_EbsConfig_EbsOptimized
=== CONT  TestAccAWSEMRInstanceGroup_basic
=== CONT  TestAccAWSEMRInstanceGroup_EbsConfig_EbsOptimized
=== CONT  TestAccAWSEMRInstanceGroup_InstanceCount
=== CONT  TestAccAWSEMRInstanceGroup_BidPrice
=== CONT  TestAccAWSEMRInstanceGroup_AutoScalingPolicy
--- FAIL: TestAccAWSEMRInstanceGroup_BidPrice (6.17s)
    testing.go:569: Step 0 error: errors during apply:
        
        Error: Error creating VPC: VpcLimitExceeded: The maximum number of VPCs has been reached.
                status code: 400, request id: ea160175-440a-4c9f-899b-4420e0b0d07e
        
          on /tmp/tf-test382298782/main.tf line 28:
          (source code not available)
        
        
--- PASS: TestAccAWSEMRInstanceGroup_EbsConfig_EbsOptimized (558.46s)
--- PASS: TestAccAWSEMRInstanceGroup_basic (565.65s)
--- PASS: TestAccAWSEMRInstanceGroup_AutoScalingPolicy (589.18s)
--- PASS: TestAccAWSEMRInstanceGroup_InstanceCount (852.11s)
FAIL
FAIL    github.com/terraform-providers/terraform-provider-aws/aws       852.157s
testing: warning: no tests to run
PASS
ok      github.com/terraform-providers/terraform-provider-aws/aws/internal/flatmap      0.003s [no tests to run]
testing: warning: no tests to run
PASS
ok      github.com/terraform-providers/terraform-provider-aws/aws/internal/keyvaluetags 0.006s [no tests to run]
FAIL
make: *** [testacc] Error 1

$ make testacc TESTARGS='-run=TestAccAWSEMRInstanceGroup_BidPrice'
==> Checking that code complies with gofmt requirements...
TF_ACC=1 go test ./... -v -count 1 -parallel 20 -run=TestAccAWSEMRInstanceGroup_BidPrice -timeout 120m
go: finding github.com/terraform-providers/terraform-provider-tls v2.1.1+incompatible
go: finding github.com/terraform-providers/terraform-provider-tls v2.1.1+incompatible
?       github.com/terraform-providers/terraform-provider-aws   [no test files]
=== RUN   TestAccAWSEMRInstanceGroup_BidPrice
=== PAUSE TestAccAWSEMRInstanceGroup_BidPrice
=== CONT  TestAccAWSEMRInstanceGroup_BidPrice
--- PASS: TestAccAWSEMRInstanceGroup_BidPrice (549.58s)
PASS
ok      github.com/terraform-providers/terraform-provider-aws/aws       549.608s
testing: warning: no tests to run
PASS
ok      github.com/terraform-providers/terraform-provider-aws/aws/internal/flatmap      0.002s [no tests to run]
testing: warning: no tests to run
PASS
ok      github.com/terraform-providers/terraform-provider-aws/aws/internal/keyvaluetags 0.004s [no tests to run]

In situations where Terraform needs to replace an aws_emr_cluster resource that has aws_emr_instance_group resources associated with it, Terraform tries to execute a destroy on the instance group, but it fails as the notion of a "destroy" on an instance group is to set the number of instances to zero, but AWS doesn't let you modify the count of instances in an instance group on an EMR cluster. This fixes the issue by treating an instance group that has been terminated as no longer existing, so Terraform won't try to execute a "destroy" and not error out. Fixes hashicorp#1355 Fixes hashicorp#9400

bflad

Hey @joelthompson 👋 Hope things are well after HashiConf and thanks for contributing this! Other than the code change below, it would be fantastic if we could also ensure there was acceptance testing covering this scenario. 👍 Once these are in this should be good to go. Please reach out with any questions or if you do not have time to implement the feedback.

To cover this functionality, we need to enhance the aws/resource_aws_emr_cluster_test.go a little to include a new testing function that terminates a cluster (and since we're doing this, why not also verify that the aws_emr_cluster resource itself does the right thing when its terminated outside Terraform):

// In aws/resource_aws_emr_cluster_test.go:
func TestAccAWSEMRCluster_disappears(t *testing.T) {
	var cluster emr.Cluster
	r := acctest.RandInt()
	resource.ParallelTest(t, resource.TestCase{
		PreCheck:     func() { testAccPreCheck(t) },
		Providers:    testAccProviders,
		CheckDestroy: testAccCheckAWSEmrDestroy,
		Steps: []resource.TestStep{
			{
				Config: testAccAWSEmrClusterConfig(r),
				Check: resource.ComposeTestCheckFunc(
					testAccCheckAWSEmrClusterExists("aws_emr_cluster.tf-test-cluster", &cluster),
					testAccCheckAWSEmrClusterDisappears(&cluster),
				),
				ExpectNonEmptyPlan: true,
			},
		},
	})
}

func testAccCheckAWSEmrClusterDisappears(cluster *emr.Cluster) resource.TestCheckFunc {
	return func(s *terraform.State) error {
		conn := testAccProvider.Meta().(*AWSClient).emrconn
		id := aws.StringValue(cluster.Id)

		terminateJobFlowsInput := &emr.TerminateJobFlowsInput{
			JobFlowIds: []*string{cluster.Id},
		}

		_, err := conn.TerminateJobFlows(terminateJobFlowsInput)

		if err != nil {
			return err
		}

		input := &emr.ListInstancesInput{
			ClusterId: cluster.Id,
		}
		var output *emr.ListInstancesOutput
		var instanceCount int

		err = resource.Retry(20*time.Minute, func() *resource.RetryError {
			var err error
			output, err = conn.ListInstances(input)

			if err != nil {
				return resource.NonRetryableError(err)
			}

			instanceCount = countEMRRemainingInstances(output, id)

			if instanceCount != 0 {
				return resource.RetryableError(fmt.Errorf("EMR Cluster (%s) has (%d) Instances remaining", id, instanceCount))
			}

			return nil
		})

		if isResourceTimeoutError(err) {
			output, err = conn.ListInstances(input)

			if err == nil {
				instanceCount = countEMRRemainingInstances(output, id)
			}
		}

		if instanceCount != 0 {
			return fmt.Errorf("EMR Cluster (%s) has (%d) Instances remaining", id, instanceCount)
		}

		if err != nil {
			return fmt.Errorf("error waiting for EMR Cluster (%s) Instances to drain: %s", id, err)
		}

		return nil
	}
}

I verified this works as expected:

--- PASS: TestAccAWSEMRCluster_disappears (451.10s)

Now we can use that new function in aws/resource_aws_emr_instance_group_test.go to verify what we are doing here. 🎉 Adding this new test:

// In aws/resource_aws_emr_instance_group_test.go:
func TestAccAWSEMRInstanceGroup_disappears_EmrCluster(t *testing.T) {
	var cluster emr.Cluster
	var ig emr.InstanceGroup
	rInt := acctest.RandInt()
	emrClusterResourceName := "aws_emr_cluster.tf-test-cluster"
	resourceName := "aws_emr_instance_group.task"

	resource.ParallelTest(t, resource.TestCase{
		PreCheck:     func() { testAccPreCheck(t) },
		Providers:    testAccProviders,
		CheckDestroy: testAccCheckAWSEmrInstanceGroupDestroy,
		Steps: []resource.TestStep{
			{
				Config: testAccAWSEmrInstanceGroupConfig_basic(rInt),
				Check: resource.ComposeTestCheckFunc(
					testAccCheckAWSEmrClusterExists(emrClusterResourceName, &cluster),
					testAccCheckAWSEmrInstanceGroupExists(resourceName, &ig),
					testAccCheckAWSEmrClusterDisappears(&cluster),
				),
				ExpectNonEmptyPlan: true,
			},
		},
	})
}

Yields an error, which hopefully this pull request addresses. 😉

--- FAIL: TestAccAWSEMRInstanceGroup_disappears_EmrCluster (432.45s)
    testing.go:630: Error destroying resource! WARNING: Dangling resources
        may exist. The full state and error is shown below.

        Error: errors during apply: error draining EMR Instance Group (ig-1O27W6AYK3EQ7): ValidationException: An instance group may only be modified when the cluster is running or waiting.

aws/resource_aws_emr_instance_group.go

joelthompson · 2019-10-10T04:39:25Z

Hey @bflad -- great to see you at HashiConf and catch up! And I appreciate the really helpful suggestions!

Also a quick note -- I'm being a bit conservative here and only implementing the one known state that we've run into. There could be other states that should trigger ignoring the instance group (e.g., TERMINATING -- see https://github.com/aws/aws-sdk-go/blob/master/service/emr/api.go#L11030 for the complete list of states), but I'd be hesitant to be more speculative and treat other states as effectively being deleted when they actually aren't which could cause Terraform to leak resources. Thoughts? Does this make sense?

Anyway, I've implemented the suggestions and verified the two tests pass as expected:

$ make testacc TESTARGS='-run=TestAccAWSEMRCluster_disappears'
==> Checking that code complies with gofmt requirements...
TF_ACC=1 go test ./... -v -count 1 -parallel 20 -run=TestAccAWSEMRCluster_disappears -timeout 120m
?       github.com/terraform-providers/terraform-provider-aws   [no test files]
=== RUN   TestAccAWSEMRCluster_disappears
=== PAUSE TestAccAWSEMRCluster_disappears
=== CONT  TestAccAWSEMRCluster_disappears
--- PASS: TestAccAWSEMRCluster_disappears (441.29s)
PASS
ok      github.com/terraform-providers/terraform-provider-aws/aws       441.315s
testing: warning: no tests to run
PASS
ok      github.com/terraform-providers/terraform-provider-aws/aws/internal/flatmap      0.002s [no tests to run]
testing: warning: no tests to run
PASS
ok      github.com/terraform-providers/terraform-provider-aws/aws/internal/keyvaluetags 0.026s [no tests to run]

$ make testacc TESTARGS='-run=TestAccAWSEMRInstanceGroup_EmrClusterDisappears'
==> Checking that code complies with gofmt requirements...
TF_ACC=1 go test ./... -v -count 1 -parallel 20 -run=TestAccAWSEMRInstanceGroup_EmrClusterDisappears -timeout 120m
?       github.com/terraform-providers/terraform-provider-aws   [no test files]
=== RUN   TestAccAWSEMRInstanceGroup_EmrClusterDisappears
=== PAUSE TestAccAWSEMRInstanceGroup_EmrClusterDisappears
=== CONT  TestAccAWSEMRInstanceGroup_EmrClusterDisappears
--- PASS: TestAccAWSEMRInstanceGroup_EmrClusterDisappears (378.58s)
PASS
ok      github.com/terraform-providers/terraform-provider-aws/aws       378.608s
testing: warning: no tests to run
PASS
ok      github.com/terraform-providers/terraform-provider-aws/aws/internal/flatmap      0.002s [no tests to run]
testing: warning: no tests to run
PASS
ok      github.com/terraform-providers/terraform-provider-aws/aws/internal/keyvaluetags 0.004s [no tests to run]

bflad · 2019-10-10T04:45:36Z

There could be other states that should trigger ignoring the instance group (e.g., TERMINATING

We go either which way on implementing these -- usually we'll automatically remove things on states like TERMINATING since its likely that the operator would want to recreate it since the API is telling us its about to disappear, but its also such an edge case that likely does not matter too much.

joelthompson · 2019-10-10T05:18:36Z

There could be other states that should trigger ignoring the instance group (e.g., TERMINATING

We go either which way on implementing these -- usually we'll automatically remove things on states like TERMINATING since its likely that the operator would want to recreate it since the API is telling us its about to disappear, but its also such an edge case that likely does not matter too much.

This would be so much easier if AWS would just post the state machine that they promise to abide by and we wouldn't have to speculate on things like this. OK, I know, I sometimes have wild and unrealistic fantasies. I personally tend to be overly paranoid and worry about things like, "Is it possible to cancel a termination?" At least with TERMINATING, it should eventually go into TERMINATED and thus will get cleaned if we just look for TERMINATED.

Anyway, I'll go ahead and add just TERMINATING and leave the others for now.

joelthompson · 2019-10-10T05:50:40Z

Anyway, I'll go ahead and add just TERMINATING and leave the others for now.

OK, done. Not sure if there's a good way to test for TERMINATING (versus TERMINATED), though.

$ make testacc TESTARGS='-run=TestAccAWSEMRInstanceGroup_EmrClusterDisappears'
==> Checking that code complies with gofmt requirements...
TF_ACC=1 go test ./... -v -count 1 -parallel 20 -run=TestAccAWSEMRInstanceGroup_EmrClusterDisappears -timeout 120m
?       github.com/terraform-providers/terraform-provider-aws   [no test files]
=== RUN   TestAccAWSEMRInstanceGroup_EmrClusterDisappears
=== PAUSE TestAccAWSEMRInstanceGroup_EmrClusterDisappears
=== CONT  TestAccAWSEMRInstanceGroup_EmrClusterDisappears
--- PASS: TestAccAWSEMRInstanceGroup_EmrClusterDisappears (358.62s)
PASS
ok      github.com/terraform-providers/terraform-provider-aws/aws       358.647s
testing: warning: no tests to run
PASS
ok      github.com/terraform-providers/terraform-provider-aws/aws/internal/flatmap      0.007s [no tests to run]
testing: warning: no tests to run
PASS
ok      github.com/terraform-providers/terraform-provider-aws/aws/internal/keyvaluetags 0.005s [no tests to run]

bflad

LGTM, thanks @joelthompson 🚀

--- PASS: TestAccAWSEMRInstanceGroup_EmrClusterDisappears (369.87s)
--- PASS: TestAccAWSEMRInstanceGroup_AutoScalingPolicy (556.13s)
--- PASS: TestAccAWSEMRInstanceGroup_basic (569.22s)
--- PASS: TestAccAWSEMRInstanceGroup_EbsConfig_EbsOptimized (581.03s)
--- PASS: TestAccAWSEMRInstanceGroup_InstanceCount (589.60s)
--- PASS: TestAccAWSEMRInstanceGroup_BidPrice (596.88s)
--- PASS: TestAccAWSEMRInstanceGroup_ConfigurationsJson (810.57s)

joelthompson · 2019-10-10T13:57:24Z

Thanks so much @bflad :)

ghost · 2019-10-10T20:43:49Z

This has been released in version 2.32.0 of the Terraform AWS provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template for triage. Thanks!

ghost · 2019-11-09T14:51:40Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

joelthompson requested a review from a team October 8, 2019 17:45

ghost added size/XS Managed by automation to categorize the size of a PR. service/emr Issues and PRs that pertain to the emr service. labels Oct 8, 2019

bflad added the bug Addresses a defect in current functionality. label Oct 10, 2019

bflad suggested changes Oct 10, 2019

View reviewed changes

aws/resource_aws_emr_instance_group.go Outdated Show resolved Hide resolved

bflad self-assigned this Oct 10, 2019

bflad added the waiting-response Maintainers are waiting on response from community or contributor. label Oct 10, 2019

joelthompson added 2 commits October 9, 2019 23:59

Address PR Feedback

978cee7

Merge remote-tracking branch 'origin/master' into emr_ig_destroy

a17d633

ghost added size/M Managed by automation to categorize the size of a PR. tests PRs: expanded test coverage. Issues: expanded coverage, enhancements to test infrastructure. and removed size/XS Managed by automation to categorize the size of a PR. labels Oct 10, 2019

joelthompson added 2 commits October 10, 2019 00:03

gofmt

a27a520

Fix missing import

e2dd6c0

ghost removed the waiting-response Maintainers are waiting on response from community or contributor. label Oct 10, 2019

joelthompson requested a review from bflad October 10, 2019 04:39

bflad added this to the v2.32.0 milestone Oct 10, 2019

Also handle TERMINATING state

bfb2dea

bflad approved these changes Oct 10, 2019

View reviewed changes

bflad merged commit 6c5520d into hashicorp:master Oct 10, 2019

bflad added a commit that referenced this pull request Oct 10, 2019

Update CHANGELOG for #10425

3bd0318

joelthompson deleted the emr_ig_destroy branch October 10, 2019 13:57

ghost locked and limited conversation to collaborators Nov 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix AWS EMR Instance Group Deletion Errors #10425

Fix AWS EMR Instance Group Deletion Errors #10425

joelthompson commented Oct 8, 2019

bflad left a comment

joelthompson commented Oct 10, 2019

bflad commented Oct 10, 2019

joelthompson commented Oct 10, 2019

joelthompson commented Oct 10, 2019

bflad left a comment

joelthompson commented Oct 10, 2019

ghost commented Oct 10, 2019

ghost commented Nov 9, 2019

Fix AWS EMR Instance Group Deletion Errors #10425

Fix AWS EMR Instance Group Deletion Errors #10425

Conversation

joelthompson commented Oct 8, 2019

Community Note

bflad left a comment

Choose a reason for hiding this comment

joelthompson commented Oct 10, 2019

bflad commented Oct 10, 2019

joelthompson commented Oct 10, 2019

joelthompson commented Oct 10, 2019

bflad left a comment

Choose a reason for hiding this comment

joelthompson commented Oct 10, 2019

ghost commented Oct 10, 2019

ghost commented Nov 9, 2019