Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix AWS EMR Instance Group Deletion Errors #10425

Merged
merged 6 commits into from
Oct 10, 2019

Conversation

joelthompson
Copy link
Contributor

In situations where Terraform needs to replace an aws_emr_cluster
resource that has aws_emr_instance_group resources associated with it,
Terraform tries to execute a destroy on the instance group, but it fails
as the notion of a "destroy" on an instance group is to set the number
of instances to zero, but AWS doesn't let you modify the count of
instances in an instance group on an EMR cluster. This fixes the issue
by treating an instance group that has been terminated as no longer
existing, so Terraform won't try to execute a "destroy" and not error
out.

Community Note

  • Please vote on this pull request by adding a 👍 reaction to the original pull request comment to help the community and maintainers prioritize this request
  • Please do not leave "+1" comments, they generate extra noise for pull request followers and do not help prioritize the request

Closes #1355
Closes #9400

Release note for CHANGELOG:

Gracefully handle deletion of aws_emr_instance_group resources when the underlying cluster has been terminated.

Output from acceptance testing:

Note that one test failed due to EC2 resource limits in my test account but succeeded on a retry:

$ make testacc TESTARGS='-run=TestAccAWSEMRInstanceGroup'
==> Checking that code complies with gofmt requirements...
TF_ACC=1 go test ./... -v -count 1 -parallel 20 -run=TestAccAWSEMRInstanceGroup -timeout 120m
go: finding github.com/terraform-providers/terraform-provider-tls v2.1.1+incompatible
go: finding github.com/terraform-providers/terraform-provider-tls v2.1.1+incompatible
?       github.com/terraform-providers/terraform-provider-aws   [no test files]
=== RUN   TestAccAWSEMRInstanceGroup_basic
=== PAUSE TestAccAWSEMRInstanceGroup_basic
=== RUN   TestAccAWSEMRInstanceGroup_BidPrice
=== PAUSE TestAccAWSEMRInstanceGroup_BidPrice
=== RUN   TestAccAWSEMRInstanceGroup_AutoScalingPolicy
=== PAUSE TestAccAWSEMRInstanceGroup_AutoScalingPolicy
=== RUN   TestAccAWSEMRInstanceGroup_InstanceCount
=== PAUSE TestAccAWSEMRInstanceGroup_InstanceCount
=== RUN   TestAccAWSEMRInstanceGroup_EbsConfig_EbsOptimized
=== PAUSE TestAccAWSEMRInstanceGroup_EbsConfig_EbsOptimized
=== CONT  TestAccAWSEMRInstanceGroup_basic
=== CONT  TestAccAWSEMRInstanceGroup_EbsConfig_EbsOptimized
=== CONT  TestAccAWSEMRInstanceGroup_InstanceCount
=== CONT  TestAccAWSEMRInstanceGroup_BidPrice
=== CONT  TestAccAWSEMRInstanceGroup_AutoScalingPolicy
--- FAIL: TestAccAWSEMRInstanceGroup_BidPrice (6.17s)
    testing.go:569: Step 0 error: errors during apply:
        
        Error: Error creating VPC: VpcLimitExceeded: The maximum number of VPCs has been reached.
                status code: 400, request id: ea160175-440a-4c9f-899b-4420e0b0d07e
        
          on /tmp/tf-test382298782/main.tf line 28:
          (source code not available)
        
        
--- PASS: TestAccAWSEMRInstanceGroup_EbsConfig_EbsOptimized (558.46s)
--- PASS: TestAccAWSEMRInstanceGroup_basic (565.65s)
--- PASS: TestAccAWSEMRInstanceGroup_AutoScalingPolicy (589.18s)
--- PASS: TestAccAWSEMRInstanceGroup_InstanceCount (852.11s)
FAIL
FAIL    github.com/terraform-providers/terraform-provider-aws/aws       852.157s
testing: warning: no tests to run
PASS
ok      github.com/terraform-providers/terraform-provider-aws/aws/internal/flatmap      0.003s [no tests to run]
testing: warning: no tests to run
PASS
ok      github.com/terraform-providers/terraform-provider-aws/aws/internal/keyvaluetags 0.006s [no tests to run]
FAIL
make: *** [testacc] Error 1

$ make testacc TESTARGS='-run=TestAccAWSEMRInstanceGroup_BidPrice'
==> Checking that code complies with gofmt requirements...
TF_ACC=1 go test ./... -v -count 1 -parallel 20 -run=TestAccAWSEMRInstanceGroup_BidPrice -timeout 120m
go: finding github.com/terraform-providers/terraform-provider-tls v2.1.1+incompatible
go: finding github.com/terraform-providers/terraform-provider-tls v2.1.1+incompatible
?       github.com/terraform-providers/terraform-provider-aws   [no test files]
=== RUN   TestAccAWSEMRInstanceGroup_BidPrice
=== PAUSE TestAccAWSEMRInstanceGroup_BidPrice
=== CONT  TestAccAWSEMRInstanceGroup_BidPrice
--- PASS: TestAccAWSEMRInstanceGroup_BidPrice (549.58s)
PASS
ok      github.com/terraform-providers/terraform-provider-aws/aws       549.608s
testing: warning: no tests to run
PASS
ok      github.com/terraform-providers/terraform-provider-aws/aws/internal/flatmap      0.002s [no tests to run]
testing: warning: no tests to run
PASS
ok      github.com/terraform-providers/terraform-provider-aws/aws/internal/keyvaluetags 0.004s [no tests to run]

In situations where Terraform needs to replace an aws_emr_cluster
resource that has aws_emr_instance_group resources associated with it,
Terraform tries to execute a destroy on the instance group, but it fails
as the notion of a "destroy" on an instance group is to set the number
of instances to zero, but AWS doesn't let you modify the count of
instances in an instance group on an EMR cluster. This fixes the issue
by treating an instance group that has been terminated as no longer
existing, so Terraform won't try to execute a "destroy" and not error
out.

Fixes hashicorp#1355
Fixes hashicorp#9400
@joelthompson joelthompson requested a review from a team October 8, 2019 17:45
@ghost ghost added size/XS Managed by automation to categorize the size of a PR. service/emr Issues and PRs that pertain to the emr service. labels Oct 8, 2019
@bflad bflad added the bug Addresses a defect in current functionality. label Oct 10, 2019
Copy link
Contributor

@bflad bflad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @joelthompson 👋 Hope things are well after HashiConf and thanks for contributing this! Other than the code change below, it would be fantastic if we could also ensure there was acceptance testing covering this scenario. 👍 Once these are in this should be good to go. Please reach out with any questions or if you do not have time to implement the feedback.

To cover this functionality, we need to enhance the aws/resource_aws_emr_cluster_test.go a little to include a new testing function that terminates a cluster (and since we're doing this, why not also verify that the aws_emr_cluster resource itself does the right thing when its terminated outside Terraform):

// In aws/resource_aws_emr_cluster_test.go:
func TestAccAWSEMRCluster_disappears(t *testing.T) {
	var cluster emr.Cluster
	r := acctest.RandInt()
	resource.ParallelTest(t, resource.TestCase{
		PreCheck:     func() { testAccPreCheck(t) },
		Providers:    testAccProviders,
		CheckDestroy: testAccCheckAWSEmrDestroy,
		Steps: []resource.TestStep{
			{
				Config: testAccAWSEmrClusterConfig(r),
				Check: resource.ComposeTestCheckFunc(
					testAccCheckAWSEmrClusterExists("aws_emr_cluster.tf-test-cluster", &cluster),
					testAccCheckAWSEmrClusterDisappears(&cluster),
				),
				ExpectNonEmptyPlan: true,
			},
		},
	})
}

func testAccCheckAWSEmrClusterDisappears(cluster *emr.Cluster) resource.TestCheckFunc {
	return func(s *terraform.State) error {
		conn := testAccProvider.Meta().(*AWSClient).emrconn
		id := aws.StringValue(cluster.Id)

		terminateJobFlowsInput := &emr.TerminateJobFlowsInput{
			JobFlowIds: []*string{cluster.Id},
		}

		_, err := conn.TerminateJobFlows(terminateJobFlowsInput)

		if err != nil {
			return err
		}

		input := &emr.ListInstancesInput{
			ClusterId: cluster.Id,
		}
		var output *emr.ListInstancesOutput
		var instanceCount int

		err = resource.Retry(20*time.Minute, func() *resource.RetryError {
			var err error
			output, err = conn.ListInstances(input)

			if err != nil {
				return resource.NonRetryableError(err)
			}

			instanceCount = countEMRRemainingInstances(output, id)

			if instanceCount != 0 {
				return resource.RetryableError(fmt.Errorf("EMR Cluster (%s) has (%d) Instances remaining", id, instanceCount))
			}

			return nil
		})

		if isResourceTimeoutError(err) {
			output, err = conn.ListInstances(input)

			if err == nil {
				instanceCount = countEMRRemainingInstances(output, id)
			}
		}

		if instanceCount != 0 {
			return fmt.Errorf("EMR Cluster (%s) has (%d) Instances remaining", id, instanceCount)
		}

		if err != nil {
			return fmt.Errorf("error waiting for EMR Cluster (%s) Instances to drain: %s", id, err)
		}

		return nil
	}
}

I verified this works as expected:

--- PASS: TestAccAWSEMRCluster_disappears (451.10s)

Now we can use that new function in aws/resource_aws_emr_instance_group_test.go to verify what we are doing here. 🎉 Adding this new test:

// In aws/resource_aws_emr_instance_group_test.go:
func TestAccAWSEMRInstanceGroup_disappears_EmrCluster(t *testing.T) {
	var cluster emr.Cluster
	var ig emr.InstanceGroup
	rInt := acctest.RandInt()
	emrClusterResourceName := "aws_emr_cluster.tf-test-cluster"
	resourceName := "aws_emr_instance_group.task"

	resource.ParallelTest(t, resource.TestCase{
		PreCheck:     func() { testAccPreCheck(t) },
		Providers:    testAccProviders,
		CheckDestroy: testAccCheckAWSEmrInstanceGroupDestroy,
		Steps: []resource.TestStep{
			{
				Config: testAccAWSEmrInstanceGroupConfig_basic(rInt),
				Check: resource.ComposeTestCheckFunc(
					testAccCheckAWSEmrClusterExists(emrClusterResourceName, &cluster),
					testAccCheckAWSEmrInstanceGroupExists(resourceName, &ig),
					testAccCheckAWSEmrClusterDisappears(&cluster),
				),
				ExpectNonEmptyPlan: true,
			},
		},
	})
}

Yields an error, which hopefully this pull request addresses. 😉

--- FAIL: TestAccAWSEMRInstanceGroup_disappears_EmrCluster (432.45s)
    testing.go:630: Error destroying resource! WARNING: Dangling resources
        may exist. The full state and error is shown below.

        Error: errors during apply: error draining EMR Instance Group (ig-1O27W6AYK3EQ7): ValidationException: An instance group may only be modified when the cluster is running or waiting.

aws/resource_aws_emr_instance_group.go Outdated Show resolved Hide resolved
@bflad bflad self-assigned this Oct 10, 2019
@bflad bflad added the waiting-response Maintainers are waiting on response from community or contributor. label Oct 10, 2019
@ghost ghost added size/M Managed by automation to categorize the size of a PR. tests PRs: expanded test coverage. Issues: expanded coverage, enhancements to test infrastructure. and removed size/XS Managed by automation to categorize the size of a PR. labels Oct 10, 2019
@joelthompson
Copy link
Contributor Author

Hey @bflad -- great to see you at HashiConf and catch up! And I appreciate the really helpful suggestions!

Also a quick note -- I'm being a bit conservative here and only implementing the one known state that we've run into. There could be other states that should trigger ignoring the instance group (e.g., TERMINATING -- see https://github.com/aws/aws-sdk-go/blob/master/service/emr/api.go#L11030 for the complete list of states), but I'd be hesitant to be more speculative and treat other states as effectively being deleted when they actually aren't which could cause Terraform to leak resources. Thoughts? Does this make sense?

Anyway, I've implemented the suggestions and verified the two tests pass as expected:

$ make testacc TESTARGS='-run=TestAccAWSEMRCluster_disappears'
==> Checking that code complies with gofmt requirements...
TF_ACC=1 go test ./... -v -count 1 -parallel 20 -run=TestAccAWSEMRCluster_disappears -timeout 120m
?       github.com/terraform-providers/terraform-provider-aws   [no test files]
=== RUN   TestAccAWSEMRCluster_disappears
=== PAUSE TestAccAWSEMRCluster_disappears
=== CONT  TestAccAWSEMRCluster_disappears
--- PASS: TestAccAWSEMRCluster_disappears (441.29s)
PASS
ok      github.com/terraform-providers/terraform-provider-aws/aws       441.315s
testing: warning: no tests to run
PASS
ok      github.com/terraform-providers/terraform-provider-aws/aws/internal/flatmap      0.002s [no tests to run]
testing: warning: no tests to run
PASS
ok      github.com/terraform-providers/terraform-provider-aws/aws/internal/keyvaluetags 0.026s [no tests to run]

$ make testacc TESTARGS='-run=TestAccAWSEMRInstanceGroup_EmrClusterDisappears'
==> Checking that code complies with gofmt requirements...
TF_ACC=1 go test ./... -v -count 1 -parallel 20 -run=TestAccAWSEMRInstanceGroup_EmrClusterDisappears -timeout 120m
?       github.com/terraform-providers/terraform-provider-aws   [no test files]
=== RUN   TestAccAWSEMRInstanceGroup_EmrClusterDisappears
=== PAUSE TestAccAWSEMRInstanceGroup_EmrClusterDisappears
=== CONT  TestAccAWSEMRInstanceGroup_EmrClusterDisappears
--- PASS: TestAccAWSEMRInstanceGroup_EmrClusterDisappears (378.58s)
PASS
ok      github.com/terraform-providers/terraform-provider-aws/aws       378.608s
testing: warning: no tests to run
PASS
ok      github.com/terraform-providers/terraform-provider-aws/aws/internal/flatmap      0.002s [no tests to run]
testing: warning: no tests to run
PASS
ok      github.com/terraform-providers/terraform-provider-aws/aws/internal/keyvaluetags 0.004s [no tests to run]

@ghost ghost removed the waiting-response Maintainers are waiting on response from community or contributor. label Oct 10, 2019
@joelthompson joelthompson requested a review from bflad October 10, 2019 04:39
@bflad
Copy link
Contributor

bflad commented Oct 10, 2019

There could be other states that should trigger ignoring the instance group (e.g., TERMINATING

We go either which way on implementing these -- usually we'll automatically remove things on states like TERMINATING since its likely that the operator would want to recreate it since the API is telling us its about to disappear, but its also such an edge case that likely does not matter too much.

@joelthompson
Copy link
Contributor Author

There could be other states that should trigger ignoring the instance group (e.g., TERMINATING

We go either which way on implementing these -- usually we'll automatically remove things on states like TERMINATING since its likely that the operator would want to recreate it since the API is telling us its about to disappear, but its also such an edge case that likely does not matter too much.

This would be so much easier if AWS would just post the state machine that they promise to abide by and we wouldn't have to speculate on things like this. OK, I know, I sometimes have wild and unrealistic fantasies. I personally tend to be overly paranoid and worry about things like, "Is it possible to cancel a termination?" At least with TERMINATING, it should eventually go into TERMINATED and thus will get cleaned if we just look for TERMINATED.

Anyway, I'll go ahead and add just TERMINATING and leave the others for now.

@bflad bflad added this to the v2.32.0 milestone Oct 10, 2019
@joelthompson
Copy link
Contributor Author

Anyway, I'll go ahead and add just TERMINATING and leave the others for now.

OK, done. Not sure if there's a good way to test for TERMINATING (versus TERMINATED), though.

$ make testacc TESTARGS='-run=TestAccAWSEMRInstanceGroup_EmrClusterDisappears'
==> Checking that code complies with gofmt requirements...
TF_ACC=1 go test ./... -v -count 1 -parallel 20 -run=TestAccAWSEMRInstanceGroup_EmrClusterDisappears -timeout 120m
?       github.com/terraform-providers/terraform-provider-aws   [no test files]
=== RUN   TestAccAWSEMRInstanceGroup_EmrClusterDisappears
=== PAUSE TestAccAWSEMRInstanceGroup_EmrClusterDisappears
=== CONT  TestAccAWSEMRInstanceGroup_EmrClusterDisappears
--- PASS: TestAccAWSEMRInstanceGroup_EmrClusterDisappears (358.62s)
PASS
ok      github.com/terraform-providers/terraform-provider-aws/aws       358.647s
testing: warning: no tests to run
PASS
ok      github.com/terraform-providers/terraform-provider-aws/aws/internal/flatmap      0.007s [no tests to run]
testing: warning: no tests to run
PASS
ok      github.com/terraform-providers/terraform-provider-aws/aws/internal/keyvaluetags 0.005s [no tests to run]

Copy link
Contributor

@bflad bflad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @joelthompson 🚀

--- PASS: TestAccAWSEMRInstanceGroup_EmrClusterDisappears (369.87s)
--- PASS: TestAccAWSEMRInstanceGroup_AutoScalingPolicy (556.13s)
--- PASS: TestAccAWSEMRInstanceGroup_basic (569.22s)
--- PASS: TestAccAWSEMRInstanceGroup_EbsConfig_EbsOptimized (581.03s)
--- PASS: TestAccAWSEMRInstanceGroup_InstanceCount (589.60s)
--- PASS: TestAccAWSEMRInstanceGroup_BidPrice (596.88s)
--- PASS: TestAccAWSEMRInstanceGroup_ConfigurationsJson (810.57s)

@bflad bflad merged commit 6c5520d into hashicorp:master Oct 10, 2019
bflad added a commit that referenced this pull request Oct 10, 2019
@joelthompson
Copy link
Contributor Author

Thanks so much @bflad :)

@joelthompson joelthompson deleted the emr_ig_destroy branch October 10, 2019 13:57
@ghost
Copy link

ghost commented Oct 10, 2019

This has been released in version 2.32.0 of the Terraform AWS provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template for triage. Thanks!

@ghost
Copy link

ghost commented Nov 9, 2019

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

@ghost ghost locked and limited conversation to collaborators Nov 9, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Addresses a defect in current functionality. service/emr Issues and PRs that pertain to the emr service. size/M Managed by automation to categorize the size of a PR. tests PRs: expanded test coverage. Issues: expanded coverage, enhancements to test infrastructure.
Projects
None yet
2 participants