-
Notifications
You must be signed in to change notification settings - Fork 9.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix AWS EMR Instance Group Deletion Errors #10425
Conversation
In situations where Terraform needs to replace an aws_emr_cluster resource that has aws_emr_instance_group resources associated with it, Terraform tries to execute a destroy on the instance group, but it fails as the notion of a "destroy" on an instance group is to set the number of instances to zero, but AWS doesn't let you modify the count of instances in an instance group on an EMR cluster. This fixes the issue by treating an instance group that has been terminated as no longer existing, so Terraform won't try to execute a "destroy" and not error out. Fixes hashicorp#1355 Fixes hashicorp#9400
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @joelthompson 👋 Hope things are well after HashiConf and thanks for contributing this! Other than the code change below, it would be fantastic if we could also ensure there was acceptance testing covering this scenario. 👍 Once these are in this should be good to go. Please reach out with any questions or if you do not have time to implement the feedback.
To cover this functionality, we need to enhance the aws/resource_aws_emr_cluster_test.go
a little to include a new testing function that terminates a cluster (and since we're doing this, why not also verify that the aws_emr_cluster
resource itself does the right thing when its terminated outside Terraform):
// In aws/resource_aws_emr_cluster_test.go:
func TestAccAWSEMRCluster_disappears(t *testing.T) {
var cluster emr.Cluster
r := acctest.RandInt()
resource.ParallelTest(t, resource.TestCase{
PreCheck: func() { testAccPreCheck(t) },
Providers: testAccProviders,
CheckDestroy: testAccCheckAWSEmrDestroy,
Steps: []resource.TestStep{
{
Config: testAccAWSEmrClusterConfig(r),
Check: resource.ComposeTestCheckFunc(
testAccCheckAWSEmrClusterExists("aws_emr_cluster.tf-test-cluster", &cluster),
testAccCheckAWSEmrClusterDisappears(&cluster),
),
ExpectNonEmptyPlan: true,
},
},
})
}
func testAccCheckAWSEmrClusterDisappears(cluster *emr.Cluster) resource.TestCheckFunc {
return func(s *terraform.State) error {
conn := testAccProvider.Meta().(*AWSClient).emrconn
id := aws.StringValue(cluster.Id)
terminateJobFlowsInput := &emr.TerminateJobFlowsInput{
JobFlowIds: []*string{cluster.Id},
}
_, err := conn.TerminateJobFlows(terminateJobFlowsInput)
if err != nil {
return err
}
input := &emr.ListInstancesInput{
ClusterId: cluster.Id,
}
var output *emr.ListInstancesOutput
var instanceCount int
err = resource.Retry(20*time.Minute, func() *resource.RetryError {
var err error
output, err = conn.ListInstances(input)
if err != nil {
return resource.NonRetryableError(err)
}
instanceCount = countEMRRemainingInstances(output, id)
if instanceCount != 0 {
return resource.RetryableError(fmt.Errorf("EMR Cluster (%s) has (%d) Instances remaining", id, instanceCount))
}
return nil
})
if isResourceTimeoutError(err) {
output, err = conn.ListInstances(input)
if err == nil {
instanceCount = countEMRRemainingInstances(output, id)
}
}
if instanceCount != 0 {
return fmt.Errorf("EMR Cluster (%s) has (%d) Instances remaining", id, instanceCount)
}
if err != nil {
return fmt.Errorf("error waiting for EMR Cluster (%s) Instances to drain: %s", id, err)
}
return nil
}
}
I verified this works as expected:
--- PASS: TestAccAWSEMRCluster_disappears (451.10s)
Now we can use that new function in aws/resource_aws_emr_instance_group_test.go
to verify what we are doing here. 🎉 Adding this new test:
// In aws/resource_aws_emr_instance_group_test.go:
func TestAccAWSEMRInstanceGroup_disappears_EmrCluster(t *testing.T) {
var cluster emr.Cluster
var ig emr.InstanceGroup
rInt := acctest.RandInt()
emrClusterResourceName := "aws_emr_cluster.tf-test-cluster"
resourceName := "aws_emr_instance_group.task"
resource.ParallelTest(t, resource.TestCase{
PreCheck: func() { testAccPreCheck(t) },
Providers: testAccProviders,
CheckDestroy: testAccCheckAWSEmrInstanceGroupDestroy,
Steps: []resource.TestStep{
{
Config: testAccAWSEmrInstanceGroupConfig_basic(rInt),
Check: resource.ComposeTestCheckFunc(
testAccCheckAWSEmrClusterExists(emrClusterResourceName, &cluster),
testAccCheckAWSEmrInstanceGroupExists(resourceName, &ig),
testAccCheckAWSEmrClusterDisappears(&cluster),
),
ExpectNonEmptyPlan: true,
},
},
})
}
Yields an error, which hopefully this pull request addresses. 😉
--- FAIL: TestAccAWSEMRInstanceGroup_disappears_EmrCluster (432.45s)
testing.go:630: Error destroying resource! WARNING: Dangling resources
may exist. The full state and error is shown below.
Error: errors during apply: error draining EMR Instance Group (ig-1O27W6AYK3EQ7): ValidationException: An instance group may only be modified when the cluster is running or waiting.
Hey @bflad -- great to see you at HashiConf and catch up! And I appreciate the really helpful suggestions! Also a quick note -- I'm being a bit conservative here and only implementing the one known state that we've run into. There could be other states that should trigger ignoring the instance group (e.g., TERMINATING -- see https://github.com/aws/aws-sdk-go/blob/master/service/emr/api.go#L11030 for the complete list of states), but I'd be hesitant to be more speculative and treat other states as effectively being deleted when they actually aren't which could cause Terraform to leak resources. Thoughts? Does this make sense? Anyway, I've implemented the suggestions and verified the two tests pass as expected:
|
We go either which way on implementing these -- usually we'll automatically remove things on states like |
This would be so much easier if AWS would just post the state machine that they promise to abide by and we wouldn't have to speculate on things like this. OK, I know, I sometimes have wild and unrealistic fantasies. I personally tend to be overly paranoid and worry about things like, "Is it possible to cancel a termination?" At least with Anyway, I'll go ahead and add just |
OK, done. Not sure if there's a good way to test for
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @joelthompson 🚀
--- PASS: TestAccAWSEMRInstanceGroup_EmrClusterDisappears (369.87s)
--- PASS: TestAccAWSEMRInstanceGroup_AutoScalingPolicy (556.13s)
--- PASS: TestAccAWSEMRInstanceGroup_basic (569.22s)
--- PASS: TestAccAWSEMRInstanceGroup_EbsConfig_EbsOptimized (581.03s)
--- PASS: TestAccAWSEMRInstanceGroup_InstanceCount (589.60s)
--- PASS: TestAccAWSEMRInstanceGroup_BidPrice (596.88s)
--- PASS: TestAccAWSEMRInstanceGroup_ConfigurationsJson (810.57s)
Thanks so much @bflad :) |
This has been released in version 2.32.0 of the Terraform AWS provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading. For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template for triage. Thanks! |
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks! |
In situations where Terraform needs to replace an aws_emr_cluster
resource that has aws_emr_instance_group resources associated with it,
Terraform tries to execute a destroy on the instance group, but it fails
as the notion of a "destroy" on an instance group is to set the number
of instances to zero, but AWS doesn't let you modify the count of
instances in an instance group on an EMR cluster. This fixes the issue
by treating an instance group that has been terminated as no longer
existing, so Terraform won't try to execute a "destroy" and not error
out.
Community Note
Closes #1355
Closes #9400
Release note for CHANGELOG:
Output from acceptance testing:
Note that one test failed due to EC2 resource limits in my test account but succeeded on a retry: