Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Retry Mechanism to E2E EC2 Terraform Deployment #635

Merged
merged 4 commits into from
Dec 14, 2023
Merged

Add Retry Mechanism to E2E EC2 Terraform Deployment #635

merged 4 commits into from
Dec 14, 2023

Conversation

harrryr
Copy link
Contributor

@harrryr harrryr commented Dec 7, 2023

Issue #, if available:
The EC2 Canary occasionally fails due to transitivity issues. Some of the recurring errors are Max attempts reached in the Step : Wait for Endpoint to Come Online and the Step: Timeout while waiting for state to become running. This occurs due to the endpoint and the ec2 instances sometime taking longer than expected to become ready.

Description of changes:

  • Change the endpoint wait time from 5 minutes to 10 minutes.
  • Add a retry mechanism such that if either the terraform deployment or the endpoint connection fails, then destroy the terraform deployment and try again. This will ensure that the endpoint is deleted and recreated.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@harrryr harrryr requested a review from a team as a code owner December 7, 2023 23:40
@codecov-commenter
Copy link

codecov-commenter commented Dec 8, 2023

Codecov Report

Attention: 102 lines in your changes are missing coverage. Please review.

Comparison is base (09e6487) 85.71% compared to head (184678f) 50.73%.
Report is 177 commits behind head on main.

Files Patch % Lines
...ent/providers/AwsAppSignalsCustomizerProvider.java 24.00% 35 Missing and 3 partials ⚠️
...gent/providers/AwsSpanMetricsProcessorBuilder.java 0.00% 20 Missing ⚠️
...ders/AttributePropagatingSpanProcessorBuilder.java 0.00% 16 Missing ⚠️
...viders/AwsMetricAttributesSpanExporterBuilder.java 0.00% 11 Missing ⚠️
...try/javaagent/providers/AwsSpanProcessingUtil.java 90.16% 1 Missing and 5 partials ⚠️
...vaagent/providers/AwsMetricAttributeGenerator.java 96.89% 2 Missing and 3 partials ⚠️
...y/javaagent/providers/AwsSpanMetricsProcessor.java 91.48% 0 Missing and 4 partials ⚠️
...t/providers/AttributePropagatingSpanProcessor.java 94.59% 2 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@              Coverage Diff              @@
##               main     #635       +/-   ##
=============================================
- Coverage     85.71%   50.73%   -34.99%     
- Complexity       19      264      +245     
=============================================
  Files             3       39       +36     
  Lines            49     1301     +1252     
  Branches          5      141      +136     
=============================================
+ Hits             42      660      +618     
- Misses            3      609      +606     
- Partials          4       32       +28     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

exit 1
fi
echo "Attempt $retry_counter"
success=0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Success should be made 0 after the terraform apply command has completed. That's is the assumption in the following code.

Copy link
Contributor Author

@harrryr harrryr Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Success of 1 indicates that the setting up App Signals on the sample app failed, while 0 indicates that everything ran successfully.

The logic is:

1. Set Success to 1 (Set initial value to 1 so that the while loop runs)
2. While Success is 1 (Indicates that terraform deployment/endpoint connection failed and will try again):
     2a: Set Success to 0 (Set the value to 0 and if there were any failures change it to 1)
     2b: Run Terraform apply (If the deployment failed, then success will change to 1)
     2c: If Success is still 0, then install App Signals and check endpoint connection
     2d: If endpoint connection failed, change success to 1
     2e: If Success is 1 at this point, then either the deployment or connection failed and run the while loop again. If it is still 0, then the code ran successfully and exit the while loop

If the success is made 0 after the terraform apply, then it will override whether terraform deployment succeeded or not. If after the terraform deployment the success is 1, we want to skip the endpoint connection step and redeploy the terraform again.

@srprash
Copy link
Contributor

srprash commented Dec 13, 2023

Is there a sample run where we can see this change in action?

@PaurushGarg PaurushGarg merged commit 5d7feed into aws-observability:main Dec 14, 2023
4 checks passed
PaurushGarg added a commit that referenced this pull request Dec 15, 2023
* E2E Test: Ensure the use of IMDSv2 in EC2 instances (#621)

* Add e2e canary to public preview regions (#623)

* Fix trace validation error follow up fix (#626)

* Fix Terrform Destroy Error on EKS Canary (#628)

* fix-e2e-eks-terraform-destroy-error

* Add region as parameter to terraform destroy

* Bump nebula.release from 17.2.2 to 18.0.6 (#631)

Bumps nebula.release from 17.2.2 to 18.0.6.

---
updated-dependencies:
- dependency-name: nebula.release
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump actions/setup-java from 3 to 4 (#629)

Bumps [actions/setup-java](https://github.com/actions/setup-java) from 3 to 4.
- [Release notes](https://github.com/actions/setup-java/releases)
- [Commits](actions/setup-java@v3...v4)

---
updated-dependencies:
- dependency-name: actions/setup-java
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump hashicorp/setup-terraform from 2 to 3 (#586)

Bumps [hashicorp/setup-terraform](https://github.com/hashicorp/setup-terraform) from 2 to 3.
- [Release notes](https://github.com/hashicorp/setup-terraform/releases)
- [Changelog](https://github.com/hashicorp/setup-terraform/blob/main/CHANGELOG.md)
- [Commits](hashicorp/setup-terraform@v2...v3)

---
updated-dependencies:
- dependency-name: hashicorp/setup-terraform
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump rust from 1.73 to 1.74 (#611)

Bumps rust from 1.73 to 1.74.

---
updated-dependencies:
- dependency-name: rust
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump actions/setup-node from 3 to 4 (#574)

Bumps [actions/setup-node](https://github.com/actions/setup-node) from 3 to 4.
- [Release notes](https://github.com/actions/setup-node/releases)
- [Commits](actions/setup-node@v3...v4)

---
updated-dependencies:
- dependency-name: actions/setup-node
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump tempfile from 3.8.0 to 3.8.1 in /tools/cp-utility (#585)

Bumps [tempfile](https://github.com/Stebalien/tempfile) from 3.8.0 to 3.8.1.
- [Changelog](https://github.com/Stebalien/tempfile/blob/master/CHANGELOG.md)
- [Commits](https://github.com/Stebalien/tempfile/commits)

---
updated-dependencies:
- dependency-name: tempfile
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Provide aws-region for the e2e test in worklow (#643)

* Provide aws-region for the e2e test in worklow

* Update region to us-east-1 and add concurrency

* Revert "Provide aws-region for the e2e test in worklow (#643)" (#645)

This reverts commit 44b5b68.

* E2E Testing: Add concurrency tag to test in main build and nightly build (#646)

* Use aws-region in the workflow (#649)

* Add Retry Mechanism to E2E EKS Terraform Deployment (#634)

* Add Retry Mechanism to E2E EKS Terraform Deployment

* Add Extra Comments

* Call Test APIs First before Validation

* Add clean-app-signals to retry logic

* Change App Signal Download Directory and modify if statement for validation

* Modify while loop and refactor code

* Dynamic input RPM link by region setting (#647)

* Dynamic input RPM link by region setting

* Remove unneeded env variable

* Fix an issue in echo shell command

* Revert previous wrong 'fix' regarding variable call

* Add Retry Mechanism to E2E EC2 Terraform Deployment (#635)

* Add Retry Mechanism to E2E EC2 Terraform Deployment

* Add Extra Comments

* Refactor code

* Change App Signals Directory (#650)

* change dep config to compileOnly to fix high cardinality metrics (#651)

* E2E Testing: Fix EKS test candidate image override (#652)

This change checks if there is an adot image passed to the workflow and patches the App Signals deployment to update the image and restarts the cloudwatch pods.

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: Mahad Janjua <[email protected]>
Co-authored-by: Harry <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Vasi Vasireddy <[email protected]>
Co-authored-by: XinRan Zhang <[email protected]>
Co-authored-by: Mengyi Zhou (bjrara) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants