Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Retry Mechanism to E2E EC2 Terraform Deployment #635

Merged
merged 4 commits into from
Dec 14, 2023
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 59 additions & 25 deletions .github/workflows/appsignals-e2e-ec2-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -48,18 +48,68 @@ jobs:
with:
terraform_wrapper: false

- name: Deploy sample app via terraform
- name: Deploy sample app via terraform and wait for endpoint to come online
working-directory: testing/terraform/ec2
run: |
terraform init
terraform validate
terraform apply -auto-approve \
-var="aws_region=${{ env.AWS_DEFAULT_REGION }}" \
-var="test_id=${{ env.TESTING_ID }}" \
-var="sample_app_jar=${{ env.SAMPLE_APP_FRONTEND_SERVICE_JAR }}" \
-var="sample_remote_app_jar=${{ env.SAMPLE_APP_REMOTE_SERVICE_JAR }}" \
-var="cw_agent_rpm=${{ env.APP_SIGNALS_CW_AGENT_RPM }}" \
-var="adot_jar=${{ env.APP_SIGNALS_ADOT_JAR }}"

# Attempt to deploy the sample app on an EC2 instance and wait for its endpoint to come online.
# There may be occasional failures due to transitivity issues, so try up to 2 times.
# Success of 0 indicates that both the terraform deployment and the endpoint are running, while 1 indicates
# that it failed at some point
retry_counter=0
max_retry=2
success=1
while [ $success -eq 1 ]; do
if [ $retry_counter -ge $max_retry ]; then
echo "Max retry reached, failed to deploy terraform and connect to the endpoint. Exiting code"
exit 1
fi
echo "Attempt $retry_counter"
success=0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Success should be made 0 after the terraform apply command has completed. That's is the assumption in the following code.

Copy link
Contributor Author

@harrryr harrryr Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Success of 1 indicates that the setting up App Signals on the sample app failed, while 0 indicates that everything ran successfully.

The logic is:

1. Set Success to 1 (Set initial value to 1 so that the while loop runs)
2. While Success is 1 (Indicates that terraform deployment/endpoint connection failed and will try again):
     2a: Set Success to 0 (Set the value to 0 and if there were any failures change it to 1)
     2b: Run Terraform apply (If the deployment failed, then success will change to 1)
     2c: If Success is still 0, then install App Signals and check endpoint connection
     2d: If endpoint connection failed, change success to 1
     2e: If Success is 1 at this point, then either the deployment or connection failed and run the while loop again. If it is still 0, then the code ran successfully and exit the while loop

If the success is made 0 after the terraform apply, then it will override whether terraform deployment succeeded or not. If after the terraform deployment the success is 1, we want to skip the endpoint connection step and redeploy the terraform again.

terraform apply -auto-approve \
-var="aws_region=${{ env.AWS_DEFAULT_REGION }}" \
-var="test_id=${{ env.TESTING_ID }}" \
-var="sample_app_jar=${{ env.SAMPLE_APP_FRONTEND_SERVICE_JAR }}" \
-var="sample_remote_app_jar=${{ env.SAMPLE_APP_REMOTE_SERVICE_JAR }}" \
-var="cw_agent_rpm=${{ env.APP_SIGNALS_CW_AGENT_RPM }}" \
-var="adot_jar=${{ env.APP_SIGNALS_ADOT_JAR }}" || success=$?

if [ $success -eq 1 ]; then
echo "Terraform deployment was unsuccessful. Will attempt to retry deployment."
fi

# If the success is still 0, then the terraform deployment succeeded and now try to connect to the endpoint.
# Attempts to connect will be made for up to 10 minutes
if [ $success -eq 0 ]; then
echo "Attempting to connect to the endpoint"
sample_app_endpoint=http://$(terraform output sample_app_main_service_public_dns):8080
attempt_counter=0
max_attempts=60
until $(curl --output /dev/null --silent --head --fail $(echo "$sample_app_endpoint" | tr -d '"')); do
if [ ${attempt_counter} -eq ${max_attempts} ];then
echo "Failed to connect to endpoint. Will attempt to redeploy sample app."
success=1
break
fi

printf '.'
attempt_counter=$(($attempt_counter+1))
sleep 10
done
fi

# If the success is 1 then either the terraform deployment or the endpoint connection failed, so first destroy the
# resources created from terraform and try again.
if [ $success -eq 1 ]; then
echo "Destroying terraform"
terraform destroy -auto-approve \
-var="test_id=${{ env.TESTING_ID }}"

retry_counter=$(($retry_counter+1))
fi
done

- name: Get the ec2 instance ami id
run: |
Expand All @@ -72,22 +122,6 @@ jobs:
echo "REMOTE_SERVICE_IP=$(terraform output sample_app_remote_service_public_ip)" >> $GITHUB_ENV
working-directory: testing/terraform/ec2

- name: Wait for app endpoint to come online
id: endpoint-check
run: |
attempt_counter=0
max_attempts=30
until $(curl --output /dev/null --silent --head --fail http://${{ env.MAIN_SERVICE_ENDPOINT }}); do
if [ ${attempt_counter} -eq ${max_attempts} ];then
echo "Max attempts reached"
exit 1
fi

printf '.'
attempt_counter=$(($attempt_counter+1))
sleep 10
done

# This steps increases the speed of the validation by creating the telemetry data in advance
- name: Call all test APIs
continue-on-error: true
Expand Down Expand Up @@ -174,4 +208,4 @@ jobs:
working-directory: testing/terraform/ec2
run: |
terraform destroy -auto-approve \
-var="test_id=${{ env.TESTING_ID }}"
-var="test_id=${{ env.TESTING_ID }}"
2 changes: 1 addition & 1 deletion .github/workflows/appsignals-e2e-eks-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -274,4 +274,4 @@ jobs:
--name service-account-${{ env.TESTING_ID }} \
--namespace ${{ env.SAMPLE_APP_NAMESPACE }} \
--cluster ${{ inputs.test-cluster-name }} \
--region ${{ env.AWS_DEFAULT_REGION }} \
--region ${{ env.AWS_DEFAULT_REGION }}