Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Jenkins] Add delete environment pipeline #1055

Merged
merged 6 commits into from
Oct 11, 2024

Conversation

lewijacn
Copy link
Collaborator

@lewijacn lewijacn commented Oct 8, 2024

Description

Adds logic for deleting our CFN stacks across two CDK packages we currently use (this projects CDK and sample EC2 source CDK). This logic can be used locally with python, but is also included as a Jenkins pipeline in this change to allow manually deleting an environment as well as scheduled deletion of environments.

Issues Resolved

https://opensearch.atlassian.net/browse/MIGRATIONS-2054

Testing

Local testing and Jenkins: https://migrations.ci.opensearch.org/job/delete-environment/11/console

Check List

  • New functionality includes testing
    • All tests pass, including unit test, integration test and doctest
  • New functionality has been documented
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link

codecov bot commented Oct 8, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 80.16%. Comparing base (5709add) to head (9cf66fb).
Report is 15 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #1055      +/-   ##
============================================
- Coverage     80.55%   80.16%   -0.40%     
- Complexity     2736     2744       +8     
============================================
  Files           365      367       +2     
  Lines         13614    13743     +129     
  Branches        941      949       +8     
============================================
+ Hits          10967    11017      +50     
- Misses         2070     2149      +79     
  Partials        577      577              
Flag Coverage Δ
gradle-test 78.07% <ø> (-0.52%) ⬇️
python-test 90.24% <ø> (+0.13%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Tanner Lewis <[email protected]>
class StackDeletionRequest:
def __init__(self, stack_name, client_request_token=None):
self.stack_name = stack_name
self.client_request_token = client_request_token
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm missing something,client_request_token is never set or used?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to take out, removed now


def retry_delete_stack(cfn_client, deletion_request: StackDeletionRequest):
if deletion_request.retry_count >= MAX_DELETE_STACK_RETRIES:
raise RuntimeError(f"Max attempts of {MAX_DELETE_STACK_RETRIES} have failed to delete stack: "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind making a custom exception type to use instead of RuntimeError? It's really easy to do and provides an easy-to-grep, stable interface for articulating an exceptional circumstance up the call stack (and in the logs).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added 👍

Comment on lines +13 to +15
INDEPENDENT_STACKS = ['MigrationConsole', 'ReindexFromSnapshot', 'TrafficReplayer', 'TargetClusterProxy',
'CaptureProxy', 'KafkaBroker', 'OpenSearchContainer', 'CaptureProxyES', 'Elasticsearch']
CORE_STACKS_ORDERED = ['MigrationInfra', 'OpenSearchDomain', 'NetworkInfra', 'infra-stack', 'network-stack']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoding all these stack names and their relationships in a completely separate context doesn't feel great. Maybe it's necessary though. Does cdk destroy work to tear down these stack collections?

Copy link
Member

@peternied peternied Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In these test accounts are there any stacks that shouldn't be fully deleted? If that is the case, can we just tear down everything?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few issues that caused me to go in this direction, we have two separate CDKs that we want to clean up, the stacks these CDKs will deploy depend on the context provided. To avoid having to match a particular context with a cdk destroy for a given environment I went this route of deleting possible stacks. We want the ability to cleanup a particular stage deployment only and not all our stacks completely in the case of a botched environment.

Comment on lines +69 to +72
elif stack_status == 'DELETE_FAILED':
logger.error(f"Stack {delete_request.stack_name} deletion failed, retrying...")
retry_delete_stack(cfn_client=cfn_client, deletion_request=delete_request)
in_progress_requests.append(delete_request) # Keep for further checks after retry
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious what scenarios you've seen where retrying a failed Cfn deletion succeeds on the second attempt? Most of time this occurs for me, I have to go in manually and jigger things in the AWS Console.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main case I see is around custom resources. Some internal CDK logic gets confused during deletion and one or two retries usually resolves the issue

self.retry_count = 0


def delete_stack(cfn_client, stack_name: str) -> StackDeletionRequest:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably handle stacks that have already been deleted more gracefully here. I think we'd toss an exception on the Describe call currently, yeah? Or would that return an empty list and we'd get an out-of-bounds exception on the status retrieval?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be fine to continue if a stack has already been deleted before we go to delete it. I have added logging for this and raising an error if its for a different reason besides not existing

Comment on lines +99 to +100
for stack_name in stack_names
if any(stack_id in stack_name for stack_id in INDEPENDENT_STACKS)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me understand the need for the final if in this list comprehension? I'd have expected us to just simply pull the names from the INDEPENDENT_STACKS list.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The INDEPENDENT_STACKS list contains they key identifier for a given stack while stack_name is the full stack name. The if condition is checking if any of the key identifiers are in the stack name before deleting it, as we only want to delete our independent stacks in this section

Comment on lines +132 to +133
if re.match(rf".*-{stage_name}-.*|.*-{stage_name}$", name):
stage_stack_names.append(name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't understand the need for a regex match here; mind explaining what you're thinking?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From our list of all stack names we use regex to get stacks that have stage name in middle(-stage-) or at end(-stage) of stack name. Added a comment for this as well

@@ -0,0 +1,47 @@
def call(Map config = [:]) {

pipeline {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you envision this Pipeline's actions being initiated? Does a human operator click a button to start the pipeline when needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A human operator can simply input a stage name and trigger the delete process through the UI. We can also in the future schedule a job to delete a specified set of stages

Signed-off-by: Tanner Lewis <[email protected]>
@lewijacn lewijacn merged commit 18d55fb into opensearch-project:main Oct 11, 2024
13 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants