-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Jenkins] Add delete environment pipeline #1055
[Jenkins] Add delete environment pipeline #1055
Conversation
Signed-off-by: Tanner Lewis <[email protected]>
Signed-off-by: Tanner Lewis <[email protected]>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1055 +/- ##
============================================
- Coverage 80.55% 80.16% -0.40%
- Complexity 2736 2744 +8
============================================
Files 365 367 +2
Lines 13614 13743 +129
Branches 941 949 +8
============================================
+ Hits 10967 11017 +50
- Misses 2070 2149 +79
Partials 577 577
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Signed-off-by: Tanner Lewis <[email protected]>
class StackDeletionRequest: | ||
def __init__(self, stack_name, client_request_token=None): | ||
self.stack_name = stack_name | ||
self.client_request_token = client_request_token |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I'm missing something,client_request_token
is never set or used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to take out, removed now
|
||
def retry_delete_stack(cfn_client, deletion_request: StackDeletionRequest): | ||
if deletion_request.retry_count >= MAX_DELETE_STACK_RETRIES: | ||
raise RuntimeError(f"Max attempts of {MAX_DELETE_STACK_RETRIES} have failed to delete stack: " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind making a custom exception type to use instead of RuntimeError? It's really easy to do and provides an easy-to-grep, stable interface for articulating an exceptional circumstance up the call stack (and in the logs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added 👍
INDEPENDENT_STACKS = ['MigrationConsole', 'ReindexFromSnapshot', 'TrafficReplayer', 'TargetClusterProxy', | ||
'CaptureProxy', 'KafkaBroker', 'OpenSearchContainer', 'CaptureProxyES', 'Elasticsearch'] | ||
CORE_STACKS_ORDERED = ['MigrationInfra', 'OpenSearchDomain', 'NetworkInfra', 'infra-stack', 'network-stack'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoding all these stack names and their relationships in a completely separate context doesn't feel great. Maybe it's necessary though. Does cdk destroy
work to tear down these stack collections?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In these test accounts are there any stacks that shouldn't be fully deleted? If that is the case, can we just tear down everything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few issues that caused me to go in this direction, we have two separate CDKs that we want to clean up, the stacks these CDKs will deploy depend on the context provided. To avoid having to match a particular context with a cdk destroy
for a given environment I went this route of deleting possible stacks. We want the ability to cleanup a particular stage deployment only and not all our stacks completely in the case of a botched environment.
elif stack_status == 'DELETE_FAILED': | ||
logger.error(f"Stack {delete_request.stack_name} deletion failed, retrying...") | ||
retry_delete_stack(cfn_client=cfn_client, deletion_request=delete_request) | ||
in_progress_requests.append(delete_request) # Keep for further checks after retry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious what scenarios you've seen where retrying a failed Cfn deletion succeeds on the second attempt? Most of time this occurs for me, I have to go in manually and jigger things in the AWS Console.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main case I see is around custom resources. Some internal CDK logic gets confused during deletion and one or two retries usually resolves the issue
self.retry_count = 0 | ||
|
||
|
||
def delete_stack(cfn_client, stack_name: str) -> StackDeletionRequest: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably handle stacks that have already been deleted more gracefully here. I think we'd toss an exception on the Describe call currently, yeah? Or would that return an empty list and we'd get an out-of-bounds exception on the status retrieval?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be fine to continue if a stack has already been deleted before we go to delete it. I have added logging for this and raising an error if its for a different reason besides not existing
for stack_name in stack_names | ||
if any(stack_id in stack_name for stack_id in INDEPENDENT_STACKS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you help me understand the need for the final if
in this list comprehension? I'd have expected us to just simply pull the names from the INDEPENDENT_STACKS
list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The INDEPENDENT_STACKS
list contains they key identifier for a given stack while stack_name
is the full stack name. The if
condition is checking if any of the key identifiers are in the stack name before deleting it, as we only want to delete our independent stacks in this section
if re.match(rf".*-{stage_name}-.*|.*-{stage_name}$", name): | ||
stage_stack_names.append(name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't understand the need for a regex match here; mind explaining what you're thinking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From our list of all stack names we use regex to get stacks that have stage name in middle(-stage-) or at end(-stage) of stack name. Added a comment for this as well
@@ -0,0 +1,47 @@ | |||
def call(Map config = [:]) { | |||
|
|||
pipeline { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you envision this Pipeline's actions being initiated? Does a human operator click a button to start the pipeline when needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A human operator can simply input a stage name and trigger the delete process through the UI. We can also in the future schedule a job to delete a specified set of stages
Signed-off-by: Tanner Lewis <[email protected]>
Description
Adds logic for deleting our CFN stacks across two CDK packages we currently use (this projects CDK and sample EC2 source CDK). This logic can be used locally with python, but is also included as a Jenkins pipeline in this change to allow manually deleting an environment as well as scheduled deletion of environments.
Issues Resolved
https://opensearch.atlassian.net/browse/MIGRATIONS-2054
Testing
Local testing and Jenkins: https://migrations.ci.opensearch.org/job/delete-environment/11/console
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.