Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLIP-1701: Wait until pods in Helm release are terminated before destroying nfs module #282

Merged
merged 4 commits into from
Nov 15, 2022

Conversation

bianchi2
Copy link
Collaborator

@bianchi2 bianchi2 commented Nov 10, 2022

This PR fixes an issue with pods stuck in a Terminating state when destroying infrastructure. The problem was with the deletion of a shared home pvc which is stuck in pending as long as its pod is stuck in Terminating.

Helm release server pods are often stuck in terminating (unless termination grace period is set to 0) because of the following reason:

Destruction of nfs module (that includes nfs helm release and EBS volume with PV and PVCs) and product helm releases happens almost simultaneously, as a result, when, say, confluence pod is in Terminating state (preStop hook can take some time), EBS volume that backs the underlying PV and PVCs gets destroyed too. As a result, confluence container enters a weird state, and kubelet cannot kill the pod because of the following error:

tried to kill container, but did not receive an exit event

When trying to delete a docker container directly from the node, it turns out that the container is indeed unresponsive:

[root@ip-10-0-0-198 ~]# docker stop 92f3653bd6f1
Error response from daemon: cannot stop container: 92f3653bd6f1: tried to kill container, but did not receive an exit event

As a result, confluence pod is stuck in Terminating, and shared home PVC is stuck too and will be in this state until confluence pod exists. As a result, Terraform gives up waiting for the PVC deletion.

Having investigating the issue, and being not able to reproduce manually with helm delete it became obvious that helm_release destruction needs to wait for all pods to be wiped out or else there are chances that critical pieces of infra are destroyed when the pod is being terminated.

Unfortunately, helm provider cannot wait for pods to be deleted, it just waits for the release to be deleted. See: hashicorp/terraform-provider-helm#593 and helm/helm#2378

The workaround is:

  • make sure helm_release depends on nfs module to make sure it is deleted first (deletion happens in an inverse order)
  • make sure terraform waits n seconds before tearing down nfs module. n seconds == terminationGracePeriod

This way the following deletion order is achieved:

module.confluence[0].helm_release.confluence: Destroying... [id=confluence]
module.confluence[0].helm_release.confluence: Still destroying... [id=confluence, 10s elapsed]
module.confluence[0].helm_release.confluence: Still destroying... [id=confluence, 20s elapsed]
module.confluence[0].helm_release.confluence: Still destroying... [id=confluence, 30s elapsed]
module.confluence[0].helm_release.confluence: Destruction complete after 32s
module.confluence[0].time_sleep.wait_confluence_termination: Destroying... [id=2022-11-09T20:39:37Z]
module.confluence[0].kubernetes_secret.rds_secret: Destroying... [id=atlassian/confluence-db-cred]
module.confluence[0].kubernetes_secret.license_secret: Destroying... [id=atlassian/confluence-license]
module.confluence[0].kubernetes_job.pre_install[0]: Destroying... [id=atlassian/confluence-pre-install]
module.confluence[0].kubernetes_secret.license_secret: Destruction complete after 1s
module.confluence[0].kubernetes_secret.rds_secret: Destruction complete after 1s
module.confluence[0].kubernetes_job.pre_install[0]: Destruction complete after 2s
module.confluence[0].module.database.module.security_group.aws_security_group_rule.ingress_with_source_security_group_id[0]: Destroying... [id=sgrule-2573860818]
module.confluence[0].module.database.module.db.module.db_instance.aws_db_instance.this[0]: Destroying... [id=atlas-eugenetest-confluence-db]
module.confluence[0].module.database.module.security_group.aws_security_group_rule.ingress_with_source_security_group_id[0]: Destruction complete after 0s
module.confluence[0].time_sleep.wait_confluence_termination: Still destroying... [id=2022-11-09T20:39:37Z, 10s elapsed]
module.confluence[0].module.database.module.db.module.db_instance.aws_db_instance.this[0]: Still destroying... [id=atlas-eugenetest-confluence-db, 10s elapsed]
module.confluence[0].time_sleep.wait_confluence_termination: Still destroying... [id=2022-11-09T20:39:37Z, 20s elapsed]
module.confluence[0].module.database.module.db.module.db_instance.aws_db_instance.this[0]: Still destroying... [id=atlas-eugenetest-confluence-db, 20s elapsed]
module.confluence[0].time_sleep.wait_confluence_termination: Still destroying... [id=2022-11-09T20:39:37Z, 30s elapsed]
module.confluence[0].module.database.module.db.module.db_instance.aws_db_instance.this[0]: Still destroying... [id=atlas-eugenetest-confluence-db, 30s elapsed]
module.confluence[0].time_sleep.wait_confluence_termination: Destruction complete after 35s
module.confluence[0].module.nfs.kubernetes_persistent_volume_claim.product_shared_home_pvc: Destroying... [id=atlassian/confluence-shared-home-pvc]
module.confluence[0].module.nfs.helm_release.nfs: Destroying... [id=confluence-nfs]
module.confluence[0].module.nfs.kubernetes_persistent_volume_claim.product_shared_home_pvc: Destruction complete after 1s
module.confluence[0].module.nfs.kubernetes_persistent_volume.product_shared_home_pv: Destroying... [id=confluence-shared-home-pv]
module.confluence[0].module.nfs.kubernetes_persistent_volume.product_shared_home_pv: Destruction complete after 1s

We're giving helm release pods time to be terminated and only then EBS, PV and PVCs get deleted.

Checklist

  • I have successful end to end tests run (with & without domain)
  • I have added unit tests (if applicable)
  • I have user documentation (if applicable)

@jjeongatl
Copy link
Collaborator

jjeongatl commented Nov 10, 2022

make sure terraform waits n seconds before tearing down nfs module. n second == terminationGracePeriod + 5 second

Out of curiosity, where can I find part that waits extra 5 seconds

@bianchi2
Copy link
Collaborator Author

@jjeongatl oops, my bad. That was the original plan, but just termination_grace_period is enough since it takes some time to delete helm release itself and the pod starts terminating already.

depends_on = [kubernetes_job.import_dataset]
depends_on = [
kubernetes_job.import_dataset,
module.nfs,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to have module.nfs here? time_sleep is already depends on nfs, so here we can remove it as is embedded.

Copy link
Collaborator Author

@bianchi2 bianchi2 Nov 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. Yes, depending on nfs module is implied here. Leftovers from trying just to get away with dependency on nfs (without sleep). Removed this redundant dependency

@bianchi2 bianchi2 added e2e and removed e2e labels Nov 15, 2022
@bianchi2 bianchi2 merged commit 79c9aff into main Nov 15, 2022
@bianchi2 bianchi2 deleted the CLIP-1707-helm-wait-for-pod-termination branch November 15, 2022 23:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants