Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plan stalls due to failed tiller during helm_resource state refresh #315

Closed
joatmon08 opened this issue Aug 7, 2019 · 13 comments
Closed

Comments

@joatmon08
Copy link

Terraform Version

Terraform v0.12.6

  • provider.helm v0.10.2

Affected Resource(s)

  • helm_resource
  • tiller

Terraform Configuration Files

main.tf

terraform {
  required_version = "~> 0.12"
}

provider "helm" {
  version = "~> 0.10"
  install_tiller = true
}

module "helm-consul" {
  source    = "./helm-consul"
  name      = "consul"
  namespace = var.namespace
  enable    = true
}

module file (located in ./helm-consul)

resource "helm_release" "consul" {
  name      = var.name
  chart     = "${path.module}/consul-helm"  ## official Helm Consul chart, local
  namespace = var.namespace

  set {
    name  = "server.replicas"
    value = var.replicas
  }

  set {
    name  = "server.bootstrapExpect"
    value = var.replicas
  }

  set {
    name  = "server.connect"
    value = true
  }

  provisioner "local-exec" {
    command = "helm test ${var.name}"
  }
}

Debug Output

https://gist.github.com/joatmon08/c77de83d65709c06e5313331f3aa8c4a

Expected Behavior

Tiller pod should be re-initialized or error message should return "could not find a ready tiller pod".

Actual Behavior

$ terraform plan
Refreshing Terraform state in-memory prior to plan...
The refreshed state will be used to calculate this plan, but will not be
persisted to local or remote state storage.

module.helm-consul.helm_release.consul[0]: Refreshing state... [id=consul]

Error: timeout while waiting for state to become 'Running' (last state: 'Pending', timeout: 5m0s)

Steps to Reproduce

  1. Create a Kubernetes cluster.

  2. Run terraform init with install_tiller = true. Tiller initializes correctly in cluster.

  3. Successfully deploy a helm_resource using terraform apply. This gets logged into Terraform state.

  4. Scale Tiller deployment down using kubectl scale deployment/tiller-deploy -n kube-system --replicas=0. (To mimic failed tiller.)

  5. Run terraform plan. It will wait for available tiller pod and times out.

Important Factoids

Initially, we discovered this when we created a managed Kubernetes cluster and updated some configuration. This caused the Kubernetes cluster to destroy and re-create itself. When the cluster re-initialized, Tiller was stuck in a failed state. Running helm init again re-deployed the Tiller pod and allows the plan to complete.

While this does not apply to Helm v3, any cluster running Helm v2 that is re-created could result in a failed tiller pod and cause the plan to stall. Initially discussed this with @alexsomesan, posting here to collect input.

References

N/A

@alexsomesan
Copy link
Member

Able to reproduce this. I'll be working on a fix in the coming days.

@mmclane
Copy link

mmclane commented Oct 10, 2019

I am seeing this issue. Any updates?

@fl-max
Copy link

fl-max commented Oct 31, 2019

I'm seeing this same issue however Tiller never failed and is still healthy

Stuck on:
helm_release.airflow: Refreshing state... [id=heroic-seahorse]

In Tiller, I see:

[storage] 2019/10/31 17:14:19 getting last revision of "heroic-seahorse"
[storage] 2019/10/31 17:14:19 getting release history for "heroic-seahorse"
[storage] 2019/10/31 17:14:20 getting last revision of "heroic-seahorse"
[storage] 2019/10/31 17:14:20 getting release history for "heroic-seahorse"

Cancelling and running the plan again seems to fix it.

@zzzuzik
Copy link

zzzuzik commented Oct 31, 2019

+1 constantly stuck with
getting last revision... getting release ...

@ryudice
Copy link

ryudice commented Nov 4, 2019

Hi, same issue here

@zzzuzik
Copy link

zzzuzik commented Nov 4, 2019

btw, since my resource is recyclable, workarounded the problem by issuing terraform destroy for the resource and creating it with a different terraform name, like resource.mysql -> resource.mysql2

@suheb
Copy link

suheb commented Nov 7, 2019

I found a workaround to this. You need to delete the failed tiller-deploy pod in your cluster.
Run kubectl -n kube-system get pods | grep tiller to get the pod name.
Then, run kubectl delete pods <pod>.

After this, terraform plan should run normally.

@cliedeman
Copy link

In my case I had made a mistake and the deploy never started the tiller pod because the sa account could not be found

kubectl -n kube-system get deploy | grep tiller-deploy

Deleting the deploy fixed my issue. Can't wait for helm 3

@ArthurSens
Copy link

ArthurSens commented Dec 4, 2019

I'm experiencing a similar issue.

I've created an eks cluster with terraform and deployed tiller and some helm_resources with the helm provider.

After that I deleted my eks cluster with terraform destroy -target=module.eks

Of course, all my pods were deleted with the cluster and I cannot perform any of the following comands terraform plan, terraform apply, terraform destroy.. with the following logs:

module.helm-releases.helm_release.metrics-server: Refreshing state... [id=metrics-server]


Error: timeout while waiting for state to become 'Running' (last state: 'Pending', timeout: 5m0s)

Just to let you guys know...
The issue was solved with

terraform state rm module.helm-releases.helm_release.metrics-server
terraform destroy -auto-approve

@pio2pio
Copy link

pio2pio commented Dec 30, 2019

after removing tiller pod manually, terraform was unable to refresh the state got stack in
module.kubernetes.module.config.module.gitlab-ci.helm_release.gitlab[0]: Refreshing state... [id=gitlab-runner]

Removing missing resource from the state file resolved the issue

$ terraform state rm module.kubernetes.module.config.module.gitlab-ci.helm_release.gitlab[0]
Removed module.kubernetes.module.config.module.gitlab-ci.helm_release.gitlab[0]
Successfully removed 1 resource instance(s).

@nidhi5885
Copy link

The issue is intermittent, I am also facing the same.

@mcuadros
Copy link
Collaborator

Closing this issue since is making reference to a version based on Helm 2, if this is still valid to the master branch please reopen it. Thanks.

@ghost
Copy link

ghost commented May 11, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 [email protected]. Thanks!

@ghost ghost locked and limited conversation to collaborators May 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests