Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"boot_disk.0.kms_key_self_link": conflicts with boot_disk.0.disk_encryption_key_raw #7934

Closed
ronjarrell opened this issue Dec 4, 2020 · 23 comments

Comments

@ronjarrell
Copy link

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
  • Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.
  • If an issue is assigned to the modular-magician user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to hashibot, a community member has claimed the issue already.

Terraform Version

0.14.0
(also failing under 0.13.5 and 0.12.29)
Provider 3.49.0
(also tried 2-3 random versions clear back to 3.19)

Affected Resource(s)

  • google_compute_instance

Terraform Configuration Files

resource "google_compute_instance" "this" {
  count = var.instance_count

  project = var.gcp_project

  allow_stopping_for_update = true

  name = module.names.shortname[count.index]


  machine_type = var.machine_type
  zone         = element(data.google_compute_zones.available.names, lookup({ "a" = 1, "b" = 2, "c" = 3, "d" = 4 }, var.az) - 1)

  tags = var.tags

  labels = merge(
    {
      "managed"     = "true"
      "environment" = lower(var.env_name)
      "hostname" = substr(replace(module.names.hostname[count.index],
      ".", "_"), 0, 62)
    },
    var.labels,
  )

  shielded_instance_config {
    enable_vtpm                 = true
    enable_integrity_monitoring = true
    enable_secure_boot          = false
  }

  metadata = merge(
    var.metadata,
    map(local.startup_script_key_name, data.template_file.boot_script[count.index].rendered),
    { block-project-ssh-keys = true },
    { enable-oslogin = true }
  )
  boot_disk {
    initialize_params {
      size  = var.boot_disk_size
      image = local.image
    }
    disk_encryption_key_raw = data.terraform_remote_state.atom.outputs.csek-raw-key == "" ? null : data.terraform_remote_state.atom.outputs.csek-raw-key
  }

  dynamic "service_account" {
    for_each = var.service_account
    content {
      email  = lookup(service_account.value, "email", null)
      scopes = lookup(service_account.value, "scopes", null)
    }
  }

  network_interface {
    subnetwork = var.subnet_id
    network_ip = element(concat(var.ip_address, [""]), count.index)

    dynamic "access_config" {
      for_each = var.access_config
      content {
        network_tier = access_config.value.network_tier
      }
    }
  }

  lifecycle {
    ignore_changes = [
      metadata["startup-script"],
      boot_disk,
      attached_disk,
      id,
    ]
  }
}

# Copy-paste your Terraform configurations here.
#
# For large Terraform configs, please use a service like Dropbox and share a link to the ZIP file.
# For security, you can also encrypt the files using our GPG public key:
#    https://www.hashicorp.com/security
#
# If reproducing the bug involves modifying the config file (e.g., apply a config,
# change a value, apply the config again, see the bug), then please include both:
# * the version of the config before the change, and
# * the version of the config after the change.

Debug Output

Error: "boot_disk.0.disk_encryption_key_raw": conflicts with boot_disk.0.kms_key_self_link

  on .terraform/modules/app-demo_gcp_ams.atom_a/main.tf line 78, in resource "google_compute_instance" "this":
  78: resource "google_compute_instance" "this" {



Error: "boot_disk.0.disk_encryption_key_sha256": this field cannot be set

  on .terraform/modules/app-demo_gcp_ams.atom_a/main.tf line 78, in resource "google_compute_instance" "this":
  78: resource "google_compute_instance" "this" {



Error: "boot_disk.0.kms_key_self_link": conflicts with boot_disk.0.disk_encryption_key_raw

  on .terraform/modules/app-demo_gcp_ams.atom_a/main.tf line 78, in resource "google_compute_instance" "this":
  78: resource "google_compute_instance" "this" {

Note, that in tf 12 or 13 the provider segfaults instead of giving an error.

Panic Output

Expected Behavior

Trying to do a
terraform apply -var-file=b1-scratch.tfvars in a green field situation works fine. Doing a destroy works fine.
Doing another terraform apply to check for changes (there are none) causes that error to happen every time.

Should have told me there were no changes (or if there were, showed me)

Actual Behavior

The error above.

Steps to Reproduce

  1. terraform apply -var-file=b1-scratch.tfvars
  2. terraform apply -var-file=b1-scratch.tfvars

Note, in the code, the reference to the csek key yields a string that's the base64 encoded raw key, since the provider has no way of accepting the wrapped key. At no point do we provide a kms key as you can see, or try to change the sha256 value.

The raw string, in a different terraform file entirely was decrypted by using a kms key to decrypt the value in the tf file, then the plaintext version was written into state, where it's being referenced here.

Important Factoids

References

  • #0000
@ghost ghost added the bug label Dec 4, 2020
@edwardmedia edwardmedia self-assigned this Dec 4, 2020
@edwardmedia
Copy link
Contributor

@ronjarrell can you provide all data for the variables and steps how to repro the issue?

@mhenderson-t2
Copy link

I had this same issue. I'm not able to provide a complete code sample due to the block being way too large to filter down to a useful subset. However, I rolled Terraform back to 0.13.5 and the issue went away.

My disk was defined like this:

  boot_disk {
    initialize_params {
      type = "pd-balanced"
      image = data.google_compute_image.my_image.self_link
    }
  }

That's it - no encryption specified at all.

@ghost ghost removed the waiting-response label Dec 8, 2020
@edwardmedia
Copy link
Contributor

@ronjarrell Is your case the same as Mark's? If yours is still an issue, can you provide data so I can repro it?

@edwardmedia
Copy link
Contributor

@ronjarrell has your issue been resolved?

@ronjarrell
Copy link
Author

No, I;ve tried it under, as i said, 12.29, 13.5 and 14, and the provider fails each time. I segfaults in 12 15, but got an actuall error message from 14.0 about the message variables. Note it's complaining I'm using kms keys I'm not using and and that I'm writing to the sha256 output value, which Im not.

Some of those values are generated a cross a few thousand likes of distributed code in different modules and state files.
Here's a state show of one particular instance that gets the error on a repeat apply, I've only elided out the user-data startup script. I think nothing else is particular sensitive.

resource "google_compute_instance" "this" {
    allow_stopping_for_update = true
    can_ip_forward            = false
    cpu_platform              = "Intel Skylake"
    current_status            = "RUNNING"
    deletion_protection       = false
    enable_display            = false
    guest_accelerator         = []
    id                        = "projects/XX/zones/europe-west4-a/instances/app-demo-1-a-ams-g-XX"
    instance_id               = "8177554915564758281"
    label_fingerprint         = "GlylWb2FaSg="
    labels                    = {
        "billing-use"    = "rnd"
        "cost_group"     = "b1fs"
        "cost_lifecycle" = "production"
        "cost_product"   = "infra"
        "cost_purpose"   = "demo"
        "environment"    = "XX"
        "hostname"       = "app-demo-1-a-ams-g-XX_node_a-ams-g-XX_int_b1fs"
        "managed"        = "true"
        "role"           = "app-demo"
    }
    machine_type              = "g1-small"
    metadata                  = {
        "block-project-ssh-keys" = "true"
        "enable-oslogin"         = "true"
        "user-data"              = <<-EOT
            #!/bin/bash
           
            echo "@@@@@@@@@@@@@@@@ System Startup Script Finished @@@@@@@@@@@@@@@@@"
        EOT
    }
    metadata_fingerprint      = "byXGZQud16U="
    name                      = "app-demo-1-a-ams-g-XX"
    project                   = "XX"
    self_link                 = "https://www.googleapis.com/compute/v1/projects/XX/zones/europe-west4-a/instances/app-demo-1-a-ams-g-XX"
    tags_fingerprint          = "42WmSpB8rSM="
    zone                      = "europe-west4-a"

    boot_disk {
        auto_delete                = true
        device_name                = "persistent-disk-0"
        disk_encryption_key_raw    = (sensitive value)
        disk_encryption_key_sha256 = "4Vr7NMN8VgvBIeD69vbivDt1kXyUjZYbeoDaKYNoeTA="
        mode                       = "READ_WRITE"
        source                     = "https://www.googleapis.com/compute/v1/projects/XX/zones/europe-west4-a/disks/app-demo-1-a-ams-g-XX"

        initialize_params {
            image  = "https://www.googleapis.com/compute/v1/projects/XX/global/images/b1-centos7-cis-1603764417"
            labels = {}
            size   = 20
            type   = "pd-standard"
        }
    }

    network_interface {
        name               = "nic0"
        network            = "https://www.googleapis.com/compute/v1/projects/XX/global/networks/XX-private"
        network_ip         = "10.x.x.x
        subnetwork         = "https://www.googleapis.com/compute/v1/projects/XX/regions/europe-west4/subnetworks/XX-europe-west4-atoma-private-subnet"
        subnetwork_project = "XX"
    }

    scheduling {
        automatic_restart   = true
        on_host_maintenance = "MIGRATE"
        preemptible         = false
    }

    shielded_instance_config {
        enable_integrity_monitoring = true
        enable_secure_boot          = false
        enable_vtpm                 = true
    }
}

@ghost ghost removed waiting-response labels Dec 9, 2020
@ronjarrell
Copy link
Author

The good news is I just tried it with 0.14.2, and while were holding off temporarily on going to 14, it seems to work ok in .2. with google 3.50.0_x5. That's the doing 2 apply's in a row test. Then I went in and changed the boot disk size from 20 to 21 and ran it, and segfaulted.

2020/12/09 15:02:23 [TRACE] dag/walk: upstream of "meta.count-boundary (EachMode fixup)" errored, so skipping
2020/12/09 15:02:23 [TRACE] dag/walk: upstream of "root" errored, so skipping
2020-12-09T15:02:23.265-0500 [DEBUG] plugin: plugin exited
panic: runtime error: invalid memory address or nil pointer dereference
2020-12-09T15:02:23.187-0500 [DEBUG] plugin.terraform-provider-google_v3.50.0_x5: [signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x16ad25b]
2020-12-09T15:02:23.187-0500 [DEBUG] plugin.terraform-provider-google_v3.50.0_x5: 
2020-12-09T15:02:23.187-0500 [DEBUG] plugin.terraform-provider-google_v3.50.0_x5: goroutine 721 [running]:
2020-12-09T15:02:23.187-0500 [DEBUG] plugin.terraform-provider-google_v3.50.0_x5: github.com/hashicorp/terraform-plugin-sdk/v2/internal/helper/plugin.(*GRPCProviderServer).PlanResourceChange(0xc000119ea0, 0x3df2be0, 0xc00130fb40, 0xc0005d2700, 0xc000119ea0, 0xc000119eb0, 0x394fe98)
2020-12-09T15:02:23.187-0500 [DEBUG] plugin.terraform-provider-google_v3.50.0_x5: 	/opt/teamcity-agent/work/5d79fe75d4460a2f/pkg/mod/github.com/hashicorp/terraform-plugin-sdk/[email protected]/internal/helper/plugin/grpc_provider.go:796 +0x109b
2020-12-09T15:02:23.187-0500 [DEBUG] plugin.terraform-provider-google_v3.50.0_x5: github.com/hashicorp/terraform-plugin-sdk/v2/internal/tfplugin5._Provider_PlanResourceChange_Handler.func1(0x3df2be0, 0xc00130fb40, 0x3746280, 0xc0005d2700, 0xc00130fb40, 0x33fd6e0, 0xc0014cd601, 0xc00102cd80)
2020-12-09T15:02:23.187-0500 [DEBUG] plugin.terraform-provider-google_v3.50.0_x5: 	/opt/teamcity-agent/work/5d79fe75d4460a2f/pkg/mod/github.com/hashicorp/terraform-plugin-sdk/[email protected]/internal/tfplugin5/tfplugin5.pb.go:3294 +0x86
2020-12-09T15:02:23.187-0500 [DEBUG] plugin.terraform-provider-google_v3.50.0_x5: github.com/hashicorp/terraform-plugin-sdk/v2/plugin.Serve.func3.1(0x3df2ca0, 0xc0004c9bf0, 0x3746280, 0xc0005d2700, 0xc00102cd60, 0xc00102cd80, 0xc001217ba0, 0x108fb78, 0x35a5e60, 0xc0004c9bf0)
2020-12-09T15:02:23.187-0500 [DEBUG] plugin.terraform-provider-google_v3.50.0_x5: 	/opt/teamcity-agent/work/5d79fe75d4460a2f/pkg/mod/github.com/hashicorp/terraform-plugin-sdk/[email protected]/plugin/serve.go:76 +0x87
2020-12-09T15:02:23.187-0500 [DEBUG] plugin.terraform-provider-google_v3.50.0_x5: github.com/hashicorp/terraform-plugin-sdk/v2/internal/tfplugin5._Provider_PlanResourceChange_Handler(0x37a04c0, 0xc000119ea0, 0x3df2ca0, 0xc0004c9bf0, 0xc0014cd680, 0xc000da2520, 0x3df2ca0, 0xc0004c9bf0, 0xc001058000, 0x2f04)
2020-12-09T15:02:23.187-0500 [DEBUG] plugin.terraform-provider-google_v3.50.0_x5: 	/opt/teamcity-agent/work/5d79fe75d4460a2f/pkg/mod/github.com/hashicorp/terraform-plugin-sdk/[email protected]/internal/tfplugin5/tfplugin5.pb.go:3296 +0x14b
2020-12-09T15:02:23.187-0500 [DEBUG] plugin.terraform-provider-google_v3.50.0_x5: google.golang.org/grpc.(*Server).processUnaryRPC(0xc0001521c0, 0x3e2c640, 0xc000185500, 0xc000b7a200, 0xc0006151a0, 0x4f5a408, 0x0, 0x0, 0x0)
2020-12-09T15:02:23.187-0500 [DEBUG] plugin.terraform-provider-google_v3.50.0_x5: 	/opt/teamcity-agent/work/5d79fe75d4460a2f/pkg/mod/google.golang.org/[email protected]/server.go:1180 +0x50a
2020-12-09T15:02:23.187-0500 [DEBUG] plugin.terraform-provider-google_v3.50.0_x5: google.golang.org/grpc.(*Server).handleStream(0xc0001521c0, 0x3e2c640, 0xc000185500, 0xc000b7a200, 0x0)
2020-12-09T15:02:23.187-0500 [DEBUG] plugin.terraform-provider-google_v3.50.0_x5: 	/opt/teamcity-agent/work/5d79fe75d4460a2f/pkg/mod/google.golang.org/[email protected]/server.go:1503 +0xcfd
2020-12-09T15:02:23.187-0500 [DEBUG] plugin.terraform-provider-google_v3.50.0_x5: google.golang.org/grpc.(*Server).serveStreams.func1.2(0xc0000440e0, 0xc0001521c0, 0x3e2c640, 0xc000185500, 0xc000b7a200)
2020-12-09T15:02:23.187-0500 [DEBUG] plugin.terraform-provider-google_v3.50.0_x5: 	/opt/teamcity-agent/work/5d79fe75d4460a2f/pkg/mod/google.golang.org/[email protected]/server.go:843 +0xa1
2020-12-09T15:02:23.187-0500 [DEBUG] plugin.terraform-provider-google_v3.50.0_x5: created by google.golang.org/grpc.(*Server).serveStreams.func1
2020-12-09T15:02:23.187-0500 [DEBUG] plugin.terraform-provider-google_v3.50.0_x5: 	/opt/teamcity-agent/work/5d79fe75d4460a2f/pkg/mod/google.golang.org/[email protected]/server.go:841 +0x204

@edwardmedia
Copy link
Contributor

@ronjarrell can you post the full debug log?

@edwardmedia
Copy link
Contributor

@ronjarrell closing this issue as we need to be able to repro the issue. Please feel free to reopen it once you are able to help repro

@ronjarrell
Copy link
Author

Ok, so, this is affecting most of our jobs, and is completely replicable on our end, but other than a guess that in involves the KMS key logic (because that's the only runs it blows up on) I'm at a loss to isolate it. I can't upload you the hundreds of lines of modules and submodules that our environment uses, can I get some suggestions for narrowing it down?

@ghost ghost removed the waiting-response label Dec 21, 2020
@ronjarrell
Copy link
Author

Verified it still fails with 14.3 btw.

@ronjarrell
Copy link
Author

@edwardmedia

@ronjarrell
Copy link
Author

https://gist.github.com/ronjarrell/c755419d32b13b6b3c0a7ddf0e6d9b67
Latest crash. updated provider to 3.51, tf to 14.3. Works fine on a greenfield apply, crashes on a second apply with or without changes, works fine for a destroy.

@ronjarrell
Copy link
Author

Also, tried doing a state list, then doing an apply -target= each item on that list. THey all work great except the compute instances - that consistently crashes. But if I do a state show on one of the instances that crash, copy that to a tf file, make some slight changes to it to get rid of immutable variables, I can run apply to my hearts content and never fail.

@ronjarrell
Copy link
Author

@edwardmedia Can we reopen this before the bot locks it and we have to start over... I'm having serious production issues every time I touch an instance. We crash terraform 10-15 times a day now.

@rileykarson
Copy link
Collaborator

@slevenick: Mind taking a look at this?

@slevenick
Copy link
Collaborator

@ronjarrell so this is a tricky issue. It's pretty difficult to reproduce (I have not been able to do it) and it changes between versions of Terraform that you are running. That makes me think that there is something deeper than a provider issue going on here.

The debug output you reported in the initial issue points to a config being provided with conflicting disk_encryption_key_raw and kms_key_self_link fields, as well as disk_encryption_key_sha256 being specified when it should not be. But the config that you provided does not include any conflicts, or the sha256 field.

The concerning part for me is that enforcement on conflicting fields and specifying the sha256 is not done within the
provider, but happens in Terraform core/SDK itself.

The crash you started seeing in 0.14 again points to a potential issue in the SDK as all of the lines in the stack trace are from the SDK library rather than any code in this repository.

I have a couple of ideas:

  1. Something strange is happening between runs of terraform apply due to the variables you are using. You mentioned that copying the actual values from state makes terraform apply to work successfully. It looks like you are provisioning a large environment with a lot of resources, variables and modules. Is it possible that something is changing between runs of terraform apply within the variables themselves? Can you attempt to reproduce this using variables in a smaller context so that you can share the variables you are running Terraform with?

  2. Something strange is happening within the SDK because of the use of variables. I'm not sure why or how this would happen, but the segfault you shared points to something going wrong in the SDK itself. I'd suggest sharing the stack trace and some information about what you are seeing here: https://github.com/hashicorp/terraform-plugin-sdk/issues

@paddycarver
Copy link
Contributor

Hey y'all, my ears were burning. :)

I don't have a lot to add (yet--I'm still poking at this) but It appears the crash is eerily similar to hashicorp/terraform-plugin-sdk#548. I can't say whether that's the root of the problem or just a confounding symptom and a red herring, but it seems related. I haven't fully gotten to the bottom of that issue yet, but I'm still digging. It would be really helpful if I could get a reproduction against hashicorp/terraform-plugin-sdk#686 and let me know any and all of the new [WARN] Field "foo" was null, not modifying its NewExtra log lines that appear. (The change should prevent the crash, and instead let us know which fields are null.

So far all I've got is this seems related to the use of CustomizeDiff, and the crash happens because an attribute is unexpectedly nil. I'll keep digging and see if I can come up with a more useful explanation of what's happening and why.

@ronjarrell
Copy link
Author

I'd be happy to run a custom provider to test this. I can replicate it by running twice in a row, with no one else changing anything, so it's not stuff changing between runs.

paddycarver added a commit to hashicorp/terraform-plugin-sdk that referenced this issue Jan 27, 2021
Previously, we'd assign the result of finalizeDiff to the resource diff
without checking its return. This caused problems because a "finalized"
diff for any given attribute could, in fact, be no diff at all. Which we
represent as `nil`. But some consumers of the resource diff expect every
attribute in the map to be non-`nil`, and so crash on these attributes
that have diff entries but no diffs. See for example
hashicorp/terraform-provider-google#7934, which would crash when a
config had an explicit empty string as the value for a field that a
CustomizeDiff function set to ForceNew. Technically, there was a diff,
but finalizeDiff decided it wasn't a "real" diff, because the SDK still
interprets empty strings as "unset" for computed fields to align with
legacy behavior. But that meant a nil in the resource's map of attribute
diffs, which then was dereferenced when populating the response to
PlanResourceChange. This caused a crash.

This commit fixes that issue by updating all our usages of finalizeDiff
to check for a nil diff _before_ writing it to the resource's map of
attribute diffs. This was easier than tracking down all the usages of a
ResourceAttributeDiff and trying to ensure they were ignoring nil
values.
@slevenick
Copy link
Collaborator

Looks like this will be solved by hashicorp/terraform-plugin-sdk#686 (comment)

Will require a SDK upgrade

paddycarver added a commit to hashicorp/terraform-plugin-sdk that referenced this issue Jan 27, 2021
Previously, we'd assign the result of finalizeDiff to the resource diff
without checking its return. This caused problems because a "finalized"
diff for any given attribute could, in fact, be no diff at all. Which we
represent as `nil`. But some consumers of the resource diff expect every
attribute in the map to be non-`nil`, and so crash on these attributes
that have diff entries but no diffs. See for example
hashicorp/terraform-provider-google#7934, which would crash when a
config had an explicit empty string as the value for a field that a
CustomizeDiff function set to ForceNew. Technically, there was a diff,
but finalizeDiff decided it wasn't a "real" diff, because the SDK still
interprets empty strings as "unset" for computed fields to align with
legacy behavior. But that meant a nil in the resource's map of attribute
diffs, which then was dereferenced when populating the response to
PlanResourceChange. This caused a crash.

This commit fixes that issue by updating all our usages of finalizeDiff
to check for a nil diff _before_ writing it to the resource's map of
attribute diffs. This was easier than tracking down all the usages of a
ResourceAttributeDiff and trying to ensure they were ignoring nil
values.
@paddycarver
Copy link
Contributor

Sorry for the delayed response here, was waiting for confirmation. Yes, we got to the bottom of this, and it turns out a very specific confluence of events was needed to provoke this, which is why we didn't see it until now. I haven't confirmed this myself yet, but any google_compute_instance with a network_interface.network_ip set explicitly to "" in the config would provoke this crash after the initial apply.

hashicorp/terraform-plugin-sdk#686 (comment) has a more in-depth write-up of what's going on and why, and how we're fixing it.

Thanks for the patience on this, it was a bit of a dig to isolate and trace the problem!

@paddycarver
Copy link
Contributor

Version 2.4.2 of the SDK is out and should resolve this issue.

@ronjarrell
Copy link
Author

Google provider 3.55, which is using the new sdk, resolves our issue.

@ghost
Copy link

ghost commented Mar 4, 2021

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 [email protected]. Thanks!

@ghost ghost locked as resolved and limited conversation to collaborators Mar 4, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants