Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

google_container_cluster.node_config subfields can't be updated in-place #19225

Closed
wyardley opened this issue Aug 22, 2024 · 34 comments · Fixed by GoogleCloudPlatform/magic-modules#12014, #20038 or hashicorp/terraform-provider-google-beta#8522

Comments

@wyardley
Copy link

wyardley commented Aug 22, 2024

When defined as part of the default nodepool (via node_config.kubelet_config within a google_container_cluster resource, cpu_manager_policy (and, potentially, other kubelet_config settings) will not get updated properly if changed (note: they should function correctly when a cluster / nodepool are created initially.

This seems likely to be because the code here and here is not in the path at all when the resource is invoked this way.

Note: this may affect other settings within node_config; it's just more urgent with this specific field because it's Required, (though maybe it shouldn't need to be)?

Edit: Also, I think there are contexts where cpu_manager_policy can / should be unset? I'm not sure exactly why the provider currently lists it as required to be set when kubelet_config is. It seems like there are cases where it's valid for it to be unset, and in fact, the provider doesn't seem to handle setting it properly in any of the cases / places I tested it. This is now fixed as part of #11572

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
  • Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.
  • If an issue is assigned to a user, that user is claiming responsibility for the issue.
  • Customers working with a Google Technical Account Manager or Customer Engineer can ask them to reach out internally to expedite investigation and resolution of this issue.

Terraform Version & Provider Version(s)

OpenTofu v1.7.3
on darwin_arm64
+ provider registry.opentofu.org/hashicorp/google v5.41.0

Note: should be reproducible with regular Terraform and latest provider version as well.

Affected Resource(s)

google_container_cluster

Terraform Configuration

resource "google_container_cluster" "test_insecure_kubelet_readonly_port" {
  name               = "test-insecure-kubelet-readonly-port"
  location           = "us-central1-f"
  initial_node_count = 1

  node_config {
    kubelet_config {
      # Note: create the cluster first, and then try to update the value to this
      cpu_manager_policy = "static"
    }
  }
  deletion_protection = false
  network             = "default"
  subnetwork          = "default"
}

Debug Output

2024-08-22T11:02:24.603-0700 [WARN]  Provider "provider[\"registry.opentofu.org/hashicorp/google\"]" produced an unexpected new value for google_container_cluster.test_insecure_kubelet_readonly_port, but we are tolerating it because it is using the legacy plugin SDK.
    The following problems may be the cause of any confusing errors from downstream operations:
      - .node_config[0].kubelet_config[0].cpu_manager_policy: was cty.StringVal("static"), but now cty.StringVal("")

Expected Behavior

  • When the value is set to "static", Terraform should update the settings on the default node-pool in place to the expected value.
  • When the value is set to "none" or unset, a permadiff should not be observed when the value is not set at the API level

Actual Behavior

  • A permadiff is observed.
  • As best I can tell, Terraform does not actually try to update the node pool settings when the value is changed from default for the default pool.

Note: the undocumented, but allowed, value "" resolves the permadiff, presumably because the API is not returning it at all vs. returning the value "none" with default settings, and because the provider doesn't seem to actually be trying to update the default nodepool settings when the setting is changed after the resource is initially created.

Steps to reproduce

  1. terraform apply

Important Factoids

See references below for more details

References

GoogleCloudPlatform/magic-modules#11272
#15767

b/361634104

@wyardley wyardley added the bug label Aug 22, 2024
@github-actions github-actions bot added forward/review In review; remove label to forward service/container labels Aug 22, 2024
@trodge trodge removed the forward/review In review; remove label to forward label Aug 22, 2024
@wyardley wyardley changed the title Inconsistent behavior (and inability to set) cpu_manager_policy Inconsistent behavior (and inability to set) cpu_manager_policy on default nodepool Aug 22, 2024
@wyardley
Copy link
Author

wyardley commented Aug 27, 2024

So I think what needs to happen is roughly something like

"when anything in this block changes, call nodePoolUpdate() (https://github.com/GoogleCloudPlatform/magic-modules/blob/61f619ede9c1e30516810a5d6799fbfe29cd91da/mmv1/third_party/terraform/services/container/resource_container_node_pool.go.erb#L1349C6-L1349C20) for default-pool. After the operation is complete, it may be sufficient (at least for some of the attributes) to infer that it was successful via the deprecated values directly in nodeConfig.nodeKubeletConfig, but I'm guessing it might be best to actually look at default-pool's settings directly for some things?

It may be necessary to update the cluster for some settings and the nodepool for others.

@wyardley
Copy link
Author

GoogleCloudPlatform/magic-modules@db731f7

POC fix for insecure_kubelet_readonly_port_enabled only, but probably a broader fix can be made for all the items within node_config.kubelet_config.

@wyardley
Copy link
Author

GoogleCloudPlatform/magic-modules@efb71a9
Seems like it was sort of arbitrary that this field was marked Required at all

wyardley added a commit to wyardley/magic-modules that referenced this issue Aug 29, 2024
Partial fix for hashicorp/terraform-provider-google#19225

This should resolve some confusing behavior with `cpu_manager_policy`

* It frequently will show a permadrift when it can't be set
* It also doesn't seem to accept the documented value of "none" as an
  empty value, though the previously undocumented empty string (`""`)
  seems to work.

This doesn't resolve all of the issues, but resolves other issues where
it must be set where it's not actually needed (for example, if
`insecure_kubelet_readonly_port_enabled` is set).

It appears that it was marked as `Required` somewhat arbitrarily, and
it's also possible that some of what's in place is tied to an API level
bug that may have since been resolved.
@melinath melinath changed the title Inconsistent behavior (and inability to set) cpu_manager_policy on default nodepool google_container_cluster.node_config subfields can't be updated in-place Aug 29, 2024
@melinath
Copy link
Collaborator

Updated the title based on the discussion above and at GoogleCloudPlatform/magic-modules#11272 (comment) - the main thing to fix here is going to be that update-in-place is broken for most subfields of node_config. (Fixing the cpu_manager_policy issue is sort of separate but happy to track it here since there's already a PR open for it.)

@wyardley
Copy link
Author

Thanks, had the same thought about updating the title to be more clear. When I created this, it wasn't quite as clear how related these issues all were, but it's becoming marginally clearer now.

@wyardley
Copy link
Author

wyardley commented Sep 4, 2024

GoogleCloudPlatform/magic-modules#11553 is another one within node_config that probably needs to be fixed in here now that force replacement was removed.

@melinath

This comment was marked as outdated.

@wyardley

This comment was marked as outdated.

wyardley added a commit to wyardley/magic-modules that referenced this issue Sep 11, 2024
Partial fix for hashicorp/terraform-provider-google#19225

This should resolve some confusing behavior with `cpu_manager_policy`

* It frequently will show a permadrift when it can't be set
* It also doesn't seem to accept the documented value of "none" as an
  empty value, though the previously undocumented empty string (`""`)
  seems to work.

This doesn't resolve all of the issues, but resolves other issues where
it must be set where it's not actually needed (for example, if
`insecure_kubelet_readonly_port_enabled` is set).

It appears that it was marked as `Required` somewhat arbitrarily, and
it's also possible that some of what's in place is tied to an API level
bug that may have since been resolved.
@wyardley
Copy link
Author

@melinath @rileykarson may be best for someone from Google to take this on, but if there's clear guidance about whether (and how) the node pool update stuff can be DRYed up / shared between the node_pool and container_cluster resources, I might be able to take a stab at it.

@melinath
Copy link
Collaborator

@wyardley I don't think we have specific guidance about it, but you're welcome to take a stab. As long as it works and is easy to understand, there's a good chance we'll accept the change.

Heads up that we're planning to switch the core engine from Ruby to Go in a couple weeks; I think that wouldn't impact this ticket too badly since it's working with handwritten files. The .erb formatting will change to go template formatting, but it might not be horrible to port over? Or you could wait until after the port lands.

@wyardley
Copy link
Author

I guess the big thing I was looking for ideas on is whether the code can be shared between the two resources (and if so, what files should it be in). And yeah, I can work around the cutover one way or another.

@rileykarson
Copy link
Collaborator

I also have GoogleCloudPlatform/magic-modules#11978 out which fixes the specific cause of that error (but still doesn't accurately model the API- we send the message we expect, and still have a permadiff).

wyardley added a commit to wyardley/magic-modules that referenced this issue Oct 15, 2024
This sets the stage for consolidating the logic for updating nodepools
between the `container_cluster` and `container_node_pool` resources, by
creating a new `nodePoolNodeConfigUpdate()` function.

Part of hashicorp/terraform-provider-google#19225
wyardley added a commit to wyardley/magic-modules that referenced this issue Oct 16, 2024
This resolves an issue where many subfields of `node_config` in a
cluster (which affects the default node-pool "default-pool" when
`remove_default_node_pool` is not set to `false`) don't support updates
properly.

Fixes hashicorp/terraform-provider-google#19225

Followup to GoogleCloudPlatform#11826 where the code used by the regular node pool update
code was broken out.
@wyardley
Copy link
Author

I've added a change in the linked PR that I think will fix this for all remaining cases, and at that point, any additional DRYing up within that shared function could be totally optional and could be handled by anyone who felt like taking it on in the future.

@wyardley
Copy link
Author

Also just did a quick verification that (even without these changes) when node_pool {} (within google_container_cluster) is used to define the first node pool vs the implicit settings (e.g., node_config), which I just was reminded is also a way this can work 🙃, updates do already seem to work as expected.

wyardley added a commit to wyardley/magic-modules that referenced this issue Oct 19, 2024
This resolves an issue where many subfields of `node_config` in a
cluster (which affects the default node-pool "default-pool" when
`remove_default_node_pool` is not set to `false`) don't support updates
properly.

Fixes hashicorp/terraform-provider-google#19225

Followup to GoogleCloudPlatform#11826 where the code used by the regular node pool update
code was broken out.
@wyardley
Copy link
Author

wyardley commented Oct 19, 2024

While working on the fixes, I had a thought: at some point (not as part of this set of changes), I imagine it might be possible to build a single nodeConfig object for some or all of the changed fields and do a single update, just logging the top level keys of all the changed fields? If that works, it would help speed updates up significantly, since all the updates would be accomplished in one pass vs. serially, but maybe there's a reason it doesn't already work that way?

@wyardley
Copy link
Author

For anyone following, one thing I discovered is that labels, resource_manager_tags and workload_metadata_config would be updateable, but only because they were set to force recreation. Presumably the intent was for other non-updateable fields to be added to this list as well, and people just didn't realize they needed to do so when those fields were added to node_config.

Since I'm pretty sure these should be allowed to be updated in-place in the default node pool when defined in node_config, just as they could be when defined externally or in node_pool {}, I simplified the code to remove that bit, and updated the changelog notes to reflect both a fix (of the fields that were missing entirely) and an enhancement (allowing those fields to be updated in place).

wyardley added a commit to wyardley/magic-modules that referenced this issue Oct 21, 2024
This resolves an issue where many subfields of `node_config` in a
cluster (which affects the default node-pool "default-pool" when
`remove_default_node_pool` is not set to `false`) don't support updates
properly.

Fixes hashicorp/terraform-provider-google#19225

Followup to GoogleCloudPlatform#11826 where the code used by the regular node pool update
code was broken out.
wyardley added a commit to wyardley/magic-modules that referenced this issue Oct 23, 2024
This resolves an issue where many subfields of `node_config` in a
cluster (which affects the default node-pool "default-pool" when
`remove_default_node_pool` is not set to `false`) don't support updates
properly.

Fixes hashicorp/terraform-provider-google#19225

Followup to GoogleCloudPlatform#11826 where the code used by the regular node pool update
code was broken out.
wyardley added a commit to wyardley/magic-modules that referenced this issue Oct 24, 2024
This resolves an issue where many subfields of `node_config` in a
cluster (which affects the default node-pool "default-pool" when
`remove_default_node_pool` is not set to `false`) don't support updates
properly.

Fixes hashicorp/terraform-provider-google#19225

Followup to GoogleCloudPlatform#11826 where the code used by the regular node pool update
code was broken out.
wyardley added a commit to wyardley/magic-modules that referenced this issue Oct 25, 2024
This resolves an issue where many subfields of `node_config` in a
cluster (which affects the default node-pool "default-pool" when
`remove_default_node_pool` is not set to `false`) don't support updates
properly.

Fixes hashicorp/terraform-provider-google#19225

Followup to GoogleCloudPlatform#11826 where the code used by the regular node pool update
code was broken out.
wyardley added a commit to wyardley/magic-modules that referenced this issue Oct 25, 2024
It appears that in-place updates were never implemented for
`node_config.containerd_config`, when contained within
either `google_container_cluster` or `google_container_node_pool`

Fixes hashicorp/terraform-provider-google#19056
Related to hashicorp/terraform-provider-google#19225

Signed-off-by: William Yardley <[email protected]>
wyardley added a commit to wyardley/magic-modules that referenced this issue Oct 25, 2024
It appears that in-place updates were never implemented for
`node_config.containerd_config`, when contained within
either `google_container_cluster` or `google_container_node_pool`

Fixes hashicorp/terraform-provider-google#19056
Related to hashicorp/terraform-provider-google#19225

Signed-off-by: William Yardley <[email protected]>
wyardley added a commit to wyardley/magic-modules that referenced this issue Oct 26, 2024
It appears that in-place updates were never implemented for
`node_config.containerd_config`, when contained within
either `google_container_cluster` or `google_container_node_pool`

Fixes hashicorp/terraform-provider-google#19056
Related to hashicorp/terraform-provider-google#19225

Signed-off-by: William Yardley <[email protected]>
Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 26, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.