Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unknown values should not block successful planning #30937

Open
apparentlymart opened this issue Apr 26, 2022 · 35 comments
Open

Unknown values should not block successful planning #30937

apparentlymart opened this issue Apr 26, 2022 · 35 comments
Labels
config enhancement unknown-values Issues related to Terraform's treatment of unknown values

Comments

@apparentlymart
Copy link
Contributor

apparentlymart commented Apr 26, 2022

The idea of "unknown values" is a crucial part of how Terraform implements planning as a separate step from applying.

An unknown value is a placeholder for a value that Terraform (most often, a Terraform provider) cannot know until the apply step. Unknown values allow Terraform to still keep track of type information where possible, even if the exact values aren't known, and allow Terraform to be explicit in its proposed plan output about which values it can predict and which values it cannot.

Internally, Terraform performs checks to ensure that the final arguments for a resource instance at the apply step conform to the arguments previously shown in the plan: known values must remain exactly equal, while unknown values must be replaced by known values matching the unknown value's type constraint. Through this mechanism, Terraform aims to promise that the apply phase will use the same settings as were used during planning, or Terraform will return an error explaining that it could not.

(For a longer and deeper overview of what unknown values represent and how Terraform treats them, see my blog post Unknown Values: The Secret to Terraform Plan.)

The design goal for unknown values is that Terraform should always be able to produce some sort of plan, even if parts of it are not yet known, and then it's up to the user to review the plan and decide either to accept the risk that the unknown values might not be what's expected, or to apply changes from a smaller part of the configuration (e.g. using -target) in order to learn more final values and thus produce a plan with fewer unknowns.

However, Terraform currently falls short of that goal in a couple different situations:

  • The Terraform language runtime does not allow an unknown value to be assigned to either of the two resource repetition meta-arguments, count and for_each.

    In that situation, Terraform cannot even predict how many instances of a resource are being declared, and it isn't clear how exactly Terraform should explain that degenenerate situation in a plan and so currently Terraform gives up and returns an error:

    │ Error: Invalid for_each argument
    │
    │ ...
    │
    │ The "for_each" value depends on resource attributes that cannot
    │ be determined until apply, so Terraform cannot predict how many
    │ instances will be created. To work around this, use the -target
    │ argument to first apply only the resources that the for_each
    │ depends on.
    
    │ Error: Invalid count argument
    │
    │ ...
    │
    │ The "count" value depends on resource attributes that cannot be
    │ determined until apply, so Terraform cannot predict how many
    │ instances will be created. To work around this, use the -target
    │ argument to first apply only the resources that the count depends
    │ on.
    
  • If any unknown values appear in a provider block for configuring a provider, Terraform will pass those unknown values to the provider's "Configure" function.

    Although Terraform Core handles this in an arguably-reasonable way, we've never defined how exactly a provider ought to react to crucial arguments being unknown, and so existing providers tend to fail or behave strangely in that situation.

    For example, some providers (due to quirks of the old Terraform SDK) end up treating an unknown value the same as an unset value, causing the provider to try to connect to somewhere weird like a port on localhost.

    Providers built using the modern Provider Framework don't run into that particular malfunction, but it still isn't really clear what a provider ought to do when a crucial argument is unknown and so e.g. the AWS Cloud Control provider -- a flagship use of the new framework -- reacts to unknown provider arguments by returning an error, causing a similar effect as we see for count and for_each above.

Although the underlying causes for the errors in these two cases are different, they both lead to a similar problem: planning is blocked entirely by the resulting error and the user has to manually puzzle out how to either change the configuration to avoid the unknown values appearing in "the wrong places", or alternatively puzzle out what exactly to pass to -target to select a suitable subset of the configuration to cause the problematic values to be known in a subsequent untargeted plan.

Terraform should ideally treat unknown values in these locations in a similar way as it does elsewhere: it should successfully produce a plan which describes what's certain and is explicit about what isn't known yet. The user can then review that plan and decide whether to proceed.

Ideally in each situation where an unknown value appears there should be some clear feedback on what unknown value source it was originally derived from, so that in situations where the user doesn't feel comfortable proceeding without further information they can more easily determine how to use -target (or some other similar capabililty yet to be designed) to deal with only a subset of resources at first and thus create a more complete subsequent plan.


This issue is intended as a statement of a problem to be solved and not as a particular proposed solution to that problem. However, there are some specific questions for us to consider on the path to designing a solution:

  • Is it acceptable for Terraform to produce a plan which can't even say how many instances of a particular resource will be created?

    That's a line we've been loathe to cross so far because the difference between a couple instances and tens of instances can be quite an expensive bill, but the same could be said for other values that Terraform is okay with leaving unknown in the plan output, such as the "desired count" of an EC2 autoscaling group. Maybe it's okay as long as Terraform is explicit about it in the plan output?

    A particularly "interesting" case to consider here is if some instances of a resource already exist and then subsequent changes to the configuration cause the count or for_each to become retroactively unknown. In that case, the final result of count or for_each could mean that there should be more instances of the resource (create), fewer instances of the resource (destroy), or no change to the number of instances (no-op). I personally would feel uncomfortable applying a plan that can't say for certain whether it will destroy existing objects.

  • Conversely, is it acceptable for Terraform to automatically produce a plan which explicitly covers only a subset of the configuration, leaving the user to run terraform apply again to pick up where it left off?

    This was essence of the earlier proposal Partial/Progressive Configuration Changes #4149, which is now closed due to its age and decreasing relevance to modern Terraform. That proposal made the observation that, since we currently suggest folks work around unknown value errors by using -target, Terraform could effectively synthesize its own -target settings to carve out the maximum possible set of actions that can be taken without tripping over the two problematic situations above.

  • Should providers (probably with some help from the Plugin Framework) be permitted to return an entirely-unknown response to the UpgradeResourceState, ReadResource, ReadDataSource, and PlanResourceChange operations for situations where the provider isn't configured completely enough to even attempt these operations?

    These are the four operations that Terraform needs to be able to ask a partially-configured provider to perform. If we allow a provider to signal that it isn't configured enough to even try at those, what should Terraform Core do in order to proceed with that incomplete or stale information?

  • We most frequently encounter large numbers of unknown values when planning the initial creation of a configuration, when nothing at all exists yet. That is definitely the most common scenario where these problems arise, but a provider can potentially return unknown values even as part of an in-place update if that is the best representation of the remote API's behavior -- for example, perhaps one of the output attributes is derived from an updated argument in a way that the provider cannot predict or simulate.

    Do we need to take any extra care to deal with the situation where an unknown value cascades downstream from an updated or replaced resource instance?

    For example, if I've used an attribute from a vendor-specific Kubernetes cluster resource to provide an API URL to the hashicorp/kubernetes provider and the user changes the configuration of the cluster itself in a way that causes the API URL to change, how should Terraform and the Kubernetes provider react to the cluster URL being unknown even though there are existing objects bound to resources belonging to that provider which we will need to refresh and plan?

  • What sort of analysis would we need to implement in order to answer questions like "why is this value unknown?" and "what subset of actions could I take in order to make this value be known?"?.

@pdecat
Copy link
Contributor

pdecat commented Apr 27, 2022

Isn't that a typo?

but a provider can potentially return known values even as part of an in-place update

s/known/unknown

@BzSpi
Copy link

BzSpi commented Apr 27, 2022

Same typo here I think

If any known values appear in a provider block for configuring a provider, Terraform will pass those unknown values to the provider's "Configure" function.

s/known/unknown

@apparentlymart
Copy link
Contributor Author

Thanks for the corrections! I've updated the original comment to fix them.

@pdecat
Copy link
Contributor

pdecat commented Apr 27, 2022

Hi @apparentlymart, I believe you made a similar typo in your blog post :

Terraform deals with these by just making them return known values of the appropriate type during planning.

@apparentlymart
Copy link
Contributor Author

Apparently it's easy for my fingers to confuse those! I'll update my blog at some point, too. Thanks!

@Nuru
Copy link

Nuru commented May 11, 2022

@apparentlymart Is this why sort() causes plans to fail?

Oversimplified example. With var.a_list of type list(string)

count = length(var.a_list)

works, even when the values in the a_list are unknown, I understand why

count = length(compact(var.a_list))

fails, because compact causes the length of the list to depend on the values in it. Fair enough. What I don't understand is why

count = length(sort(var.a_list))

fails. The cardinality of the list is not changed by sorting it, so why should it break count?

@apparentlymart
Copy link
Contributor Author

apparentlymart commented May 11, 2022

I expect that the sort function is forced to return a wholly-unknown list because it cannot predict which of the input elements will be at which indices in order to return a known list with a mixture of known and unknown elements. In the language today there is no such thing as an unknown list with a known length -- it's all bundled together into one concept -- and so the length is lost by the time the value reaches the length function.

I imagine we might be able to change sort so that it returns a known list of a particular number of values that are all unknown instead, thus creating the effect of an unknown list with a known length, but I'm not sure about the compatibility consequences of doing so. If you'd like to discuss that more, please open a new enhancement request issue for it and we can discuss the details of how it might work over in that other issue.

@KamalAman
Copy link

KamalAman commented May 13, 2022

It would be great to have providers lazy load if a value it is waiting on to be initialized. This way you can create a cluster and provisioning the kuberentes provider can work in the same apply in the same workspace. #1102

I think this issue also effects for key, value loops, but surprisingly does not effect for value loops.

For example first apply when creating a many tfe_workspace.workspace using a for_count results in a bunch of partial values in this trivial example

locals {
   ids = [
      for key, workspace in tfe_workspace.workspace : "${key}-${workspace.id}"
    ]
}

where as works just fine

locals {
   ids = [
      for workspace in tfe_workspace.workspace : workspace.id
    ]
}

@jbardin
Copy link
Member

jbardin commented May 17, 2022

I wanted to add a note here which may or may not be related depending on what solutions we find.

There is one place where unknown values are allowed, but can drastically alter the behavior of the plan -- resource attributes which force replacement. Feeding an unknown into an attribute of this type always forces replacement, which itself may cause more values to become unknown, cascading replacements through the config even when the end result may have little to no differences from the initial state.

Technically a solution to the problem posed here would generally have no negative effect on this type of resource replacement, but we need to keep this in mind if we start allowing users to more haphazardly throw around unknown values. (A solution here may even help with this problem, but we can brainstorm about that and other possibilities in other threads)

@apparentlymart
Copy link
Contributor Author

Thanks for adding that, @jbardin. I think you're describing the situation where if a value of an existing argument becomes unknown later then we'll conservatively assume it will have a different value than it had before, which will at least cause the provider to plan an update but may also sometimes cause the provider to plan a replace instead, if it's in an argument that can't update in place.

The specific case where folks find that confusing / frustrating is where the new value happens to end up being the same as the old value, and so Terraform ends up needlessly updating the object to the values it already had, or replacing the object with one configured in the same way as the previous one.

This tends to be exacerbated today by the fact that our old SDK design made it awkward for providers to offer predictions of what values attributes would have during planning -- for example, to use the documented syntax for a particular kind of ARN to predict an AWS resource type's arn attribute during planning rather than applying -- and so it might also be good to try to make that more ergonomic in the new framework and encourage provider developers to make use of that functionality so that we aren't propagating unknown values that don't really need to be unknown, and instead reserving that mechanism only for truly unpredictable values like a dynamically-selected IP address, or similar.

@stevehipwell
Copy link

My two biggest requests in this area are as follows, I could have listed many more but these two and the count & for_each issues already mentioned cause us the majority of our pain.

  1. Changing a tag on an EKS cluster shouldn't cause any resource using an output to have unknown data unless it's actually using the tags. This is a general pattern we see far too often, illustrated via the aws_eks_cluster resource which tends to be the resource we see it happen most on.
  2. The Kubernetes provider kubernetes_manifest resource shouldn't need API access at plan time as it should be fine for it to be unknown which is to be expected if the cluster isn't created yet. If somehow the cluster exists and is connected during apply this should result in an error as it'd be an update and not an apply happening.

@apparentlymart
Copy link
Contributor Author

Hi @stevehipwell! Thanks for sharing those.

I'm familiar with the behavior you're describing for kubernetes_manifest, but I'm not familiar with the situation you're describing about aws_eks_cluster having data become unknown as a result of changing tags. Do you have a reference to an AWS provider issue describing that?

I'm asking because I'd like to see more details on what's going on so we can understand whether this is a situation where the provider must return unknown for some reason -- in which case this issue is relevant as a way to make Terraform handle that better -- or if the AWS provider should already have enough information to avoid producing an unknown value in that case, and so we could potentially fix that inside the AWS provider with Terraform as it exists today.

For all of the AWS provider resource types I'm familiar with, changing a tag can be achieved using an in-place update which should therefore avoid anything else becoming unknown, but I don't have any direct experience with the EKS resource types in particular, and there's possibly something different about the EKS API that means the AWS provider needs to treat it differently than tagging of other resource types.

@stevehipwell
Copy link

@apparentlymart I'll add something here next time I see this, but there is nothing special with an EKS cluster to make it need custom behaviour. We see it elsewhere but for us the EKS cluster is the resource at the top of the biggest graphs. We spent a significant engineering effort to change how we use Terraform to limit the number of incorrect "known at apply" fields showing up in the plan. Our code is much harder to understand and develop but our end users get a cleaner plan, that said there are still a significant number of cases where this will happen.

This has also reminded me of a number of other very obvious issues in this area.

  • Using a data aws_iam_policy_document with dynamic values coming from a deep enough graph will result in a lot of known at apply (we can no longer use these even with careful code structuring)
  • Some resources (such as ACL rules or SG rules) don't have a consistent representation and always show changes

@apparentlymart
Copy link
Contributor Author

IAM policy documents are a particularly tricky case because it ends up folding a bunch of separate identifiers into a single string, which must therefore end up being unknown if any part of it is unknown. The AWS provider could potentially help with this by generating ARNs during the planning step based on the syntax specified in the AWS documentation, but I'm aware that most of the time arn attributes end up being unknown today even though in principle the provider could predict what the final ARN string would be.

This issue is focused only on the situations where unknown values cause Terraform to return an error, and so the problem of "limiting the number of known after apply" in the plan is not in scope for this issue, but if you see situations where a provider is returning an unknown value even though it should have all of the information to return a known value I'd suggest opening an enhancement request with the relevant provider about that, because providers are already empowered to return final values during planning if they include the logic necessary to do so.

For the sake of this issue, we'd benefit from there being fewer situations where providers return unknown values, but it will never be possible to eliminate them entirely -- some values really are determined by the remote system only during the apply step -- and so this issue is about making sure that those situations cannot cause Terraform to fail planning altogether, so that it's always possible to make progress towards a converged result.

@stevehipwell
Copy link

stevehipwell commented May 23, 2022

@apparentlymart whenever we've opened issues about these issues they get closed, ignored, or both.

If you want another scenario where this error happens; if you create a new resource in the plan and compare an input value with the resource identifier to create a list of exclusive identifiers it will error even though it should be able to determine that the input value isn't equal. This usually happens in a multi module scenario but the following convoluted example should have the same behaviour.

variable "identifier" {
  type = string
}
resource "resource_type_a" "my_resource" {
}
resource "resource_type_b" "my_other_resource" {
  count = length(concat([resource_type_a.my_resource.id], var.identifier != "" && var.identifier != resource_type_a.my_resource.id ? [var.identifier] : []))
}

@apparentlymart
Copy link
Contributor Author

apparentlymart commented May 23, 2022

Thanks @stevehipwell.

For the purpose of the framing of this issue the answer in that case would be for count to accept the value being unknown rather than returning an error, rather than to make the result be known.

Making the Terraform language's analysis of unknown values more precise is also possible, but is something separate from what this issue is about. What you have there is essentially a Terraform language version of the situation we were originally describing, where something is unknown even though it could technically be known. Feel free to open a new issue in this repository about that case if you like, and we could see about addressing it separately from what this issue is covering, although I will say that at first glance it seems correct to me that this would return unknown if resource_type_a.my_resource.id isn't known, because the list could have either one or two elements depending on that result. If I've misunderstood what you meant, let's talk about that in a separate issue.

The intended scope of this issue is that there should be no situation where an unknown value leads to an error saying that the value must be known, regardless of why the unknown value is present. Situations where a value turns out unknown but needn't be are also good separate enhancement requests (either in this repository for the core language, or in provider repositories for providers), but are not in scope for this issue.

@stevehipwell
Copy link

Making the Terraform language's analysis of unknown values more precise is also possible, but is something separate from what this issue is about. What you have there is essentially a Terraform language version of the situation we were originally describing, where something is unknown even though it could technically be known. Feel free to open a new issue in this repository about that case if you like, and we could see about addressing it separately from what this issue is covering, although I will say that at first glance it seems correct to me that this would return unknown if resource_type_a.my_resource.id isn't known, because the list could have either one or two elements depending on that result. If I've misunderstood what you meant, let's talk about that in a separate issue.

I updated the example slightly to check that the var is set for completeness, but that doesn't change the behaviour. The point here is that if data about the types isn't lost (the identifier of a resource which doesn't exist can't be passed in) this should evaluate correctly, but if the type system falls back to the lowest common denominator (known or unknown) it's a tricky problem. Shouldn't the type system be evaluated and "fixed" if possible before changing the behaviour of the system as a whole? A potential outcome might be the lack of need to change the system, but a more likely outcome would be to change the scope or shift the context.

@apparentlymart
Copy link
Contributor Author

To be clear, I'm not saying that both aren't valid improvements to make, just that I don't want this issue to grow into an overly generalized "everything that's wrong with unknown values" situation which could never be resolved.

More than happy to discuss improvements to the precision of Terraform's handling of unknown values in other issues, but this one is intended to have a very specific "acceptance criteria" after which it'd be time to close it, which is that Terraform doesn't block planning when count or for_each are unknown, and that there is a clear story for how providers ought to handle unknown values in provider configuration (though actually making providers follow that recommendation would be a change to make elsewhere, of course, since providers are separate codebases).

@jtv8
Copy link

jtv8 commented Jun 13, 2022

It would be great to have providers lazy load if a value it is waiting on to be initialized. This way you can create a cluster and provisioning the kuberentes provider can work in the same apply in the same workspace. #1102

+1 for this - and/or to add an equivalent of depends_on for provider configuration blocks. This is my biggest pain point with Terraform so far - because dynamically configuring providers works perfectly well for creating resources but leaves you stuck with an unstable configuration if you have to replace, say, a Kubernetes cluster or PostgreSQL database later on.

In the case of Kubernetes, the behavior I'm looking for is pretty simple: if the cluster that the provider depends on needs to be created or replaced, we can safely assume that all of the resources that are defined using that provider will also need to be created from scratch, and the plan can accurately reflect this. There should be a way of passing a provider the known information that a planned operation will destroy all of its existing resources.

In the more complex case where we can't be certain if a resource will still exist or not if some of its provider's values are unknown, I'd be fine with Terraform expressing that uncertainty in the plan as "may be replaced", etc.

Having some uncertainty in the plan, and letting the end user decide whether to accept that uncertainty, is infinitely preferable over a perfectly valid configuration being impossible to apply!

If some users want to always have a 100% deterministic plan, what about letting them choose the following configurable behaviors?

  1. Deterministic: as at present, terraform plan and terraform apply will both fail if it's not unknown exactly how many resources will be created. More complex configurations will require selective apply or separate states.
  2. Lazy apply, with human-in-loop: terraform apply will lazily load providers as their configurations become known, and prompt the user to accept or refuse changes at each stage.
  3. Lazy apply, fully automated: as above, but presuming the user's approval for each stage.

@daviewales
Copy link

daviewales commented May 20, 2024

I have a module which optionally assigns a Key Vault role to a group, depending on whether a Key Vault ID is specified when the module is called:

resource "azurerm_role_assignment" "pipeline_key_vault" {
  # Only create the role assignment if the variable is not null
  count                = var.pipeline_key_vault_id != null ? 1 : 0
  scope                = var.pipeline_key_vault_id
  role_definition_name = "Key Vault Secrets User"
  principal_id         = azuread_group.pipeline.object_id
}

However, because the Key Vault does not yet exist, I get the error described above.
I'm not sure if there is an elegant solution to this.
The only workaround I can think of is to add an additional flag variable to pass to count, instead of using the non-existent Key Vault ID.

I guess the other option would be to do what the error message says, and -target the creation of the Key Vault, then re-run the plan. But this is not ideal, as it introduces manual steps into the plan if it ever needs to be re-deployed.

EDIT:

Using an object type variable is a good workaround, because you can check for the existence of the object instead of the Key Vault ID.

variable "pipeline_key_vault" {
  type = object({
    name = string
    id   = string
  })
  description = "Pipeline Key Vault object"
  default     = null
}

@MoussaBangre
Copy link

Any progress on that issue ?

@apparentlymart
Copy link
Contributor Author

Yes, there has been considerable progress on this issue. It's under active development.

@colans
Copy link

colans commented Sep 5, 2024

Is hashicorp/terraform-provider-kubernetes#1775 a side effect of this issue? Would it make sense to mark that one as a duplicate?

@crw
Copy link
Contributor

crw commented Sep 5, 2024

Yes, although they may have work to do to close that issue after the core functionality is finished here, and that issue is already marked upstream, so I think it is OK to leave both open for now.

@andrew-bulford-form3
Copy link

Yes, there has been considerable progress on this issue. It's under active development.

@apparentlymart This is great to hear! Are you able to provide a sneak peak into what changes are being introduced and how this might affect what options we have for unknown values being used in providers? Are you implementing a mechanism to allow Terraform to converge on the desired state over multiple applies, or are you solving this another way?

I'm at a fork in the road for some refactoring I'm working on - either provision an EKS cluster in one root module and bootstrap it with the kubernetes provider in another, or provision and bootstrap the cluster in the same root module with some hacky workarounds to the provider issues described in this ticket. If there are changes coming soon which will mean that provisioning and bootstrapping a cluster can be done cleanly under the same root module then I'm inclined to go for the hacky, single module option now and clean it up once these improvements are available, to avoid needing the separate root modules.

Also, is there any expectation that this work will make it possible for child resources to be deleted using the old provider configuration and created using the new one? So taking your happycloud example, if happycloud_compute_cluster.example needs recreating, the compute_deployment.example would first be destroyed using the old provider configuration (pointing at the existing cluster), the cluster recreated, then the deployment created again with the new provider configuration (pointing at the newly created cluster). Is that something being addressed by this ticket or do you see this as a separate issue?

@apparentlymart
Copy link
Contributor Author

Sorry to leave you hanging @andrew-bulford-form3... your mention of my username has been in my notifications for a while but I didn't reply to it previously because I'm no longer on the Terraform team and so I was hoping someone who still is would respond to you instead. 😀

The work I was referring to when I left my comment earlier was hashicorp/terraform-provider-kubernetes#2510, and some other things in the vicinity of that.

I have no information on what might have happened since then, or what's planned, since I'm no longer involved in this work.

@andrew-bulford-form3
Copy link

No problem @apparentlymart, sorry to hear you're no longer involved with this but thanks for the PR link! We've gone for the split workspace approach for now but I'll continue to keep an eye on this issue to see how things progress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
config enhancement unknown-values Issues related to Terraform's treatment of unknown values
Projects
None yet
Development

No branches or pull requests