Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terraform passes incorrect provider through when provider depends on resources which need replacing #31520

Closed
gtmtech opened this issue Jul 26, 2022 · 9 comments
Labels
bug v1.2 Issues (primarily bugs) reported against v1.2 releases waiting for reproduction unable to reproduce issue without further information waiting-response An issue/pull request is waiting for a response from the community

Comments

@gtmtech
Copy link

gtmtech commented Jul 26, 2022

Terraform Version

Terraform 1.2.5
AWS Provider 4.23.0

Description

I found an obscure but critical case where terraform does not use the correct provider when refreshing a resource. It is quite convoluted, but as terraform should NEVER mix providers up, I definitely feel its worth raising.

The problem happens with so-called dynamic providers, that is providers which are not immediately instantiated but are instantiated midway through a terraform run.

Recall that a provider without an alias is instantiated at beginning of the terraform run, but a provider with an alias can be made to instantiate part-way through the terraform run if the provider definition depends on resources yet to exist.

This is quite a cool feature that allows us to create a role, and then a provider based off that role to use in creating further resources as that role. This is a very useful way of provisioning multiple accounts for example all in one terraform run.

However I experienced a critical issue that all of a sudden, terraform switched back to using the initial provider to try and refresh the later resources that are supposed to be provisioned using the midway dynamic provider.

Here is some code to better illustrate.

Terraform Configuration Files

provider "aws" {
    alias = "foo"
    assume_role {
        role_arn     = "arn:aws:iam::111111111111:role/rolefoo"
        session_name = "terraform"
    }   
}

# create a role using the foo provider
resource "aws_iam_role" "bar" {
    provider = aws.foo
    name = "rolebar"
    managed_policy_arns = [ "arn:aws:iam::aws:policy/IAMFullAccess" ]
}

# now wait for role to be provisioned (because of aws eventual consistency)
resource "time_sleep" "wait_for_rolebar" {
    triggers = {
        name = aws_iam_role.bar.name
    }
    create_duration = "10s"
    depends_on = [ aws_iam_role.bar ]
}

# now dynamically create a bar provider
provider "aws" {
    alias = "bar"
    assume_role {
        role_arn     = time_sleep.ready.id != "" ? aws_iam_role.bar.arn : aws_iam_role.bar.arn
        session_name = "terraform"
    }
}

# now use the bar provider midrun to create something else
resource "aws_iam_policy" "policy1" {
    provider = aws.bar
    ...
}

All this works the first time you use it, provider foo creates rolebar, and provider bar - based off the new rolebar creates policy1 - great!

But now I realised my triggers were wrong, so I changed the time_sleep resource like this:

resource "time_sleep" "wait_for_rolebar" {
    triggers = {
        id = aws_iam_role.bar.id        # changes from name to id
    }
    create_duration = "10s"
    depends_on = [ aws_iam_role.bar ]
}

Now when I run terraform, I get very strange behaviour like this:

╷
│ Error: reading IAM Policy (arn:aws:iam::111111111111:policy/policy1): AccessDenied: User: arn:aws:sts::111111111111:assumed-role/rolefoo/[email protected] is not authorized to access this resource
│ 	status code: 403

That's odd I think - the code hasn't changed at all, and the IAM Policy policy1 is clearly set to use the aws.bar provider, not the aws.foo provider, so why is terraform trying to use the foo provider to refresh it?

When I turned on TF_LOG=debug, I saw absolutely no posts to the sts endpoint trying to assume the bar role.

What is happening here, is that provider bar depends on the time_sleep resource, and in this case, the time_sleep resource needs to be recreated, so the aws.bar provider which depends on that time_sleep resource cannot be evaluated until that happens.

But in the refresh of the whole plan, terraform has to try and refresh the iam policy policy1. Without an available aws.bar provider to use at this point (because it needs to apply the time_sleep before it can refresh the iam_policy policy1), terraform decided to unilaterally use the original provider foo.

This warrants a fix of some sort - it took absolutely ages to track down, and I would like to have seen a message that terraform couldn't refresh the policy1 resource because the underlying provider couldn't be created. That would have been much more helpful than mis-assigning the wrong provider to the refresh task.

When I finally figured out what was going wrong, I was able to target apply the time_sleep resources to fix the new triggers, and this "freed up" the internal terraform logic to create the aws.bar provider and refresh the policy1 with the correct provider again.

@gtmtech gtmtech added bug new new issue not yet triaged labels Jul 26, 2022
@jbardin
Copy link
Member

jbardin commented Jul 27, 2022

Hi @gtmtech,

I'm not sure exactly what's going in here from the description alone. Would it be possible to supply a trace log of the failed plan? If you use TF_LOG_CORE=trace it will skip all the provider output which avoids all the aws api calls being in the logs.

(probably unrelated, but depends_on = [ aws_iam_role.bar ] here is not doing anything, since time_sleep.wait_for_rolebar already refers to the aws_iam_role.bar resource, in case you intended for that to perform some other function)

Thanks!

@jbardin jbardin added waiting-response An issue/pull request is waiting for a response from the community waiting for reproduction unable to reproduce issue without further information and removed new new issue not yet triaged labels Jul 27, 2022
@apparentlymart apparentlymart added the v1.2 Issues (primarily bugs) reported against v1.2 releases label Sep 16, 2022
@gtmtech
Copy link
Author

gtmtech commented Nov 4, 2022

@jbardin - sorry just coming back to this.

So when you have the following in a dependency chain:

  1. Provider X
  2. An Iam role defined that uses Provider X
  3. A time_sleep 10s dependent on the Iam role
  4. A Provider Y which references 2 as the role to assume, and 3 via some function-call (to create dependency on 3)
  5. A resource that uses provider Y

.. and you have successfully terraformed the lot, but then a change to 2 requires 3 to be recreated (e.g. via triggers{} on 3 referencing an attribute from 2)...

If 3 needs to be recreated, terraform cant create provider Y for the refresh phase (as depends on 3, but 3 now needs recreating). During refresh phase, terraform assigns provider X (incorrectly) to try and refresh 5, rather than provider Y - as Y is no longer available. It should error or do something else, not assign the wrong provider.

@jbardin
Copy link
Member

jbardin commented Nov 4, 2022

Hi @gtmtech,

I don't think the problem here is the use of the incorrect provider (at least there has been no evidence that could happen yet, but the logs would verify it), rather the configuration for the bar provider is going to contain unknown values during the next plan. While it may work in some cases, referencing managed resource from a provider is not supported, with the applicable excerpt from the docs being:

You can use expressions in the values of these configuration arguments, but can only reference values that are known before the configuration is applied. This means you can safely reference input variables, but not attributes exported by resources.

If the provider requires the full configuration in order to complete the plan operation, you won't be able to plan and apply a configuration like this without the use of `-target. When given a partially known configuration, most providers opt to attempt whatever operations they need during plan, but generally act as if the unknown value was unset so may fail outright or produce unexpected results.

@gtmtech
Copy link
Author

gtmtech commented Nov 22, 2022

Thanks @jbardin - I certainly did not realise that basing a provider on attributes of a managed resource was not supported behaviour as in most cases it works, but I appreciate the docs that you mention.

However, if it's not supported behaviour I would argue an error needs to be thrown - Assigning the wrong provider to the wrong resource is in my view of critical importance and should not be tolerated under any circumstances - for example this may lead to api requests being done against the wrong aws account, with all sorts of implications.

@jbardin
Copy link
Member

jbardin commented Dec 7, 2022

I don't see any evidence here (yet) of the incorrect provider being assigned within Terraform, but an incomplete configuration could make it act like another provider configuration with the same defaults. If that's the case, there's not much we can do within the current protocol, because that is also a perfectly valid mode of operation for some providers.

@gtmtech
Copy link
Author

gtmtech commented Dec 20, 2022

I will try and record a video sometime to show you this error @jbardin 👍

@wrschneider
Copy link

I have seen exactly the same problem with Databricks workspaces.

  • azurerm_databricks_workspace resource
  • databricks provider depends on azurerm_databricks_workspace
  • databricks_cluster resource in databricks provider

If you make a change that causes the workspace resource to change, then Terraform isn't able to refresh the databricks_cluster resource because the provider doesn't exist yet. Often the error will be something that looks authentication-related and causes an enormous amount of confusion

Kubernetes providers tend to have the same problem.

I will make a feature request for a warning or more meaningful error -- the docs say not to do it, but the initial TF apply does work, and the error on subsequent apply is completely non-obvious.

It would be better to report a warning during the parse / validation phase.

@jbardin
Copy link
Member

jbardin commented Jan 27, 2023

Since we don't have any evidence of an incorrect provider instance being used, I'm going to close this under the assumption that the behavior was an incomplete configuration for a provider.

Thanks!

@jbardin jbardin closed this as completed Jan 27, 2023
@github-actions
Copy link
Contributor

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 27, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug v1.2 Issues (primarily bugs) reported against v1.2 releases waiting for reproduction unable to reproduce issue without further information waiting-response An issue/pull request is waiting for a response from the community
Projects
None yet
Development

No branches or pull requests

4 participants