Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regression error in tofu/kubernetes/talos/image.tf: "for_each" set includes values derived from resource attributes that cannot be determined until apply #106

Open
sebiklamar opened this issue Oct 18, 2024 · 7 comments

Comments

@sebiklamar
Copy link
Contributor

sebiklamar commented Oct 18, 2024

Hi Vegard, the recent commit 856cef4 on tofu/kubernetes/talos/image.tf (use new talos_image_factory_schematic resource) produces a tofu plan error when starting from fresh -- while with an existing cluster there's no error thrown by tofu.

Error message

│ Error: Invalid for_each argument
│
│   on talos/image.tf line 20, in resource "proxmox_virtual_environment_download_file" "this":
│   20:   for_each = toset(distinct([for k, v in var.nodes : "${v.host_node}_${v.update == true ? local.update_image_id : local.image_id}"]))
│     ├────────────────
│     │ local.image_id is a string, known only after apply
│     │ local.update_image_id is a string, known only after apply
│     │ var.nodes is map of object with 5 elements
│
│ The "for_each" set includes values derived from resource attributes that cannot be determined until apply, and so OpenTofu cannot determine the full set of keys that will identify the instances of this resource.
│
│ When working with unknown values in for_each, it's better to use a map value where the keys are defined statically in your configuration and where only the values contain apply-time results.
│
│ Alternatively, you could use the -target planning option to first apply only the resources that the for_each value depends on, and then apply a second time to fully converge.

How to reproduce:

  • git clone
  • provide credentials.auto.tfvars and openssl certs
  • tofu plan

Alternatively to creating copy of your repo without terraform state just move away TF state in an existing repo connected to a cluster ("mv terraform.tfstate terraform.tfstate.bak").

Unfortunately, I cannot propose you a fix as I'm still new to K8S and tofu.

That said, thank you sharing your knowledge and homelab! It helps me a lot for quick starting into IaC!

Kind regards -- Sebastian

@sebiklamar sebiklamar changed the title regression error in tofu/kubernetes/talos/image.tf: regression error in tofu/kubernetes/talos/image.tf: "for_each" set includes values derived from resource attributes that cannot be determined until apply Oct 18, 2024
@vehagn
Copy link
Owner

vehagn commented Oct 19, 2024

Hi @sebiklamar,

I unfortunately never tested the change with a fresh plan and you're not the first to get the error.

Jacob Melton was able to solve the issue by running

tofu apply -target=module.talos.talos_image_factory_schematic.this

before a full apply.

I haven't had the time to update the article and readme with the fix yet.

@sebiklamar
Copy link
Contributor Author

Hi @vehagn,

thank you for the quick response and the quick fix. That helped.
I don't have any issue with untested regressions, as we're consuming your work for free and thus can give you also valuable feedback.

Do you welcome PRs for the readme or do you want to handle it on your own because you need to update the article anyway? I've stripped off the "tofu output" commands and added the commands for creating sealed secrets openssl key, credentials.auto.tfvars as well as config merges for talosconfig and kubeconfig in my version.

@vehagn
Copy link
Owner

vehagn commented Oct 19, 2024

PRs of any kind are very welcome, I only ask that you try to adhere to the conventional commits specification and add a descriptive commit message.

Signed commits are also appreciated, but I won't require it.

@sebiklamar
Copy link
Contributor Author

sebiklamar commented Nov 4, 2024

Hi, am still onto it... I'm still struggling with the PR as I'm new to git and I already made changes in my fork in main branch, so I cannot, AFAIK, use the typical fork-branch workflow. To my understanding I either

  1. need to create my "upstream" branch from a previous git version/commit where my changes were not in (3 weeks ago), then pull all your changes for having a recent base or
  2. I branch "upstream" from HEAD version and do a merge from upstream remote with overwriting all my changes
  3. create a 2nd fork for having a vanilla repo without my changes
  4. create a cherry-picked PR only covering the files in questions (some READMEs for PR 1 and 2 .tf files for PR 2)

I prefer approach no. 2. For no. 4 I don't have any clue. What do you think?

By the way, I've created a workaround for the for_each issue b/c your -target workaround does not work with terragrunt. I've duplicated the proxmox_virtual_environment_download_file resource and gave them 2 static names. As this is not DRY yet, I'm aiming for a better code that will leverage the (talos) version as instance id and hence will create 2 resources dynamically if update_version differs. I also would like to get rid of these ${split("_", each.key)[2] uses as they are not well readable and not DRY, however there I'm struggling to create locals (variables) as these are not allowed in the resource block, only in the locals block... (also new to TF).

So, TLDR, will take some time as I'm still newbie. And expect 2 PRs -- for the README PR I will create another dedicated issue, if you prefer.

@vehagn
Copy link
Owner

vehagn commented Nov 4, 2024

I have a habit of force pushing and rebasing to keep a tidy history which makes it easier for me to go back in time to review when I did what and how I did it. This unfortunately makes it more complicated to create PRs as you have to rebase your commits on top of the new history.

I'm contemplating reverting this commit to not use the new talos_image_factory_schematic resource introduced in v0.6 of the Talos provider.
The first way I did it with the http data sources works the same using a GET call to the Talos factory, but doesn't require running the targets first. (Alternatively the new talos_image_factory_schematic resources could maybe be put in the root module).

I don't mind you opening a PR with merge conflicts as long as you don't mind me editing the PR to fix the conflicts, though if you want to try fixing it yourself git rebase -i is a great place to start!

Thank you again for taking the time to contribute back <3

sebiklamar added a commit to sebiklamar/terraform-modules that referenced this issue Nov 17, 2024
…_environment_download_file

this commit implements solution option #4 for circumventing vehagn/homelab#106
sebiklamar added a commit to sebiklamar/terraform-modules that referenced this issue Nov 17, 2024
…ual_environment_download_file

this commit implements solution option #4 for circumventing vehagn/homelab#106
@sebiklamar
Copy link
Contributor Author

sebiklamar commented Nov 17, 2024

Hi,

TLDR: My initial approach by removing the schematic portion from proxmox_virtual_environment_download_file is not appropriate. How aiming for a mixture of both (see option [3] below).

I'm contemplating reverting this commit to not use the new talos_image_factory_schematic

What are your reasons for (maybe) getting away from the new talos_image_factory_schematic resource from v0.6? If it's (only) because of the new dependency to run -target first: We'll get that managed, i.e. we don't strictly need -target any longer -- see below,)though with some other drawbacks.

I see benefits of using the new talos_image_factory_schematic instead of your data http POST calls -- not offending your code! -- because your code base would be better readable and maintainable when using simple resource calls of a common available provider instead of (okay, not so complex) http calls you would have to adapt if something changes on the talos side.

Root Cause Analysis

Re. proxmox_virtual_environment_download_file which is causing tofu plan/apply error:

[Error message]
The "for_each" set includes values derived from resource attributes that cannot be determined until apply, and so OpenTofu cannot determine the full set of keys that will identify the instances of this resource.

When working with unknown values in for_each, it's better to use a map value where the keys are defined statically in your configuration and where only the values contain apply-time results.

Alternatively, you could use the -target planning option to first apply only the resources that the for_each value depends on, and then apply a second time to fully converge.

The issue is that proxmox_virtual_environment_download_file's ID is dynamic, i.e. it depends on the schematic id retrieved from the image factory because its ID format is <node>_<schematic>_<version>. The <schematic> is not known before a tofu apply, more precisely, before a run of resource talos_image_factory_schematic. Hence the workaround with -target proposed by tofu (and Jacob Melton).

NB: This dependency was there already since months, also for your custom implementation using data http "schematic" POST calls before using the new talos_image_factory_schematic resource from v0.6. Hence, I wonder why it worked before without a -target run workaround. Maybe because of the nature of having a dependency on Terraform data elements and not resource elements? That could be another solution idea to tackle (option [3] in my below list).

Solution options

I currently see 4 solution options -- the "map" approach mentioned by tofu being similar to option [4] and the -target workaround as 5th option not on my table as it's not working with terragrunt wrapper:

  1. Your option: Revert the implementation by getting rid of new talos_image_factory_schematic resource in favor of using data http POST calls, as implemented before commit 856cef4 (a.k.a. 7152493 after your recent renovate rebase squashing).
    Yeah, that would be an easy solution. However, I'd like to use code from another module without inventing (and maintaining) it again -- though you did invent it first ;-)
  2. Your alternative: "the new talos_image_factory_schematic resources could maybe be put in the root module".
    Hmm, I'm not experienced with tofu as you, yet. I wonder how moving the image.tf code to main.tf (or parallel on filesystem level) would change anything. Can you elaborate on it a bit?
  3. Smart detour using data sources inbetween: Keep new talos_image_factory_schematic and shift dependency of proxmox_virtual_environment_download_file from resource talos_image_factory_schematic onto data counterparts, similar to the previous behaviour. Just a vague idea I don't have any implementation clue, yet. Maybe it's sufficient in storing the output of talos_image_factory_schematic into data buckets and then refer to these data buckets.
  4. Remove schematic id from resource ID: Keep new talos_image_factory_schematic and remove the dynamic part of proxmox_virtual_environment_download_file's ID, i.e. remove the schematic id string from the resource id. Simple example, id format

In the following let me elaborate on the 4th option (remove schematic id from resource ID) a bit more.

Option 4 -- challenges of a (dynamic) id

The approach of option 4 is removing the dynamic part of <schematic> from the resource id, i.e. format changes from <node>_<schematic>_<version> to <node>_<version> (or <node>_foo_<version>) afterwards. In your example: abel_dcac6b92c17d1d8947a0cee5e0e6b6904089aa878c70d66196bb1138dbd05d1a_v1.8.3 becomes abel_v1.8.3. That makes the resource's ID predictable for tofu, resulting in a tofu apply without the necessity to run tofu apply -target=module.talos.talos_image_factory_schematic.this beforehand.

Though, there's a drawback with this approach: We would lose possibility of evolving the schematic id by adding or removing talos extensions: As your logic is also that flexible in supporting the change of the schematic (var.image.update_schematic versus var.image.schematic) as part of an update (where the version could stay the same, even), we somehow would lose this possibility. That's a very tricky one.

I didn't manage to find a solution for keeping the 'possibility to change schematic' feature because that would bring a dynamic part into the ID again. Even when using some schematic information in the ID, e.g. using orig/update strings instead of <schematic> in ID's format (resulting in <node>_orig_<version> and <node>_update_<version>), this won't work in the end. The tofu hammer will knock down all your nodes after an update once you remove the update = true stanza and the node is damned to switch its image id from <node>_update_<version> back to <node>_orig_<version> (with version being the same).

Another challenge -- a showstopper for users with a running cluster -- is the change of the resource id format at all. Once option 4 is getting implemented, tofu would knock down all nodes in order to change their image due to the VM file_id's change (from e.g. abel_dcac6b92c17d1d8947a0cee5e0e6b6904089aa878c70d66196bb1138dbd05d1a_v1.8.3 to abel_v1.8.3). I don't have a safe upgrade path proposal for that.

  • tofu apply -parallelism=1 doesn't help.
  • Changing the ID only on update (e.g. from abel_dcac6b92c17d1d8947a0cee5e0e6b6904089aa878c70d66196bb1138dbd05d1a_v1.8.3 to abel_v1.9.0) would make it necessary to detect whether a running cluster is still using the old format (e.g. by checking the length of the middle part of the id) -- no clue about implementing this -- or define 'the' version where the change should happen. One cannot hard code e.g. 'v1.8.3' as the last 'legacy' version -- it would not work for users who forked your repo and would pull the new logic in a couple of weeks/months after they've already upgraded their talos to v1.9 in between.
  • As there's no way to ignore certain resources in a run, the only option I see on the table is targeting the nodes individually, e.g. run tofu apply -target=module.talos.proxmox_virtual_environment_download_file.this["abel_dcac6b92c17d1d8947a0cee5e0e6b6904089aa878c70d66196bb1138dbd05d1a_v1.8.3"] and tofu apply -target=module.talos.proxmox_virtual_environment_vm.this["ctrl-00"] for each proxmox host and talos node, respectively (not tested).
    Too much fiddling for the users...

Looks like we have to find a way which keeps the schematic as part of the resource id. That makes sense in terms of a proper resource definition.

Closing... still WIP: option 4 ruled out, exploring option 3 (depend on data sources)

In closing, option 4 is lacking the possibility in changing talos extensions in a running cluster -- which should be okay for 99 % of users. However, as there's no simple upgrade path for a running cluster, which is a no-go for all users with, I'm not creating a PR yet, as option 4 is no appropriate.

Nevertheless, I've implemented option 4 for my environment (image.tf) as I don't have the pain for keeping a prod cluster alive, yet ;-)
Maybe I can explore on option 3 (Smart detour using data sources) by evolving my terragrunt (terraform) module version of your homelab repo. I need to get to know data sources and inputs/outputs a bit more for advancing my terraform modules*

If you have some hints on option 2 (or 3), please let me know.

FYI terragrunt feat. my homelab and terraform-modules

evolving my terragrunt (terraform) module version of your homelab ...
I need to get to know data sources and inputs/outputs a bit more for implementing the sealed secrets key+crt input.

*The reason behind this complexity: As terragrunt is a wrapper for terraform/tofu, it behaves a bit differently -- both with advantages and disadvantags (or call it removing "hacky" methods): Terragrunt does not provide a refresh=false nor a -target method, and it cannot access local files (created manually / not version controled together with the terraform/tofu module) -- unless you put them into the .terragrunt-cache/foo/bar/ folder.

FYI: If you're interested in what I mean, feel free having a look on my PoC (proof-of-concept) implementations of your homelab (talos part only, yet). There're 2 variants:

  1. terraform-modules/talos-proxmox is my first try with implementing your tofu/kubernetes/talos folder without its sister ../bootstrap module (decoupling of all module logic for creating a cluster leveraging proxmox and talos from the logic enhancing a running cluster, maybe build with k3s distribution) -- where I failed with the cluster bootstrapping due to kube_config/talos_config outputs not being available for subsequent calls (resulting in the cluster not bootstrapping succesfully). Would need to adapt the kube_config/raw output definitions, likely.
  2. terraform-modules/vehagn-k8s my second attempt by copy-pasting your full tofu/kubernetes/talos folder. Basic bootstrapping works, I still need to solve sealed secrets key+crt bootstrapping.
    I need to find a method for providing the sealed secret key+crt (created manually via an openssl run in bootstrap/sealed-secrets/certificate in your solution). The easy (similar) way would be generating them still manually and putting them in assets/ folder (next to terragrunt.hcl) which can be referenced by the (downloaded) module, like already done for the schematic.yaml (and upcoming cilium-vars.yaml file). Another approach would be full automation by decrypting a SOPS-encrypted counterpart of both files on the run by using SOPS data sources. The manual approach would be sufficient to me, however the SOPS way is nice for getting to know data sources -- thus helping in exploring schematic solution option [3].

In case terragrunt is new to you: (Both) terraform modules are referenced by terragrunt.hcl files, typically residing in a separate "infrastructure-live" repo (in my case folder homelab/terragrunt, thus allowing multiple (and versioned) incarnations of the logic of a module with variations in the environment, e.g. poc, dev, staging, prod...

Kind regards -- Sebastian

PS: Thank you for your hint on git rebase -i which I only knew for changing commit messages until now.
I was struggling at the beginning for a while because you did some commit squashing for changes on 10-Oct. That caused trouble for both approaches in removing my commits via git rebase -i for getting to a clean branch as well for another approach by creating an upstream-vanilla branch on the last commit before my changes kicked into main (option 2 of my git approach). I initially used commit (7a6cf8c / ref. in my fork: sebiklamar/vehagn-homelab@7a6cf8c in case your commit got garbage-collected in your repo in between), but in the end going more back in history was the right way (using c2aa58a).

@vehagn
Copy link
Owner

vehagn commented Nov 17, 2024

I love your enthusiasm! There's a lot of text here which makes it difficult to address everything in one reply, but I'll try.

What are your reasons for (maybe) getting away from the new talos_image_factory_schematic resource from v0.6? If it's (only) because of the new dependency to run -target first: We'll get that managed, i.e. we don't strictly need -target any longer -- see below,)though with some other drawbacks.

The motivation is to remove the dependency to run with -target first and it's also what I've done in the current iteration of this article. Ideally I'd like to use the resource as it should ensure longer stability, i.e. the GET-call might change in the future if they decide to implement more features in the factory like e.g. versioned system extensions. It could be that future OpenTofu release fixes the issue. I should probably also create an issue in the terraform-provider-talos repo. Maybe they have an explanation of the behaviour.

I see benefits of using the new talos_image_factory_schematic instead of your data http POST calls -- not offending your code! --

No offence taken, I like my ideas challenged! I think we agree that using the provider resource is the better long term solution. I think the difference lies in my more "pragmatic" solution of not caring too much.

I didn't manage to find a solution for keeping the 'possibility to change schematic' feature because that would bring a dynamic part into the ID again. Even when using some schematic information in the ID, e.g. using orig/update strings instead of in ID's format (resulting in orig and update), this won't work in the end.

You're rediscovering the main issue I had in getting this to work. That's my fault for leaving it as an implicit "the proof is left as an exercise for the reader" task. It all boils down to this file_id.
A solution could be to compute the schematic ID explicitly, and not use the ID computed by the resource and/or Talos factory. If we don't want to implement the exact same algorithm Talos uses (look through the source code to get the same SHA-value), we could implement our own hashing algorithm only for the updated version, and then switch after the cluster is upgraded.

FYI: If you're interested in what I mean, feel free having a look on my PoC (proof-of-concept) implementations of your homelab (talos part only, yet).

I haven't had much time to coddle with my homelab lately, but I'll try to take a look at your work when I have the time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants