Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace EKS test-infra with example #1192
Replace EKS test-infra with example #1192
Changes from 1 commit
2210992
10c5a6a
58ae59c
48cbe7f
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer that we leave out the first phrase. I feel it's presenting some facts inaccurately.
My understanding is that all providers are configured at the same time, BEFORE any of the resources have been created. It is because the configured attribute values are only then consumed by the Kubernetes / Helm provider CRUD logic at a later stage in the apply, that we only see intermittent issues that we generally label as "progressive apply". This sequencing is not predictable and is influenced by the resource dependencies in the graph in very unintuitive ways. I think the first phrase here gives users the false impression that this aspect is somehow predictable. I worry that it is not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe there's a way I can represent this more accurately. Let's discuss the nuances of it, and maybe I can learn from your perspective. There is probably some truth to both of our understandings of this scenario, since my understanding is based on first-hand experience and observation. The scenario I'm describing in this document is thoroughly tested. It is repeatable and predictable. I'm less certain about the "why" behind it though.
I think this is because, when you initialize the Kubernetes provider using a data source, like
google_container_cluster
, the reading of the data source can be deferred. In deferring the data source read, you get the predictable deferred provider initialization. This is what allows our initial builds of GKE/EKS/AKS clusters to succeed in our test-infra and examples. It's the predictable single-apply scenario.However, you can run into problems later, such as during cluster replacement, when the Kubernetes provider initialization is no longer delayed by the initial read of the data source. The Kubernetes provider is initialized before the data source has read the new GKE credentials, since the credentials were replaced during the apply phase. In this scenario, the data source was never re-read to give the Kubernetes provider valid credentials, and so the apply fails. In those cases, as far as I'm aware, the only way forward is to remove the Kubernetes provider resources from state, as described in this document. (Related: I opened a bug upstream to see if they can help us with this scenario).
My goal with this README to be able to convey the information accurately to users, to help them to succeed in establishing the dependency between their GKE cluster and their Kubernetes resources.
How about this instead... I edited it to say the providers will
not attempt to read from the Kubernetes API
instead of saying they won't be initialized. Perhaps that information is more accurate and more relevant to the users.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stef, thanks for breaking things down in such detail.
I appreciate the overall solution that you are proposing here and I think removing the state of the Kubernetes resources deliberately via the
state rm
command before replacing the cluster is a clever and very valid solution. I think we should promote that and encourage users to adopt it. It follows good practice patterns, by making actions explicit rather than implicit and limiting blast radius of each action. I like it.What I'm not comfortable with is the messaging in the first phrase only. It can be interpreted by users as stating for a fact that Terraform offers any kind of sequencing guarantees that "the Kubernetes and Helm providers will not attempt to read from the Kubernetes API until authentication details are created for the cluster".
This causality, I'm afraid, is not sustained by the way Terraform is currently designed. After multiple conversations with Terraform engineers, the conclusion has always been that this is not by-design behaviour. The fact that it works (as clearly you have experienced first-hand) is a side-effect and side-effects are dangerous to promote as reliable fact.
In fact, Terraform documentation clearly advises against relying on this assumption here: https://www.terraform.io/docs/language/providers/configuration.html#provider-configuration-1. About midway in that section, it reads:
"You can use expressions in the values of these configuration arguments, but can only reference values that are known before the configuration is applied. This means you can safely reference input variables, but not attributes exported by resources (with an exception for resource arguments that are specified directly in the configuration)."
I think it's in our users' best interest (but also ours) for us to send a clear and cohesive message about how to properly use Terraform.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll give this more thought. Maybe for now, we can just remove this README from the test-infra dir. It was just a copy/paste of our example README anyway. And we can continue the discussion and tuning this file over in the PR where I'm making changes to this README in a more user-facing way. This discussion is also something we can think about for our upcoming meeting regarding the Kubernetes provider authentication issues. (Phil scheduled it for us yesterday -- we'll meet in about 2 weeks).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the paragraph in question. It's ok to just wait until the meeting to discuss it further, since we might not do much work on the other PR until we have that talk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's a great way forward! I'm happy to do a deep dive in this topic online and exchange ideas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GP2 volumes are generally more expensive.
Is there a technical need for nodes to run on GP2 volumes? We're likely not putting much workload on the nodes during our tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gp2 was the default in the EKS module, but they recently updated it to gp3. I put it back to the old default because gp3 isn't available in all regions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the commit where it was added. terraform-aws-modules/terraform-aws-eks@76537d1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, if that's the case then disregard my comment. I must have based it on old information and likely GPx volumes are now the only choice. Sorry for the confusion :)