-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nvidia taint along custom taints in google_container_node_pool #7928
Comments
@andre-lx help me understand how it should work after you uncomment the block? |
Hi @edwardmedia . Don't know if I understand correctly your question. After uncomment the nvidia taint, everything works correctly in the updates. The problem is with the first deploy using I will provide a more extensive example: First
This configuration, will work, and the pool is correctly created.
The next
If I don't insert the gpu taint together with our own taints like the previous file, terraform will "force replace" my pools all the time, since the taint is not available in the configurations file.
That's why, I need to comment in the first deploy, and uncomment in the subsequent deploys. An image with the |
@andre-lx I have tested cases by providing either one of below
resource "google_container_node_pool" "gpu_pool_test" {
...
taint = [
{
effect = "NO_SCHEDULE"
key = "nvidia.com/gpu"
value = "present"
},
{
key = "another_taint"
value = "true"
effect = "NO_SCHEDULE"
},
]
....
} |
Hi @edwardmedia . Thanks for the quick response. Since the nvidia taint is the default for the gpu node pools created by gke itself (even if you create the node pools manually), the only configuration missing in my examples, that can actually affect this, is the
Thanks! |
@andre-lx below is the state from my first run. Did I miss anything? There are many incompatible configs but that seems beyond what the Terraform provider can control. If you see other cases, can you share your FULL terraform code so I can repro the issue? Another thing you may want to try is to see if you can create the pools using gcloud container ... command
|
Hi @edwardmedia. You didn't miss anything. Bellow is my full config:
I just copy and paste your
The full
Creating the pool using the
Output:
This makes sense, since the nvidia taint is already added by default on gpu node pools by the gke itself. On the terraform side, if you don't add this taint the gpu pool is created successfully. The problem, as I already described, on the updates, since terraform always show the "forces replacement". It's important to refer, that, if you don't need to use custom taints (so, without specifying the taint block in the config file), the creation and updates works fine at the moment, and the nvidia taint is added by terraform to the state file, as showing bellow. First
Subsequent
Getting the
In the next ... Getting the
So, it's really strange, that terraform thinks that the gpu pool needs a replacement:
The question is, why terraform forces replacement of an array that is equal to the same resource in the state file using custom taints? Since with only the nividia taint, the taint array are successfully added to the state file, and in the subsequent Thanks! |
Hi @edwardmedia. In short, Why the pool is recreated if the nvidia taint is the default by gke? And, why the pool is not recreated if no custom taints are used (or better, if only the nvidia taint exists). |
@andre-lx I am not sure if I understand what you said correctly. In my tests, I have tried to put 1) both nvidia and a customer
Where do you see From the provider's perspective, any changes on taints will trigger pool recreation because I don't see GCP API provides a way you can use to update taints directly. Instead, if you run |
@edwardmedia the nvidia taint is created by default on gpu node pools as you can see here: That't why (I think), I can't add the taint in node pools at creation time, as I explained in the other comments, and that's why terraform and gcloud give me the error:
Because of this I don't understand how did you manage to create the gpu pool with the nvidia taint specified. I understand, that if you change the taints in both google platform, or via terraform, the terraform will recreate the pool, that's make a lot of sense and I was not expecting another way (since the state file is different from the resource itself). The problem here, is that, using customer specific taints, I can't create the pool with the nvidia taint, and I can't And that's why, I need to comment the nvidia taint on creation (since this is added by gke itself), and uncomment the nvidia taint in the subsequent I will put this to kind of examples, maybe makes it easy: 1 - No taints in config file: 2 - with both nvidia and costumer specific taint:
2.2 - Solution: create the pool has above (example 3) 3 - Only with one costumer specific taint in the config:
3.2 - The pool is created successfully, and the costumer specific taint as well the nvidia taint is added to the state file (again, since this is created automatically by gke)
3.5 - All the future |
@andre-lx I see. Thanks for the link. In my tests, all node pools were added to a new cluster, which is different from adding pools to an existing cluster. That explains why it works for mine and it not for yours
All behaviors you have experienced appear to be controlled by gke/kubenetes. I don't think the provider has much space to do. I am glad you have found a workaround |
@andre-lx closing this issue then. Feel free to reopen if you see there is something the provider can help. Thank you |
Hi. Some update from my part. As an workaround: From the docs:
So you can set only one taint (without the nvidia taint), and ignore the changes with the
With this, you are able to create and update, without losing the nvidia taint. The only problem I found, is if you need to update the taints in your terraform receipt. This changes are algo ignored. |
Just adding my voice here as well. This is a problem and it's very annoying since every time terraform tries to re-create the nodepool since the taints do not match. The workarounds are:
More people are discussing the problem here: terraform-google-modules/terraform-google-kubernetes-engine#703 |
@andre-lx ignoring taint changes with in the resource "google_container_node_pool" "kubeflow_primary_gpu" {
# ...
node_config {
# ...
taint = [
{
key = "preemptible"
value = "true"
effect = "NO_EXECUTE"
},
{
key = "cloud.google.com/gke-preemptible"
value = "true"
effect = "NO_SCHEDULE"
},
]
}
lifecycle {
ignore_changes = [
node_config[0].taint,
]
}
} |
Taints are likely to get fixed in a future major release. The current model for them has proven difficult enough to work with that I don't think we can fix it by adding behaviours in a backwards-compatible way. |
Closed in GoogleCloudPlatform/magic-modules#9011 |
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. |
Community Note
modular-magician
user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned tohashibot
, a community member has claimed the issue already.Terraform Version
terraform -v
Affected Resource(s)
google_container_node_pool
Terraform Configuration Files
Debug Output
Right now, we have a lot of pools, and with our gpu pools we have our own taints, but we need to comment this taint in the first deploy:
Otherwise, terraform will output the error:
After the first deploy, we need to uncomment in the subsequent deploys (
terraform apply
), or terraform will replace the node_pool each time we run the apply command.Important Factoids
Authenticating as a service account instead of a user.
b/299312479
The text was updated successfully, but these errors were encountered: