-
Notifications
You must be signed in to change notification settings - Fork 558
Cluster provisioning fails due to invalid nvidia keys, during apt-get update in the cse for agents and master #3933
Comments
Ok here is a work around tested and verified: Deploy using ARM templates (by using acs-engine generate first, and using az group deployment create), and pass an extra parameters file with the content (corresponding to the above kubernetes.json):
This will use the latest 16.04-LTS image which seems to work. Currently acs-engine uses by default an old image: 16.04.201804050 From the ID it seems to be from 5th of April, which I think is rather old. |
@nikolajbrinch thanks for reporting. What do you mean by "Currently acs-engine uses by default an old image: 16.04.201804050"? acs-engine in fact uses |
does the aks 0.16.0 image version resolve the key problem? Unfortunately this isn't in an acs-engine release yet, but you could always build your own binary in the interim (or better yet, just hack the image version in parameters.json) |
@grenzr good call it most likely does as our tests are not seeing the same error (running from the master branch). Also, as @davidmatson pointed out in the expiration points to unit timestamp |
confirmed - using |
Also confirming that I was able to get a working cluster by switching osImageVersion to 0.16.0 in azuredeploy.parameters.json. (Just had to switch from acs-engine deploy to acs-engine generate to have the option to tweak this value.) |
Great thanks @davidmatson and @grenzr. Patch release is on its way. |
@CecileRobertMichon I mean using acs-engine 0.22.3
and generating templates, the following is in my azuredeploy.json
and I was of the impression that azuredeploy.json states the defaulValues for the current acs-engine version. That would be nice though, so I wouldn't have to go to the code... |
@CecileRobertMichon on the docker version thing. The reason for actually deploying this cluster again, was that I crashed the API server, and decided to just rebuild the cluster (it is a test cluster). BTW: The Troubleshooting documentation helped me figure out the problem, so many thanks for writing that up. |
@nikolajbrinch my initial intuition about the newer docker version was wrong, the problem was definitely that the nvidia key used in the 0.15.0 VHD image was expired. The only way that your docker version was affecting that is that because that version is not preinstalled on our AKS curated image it was being pulled at deployment which included running an apt-get update and that was causing the nvidia key error. Thanks for the feedback on the default value in the template, that version is actually not being used but I will make sure it's removed or updated to prevent confusion. |
@CecileRobertMichon Thank you for clarifying everything and making a quick fix. Tested with a small cluster provisioning and it works fine now (0.22.3). We use the newer docker version, as it support multistage builds, and other newer stuff. Our Jenkins servers run on this and does the images builds, and we like using multistage builds and setting ownership and group during copy :-) |
This is already closed, but another way maybe to update the keys as told here https://nvidia.github.io/nvidia-docker/ |
@d-demirci Yes that would be ideal, but that is not up to the user, but up to the maintainer. The CSE is failing at provisioning, nothing I can do, as far as I know? |
Is this a request for help?:
yes
Is this an ISSUE or FEATURE REQUEST? (choose one):
ISSUE
What version of acs-engine?:
0.22.2
kubernetes.json
Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
What happened:
Command output:
/var/log/azure/cluster-provisioning.log
What you expected to happen:
A new cluster should be built
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know:
I logged in to the master and executed
with the following result
As can be seen from the above, something is wrong with the nvidia repository keys.
Please advice as I'm stuck at the moment.
The text was updated successfully, but these errors were encountered: