-
Notifications
You must be signed in to change notification settings - Fork 312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trouble attaching volume #884
Comments
@andyzhangx any thoughts? i'm a bit uncomfortable to just force the unattach my PVCs. |
One of my pvc's that failed to attach described: kubectl describe pvc claim-resX-n res-jhub
|
@yvan could you check the status of VM |
kubectl get no
there is no such pvc as:
This seems to refer to:
All my PVCs always have status bound, even when they are not in use by a user or an app. It never caused an issue like this before. Just started experiencing this this morning. |
@yvan, I mean could you goto azure portal to check the status of VM |
There's a problem with the node aks-agentpool-57634498-0, it has status 'Running': I actually see no such data disk (kubernetes-dynamic-pvc-0d7740b9-3a43-11e9-93d5-dee1946e6ce9) in the portal. Here are all my PVCs with form name 'kuberenetes-dynamic-pvc': aks-agentpool-57634498-0
aks-agentpool-57634498-1
aks-agentpool-57634498-2
aks-agentpool-57634498-3
There is one with a VERY similar name on |
could you run |
I gave it a go, the result: az vm update -g MC_risc-ml_ds-cluster_westeurope -n aks-agentpool-57634498-0
|
could you help find that
You would get the full resource path of that disk, check whether that disk exists or not |
Ok so it exists if I show all namespaces: kubectl get pvc --all-namespaces
But if I check the namespace where it should be I get: kubectl get pvc pvc-0d7740b9-3a43-11e9-93d5-dee1946e6ce9 -n res-jhub
|
@yvan Your PV has "pvc" in the name, creating some confusion. Contrast |
The names are just generated by jupyterhub. I agree it's mildly annoying. At the end of the day I want to understand why this happened and care a lot less about the names.
|
@yvan could you run |
Ok I think this is what you wanted to locate: kubectl get pv pvc-0d7740b9-3a43-11e9-93d5-dee1946e6ce9 -o yaml -n res-jhub
I checked to see if the the diskURI exists and found 2 disks with similar names:
|
could you check which node is disk
|
1 - the disk is attached to node: Also mysteriously a bunch of disk pressure messages (that have definitely not been popping up over 14-22d) have appeared in my event log for my 4th node: kubectl describe no aks-agentpool-57634498-3
seems related to pulling images. no events for |
small update: in the end i just waited a day and the cluster eventually helped clear up the resources. it seems connected to a broader service outage/issue mounting disks on AKS. |
A similar issue arose in my cluster, the pod could not start as the PV corresponding to its PVC is bound to another node and it could not detach the disk(PV). This happened after changing the service principal on all my nodes. This seems very no-deterministic in nature, yet this happens occasionally in aks-engine generated clusters. |
@Gangareddy you may also change the service principal on the master node, otherwise detach disk operation could not succeed. In that condition, you could manually detach that disk from the agent node, and k8s will automatically attach disk PV to the new node. |
@andyzhangx: I have updated the service principal on master nodes as well. Thought manually detaching disk would complicate self-healing nature of AKS. However, I was able to create additional disks from the snapshots of the disks that were stuck to the VM. Made changes to my PVs (persistent volumes) to use new creatly disks from snapshots. But I wonder, Is a VM reboot necessary after changing service principal on the VM.? |
@Gangareddy pls follow this guide to reset service principal: https://docs.microsoft.com/en-us/azure/aks/update-credentials, on agent node, you only need to restart kubelet:
|
Unable to Attach PVCs to a basic K8s deploy in Azure, when is K8s going to be production ready? This is just sad. This cluster is brand new created today. |
@mcurry-brierley could you provide more details about this issue? .e.g
|
You can close this, as I have removed all resources and we have decided not to go forward with azure as a result.
Azure is not an enterprise environment. When I feel it is, I will try again...
|
@mcurry-brierley I would say you may happen to hit this vmss disk attach issue which only happens in these two weeks, vmss team are hotfixing this disk attach issue. Just set up a VM Availability Set VM based cluster, I am pretty sure it won't have such issue. Just ping me if you hit such issue in slack. |
@andyzhangx , we seem to be hit by this on our production environments as well. I'll ping you on teams (internal) to see what we can do with this problem. In general, I agree with @mcurry-brierley on how annoying this is. We have had way too many issues with VMSS in the last 6 months and I am really tempted to track their team down and ask them for their SLA and where they have been wrt to the SLA in the last 6 months. |
@vijaygos @andyzhangx Has there been any progress on this? VMSS are totally unusable with AKS, which obviously means multiple nodepools are out of the window. Following a lengthy support call with Andy Wu in your support team, he advised I give up on my previous cluster and create a new one. As a test I've deleted a pod to reschedule it elsewhere, and already seeing the Surely loads of people must be seeing this, is it ever going to be fixed? |
@davestephens Your issue |
@vijaygos and I have a point. This is a production product? |
We are actually experiencing something similar here. We had a failing instance in a VMSS based cluster. After deleting the instance it seems the Kubernetes control pane still sees the disks as attached. Looking in the Azure portal (or using the Azure CLI) the disks are unattached however starting up the POD we get the following status:
Its like Kubernetes has cached the information that this disk is attached to the node which is no longer part of the cluster. We are running the latest non-preview version on AKS 1.14.8. |
hi all, I'm not sure if this is the same issue or not... but, I performed a k8s version upgrade on our non-prod node today. During that, one of the nodes died and caused problems. After restarting that node, the service that runs on that node wouldn't redeploy and is stuck in a perpetual |
This comment has been minimized.
This comment has been minimized.
I'm also experiencing this problem intermittently when making deployments: The following error is given:
We aern't in production yet but quite hesitant unless this issue is resolved. |
the error info is from k8s volume controller which means the volume is not unmounted from the previous node. Did the volume attach succeeded finally? |
back to this question again, there are two kinds of
I will close issue. Let me know if you have any question. |
Happened to me as well when deploying several helm charts. Created a cluster without vmss and the problem was solved. |
Having an issue where I'm getting multi attach errors when I try to attach a pvc. This issue was already brought up again 6 days ago #615. I'm just reopening it here as per the instruction on that thread.
What happened:
Pods cannot attach a pvc because it's bound somewhere else (though it should not be).
What I expect to happen:
Pods should be able to bind a pvc.
How to reproduce:
Not sure as I don't know why pvc's that are not in use would be attached or seen as attached by k8s.
k8s version:
azure region:
west europe
kubectl describe pod hub-7476649468-qfj75 -n res-jhub:
how many disks mouting into one VM in parallel:
The hub pod (whose describe is posted above) mounts the 1Gi
hub-db-dir
claim.Every user pod that tries to spawn mounts both one of claim-res(1-4) and also mounts
jupyterhub-shares-res-volume
which is an azurefile.what vms:
4 nodes/vms that have spec: Standard D16s v3 (16 vcpus, 64 GB memory)
No disk, cpu, or memory pressure that is in the node descriptions.
Other Similar Issues:
#477
#615
Not sure If related but my image puller also seems to be failing because it is looking for a file in the kubelet folder that it expects but cannot find.
The text was updated successfully, but these errors were encountered: