Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two PVC-s bound to the same PV #2456

Closed
1 task done
IvanKuchin opened this issue Feb 21, 2022 · 27 comments
Closed
1 task done

Two PVC-s bound to the same PV #2456

IvanKuchin opened this issue Feb 21, 2022 · 27 comments

Comments

@IvanKuchin
Copy link

IvanKuchin commented Feb 21, 2022

Describe the bug

Hello team,

Reconciliation process creates new pod before deleting old one. In case of pod has pvc in volume-section that ordering creates double claim to the same pv.
IMO order of operations should be

  1. remove old pod
  2. create new one

Steps to reproduce

Easiest way to reproduce is to follow "Automate image updates to Git" guide , with the following addition to podinfo-deployment.yaml.

Step 1) Add PV / PVC and attach volume to pod.

      volumes:
        - name: empty-dir-vol
          persistentVolumeClaim:
            claimName: empty-dir-pvc
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: empty-dir-pvc
  namespace: podinfo-image-updater
spec:
  storageClassName: slow
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 3Gi
---
apiVersion: v1
kind: PersistentVolume
metadata:
  labels:
    type: nfs
  name: podinfoimageupdater-emptydir-pv
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 10Gi
  claimRef:
    name: empty-dir-pvc
    namespace: podinfo-image-updater
  nfs:
    path: /storage_local/podinfo-image-updater/empty-dir
    server: 192.168.170.36
  storageClassName: slow

If it is confusing full manifest is here

  1. Change image version to trigger deployment reconciliation

  2. Observe the problem.
    PVC will get to Lost state

$ kubectl get pvc -w
NAME            STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
empty-dir-pvc   Bound     podinfoimageupdater-emptydir-pv   10Gi       RWO            slow           11s
empty-dir-pvc   Lost      podinfoimageupdater-emptydir-pv   0                         slow           2m26s
ikuchin@microk8s-test:~$ microk8s.kubectl describe pvc 
Name:          empty-dir-pvc
Namespace:     podinfo-image-updater
StorageClass:  slow
Status:        Lost
Volume:        podinfoimageupdater-emptydir-pv
Labels:        kustomize.toolkit.fluxcd.io/name=flux-system
               kustomize.toolkit.fluxcd.io/namespace=flux-system
Annotations:   pv.kubernetes.io/bind-completed: yes
               pv.kubernetes.io/bound-by-controller: yes
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      0
Access Modes:  
VolumeMode:    Filesystem
Used By:       podinfo-9ccf96ff5-6d8nx    <----------- notice podID
Events:
  Type     Reason         Age   From                         Message
  ----     ------         ----  ----                         -------
  Warning  ClaimMisbound  26s   persistentvolume-controller  Two claims are bound to the same volume, this one is bound incorrectly

PV will get to Available state

$ kubectl get pv -w
podinfoimageupdater-emptydir-pv     10Gi       RWO            Retain           Bound       podinfo-image-updater/empty-dir-pvc   slow                    2m23s
podinfoimageupdater-emptydir-pv     10Gi       RWO            Retain           Available   podinfo-image-updater/empty-dir-pvc   slow                    2m23s

Reason for that is order of pod update operations

$ kubectl get pod -w
NAME                       READY   STATUS    RESTARTS       AGE
podinfo-844777597c-hhj8g   1/1     Running   1 (114m ago)   11h <----- this pod owns PVC
podinfo-9ccf96ff5-6d8nx    0/1     Pending   0              0s
podinfo-9ccf96ff5-6d8nx    0/1     Pending   0              0s
podinfo-9ccf96ff5-6d8nx    0/1     Pending   0              14s
podinfo-9ccf96ff5-6d8nx    0/1     ContainerCreating   0              14s
podinfo-9ccf96ff5-6d8nx    0/1     ContainerCreating   0              15s
podinfo-9ccf96ff5-6d8nx    1/1     Running             0              15s   <--------- this pod creates duplicate PVC
podinfo-844777597c-hhj8g   1/1     Terminating         1 (116m ago)   11h
podinfo-844777597c-hhj8g   0/1     Terminating         1 (116m ago)   11h
podinfo-844777597c-hhj8g   0/1     Terminating         1 (116m ago)   11h
podinfo-844777597c-hhj8g   0/1     Terminating         1 (116m ago)   11h

Expected behavior

Successful image update even with PV/PVC attached to the pod

Screenshots and recordings

No response

OS / Distro

20.04.3 LTS (Focal Fossa)

Flux version

flux version 0.27.0

Flux check

$ flux check
► checking prerequisites
✔ Kubernetes 1.22.6-3+7ab10db7034594 >=1.20.6-0
► checking controllers
✔ helm-controller: deployment ready
► ghcr.io/fluxcd/helm-controller:v0.17.0
✔ notification-controller: deployment ready
► ghcr.io/fluxcd/notification-controller:v0.22.0
✔ image-reflector-controller: deployment ready
► ghcr.io/fluxcd/image-reflector-controller:v0.16.0
✔ image-automation-controller: deployment ready
► ghcr.io/fluxcd/image-automation-controller:v0.20.0
✔ kustomize-controller: deployment ready
► ghcr.io/fluxcd/kustomize-controller:v0.21.0
✔ source-controller: deployment ready
► ghcr.io/fluxcd/source-controller:v0.21.2
✔ all checks passed

Git provider

No response

Container Registry provider

No response

Additional context

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@IvanKuchin IvanKuchin changed the title Two PVC bound to the same PV Two PVC-s bound to the same PV Feb 21, 2022
@stefanprodan
Copy link
Member

Reconciliation process creates new pod before deleting old one. In case of pod has pvc in volume-section that ordering creates double claim to the same pv.

It's not Flux that managed pods, but Kubernetes deployment and replicaset controllers. PVC should be used with Statefulsets, Kubernetes knows how to handle the rolling upgrade accordingly. If you want to use PVC with Kubernetes Deployments, then set the strategy to replace, this will delete the old pod before creating the new one.

@IvanKuchin
Copy link
Author

Thank you Stefan. Sorry that I didn't do enough exploration before opening the issue.

@IvanKuchin
Copy link
Author

IvanKuchin commented Feb 24, 2022

Hello Stefan,

Sorry for bugging you again and re-opening issue.
I did lots of testing and you are right that order of operation can be influenced by Deployment.spec.strategy.type==RollingUpdate or Recreate. But it doesn't change success or failure of deployment.
I'll try to keep this post short with description only. Please let me know if you want to look at actual outputs.

In spite of deployment strategy normal-k8s image upgrade works perfectly fine.
rollout strategy + kubectl set image deployment/podinfo podinfod=ghcr.io/stefanprodan/podinfo:5.1.4 - no problem
recreate strategy + kubectl set image deployment/podinfo podinfod=ghcr.io/stefanprodan/podinfo:5.1.0 - no problem
rollout strategy + kustomize build . | kubectl apply/delete -f - - no problem
recreate strategy + kustomize build . | kubectl apply/delete -f - - no problem

rollout strategy + flux reconcile kustomization flux-system - makes PVC lost w/ warning ClaimMisbound
recreate strategy + flux reconcile kustomization flux-system - makes PVC lost w/o warnings

I think that flux is doing smth unusual from standard kubectl set image or kustomize | kubectl

Could you comment ?


  • Annotating pvc with kustomize.toolkit.fluxcd.io/reconcile: disabled and reconciliation after has put pvc in a Lost state.
  • Changing PVC/PV access mode to RWX/RWO put pvc in a Lost state

Sorry for bothering you again.

@IvanKuchin IvanKuchin reopened this Feb 24, 2022
@damyan
Copy link

damyan commented Apr 13, 2022

Same here: using a PVC with a deployment (the combination is unfortunately not under our control) leads to a ClaimMisbound and phase: Lost.

@daogilvie
Copy link

daogilvie commented Apr 25, 2022

+1 from me for reports of this in the wild. Really happy with Flux generally, thanks for making a cool bit of software!

As for this issue specifically: we are also experiencing something similar, immediately after having done a rolling update of our kubernetes cluster.
We have just updated to 1.22.6 on GKE so is it possible there's something in this K8S version that is problematic for flux? We have a flux2 managed kustomization, and this issue appears even though we have tried completely removing the app with a flux delete followed by a re-adding of the kustomize i.e we get this when starting from an empty cluster.
We have always had the relevant deployments use a "Recreate" strategy too, so that has not changed.
Our app itself actually works fine, and is able to access and use the volumes, so we're mostly just confused 😄

❯ flux check
► checking prerequisites
✔ Kubernetes 1.22.6-gke.300 >=1.20.6-0
► checking controllers
✔ kustomize-controller: deployment ready
► ghcr.io/fluxcd/kustomize-controller:v0.22.3
✔ notification-controller: deployment ready
► ghcr.io/fluxcd/notification-controller:v0.23.2
✔ source-controller: deployment ready
► ghcr.io/fluxcd/source-controller:v0.22.5
✔ all checks passed

@jmriebold
Copy link

jmriebold commented Apr 30, 2022

@kingdonb, RE: your suggestion in #2250, you were absolutely correct: claimRef is not required, and without it Flux has no problem with the PV(C)s. In hindsight this makes perfect sense since the PVC has a selector, which is enough to bind, but since I'd always had the field there and it was referenced in some articles I read, I assumed it was necessary.

Thanks a bunch, this solves our issue completely.

@IvanKuchin, if you remove claimRef from your PV and add spec.selector to your PVC (e.g. spec.selector.matchLabels with a corresponding label on the PV) that ought to fix your problem as well.

@Robbilie
Copy link

Robbilie commented May 1, 2022

I am using local path pvs and I want to prevent pvcs from being bound to them from inside namespaces other than what I configured, that's why I was using claim ref. I don't really see an alternative to my problem though 😕

@jmriebold
Copy link

@Robbilie, would a selector not work for some reason?

@Robbilie
Copy link

Robbilie commented May 1, 2022

I can pin the PVC to a PV (just by setting the volume name, doesn't even need the selector) but I want to enforce only this one PVC being allowed to be bound to the PV. The claim ref is on the PV side so no other PVC is allowed to be bound to it. The volume name/selector on PVC side doesn't prevent this...

This way if I allow someone to create pvcs they can simply bind to the PV and write on the local path volume to which they shouldn't have access

@jmriebold
Copy link

Kubernetes itself enforces that though. From the docs:

Once bound, PersistentVolumeClaim binds are exclusive, regardless of how they were bound.

The claimRef field, if unset on PV creation, will be set as soon as a PVC binds the volume (e.g. via a selector). I just tested this a few minutes ago, so unless local path volumes work differently, a volume name or selector on the PVC should be enough.

@Robbilie
Copy link

Robbilie commented May 1, 2022

Yeah but what if there are two pvcs, isn't that a race condition and it's possible the "malicious" one binds first?

@jmriebold
Copy link

Well yeah I guess so, but if that's really the situation you've likely got bigger problems than PVC binding.

Regardless, @kingdonb, after reviewing the docs, I think Flux's behavior here is still a bug of some sort. According to the docs:

If other PersistentVolumeClaims could use the PV that you specify, you first need to reserve that storage volume. Specify the relevant PersistentVolumeClaim in the claimRef field of the PV so that other PVCs can not bind to it.

This is essentially the situation that @Robbilie is talking about, and it suggests that the field is not for control plane usage only, so it follows that Flux should be able to handle this situation.

That said, removing the claimRef from our PVs was an easy workaround, so we, at least, aren't blocked by this.

@kingdonb
Copy link
Member

kingdonb commented May 2, 2022

@jmriebold Thanks for finding that! I did not know, maybe this is an issue we need to handle somehow.

You might try the ssa: merge annotation described here:
https://fluxcd.io/docs/faq/#how-to-patch-coredns-and-other-pre-installed-addons

In general the disposition of Flux has changed from "will merge with externally provided configuration" to "in general we override it, unless you specifically opt-in to merging behavior, or if it is from a known cluster actor"

Well in this case, it is a known cluster actor. It seems possible that the bad behavior is outside of Flux, if you should be able to set claimRef then it's the storage driver who needs to know how to handle that, without losing the volume (Lost phase).

I think someone who is affected by this is going to have to try to figure it out and tell us what needs to change, I haven't found any situations in my own clusters where I need to set up volumes on the same cluster and there are tenants that do not trust each other who all should be able to create volumes from the same pool. I would probably try to separate them a bit harder with a storage class that has a separate configuration per user, and admission controllers to block users from selecting PVs in storage classes that don't belong to them, but this might not be possible with every storage provider.

@IvanKuchin
Copy link
Author

@kingdonb thank you for workaround.

I'm out of office now, give me please few days to test it out.

@daogilvie
Copy link

daogilvie commented May 3, 2022

... if you remove claimRef from your PV and add spec.selector to your PVC (e.g. spec.selector.matchLabels with a corresponding label on the PV) that ought to fix your problem as well.

@jmriebold would you expect this to work if we were using the volumeName field as per Reserving A Persistent Volume? Or is that going to fall foul of the same problems as the claimRef approach? I ask as we are using volumeName and the issue persists. I'll try your suggestion to use a spec today also. Thanks!

@jmriebold
Copy link

@daogilvie are you using both volumeName and claimRef? If so, yes, I would think removing the claimRef field would be enough, but haven't tested myself.

@daogilvie
Copy link

@daogilvie are you using both volumeName and claimRef? If so, yes, I would think removing the claimRef field would be enough, but haven't tested myself.

Thanks for replying so quickly!

We are using only volumeName.

We were using just claimRef when this issue first hit us (as we had been for some time), and then we swapped to volumeName when we thought that might work better. It is also the "more gooder" way of doing our use case, apparently.

I'll report back if the selector solves the problem for us or not 👍

@jmriebold
Copy link

Just volumeName and your volumes are still getting Lost? That's surprising, I would've expected that to work fine.

@daogilvie
Copy link

Yes, we expected it to work too 😄
It is entirely possible we have got some sort of mixed state, or other issues with the configuration. I'll confirm as best I can exactly what works and what doesn't over the course of today, just as soon as my team agrees to let me spin things up and down on the staging cluster with wild abandon.

@daogilvie
Copy link

daogilvie commented May 3, 2022

Using selectors doesn't work for us, but in a new and different way.

When the following input is provided to flux in the form of a kustomization overlay (just the relevant PV/PVC shown here):

---
apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    app_version: 2022.4.2
  labels:
    app_name: foyer
    generator: kustomize
    vol: flower-pv
  name: flower-pv
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 100G
  claimRef:
    namespace: default
  gcePersistentDisk:
    fsType: ext4
    pdName: foyer-pd
  storageClassName: ""
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    app_version: 2022.4.2
  labels:
    app_name: foyer
    generator: kustomize
  name: flower-pv-claim
  namespace: default
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 100G
  selector:
    matchLabels:
      vol: flower-pv
  storageClassName: ""
  volumeName: flower-pv
---

...we now get the pods and pvc stuck in "Pending", where the pods aren't scheduled onto a node because the claim won't bind, and the claim won't bind because:

 Warning  FailedBinding  2m27s (x26 over 8m40s)  persistentvolume-controller  volume "flower-pv" already bound to a different claim.

And when we go to look at this "already bound" volume itself, it has the status of Available 🤷

I had run a flux delete kustomization ..., then a corresponding create, to clear out and reinstall the cluster. I don't really want to completely remove flux but if I have to try that then I could do.

Please let me know what information you might want/need here, but we are going to go back to our workaround, which is manually creating the PV/PVC with a direct apply, then having flux manage everything else.

@kingdonb
Copy link
Member

kingdonb commented May 3, 2022

Now I see what you're doing is not what I expected:

  claimRef:
    namespace: default

Is your intention here to enforce that the claim must come from the named namespace? I do not think you can count on this behavior to work the way you want, I've never seen it documented to work as you describe.

If you were going to specify a claimRef then I think you'd need to specify the whole claimRef for it to have an effect. But we have only this one thin reference from the docs that even asserts a claimRef is something which could be specified by a user instead of being instantiated into the spec by the kube-controller-manager, or whoever is in charge of the CSI driver.

@daogilvie
Copy link

daogilvie commented May 3, 2022

Now I see what you're doing is not what I expected:

  claimRef:
    namespace: default

Is your intention here to enforce that the claim must come from the named namespace? I do not think you can count on this behavior to work the way you want, I've never seen it documented to work as you describe.

If you were going to specify a claimRef then I think you'd need to specify the whole claimRef for it to have an effect. But we have only this one thin reference from the docs that even asserts a claimRef is something which could be specified by a user instead of being instantiated into the spec by the kube-controller-manager, or whoever is in charge of the CSI driver.

Hi @kingdonb — thank you for this! I had missed that as it is actually added by a Kustomization config and is not included in the main base yaml that I've been editing. I'll try this whole thing again having taken that out — we found that we needed that for the claimref method to work previously when we were using it as the means of binding.
It isn't something we've deliberately left in when trying the direct bind methods. It could well be causing an issue though, because in the specs we directly apply with kubectl it naturally isn't there 👍

@daogilvie
Copy link

Thank you @kingdonb, @jmriebold. Now that I have removed that claimref kustomization, using volumeName works perfectly. We no longer see the Lost status, and everything is ship-shape again for us.
Really appreciate you taking the time, and helping us continue using a great bit of software.

@Robbilie
Copy link

Robbilie commented May 4, 2022

@kingdonb i created a pv and pvc with claimref on the pv and the ssa merge annotation and its lost too :/

@kingdonb
Copy link
Member

kingdonb commented May 6, 2022

If your claimRef does not match exactly what controller-manager decides to do with the PVC then it's probably going to mark the PVC as lost. @Robbilie We have a number of reports here from people that had this issue, who were able to resolve it in Flux, but the behavior can be different from one Kubernetes CSI environment to another. Do you have more information about your issue so we can see what's gone wrong? It might be good to open a separate issue, and provide the details there.

As much information as you can provide as possible please, what cloud provider and what specific configuration (all YAMLs needed to trigger the issue in a minimal set) causes the volume to be lost.

I am skeptical that there is a Flux issue requiring any code change here given how many seem to have resolved it for themselves, but that does not mean we don't have a UX issue worth addressing with a doc that could be super clear about how you should handle PVs and PVCs in Flux when working with stateful workloads. It's worth documenting at least, (and if some CSIs provide a different behavior than others, we should try to capture the important parts for our users in a doc.)

@Robbilie
Copy link

Robbilie commented May 6, 2022

i dont think anyone got it to work while still keeping the claimRef, no? the solution was to drop the claimRef on the PV so the original issue persists i believe…

i am using rke2 on ubuntu 20.04 nodes and these resources:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: test
  annotations:
    kustomize.toolkit.fluxcd.io/ssa: merge
spec:
  capacity:
    storage: 10Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteOnce
  claimRef:
    name: "test"
    namespace: "dependency-track"
  persistentVolumeReclaimPolicy: Retain
  storageClassName: "local-path"
  local:
    path: /opt/test
  nodeAffinity:
    required:
      nodeSelectorTerms:
        - matchExpressions:
            - key: kubernetes.io/hostname
              operator: In
              values:
                - decgn-pr-vmcq-main
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test
  namespace: dependency-track
  annotations:
    kustomize.toolkit.fluxcd.io/ssa: merge
spec:
  storageClassName: "local-path"
  volumeName: "test"
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi

@kingdonb
Copy link
Member

kingdonb commented May 6, 2022

@Robbilie I don't think anyone else needed to set claimRef to get the binding to happen by volumeName in the end.

I noticed you set ssa: merge on your PersistentVolumeClaim. I don't think PVCs can be mutated after they are created. What is the purpose of that annotation there, I wonder did you use Kustomization.spec.force to get that applied?

If so, it's not too surprising if your PVs go to Lost phase. PVCs need to be fully defined except for:

Kubernetes: PersistentVolumeClaim error, Forbidden: is immutable after creation except resources.requests for bound claims.

... except for resources.requests, which I guess can still be mutated after creation of the PVC.

There may be something in common between your issue and the original issue reported here, but if you want to follow it up, it's better if you create a new issue. I'm not sure if any CSI requires the user to set claimRef for any reason, but maybe RKE2 does. In my experience it has never been necessary. Please include details of what happens if you do not set claimRef in your report, so we can understand the context of why you've configured it this way, and provide supporting docs if you have them to understand why this should work.

I don't want to spam the original poster of this issue, if we're now talking about a different issue. I'm going to close this one, but we can reopen it if the original poster still is dealing with this issue.

In the mean time, I'd ask that anyone who is still struggling with this please open up a separate issue, feel free to link back to this issue for context if applicable.

@kingdonb kingdonb closed this as completed May 6, 2022
cfergs added a commit to cfergs/kubernetes-homelab that referenced this issue May 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants