Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VolumeOp was not able to create PVC #5257

Closed
yuhuishi-convect opened this issue Mar 8, 2021 · 18 comments
Closed

VolumeOp was not able to create PVC #5257

yuhuishi-convect opened this issue Mar 8, 2021 · 18 comments
Assignees
Labels

Comments

@yuhuishi-convect
Copy link

yuhuishi-convect commented Mar 8, 2021

What steps did you take:

A simple pipeline with one vol_op and one simple step that mounts the pvc created.

# minimal example

import kfp

@kfp.dsl.pipeline(
    name="data download and upload",
    description="Data IO test"
)
def volume_pipeline():
    # shared vol
    
    vop = kfp.dsl.VolumeOp(
        name="volume_creation",
        resource_name="sharedpvc",
        size="5Gi",
        modes=["RWO"]
    )
    
    # mount the vol
    
    simple_task = kfp.dsl.ContainerOp(
        name="simple task",
        image="bash",
        arguments=[
            "echo",
            "hello",
            ">/data/hello.text"
        ]
    ).add_pvolumes({
        "/data": vop.volume
    })
    
# run the pipeline

client = kfp.Client()

client.create_run_from_pipeline_func(volume_pipeline, arguments={})

What happened:

The VolumeOp was not able to create the PVC, therefore the depending task complains about not finding the PVC.
kubectl get pvc -n kubeflow | grep sharedpvc didn't return any results.

What did you expect to happen:

The VolumeOp shall create a PVC named sharedpvc.

Environment:

How did you deploy Kubeflow Pipelines (KFP)?
Deploying Kubeflow Pipelines on a local kind cluster

KFP version: 1.2.0

KFP SDK version: 1.4.0

Anything else you would like to add:

[Miscellaneous information that will assist in solving the issue.]

The log of the VolumneOp indicates

This step output is taken from cache.

I was trying to prevent it from using the cache but didn't succeed.
The manifest from the VolumneOp

apiVersion: v1 kind: PersistentVolumeClaim metadata: name: '{{workflow.name}}-sharedpvc' spec: accessModes: - RWO resources: requests: storage: 5Gi 

The log of the depending task indicates

This step is in Pending state with this message: Unschedulable: persistentvolumeclaim "{{tasks.volume-creation.outputs.parameters.volume-creation-name}}" not found

Storage class used

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"storage.k8s.io/v1","kind":"StorageClass","metadata":{"annotations":{"storageclass.kubernetes.io/is-default-class":"true"},"name":"standard"},"provisioner":"rancher.io/local-path","reclaimPolicy":"Delete","volumeBindingMode":"WaitForFirstConsumer"}
    storageclass.kubernetes.io/is-default-class: "true"
  creationTimestamp: "2020-12-29T06:31:17Z"
  name: standard
  resourceVersion: "195"
  selfLink: /apis/storage.k8s.io/v1/storageclasses/standard
  uid: 5dbc1bff-b488-4d3a-b45f-e710cf96a415
provisioner: rancher.io/local-path
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

/kind bug

@Bobgy
Copy link
Contributor

Bobgy commented Mar 30, 2021

/assign @elikatsis
Can you help with this issue?

@munagekar
Copy link
Contributor

munagekar commented Apr 5, 2021

Kubeflow cache's steps, so given the same inputs, it skips the step and instead fetches outputs from minio.
Volume op has no outputs in Minio so ideally VolumeOp should not be cached.

In this case you would need to change the name of the PVC.

You can also refer to #5055 (comment).

@munagekar
Copy link
Contributor

munagekar commented Apr 5, 2021

Does the following work ?

    vop = kfp.dsl.VolumeOp(
        name="volume_creation",
        resource_name=f"{{{{workflow.name}}}}-sharedpvc",
        size="5Gi",
        modes=["RWO"]
    )

This will hopefully change the input to the ResourceOp and hence prevent caching.

Update: This workound doesn't seem to work with Kubeflow 1.3/latest kfp version. I think this workaround worked with Kubeflow 1.2.

@yuhuishi-convect
Copy link
Author

Does the following work ?

    vop = kfp.dsl.VolumeOp(
        name="volume_creation",
        resource_name=f"{{{{workflow.name}}}}-sharedpvc",
        size="5Gi",
        modes=["RWO"]
    )

This will hopefully change the input to the ResourceOp and hence prevent caching.

Yes, the problem is fixed after I disabled the cache. Thanks for the help.

@elikatsis
Copy link
Member

@Bobgy sorry I had totally missed this.

I think the problem here is that

  1. the mechanism is caching steps that it shouldn't do so
  2. there is no "user" selection on whether to cache some specific step or not, only global API server configuration. The API server overrides any configuration:
    workflow.SetLabelsToAllTemplates(util.LabelKeyCacheEnabled, common.IsCacheEnabled())

@elikatsis
Copy link
Member

I'll reopen this issue as it needs some fix apart from globally disabling the cache

/reopen

@google-oss-robot
Copy link

@elikatsis: Reopened this issue.

In response to this:

I'll reopen this issue as it needs some fix apart from globally disabling the cache

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@skogsbrus
Copy link

Can we rename this issue? The typo makes it hard to find.

@yuhuishi-convect yuhuishi-convect changed the title VolumnOp was not able to create PVC VolumeOp was not able to create PVC Jun 21, 2021
@pahask8
Copy link
Contributor

pahask8 commented Jun 24, 2021

Does the following work ?

    vop = kfp.dsl.VolumeOp(
        name="volume_creation",
        resource_name=f"{{{{workflow.name}}}}-sharedpvc",
        size="5Gi",
        modes=["RWO"]
    )

This will hopefully change the input to the ResourceOp and hence prevent caching.

Hi, for some reason, this workaround does not work for me. KF version: 1.6.0, Kubeflow SDK 1.6.3. I used the same pipeline from the issue description. My pcv was created only once.

Thank you

@mjurkus
Copy link

mjurkus commented Jul 2, 2021

Facing the same issue. VolumeOp fails to create PVC with log message This step output is taken from cache.

Tried to change resource_name, but that workaround didn't work (if it worked before).

@Bobgy
Copy link
Contributor

Bobgy commented Aug 9, 2021

I think KFP v1 caching should not cache volume op / resource op, because the side effect is intended.
/cc @Ark-kun

Welcome contributions to fix this.
Caching webhook: https://github.com/kubeflow/pipelines/tree/master/backend/src/cache.

Besides that, caching for KFP v2 compatible mode should no longer cache volume ops. You can also consider trying it out too when it's released and documented for your env. (currently documented for KFP standalone, but not full Kubeflow.)

@munagekar
Copy link
Contributor

Updated workaround for getting around VolumeOp Caching. This seems to work with kfp version 1.7.2.

REF: #4857 (comment)

  test_vop = kfp.dsl.VolumeOp(
    name="volume",
    resource_name="pvc-name",
    modes=['RWO'],
    storage_class="standard",
    size="10Gi"
  ).add_pod_annotation(name="pipelines.kubeflow.org/max_cache_staleness", value="P0D")

@boarder7395
Copy link
Contributor

Not sure this is the correct place to bring this up. And I am not familiar with v2 component configuration yet but it looks like the mutation webhook in cache-server backend is looking for key pipelines.kubeflow.org/enable_caching while the python sdk is creating key pipelines.kubeflow.org/cache_enabled when using set_cache_enabled.

See following files:
https://github.com/kubeflow/pipelines/blob/master/backend/src/cache/server/mutation.go#L36

op.add_pod_label('pipelines.kubeflow.org/enable_caching',

Although it was mentioned to use .add_pod_annotation(name="pipelines.kubeflow.org/max_cache_staleness", value="P0D") which looks like it should work from the code in mutation.go. If that's the case will the base_op function set_caching_enabled be deprecated in the future.

The change on cache-server to allow the function of set_caching_enabled to work will be easy. I made the changes and tested them on my forked version of this repo. I can make a PR if the set_caching_enabled is not going to be deprecated.

@juliusvonkohout
Copy link
Member

Updated workaround for getting around VolumeOp Caching. This seems to work with kfp version 1.7.2.

REF: #4857 (comment)

  test_vop = kfp.dsl.VolumeOp(
    name="volume",
    resource_name="pvc-name",
    modes=['RWO'],
    storage_class="standard",
    size="10Gi"
  ).add_pod_annotation(name="pipelines.kubeflow.org/max_cache_staleness", value="P0D")

Thank you very much, i was experiencing the same issue

@stale
Copy link

stale bot commented Mar 2, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Mar 2, 2022
@shashisingh
Copy link

shashisingh commented Jun 22, 2024

What is the solution for this issue in version 1.8.0 of kubeflow while using 1.8.22 version of kfp sdk? What I want is to be able to run multiple instance of a pipeline (these are model training jobs) and want to make sure persistent volumes are not shared between these jobs. I have tried the following so far =>

volumeOp.add_pod_label("pipelines.kubeflow.org/enable_caching","false") 
       .add_pod_label("pipelines.kubeflow.org/cache_enabled","false")
       .set_caching_options(enable_caching=False)

I have also tried to disable cache globally by following
https://www.kubeflow.org/docs/components/pipelines/legacy-v1/overview/caching/#disabling-caching-in-your-kubeflow-pipelines-deployment
but I can still see pvcs are being reused across different runs.

@stale stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jun 22, 2024
@juliusvonkohout
Copy link
Member

@shashisingh the solution is in this thread and confirmed by others

/close

Copy link

@juliusvonkohout: Closing this issue.

In response to this:

@shashisingh the solution is in this thread and confirmed by others

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Status: Closed
Development

No branches or pull requests