Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[backend] Cached volume creations leads to unschedulable runs, if the pvc is deleted in the pipeline #5844

Closed
skogsbrus opened this issue Jun 11, 2021 · 1 comment

Comments

@skogsbrus
Copy link

skogsbrus commented Jun 11, 2021

Environment

  • How did you deploy Kubeflow Pipelines (KFP)? KF 1.3 Manifests release, on premises
  • KFP version: 1.5.0
  • KFP SDK version:
$ pip list | grep kfp
kfp                      1.6.2    
kfp-pipeline-spec        0.1.7    
kfp-server-api           1.5.0  

Steps to reproduce

  1. Define a pipeline that creates a volume, mounts it, and then destroys it:
import kfp
from kfp import dsl

def glob_volume(volume):
    @kfp.components.create_component_from_func
    def glob_files(directory: str) -> list:
        import pathlib
        paths = pathlib.Path(directory).glob('**/*')
        filepaths = [str(p) for p in paths if p.is_file()]
        return filepaths

    component_function = glob_files
    return component_function("/volume") \
      .add_pvolumes({
        "/volume": volume
      })

@dsl.pipeline(name="vol-cache-bug", description="Demonstrates a caching bug in KF 1.3")
def volume_caching_bug():
    vop = dsl.VolumeOp(name='create-a-volume', resource_name='a-volume', size="1Gi", modes=dsl.VOLUME_MODE_RWO)
    glob_vol_op = glob_volume(vop.volume)
    vop.delete().after(glob_vol_op)

if __name__ == "__main__":
    kfp.compiler.Compiler().compile(volume_caching_bug, "volume_caching_bug.yaml")
  1. Create a pipeline and an experiment in the UI from the above sample
  2. Launch a run, let it complete
    image
  3. (delete mounting pods if the PVC is stuck at terminating)
  4. Clone the run
  5. The first step is cached and never run
    image
  6. The second step is never scheduled, because the volume got deleted in the previous run and was never created in this run.
    image

Expected result

I expected that volume ops should not be cached at all. If the PVC already exists it should be reused and if it doesn't it should be created. Alternatively, that VolumeOp supported an equivalent to .execution_options.caching_strategy.max_cache_staleness = "P0D".

FWIW: Recently upgraded from KF 1.1 to KF 1.3. Did not experience this on 1.1.

Workarounds

  1. Rename the volume creation step to prevent it from fetching from cache. Needs to be done after each run (i.e. upload a new pipeline version before issuing a new run)
  2. Disable caching on the whole KF instance.

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

@skogsbrus skogsbrus changed the title [backend] Cached volume creations leads to unschedulable runs, if the volum [backend] Cached volume creations leads to unschedulable runs, if the pvc is deleted in the pipeline Jun 11, 2021
@elikatsis
Copy link
Member

Hi there,

this is a duplicate of #5257. Should we close this in favor of the other one.

Copying my comment here as well, for the sake of completeness of this issue:

I think the problem here is that

  1. the mechanism is caching steps that it shouldn't do so
  2. there is no "user" selection on whether to cache some specific step or not, only global API server configuration. The API server overrides any configuration:
    workflow.SetLabelsToAllTemplates(util.LabelKeyCacheEnabled, common.IsCacheEnabled())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants