Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stage-level vars alternative in DVC 3.X #9639

Closed
observingClouds opened this issue Jun 21, 2023 · 15 comments · Fixed by #9647
Closed

Stage-level vars alternative in DVC 3.X #9639

observingClouds opened this issue Jun 21, 2023 · 15 comments · Fixed by #9647

Comments

@observingClouds
Copy link

Bug Report

Description

I updated DVC from 2.58.2 to 3.0.0 and now get an error when trying to reproduce previous stages.

'./dvc.yaml' validation failed.                                       

extra keys not allowed, in stages -> process -> vars, line 6, column 5
  5 │   vars:                                                                                                                                                                        
  6 │   - config.yaml

The entire content of dvc.yaml is:

stages:
  process:
    cmd: python apply.py
    wdir: ./
    vars:
    - config.yaml
    deps:
    - ${datastore}/data/MODIS/

Issue #9630 focus on this kind of break with respect to error handling. Here, I'm however interested in how I have to rewrite my dvc.yaml file to overcome this issue and retrieve the expected results. I appreciate your help.

Reproduce

git init
dvc init
vi dvc.yaml # paste above mentioned content
dvc repro process

Expected

I expected compatibility with DVC 3.X. However, I understand that breaking changes can occur for a major release.

Environment information

Output of dvc doctor:

$ dvc doctor

DVC version: 3.0.0 (conda)
--------------------------
Platform: Python 3.10.11 on Linux-5.15.0-1038-azure-x86_64-with-glibc2.31
Subprojects:
        dvc_data = 1.11.0
        dvc_objects = 0.23.0
        dvc_render = 0.5.3
        dvc_task = 0.3.0
        scmrepo = 1.0.4
Supports:
        http (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
        webdav (webdav4 = 0.9.8),
        webdavs (webdav4 = 0.9.8)
Config:
        Global: /home/azureuser/.config/dvc
        System: /etc/xdg/dvc
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: webdavs
Workspace directory: ext4 on /dev/sdb1
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/64edfdfad8786fbbe7d0c2972fba2dba

Additional Information (if any):

@observingClouds
Copy link
Author

Okay, I just realise that stage-level support for vars has been dropped (#9556).
So, how can I migrate the following dvc.yaml, where a variable (here datastore) with the same name exist in several files? E.g.

stages:
  process_setup1:
    cmd: python apply.py
    wdir: ./
    vars:
    - config_setup1.yaml
    deps:
    - ${datastore}
  process_setup2:
    cmd: python apply.py
    wdir: ./
    vars:
    - config_setup2.yaml
    deps:
    - ${datastore}

@observingClouds observingClouds changed the title Make dvc.yaml compatible with DVC 3.X Stage-level vars alternative in DVC 3.X Jun 21, 2023
@skshetry
Copy link
Member

skshetry commented Jun 21, 2023

Hi, you can move the vars to the top-level, which will make it available for all stages.

vars:
- config_setup1.yaml
stages:
  process_setup1:
    cmd: python apply.py
    wdir: ./
    deps:
    - ${datastore}
  process_setup2:
    cmd: python apply.py
    wdir: ./
    deps:
    - ${datastore}

@observingClouds
Copy link
Author

Thanks @skshetry for your quick response. However, the variable datastore in my example has two different values in config_setup1.yaml and config_setup2.yaml. How can I reference a specific one? Does a syntax similar to ${config_setup2.yaml:datastore} work?

@skshetry
Copy link
Member

skshetry commented Jun 21, 2023

I think there are three ways:

  1. you can keep config in one location.
# config_setup.yaml
setup1:
   datastore: ...
setup2:
   datastore: ...
# dvc.yaml
vars:
- config_setup.yaml
stages:
  process_setup1:
     deps: ${setup1.datastore}
  1. You can also split the config-setup.yml into two files, and specify vars as [config_setup1.yaml, config_setup2.yaml]. dvc will merge them together like in 1).
# config_setup1.yaml
setup1:
  datastore: ...

# config_setup2.yaml
setup2:
  datastore: ...
  1. Also, you can just rename datastore into datastore1 and datastore2.

@observingClouds
Copy link
Author

Thank you for showing these other options. To my feeling the stage-level vars option was however much cleaner for my use-case and I would love to see this feature back in DVC.

I use for example a template yaml file that contains all necessary information to apply a trained neural network to a specific image dataset. For each image dataset I apply the neural network on, I create a new configuration file that contains for example the source-path. I like to think of these configurations as separate projects that share a common code base, but should be independent.

Having several variables with the same meaning (e.g. datastore1, datastore2) is therefore not my preferred solution and having one large config file with repetitive patterns that can at times be troublesome to diff neither.

Any chance this could be added again?

@skshetry
Copy link
Member

yes, we are discussing that in #9630. Will let you know if we decide on anything. 🙂

@dberenbaum
Copy link
Collaborator

@observingClouds @sjawhar As you can see, we have reverted dropping stage-level vars for now, but we are still discussing whether to move towards dropping it in the future with more warning. It would be great to get your thoughts here and brainstorm whether there are better alternatives we could develop.

@sjawhar
Copy link
Contributor

sjawhar commented Jun 22, 2023

I think all my use-cases of stage-level vars stem from not having support for parameter grids beyond a single foreach. I won't pretend to represent every use case (I reviewed @observingClouds 's post above and didn't quite understand all of @skshetry 's suggestions), but I think that implementing matrix would cover my needs (and probably many others').

@skshetry
Copy link
Member

I think all my use-cases of stage-level vars stem from not having support for parameter grids beyond a single foreach. I won't pretend to represent every use case (I reviewed @observingClouds 's post above and didn't quite understand all of @skshetry 's suggestions), but I think that implementing matrix would cover my needs (and probably many others').

We created vars for a scenario, for example, where you train multiple models, but they have slightly different additional parameters. But I haven't seen anyone using that, except you but as you have said, it's mostly as a replacement for matrix.

stages;
  train_model:
    foreach: [model1, model2]
    do:
       wdir: ${model}
       vars: params.yaml
       cmd: ./script ${thresh}

Here, wdir for stage is the key (model1/model2), and vars is loaded from ${wdir}/params.yaml. This is confusing because it is so dependent on wdir to load from.
Which is why I wanted to remove it (also has performance issues associated with it in the current implementation).

While above @observingClouds use leads to a cleaner definition, we did not create vars to have a local scoping just for sake of having a local scope (as in, it was a means to do something). My suggestion was to combine all parameters into one single file for easier maintainability. Or, separate them in small files but still declare them in global scope.
DVC is not a programming language, and even though we have a few constructs like foreach, i don't think we should become one.

@sjawhar
Copy link
Contributor

sjawhar commented Jun 23, 2023

DVC is not a programming language, and even though we have a few constructs like foreach, i don't think we should become one.

Absolutely, 100% agree 👍 The simplicity of a static YAML file is one of my favorite things about DVC, and my usage of YAML anchors feels much more like a hack than something clear and explicit like matrix would be.

@observingClouds
Copy link
Author

Is there a discussion/documentation about the planned matrix feature? Happy to look at it and see if this would also capture my use-case.

@skshetry
Copy link
Member

@observingClouds, yes, please take a look at #5172.

@observingClouds
Copy link
Author

Would it maybe easier to support the global vars that can be referenced with a syntax like ${config_setup1.yaml:datastore}. This would allow to define the same variable in different files with different values.

@skshetry
Copy link
Member

We discussed this previously in #3633. We decided against it because it gets ugly quite fast, as file paths tend to get longer and you have to repeat the file name each time (going against why we introduced parametrization in the first place: to reduce duplication).

But I don’t think I have any strong opposition to it, but at the same time, I’d prefer to keep things simple.

Still, I’ll encourage you create a feature request explaining your scenario.

@skshetry
Copy link
Member

(But if you have to use it sparingly, it’s a good solution, it probably should not be the default recommendation)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants