Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Future dvc.yaml 2.0 improvements #5312

Closed
Tracked by #5367
jorgeorpinel opened this issue Jan 22, 2021 · 12 comments
Closed
Tracked by #5367

Future dvc.yaml 2.0 improvements #5312

jorgeorpinel opened this issue Jan 22, 2021 · 12 comments
Labels
A: templating Related to the templating feature discussion requires active participation to reach a conclusion enhancement Enhances DVC product: VSCode Integration with VSCode extension

Comments

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Jan 22, 2021

UPDATE: Jump to #5312 (comment) for remaining discussions

Consolidates #5164, #5165, and #5180.
For context, see existing docs in https://dvc.org/doc/user-guide/dvc-files/advanced-dvc-yaml.

Some remaining concerns in order or relevance (only 1 and 2.1 seem more or less worth considering in the immediate term).

The min. purpose of this issue is to at least decide which, if any, to address, and that we can do it in the future without breaking backward-compatibility.

1. Behavior: default params file & merging of values

We decided to always include * from params.yaml in vars. Trying to redefine values (via other file includes or write-in vars) fails except if they're objects that can be merged (no leaf node conflicts).

What about reconsidering the "merging" of vals/objects? Instead we could rename tree paths so there's no possibility of conflicts (@shcheklein's idea). E.g.:

# params.yaml
foo: "foo"
// params.json
{"foo": "bar"}
vars:
- foo # use as ${foo}, ${params.yaml.foo}, or ${paramsyaml.foo}
- params.json:foo # use as ${params.json.foo} or ${paramsjson.foo}
vars:
- foo # use as ${params.yaml:foo} or ${paramsyaml.foo}
- params.json:foo # use as ${params.json.foo} or ${paramsjson.foo}
- foo: "baz" # use as ${foo} or ${globals.foo}

2. Syntax (minor)

Is it worth renaming certain keywords for accuracy? The suggestions below come from the terminology that we ended up using (so far) in docs:

2.1 vars: aren't really variables (not using that term in docs). How about include: or load: to invoke an action; values: (we use that term a lot in docs),const:, or globals/locals: for a descriptive term?

The most accurate term would be params:, but that is taken for the local scope... I wonder if we could overload the section for both things (when it assigns a value it's a SET, then it just lists the key it's a GET).

2.2 The $ sign in the ${} expression makes it extra tricky to use in cmd (at least on Linux). I understand this was discussed already so no strong opinion, but from our docs-related research, the most common syntax for this is {{ }}. Of course {} (or any other brackets) can also be problematic, so users will need to worry about escaping anyway...
UPDATE: {} has a meaning in YAML so we can't use that anyway.

2.3 foreach: "for each {items} do {stage details}" is a great construct. My only concern is that the given order of the items is not respected (according to #5181 (comment)) so the "loop" analogy isn't that precise. My only alternative ideas are items:, set:, multi: (we currently use term "mutli-stage" in docs).

2.3.2 do: If we keep foreach, maybe gen/yield: could a) be slightly more accurate and b) hint that you're not using a typical imperative language loop.
If we depart from the loop analogy (use set/multi/expand:) do could just stay, or just be skipped to keep the YAML structure a bit shorter.

A possible recombination:

include:
- foo
- params.json:foo,bar
- foo: "value"

stages:
  mystage:
    wdir: {{foo}}
    include:
      - qux
      - params.toml:quux
      - quuz: "eulav"
    cmd: echo {{qux}} {{globals.foo}}
    deps:
      - {{qux}}
      - {{globals.foo}}
  mystages:
    items:
      - {{paramsjson.foo}}
      - {{bar}}
    cmd: echo {{item}}

3. Current/Known limitations (future)

3.1 It's not possible to put vars (or any other field for that matter) before foreach. dvc repro gives the following error msg: format error: extra keys not allowed @ data['stages']['mystage']['foreach']
Perhaps we can improve the message so users get that they should either use foreach/do OR a regular stage structure.

Depending on 2.3.2 this may not be relevant in the future.

3.2. wdir can't use ${values} from local write-in vars (because wdir is evaluated first, needed for file-based local vars)? But if we address 1 (no merging of objects) then maybe this can be implemented?

3.3 dvc run doesn't pre-process the commands sent to it (not compatible with vars). Should it? Probably not (stating here mainly for the record).

@jorgeorpinel jorgeorpinel added enhancement Enhances DVC product: VSCode Integration with VSCode extension labels Jan 22, 2021
@jorgeorpinel jorgeorpinel changed the title QA: dvc2.0 dvc.yaml QA: dvc.yaml 2.0 Jan 22, 2021
@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Jan 22, 2021

Cc @dmpetrov @shcheklein @skshetry @dberenbaum @efiop 👋 (anyone else welcome of course)

@skshetry
Copy link
Member

The proposed syntax {{ }} needs to be quoted, as it is considered a mapping because of {} syntax, and is invalid. I did propose ${{ }} which is similar to {{ }} but does not require quoting. It was rejected because of the simplicity of ${}.

The $ sign in the ${} expression makes it extra tricky to use in cmd

You can escape with \${} (eg: \${TMPDIR}). This does mean that existing usages of ${} in 1.0 dvc.yaml file is now incompatible, and has to be updated to escape them manually in 2.0. This also affects other commands like metrics/params/plots diff and push/pull/fetch/gc --all-commits.

For other concerns, I think I have already replied to them already.

@skshetry
Copy link
Member

hint that you're not using a typical imperative language loop.

This is intentional, suggested by @dmpetrov. The reasoning was that our users will be more familiar with foreach loops than other functional/declarative "wordings" such as gen/yield/map, etc.

@dberenbaum
Copy link
Collaborator

This summary helps me a lot, @jorgeorpinel !

1. What about allowing users to choose either approach? If users want to be concise and are familiar with the merge resolution order, they can use plain foo (as long as the resolution orderj is clear in the docs). If they prefer to be explicit or want multiple values for the same key, they can use globals.foo, params.yaml.foo, etc.

2.3.2 I'm confused about whether the do command is necessary. In https://dvc.org/doc/user-guide/dvc-files/advanced-dvc-yaml, the example for using complex values in foreach actually omits do. Can there be anything under do besides cmd: ...?

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Jan 22, 2021

For other concerns, I think I have already replied to them already.

@skshetry concerns that are (partially) new here, or didn't have a reply from you in previous issues: 1 (and thus 3.2), 2.1 specific suggestions (see grayed note too), and 3.1 (improve error msg)

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Jan 23, 2021

allowing users to choose either approach

@dberenbaum correct, that is the idea! The full/explicit tree paths would be optional generally, but if there are conflicting vars names, then you should be mindful of which one takes the default position (we would doc that).

In https://dvc.org/doc/user-guide/dvc-files/advanced-dvc-yaml, the example for using complex values in foreach actually omits do.

Oops, that's an error in docs (will fix in iterative/dvc.org/pull/2098). But yes, that's an alternative syntax I'm suggesting:

image

TBH I don't have a strong opinion on this (as long as we're keeping foreach, see 2.3), and since it's already implemented with do, we'd prob. need a good reason to change it. But WDYT?

Can there be anything under do besides cmd: ...?

Yes, everything that usually goes into single stages goes under do in foreach-stages. I'll try to clarify that in docs too.

jorgeorpinel added a commit to iterative/dvc.org that referenced this issue Jan 23, 2021
@skshetry
Copy link
Member

skshetry commented Jan 25, 2021

1 (and thus 3.2), 2.1 specific suggestions (see grayed note too), and 3.1 (improve error msg)

Merging is a feature. It's based on a scenario that @dmpetrov provided, that the user wants to build 3 models for 3 different markets and they share the basic configuration but are trained with market-dependent parameters additionally.

eg:

# params.yaml
train:
  epochs: 10

# models/us/params.yaml (for example)
train:
  threshold: 2

# dvc.yaml
stages:
  train-model:
    foreach: [us, gb, cn]
    do:
      wdir: models/${item}
      vars:
       - params.yaml  # from `models/${item}/params.yaml`
       # this would have been easier if `vars` supported parametrization
      cmd: python train.py --threshold ${train.threshold} --epochs ${train.epochs}
      outs:
      - model-${item}

It's a bit complicated example though. (3.2) is based on the order of evaluation.

(2.1) - We started with imports as that was the terminology that we were using, but it clashes with every other concepts: import library, dvc import, etc and was generic. Then we moved on to use/vars/params. But the use was confused with changing the defaults of params. After that we did discuss about renaming use/vars to be params at the top-level, but during the discussion, we noticed that they weren't params (our ordinary ones) at all, they are variables. This was the naming that was suggested in the original issue itself, and it has close connection with templating/parametrization too.
I am open to other namings, but @jorgeorpinel's suggested ones are too generic IMO. But, vars sounds like a good compromise to me.

(3.1) - Not a bug, just an enhancement. It's working as intended.

skshetry added a commit to skshetry/dvc that referenced this issue Feb 1, 2021
This was loosened when introducing parametrization.
But, `foreach`..`do` should be considered first
before stage's regular structure. And, cmd is made
required (as it should have been)

Related: iterative#5371, iterative#5370, iterative#5312
skshetry added a commit that referenced this issue Feb 1, 2021
* don't allow loading stages without `cmd`

This was loosened when introducing parametrization.
But, `foreach`..`do` should be considered first
before stage's regular structure. And, cmd is made
required (as it should have been)

Related: #5371, #5370, #5312

* fix schema
@skshetry skshetry added the A: templating Related to the templating feature label Feb 2, 2021
@skshetry skshetry mentioned this issue Feb 2, 2021
31 tasks
@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Feb 5, 2021

OKdk. Thanks for the explanations and sorry about the delay. Some final notes from me:

we noticed that they weren't params

But the feature was called parameterization (now "templating" though).

they are variables

I thought we agreed that vars: are not really variables (more like constants or set values). But OK, whatever users can relate to should work (I personally like values: given the language we ended up using in the docs).

(3.1) - Not a bug, just an enhancement

I didn't say it was a bug. I said the message may not be helpful. And yep, this issue is labeled enhancement.

@shcheklein
Copy link
Member

But the feature was called parameterization (now "templating" though).

I would add that it's not set in stone :) I would still love to see what is the best for users in this case, how do people call it. The only concern we had is the overlap with dvc params. But may be it's not that important.

@jcpsantiago
Copy link

FWIW we've been using this pretty much since it's available as a beta feature and everyone very naturally called "parameterized dvc pipeline".

@jorgeorpinel
Copy link
Contributor Author

Another user ref. on terminology - they call it "multistage feature" in https://discuss.dvc.org/t/versioning-predictions/656 (even when we're switched to "foreach stages" in the docs already — maybe they read that doc before the change).

@jorgeorpinel jorgeorpinel changed the title QA: dvc.yaml 2.0 Future dvc.yaml 2.0 tweaks? Feb 10, 2021
@jorgeorpinel jorgeorpinel added the discussion requires active participation to reach a conclusion label Feb 10, 2021
@jorgeorpinel jorgeorpinel changed the title Future dvc.yaml 2.0 tweaks? Future dvc.yaml 2.0 improvements Feb 10, 2021
@dberenbaum
Copy link
Collaborator

Closing. We can reopen a new issue with any of the remaining questions if they come up again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: templating Related to the templating feature discussion requires active participation to reach a conclusion enhancement Enhances DVC product: VSCode Integration with VSCode extension
Projects
None yet
Development

No branches or pull requests

5 participants