Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement foreach ... in loop in dvc.yaml #4734

Merged
merged 2 commits into from
Nov 2, 2020

Conversation

skshetry
Copy link
Member

@skshetry skshetry commented Oct 16, 2020

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

Here's a wiki documenting the feature: https://github.com/iterative/dvc/wiki/Parametrization#foreach--in.
And, following are the examples for the foreach:

  1. Looping through a list
stages:
  build:
    foreach:
    - foo
    - bar
    - baz
    in:
      cmd: python script.py ${item}
  1. Looping through a dictionary
stages:
  build:
    foreach:
      models:
        us:
          thresh: 10
        gb:
          thresh: 15
    in:
      cmd: python script.py ${item}
  1. Looping throuhg values in vars or params.yaml
stages:
  build:
    foreach: ${models}
    in:
      cmd: python script.py --thresh ${item.thresh}

You might notice set in the wiki. That is currently not implemented.

@skshetry skshetry added feature is a feature skip-changelog Skips changelog labels Oct 16, 2020
@skshetry skshetry self-assigned this Oct 16, 2020
@shcheklein
Copy link
Member

This is excellent stuff, @skshetry !! I'll take a look this weekend and give some feedback if I have anything!

Should we invite some folks from the tickets related to this? To give it a try as well?

@skshetry
Copy link
Member Author

@shcheklein, there are few bugs that I need to fix. Then I'll write up in that ticket. Thanks.

@skshetry skshetry marked this pull request as draft October 19, 2020 14:10
@skshetry
Copy link
Member Author

skshetry commented Oct 19, 2020

Just noticed that set brokes auto-tracking. Keeping it as draft, as ideally this breaks aliasing too. Though, most of the examples will still work (that does not use set).

@skshetry
Copy link
Member Author

Btw, there's a wiki for this feature, as I have planned: https://github.com/iterative/dvc/wiki/Parametrization

@srp-31
Copy link

srp-31 commented Oct 19, 2020

This is great stuff. Waiting for it to be merged!

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Oct 20, 2020

Woah super interesting stuff. Definitely a common need from what I remember (support/feature requests). Also a lot to document eventually so thanks for maintaining that wiki for now @skshetry!

I have a bunch of QA-related thoughts but the only thing that jumps out to me at this stage is about the foreach syntax. I find it a little confusing to call the thing to do in:. How about copying Python's for syntax more or less?

...
for: item  # just a name to refer later
in: ${items}  # taken from `var` or params file in this case but cold be a hardcoded list/dict
do:
  cmd: # uses ${item}
  ...

^ This way also the user names the index so that item is not a special keyword.


Alternatively, just rename it foreach/do (no "in" needed with "foreach", I think).

@jorgeorpinel
Copy link
Contributor

OR something like

...
foreach:
  item:  # Could also be named by the user with this syntax
    - 2
    - 4
    ...
  cmd: # uses ${item}

@skshetry skshetry force-pushed the params-parametrization branch from add12f9 to 6c6af86 Compare October 20, 2020 16:37
@skshetry skshetry changed the title Support parametrizing from params/vars Implement foreach ... in loop in dvc.yaml Oct 20, 2020
@skshetry skshetry marked this pull request as ready for review October 20, 2020 16:44
@skshetry skshetry force-pushed the params-parametrization branch from 6c6af86 to 8be0d8f Compare October 21, 2020 00:34
@skshetry skshetry requested review from efiop, pared and pmrowla October 21, 2020 00:34
@skshetry skshetry mentioned this pull request Oct 21, 2020
2 tasks
Copy link
Contributor

@pared pared left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feature is cool! Left two comments.

resolver = DataResolver(dvc, PathInfo(str(tmp_dir)), d)
assert resolver.resolve() == {
"stages": {
f"build-{key}": {"cmd": f"python script.py {item}"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resolved cmd section is
'python script.py {\'us\': {\'thresh\': 10}, \'gb\': {\'thresh\': 15}}'. Not sure whether this was the anticipated result. I think It might be more explicit if we would provide expected and resolved dictionary instead of building it from iterable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that should have been item.thresh. Regarding interpolating dict, it will be handled later.

FOREACH_KWD = "foreach"
IN_KWD = "in"

DEFAULT_SENTINEL = object()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we use None instead of DEFAULT_SENTINEL? Seems like it's used only to check whether the key has been provided?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not trying to assume to much about the parametrized data.

Copy link
Contributor

@jorgeorpinel jorgeorpinel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QA round 0

Other than my loop syntax suggestions (see #4734 (comment) above), I noticed that the use and vars sections are already included in this PR (added a mention to the PR desc.) so I tested that first, along with params file support:

Using DVC 1.8.4+8be0d8 at the moment

  • Should use be called params? It's not clear what we mean by "use" e.g. dvc.yaml pipelines use all sort of things, like deps and outs.
  • use accepts inexistent paths, resulting in silently not loading any params file, even if params.yaml is present. Should it at least warn that the given file doesn't exist and so no params will be available in dvc.yaml as {$vars}? Or warn AND load params.yaml (if it exists, and include this note in the warning). (It's nice to have the use: none trick actually.)
  • If use is given a directory path, it prints a generic unexpected error - [Errno 13] Permission denied: '{path}'.
  • When vars used in dvc.yaml don't exist in params file or vars, you get ERROR: unexpected error - Could not find '{key}' in {}: '{key}' β€” the error could be a little more informative, I think.
  • vars is supposed to get recursively merged with the params file but I always get an error such as unexpected error - Cannot overwrite as key {keyname} already exists in {'{keyname}': {value}} (I'm using the same dictionary in both the params file and in vars, but with different values).
    The {value} shown comes from params.yaml if no use is given, except if I use: none (inexistent file) in which case I still get the error (weird) but the value from vars is printed in the error msg.

@skshetry skshetry force-pushed the params-parametrization branch from 8be0d8f to 0aeba39 Compare November 2, 2020 12:44
@skshetry
Copy link
Member Author

skshetry commented Nov 2, 2020

@jorgeorpinel, thanks for the QA.

Should use be called params? It's not clear what we mean by "use" e.g. dvc.yaml pipelines use all sort of things, like deps and outs.

params already has a different format/schema in stages, which might be confusing. But, maybe, it's not?
Let see, I'll focus on trying to unify them by some means (but so far I am not seeing it clearly).

use accepts inexistent paths, resulting in silently not loading any params file, even if params.yaml is present. Should it at least warn that the given file doesn't exist and so no params will be available in dvc.yaml as {$vars}? Or warn AND load params.yaml (if it exists, and include this note in the warning). (It's nice to have the use: none trick actually.)

It might be difficult to keep backward compatibility and try to give good error message. The behavior definitely needs to be improved/fixed, will be fixed in successive PRs.

If use is given a directory path, it prints a generic unexpected error - [Errno 13] Permission denied: '{path}'.

Same as above.

When vars used in dvc.yaml don't exist in params file or vars, you get ERROR: unexpected error - Could not find '{key}' in {}: '{key}' β€” the error could be a little more informative, I think.

Same as above.

vars is supposed to get recursively merged with the params file but I always get an error such as unexpected error - Cannot overwrite as key {keyname} already exists in {'{keyname}': {value}} (I'm using the same dictionary in both the params file and in vars, but with different values).

It recursively merges the dictionary, but if that's the same file, it'll have the same keys at the root level, which is not allowed.
So, we do merge things, but does not allow it to be overwritten.

OR something like

...
foreach:
  item:  # Could also be named by the user with this syntax
    - 2
    - 4
    ...
  cmd: # uses ${item}

This way, item will be a reserved keyword. Though, I'll consider it (syntax is of course not set in stone).

@skshetry skshetry merged commit e211423 into iterative:master Nov 2, 2020
@skshetry skshetry deleted the params-parametrization branch November 2, 2020 13:11
@jorgeorpinel
Copy link
Contributor

Np. Actually that was just the first round of QA. I have several other things but I didn't want to post everything at once. I see this is merged though, is there a PR that continues it, for the remaining QA rounds? Thanks.

For now, some answers to the comments above:

This way, item will be a reserved keyword

True. Actually there's a better way, since in: is always followed by cmd: what about just skip in and go straight to cmd? That addresses my concern around the semantics of "foreach ... in ...". Like this:

stages:
  something:
    foreach: ...
    cmd: python script.py --thresh ${...}

params already has a different format/schema in stages,

Also true. So maybe params-file, import, or something else? My main suggestion is that uses is not clear (again, semantics). It's not a major thing but why not improve that if we can.

use accepts inexistent paths, resulting in silently not loading any params file, even if params.yaml is present
If use is given a directory path, it prints a generic unexpected error

The behavior definitely needs to be improved/fixed, will be fixed in successive PRs.

OK, what PRs are those? I was asked QA this but I'm not sure how best to keep track of the issues I find if the PR is merged. The most obvious idea is to open issues for each of my check boxes above but... The point of QA is to avoid issues in the first place πŸ™‚

we do merge things, but does not allow it to be overwritten

Got it. So the error message could be a little better and probably overwriting would be valuable. But fine, that's not very important right now. Did you check why the strange logic in the last edge case I mentioned? That behavior seems a little unexpected:

The {value} shown comes from params.yaml if no use is given, except if I use: none (inexistent file) in which case I still get the error (weird) but the value from vars is printed

@jorgeorpinel
Copy link
Contributor

Continued in #4854

@jcpsantiago
Copy link

This does exactly what I needed and discussed in one of the original feature discussion issues. Thanks everyone πŸ™ πŸ”₯

@skshetry
Copy link
Member Author

skshetry commented Nov 9, 2020

@jcpsantiago, please note that this is an experimental feature, which is not stable yet and might change. Please do take a look, and tell us what you think. πŸ™‚

@jcpsantiago
Copy link

jcpsantiago commented Nov 9, 2020

sure, I'm modifying my pipeline to test it ;) @skshetry there is no mention about the behaviour of dvc metrics and dvc plots in the wiki. Are these commands already taking into account multiple metrics/plot files? They're not working yet, couldn't use template in metrics: or plots:

first thing that surprised me was:

cmd: python script.py ${models.some_arg}

works, but:

outs:

  • outputs/${models.another_arg}/mymodel.xgb

doesn't. Adding options such as persist: true is not working for me either even using a simplified ${models.another_arg} like in the wiki (works fine without persist option)

@skshetry
Copy link
Member Author

there is no mention about the behaviour of dvc metrics and dvc plots in the wiki.

@jcpsantiago, that should work, as this has nothing to do with dvc metrics and dvc plots.

Regarding mentioning them in dvc.yaml it should be working for metrics and plots as well. I guess the think that's not working is, interpolating stuffs for a key of the dictionary.
So, the following should work:

outputs: 
  - ${foo}
  - ${bar}

But, the following does not work yet:

outputs:
  - ${foo}:
      persist: True

@skshetry
Copy link
Member Author

@jcpsantiago, The latter case with keys being interpolated, should work now. PTAL.

@jcpsantiago
Copy link

@skshetry everything works. I'm close to tears by how beautiful this is 😭 I'm having a very nerdy moment seeing DVC plow through multiple models and then showing me the plots and metrics on the terminal all in one go. Awesome work everyone!

@jcpsantiago jcpsantiago mentioned this pull request Nov 25, 2020
11 tasks
@skshetry skshetry added the A: templating Related to the templating feature label Dec 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: templating Related to the templating feature feature is a feature skip-changelog Skips changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants