-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement foreach ... in loop in dvc.yaml #4734
Conversation
This is excellent stuff, @skshetry !! I'll take a look this weekend and give some feedback if I have anything! Should we invite some folks from the tickets related to this? To give it a try as well? |
@shcheklein, there are few bugs that I need to fix. Then I'll write up in that ticket. Thanks. |
Just noticed that |
Btw, there's a wiki for this feature, as I have planned: https://github.com/iterative/dvc/wiki/Parametrization |
This is great stuff. Waiting for it to be merged! |
Woah super interesting stuff. Definitely a common need from what I remember (support/feature requests). Also a lot to document eventually so thanks for maintaining that wiki for now @skshetry! I have a bunch of QA-related thoughts but the only thing that jumps out to me at this stage is about the foreach syntax. I find it a little confusing to call the thing to do ...
for: item # just a name to refer later
in: ${items} # taken from `var` or params file in this case but cold be a hardcoded list/dict
do:
cmd: # uses ${item}
... ^ This way also the user names the index so that Alternatively, just rename it foreach/do (no "in" needed with "foreach", I think). |
OR something like ...
foreach:
item: # Could also be named by the user with this syntax
- 2
- 4
...
cmd: # uses ${item} |
add12f9
to
6c6af86
Compare
6c6af86
to
8be0d8f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feature is cool! Left two comments.
tests/func/test_stage_resolver.py
Outdated
resolver = DataResolver(dvc, PathInfo(str(tmp_dir)), d) | ||
assert resolver.resolve() == { | ||
"stages": { | ||
f"build-{key}": {"cmd": f"python script.py {item}"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The resolved cmd
section is
'python script.py {\'us\': {\'thresh\': 10}, \'gb\': {\'thresh\': 15}}'
. Not sure whether this was the anticipated result. I think It might be more explicit if we would provide expected and resolved dictionary instead of building it from iterable
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, that should have been item.thresh
. Regarding interpolating dict, it will be handled later.
FOREACH_KWD = "foreach" | ||
IN_KWD = "in" | ||
|
||
DEFAULT_SENTINEL = object() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we use None
instead of DEFAULT_SENTINEL
? Seems like it's used only to check whether the key has been provided?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not trying to assume to much about the parametrized data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QA round 0
Other than my loop syntax suggestions (see #4734 (comment) above), I noticed that the use
and vars
sections are already included in this PR (added a mention to the PR desc.) so I tested that first, along with params file support:
Using DVC
1.8.4+8be0d8
at the moment
- Should
use
be calledparams
? It's not clear what we mean by "use" e.g. dvc.yaml pipelines use all sort of things, like deps and outs. -
use
accepts inexistent paths, resulting in silently not loading any params file, even if params.yaml is present. Should it at least warn that the given file doesn't exist and so no params will be available in dvc.yaml as{$vars}
?Or warn AND load params.yaml (if it exists, and include this note in the warning).(It's nice to have theuse: none
trick actually.) - If
use
is given a directory path, it prints a genericunexpected error - [Errno 13] Permission denied: '{path}'
. - When vars used in dvc.yaml don't exist in params file or
vars
, you getERROR: unexpected error - Could not find '{key}' in {}: '{key}'
β the error could be a little more informative, I think. -
vars
is supposed to get recursively merged with the params file but I always get an error such asunexpected error - Cannot overwrite as key {keyname} already exists in {'{keyname}': {value}}
(I'm using the same dictionary in both the params file and invars
, but with different values).
The{value}
shown comes from params.yaml if nouse
is given, except if Iuse: none
(inexistent file) in which case I still get the error (weird) but the value fromvars
is printed in the error msg.
8be0d8f
to
0aeba39
Compare
@jorgeorpinel, thanks for the QA.
It might be difficult to keep backward compatibility and try to give good error message. The behavior definitely needs to be improved/fixed, will be fixed in successive PRs.
Same as above.
Same as above.
It recursively merges the dictionary, but if that's the same file, it'll have the same keys at the root level, which is not allowed.
This way, |
Np. Actually that was just the first round of QA. I have several other things but I didn't want to post everything at once. I see this is merged though, is there a PR that continues it, for the remaining QA rounds? Thanks. For now, some answers to the comments above:
True. Actually there's a better way, since stages:
something:
foreach: ...
cmd: python script.py --thresh ${...}
Also true. So maybe
OK, what PRs are those? I was asked QA this but I'm not sure how best to keep track of the issues I find if the PR is merged. The most obvious idea is to open issues for each of my check boxes above but... The point of QA is to avoid issues in the first place π
Got it. So the error message could be a little better and probably overwriting would be valuable. But fine, that's not very important right now. Did you check why the strange logic in the last edge case I mentioned? That behavior seems a little unexpected:
|
Continued in #4854 |
This does exactly what I needed and discussed in one of the original feature discussion issues. Thanks everyone π π₯ |
@jcpsantiago, please note that this is an experimental feature, which is not stable yet and might change. Please do take a look, and tell us what you think. π |
sure, I'm modifying my pipeline to test it ;) @skshetry there is no mention about the behaviour of first thing that surprised me was:
works, but:
doesn't. Adding options such as |
@jcpsantiago, that should work, as this has nothing to do with Regarding mentioning them in outputs:
- ${foo}
- ${bar} But, the following does not work yet: outputs:
- ${foo}:
persist: True |
@jcpsantiago, The latter case with keys being interpolated, should work now. PTAL. |
@skshetry everything works. I'm close to tears by how beautiful this is π I'm having a very nerdy moment seeing DVC plow through multiple models and then showing me the plots and metrics on the terminal all in one go. Awesome work everyone! |
β I have followed the Contributing to DVC checklist.
π If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
Thank you for the contribution - we'll try to review it as soon as possible. π
Here's a wiki documenting the feature: https://github.com/iterative/dvc/wiki/Parametrization#foreach--in.
And, following are the examples for the foreach:
You might notice
set
in the wiki. That is currently not implemented.