Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

repro: support glob/foreach-group to run at once through CLI #4976

Merged

Conversation

skshetry
Copy link
Member

@skshetry skshetry commented Nov 26, 2020

This PR adds a way to be able to run the following stage foreach group through a single target build:

stages:
  build:
    foreach:
      - 15
      - 25
      - 35
    do:
      cmd: echo ${item}

Eg:

$ dvc repro build

Also, this PR adds a --glob flag so that you can achieve the same with:

$ dvc repro "build@*" --glob

Also note that, globbing is only allowed for the stage name, not to the file.
So, the following would work:

$ dvc repro "dvc.yaml:build*" --glob

But, the below one won't:

$ dvc repro "**/dvc.yaml:build*" --glob

Regarding the implementation, I have moved all of the target parsing logic inside dvc.repo.stage in a StageLoad class.
So, enabling these features on a checkout or similar commands should be straightforward.

Fixes #4912
Fixes #4886
Fixes #4958

Tests will follow tomorrow.

@skshetry skshetry added enhancement Enhances DVC skip-changelog Skips changelog A: templating Related to the templating feature labels Nov 26, 2020
@skshetry skshetry requested review from pmrowla, efiop and pared November 26, 2020 13:36
@skshetry skshetry self-assigned this Nov 26, 2020
@@ -165,6 +166,7 @@ def __init__(

self.cache = Cache(self)
self.cloud = DataCloud(self)
self.stage = StageLoad(self)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion for better naming would be appreciated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StageResolver? Also, I think that naming the object stage while we have a class of this name will get confusing and some point. Maybe stage_load or stage_resolver will be ok, depending on what name we choose in the end.

Copy link
Member Author

@skshetry skshetry Nov 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pared, will make it StageLoader, and change the other one to MultiStageLoader.

Maybe stage_load or stage_resolver will be ok, depending on what name we choose in the end.

stage_load.load_one() looks a bit odd to me. :) But, I understand your concern.

@@ -133,6 +133,11 @@ def __len__(self):
def __contains__(self, name):
return name in self.resolved_data

def is_foreach_generated(self, name: str):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Temporary solution for now, will be fixed during optimization.

if not file:
# parsing is ambiguous when it does not have a colon
# or if it's not a dvcfile, as it can be a stage name
# in `dvc.yaml` or, an output in a stage.
logger.debug(
"Checking if stage '%s' is in '%s'", target, PIPELINE_FILE
)
if not (recursive and os.path.isdir(target)):
Copy link
Member Author

@skshetry skshetry Nov 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.path.isdir might have been a bug. Though, we don't collect one single file when using brancher.

@skshetry
Copy link
Member Author

Implemented a real globbing, rather than a regex.

@skshetry skshetry changed the title repro: support regex/foreach-group to run at once through CLI repro: support glob/foreach-group to run at once through CLI Nov 26, 2020
dvc/command/repro.py Outdated Show resolved Hide resolved
@@ -165,6 +166,7 @@ def __init__(

self.cache = Cache(self)
self.cloud = DataCloud(self)
self.stage = StageLoad(self)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StageResolver? Also, I think that naming the object stage while we have a class of this name will get confusing and some point. Maybe stage_load or stage_resolver will be ok, depending on what name we choose in the end.

@skshetry skshetry merged commit f4f0554 into iterative:master Nov 27, 2020
@skshetry skshetry deleted the foreach-group-repro-and-regex-filter branch November 27, 2020 13:28
@jorgeorpinel
Copy link
Contributor

this PR adds a --glob flag

Nice! But the same option name in add seems to mean something different? see #4864. We should probably rename one or both of them. Why "globbing" BTW? I'm not sure where this term comes from (I guess also the glob Python module) and I'm guessing most users will have no clue. Thanks

@johnnychen94
Copy link
Contributor

Why "globbing" BTW? I'm not sure where this term comes from (I guess also the glob Python module) and I'm guessing most users will have no clue. Thanks

It comes from the UNIX shell glob program and is a part of POSIX specification. https://pubs.opengroup.org/onlinepubs/009696899/functions/glob.html

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Dec 1, 2020

Thanks Johny, good to confirm. My comment is more about the UX. We can't assume users have such a specific UNIX/Linux or Python experience (or a Ph.D 😉).

glob sounds like "global" to me. --wildcard or something like that would seem more intuitive IMO.

@skshetry
Copy link
Member Author

skshetry commented Dec 2, 2020

@jorgeorpinel, our users might be more familiar with glob than someone like us.

@shcheklein
Copy link
Member

I agree with @jorgeorpinel that it's a technical term being used in UI, which is not ideal. You have to explain it in docs using usual, less technical terms - e.g. path matches pattern. Then why not call it --name-pattern, or something like this. Check other alternatives how CLI tools do this (if any). I would though mention that we are going to remove the option I hope? And make it a default behavior? (at least that's the idea for dvc add and other commands), not sure about this case.

off-topic @skshetry it reminds me the yield discussion that we had internally :) - it's good, technically correct, but not not widely known - thus we have do - which it intuitive and friendly. Same here to my mind.

@efiop
Copy link
Contributor

efiop commented Dec 2, 2020

@jorgeorpinel @shcheklein term globbing is more popular. wildcard is just 1 globbing pattern. --name-pattern is just odd. I don't see an alternative to --glob right now. Please don't focus on this too much, as we've discussed earlier, this option might even go away in the future, but even if not - there doesn't seem to be a better alternative anyway.

@shcheklein
Copy link
Member

I didn't know the glob term. Even today I doubt I would even start looking for a dvc stage names glob support. My 2cs - it's fine to keep it if we gonna remove (why don't we remove it for repro right away?). Otherwise I agree with @jorgeorpinel on this - if we can find a human readable option - let's do that.

@efiop
Copy link
Contributor

efiop commented Dec 2, 2020

For the record: We've discussed --glob in #4864 (comment) before. Just keeping it as is until we add support for it to more commands and enable it by default.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Dec 3, 2020

we are going to remove the option I hope? And make it a default behavior?

When will we know? Is there an issue to follow on this? I can create it otherwise. Please lmk

We've discussed --glob in #4864 (comment) before

We have but specifically for add. My first impression here is that it does something quite different for repro (using glob internally, but that's an implementation detail).

term globbing is more popular

This trend suggest otherwise but it's also not definitive proof. In any case it's also about intuitiveness I think.

Please don't focus on this too much, as this option might even go away in the future

OK let's move on! But sometimes I worry that we may underestimate important details that can hinder the product in the long-run, and may be very hard to even detect later.

To me the takeaway is whether we consider UI/UX part of the dev process, or whether it's OK that it's secondary. My bias is towards the former because like Ivan mentioned it's harder to ignore these confusions when writing/reviewing docs anyway. But I realize that we would need a slightly different inter-team workflow (and maybe timelines) so that everyone has time to think and give feedback on these things — not having to discuss after the PR is merged. Food for thought

@johnnychen94
Copy link
Contributor

johnnychen94 commented Dec 3, 2020

I think it should either be --glob or --regex. Even with the word --name-pattern you'll still need to explain what kind of rule it follows, is it glob rule, PCRE regex, or Python regex?

--glob-match, maybe?

@efiop
Copy link
Contributor

efiop commented Dec 3, 2020

@jorgeorpinel I understand your perspective, let's indeed try to sync on this. The issue for --glob is #4816 . We are just waiting for --glob to get supported by more commands and to actually test it better before releasing it into the wild.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: templating Related to the templating feature enhancement Enhances DVC
Projects
None yet
6 participants