Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc: wildcard outputs? #4254

Closed
jorgeorpinel opened this issue Jul 21, 2020 · 5 comments
Closed

dvc: wildcard outputs? #4254

jorgeorpinel opened this issue Jul 21, 2020 · 5 comments
Labels
awaiting response we are waiting for your reply, please respond! :) discussion requires active participation to reach a conclusion feature request Requesting a new feature

Comments

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Jul 21, 2020

It's a well known limitation of DVC that 2 stages (or stage and .dvc file, etc.) can't have overlapping output paths (maybe also deps in some cases?) — this applies to directories, or course. For example

$ dvc add data/
$ dvc run -n clean -d data -o data python cleanup.py data
ERROR...

In the case above the dependency and the output are the same because maybe there's multiple raw data files in data/ and you don't want to use -d for each one — it may even be impossible if it's a variable number of raw data files coming from a previous, non-deterministic stage.
Similarly, the output may be hundreds of files (or a non-deterministic variable number) so you just want to indicate the whole directory.
For some external reason, maybe you need to avoid splitting the raw and clean data directories — we've had support cases like this e.g. this one.

Solution: Wildcards? E.g.

$ dvc add data/raw*
$ dvc run -n clean -d data/raw* -o data/**/clean* python cleanup.py data
@triage-new-issues triage-new-issues bot added the triage Needs to be triaged label Jul 21, 2020
@jorgeorpinel jorgeorpinel added discussion requires active participation to reach a conclusion feature request Requesting a new feature labels Jul 21, 2020
@triage-new-issues triage-new-issues bot removed the triage Needs to be triaged label Jul 21, 2020
@jorgeorpinel
Copy link
Contributor Author

Cc @dmpetrov

@jorgeorpinel jorgeorpinel changed the title wildcard outputs Wildcard outputs_ Jul 21, 2020
@jorgeorpinel jorgeorpinel changed the title Wildcard outputs_ Wildcard outputs? Jul 21, 2020
@jorgeorpinel jorgeorpinel changed the title Wildcard outputs? dvc: wildcard outputs? Jul 21, 2020
@efiop
Copy link
Contributor

efiop commented Jul 21, 2020

$ dvc add data/raw*
$ dvc run -n clean -d data/raw* -o data/**/clean* python cleanup.py data

These commands won't get you what you want. First of all, they are evaluated by your shell and not by dvc. To make your shell pass them to dvc, you'll need to escape it somehow (e.g. wrap it in single-quotes), which makes it quite error-prone.

But this is discussed in #1462 , so we can probably close this as a duplicate.

@efiop efiop added the awaiting response we are waiting for your reply, please respond! :) label Jul 21, 2020
@jorgeorpinel
Copy link
Contributor Author

To make your shell pass them to dvc, you'll need to escape it somehow

Sure, it's just a quick example.

But this is discussed in #1462

I see it there indeed, thanks for the reference. It's a big discussion issue from December 2018 though... Probably the best way to make it actionable is to break it into smaller ones? Like this one!

@efiop
Copy link
Contributor

efiop commented Jul 21, 2020

@jorgeorpinel I wouldn't call this one simple. It is not only about supporting wildcards, but rather supporting some scenarios that might need wildcards(or there might be a better way to do it). So I would keep the discussion there instead of also having this one.

@jorgeorpinel
Copy link
Contributor Author

Yeah I didn't say it's simple. And sure, up to you. I already commented over there too. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting response we are waiting for your reply, please respond! :) discussion requires active participation to reach a conclusion feature request Requesting a new feature
Projects
None yet
Development

No branches or pull requests

2 participants