Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc: consider introducing build matrix #1018

Closed
efiop opened this issue Aug 14, 2018 · 15 comments
Closed

dvc: consider introducing build matrix #1018

efiop opened this issue Aug 14, 2018 · 15 comments
Labels
enhancement Enhances DVC p3-nice-to-have It should be done this or next sprint
Milestone

Comments

@efiop
Copy link
Contributor

efiop commented Aug 14, 2018

#973 (comment)

I.e. something like:

matrix:
  include:
    - workdir: runs/gs1
    - workdir: runs/gs2
cmd: process.py input output
deps:
  - path: input
outs:
  - path: output
     cache: True
@efiop
Copy link
Contributor Author

efiop commented Nov 15, 2018

Also maybe something like:

cmd: mycmd input output $PARAMS
matrix:
   - name: experiment1
     params: --option 1
   - name: experiment2
     params: --option 2

it will produce output.experiment1, output.experiment2 and so on for the stages down the pipeline.
so basically output files down the pipeline will have suffixes corresponding to the experiment that they Or maybe instead of suffixes, there would be automatically created directories that would store those outputs for each experiment.

@prihoda
Copy link
Contributor

prihoda commented Nov 22, 2018

If I understand it correctly, this can already be handled by outputting a directory, using a command that contains a for cycle, right?

Something like this:

mkdir output; for i in {0..100}; do mycmd input/gs${i}/options.json output/gs${i}; done

This approach also makes it possible to run all tasks in parallel, if you are able to submit asynchronously and wait for all tasks to finish:

dvc run -d input -o output 'mkdir output; for i in {1..100}; do mycmd input/gs${i}/options.json output/gs${i} &; done; wait_for_results gs{1..100}'

# Formatted script:
mkdir output; 
for i in {1..100}; do 
    mycmd input/gs${i}/options.json output/gs${i} &; 
done; 
wait_for_results gs{1..100}

The problem with outputting a directory is that when you want to run an additional experiment, or if some of your experiments fail, you have to rerun all of the other ones as well. Therefore I think it's better to think in terms of one experiment = one DVC file. Making it possible to run these tasks in parallel #755 would make that usable.

For example:

mkdir output; 
# Move to output directory to create DVC files there
cd output;
for i in {1..100}; do 
    # Would have to execute in parallel
    dvc run -d ../input/gs${i}/options.json -o gs${i} mycmd ../input/gs${i}/options.json gs${i}; 
done; 

@efiop
Copy link
Contributor Author

efiop commented Nov 23, 2018

@prihoda Great point! This #1214 should be useful for such scenarios as well, since you will be able to tell dvc to not remove output before reproduction.

@dmpetrov
Copy link
Member

I'm still trying to understand the build matrix stuff. And I think we cannot solve this problem without intoroducing a concept of reconfigurable stages. Let me explain this.

Parallelism

First, it looks like build matrix can be a part of the parallel execution #755 problem when parallel steps are specified in a single stage as a build matrix with a certain level of parallelism.

However, an ideal parallelization solution should be able to run commands even from different stages. So, I'd discuss the parallel execution problem and build-matrix problem separately.

Reconfiguration

Second, there are many issues that are pointing to build matrix. Most of them are related to reconfiguration of a step or a pipeline:

  1. In Make dvc run handle files with same name but different path #973 @Hong-Xiang was asking about reusing (reconfigurable) pipelines.
  2. pipelines: parametrize using environment variables / DVC properties #1416 parametrize pipeline \ step - not config file, just parameters.
  3. How to manage repetitive dvc run commands (like unpacking of many zip files)? #1119 repetitive commands. I see a similarity with parametrizable commands where only a single output is in use and without creating a separate directory for each experiment (./output.p instead of gs1/output.p).

To make a stage reconfigurable many questions has to be answered (how to pass configs and params, how to specify inputs and outputs) and some assumptions should be made. Reconfigurable steps is the problem that we need to solve first before introducing a build matrix and before trying to implement something like this:

cmd: mycmd input output $PARAMS
matrix:
   - name: experiment1
     params: --option 1
   - name: experiment2
     params: --option 2

Only after that, we will be able to introduce build matrix or decide to use just cycles as @prihoda said.

I've created a new issue #1462 for reconfigurable stages.

@efiop efiop removed their assignment Jul 23, 2019
@efiop efiop added the p3-nice-to-have It should be done this or next sprint label Jul 23, 2019
@hhoeflin
Copy link

hhoeflin commented Aug 8, 2019

Have you looked into getting this behaviour using e.g. snakemake and building this into a dvc run step?

@efiop
Copy link
Contributor Author

efiop commented Aug 8, 2019

@hhoeflin We didn't. Could you elaborate, please? How would that look? 🙂

@hhoeflin
Copy link

hhoeflin commented Aug 9, 2019

@efiop My own experience with snakemake is limited - so take with a grain of salt. But the command is basically just a rule. Outputs are the targets, where the experiment name would be coded in the directory_name of the output or the target suffix (as you have suggested before). Snakemake allows you to easily parse these target names into its subcomponents. For the parameters, you could use a dictionary that injects different parameters into the rule depending on the experiment name.

Hope this helps. The documentation of snakemake is really good. Have a look there.

@efiop
Copy link
Contributor Author

efiop commented Aug 12, 2019

@hhoeflin Thanks for elaborating! 🙂 I don't think we will be able to natively integrate snakemake into dvcfiles, since we use pure yaml, but we could definitely check it out to see if we could make some conclusions from it and use them while implementing our own feature.

@hhoeflin
Copy link

@efiop
One other interesting project to look into would be makepp (http://makepp.sourceforge.net/)

It is a make program that tracks inputs and outputs using md5 checksums that are stored inside a project in .makepp directories. It is a "drop-in" replacement for make.

@Wirg
Copy link

Wirg commented Jul 2, 2020

Small bump out of curiosity.

Is there any plan to introduce such a feature ?

We are right now producing +500 .dvc files and updating them frequently (changing deps, outs, cmd, not just the hashes).
With the introduction of multi-stage in dvc v1, we will be able to reduce this to ~80 files. Kudos !
With this kind of change, we will be able to reduce this to 1 .dvc file with the 80 elements as a matrix.

For context :
We are currently using cookiecutter (https://github.com/cookiecutter/cookiecutter) to produce dvc pipelines.
This works but having a matrix system would greatly improve readability, usage (repro) and maintanability.

@efiop
Copy link
Contributor Author

efiop commented Jul 2, 2020

@Wirg I think the mechanisms for that will be introduced as a part of #2799 . We are actively working on that right now, though we are still in the early stages of development.

@skshetry
Copy link
Member

skshetry commented Jul 2, 2020

@Wirg, can you use YAML anchors? It might not be sufficient considering our YAML structure, but for small cases (such as sharing wdir), it might work.

@Wirg
Copy link

Wirg commented Jul 2, 2020

@efiop

Thanks for the lightning fast answer and the expectation management on the development stage.

I subscribed to the provided issue.

@Wirg
Copy link

Wirg commented Jul 2, 2020

@skshetry thanks for your suggestion

@Wirg, can you use YAML anchors? It might not be sufficient considering our YAML structure, but for small cases (such as sharing wdir), it might work.

I am not clear on how yaml anchors would improve the situation ?
We will probably use them to do the multistage update with dvc v1.
But I am not clear on how it will replace a matrix feature.

As an example, we will run this kind of .dvc files with various configs : HP and / or input data (currently ~80)

cmd: python src/pipeline.py
  --sub_folder {{cookiecutter.sub_folder}}
  --base_dir {{cookiecutter.base_dir}}
  --input_data {{cookiecutter.input_data}}
  --config_file {{cookiecutter.config_file}}
wdir: {{cookiecutter.dvc_wdir}}
deps:
- path: {{cookiecutter.base_dir}}/{{cookiecutter.sub_folder}}/annotations/{{cookiecutter.input_data}}
- path: src/pipeline.py
- path: {{cookiecutter.config_file}}
outs:
- path: {{cookiecutter.base_dir}}/{{cookiecutter.sub_folder}}/outputs

Our current approach is to produce this kind of files for each config thanks to cookiecutter and run them with dvc repro -R.

@dmpetrov
Copy link
Member

That was really helpful long-running discussion. It helped us a lot in identifying a possible solution #3633 and the first implementation #4734.

Let's close this issue and move all the discussions to #3633.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhances DVC p3-nice-to-have It should be done this or next sprint
Projects
None yet
Development

No branches or pull requests

6 participants