dvc: consider introducing build matrix #1018

efiop · 2018-08-14T19:27:22Z

I.e. something like:

matrix:
  include:
    - workdir: runs/gs1
    - workdir: runs/gs2
cmd: process.py input output
deps:
  - path: input
outs:
  - path: output
     cache: True

The text was updated successfully, but these errors were encountered:

efiop · 2018-11-15T20:37:07Z

Also maybe something like:

cmd: mycmd input output $PARAMS
matrix:
   - name: experiment1
     params: --option 1
   - name: experiment2
     params: --option 2

it will produce output.experiment1, output.experiment2 and so on for the stages down the pipeline.
so basically output files down the pipeline will have suffixes corresponding to the experiment that they Or maybe instead of suffixes, there would be automatically created directories that would store those outputs for each experiment.

prihoda · 2018-11-22T16:56:17Z

If I understand it correctly, this can already be handled by outputting a directory, using a command that contains a for cycle, right?

Something like this:

mkdir output; for i in {0..100}; do mycmd input/gs${i}/options.json output/gs${i}; done

This approach also makes it possible to run all tasks in parallel, if you are able to submit asynchronously and wait for all tasks to finish:

dvc run -d input -o output 'mkdir output; for i in {1..100}; do mycmd input/gs${i}/options.json output/gs${i} &; done; wait_for_results gs{1..100}'

# Formatted script:
mkdir output; 
for i in {1..100}; do 
    mycmd input/gs${i}/options.json output/gs${i} &; 
done; 
wait_for_results gs{1..100}

The problem with outputting a directory is that when you want to run an additional experiment, or if some of your experiments fail, you have to rerun all of the other ones as well. Therefore I think it's better to think in terms of one experiment = one DVC file. Making it possible to run these tasks in parallel #755 would make that usable.

For example:

mkdir output; 
# Move to output directory to create DVC files there
cd output;
for i in {1..100}; do 
    # Would have to execute in parallel
    dvc run -d ../input/gs${i}/options.json -o gs${i} mycmd ../input/gs${i}/options.json gs${i}; 
done;

efiop · 2018-11-23T01:37:30Z

@prihoda Great point! This #1214 should be useful for such scenarios as well, since you will be able to tell dvc to not remove output before reproduction.

dmpetrov · 2018-12-31T11:32:45Z

I'm still trying to understand the build matrix stuff. And I think we cannot solve this problem without intoroducing a concept of reconfigurable stages. Let me explain this.

Parallelism

First, it looks like build matrix can be a part of the parallel execution #755 problem when parallel steps are specified in a single stage as a build matrix with a certain level of parallelism.

However, an ideal parallelization solution should be able to run commands even from different stages. So, I'd discuss the parallel execution problem and build-matrix problem separately.

Reconfiguration

Second, there are many issues that are pointing to build matrix. Most of them are related to reconfiguration of a step or a pipeline:

In Make dvc run handle files with same name but different path #973 @Hong-Xiang was asking about reusing (reconfigurable) pipelines.
pipelines: parametrize using environment variables / DVC properties #1416 parametrize pipeline \ step - not config file, just parameters.
How to manage repetitive dvc run commands (like unpacking of many zip files)? #1119 repetitive commands. I see a similarity with parametrizable commands where only a single output is in use and without creating a separate directory for each experiment (./output.p instead of gs1/output.p).

To make a stage reconfigurable many questions has to be answered (how to pass configs and params, how to specify inputs and outputs) and some assumptions should be made. Reconfigurable steps is the problem that we need to solve first before introducing a build matrix and before trying to implement something like this:

cmd: mycmd input output $PARAMS
matrix:
   - name: experiment1
     params: --option 1
   - name: experiment2
     params: --option 2

Only after that, we will be able to introduce build matrix or decide to use just cycles as @prihoda said.

I've created a new issue #1462 for reconfigurable stages.

hhoeflin · 2019-08-08T14:58:48Z

Have you looked into getting this behaviour using e.g. snakemake and building this into a dvc run step?

efiop · 2019-08-08T20:14:45Z

@hhoeflin We didn't. Could you elaborate, please? How would that look? 🙂

hhoeflin · 2019-08-09T09:34:24Z

@efiop My own experience with snakemake is limited - so take with a grain of salt. But the command is basically just a rule. Outputs are the targets, where the experiment name would be coded in the directory_name of the output or the target suffix (as you have suggested before). Snakemake allows you to easily parse these target names into its subcomponents. For the parameters, you could use a dictionary that injects different parameters into the rule depending on the experiment name.

Hope this helps. The documentation of snakemake is really good. Have a look there.

efiop · 2019-08-12T00:31:49Z

@hhoeflin Thanks for elaborating! 🙂 I don't think we will be able to natively integrate snakemake into dvcfiles, since we use pure yaml, but we could definitely check it out to see if we could make some conclusions from it and use them while implementing our own feature.

hhoeflin · 2019-08-14T07:46:57Z

@efiop
One other interesting project to look into would be makepp (http://makepp.sourceforge.net/)

It is a make program that tracks inputs and outputs using md5 checksums that are stored inside a project in .makepp directories. It is a "drop-in" replacement for make.

Wirg · 2020-07-02T14:56:30Z

Small bump out of curiosity.

Is there any plan to introduce such a feature ?

We are right now producing +500 .dvc files and updating them frequently (changing deps, outs, cmd, not just the hashes).
With the introduction of multi-stage in dvc v1, we will be able to reduce this to ~80 files. Kudos !
With this kind of change, we will be able to reduce this to 1 .dvc file with the 80 elements as a matrix.

For context :
We are currently using cookiecutter (https://github.com/cookiecutter/cookiecutter) to produce dvc pipelines.
This works but having a matrix system would greatly improve readability, usage (repro) and maintanability.

efiop · 2020-07-02T14:58:21Z

@Wirg I think the mechanisms for that will be introduced as a part of #2799 . We are actively working on that right now, though we are still in the early stages of development.

skshetry · 2020-07-02T15:06:49Z

@Wirg, can you use YAML anchors? It might not be sufficient considering our YAML structure, but for small cases (such as sharing wdir), it might work.

Wirg · 2020-07-02T15:07:16Z

@efiop

Thanks for the lightning fast answer and the expectation management on the development stage.

I subscribed to the provided issue.

Wirg · 2020-07-02T15:20:10Z

@skshetry thanks for your suggestion

@Wirg, can you use YAML anchors? It might not be sufficient considering our YAML structure, but for small cases (such as sharing wdir), it might work.

I am not clear on how yaml anchors would improve the situation ?
We will probably use them to do the multistage update with dvc v1.
But I am not clear on how it will replace a matrix feature.

As an example, we will run this kind of .dvc files with various configs : HP and / or input data (currently ~80)

cmd: python src/pipeline.py
  --sub_folder {{cookiecutter.sub_folder}}
  --base_dir {{cookiecutter.base_dir}}
  --input_data {{cookiecutter.input_data}}
  --config_file {{cookiecutter.config_file}}
wdir: {{cookiecutter.dvc_wdir}}
deps:
- path: {{cookiecutter.base_dir}}/{{cookiecutter.sub_folder}}/annotations/{{cookiecutter.input_data}}
- path: src/pipeline.py
- path: {{cookiecutter.config_file}}
outs:
- path: {{cookiecutter.base_dir}}/{{cookiecutter.sub_folder}}/outputs

Our current approach is to produce this kind of files for each config thanks to cookiecutter and run them with dvc repro -R.

dmpetrov · 2020-10-18T21:48:00Z

That was really helpful long-running discussion. It helped us a lot in identifying a possible solution #3633 and the first implementation #4734.

Let's close this issue and move all the discussions to #3633.

efiop added the enhancement Enhances DVC label Aug 14, 2018

efiop added this to the Queue milestone Aug 14, 2018

efiop self-assigned this Aug 14, 2018

efiop mentioned this issue Aug 14, 2018

Make dvc run handle files with same name but different path #973

Closed

efiop mentioned this issue Sep 14, 2018

How to manage repetitive dvc run commands (like unpacking of many zip files)? #1119

Closed

prihoda mentioned this issue Jan 2, 2019

Reconfigurable pipelines #1462

Closed

efiop removed their assignment Jul 23, 2019

efiop added the p3-nice-to-have It should be done this or next sprint label Jul 23, 2019

efiop mentioned this issue Jul 25, 2019

How to use DVC for multiple experiments with same code #2324

Closed

pared mentioned this issue Aug 8, 2019

Model parameter tracking #2379

Closed

ghost mentioned this issue Aug 25, 2019

run: reference dependencies and outputs in command #2437

Closed

dmpetrov mentioned this issue Nov 20, 2019

ML experiments and hyperparameters tuning #2799

Closed

This was referenced Dec 1, 2019

Incremental processing or streaming in micro-batches #331

Closed

repro: add scheduler for parallelising execution jobs #755

Open

MatthieuBizien mentioned this issue Jul 15, 2020

Allow pipeline to create pipeline #4213

Closed

skshetry mentioned this issue Sep 23, 2020

Pipeline variables from params file #3633

Closed

dmpetrov closed this as completed Oct 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dvc: consider introducing build matrix #1018

dvc: consider introducing build matrix #1018

efiop commented Aug 14, 2018

efiop commented Nov 15, 2018 •

edited

Loading

prihoda commented Nov 22, 2018

efiop commented Nov 23, 2018

dmpetrov commented Dec 31, 2018

hhoeflin commented Aug 8, 2019

efiop commented Aug 8, 2019

hhoeflin commented Aug 9, 2019

efiop commented Aug 12, 2019

hhoeflin commented Aug 14, 2019

Wirg commented Jul 2, 2020

efiop commented Jul 2, 2020

skshetry commented Jul 2, 2020 •

edited

Loading

Wirg commented Jul 2, 2020

Wirg commented Jul 2, 2020

dmpetrov commented Oct 18, 2020

dvc: consider introducing build matrix #1018

dvc: consider introducing build matrix #1018

Comments

efiop commented Aug 14, 2018

efiop commented Nov 15, 2018 • edited Loading

prihoda commented Nov 22, 2018

efiop commented Nov 23, 2018

dmpetrov commented Dec 31, 2018

Parallelism

Reconfiguration

hhoeflin commented Aug 8, 2019

efiop commented Aug 8, 2019

hhoeflin commented Aug 9, 2019

efiop commented Aug 12, 2019

hhoeflin commented Aug 14, 2019

Wirg commented Jul 2, 2020

efiop commented Jul 2, 2020

skshetry commented Jul 2, 2020 • edited Loading

Wirg commented Jul 2, 2020

Wirg commented Jul 2, 2020

dmpetrov commented Oct 18, 2020

efiop commented Nov 15, 2018 •

edited

Loading

skshetry commented Jul 2, 2020 •

edited

Loading