-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dvc: consider introducing build matrix #1018
Comments
Also maybe something like:
it will produce output.experiment1, output.experiment2 and so on for the stages down the pipeline. |
If I understand it correctly, this can already be handled by outputting a directory, using a command that contains a for cycle, right? Something like this: mkdir output; for i in {0..100}; do mycmd input/gs${i}/options.json output/gs${i}; done This approach also makes it possible to run all tasks in parallel, if you are able to submit asynchronously and wait for all tasks to finish: dvc run -d input -o output 'mkdir output; for i in {1..100}; do mycmd input/gs${i}/options.json output/gs${i} &; done; wait_for_results gs{1..100}'
# Formatted script:
mkdir output;
for i in {1..100}; do
mycmd input/gs${i}/options.json output/gs${i} &;
done;
wait_for_results gs{1..100} The problem with outputting a directory is that when you want to run an additional experiment, or if some of your experiments fail, you have to rerun all of the other ones as well. Therefore I think it's better to think in terms of one experiment = one DVC file. Making it possible to run these tasks in parallel #755 would make that usable. For example: mkdir output;
# Move to output directory to create DVC files there
cd output;
for i in {1..100}; do
# Would have to execute in parallel
dvc run -d ../input/gs${i}/options.json -o gs${i} mycmd ../input/gs${i}/options.json gs${i};
done; |
I'm still trying to understand the build matrix stuff. And I think we cannot solve this problem without intoroducing a concept of reconfigurable stages. Let me explain this. ParallelismFirst, it looks like build matrix can be a part of the parallel execution #755 problem when parallel steps are specified in a single stage as a build matrix with a certain level of parallelism. However, an ideal parallelization solution should be able to run commands even from different stages. So, I'd discuss the parallel execution problem and build-matrix problem separately. ReconfigurationSecond, there are many issues that are pointing to build matrix. Most of them are related to reconfiguration of a step or a pipeline:
To make a stage reconfigurable many questions has to be answered (how to pass configs and params, how to specify inputs and outputs) and some assumptions should be made. Reconfigurable steps is the problem that we need to solve first before introducing a build matrix and before trying to implement something like this:
Only after that, we will be able to introduce build matrix or decide to use just cycles as @prihoda said. I've created a new issue #1462 for reconfigurable stages. |
Have you looked into getting this behaviour using e.g. snakemake and building this into a dvc run step? |
@hhoeflin We didn't. Could you elaborate, please? How would that look? 🙂 |
@efiop My own experience with snakemake is limited - so take with a grain of salt. But the command is basically just a rule. Outputs are the targets, where the experiment name would be coded in the directory_name of the output or the target suffix (as you have suggested before). Snakemake allows you to easily parse these target names into its subcomponents. For the parameters, you could use a dictionary that injects different parameters into the rule depending on the experiment name. Hope this helps. The documentation of snakemake is really good. Have a look there. |
@hhoeflin Thanks for elaborating! 🙂 I don't think we will be able to natively integrate snakemake into dvcfiles, since we use pure yaml, but we could definitely check it out to see if we could make some conclusions from it and use them while implementing our own feature. |
@efiop It is a make program that tracks inputs and outputs using md5 checksums that are stored inside a project in .makepp directories. It is a "drop-in" replacement for make. |
Small bump out of curiosity. Is there any plan to introduce such a feature ? We are right now producing +500 For context : |
@Wirg, can you use YAML anchors? It might not be sufficient considering our YAML structure, but for small cases (such as sharing |
Thanks for the lightning fast answer and the expectation management on the development stage. I subscribed to the provided issue. |
@skshetry thanks for your suggestion
I am not clear on how yaml anchors would improve the situation ? As an example, we will run this kind of cmd: python src/pipeline.py
--sub_folder {{cookiecutter.sub_folder}}
--base_dir {{cookiecutter.base_dir}}
--input_data {{cookiecutter.input_data}}
--config_file {{cookiecutter.config_file}}
wdir: {{cookiecutter.dvc_wdir}}
deps:
- path: {{cookiecutter.base_dir}}/{{cookiecutter.sub_folder}}/annotations/{{cookiecutter.input_data}}
- path: src/pipeline.py
- path: {{cookiecutter.config_file}}
outs:
- path: {{cookiecutter.base_dir}}/{{cookiecutter.sub_folder}}/outputs Our current approach is to produce this kind of files for each config thanks to cookiecutter and run them with |
#973 (comment)
I.e. something like:
The text was updated successfully, but these errors were encountered: