Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

repro: add scheduler for parallelising execution jobs #755

Open
yukw777 opened this issue Jun 8, 2018 · 54 comments
Open

repro: add scheduler for parallelising execution jobs #755

yukw777 opened this issue Jun 8, 2018 · 54 comments
Labels
A: pipelines Related to the pipelines feature enhancement Enhances DVC p2-medium Medium priority, should be done, but less important research
Milestone

Comments

@yukw777
Copy link

yukw777 commented Jun 8, 2018

When I try to run multiple dvc run commands, I get the following error:

$ dvc run ...
Failed to lock before running a command: Cannot perform the cmd since DVC is busy and locked. Please retry the cmd later.

This is inconvenient b/c I'd love to run multiple experiments together using dvc. Anyway we can be more smart about locking?

@efiop efiop self-assigned this Jun 8, 2018
@efiop efiop changed the title dvc run lock issue. Allow running multiple dvc run instances at the same time Jun 8, 2018
@efiop
Copy link
Contributor

efiop commented Jun 8, 2018

Hi @yukw777 !

I agree, it is inconvenient. For now dvc can only run single instance per repo, because it has to make sure that your dvc files/deps/outputs don't collide with each other and you don't get race conditions in your project. For dvc run case, all we need to do to enable "multiprocessing" is to make sure that dvcfiles, outputs and dependencies don't collide. That should be pretty easy to implement through creating per-file lock files, that dvc will take into account when executing commands. I will rename this issue and add it to our TODO list, but I'm not sure whether or not it is going to make it into 0.9.8 release. Btw, do you only need multiprocessing for dvc run or for dvc repro as well?

If you really need to run a few heavy experiments simultaneously right now, I would suggest a rather ugly but simple workaround of just copying your project's directory and running one dvc run per copy. Would that be feasible for you?

Thanks,
Ruslan

@efiop efiop added the enhancement Enhances DVC label Jun 8, 2018
@efiop efiop added this to the 0.9.8 milestone Jun 8, 2018
@efiop efiop added the awaiting response we are waiting for your reply, please respond! :) label Jun 9, 2018
@yukw777
Copy link
Author

yukw777 commented Jun 11, 2018

Hi Ruslan,

I think it makes sense to add multiprocessing for dvc repro as well. This will allow me to reproduce multiple experiments at once.

Yeah I was kind of doing what you suggested already. Thanks again for your speedy response! Very excited to use 0.9.8.

@efiop efiop removed the awaiting response we are waiting for your reply, please respond! :) label Jun 14, 2018
@efiop efiop modified the milestones: 0.9.8, 0.9.9 Jun 20, 2018
@efiop
Copy link
Contributor

efiop commented Jun 20, 2018

Moving this to 0.9.9 TODO.

@prihoda
Copy link
Contributor

prihoda commented Aug 8, 2018

Hi guys, is this still on the table? ;)

@efiop
Copy link
Contributor

efiop commented Aug 8, 2018

Hi @prihoda !

It is, we just didn't have time to tackle it yet. With such amount of demand, we will try to move it up the priority list. 🙂

Thank you.

@prihoda
Copy link
Contributor

prihoda commented Jan 7, 2019

Hi @efiop

For dvc run case, all we need to do to enable "multiprocessing" is to make sure that dvcfiles, outputs and dependencies don't collide. That should be pretty easy to implement through creating per-file lock files, that dvc will take into account when executing commands.

So all that is needed for parallel execution is to make sure no running stage's output path is the same as another running stage's input path, right? Two running stages having the same input path should not be a problem. So a lock file would be only created for dependencies and only checked for outputs. Can that be solved using zc.lockfile just like with the global lock file?

@efiop
Copy link
Contributor

efiop commented Jan 7, 2019

@prihoda Yes, I think so :)

@prihoda
Copy link
Contributor

prihoda commented Jan 7, 2019

Looks like Portalocker provides shared/exclusive locks which could be used as read/write locks https://portalocker.readthedocs.io/en/latest/portalocker.html

@AlJohri
Copy link

AlJohri commented Jan 30, 2019

Once the multiprocessing is added, hopefully it can extend to dvc repro as well, perhaps controlled via the --jobs parameter.

@efiop efiop added the p2-medium Medium priority, should be done, but less important label Mar 20, 2019
@mathemaphysics
Copy link

I followed the step verbatim on https://dvc.org/doc/tutorial/define-ml-pipeline and in the previous step. This is the official tutorial as far as I can see. And I'm getting a very, very long wait to run dvc add data/Posts.xml.zip. I've also tried desperately deleting .dvc/lock and .dvc/updater.lock, which show up every single time I look just to see if it does something.

What's going on?

@shcheklein
Copy link
Member

This is the official tutorial as far as I can see. And I'm getting a very, very long wait to run dvc add data/Posts.xml.zip

https://discordapp.com/channels/485586884165107732/485596304961962003/575535709490905101 - NFS locking issue as was discussed on Discord

@ghost
Copy link

ghost commented May 10, 2019

Related ^^ #1918

@mdekstrand
Copy link

I would find this useful, particularly for dvc repro as well.

I run my experiments on an HPC cluster, and would like to be able to kick off multiple large, non-overlapping subcomponents of the experiment in parallel. Right now the locking prevents that; also, locking keeps me from performing other operations (such as uploading the results of a previous computation to a remote) while a job is working.

My workaround right now is to create Git worktrees. I have a script that creates a worktree, configures DVC to share a cache with the original repository, checks out the DVC outputs, and then starts the real job. When it is finished, I have to manually merge the results back into my project.

It works, but is quite ugly and tedious.

@efiop
Copy link
Contributor

efiop commented Aug 12, 2019

We've had a discussion about some quick-ish ways to help users with this and came up with two possible partial solutions:

  1. As a quick general solution, we could implement --no-lock option for run and repro that would shift the responsibility to the user for making sure that his stages don't overlap. This solution seems trivial, except that we would need to revisit the way we acquire lock for our state db, because currently we take it really early and we sit there with it waiting for the command to finish. Seems like it would require a slight adjustment to take state db lock only when we really need it (self.state.get/save/etc in RemoteBASE).

  2. As @shcheklein suggested, another option might be to implement locking per-pipeline, so you could at least run stages in unrelated pipelines. This one is simple, but not as trivial as 1) and would still require solving state db lock problem. Also this approach would only work in specific scenarios.

And the final option is to come up with general solution that would still require solving 1) and would probably look like per-output lock files.

What do you guys think?

@efiop efiop self-assigned this Aug 13, 2019
@charlesbaynham
Copy link
Contributor

Nothing particular to add here other than a +1 from me! My workflow usually involves lots and lots of quite small steps, executing on an HPC cluster, so parallelism would be a great help.

@janvainer
Copy link

I would also benefit very much from parallel executions

@johnnychen94
Copy link
Contributor

johnnychen94 commented Dec 1, 2020

Cross-post from #3633 (comment)


We could go one step further based on option 1 in #755 (comment) by letting users to specify how jobs/stages are running:

# jobs.yaml
jobs:
  - name: extract_data@*
    limit: 1 # 1 means no concurrency, this can be set as default value
  - name: prepare@*
    limit: 4 # at most 4 concurrent jobs
    env:
      JULIA_NUM_THREADS: 8
  - train@*
    limit: 4
    env:
      # if it's a scalar, apply to all jobs
      JULIA_NUM_THREADS: 8
      # if it's a list, iterate one by one
      CUDA_VISIBLE_DEVICES:
        - 0
        - 1
        - 2
        - 3
  - name: evaluate@*
    limit: 3
    env:
      JULIA_NUM_THREADS: 8
      CUDA_VISIBLE_DEVICES:
        - 0
        - 1
        - 2
  - name: summary

extract_data@* follows the same glob syntax in #4976

I never used slurm so I don't know if this is applicable to cluster cases. I feel like dvc should delegate distribution computation tasks to other existing tools.

@turian
Copy link

turian commented Dec 4, 2020

Do I understand correctly that if one step of the pipeline (e.g. retrieving webpages) requires 100 micro-tasks running in parallel, that this step is a bad fit for doing within dvc?

@dmpetrov in #755 (comment) describes 3 methods of parallelization and I assume you need 2 for this. Let me know if there is a workaround.

@efiop
Copy link
Contributor

efiop commented Dec 10, 2020

@turian The workaround is to simply parallelize it in your code(1 stage in dvc pipeline), and not create 100 micro stages in dvc pipeline.

@turian
Copy link

turian commented Dec 15, 2020

@efiop The issue is that each of the little jobs has a pipeline. Download webpage => extract text without boilerplate => further NLP.

@jorgeorpinel
Copy link
Contributor

I haven't read the complete conversation but my 2c here are that while it sounds like an amazing feature, we prob should think about this several times.

We've already had some ongoing performance issues (mainly around file system manipulation and network connectivity) which required pretty sophisticated implementations. These are expensive to develop and maintain. Do we want to also worry about multithreading in this repo?

We can still provide a solution for this, just not necessarily in the DVC product (cc @dberenbaum). There's CML for example, so you can prob. setup a distributed CI system that runs DVC pipelines in parallel (cc @DavidGOrtega?) + adds a bunch of other benefits. If needed DVC could have small modifications to support this usage, or other alternative solutions (e.g. make a plugin/extension independent from this repo). And we can document/blog about setting up systems for parallel execution of DVC pipelines.

@jorgeorpinel
Copy link
Contributor

That said if a multithreading refactor could speed up regular operations automatically (more general than just for executing commands) that would definitely be a welcome enhancement.

See user insight in https://groups.google.com/u/1/a/iterative.ai/g/support/c/zvJ4WnfTGKM/m/ZOckCIwIEwAJ

@tweddielin
Copy link

Just want to see if parallel execution is still on the roadmap? By parallel execution, I mean execute stages(jobs) in the defined pipeline(DAG) of a repo in parallel instead of parallelizing multiple pipelines. Just like what you can do with make -j4 or luigi --workers=4 for luigi.

I appreciate what you guys have already done and hope to see more new functionalities, but this would be the only thing unfulfilled for me to call dvc a real "git + make for data and machine learning project".

@dberenbaum
Copy link
Collaborator

@tweddielin Thanks for the kind words and your support! It's still something we'd like to do, but a lot of efforts right now are focused on data optimization rather than pipeline optimization. Hopefully, we will have a chance to return to pipelines once we complete that work.

@dberenbaum
Copy link
Collaborator

@dmpetrov @efiop Let's discuss the priority of this one as we start to think about planning for next quarter. I think we should probably move it to p2 for now as we are unlikely to work on it for at least the rest of this quarter, but we should keep it in mind as a priority for next quarter.

@dberenbaum dberenbaum added p2-medium Medium priority, should be done, but less important and removed p1-important Important, aka current backlog of things to do labels Feb 18, 2022
@Eisbrenner
Copy link

Hi, I may add another use case here. I work partly using an HPC featuring SLURM. Using SLURM, I can and should define task dependencies directly while queuing jobs.

Assuming a graph like this:

A --> B --> C
  \-> D

I can schedule jobs of the form

#!/bin/bash
#slurm settings

dvc repro dvc.yaml:stageA

with

#!/bin/bash

# queue job A with no dependencies
stdout=$(sbatch <kwags> job_A.sh)
id_A=${stdout##* } # the PID is given as part of a sentence

# queue job B with id_A as dependency
stdout=$(sbatch <kwags> --dependency=afterok:id_A job_B.sh)
id_B=${stdout##* }

# queue job C with id_B as dependency
sbatch <kwags> --dependency=afterok:id_B job_C.sh

# queue job D with id_A as dependency
sbatch <kwags> --dependency=afterok:id_A job_D.sh

So here I can run B --> C and D in parallel on different nodes or otherwise separated units.

Only thing missing is a way to disable the dvc-repo lock.

@osma
Copy link

osma commented Mar 30, 2022

@tweddielin said above:

I appreciate what you guys have already done and hope to see more new functionalities, but this would be the only thing unfulfilled for me to call dvc a real "git + make for data and machine learning project".

I second this 100%! DVC is amazing, but the lack of parallel execution of pipeline stages is disappointing. I'm working on a machine with many cores and it would be much more efficient to be able to use something like make -j8.

I've tried some workarounds. It seems to be possible to run individual stages in parallel, using dvc repro -s <stagename>, as long as you don't start them at the exact same moment, because then you will hit the lock contention problem. I even tried automating parallel execution of pending stages outside DVC using the jq tool and GNU Parallel, like this:

dvc status --json | jq -r 'keys | join("\n")' | parallel dvc repro -s

but this fails because all the parallel dvc repro -s commands try to acquire the lock at nearly the same time and usually only one of them will succeed. Since the lock appears to be held only for a short while, adding a retry loop with a timeout could help, as mentioned above in several comments (there's also a closed issue #2031 where this was suggested).

I've also used a lot of foreach statements and at least for the use cases I can think of, all the iterations are independent from each other. So if it's difficult to schedule parallel execution of the whole pipeline/DAG, at least stages defined using foreach could be executed in parallel.

@itcarroll
Copy link

itcarroll commented Nov 16, 2022

I am wondering whether the awesome dev team has decided whether the --jobs feature associated with dvc exp run --queue is planned to close this issue? You'd be awesome either way, but maybe more awesome if this issue is still on the table 😉.

Adding a jobs parameter to foreach blocks would be killer.

@dberenbaum
Copy link
Collaborator

@itcarroll It's not in our short-term plans for the rest of this year, but it's a highly requested feature, so it's still on the table, and there's no intent to close this issue without addressing it more directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: pipelines Related to the pipelines feature enhancement Enhances DVC p2-medium Medium priority, should be done, but less important research
Projects
None yet
Development

No branches or pull requests