-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
repro: add scheduler for parallelising execution jobs #755
Comments
dvc run
instances at the same time
Hi @yukw777 ! I agree, it is inconvenient. For now dvc can only run single instance per repo, because it has to make sure that your dvc files/deps/outputs don't collide with each other and you don't get race conditions in your project. For If you really need to run a few heavy experiments simultaneously right now, I would suggest a rather ugly but simple workaround of just copying your project's directory and running one Thanks, |
Hi Ruslan, I think it makes sense to add multiprocessing for Yeah I was kind of doing what you suggested already. Thanks again for your speedy response! Very excited to use |
Moving this to 0.9.9 TODO. |
Hi guys, is this still on the table? ;) |
Hi @prihoda ! It is, we just didn't have time to tackle it yet. With such amount of demand, we will try to move it up the priority list. 🙂 Thank you. |
Hi @efiop
So all that is needed for parallel execution is to make sure no running stage's output path is the same as another running stage's input path, right? Two running stages having the same input path should not be a problem. So a lock file would be only created for dependencies and only checked for outputs. Can that be solved using |
@prihoda Yes, I think so :) |
Looks like Portalocker provides shared/exclusive locks which could be used as read/write locks https://portalocker.readthedocs.io/en/latest/portalocker.html |
Once the multiprocessing is added, hopefully it can extend to |
I followed the step verbatim on https://dvc.org/doc/tutorial/define-ml-pipeline and in the previous step. This is the official tutorial as far as I can see. And I'm getting a very, very long wait to run What's going on? |
https://discordapp.com/channels/485586884165107732/485596304961962003/575535709490905101 - NFS locking issue as was discussed on Discord |
Related ^^ #1918 |
I would find this useful, particularly for I run my experiments on an HPC cluster, and would like to be able to kick off multiple large, non-overlapping subcomponents of the experiment in parallel. Right now the locking prevents that; also, locking keeps me from performing other operations (such as uploading the results of a previous computation to a remote) while a job is working. My workaround right now is to create Git worktrees. I have a script that creates a worktree, configures DVC to share a cache with the original repository, checks out the DVC outputs, and then starts the real job. When it is finished, I have to manually merge the results back into my project. It works, but is quite ugly and tedious. |
We've had a discussion about some quick-ish ways to help users with this and came up with two possible partial solutions:
And the final option is to come up with general solution that would still require solving 1) and would probably look like per-output lock files. What do you guys think? |
Nothing particular to add here other than a +1 from me! My workflow usually involves lots and lots of quite small steps, executing on an HPC cluster, so parallelism would be a great help. |
I would also benefit very much from parallel executions |
Cross-post from #3633 (comment) We could go one step further based on option 1 in #755 (comment) by letting users to specify how jobs/stages are running: # jobs.yaml
jobs:
- name: extract_data@*
limit: 1 # 1 means no concurrency, this can be set as default value
- name: prepare@*
limit: 4 # at most 4 concurrent jobs
env:
JULIA_NUM_THREADS: 8
- train@*
limit: 4
env:
# if it's a scalar, apply to all jobs
JULIA_NUM_THREADS: 8
# if it's a list, iterate one by one
CUDA_VISIBLE_DEVICES:
- 0
- 1
- 2
- 3
- name: evaluate@*
limit: 3
env:
JULIA_NUM_THREADS: 8
CUDA_VISIBLE_DEVICES:
- 0
- 1
- 2
- name: summary
I never used slurm so I don't know if this is applicable to cluster cases. I feel like dvc should delegate distribution computation tasks to other existing tools. |
Do I understand correctly that if one step of the pipeline (e.g. retrieving webpages) requires 100 micro-tasks running in parallel, that this step is a bad fit for doing within dvc? @dmpetrov in #755 (comment) describes 3 methods of parallelization and I assume you need 2 for this. Let me know if there is a workaround. |
@turian The workaround is to simply parallelize it in your code(1 stage in dvc pipeline), and not create 100 micro stages in dvc pipeline. |
@efiop The issue is that each of the little jobs has a pipeline. Download webpage => extract text without boilerplate => further NLP. |
I haven't read the complete conversation but my 2c here are that while it sounds like an amazing feature, we prob should think about this several times. We've already had some ongoing performance issues (mainly around file system manipulation and network connectivity) which required pretty sophisticated implementations. These are expensive to develop and maintain. Do we want to also worry about multithreading in this repo? We can still provide a solution for this, just not necessarily in the DVC product (cc @dberenbaum). There's CML for example, so you can prob. setup a distributed CI system that runs DVC pipelines in parallel (cc @DavidGOrtega?) + adds a bunch of other benefits. If needed DVC could have small modifications to support this usage, or other alternative solutions (e.g. make a plugin/extension independent from this repo). And we can document/blog about setting up systems for parallel execution of DVC pipelines. |
That said if a multithreading refactor could speed up regular operations automatically (more general than just for executing commands) that would definitely be a welcome enhancement.
|
Just want to see if parallel execution is still on the roadmap? By parallel execution, I mean execute stages(jobs) in the defined pipeline(DAG) of a repo in parallel instead of parallelizing multiple pipelines. Just like what you can do with I appreciate what you guys have already done and hope to see more new functionalities, but this would be the only thing unfulfilled for me to call dvc a real "git + make for data and machine learning project". |
@tweddielin Thanks for the kind words and your support! It's still something we'd like to do, but a lot of efforts right now are focused on data optimization rather than pipeline optimization. Hopefully, we will have a chance to return to pipelines once we complete that work. |
Hi, I may add another use case here. I work partly using an HPC featuring SLURM. Using SLURM, I can and should define task dependencies directly while queuing jobs. Assuming a graph like this:
I can schedule jobs of the form
with
So here I can run Only thing missing is a way to disable the dvc-repo lock. |
@tweddielin said above:
I second this 100%! DVC is amazing, but the lack of parallel execution of pipeline stages is disappointing. I'm working on a machine with many cores and it would be much more efficient to be able to use something like I've tried some workarounds. It seems to be possible to run individual stages in parallel, using
but this fails because all the parallel I've also used a lot of |
I am wondering whether the awesome dev team has decided whether the Adding a |
@itcarroll It's not in our short-term plans for the rest of this year, but it's a highly requested feature, so it's still on the table, and there's no intent to close this issue without addressing it more directly. |
When I try to run multiple
dvc run
commands, I get the following error:This is inconvenient b/c I'd love to run multiple experiments together using dvc. Anyway we can be more smart about locking?
The text was updated successfully, but these errors were encountered: