-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slurm-pipeline occasionally hangs (likely when not enough resources are available) #45
Comments
What kinds of resources exactly? Job quota, requested memory, something else? I wonder if this is related to #1329. If you are exceeding available resources, that sounds like a problem at the platform level, not the level of |
I don't have a cluster that I can max out, but I did try to reproduce it locally, but the pipeline runs fine. library(targets)
library(crew)
targets::tar_option_set(
storage = "worker",
retrieval = "worker",
deployment = "worker",
controller = crew_controller_group(
crew_controller_local(workers = 20, name = "a"),
crew_controller_local(workers = 20, name = "b")
),
resources = targets::tar_resources(
crew = targets::tar_resources_crew(controller = "a")
)
)
list(
tar_target(index, seq_len(100)),
tar_target(a, {message("b"); Sys.sleep(5)}, pattern = map(index)),
tar_target(
b,
{message("b"); a},
pattern = map(a),
resources = tar_resources(crew = tar_resources_crew(controller = "b"))
)
) |
Yes, it seems that it is a SLURM-specific issue, since I was also not able to reproduce it using the local-controller.
My initial reprex was with requested memory. Just tried with CPU-cores, and the result is the same.
I can't seem to find this issue - can you provide me a link?
Wouldn't this be an issue of how crew.cluster appropriately handles the finite resources on a slurm-cluster? I.e. it should be able to handle that tasks/targets are sometimes queued until more resources are available, and still finish the tasks that are started, eventually allowing the remaining tasks to start and finish? |
Please see the updated, much more simple reprex. I removed the branching points, sys.sleep calls, target-target dependencies, and the issue still remains Also, I deleted some wrong comments and added clarifications. |
I have a new package called
No.
This does help understand what is happening. |
Cool! I'll make sure to check it out!
I agree that it is poor practice to max out the cluster - however, it is not unrealistic for this to happen once in a while because it is a shared cluster which others might be maxing it out occassionally. But this is also a moot point if it can be fixed by setting the seconds_idle parameter - I'll report back! |
Can confirm that seconds_idle = 10 fixed this particular issue - thanks! |
Prework
crew.cluster
package itself and not a user error, known limitation, or issue from another package thatcrew.cluster
depends on.Description
Hi!
I'm using the slurm-scheduler, and have issues with the pipeline hanging at certain targets. First, I thought it was associated with the new error = "trim" in targets, but this does not seem to be necessary for the bug to occur - only makes it more frequent/easier to happen I think.
I have experimented a bit with the necessary requirements for the bug to occur, and it seems that the following very simple scenario provokes it: Dispatching targets on different controllers that exceed the total resources (CPU cores or RAM) on the cluster/allowed node. This happens even if the second target is dependent on the first, and therefore should run after the initial target.
Reproducible example
Below, I have set up a reprex that exhausts the available resources given the --nodelist requirement.
Expected result
The pipeline should be able to finish the jobs that it starts, one at a time within the resource limits - eventually finishing the whole pipeline.
Diagnostic information
Output from the pipeline:
Output from squeue
This output shows how the initial worker that fit within the resource requirements are created and launched a SLURM-job, but is never appropriately shut down (even though it actually finishes its task), which would otherwise free up resources and allowing the other worker to launch and do its job.
Session info:
SHA-1 hash:
f22fd61
Essentially, it seems that there is some lack of communication between the controllers and the SLURM-system regarding cases of transiently limited resources and how to handle it.
The text was updated successfully, but these errors were encountered: