Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slurm-pipeline occasionally hangs (likely when not enough resources are available) #45

Closed
4 tasks done
koefoeden opened this issue Oct 10, 2024 · 7 comments
Closed
4 tasks done
Assignees

Comments

@koefoeden
Copy link

koefoeden commented Oct 10, 2024

Prework

Description

Hi!
I'm using the slurm-scheduler, and have issues with the pipeline hanging at certain targets. First, I thought it was associated with the new error = "trim" in targets, but this does not seem to be necessary for the bug to occur - only makes it more frequent/easier to happen I think.

I have experimented a bit with the necessary requirements for the bug to occur, and it seems that the following very simple scenario provokes it: Dispatching targets on different controllers that exceed the total resources (CPU cores or RAM) on the cluster/allowed node. This happens even if the second target is dependent on the first, and therefore should run after the initial target.

Reproducible example

Below, I have set up a reprex that exhausts the available resources given the --nodelist requirement.

library(targets)
a_ctrl <- crew.cluster::crew_controller_slurm(name = "a",  
                                              workers = 1, 
                                              slurm_memory_gigabytes_required=1,
                                              slurm_cpus_per_task=100, # 128 cores on the allowed node
                                              script_lines =  "#SBATCH --nodelist=esrumcmpn10fl") # artificially limit to this single empty node to not block my colleagues

b_ctrl <- crew.cluster::crew_controller_slurm(name = "b",  
                                              workers = 1, 
                                              slurm_memory_gigabytes_required=1, 
                                              slurm_cpus_per_task=30, # 128 cores on the allowed node
                                              script_lines =  "#SBATCH --nodelist=esrumcmpn10fl")


tar_option_set(controller = crew::crew_controller_group(a_ctrl, b_ctrl))

list(
    tar_target(name = a, 
               command = sessionInfo(), 
               resources = tar_resources(crew = tar_resources_crew(controller = "a"))),
    tar_target(name = b, 
               command = sessionInfo(), 
               resources = tar_resources(crew = tar_resources_crew(controller = "b")))
)

Expected result

The pipeline should be able to finish the jobs that it starts, one at a time within the resource limits - eventually finishing the whole pipeline.

Diagnostic information

Output from the pipeline:

▶ dispatched target a
▶ dispatched target b
● completed target a [0.05 seconds, 38.972 kilobytes]
... hangs indefinitzely

Output from squeue
This output shows how the initial worker that fit within the resource requirements are created and launched a SLURM-job, but is never appropriately shut down (even though it actually finishes its task), which would otherwise free up resources and allowing the other worker to launch and do its job.
image

Session info:

R version 4.3.3 (2024-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux 8.10 (Ootpa)

Matrix products: default
BLAS/LAPACK: /maps/direct/software/openblas/0.3.24/lib/libopenblasp-r0.3.24.so; LAPACK version 3.11.0

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: Europe/Copenhagen
tzcode source: system (glibc)

attached base packages:
[1] stats graphics grDevices datasets utils methods base

other attached packages:
[1] lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4
[5] purrr_1.0.2 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
[9] ggplot2_3.5.1 tidyverse_2.0.0 targets_1.8.0.9002

loaded via a namespace (and not attached):
[1] crew.cluster_0.3.2.9005 utf8_1.2.4 generics_0.1.3
[4] renv_1.0.10 xml2_1.3.6 stringi_1.8.4
[7] hms_1.1.3 magrittr_2.0.3 grid_4.3.3
[10] timechange_0.3.0 autometric_0.0.5.9000 processx_3.8.4
[13] backports_1.5.0 secretbase_1.0.3 promises_1.3.0
[16] ps_1.8.0 fansi_1.0.6 scales_1.3.0
[19] crew_0.9.5.9012 codetools_0.2-19 cli_3.6.3
[22] rlang_1.1.4 munsell_0.5.1 withr_3.0.1
[25] yaml_2.3.10 tools_4.3.3 tzdb_0.4.0
[28] getip_0.1-4 nanonext_1.3.0 colorspace_2.1-1
[31] base64url_1.4 vctrs_0.6.5 R6_2.5.1
[34] lifecycle_1.0.4 pkgconfig_2.0.3 callr_3.7.6
[37] later_1.3.2 pillar_1.9.0 gtable_0.3.5
[40] Rcpp_1.0.13 data.table_1.16.0 glue_1.8.0
[43] xfun_0.48 tidyselect_1.2.1 knitr_1.48
[46] mirai_1.2.0 igraph_2.0.3 compiler_4.3.3

SHA-1 hash:
f22fd61

Essentially, it seems that there is some lack of communication between the controllers and the SLURM-system regarding cases of transiently limited resources and how to handle it.

@wlandau
Copy link
Owner

wlandau commented Oct 10, 2024

Exceeding the current resources available on the cluster

What kinds of resources exactly? Job quota, requested memory, something else?

I wonder if this is related to #1329.

If you are exceeding available resources, that sounds like a problem at the platform level, not the level of crew, crew.cluster, or targets.

@wlandau
Copy link
Owner

wlandau commented Oct 10, 2024

I don't have a cluster that I can max out, but I did try to reproduce it locally, but the pipeline runs fine.

library(targets)
library(crew)

targets::tar_option_set(
  storage = "worker",
  retrieval = "worker",
  deployment = "worker",
  controller = crew_controller_group(
    crew_controller_local(workers = 20, name = "a"),
    crew_controller_local(workers = 20, name = "b")
  ),
  resources = targets::tar_resources(
    crew = targets::tar_resources_crew(controller = "a")
  )
)

list(
  tar_target(index, seq_len(100)),
  tar_target(a, {message("b"); Sys.sleep(5)}, pattern = map(index)),
  tar_target(
    b,
    {message("b"); a},
    pattern = map(a), 
    resources = tar_resources(crew = tar_resources_crew(controller = "b"))
  )
)

@koefoeden
Copy link
Author

koefoeden commented Oct 11, 2024

I don't have a cluster that I can max out, but I did try to reproduce it locally, but the pipeline runs fine.

Yes, it seems that it is a SLURM-specific issue, since I was also not able to reproduce it using the local-controller.

What kinds of resources exactly? Job quota, requested memory, something else?

My initial reprex was with requested memory. Just tried with CPU-cores, and the result is the same.

I wonder if this is related to #1329.

I can't seem to find this issue - can you provide me a link?

If you are exceeding available resources, that sounds like a problem at the platform level, not the level of crew, crew.cluster, or targets.

Wouldn't this be an issue of how crew.cluster appropriately handles the finite resources on a slurm-cluster? I.e. it should be able to handle that tasks/targets are sometimes queued until more resources are available, and still finish the tasks that are started, eventually allowing the remaining tasks to start and finish?

@koefoeden
Copy link
Author

koefoeden commented Oct 11, 2024

Please see the updated, much more simple reprex. I removed the branching points, sys.sleep calls, target-target dependencies, and the issue still remains Also, I deleted some wrong comments and added clarifications.

@wlandau
Copy link
Owner

wlandau commented Oct 11, 2024

My initial reprex was with requested memory.

I have a new package called autometric to prospectively log resource usage like memory. It is integrated with development crew: https://wlandau.github.io/crew/articles/logging.html

Wouldn't this be an issue of how crew.cluster appropriately handles the finite resources on a slurm-cluster?

No. crew requests resources based on how many tasks need to be done and how many workers you allow with the workers argument. It cannot find out what a system is capable of in the general case because there are too many different systems to track. Even if it does, it is poor practice to max out the resources on a system, whether on a shared cluster or on a local machine which also needs to perform interactive tasks. Users are responsible for setting a sensible value for workers.

Please see the updated, much more simple reprex.

This does help understand what is happening. b_ctrl cannot launch a worker because a_ctrl is already running one, and a_ctrl does not relinquish its worker because the it uses defaults seconds_idle = Inf and tasks_max = Inf. Either seconds_idle = 10 or tasks_max = 1 should resolve the deadlock.

@wlandau wlandau removed the type: bug label Oct 11, 2024
@koefoeden
Copy link
Author

I have a new package called autometric to prospectively log resource usage like memory. It is integrated with development crew: https://wlandau.github.io/crew/articles/logging.html

Cool! I'll make sure to check it out!

No. crew requests resources based on how many tasks need to be done and how many workers you allow with the workers argument. It cannot find out what a system is capable of in the general case because there are too many different systems to track. Even if it does, it is poor practice to max out the resources on a system, whether on a shared cluster or on a local machine which also needs to perform interactive tasks. Users are responsible for setting a sensible value for workers

I agree that it is poor practice to max out the cluster - however, it is not unrealistic for this to happen once in a while because it is a shared cluster which others might be maxing it out occassionally. But this is also a moot point if it can be fixed by setting the seconds_idle parameter - I'll report back!

@koefoeden
Copy link
Author

Can confirm that seconds_idle = 10 fixed this particular issue - thanks!

@wlandau wlandau closed this as completed Oct 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants