Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue running with Singularity on SLURM #33

Closed
6 of 7 tasks
drejom opened this issue Jan 10, 2024 · 1 comment
Closed
6 of 7 tasks

Issue running with Singularity on SLURM #33

drejom opened this issue Jan 10, 2024 · 1 comment
Assignees

Comments

@drejom
Copy link

drejom commented Jan 10, 2024

Prework

  • Read and agree to the Contributor Code of Conduct and contributing guidelines.
  • Confirm that your issue is a genuine bug in the crew.cluster package itself and not a user error, known limitation, or issue from another package that crew.cluster depends on.
  • If there is already a relevant issue, whether open or closed, comment on the existing thread instead of posting a new issue.
  • Post a minimal reproducible example like this one so the maintainer can troubleshoot the problems you identify. A reproducible example is:
    • Runnable: post enough R code and data so any onlooker can create the error on their own computer.
    • Minimal: reduce runtime wherever possible and remove complicated details that are irrelevant to the issue at hand.
    • Readable: format your code according to the tidyverse style guide.

Description

After updating a slew of packages recently, my SLURM-enabled targets pipeline has stopped running, with errors about seconds_timeout. I have a rather elaborate script to setup cluster operations, but I think I've narrowed it down to crew_controller_slurm(), so only post that here:

Reproducible example

# working on it

Apologies @wlandau my initial example was not in fact reproducible, but while I see if I can make a minimal example, does this {targets} error give any clues as to what's going on? It occurs with or without seconds_timeout set in crew.cluster::crew_controller_slurm()

Error:
! Error running targets::tar_make()
Error messages: targets::tar_meta(fields = error, complete_only = TRUE)
Debugging guide: https://books.ropensci.org/targets/debugging.html
How to ask for help: https://books.ropensci.org/targets/help.html
Last error message:
    all(is.numeric(.)) && all(length(.) == 1L) && all(!anyNA(.)) && all(. >= 0) is not true on . = seconds_timeout
Last error traceback:
    tryCatch(withCallingHandlers({ NULL saveRDS(do.call(do.call, c(readRDS("...
    tryCatchList(expr, classes, parentenv, handlers)
    tryCatchOne(tryCatchList(expr, names[-nh], parentenv, handlers[-nh]), na...
    doTryCatch(return(expr), name, parentenv, handler)
    tryCatchList(expr, names[-nh], parentenv, handlers[-nh])
    tryCatchOne(expr, names, parentenv, handlers[[1L]])
    doTryCatch(return(expr), name, parentenv, handler)
    withCallingHandlers({ NULL saveRDS(do.call(do.call, c(readRDS("/tmp/Rtmp...
    saveRDS(do.call(do.call, c(readRDS("/tmp/RtmpgNN5X8/callr-fun-136c8f4b59...
    do.call(do.call, c(readRDS("/tmp/RtmpgNN5X8/callr-fun-136c8f4b59c78e"), ...
    (function (what, args, quote = FALSE, envir = parent.frame()) { if (!is....
    (function (targets_function, targets_arguments, options, envir = NULL, s...
    tryCatch(out <- withCallingHandlers(targets::tar_callr_inner_try(targets...
    tryCatchList(expr, classes, parentenv, handlers)
    tryCatchOne(expr, names, parentenv, handlers[[1L]])
    doTryCatch(return(expr), name, parentenv, handler)
    withCallingHandlers(targets::tar_callr_inner_try(targets_function = targ...
    targets::tar_callr_inner_try(targets_function = targets_function, target...
    do.call(targets_function, targets_arguments)
    (function (pipeline, path_store, names_quosure, shortcut, reporter, seco...
    crew_init(pipeline = pipeline, meta = meta_init(path_store = path_store)...
    self$run_crew()
    self$iterate()
    if_any(queue$should_dequeue(), self$process_target(queue$dequeue()), sel...
    self$controller$wait(mode = "one", seconds_interval = interval, seconds_...
    if_any(identical(mode, "one"), private$.wait_one(controllers = control, ...
    private$.wait_one(controllers = control, seconds_interval = seconds_inte...
    crew_retry(fun = ~{ if (scale) { walk(controllers, ~.x$scale(throttle = ...
    crew_assert(seconds_timeout, is.numeric(.), length(.) == 1L, !anyNA(.), ...
    crew_error(message %|||% out)
    crew_stop(message = message, class = c("crew_error", "crew"))
    rlang::abort(message = message, class = class, call = emptyenv())
    signal_abort(cnd, .file)

Diagnostic information

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] crew.cluster_0.2.0 targets_1.4.1      lubridate_1.9.3    forcats_1.0.0      stringr_1.5.1      dplyr_1.1.4        purrr_1.0.2        readr_2.1.4        tidyr_1.3.0       
[10] tibble_3.2.1       ggplot2_3.4.4      tidyverse_2.0.0   

loaded via a namespace (and not attached):
 [1] utf8_1.2.4         generics_0.1.3     stringi_1.8.3      hms_1.1.3          digest_0.6.33      magrittr_2.0.3     grid_4.3.1         timechange_0.2.0   jsonlite_1.8.8    
[10] processx_3.8.3     backports_1.4.1    ps_1.7.5           fansi_1.0.6        scales_1.3.0       crew_0.8.0         codetools_0.2-19   cli_3.6.2          rlang_1.1.2       
[19] munsell_0.5.0      withr_2.5.2        yaml_2.3.8         parallel_4.3.1     tools_4.3.1        tzdb_0.4.0         getip_0.1-4        nanonext_0.11.0    colorspace_2.1-0  
[28] base64url_1.4      vctrs_0.6.5        R6_2.5.1           lifecycle_1.0.4    pkgconfig_2.0.3    callr_3.7.3        pillar_1.9.0       gtable_0.3.4       glue_1.6.2        
[37] data.table_1.14.10 xfun_0.41          tidyselect_1.2.0   knitr_1.45         mirai_0.11.3       igraph_1.6.0       compiler_4.3.1    
@drejom
Copy link
Author

drejom commented Jan 10, 2024

Ok, so downgrading {targets} to 1.2.2 solved things for now and I can run my analysis.

However, I see a number of changes in 1.3.0 which i suspect account for the error.

I can run the targets-minimal pipeline without issue, but when i include the following to run it on SLURM, I get errors.

nodename <- Sys.info()["nodename"]

singularity_exec <- glue::glue("cd {here::here()} \\
/{base_dir}/easy-build/software/singularity/3.7.0/bin/singularity exec \\
--env R_LIBS_USER=~/R/bioc-3.17 \\
--env R_LIBS_SITE=/{base_dir}/singularity/shared_cache/rbioc/rlibs/bioc-3.17 \\
-B /{base_dir}/singularity,/ref_genomes,/scratch \\
/{base_dir}/singularity/shared_cache/rbioc/vscode-rbioc_3.17.sif \\")

slurm <- crew.cluster::crew_controller_slurm(
    host = nodename,
    script_lines = singularity_exec)

tar_option_set(
    controller = slurm,
    resources = tar_resources(
        crew = tar_resources_crew(seconds_timeout = 3)
        )
    )
targets::tar_make()
▶ dispatched target raw_data_file
▶ completed pipeline [6.776 seconds]
Error:
! Error running targets::tar_make()
Error messages: targets::tar_meta(fields = error, complete_only = TRUE)
Debugging guide: https://books.ropensci.org/targets/debugging.html
How to ask for help: https://books.ropensci.org/targets/help.html
Last error message:
    target NA error: 'errorValue' int 5 | Timed out
Last error traceback:
    tryCatch(withCallingHandlers({ NULL saveRDS(do.call(do.call, c(readRDS("...
    tryCatchList(expr, classes, parentenv, handlers)
    tryCatchOne(tryCatchList(expr, names[-nh], parentenv, handlers[-nh]), na...
    doTryCatch(return(expr), name, parentenv, handler)
    tryCatchList(expr, names[-nh], parentenv, handlers[-nh])
    tryCatchOne(expr, names, parentenv, handlers[[1L]])
    doTryCatch(return(expr), name, parentenv, handler)
    withCallingHandlers({ NULL saveRDS(do.call(do.call, c(readRDS("/tmp/Rtmp...
    saveRDS(do.call(do.call, c(readRDS("/tmp/RtmpgCy87w/callr-fun-15f2fb7d6d...
    do.call(do.call, c(readRDS("/tmp/RtmpgCy87w/callr-fun-15f2fb7d6d7032"), ...
    (function (what, args, quote = FALSE, envir = parent.frame()) { if (!is....
    (function (targets_function, targets_arguments, options, envir = NULL, s...
    tryCatch(out <- withCallingHandlers(targets::tar_callr_inner_try(targets...
    tryCatchList(expr, classes, parentenv, handlers)
    tryCatchOne(expr, names, parentenv, handlers[[1L]])
    doTryCatch(return(expr), name, parentenv, handler)
    withCallingHandlers(targets::tar_callr_inner_try(targets_function = targ...
    targets::tar_callr_inner_try(targets_function = targets_function, target...
    do.call(targets_function, targets_arguments)
    (function (pipeline, path_store, names_quosure, shortcut, reporter, seco...
    crew_init(pipeline = pipeline, meta = meta_init(path_store = path_store)...
    self$run_crew()
    self$iterate()
    self$conclude_worker_task()
    tar_assert_all_na(result$error, msg = paste("target", result$name, "erro...
    tar_throw_validate(msg %|||% default)
    tar_error(message = paste0(...), class = c("tar_condition_validate", "ta...
    rlang::abort(message = message, class = class, call = tar_empty_envir)
    signal_abort(cnd, .file)

If I remove the resources section from tar_option_set():

resources = tar_resources(
        crew = tar_resources_crew(seconds_timeout = 3)
        )

I get no error, but the pipeline never progresses beyond dispatching the first target:

targets::tar_make()
▶ dispatched target raw_data_file
/

Apologies if I'm missing something obvious, but are you able to provide any insight?

@drejom drejom changed the title Issue running on SLURM Issue running with Singularity on SLURM Jan 10, 2024
Repository owner locked and limited conversation to collaborators Jan 16, 2024
@wlandau wlandau converted this issue into discussion #35 Jan 16, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Projects
None yet
Development

No branches or pull requests

2 participants