Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always write number of nodes on expanse #827

Merged
merged 3 commits into from
Feb 29, 2024
Merged

Conversation

cbkerr
Copy link
Member

@cbkerr cbkerr commented Feb 26, 2024

Description

Expanse expects the `--nodes`` option on all partitions, as poorly hinted at in documentation here: https://www.sdsc.edu/support/user_guides/expanse.html

Using the Shared Partition

#!/bin/bash
#SBATCH -p shared
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --mem=40G
#SBATCH -t 01:00:00
#SBATCH -J job.8
#SBATCH -A <<project*>>
#SBATCH -o job.8.%j.%N.out
#SBATCH -e job.8.%j.%N.err
#SBATCH --export=ALL

export SLURM_EXPORT_ENV=ALL

module purge
module load cpu
module load gcc
module load mvapich2
module load slurm

srun -n 8 ../hello_mpi

Motivation and Context

Was getting this error:

$ python project.py submit --partition=shared --template "myexpanse.sh" -n 1
Querying scheduler...
Submitting cluster job 'Project/e8f01a59fb6540081c69624e3f6d39f1/run/2497f00b00d4ee4751b82804f9c9f360':
 - Group: run(e8f01a59fb6540081c69624e3f6d39f1)
Traceback (most recent call last):
  File "/home/cbkerr/miniconda3/lib/python3.9/site-packages/flow/scheduling/base.py", line 216, in _call_submit
    subprocess.check_output(submit_cmd, stderr=subprocess.STDOUT, text=True)
  File "/home/cbkerr/miniconda3/lib/python3.9/subprocess.py", line 424, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/home/cbkerr/miniconda3/lib/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['sbatch', '/tmp/tmpjuv4dllz']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/expanse/lustre/scratch/cbkerr/temp_project/microswimmer-assembly/signac/project.py", line 2056, in <module>
    Project().main()
  File "/home/cbkerr/miniconda3/lib/python3.9/site-packages/flow/project.py", line 5230, in main
    args.func(args)
  File "/home/cbkerr/miniconda3/lib/python3.9/site-packages/flow/project.py", line 4941, in _main_submit
    self.submit(jobs=aggregates, names=names, **kwargs)
  File "/home/cbkerr/miniconda3/lib/python3.9/site-packages/flow/project.py", line 4227, in submit
    status = self._submit_operations(
  File "/home/cbkerr/miniconda3/lib/python3.9/site-packages/flow/project.py", line 4144, in _submit_operations
    return self._environment.submit(
  File "/home/cbkerr/miniconda3/lib/python3.9/site-packages/flow/environment.py", line 346, in submit
    if cls.get_scheduler().submit(script, flags=flags, *args, **kwargs):
  File "/home/cbkerr/miniconda3/lib/python3.9/site-packages/flow/scheduling/slurm.py", line 148, in submit
    return _call_submit(submit_cmd, script, pretend)
  File "/home/cbkerr/miniconda3/lib/python3.9/site-packages/flow/scheduling/base.py", line 218, in _call_submit
    raise SubmitError(
flow.errors.SubmitError: Error when calling submission command sbatch:
sbatch: error: bank_limit plugin: Please set the nodes parameter
sbatch: error: Batch job submission failed: Requested operation not supported on this system

Checklist:

@cbkerr cbkerr added bug Something isn't working environments Extending or updating supported environments labels Feb 26, 2024
@cbkerr cbkerr requested review from a team as code owners February 26, 2024 17:48
@cbkerr cbkerr requested review from tcmoore3 and syjlee and removed request for a team February 26, 2024 17:48
@joaander
Copy link
Member

You need to run tests/generate_template_reference_data.py to update the test reference data. If you have an account in $HOME/.signacrc, temporarily remove that first.

@cbkerr cbkerr requested a review from joaander February 28, 2024 20:55
Copy link
Member

@joaander joaander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Perhaps in #819 we should make this change in the base slurm template. I don't think this would cause a problem on any systems we support. flow requires an explicit partition selection and with homogeneous partitions, we can compute the number of nodes correctly.

@cbkerr cbkerr enabled auto-merge (squash) February 29, 2024 20:59
@cbkerr cbkerr merged commit 42e8884 into main Feb 29, 2024
9 checks passed
@cbkerr cbkerr deleted the fix/expanse-nodes-on-shared branch February 29, 2024 21:42
@cbkerr cbkerr added this to the v0.29.0 milestone Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working environments Extending or updating supported environments
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants