Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EnTK hangs on Traverse when using multiple Nodes #138

Open
lsawade opened this issue Feb 8, 2021 · 69 comments
Open

EnTK hangs on Traverse when using multiple Nodes #138

lsawade opened this issue Feb 8, 2021 · 69 comments
Assignees

Comments

@lsawade
Copy link

lsawade commented Feb 8, 2021

Hi,

I don't know whether this is related to #135 . It is weird because I got everything running on a single node, but as soon as I use more than one EnTK seems to hang. I checked out the submission script and it looks fine to me; so, did the node list.

The workflow already hangs in the submission of the first task, which is a single core, single thread task.

EnTK session: re.session.traverse.princeton.edu.lsawade.018666.0003
Creating AppManagerSetting up RabbitMQ system                                 ok
                                                                              ok
Validating and assigning resource manager                                     ok
Setting up RabbitMQ system                                                   n/a
new session: [re.session.traverse.princeton.edu.lsawade.018666.0003]           \
database   : [mongodb://specfm:****@129.114.17.185/specfm]                    ok
create pilot manager                                                          ok
submit 1 pilot(s)
        pilot.0000   princeton.traverse       90 cores      12 gpus           ok
All components created
create unit managerUpdate: pipeline.0000 state: SCHEDULING
Update: pipeline.0000.WriteSourcesStage state: SCHEDULING
Update: pipeline.0000.WriteSourcesStage.WriteSourcesTask state: SCHEDULING
Update: pipeline.0000.WriteSourcesStage.WriteSourcesTask state: SCHEDULED
Update: pipeline.0000.WriteSourcesStage state: SCHEDULED
MongoClient opened before fork. Create MongoClient only after forking. See PyMongo's documentation for details: http://api.mongodb.org/python/current/faq.html#is-pymongo-fork-safe
                                                           ok
submit: ########################################################################
Update: pipeline.0000.WriteSourcesStage.WriteSourcesTask state: SUBMITTING

[Ctrl + C]

close unit manager                                                            ok
...

Stack

  python               : /home/lsawade/.conda/envs/ve-entk/bin/python3
  pythonpath           : 
  version              : 3.8.2
  virtualenv           : ve-entk

  radical.entk         : [email protected]
  radical.gtod         : 1.5.0
  radical.pilot        : 1.5.12
  radical.saga         : 1.5.9
  radical.utils        : 1.5.12

Client zip

client.session.zip

Session zip

sandbox.session.zip

@andre-merzky
Copy link

Hi @lsawade - this is a surprising one. The task stdout shows:

$ cat *err
srun: Job 126172 step creation temporarily disabled, retrying (Requested nodes are busy)

This one does look like a slurm problem. Is this reproducible?

@andre-merzky andre-merzky self-assigned this Feb 8, 2021
@lsawade
Copy link
Author

lsawade commented Feb 8, 2021

Reproduced! The message with step creation appears after a while. Meaning I continuously checked the task's error file, and eventually the message showed up!

@andre-merzky
Copy link

andre-merzky commented Feb 8, 2021

@lsawade , would you please open an ticket with Traverse support? Maybe our srun command is not well-formed for Traverse's Slurm installation? Please include the srun command:

/usr/bin/srun --exclusive --cpu-bind=none --nodes 1 --ntasks 1 --cpus-per-task 1 --gpus-per-task 0 --nodelist=/scratch/gpfs/lsawade/radical.pilot.sandbox/re.session.traverse.princeton.edu.lsawade.018666.0003/pilot.0000/unit.000000//unit.000000.nodes --export=ALL,NODE_LFS_PATH="/tmp" write-sources "-f" "/tigress/lsawade/entkdatabase/C200709121110A/C200709121110A.cmt" "-p" "/home/lsawade/gcmt3d/workflow/params" 

and the nodelist file which just contains:

traverse-k04g10

@lsawade
Copy link
Author

lsawade commented Feb 8, 2021

It throws following error:

srun: error: Unable to create step for job 126202: Requested node configuration is not available

If I take out the nodelist argument, it runs

@andre-merzky
Copy link

Hmm, is that node name not valid somehow?

@lsawade
Copy link
Author

lsawade commented Feb 8, 2021

I tried running it with the nodename as a string and that worked

/usr/bin/srun --nodelist=traverse-k05g10 --exclusive --cpu-bind=none --nodes 1 --ntasks 1 --cpus-per-task 1 --gpus-per-task 0 --export=ALL,NODE_LFS_PATH="/tmp" write-sources "-f" "/tigress/lsawade/entkdatabase/C200709121110A/C200709121110A.cmt" "-p" "/home/lsawade/gcmt3d/workflow/params"

Note that I'm using salloc and hence a different nodename

@lsawade
Copy link
Author

lsawade commented Feb 8, 2021

I found the solution. When SLURM takes in a file for a nodelist, one has to use the node file option:

/usr/bin/srun --nodefile=nodelistfile --exclusive --cpu-bind=none --nodes 1 --ntasks 1 --cpus-per-task 1 --gpus-per-task 0 --export=ALL,NODE_LFS_PATH="/tmp" write-sources "-f" "/tigress/lsawade/entkdatabase/C200709121110A/C200709121110A.cmt" "-p" "/home/lsawade/gcmt3d/workflow/params"

@andre-merzky
Copy link

Oh! Thanks for tracking that down, we'll fix this!

@lsawade
Copy link
Author

lsawade commented Feb 9, 2021

It is puzzling though, that srun doesn't throw an error. When I do it by hand, srun throws an error when feeding a nodelist file to the --nodelist= option

@andre-merzky
Copy link

@lsawade : the fix has been released, please let us know if that problem still happens!

@lsawade
Copy link
Author

lsawade commented Mar 19, 2021

@andre-merzky, will test!

@lsawade
Copy link
Author

lsawade commented Apr 2, 2021

Sorry, for the extraordinarily late feedback, but the issue seems to persist. It already hangs in the Hello, World task.
Did I update correctly?


My stack:

  python               : /home/lsawade/.conda/envs/ve-entk/bin/python3
  pythonpath           : 
  version              : 3.8.2
  virtualenv           : ve-entk

  radical.entk         : 1.6.0
  radical.gtod         : 1.5.0
  radical.pilot        : 1.6.2
  radical.saga         : 1.6.1
  radical.utils        : 1.6.2

My script:

from radical.entk import Pipeline, Stage, Task, AppManager
import traceback, sys, os


hostname = os.environ.get('RMQ_HOSTNAME', 'localhost')
port = int(os.environ.get('RMQ_PORT', 5672))
password = os.environ.get('RMQ_PASSWORD', None)
username = os.environ.get('RMQ_USERNAME', None)


specfem = "/scratch/gpfs/lsawade/MagicScripts/specfem3d_globe"

if __name__ == '__main__':
    p = Pipeline()

    # Hello World########################################################
    test_stage = Stage()
    test_stage.name = "HelloWorldStage"

    # Create 'Hello world' task
    t = Task()
    t.cpu_reqs = {'cpu_processes': 1, 'cpu_process_type': None, 'cpu_threads': 1, 'cpu_thread_type': None}
    t.pre_exec = ['module load openmpi/gcc']
    t.name = "HelloWorldTask"
    t.executable = '/bin/echo'
    t.arguments = ['Hello world!']
    t.download_output_data = ['STDOUT', 'STDERR']

    # Add task to stage and stage to pipeline
    test_stage.add_tasks(t)
    p.add_stages(test_stage)

    #########################################################    
    specfem_stage = Stage()
    specfem_stage.name = 'SimulationStage'
    
    for i in range(2):

        # Create Task
        t = Task()
        t.name = f"SIMULATION.{i}"
        tdir = f"/home/lsawade/simple_entk_specfem/specfem_run_{i}"
        t.pre_exec = [
            # Load necessary modules
            'module load openmpi/gcc',
            'module load cudatoolkit/11.0',
            
            # Change to your specfem run directory
            f'rm -rf {tdir}',
            f'mkdir {tdir}',
            f'cd {tdir}',
            
            # Create data structure in place
            f'ln -s {specfem}/bin .',
            f'ln -s {specfem}/DATABASES_MPI .',
            f'cp -r {specfem}/OUTPUT_FILES .',
            'mkdir DATA',
            f'cp {specfem}/DATA/CMTSOLUTION ./DATA/',
            f'cp {specfem}/DATA/STATIONS ./DATA/',
            f'cp {specfem}/DATA/Par_file ./DATA/'
        ]
        t.executable = './bin/xspecfem3D'
        t.cpu_reqs = {'cpu_processes': 4, 'cpu_process_type': 'MPI', 'cpu_threads': 1, 'cpu_thread_type' : 'OpenMP'}
        t.gpu_reqs = {'gpu_processes': 4, 'gpu_process_type': 'MPI', 'gpu_threads': 1, 'gpu_thread_type' : 'CUDA'}
        t.download_output_data = ['STDOUT', 'STDERR']

        # Add task to stage
        specfem_stage.add_tasks(t)
        
    p.add_stages(specfem_stage)
        
    res_dict = {
        'resource': 'princeton.traverse', # 'local.localhost',
        'schema'   : 'local',
        'walltime':  20, #2 * 30,
        'cpus': 16, #2 * 10 * 1,
        'gpus': 8, #2 * 4 * 2,          
    }

    appman = AppManager(hostname=hostname, port=port, username=username, password=password, resubmit_failed=False)
    appman.resource_desc = res_dict
    appman.workflow = set([p])
    appman.run()        
    

Tarball:

sandbox.tar.gz

@andre-merzky
Copy link

andre-merzky commented Apr 2, 2021

Bugger... - the code though is using --nodefile=:

$ grep srun task.0000.sh
task.0000.sh:/usr/bin/srun --exclusive --cpu-bind=none --nodes 1 --ntasks 1 --cpus-per-task 1 --gpus-per-task 0 --nodefile=/scratch/gpfs/lsawade/radical.pilot.sandbox/re.session.traverse.princeton.edu.lsawade.018719.0005/pilot.0000/task.0000//task.0000.nodes --export=ALL,NODE_LFS_PATH="/tmp" /bin/echo "Hello world!" 

but that task indeed never returns. Does that line work on an interactive node? FWIW, task.0000.nodes contains:

$ cat task.0000.nodes 
traverse-k02g1

@lsawade
Copy link
Author

lsawade commented Apr 2, 2021

Yes, in interactive mode and change of the nodefile to the node I land on it works

Edit: In my interactive job I'm using one node only, let me try with two...

Update

It works also when using the two nodes in the interactive job and editing the task.0000.nodes to contain one of the accessible nodes. Either node works, so this does not seem to be the problem.

@andre-merzky
Copy link

Hmm, where does that leave us... - so it is not the srun command format which is at fault after all?

Can you switch your workload to, say, /bin/date to make sure we are not looking at the wrong place, and that the application code behaves as expected when we run under EnTK?

@lsawade
Copy link
Author

lsawade commented Apr 2, 2021

Would you mind running one more test: interactively get two nodes, and run the command towards the other node than the one you land on.

See Update above

You should see the allocated nodes via cat $SLURM_NODEFILE or something like that (env | grep SLURM will be helpful)

echo $SLURM_NODELIST works, I don't seem to have the nodefile environment variable.

@lsawade
Copy link
Author

lsawade commented Apr 2, 2021

What do you mean with switching my workload to /bin/date ?

@lsawade
Copy link
Author

lsawade commented Apr 2, 2021

I also tested running the entire task.0000.sh in interactive mode, and it had no problem.

@mturilli
Copy link
Contributor

mturilli commented Apr 2, 2021

Slurm on Traverse seems to be working in a strange way. Lucas is in contact with the research service at Princeton.

@lsawade
Copy link
Author

lsawade commented Apr 3, 2021

Two things that have come up:

  1. The srun command needs a -G0 flag (no GPUs) if a non-gpu task is executed with a resource set that contains GPUs. The command will only hang if the resource set contains GPUs and run otherwise.
  2. Make sure your print statement does not contain any ! my hello world task also encountered issues because I didn't properly escape the ! in "Hello, World!". Use "Hello, World\!" instead. facepalm

@lsawade
Copy link
Author

lsawade commented Apr 3, 2021

Most quick debugging discussions were held on Slack but here a summary for posterity:
@andre-merzky published a quick fix for the srun command one of the RP branches (radical-cybertools/radical.pilot@aee4fb8), but there is KeyError that is issued by EnTK when calling something from the pilot.


Error

EnTK session: re.session.traverse.princeton.edu.lsawade.018720.0001
Creating AppManagerSetting up RabbitMQ system                                 ok
                                                                              ok
Validating and assigning resource manager                                     ok
Setting up RabbitMQ system                                                   n/a
new session: [re.session.traverse.princeton.edu.lsawade.018720.0001]           \
database   : [mongodb://specfm:****@129.114.17.185/specfm]                    ok
All components terminated
Traceback (most recent call last):
  File "/home/lsawade/.conda/envs/ve-entk/lib/python3.8/site-packages/radical/entk/execman/rp/resource_manager.py", line 147, in _submit_resource_request
    self._pmgr    = rp.PilotManager(session=self._session)
  File "/home/lsawade/.conda/envs/ve-entk/lib/python3.8/site-packages/radical/pilot/pilot_manager.py", line 93, in __init__
    self._pilots_lock = ru.RLock('%s.pilots_lock' % self._uid)
AttributeError: 'PilotManager' object has no attribute '_uid'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/home/lsawade/.conda/envs/ve-entk/lib/python3.8/site-packages/radical/entk/appman/appmanager.py", line 428, in run
    self._rmgr._submit_resource_request()
  File "/home/lsawade/.conda/envs/ve-entk/lib/python3.8/site-packages/radical/entk/execman/rp/resource_manager.py", line 194, in _submit_resource_request
    raise EnTKError(ex) from ex
radical.entk.exceptions.EnTKError: 'PilotManager' object has no attribute '_uid'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "solver.py", line 104, in <module>
    appman.run()        
  File "/home/lsawade/.conda/envs/ve-entk/lib/python3.8/site-packages/radical/entk/appman/appmanager.py", line 459, in run
    raise EnTKError(ex) from ex
radical.entk.exceptions.EnTKError: 'PilotManager' object has no attribute '_uid'

Stack

  python               : /home/lsawade/.conda/envs/ve-entk/bin/python3
  pythonpath           : 
  version              : 3.8.2
  virtualenv           : ve-entk
  radical.entk         : 1.6.0
  radical.gtod         : 1.5.0
  radical.pilot        : 1.6.2-v1.6.2-78-gaee4fb886@fix-hpc_wf_138
  radical.saga         : 1.6.1
  radical.utils        : 1.6.2

@andre-merzky
Copy link

My apologies, that error is now fixed in RP.

@lsawade
Copy link
Author

lsawade commented Apr 3, 2021

Getting a new one again!

EnTK session: re.session.traverse.princeton.edu.lsawade.018720.0008
Creating AppManagerSetting up RabbitMQ system                                 ok
                                                                              ok
Validating and assigning resource manager                                     ok
Setting up RabbitMQ system                                                   n/a
new session: [re.session.traverse.princeton.edu.lsawade.018720.0008]           \
database   : [mongodb://specfm:****@129.114.17.185/specfm]                    ok
create pilot manager                                                          ok
submit 1 pilot(s)
        pilot.0000   princeton.traverse       16 cores       8 gpus           ok
closing session re.session.traverse.princeton.edu.lsawade.018720.0008          \
close pilot manager                                                            \
wait for 1 pilot(s)
              0                                                               ok
                                                                              ok
session lifetime: 16.1s                                                       ok
wait for 1 pilot(s)
              0                                                          timeout
All components terminated
Traceback (most recent call last):
  File "/home/lsawade/.conda/envs/ve-entk/lib/python3.8/site-packages/radical/entk/appman/appmanager.py", line 428, in run
    self._rmgr._submit_resource_request()
  File "/home/lsawade/.conda/envs/ve-entk/lib/python3.8/site-packages/radical/entk/execman/rp/resource_manager.py", line 177, in _submit_resource_request
    self._pilot.wait([rp.PMGR_ACTIVE, rp.DONE, rp.FAILED, rp.CANCELED])
  File "/home/lsawade/thirdparty/python/radical.pilot/src/radical/pilot/pilot.py", line 558, in wait
    time.sleep(0.1)
KeyboardInterrupt

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "solver.py", line 104, in <module>
    appman.run()        
  File "/home/lsawade/.conda/envs/ve-entk/lib/python3.8/site-packages/radical/entk/appman/appmanager.py", line 453, in run
    raise KeyboardInterrupt from ex
KeyboardInterrupt

@lsawade
Copy link
Author

lsawade commented Apr 30, 2021

So the way I install entk and the pilot at the moment is as follows:

# Install EnTK
conda create -n conda-entk python=3.7 -c conda-forge -y
conda activate conda-entk
pip install radical.entk

(Note, I'm not changing pilot here and just keep the default one.)

Then, I get the radical.pilot repo to create the static ve.rp. Log out, log in,

# Create environment
module load anaconda3
conda create -n ve -y python=3.7
conda activate ve

# Install Pilot
git clone [email protected]:radical-cybertools/radical.pilot.git
cd radical.pilot
pip install .

# Create static environment
./bin/radical-pilot-create-static-ve -p /scratch/gpfs/$USER/ve.rp/

Log out, Log in:

conda activate conda-entk
python workflow.py

@lsawade
Copy link
Author

lsawade commented May 6, 2021

Are there any news here?

@lsawade
Copy link
Author

lsawade commented May 11, 2021

@andre-merzky ?

@lsawade
Copy link
Author

lsawade commented May 13, 2021

Alright, I got the workflow manager to -- at least -- run. Not hanging, yay

One of the issues is that when I create the static environment using radical-pilot-create-static-ve, it does not install any dependencies, so I installed all requirements into the ve.rp.

However, I'm sort of back to square one. A serial task executes, and task.0000.out has "Hello World" in it, and the log shows that task.0000 does return with a 0 exit code, but it also fails as indicated by the workflow manager and task.0000.err contains following line:

cpu-bind=MASK - traverse-k01g10, task 0 0 [140895]: mask 0xf set

I'll attach the tarball.

It is also important to state that the Manager seems to drop scheduling other jobs upon failure of the first task. I wasn't able to find anything about it in the log.


sandbox.tar.gz

@andre-merzky
Copy link

Hey @lsawade - the reason for the behavior eludes me completely. I can confirm that the same is observed on at least one other Slurm cluster (expanse @ SDSC), and I opened a ticket there to hopefully get some useful feedback. At the moment I simply don't know how we can possibly resolve this. I am really sorry for that, I understand that this is blocking progress since several months by now :-/

@lsawade
Copy link
Author

lsawade commented Oct 15, 2021

Yeah, I have had a really long thread with the people from the research computing group and they did not understand why this is not working either. Maybe we should contact the slurm people?

@andre-merzky
Copy link

Yes, I think we should resort to that. I'll open a ticket if the XSEDE support is not able to suggest a solution within a week.

@andre-merzky
Copy link

We got some useful feedback from XSEDE after all: slurm seems indeed to be unable to do correct auto-placement for non-node-local tasks. I find this surprising, and it may still be worthwhile to open a slurm ticket about this. Either way though: a workaround is to start the job with a specific node file. From your example above:

srun -n 6 --cpus-per-task 1 --gpus-per-task 1 show_devices.sh 0 &
srun -n 6 --cpus-per-task 1 --gpus-per-task 1 show_devices.sh 1 &

should work as expected with

export SLURM_HOSTFILE=host1.list
srun -n 6 --cpus-per-task 1 --gpus-per-task 1 --distribution=arbitrary show_devices.sh 0 &
export SLURM_HOSTFILE=host2.list
srun -n 6 --cpus-per-task 1 --gpus-per-task 1 --distribution=arbitrary show_devices.sh 1 &

where the host file look like, for example:

$ cat host2.list
exp-1-57
exp-1-57
exp-6-58
exp-6-58
exp-6-58
exp-6-58

Now, that brings us back to RP / EnTK: we actually do use a hostfile, we just miss out on --distribution=arbitrary flag. Before we include that, could you please confirm that the above also in fact works on Traverse please?

@lsawade
Copy link
Author

lsawade commented Oct 21, 2021

Hi @andre-merzky,

I have been playing with this and I can't seem to get it to work. I explain what I do here:
https://github.com/lsawade/slurm-job-step-shared-res

I'm not sure whether it's me or Traverse.

Can you adjust this mini example to see whether it runs on XSEDE? Things you would have to change are the automatic writing of the hostfile and how many tasks per job step. If you give me the hardware setup of XSEDE, I could also adjust the script and give you something that should run out of the box to check.

@andre-merzky
Copy link

andre-merzky commented Oct 25, 2021

The hardware setup on Expanse is really similar to Traverse: 4 GPUs/node.

I pasted something incorrect above, apologies! Too many scripts lying around :-/ The --gpus=6 flag was missing. Here should be the correct one, showing the same syntax working for both cyclic and block:

This is the original script:

$ cat test2.slurm
#!/bin/bash

#SBATCH -t00:10:00
#SBATCH --account UNC100
#SBATCH --nodes 3
#SBATCH --gpus 12
#SBATCH -n 12
#SBATCH --output=test2.out
#SBATCH --error=test2.out

my_srun() {
  export SLURM_HOSTFILE="$1"
  srun -n 6 --gpus=6 --cpus-per-task=1 --gpus-per-task=1 --distribution=arbitrary show_devices.sh 
}

cyclic() {
  scontrol show hostnames "${SLURM_JOB_NODELIST}"  > host1.cyclic.list
  scontrol show hostnames "${SLURM_JOB_NODELIST}" >> host1.cyclic.list
  
  scontrol show hostnames "${SLURM_JOB_NODELIST}"  > host2.cyclic.list
  scontrol show hostnames "${SLURM_JOB_NODELIST}" >> host2.cyclic.list
  
  my_srun host1.cyclic.list > cyclic.1.out 2>&1 &
  my_srun host2.cyclic.list > cyclic.2.out 2>&1 &
  wait
}

block() {
  scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -1 | tail -1  > host1.block.list
  scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -1 | tail -1 >> host1.block.list
  scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -1 | tail -1 >> host1.block.list
  scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -1 | tail -1 >> host1.block.list
  scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -2 | tail -1 >> host1.block.list
  scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -2 | tail -1 >> host1.block.list
  
  scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -2 | tail -1  > host2.block.list
  scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -2 | tail -1 >> host2.block.list
  scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -3 | tail -1 >> host2.block.list
  scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -3 | tail -1 >> host2.block.list
  scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -3 | tail -1 >> host2.block.list
  scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -3 | tail -1 >> host2.block.list
  
  my_srun host1.block.list > block.1.out 2>&1 &
  my_srun host2.block.list > block.2.out 2>&1 &
  wait
}

block
cyclic

These are the resulting node files:

$ for f in *list; do echo $f; cat $f; echo; done
host1.block.list
exp-6-57
exp-6-57
exp-6-57
exp-6-57
exp-6-59
exp-6-59

host1.cyclic.list
exp-6-57
exp-6-59
exp-10-58
exp-6-57
exp-6-59
exp-10-58

host2.block.list
exp-6-59
exp-6-59
exp-10-58
exp-10-58
exp-10-58
exp-10-58

host2.cyclic.list
exp-6-57
exp-6-59
exp-10-58
exp-6-57
exp-6-59
exp-10-58

and these the resulting outputs:

$ for f in *out; do echo $f; cat $f; echo; done
block.1.out
6664389.1.2 START Mon Oct 25 02:54:42 PDT 2021 @ exp-6-57 : 0,1,2,3
6664389.1.1 START Mon Oct 25 02:54:42 PDT 2021 @ exp-6-57 : 0,1,2,3
6664389.1.0 START Mon Oct 25 02:54:42 PDT 2021 @ exp-6-57 : 0,1,2,3
6664389.1.3 START Mon Oct 25 02:54:42 PDT 2021 @ exp-6-57 : 0,1,2,3
6664389.1.5 START Mon Oct 25 02:54:42 PDT 2021 @ exp-6-59 : 0,1
6664389.1.4 START Mon Oct 25 02:54:42 PDT 2021 @ exp-6-59 : 0,1
6664389.1.2 STOP  Mon Oct 25 02:54:52 PDT 2021
6664389.1.1 STOP  Mon Oct 25 02:54:52 PDT 2021
6664389.1.0 STOP  Mon Oct 25 02:54:52 PDT 2021
6664389.1.3 STOP  Mon Oct 25 02:54:52 PDT 2021
6664389.1.4 STOP  Mon Oct 25 02:54:52 PDT 2021
6664389.1.5 STOP  Mon Oct 25 02:54:52 PDT 2021

block.2.out
6664389.0.1 START Mon Oct 25 02:54:42 PDT 2021 @ exp-6-59 : 0,1
6664389.0.2 START Mon Oct 25 02:54:42 PDT 2021 @ exp-10-58 : 0,1,2,3
6664389.0.0 START Mon Oct 25 02:54:42 PDT 2021 @ exp-6-59 : 0,1
6664389.0.3 START Mon Oct 25 02:54:42 PDT 2021 @ exp-10-58 : 0,1,2,3
6664389.0.5 START Mon Oct 25 02:54:42 PDT 2021 @ exp-10-58 : 0,1,2,3
6664389.0.4 START Mon Oct 25 02:54:42 PDT 2021 @ exp-10-58 : 0,1,2,3
6664389.0.0 STOP  Mon Oct 25 02:54:52 PDT 2021
6664389.0.1 STOP  Mon Oct 25 02:54:52 PDT 2021
6664389.0.4 STOP  Mon Oct 25 02:54:52 PDT 2021
6664389.0.3 STOP  Mon Oct 25 02:54:52 PDT 2021
6664389.0.5 STOP  Mon Oct 25 02:54:52 PDT 2021
6664389.0.2 STOP  Mon Oct 25 02:54:52 PDT 2021

cyclic.1.out
6664389.2.3 START Mon Oct 25 02:54:52 PDT 2021 @ exp-6-57 : 0,1
6664389.2.2 START Mon Oct 25 02:54:52 PDT 2021 @ exp-10-58 : 0,1
6664389.2.4 START Mon Oct 25 02:54:52 PDT 2021 @ exp-6-59 : 0,1
6664389.2.5 START Mon Oct 25 02:54:52 PDT 2021 @ exp-10-58 : 0,1
6664389.2.0 START Mon Oct 25 02:54:52 PDT 2021 @ exp-6-57 : 0,1
6664389.2.1 START Mon Oct 25 02:54:52 PDT 2021 @ exp-6-59 : 0,1
6664389.2.3 STOP  Mon Oct 25 02:55:02 PDT 2021
6664389.2.2 STOP  Mon Oct 25 02:55:02 PDT 2021
6664389.2.4 STOP  Mon Oct 25 02:55:02 PDT 2021
6664389.2.0 STOP  Mon Oct 25 02:55:02 PDT 2021
6664389.2.1 STOP  Mon Oct 25 02:55:02 PDT 2021
6664389.2.5 STOP  Mon Oct 25 02:55:02 PDT 2021

cyclic.2.out
6664389.3.3 START Mon Oct 25 02:54:52 PDT 2021 @ exp-6-57 : 0,1
6664389.3.5 START Mon Oct 25 02:54:52 PDT 2021 @ exp-10-58 : 0,1
6664389.3.4 START Mon Oct 25 02:54:52 PDT 2021 @ exp-6-59 : 0,1
6664389.3.0 START Mon Oct 25 02:54:52 PDT 2021 @ exp-6-57 : 0,1
6664389.3.2 START Mon Oct 25 02:54:52 PDT 2021 @ exp-10-58 : 0,1
6664389.3.1 START Mon Oct 25 02:54:52 PDT 2021 @ exp-6-59 : 0,1
6664389.3.5 STOP  Mon Oct 25 02:55:02 PDT 2021
6664389.3.3 STOP  Mon Oct 25 02:55:02 PDT 2021
6664389.3.4 STOP  Mon Oct 25 02:55:02 PDT 2021
6664389.3.2 STOP  Mon Oct 25 02:55:02 PDT 2021
6664389.3.1 STOP  Mon Oct 25 02:55:02 PDT 2021
6664389.3.0 STOP  Mon Oct 25 02:55:02 PDT 2021

@lsawade
Copy link
Author

lsawade commented Oct 25, 2021

So, I have some good news, I have also tested this on Andes, and it definitely works on Andes as well. An added batch_andes.sh batch script to the repo to test the arbitrary distribution for cyclic and block with nodes [1,2], [1,2] and [1], [1,2,2], respectively.

The annoying news are that it does not seem to work on Traverse. At least I was able to test whether it's a user error...

So, how do we proceed? I'm sure it's a setting in the slurm setup. Do we open a ticket with the Andes/Expanse support? I'll for sure open a ticket with PICSciE and see whether they can find a solution.

UPDATE:

The unexpected/unwanted output on Traverse:

block.1.out
srun: Job 258710 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Step created for job 258710
258710.3.3 START Mon Oct 25 19:40:25 EDT 2021 @ traverse-k05g2: 0
258710.3.2 START Mon Oct 25 19:40:25 EDT 2021 @ traverse-k05g2: 0
258710.3.0 START Mon Oct 25 19:40:25 EDT 2021 @ traverse-k05g2: 0
258710.3.1 START Mon Oct 25 19:40:25 EDT 2021 @ traverse-k05g2: 0
258710.3.5 START Mon Oct 25 19:40:25 EDT 2021 @ traverse-k05g3: 0
258710.3.4 START Mon Oct 25 19:40:25 EDT 2021 @ traverse-k05g3: 0
258710.3.0 STOP Mon Oct 25 19:41:25 EDT 2021
258710.3.1 STOP Mon Oct 25 19:41:25 EDT 2021
258710.3.2 STOP Mon Oct 25 19:41:25 EDT 2021
258710.3.3 STOP Mon Oct 25 19:41:25 EDT 2021
258710.3.4 STOP Mon Oct 25 19:41:25 EDT 2021
258710.3.5 STOP Mon Oct 25 19:41:25 EDT 2021
block.2.out
258710.2.0 START Mon Oct 25 19:39:24 EDT 2021 @ traverse-k05g3: 0
258710.2.1 START Mon Oct 25 19:39:24 EDT 2021 @ traverse-k05g3: 0
258710.2.0 STOP Mon Oct 25 19:40:24 EDT 2021
258710.2.1 STOP Mon Oct 25 19:40:24 EDT 2021
cyclic.1.out
258710.0.1 START Mon Oct 25 19:37:23 EDT 2021 @ traverse-k05g3: 0
258710.0.0 START Mon Oct 25 19:37:23 EDT 2021 @ traverse-k05g2: 0
258710.0.3 START Mon Oct 25 19:37:23 EDT 2021 @ traverse-k05g3: 0
258710.0.2 START Mon Oct 25 19:37:23 EDT 2021 @ traverse-k05g2: 0
258710.0.1 STOP Mon Oct 25 19:38:23 EDT 2021
258710.0.3 STOP Mon Oct 25 19:38:23 EDT 2021
258710.0.0 STOP Mon Oct 25 19:38:23 EDT 2021
258710.0.2 STOP Mon Oct 25 19:38:23 EDT 2021
cyclic.2.out
srun: Job 258710 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Step created for job 258710
258710.1.0 START Mon Oct 25 19:38:24 EDT 2021 @ traverse-k05g2: 0
258710.1.1 START Mon Oct 25 19:38:24 EDT 2021 @ traverse-k05g3: 0
258710.1.2 START Mon Oct 25 19:38:24 EDT 2021 @ traverse-k05g2: 0
258710.1.3 START Mon Oct 25 19:38:24 EDT 2021 @ traverse-k05g3: 0
258710.1.0 STOP Mon Oct 25 19:39:24 EDT 2021
258710.1.2 STOP Mon Oct 25 19:39:24 EDT 2021
258710.1.1 STOP Mon Oct 25 19:39:24 EDT 2021
258710.1.3 STOP Mon Oct 25 19:39:24 EDT 2021

Does it almost look like there is a misunderstanding between slurm and cude, the devices visible should not be all CUDA_VISIBLE_DEVICES?

PS: I totally stole the way you made the block and cyclic functions as well as the printing. Why did I not think of that...?

@lsawade
Copy link
Author

lsawade commented Oct 26, 2021

Ok I can run things on Traverse using this setup. But there are some things I have learnt:

One traverse to not give a job step the entire CPU affinity of the involved nodes, I have to use the --exclusive flag in srun, which indicates that certain cpus/cores are exclusively used by that job step and not anything else.

Furthermore, I cannot use --cpus-per-task=1. Which makes a lot of sense, and CPU affinity prints should have rang a bell for me. I feel dense.

So, at request, I ask SBATCH like so:

#SBATCH -n 8
#SBATCH --cpus-per-task=4
#SBATCH --gpus-per-task=1

and then

srun --ntasks=4 --gpus-per-task=1 --cpus-per-task=4 --distribution=arbitrary --exclusive script.sh

or even

srun --ntasks=4 --distribution=arbitrary --exclusive script.sh

would work.

What does not work is the following:

...
#SBATCH -n 8
#SBATCH -G 8
srun --ntasks=4 --gpus-per-task=1 --cpus-per-task=4 --distribution=arbitrary --exclusive 

For some reason, I cannot request a pool of GPUs and take from it.

@andre-merzky
Copy link

Ok I can run things on Traverse using this setup. But there are some things I have learnt:
...
For some reason, I cannot request a pool of GPUs and take from it.

I am not sure I appreciate the distinction - isn't 'this setup' also using GPUs from a pool of requested GPUs?

Given the first statement (I can run things on Traverse using this setup), it sounds like we should encode just this in RP to get you running on Traverse, correct?

@lsawade
Copy link
Author

lsawade commented Nov 3, 2021

Well, I'm not quite sure. It seems to me that if I request, #SBATCH --gpus-per-task=1 I already prescribe how many GPUs a task uses, which worries me. Maybe it's a misunderstanding on my end..

@andre-merzky
Copy link

andre-merzky commented Nov 8, 2021

This batch script here does not use that directive. The sbatch only needs to provision the right number of nodes - the per_task parameters should not matter (even if you need to specify it in your case for some reason) as we overwrite them in the srun directives anyway?

@lsawade
Copy link
Author

lsawade commented Nov 8, 2021

Exactly! But this does not seem to work!

SBATCH -n 4
SBATCH --gpus-per-task=1

srun -n 4 --gpus-per-task=1 a.o

works;

SBATCH -n 4
SBATCH -gpus=4

srun -n 4 --gpus-per-task=1 a.o

does not work!


Unless, I'm making a dumb mistake ...

@andre-merzky
Copy link

Sorry, I did not work on this further, yet.

@andre-merzky
Copy link

Hi @lsawade - I still can't make sense of it and wasn't able to reproduce it on other Slurm clusters :-( But either way, please do give the RS branch fix/traverse (radical-cybertools/radical.saga#840) a try. It now hardcodes the #SBATCH --gpus-per-task=1 for Traverse.

@lsawade
Copy link
Author

lsawade commented Jan 21, 2022

Hi @andre-merzky - So, I was getting errors in the submission, and I finally had a chance to go through the log. And, I found the error, the submitted SBATCH script can't work like this:

#SBATCH --ntasks=32
#SBATCH --ntasks-per-node=32
#SBATCH --gpus-per-task=1
#SBATCH -J "pilot.0000"
#SBATCH -D "/scratch/gpfs/lsawade/radical.pilot.sandbox/re.session.traverse.princeton.edu.lsawade.019013.0000/pilot.0000/"
#SBATCH --output "bootstrap_0.out"
#SBATCH --error "bootstrap_0.err"
#SBATCH --partition "test"
#SBATCH --time 00:20:00

In this case, you are asking for 32 GPUs on a single node. I have no solution for this because the alternative, requesting 4 tasks seems stupid. And, research computing staff seemed to be immovable in terms of SLURM settings on Traverse.

@andre-merzky
Copy link

We discussed this topic on this weeks devel call. At this point we are inclined to not support Traverse: the Slurm configuration on Traverse is contradicting the Slurm documentation, and also how other Slurm deployments work. To support Traverse we basically have to break support on other Slurm resources.
We can in principle create a separate slurm_traverse launch method and pilot launcher in RP to accommodate the machine. That however is a fair amount of effort. Not insurmountable, but still, quite some work. Let's discuss on the HPC-Workflows call on how to handle this. Maybe there is also a chance to iterate with the admins (although we wanted to stay out of the business of dealing with system admins directly :-/ )

@mturilli
Copy link
Contributor

mturilli commented Feb 4, 2022

We will have to write an executor specific to Traverse. This will require allocating specific resources and we will report back once we do some internal discussion. RADICAL remains available to discuss the configuration of new machines, in case it will be useful/needed. Meanwhile, Lucas is using Summit while waiting for Traverse to become viable with EnTK.

@lsawade
Copy link
Author

lsawade commented Feb 8, 2022

@andre-merzky

Today I was working on something completely separate, but -- again -- I had issues with Traverse even for an embarrassingly parallel submission. It turned out that there seems to be an issue with how hardware threads are assigned.

If I just ask for --ntasks=5 I will not get 5 physical cores from the Power9 CPU, but rather 4 hardware threads from one core and 1 hardware thread from another.
So, the CPU pool on traverse by default has size 128. I have to use the following to truly access 5 physical cores:

#SBATCH --ntasks=5
#SBATCH --cpus-per-task=4
#SBATCH --ntasks-per-core=1

I will check whether this has an impact on how we are assigning the tasks during submission.

Just an additional example to build understanding:

This

#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --cpus-per-task=4
#SBATCH --ntasks-per-core=1

is OK.

This

#SBATCH --nodes=1
#SBATCH --ntasks=33
#SBATCH --cpus-per-task=4
#SBATCH --ntasks-per-core=1

is not OK.

@lsawade
Copy link
Author

lsawade commented Feb 11, 2022

I have confirmed my suspicions. I have finally found a resource and task description that definitely works. Test scripts are located here traverse-slurm-repo, but I will summarize below:

The sbatch header:

#!/bin/bash
#SBATCH -t00:05:00
#SBATCH -N 2
#SBATCH -n 64
#SBATCH --cpus-per-task=4
#SBATCH --ntasks-per-core=1
#SBATCH --output=mixed_gpu.txt
#SBATCH --reservation=test
#SBATCH --gres=gpu:4

So, in the sbatch header, I'm explicitly asking for 32 tasks where each task has access to 4 cpus. In SLURM language Power9 hardware threads are apparently equal to cpus. Hence, each physical has to be assigned 4 CPUs. Then, I also specify that each core is only assigned a single task. Finally, instead of specifying somewhere implicitely some notion of GPU need, I simply tell slurm I want the 4 GPUs in each node with --gres=gpu:4.

If you want to provide the hostfile you will have to decorate the srun command as follows:

# Define Hostfile
export SLURM_HOSTFILE=<some_hostfile with <N> entries>

# Run command
srun --ntasks=<N> --gpus-per-task=1 --cpus-per-task=4 --ntasks-per-core=1 --distribution=arbitrary <my-executable>

dropping the --gpus-per-task if none are needed. Otherwise, if you want to let slurm handle the resource allocation, the following works as well:

srun --ntasks=$1 --gpus-per-task=1 --cpus-per-task=4 --ntasks-per-core=1 <my-executable>

again, dropping the --gpus-per-task if none are needed.

From past experience, I think this is relatively easy put into EnTK?

@andre-merzky
Copy link

@lsawade - thanks for you patience! In radical-saga and radical-pilot, you should now find two branches named fix/issue_138_hpcwf. They hopefully implement the right special cases for Traverse to work as expected. Would you please give them a try? Thank you!

@lsawade
Copy link
Author

lsawade commented Apr 1, 2022

Will give it a whirl!

@lsawade
Copy link
Author

lsawade commented Apr 6, 2022

@andre-merzky , I find the branch in the pilot but not in saga? Should I just use fix/traverse for saga?

@andre-merzky
Copy link

@lsawade : Apologies, I missed a push for the branch... It should be there now in RS also.

@andre-merzky
Copy link

Hey @lsawade - did you have the chance to look into this again?

@lsawade
Copy link
Author

lsawade commented May 2, 2022

Sorry, @andre-merzky , I thought I had updated the issue before I started driving on Friday...

So, the issue persists. An error is still thrown when --cpus_per_task is used due to the underscores.


  python               : /home/lsawade/.conda/envs/conda-entk/bin/python3
  pythonpath           : 
  version              : 3.7.12
  virtualenv           : conda-entk

  radical.entk         : 1.14.0
  radical.gtod         : 1.13.0
  radical.pilot        : 1.13.0-v1.13.0-149-g211a82593@fix-issue_138_hpcwf
  radical.saga         : 1.13.0-v1.13.0-1-g7a950d53@fix-issue_138_hpcwf
  radical.utils        : 1.14.0

$ cat re.session.traverse.princeton.edu.lsawade.019111.0001/radical.log | grep -b10 ERROR | head -20
136162-1651239844.198 : radical.saga.cpi     : 715866 : 35185202950512 : DEBUG    : write: [   84] [   82] (cd ~ && "/usr/bin/cp" -v  "/tmp/rs_pty_staging_f19k3a1g.tmp" "tmp_jp8rdthi.slurm"\n)
136348-1651239844.202 : radical.saga.cpi     : 715866 : 35185202950512 : DEBUG    : read : [   84] [   60] ('/tmp/rs_pty_staging_f19k3a1g.tmp' -> 'tmp_jp8rdthi.slurm'\n)
136511-1651239844.244 : radical.saga.cpi     : 715866 : 35185202950512 : DEBUG    : read : [   84] [    1] ($)
136615-1651239844.244 : radical.saga.cpi     : 715866 : 35185202950512 : DEBUG    : copy done: ['/tmp/rs_pty_staging_f19k3a1g.tmp', '$']
136745-1651239844.245 : radical.saga.cpi     : 715866 : 35185202950512 : DEBUG    : flush: [   83] [     ] (flush pty read cache)
136868-1651239844.346 : radical.saga.cpi     : 715866 : 35185202950512 : DEBUG    : run_sync: sbatch 'tmp_jp8rdthi.slurm'; echo rm -f 'tmp_jp8rdthi.slurm'
137016-1651239844.347 : radical.saga.cpi     : 715866 : 35185202950512 : DEBUG    : write: [   83] [   61] (sbatch 'tmp_jp8rdthi.slurm'; echo rm -f 'tmp_jp8rdthi.slurm'\n)
137181-1651239844.352 : radical.saga.cpi     : 715866 : 35185202950512 : DEBUG    : read : [   83] [   91] (sbatch: unrecognized option '--cpus_per_task=4'\nTry "sbatch --help" for more information\n)
137375-1651239844.352 : radical.saga.cpi     : 715866 : 35185202950512 : DEBUG    : read : [   83] [   36] (rm -f tmp_jp8rdthi.slurm\nPROMPT-0->)
137514-1651239844.352 : radical.saga.cpi     : 715866 : 35185202950512 : DEBUG    : submit SLURM script (tmp_jp8rdthi.slurm) (0)
137636:1651239844.352 : radical.saga.cpi     : 715866 : 35185202950512 : ERROR    : NoSuccess: Couldn't get job id from submitted job! sbatch output:
137779-sbatch: unrecognized option '--cpus_per_task=4'
137827-Try "sbatch --help" for more information
137868-rm -f tmp_jp8rdthi.slurm
137893-
137894:1651239844.354 : pmgr_launching.0000  : 715866 : 35184434934128 : ERROR    : bulk launch failed
137990-Traceback (most recent call last):
138025-  File "/home/lsawade/.conda/envs/conda-entk/lib/python3.7/site-packages/radical/pilot/pmgr/launching/default.py", line 405, in work
138158-    self._start_pilot_bulk(resource, schema, pilots)
138211-  File "/home/lsawade/.conda/envs/conda-entk/lib/python3.7/site-packages/radical/pilot/pmgr/launching/default.py", line 609, in _start_pilot_bulk

@mtitov
Copy link

mtitov commented May 17, 2022

@lsawade hi Lucas, can you please give it another try, since that was a typo in option setup and was fixed in that branch, thus the stack would look like this

% radical-stack           

  python               : /Users/mtitov/.miniconda3/envs/test_rp/bin/python3
  pythonpath           : 
  version              : 3.7.12
  virtualenv           : test_rp

  radical.entk         : 1.14.0
  radical.gtod         : 1.13.0
  radical.pilot        : 1.14.0-v1.14.0-119-ga6886ca58@fix-issue_138_hpcwf
  radical.saga         : 1.13.0-v1.13.0-9-g1875aa88@fix-issue_138_hpcwf
  radical.utils        : 1.14.0

@andre-merzky
Copy link

@lsawade : ping :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants