-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EnTK hangs on Traverse when using multiple Nodes #138
Comments
Hi @lsawade - this is a surprising one. The task stdout shows: $ cat *err
srun: Job 126172 step creation temporarily disabled, retrying (Requested nodes are busy) This one does look like a slurm problem. Is this reproducible? |
Reproduced! The message with step creation appears after a while. Meaning I continuously checked the task's error file, and eventually the message showed up! |
@lsawade , would you please open an ticket with Traverse support? Maybe our srun command is not well-formed for Traverse's Slurm installation? Please include the srun command: /usr/bin/srun --exclusive --cpu-bind=none --nodes 1 --ntasks 1 --cpus-per-task 1 --gpus-per-task 0 --nodelist=/scratch/gpfs/lsawade/radical.pilot.sandbox/re.session.traverse.princeton.edu.lsawade.018666.0003/pilot.0000/unit.000000//unit.000000.nodes --export=ALL,NODE_LFS_PATH="/tmp" write-sources "-f" "/tigress/lsawade/entkdatabase/C200709121110A/C200709121110A.cmt" "-p" "/home/lsawade/gcmt3d/workflow/params" and the nodelist file which just contains:
|
It throws following error:
If I take out the |
Hmm, is that node name not valid somehow? |
I tried running it with the nodename as a string and that worked
Note that I'm using salloc and hence a different nodename |
I found the solution. When SLURM takes in a file for a nodelist, one has to use the node file option:
|
Oh! Thanks for tracking that down, we'll fix this! |
It is puzzling though, that |
@lsawade : the fix has been released, please let us know if that problem still happens! |
@andre-merzky, will test! |
Sorry, for the extraordinarily late feedback, but the issue seems to persist. It already hangs in the My stack: python : /home/lsawade/.conda/envs/ve-entk/bin/python3
pythonpath :
version : 3.8.2
virtualenv : ve-entk
radical.entk : 1.6.0
radical.gtod : 1.5.0
radical.pilot : 1.6.2
radical.saga : 1.6.1
radical.utils : 1.6.2 My script: from radical.entk import Pipeline, Stage, Task, AppManager
import traceback, sys, os
hostname = os.environ.get('RMQ_HOSTNAME', 'localhost')
port = int(os.environ.get('RMQ_PORT', 5672))
password = os.environ.get('RMQ_PASSWORD', None)
username = os.environ.get('RMQ_USERNAME', None)
specfem = "/scratch/gpfs/lsawade/MagicScripts/specfem3d_globe"
if __name__ == '__main__':
p = Pipeline()
# Hello World########################################################
test_stage = Stage()
test_stage.name = "HelloWorldStage"
# Create 'Hello world' task
t = Task()
t.cpu_reqs = {'cpu_processes': 1, 'cpu_process_type': None, 'cpu_threads': 1, 'cpu_thread_type': None}
t.pre_exec = ['module load openmpi/gcc']
t.name = "HelloWorldTask"
t.executable = '/bin/echo'
t.arguments = ['Hello world!']
t.download_output_data = ['STDOUT', 'STDERR']
# Add task to stage and stage to pipeline
test_stage.add_tasks(t)
p.add_stages(test_stage)
#########################################################
specfem_stage = Stage()
specfem_stage.name = 'SimulationStage'
for i in range(2):
# Create Task
t = Task()
t.name = f"SIMULATION.{i}"
tdir = f"/home/lsawade/simple_entk_specfem/specfem_run_{i}"
t.pre_exec = [
# Load necessary modules
'module load openmpi/gcc',
'module load cudatoolkit/11.0',
# Change to your specfem run directory
f'rm -rf {tdir}',
f'mkdir {tdir}',
f'cd {tdir}',
# Create data structure in place
f'ln -s {specfem}/bin .',
f'ln -s {specfem}/DATABASES_MPI .',
f'cp -r {specfem}/OUTPUT_FILES .',
'mkdir DATA',
f'cp {specfem}/DATA/CMTSOLUTION ./DATA/',
f'cp {specfem}/DATA/STATIONS ./DATA/',
f'cp {specfem}/DATA/Par_file ./DATA/'
]
t.executable = './bin/xspecfem3D'
t.cpu_reqs = {'cpu_processes': 4, 'cpu_process_type': 'MPI', 'cpu_threads': 1, 'cpu_thread_type' : 'OpenMP'}
t.gpu_reqs = {'gpu_processes': 4, 'gpu_process_type': 'MPI', 'gpu_threads': 1, 'gpu_thread_type' : 'CUDA'}
t.download_output_data = ['STDOUT', 'STDERR']
# Add task to stage
specfem_stage.add_tasks(t)
p.add_stages(specfem_stage)
res_dict = {
'resource': 'princeton.traverse', # 'local.localhost',
'schema' : 'local',
'walltime': 20, #2 * 30,
'cpus': 16, #2 * 10 * 1,
'gpus': 8, #2 * 4 * 2,
}
appman = AppManager(hostname=hostname, port=port, username=username, password=password, resubmit_failed=False)
appman.resource_desc = res_dict
appman.workflow = set([p])
appman.run()
Tarball: |
Bugger... - the code though is using $ grep srun task.0000.sh
task.0000.sh:/usr/bin/srun --exclusive --cpu-bind=none --nodes 1 --ntasks 1 --cpus-per-task 1 --gpus-per-task 0 --nodefile=/scratch/gpfs/lsawade/radical.pilot.sandbox/re.session.traverse.princeton.edu.lsawade.018719.0005/pilot.0000/task.0000//task.0000.nodes --export=ALL,NODE_LFS_PATH="/tmp" /bin/echo "Hello world!" but that task indeed never returns. Does that line work on an interactive node? FWIW, $ cat task.0000.nodes
traverse-k02g1 |
Yes, in interactive mode and change of the nodefile to the node I land on it works Edit: In my interactive job I'm using one node only, let me try with two... UpdateIt works also when using the two nodes in the interactive job and editing the |
Hmm, where does that leave us... - so it is not the Can you switch your workload to, say, |
See Update above
|
What do you mean with switching my workload to |
I also tested running the entire |
Slurm on Traverse seems to be working in a strange way. Lucas is in contact with the research service at Princeton. |
Two things that have come up:
|
Most quick debugging discussions were held on Slack but here a summary for posterity: Error
Stack
|
My apologies, that error is now fixed in RP. |
Getting a new one again!
|
So the way I install entk and the pilot at the moment is as follows: # Install EnTK
conda create -n conda-entk python=3.7 -c conda-forge -y
conda activate conda-entk
pip install radical.entk (Note, I'm not changing pilot here and just keep the default one.) Then, I get the # Create environment
module load anaconda3
conda create -n ve -y python=3.7
conda activate ve
# Install Pilot
git clone [email protected]:radical-cybertools/radical.pilot.git
cd radical.pilot
pip install .
# Create static environment
./bin/radical-pilot-create-static-ve -p /scratch/gpfs/$USER/ve.rp/ Log out, Log in: conda activate conda-entk
python workflow.py |
Are there any news here? |
Alright, I got the workflow manager to -- at least -- run. Not hanging, yay One of the issues is that when I create the static environment using However, I'm sort of back to square one. A serial task executes, and
I'll attach the tarball. It is also important to state that the Manager seems to drop scheduling other jobs upon failure of the first task. I wasn't able to find anything about it in the log. |
Hey @lsawade - the reason for the behavior eludes me completely. I can confirm that the same is observed on at least one other Slurm cluster (expanse @ SDSC), and I opened a ticket there to hopefully get some useful feedback. At the moment I simply don't know how we can possibly resolve this. I am really sorry for that, I understand that this is blocking progress since several months by now :-/ |
Yeah, I have had a really long thread with the people from the research computing group and they did not understand why this is not working either. Maybe we should contact the |
Yes, I think we should resort to that. I'll open a ticket if the XSEDE support is not able to suggest a solution within a week. |
We got some useful feedback from XSEDE after all: slurm seems indeed to be unable to do correct auto-placement for non-node-local tasks. I find this surprising, and it may still be worthwhile to open a slurm ticket about this. Either way though: a workaround is to start the job with a specific node file. From your example above: srun -n 6 --cpus-per-task 1 --gpus-per-task 1 show_devices.sh 0 &
srun -n 6 --cpus-per-task 1 --gpus-per-task 1 show_devices.sh 1 & should work as expected with export SLURM_HOSTFILE=host1.list
srun -n 6 --cpus-per-task 1 --gpus-per-task 1 --distribution=arbitrary show_devices.sh 0 &
export SLURM_HOSTFILE=host2.list
srun -n 6 --cpus-per-task 1 --gpus-per-task 1 --distribution=arbitrary show_devices.sh 1 & where the host file look like, for example: $ cat host2.list
exp-1-57
exp-1-57
exp-6-58
exp-6-58
exp-6-58
exp-6-58 Now, that brings us back to RP / EnTK: we actually do use a hostfile, we just miss out on |
Hi @andre-merzky, I have been playing with this and I can't seem to get it to work. I explain what I do here: I'm not sure whether it's me or Traverse. Can you adjust this mini example to see whether it runs on XSEDE? Things you would have to change are the automatic writing of the hostfile and how many tasks per job step. If you give me the hardware setup of XSEDE, I could also adjust the script and give you something that should run out of the box to check. |
The hardware setup on Expanse is really similar to Traverse: 4 GPUs/node. I pasted something incorrect above, apologies! Too many scripts lying around :-/ The This is the original script: $ cat test2.slurm
#!/bin/bash
#SBATCH -t00:10:00
#SBATCH --account UNC100
#SBATCH --nodes 3
#SBATCH --gpus 12
#SBATCH -n 12
#SBATCH --output=test2.out
#SBATCH --error=test2.out
my_srun() {
export SLURM_HOSTFILE="$1"
srun -n 6 --gpus=6 --cpus-per-task=1 --gpus-per-task=1 --distribution=arbitrary show_devices.sh
}
cyclic() {
scontrol show hostnames "${SLURM_JOB_NODELIST}" > host1.cyclic.list
scontrol show hostnames "${SLURM_JOB_NODELIST}" >> host1.cyclic.list
scontrol show hostnames "${SLURM_JOB_NODELIST}" > host2.cyclic.list
scontrol show hostnames "${SLURM_JOB_NODELIST}" >> host2.cyclic.list
my_srun host1.cyclic.list > cyclic.1.out 2>&1 &
my_srun host2.cyclic.list > cyclic.2.out 2>&1 &
wait
}
block() {
scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -1 | tail -1 > host1.block.list
scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -1 | tail -1 >> host1.block.list
scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -1 | tail -1 >> host1.block.list
scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -1 | tail -1 >> host1.block.list
scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -2 | tail -1 >> host1.block.list
scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -2 | tail -1 >> host1.block.list
scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -2 | tail -1 > host2.block.list
scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -2 | tail -1 >> host2.block.list
scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -3 | tail -1 >> host2.block.list
scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -3 | tail -1 >> host2.block.list
scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -3 | tail -1 >> host2.block.list
scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -3 | tail -1 >> host2.block.list
my_srun host1.block.list > block.1.out 2>&1 &
my_srun host2.block.list > block.2.out 2>&1 &
wait
}
block
cyclic These are the resulting node files: $ for f in *list; do echo $f; cat $f; echo; done
host1.block.list
exp-6-57
exp-6-57
exp-6-57
exp-6-57
exp-6-59
exp-6-59
host1.cyclic.list
exp-6-57
exp-6-59
exp-10-58
exp-6-57
exp-6-59
exp-10-58
host2.block.list
exp-6-59
exp-6-59
exp-10-58
exp-10-58
exp-10-58
exp-10-58
host2.cyclic.list
exp-6-57
exp-6-59
exp-10-58
exp-6-57
exp-6-59
exp-10-58 and these the resulting outputs: $ for f in *out; do echo $f; cat $f; echo; done
block.1.out
6664389.1.2 START Mon Oct 25 02:54:42 PDT 2021 @ exp-6-57 : 0,1,2,3
6664389.1.1 START Mon Oct 25 02:54:42 PDT 2021 @ exp-6-57 : 0,1,2,3
6664389.1.0 START Mon Oct 25 02:54:42 PDT 2021 @ exp-6-57 : 0,1,2,3
6664389.1.3 START Mon Oct 25 02:54:42 PDT 2021 @ exp-6-57 : 0,1,2,3
6664389.1.5 START Mon Oct 25 02:54:42 PDT 2021 @ exp-6-59 : 0,1
6664389.1.4 START Mon Oct 25 02:54:42 PDT 2021 @ exp-6-59 : 0,1
6664389.1.2 STOP Mon Oct 25 02:54:52 PDT 2021
6664389.1.1 STOP Mon Oct 25 02:54:52 PDT 2021
6664389.1.0 STOP Mon Oct 25 02:54:52 PDT 2021
6664389.1.3 STOP Mon Oct 25 02:54:52 PDT 2021
6664389.1.4 STOP Mon Oct 25 02:54:52 PDT 2021
6664389.1.5 STOP Mon Oct 25 02:54:52 PDT 2021
block.2.out
6664389.0.1 START Mon Oct 25 02:54:42 PDT 2021 @ exp-6-59 : 0,1
6664389.0.2 START Mon Oct 25 02:54:42 PDT 2021 @ exp-10-58 : 0,1,2,3
6664389.0.0 START Mon Oct 25 02:54:42 PDT 2021 @ exp-6-59 : 0,1
6664389.0.3 START Mon Oct 25 02:54:42 PDT 2021 @ exp-10-58 : 0,1,2,3
6664389.0.5 START Mon Oct 25 02:54:42 PDT 2021 @ exp-10-58 : 0,1,2,3
6664389.0.4 START Mon Oct 25 02:54:42 PDT 2021 @ exp-10-58 : 0,1,2,3
6664389.0.0 STOP Mon Oct 25 02:54:52 PDT 2021
6664389.0.1 STOP Mon Oct 25 02:54:52 PDT 2021
6664389.0.4 STOP Mon Oct 25 02:54:52 PDT 2021
6664389.0.3 STOP Mon Oct 25 02:54:52 PDT 2021
6664389.0.5 STOP Mon Oct 25 02:54:52 PDT 2021
6664389.0.2 STOP Mon Oct 25 02:54:52 PDT 2021
cyclic.1.out
6664389.2.3 START Mon Oct 25 02:54:52 PDT 2021 @ exp-6-57 : 0,1
6664389.2.2 START Mon Oct 25 02:54:52 PDT 2021 @ exp-10-58 : 0,1
6664389.2.4 START Mon Oct 25 02:54:52 PDT 2021 @ exp-6-59 : 0,1
6664389.2.5 START Mon Oct 25 02:54:52 PDT 2021 @ exp-10-58 : 0,1
6664389.2.0 START Mon Oct 25 02:54:52 PDT 2021 @ exp-6-57 : 0,1
6664389.2.1 START Mon Oct 25 02:54:52 PDT 2021 @ exp-6-59 : 0,1
6664389.2.3 STOP Mon Oct 25 02:55:02 PDT 2021
6664389.2.2 STOP Mon Oct 25 02:55:02 PDT 2021
6664389.2.4 STOP Mon Oct 25 02:55:02 PDT 2021
6664389.2.0 STOP Mon Oct 25 02:55:02 PDT 2021
6664389.2.1 STOP Mon Oct 25 02:55:02 PDT 2021
6664389.2.5 STOP Mon Oct 25 02:55:02 PDT 2021
cyclic.2.out
6664389.3.3 START Mon Oct 25 02:54:52 PDT 2021 @ exp-6-57 : 0,1
6664389.3.5 START Mon Oct 25 02:54:52 PDT 2021 @ exp-10-58 : 0,1
6664389.3.4 START Mon Oct 25 02:54:52 PDT 2021 @ exp-6-59 : 0,1
6664389.3.0 START Mon Oct 25 02:54:52 PDT 2021 @ exp-6-57 : 0,1
6664389.3.2 START Mon Oct 25 02:54:52 PDT 2021 @ exp-10-58 : 0,1
6664389.3.1 START Mon Oct 25 02:54:52 PDT 2021 @ exp-6-59 : 0,1
6664389.3.5 STOP Mon Oct 25 02:55:02 PDT 2021
6664389.3.3 STOP Mon Oct 25 02:55:02 PDT 2021
6664389.3.4 STOP Mon Oct 25 02:55:02 PDT 2021
6664389.3.2 STOP Mon Oct 25 02:55:02 PDT 2021
6664389.3.1 STOP Mon Oct 25 02:55:02 PDT 2021
6664389.3.0 STOP Mon Oct 25 02:55:02 PDT 2021 |
So, I have some good news, I have also tested this on The annoying news are that it does not seem to work on Traverse. At least I was able to test whether it's a user error... So, how do we proceed? I'm sure it's a setting in the UPDATE: The unexpected/unwanted output on Traverse: block.1.out
srun: Job 258710 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Step created for job 258710
258710.3.3 START Mon Oct 25 19:40:25 EDT 2021 @ traverse-k05g2: 0
258710.3.2 START Mon Oct 25 19:40:25 EDT 2021 @ traverse-k05g2: 0
258710.3.0 START Mon Oct 25 19:40:25 EDT 2021 @ traverse-k05g2: 0
258710.3.1 START Mon Oct 25 19:40:25 EDT 2021 @ traverse-k05g2: 0
258710.3.5 START Mon Oct 25 19:40:25 EDT 2021 @ traverse-k05g3: 0
258710.3.4 START Mon Oct 25 19:40:25 EDT 2021 @ traverse-k05g3: 0
258710.3.0 STOP Mon Oct 25 19:41:25 EDT 2021
258710.3.1 STOP Mon Oct 25 19:41:25 EDT 2021
258710.3.2 STOP Mon Oct 25 19:41:25 EDT 2021
258710.3.3 STOP Mon Oct 25 19:41:25 EDT 2021
258710.3.4 STOP Mon Oct 25 19:41:25 EDT 2021
258710.3.5 STOP Mon Oct 25 19:41:25 EDT 2021
block.2.out
258710.2.0 START Mon Oct 25 19:39:24 EDT 2021 @ traverse-k05g3: 0
258710.2.1 START Mon Oct 25 19:39:24 EDT 2021 @ traverse-k05g3: 0
258710.2.0 STOP Mon Oct 25 19:40:24 EDT 2021
258710.2.1 STOP Mon Oct 25 19:40:24 EDT 2021
cyclic.1.out
258710.0.1 START Mon Oct 25 19:37:23 EDT 2021 @ traverse-k05g3: 0
258710.0.0 START Mon Oct 25 19:37:23 EDT 2021 @ traverse-k05g2: 0
258710.0.3 START Mon Oct 25 19:37:23 EDT 2021 @ traverse-k05g3: 0
258710.0.2 START Mon Oct 25 19:37:23 EDT 2021 @ traverse-k05g2: 0
258710.0.1 STOP Mon Oct 25 19:38:23 EDT 2021
258710.0.3 STOP Mon Oct 25 19:38:23 EDT 2021
258710.0.0 STOP Mon Oct 25 19:38:23 EDT 2021
258710.0.2 STOP Mon Oct 25 19:38:23 EDT 2021
cyclic.2.out
srun: Job 258710 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Step created for job 258710
258710.1.0 START Mon Oct 25 19:38:24 EDT 2021 @ traverse-k05g2: 0
258710.1.1 START Mon Oct 25 19:38:24 EDT 2021 @ traverse-k05g3: 0
258710.1.2 START Mon Oct 25 19:38:24 EDT 2021 @ traverse-k05g2: 0
258710.1.3 START Mon Oct 25 19:38:24 EDT 2021 @ traverse-k05g3: 0
258710.1.0 STOP Mon Oct 25 19:39:24 EDT 2021
258710.1.2 STOP Mon Oct 25 19:39:24 EDT 2021
258710.1.1 STOP Mon Oct 25 19:39:24 EDT 2021
258710.1.3 STOP Mon Oct 25 19:39:24 EDT 2021 Does it almost look like there is a misunderstanding between slurm and cude, the devices visible should not be all PS: I totally stole the way you made the block and cyclic functions as well as the printing. Why did I not think of that...? |
Ok I can run things on Traverse using this setup. But there are some things I have learnt: One traverse to not give a job step the entire CPU affinity of the involved nodes, I have to use the Furthermore, I cannot use So, at request, I ask SBATCH like so: #SBATCH -n 8
#SBATCH --cpus-per-task=4
#SBATCH --gpus-per-task=1 and then srun --ntasks=4 --gpus-per-task=1 --cpus-per-task=4 --distribution=arbitrary --exclusive script.sh or even srun --ntasks=4 --distribution=arbitrary --exclusive script.sh would work. What does not work is the following: ...
#SBATCH -n 8
#SBATCH -G 8
srun --ntasks=4 --gpus-per-task=1 --cpus-per-task=4 --distribution=arbitrary --exclusive For some reason, I cannot request a pool of GPUs and take from it. |
I am not sure I appreciate the distinction - isn't 'this setup' also using GPUs from a pool of requested GPUs? Given the first statement ( |
Well, I'm not quite sure. It seems to me that if I request, |
This batch script here does not use that directive. The sbatch only needs to provision the right number of nodes - the |
Exactly! But this does not seem to work! SBATCH -n 4
SBATCH --gpus-per-task=1
srun -n 4 --gpus-per-task=1 a.o works; SBATCH -n 4
SBATCH -gpus=4
srun -n 4 --gpus-per-task=1 a.o does not work! Unless, I'm making a dumb mistake ... |
Sorry, I did not work on this further, yet. |
Hi @lsawade - I still can't make sense of it and wasn't able to reproduce it on other Slurm clusters :-( But either way, please do give the RS branch |
Hi @andre-merzky - So, I was getting errors in the submission, and I finally had a chance to go through the log. And, I found the error, the submitted
In this case, you are asking for 32 GPUs on a single node. I have no solution for this because the alternative, requesting |
We discussed this topic on this weeks devel call. At this point we are inclined to not support Traverse: the Slurm configuration on Traverse is contradicting the Slurm documentation, and also how other Slurm deployments work. To support Traverse we basically have to break support on other Slurm resources. |
We will have to write an executor specific to Traverse. This will require allocating specific resources and we will report back once we do some internal discussion. RADICAL remains available to discuss the configuration of new machines, in case it will be useful/needed. Meanwhile, Lucas is using Summit while waiting for Traverse to become viable with EnTK. |
Today I was working on something completely separate, but -- again -- I had issues with Traverse even for an embarrassingly parallel submission. It turned out that there seems to be an issue with how hardware threads are assigned. If I just ask for #SBATCH --ntasks=5
#SBATCH --cpus-per-task=4
#SBATCH --ntasks-per-core=1 I will check whether this has an impact on how we are assigning the tasks during submission. Just an additional example to build understanding: This #SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --cpus-per-task=4
#SBATCH --ntasks-per-core=1 is OK. This #SBATCH --nodes=1
#SBATCH --ntasks=33
#SBATCH --cpus-per-task=4
#SBATCH --ntasks-per-core=1 is not OK. |
I have confirmed my suspicions. I have finally found a resource and task description that definitely works. Test scripts are located here traverse-slurm-repo, but I will summarize below: The #!/bin/bash
#SBATCH -t00:05:00
#SBATCH -N 2
#SBATCH -n 64
#SBATCH --cpus-per-task=4
#SBATCH --ntasks-per-core=1
#SBATCH --output=mixed_gpu.txt
#SBATCH --reservation=test
#SBATCH --gres=gpu:4 So, in the If you want to provide the hostfile you will have to decorate the # Define Hostfile
export SLURM_HOSTFILE=<some_hostfile with <N> entries>
# Run command
srun --ntasks=<N> --gpus-per-task=1 --cpus-per-task=4 --ntasks-per-core=1 --distribution=arbitrary <my-executable> dropping the srun --ntasks=$1 --gpus-per-task=1 --cpus-per-task=4 --ntasks-per-core=1 <my-executable> again, dropping the From past experience, I think this is relatively easy put into EnTK? |
@lsawade - thanks for you patience! In radical-saga and radical-pilot, you should now find two branches named |
Will give it a whirl! |
@andre-merzky , I find the branch in the pilot but not in saga? Should I just use fix/traverse for saga? |
@lsawade : Apologies, I missed a push for the branch... It should be there now in RS also. |
Hey @lsawade - did you have the chance to look into this again? |
Sorry, @andre-merzky , I thought I had updated the issue before I started driving on Friday... So, the issue persists. An error is still thrown when
$ cat re.session.traverse.princeton.edu.lsawade.019111.0001/radical.log | grep -b10 ERROR | head -20
|
@lsawade hi Lucas, can you please give it another try, since that was a typo in option setup and was fixed in that branch, thus the stack would look like this
|
@lsawade : ping :-) |
Hi,
I don't know whether this is related to #135 . It is weird because I got everything running on a single node, but as soon as I use more than one EnTK seems to hang. I checked out the submission script and it looks fine to me; so, did the node list.
The workflow already hangs in the submission of the first task, which is a single core, single thread task.
Stack
Client zip
client.session.zip
Session zip
sandbox.session.zip
The text was updated successfully, but these errors were encountered: