-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
disappearing units under ORTE on Titan #1235
Comments
I have experienced the same problem. Some CUs disappear and causes the hitting of the wall-time. The problem seems related with the number of CUs. By using one core per CU, I have no problems up to 512 but I fail constantly with 1024 or greater. |
Alession, can you please paste a compute unit description? Thanks! |
This is my CU description: cu = rp.ComputeUnitDescription()
cu.pre_exec = ["module load boost",
"module load fftw",
"module load cudatoolkit",
"module use --append /lustre/atlas/world-shared/bip103/modules",
"module load openmpi/DEVEL-STATIC"]
cu.executable = [PATH+"gmx_mpi"]
cu.arguments = ["mdrun","-ntomp","1", "-nb", "cpu","-s","topol.tpr","-c","out.gro"]
cu.input_staging = ["topol.tpr"]
cu.mpi = True
cu.cores = cores |
Thanks Alessio! What is PS.: please use triple-backticks to format verbatim text. FWIW, def test(a, b, c=None):
pass |
PATH is the path for gromacs in my ram disk space. It corresponds to: |
You need to specify |
Ah, that's the trick :) I was about to parse through the LRMS code... Thanks, will do. |
#1277 is currently tested in this context |
[Moving discussion from #1277 back to this ticket] Hey Alessio, in your unit description, you have:
This looks like the wrong module, this should be the same as in https://github.com/radical-cybertools/radical.pilot/pull/1277/files#diff-4d8182f639ff06b24cd3f05d2c7e27abL68 . But I am not sure this makes a difference, really, given the mode in which pre_exec works in ORTE. Thanks for opening up the gromacs install, I'll do some tests with that. |
Hi Ale, I fixed an error in the titan ORTE config (a subdir missing in a path spec). The following script seems to work with your gromacs installation. So please update your RP installtion from the #!/usr/bin/env python
import os
import sys
import radical.pilot as rp
import radical.utils as ru
PATH="/lustre/atlas/proj-shared/csc230/gromacs/gromacs/bin/"
#------------------------------------------------------------------------------
#
if __name__ == '__main__':
report = ru.LogReporter(name='radical.pilot')
report.title('Getting Started (RP version %s)' % rp.version_detail)
resource = 'ornl.titan'
queue = 'debug'
pilot_cores = 16
cu_cores = 2
cu_num = 2
session = rp.Session()
try:
report.info('read config')
config = ru.read_json('./config.json')
report.ok('>>ok\n')
report.header('submit pilots')
pmgr = rp.PilotManager(session=session)
pd_init = {
'resource' : resource,
'queue' : queue,
'project' : 'CSC230',
'access_schema' : 'local',
'runtime' : 60,
'exit_on_error' : True,
'cores' : pilot_cores + 16,
}
pdesc = rp.ComputePilotDescription(pd_init)
pilot = pmgr.submit_pilots(pdesc)
report.header('submit units')
umgr = rp.UnitManager(session=session)
umgr.add_pilots(pilot)
report.info('create ' + str(cu_num) + 'unit description(s)\n\t' )
cuds = list()
for i in range(0, cu_num):
cud = rp.ComputeUnitDescription()
cud.pre_exec = ["module load boost",
"module load fftw",
"module load cudatoolkit",
"module use --append /lustre/atlas/world-shared/csc230/modules/",
"module load openmpi/2017_03_09_6da4dbb"]
cud.executable = [PATH + "gmx_mpi"]
cud.arguments = ["mdrun","-ntomp","1", "-nb", "cpu","-s","topol.tpr","-c","out.gro"]
cud.input_staging = ["topol.tpr"]
cud.mpi = True
cud.cores = cu_cores
cuds.append(cud)
report.progress()
report.ok('>>ok\n')
umgr.submit_units(cuds)
report.header('gather results')
umgr.wait_units()
except Exception as e:
report.error('caught Exception: %s\n' % e)
raise
except (KeyboardInterrupt, SystemExit) as e:
report.warn('exit requested\n')
finally:
report.header('finalize')
session.close(cleanup=False)
report.header()
#------------------------------------------------------------------------------- |
Hi Andre,
I did the pull and tried again with 1024 units on 1024 cores. The
problem is still there.
Ale
p.s. I was using the correct openMPI module on Titan. I sent to you an
older version that I had on my pc. Sorry.
|
At what scale does the problem appear at this point? How many cores are you using per CU? And can you send me a tarball of the pilot sandbox, please? Thanks, Andre |
As before 1024 cores. Each CU uses one core. |
I would indeed like to look into some of the unit sandboxes, too - would be great if you could make them available on titan then. thanks! |
I have put it here : /lustre/atlas/proj-shared/csc230/ |
Uh, I am afraid the unit sandboxes are not there either? |
They are coming right now. |
Wohoo, got an error message:
@marksantcroos , any idea, have you seen this before by any chance? |
I have trouble re-creating the interactive dvm session documented above:
I see the same error when using the new ORTE install. @marksantcroos, any idea whats up with this? |
No, but good news that you found an error. I'll report to openmpi. |
Mark, I am handing this over to you at that point, if that's ok for you. But please do let me know if anything can be done from my end! |
Alessio, did you run into the issue also with the script that Andre posted or only with your own code? |
only with mine, but Andre script was essentially my script with the
instructions of a method inside the main script instead of a separate file.
If think that you can have the very same result if you run Andre's
script with 1024 units on 1024 cores.
Anyway, I have put the script here:
/lustre/atlas/proj-shared/csc230/example/
You can use mine if you prefer.
Alessio
…On 16/03/2017 15:41, Mark Santcroos wrote:
Alessio, did you run into the issue also with the script that Andre
posted or only with your own code?
(In other words, I would like to be able to reproduce it)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1235 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ATSdv6fGKuPjyQH-xvvEsT0QQ9vOBboqks5rmZBZgaJpZM4MFR0b>.
|
Update: I can reproduce this now. TBC. |
The hanging is likely caused by open-mpi/ompi#1132. The other error (open-mpi/ompi#3192) might be unrelated and therefore unresolved. |
Assuming you are using the |
I'll do it but I cannot do it right now. I am doing tests for a deadline that is very close (April 3) and all my four job slots on Titan are constantly busy. |
Thanks Mark, I'll give it a go! |
Mark, it took me a while to fix some problems wrt. pre/post exec on my end, and at the end I reverted to confirming the fix in ORTE. (ve_test)merzky1@titan-ext1:~/test_ale $ radical-stack
python : 2.7.9
virtualenv : /autofs/nccs-svm1_home1/merzky1/ve_test
radical.utils : v0.45-2-g82050c5@devel
saga-python : split-4-gaa285ca@devel
radical.pilot : split-30-g5b41647@fix-titan_orte
radical.analytics : v0.1-137-g05f47f3@devel and am able to run MPI units at scale (2k tested). I'll continue to run tests, but think this is good for merging. I would like to redeploy w/o the Thanks! |
This installation has a local edit that requires a proper fix. So currently you can't. |
Got it. Is that fix likely to be required on other machines, too, or is that titan specific? If the former, what's the best path forward? I assume a PR to openmpi? |
Its Cray specific. Its being worked on. |
Update:
So you'll probably want to stay on |
Mark sorry for the delay. |
1024 gromacs successfully executed. First time that I am able to do it. |
Good.
There is a 64k units per pilot limit currently. Not concurrent cu's, but over time.
Stalled units. |
At long last, we can consider this one solved. Thanks to all, and Mark specifically! :) |
Vivek runs a workload towrads
ornl.titan
, thus using ORTE not ORTELIB, for itspre-exec
support, under this stack:The workload is a single unit (at this point), with the following LM resulting script
I see the CU started in the logs, and the pre-exec command get executed all right (all files exist), but the workload itself (the mpi task) seems to never run (no output nor stdout files), nor will the agent ever collect the unit -- the agent eventually times out while apparently still waiting for it.
I tried to reproduce the behavior like this, with an interactive node:
and then
and
which is approximately the expected runtime, which can also be confirmed via aprun and mpirun.
Note that when I specify
-host nid02259,nid02259,nid02259,nid02259
, then the run returns immediately with an error:but I don't see stdout/stderr. I also do never see the task hanging.
I guess that needs some debugging on the dvm layer, to see why the mpi task does not start, and possibly on the RP layer, to see why the job is not collected (iff it returns).
Mark, could you help out with #1230 to get a handle on that?
The text was updated successfully, but these errors were encountered: