disappearing units under ORTE on Titan #1235

andre-merzky · 2017-02-18T22:22:28Z

Vivek runs a workload towrads ornl.titan, thus using ORTE not ORTELIB, for its pre-exec support, under this stack:

radical-stack
python            : 2.7.9
virtualenv        : /ccs/home/vivekb/ves/seis_env_titan
radical.utils     : titan@no-branch
saga-python       : titan@no-branch
radical.pilot     : titan-4-g24e1956@no-branch

The workload is a single unit (at this point), with the following LM resulting script

$ cat radical_pilot_cu_launch_script.sh 
#!/bin/sh


# Change to working directory for unit
cd /lustre/atlas2/csc230/scratch/vivekb/radical.pilot.sandbox/rp.session.titan-ext7.vivekb.017215.0002/pilot.0000/unit.000000
# Environment variables
export RP_SESSION_ID=rp.session.titan-ext7.vivekb.017215.0002
export RP_PILOT_ID=pilot.0000
export RP_AGENT_ID=agent_1
export RP_SPAWNER_ID=agent.executing.0.child
export RP_UNIT_ID=unit.000000

# Pre-exec commands
module swap PrgEnv-pgi PrgEnv-gnu
module use --append /lustre/atlas/world-shared/bip103/modules
module load openmpi/DEVEL-STATIC
tar xf ipdata.tar
mkdir DATABASES_MPI
mkdir OUTPUT_FILES
cp -rf /lustre/atlas/scratch/vivekb/csc230/modules/specfem3d_globe/bin .
sed -i "s:^NUMBER_OF_SIMULTANEOUS_RUNS.*:NUMBER_OF_SIMULTANEOUS_RUNS = 1:g" DATA/Par_file
# The command to run
/lustre/atlas/world-shared/bip103/openmpi/static-nodebug/bin/orterun  --hnp "1694367744.0;tcp://10.128.0.92,160.91.205.242:57011"  --bind-to none -np 4 -host 820,820,820,820 ./bin/xmeshfem3D
RETVAL=$?
# Post-exec commands
tar cf opdata.tar DATA/ DATABASES_MPI/ OUTPUT_FILES/ bin/

# Exit the script with the return code from the command
exit $RETVAL

I see the CU started in the logs, and the pre-exec command get executed all right (all files exist), but the workload itself (the mpi task) seems to never run (no output nor stdout files), nor will the agent ever collect the unit -- the agent eventually times out while apparently still waiting for it.

I tried to reproduce the behavior like this, with an interactive node:

$ module swap PrgEnv-pgi PrgEnv-gnu
$ module use --append /lustre/atlas/world-shared/bip103/modules
$ module load openmpi/DEVEL-STATIC
$ /usr/bin/stdbuf -oL /lustre/atlas/world-shared/bip103/openmpi/static-nodebug/bin/orte-dvm > dvm.log 2>&1 &
$ cat dvm.log 
VMURI: 1321074688.0;tcp://10.128.36.164,160.91.205.204:39508
DVM ready

and then

$ time /lustre/atlas/world-shared/bip103/openmpi/static-nodebug/bin/orterun --hnp "1321074688.0;tcp://10.128.36.164,160.91.205.204:39508" -np 4  hostname        
nid02259
nid02259
nid02259
nid02259
[ORTE] Task: 0 is launched! (Job ID: [20158,2])
[ORTE] Task: 0 returned: 0 (Job ID: [20158,2])
r:0m0.048s  u:0m0.008s  s:0m0.008s

and

time /lustre/atlas/world-shared/bip103/openmpi/static-nodebug/bin/orterun --hnp "1321074688.0;tcp://10.128.36.164,160.91.205.204:39508" --bind-to none -np 4 ./bin/xmeshfem3D
[ORTE] Task: 0 is launched! (Job ID: [20158,5])
[ORTE] Task: 0 returned: 0 (Job ID: [20158,5])
r:0m35.705s  u:0m0.008s  s:0m0.008s

which is approximately the expected runtime, which can also be confirmed via aprun and mpirun.

Note that when I specify -host nid02259,nid02259,nid02259,nid02259, then the run returns immediately with an error:

$ time /lustre/atlas/world-shared/bip103/openmpi/static-nodebug/bin/orterun --hnp "1321074688.0;tcp://10.128.36.164,160.91.205.204:39508"  -np 4  --bind-to none  -host nid02259,nid02259,nid02259,nid02259 ./bin/xmeshfem3D
r:0m0.031s  u:0m0.004s  s:0m0.012s

$ echo $?
69

but I don't see stdout/stderr. I also do never see the task hanging.

I guess that needs some debugging on the dvm layer, to see why the mpi task does not start, and possibly on the RP layer, to see why the job is not collected (iff it returns).

Mark, could you help out with #1230 to get a handle on that?

The text was updated successfully, but these errors were encountered:

AA919 · 2017-02-19T15:16:42Z

I have experienced the same problem. Some CUs disappear and causes the hitting of the wall-time. The problem seems related with the number of CUs. By using one core per CU, I have no problems up to 512 but I fail constantly with 1024 or greater.

andre-merzky · 2017-02-19T15:22:47Z

Alession, can you please paste a compute unit description? Thanks!

AA919 · 2017-02-19T15:31:53Z

This is my CU description:

        cu = rp.ComputeUnitDescription()
        cu.pre_exec = ["module load boost",
          "module load fftw",
          "module load cudatoolkit",
          "module use --append /lustre/atlas/world-shared/bip103/modules",
          "module load openmpi/DEVEL-STATIC"]
        cu.executable = [PATH+"gmx_mpi"]
        cu.arguments = ["mdrun","-ntomp","1", "-nb", "cpu","-s","topol.tpr","-c","out.gro"]
        cu.input_staging = ["topol.tpr"]
        cu.mpi = True
        cu.cores = cores

andre-merzky · 2017-02-19T18:16:20Z

Thanks Alessio! What is PATH set to?

PS.: please use triple-backticks to format verbatim text. FWIW, ```python will also enable Python syntax highlighting, like this:

def test(a, b, c=None):
  pass

AA919 · 2017-02-19T18:18:45Z

PATH is the path for gromacs in my ram disk space. It corresponds to: /lustre/atlas/scratch/aleang9/csc108/gromacs/bin/

marksantcroos · 2017-02-20T09:53:47Z

Note that when I specify -host nid02259,nid02259,nid02259,nid02259, then the run returns immediately with an error:

You need to specify 2259 instead of nid02259 to duplicate RP's behaviour.

andre-merzky · 2017-02-20T10:01:05Z

Ah, that's the trick :) I was about to parse through the LRMS code... Thanks, will do.

andre-merzky · 2017-03-13T11:59:34Z

#1277 is currently tested in this context

andre-merzky · 2017-03-15T07:25:40Z

[Moving discussion from #1277 back to this ticket]

Hey Alessio,

in your unit description, you have:

                     "module use --append /lustre/atlas/world-shared/bip103/modules",
                     "module load openmpi/STATIC"]

This looks like the wrong module, this should be the same as in

https://github.com/radical-cybertools/radical.pilot/pull/1277/files#diff-4d8182f639ff06b24cd3f05d2c7e27abL68 . But I am not sure this makes a difference, really, given the mode in which pre_exec works in ORTE.

Thanks for opening up the gromacs install, I'll do some tests with that.

andre-merzky · 2017-03-15T11:08:05Z

Hi Ale,

I fixed an error in the titan ORTE config (a subdir missing in a path spec).

The following script seems to work with your gromacs installation. So please update your RP installtion from the fix/titan_orte branch and give that script a try, and if that works ok, from here evolve it toward your use case and scale? Thanks!

#!/usr/bin/env python

import os
import sys

import radical.pilot as rp
import radical.utils as ru

PATH="/lustre/atlas/proj-shared/csc230/gromacs/gromacs/bin/"


#------------------------------------------------------------------------------
#
if __name__ == '__main__':

    report = ru.LogReporter(name='radical.pilot')
    report.title('Getting Started (RP version %s)' % rp.version_detail)

    resource    = 'ornl.titan'
    queue       = 'debug'
    pilot_cores = 16
    cu_cores    =  2
    cu_num      =  2

    session = rp.Session()

    try:

        report.info('read config')
        config = ru.read_json('./config.json')

        report.ok('>>ok\n')
        report.header('submit pilots')

        pmgr = rp.PilotManager(session=session)
        pd_init = {
                'resource'      : resource,
                'queue'         : queue,
                'project'       : 'CSC230',
                'access_schema' : 'local',
                'runtime'       : 60,
                'exit_on_error' : True,
                'cores'         : pilot_cores + 16,
                }
        pdesc = rp.ComputePilotDescription(pd_init)
        pilot = pmgr.submit_pilots(pdesc)

        report.header('submit units')

        umgr = rp.UnitManager(session=session)
        umgr.add_pilots(pilot)

        report.info('create ' + str(cu_num) + 'unit description(s)\n\t' )

        cuds = list()
        for i in range(0, cu_num):

            cud = rp.ComputeUnitDescription()
            cud.pre_exec      = ["module load boost",
                                 "module load fftw",
                                 "module load cudatoolkit",
                                 "module use --append /lustre/atlas/world-shared/csc230/modules/",
                                 "module load openmpi/2017_03_09_6da4dbb"]
            cud.executable    = [PATH + "gmx_mpi"] 
            cud.arguments     = ["mdrun","-ntomp","1", "-nb", "cpu","-s","topol.tpr","-c","out.gro"]
            cud.input_staging = ["topol.tpr"]
            cud.mpi           = True
            cud.cores         = cu_cores
            
            cuds.append(cud)
            report.progress()

        report.ok('>>ok\n')
        umgr.submit_units(cuds)

        report.header('gather results')
        umgr.wait_units()
    

    except Exception as e:
        report.error('caught Exception: %s\n' % e)
        raise

    except (KeyboardInterrupt, SystemExit) as e:
        report.warn('exit requested\n')

    finally:
        report.header('finalize')
        session.close(cleanup=False)

    report.header()


#-------------------------------------------------------------------------------

AA919 · 2017-03-15T17:49:15Z

Hi Andre, I did the pull and tried again with 1024 units on 1024 cores. The problem is still there. Ale p.s. I was using the correct openMPI module on Titan. I sent to you an older version that I had on my pc. Sorry.

andre-merzky · 2017-03-15T18:05:43Z

At what scale does the problem appear at this point? How many cores are you using per CU? And can you send me a tarball of the pilot sandbox, please?

Thanks, Andre

AA919 · 2017-03-15T18:56:21Z

As before 1024 cores. Each CU uses one core.
I attached the sandbox. I had to remove umgr.log and the CU folders because the archive is too big otherwise. If you need them I can put the archive somewhere on TITAN.

sandbox.tar.gz

andre-merzky · 2017-03-15T19:06:50Z

I would indeed like to look into some of the unit sandboxes, too - would be great if you could make them available on titan then. thanks!

AA919 · 2017-03-15T19:54:58Z

I have put it here : /lustre/atlas/proj-shared/csc230/
Not the best place but at least I am sure that you can access it.

andre-merzky · 2017-03-15T19:58:05Z

Uh, I am afraid the unit sandboxes are not there either?

AA919 · 2017-03-15T19:59:53Z

They are coming right now.

andre-merzky · 2017-03-15T21:23:27Z

Wohoo, got an error message:

$ cat *ERR
[nid12737:01598] PMIX ERROR: UNPACK-PAST-END in file /lustre/atlas/world-shared/csc230/openmpi/src/ompi/opal/mca/pmix/pmix2x/pmix/src/client/pmix_client.c at line 113
[nid12737:01598] UNEXPECTED MESSAGE tag = 256

@marksantcroos , any idea, have you seen this before by any chance?

andre-merzky · 2017-03-16T00:05:19Z

I have trouble re-creating the interactive dvm session documented above:

(ve_test)merzky1@titan-ext1:~/sandbox/rp.session.titan-ext1.merzky1.017240.0003/pilot.0000 $ qsub -I -A CSC230 -q debug -l nodes=3,walltime=30:00
qsub: waiting for job 3273741 to start
qsub: job 3273741 ready

merzky1@titan-login7:~ $ module swap PrgEnv-pgi PrgEnv-gnu
merzky1@titan-login7:~ $ module use --append /lustre/atlas/world-shared/bip103/modules
merzky1@titan-login7:~ $ module load openmpi/DEVEL-STATIC
merzky1@titan-login7:~ $ /usr/bin/stdbuf -oL /lustre/atlas/world-shared/bip103/openmpi/static-nodebug/bin/orte-dvm > dvm.log 2>&1 &
[1] 22323
merzky1@titan-login7:~ $ cat dvm.log 
VMURI: 3718578176.0;tcp://10.128.0.95,160.91.205.244:58957
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------

I see the same error when using the new ORTE install. @marksantcroos, any idea whats up with this?

marksantcroos · 2017-03-16T07:36:00Z

@marksantcroos , any idea, have you seen this before by any chance?

No, but good news that you found an error. I'll report to openmpi.

andre-merzky · 2017-03-16T11:43:16Z

Mark, I am handing this over to you at that point, if that's ok for you. But please do let me know if anything can be done from my end!

marksantcroos · 2017-03-16T19:41:12Z

Alessio, did you run into the issue also with the script that Andre posted or only with your own code?
(In other words, I would like to be able to reproduce it)

AA919 · 2017-03-16T20:00:51Z

only with mine, but Andre script was essentially my script with the instructions of a method inside the main script instead of a separate file. If think that you can have the very same result if you run Andre's script with 1024 units on 1024 cores. Anyway, I have put the script here: /lustre/atlas/proj-shared/csc230/example/ You can use mine if you prefer. Alessio

…

On 16/03/2017 15:41, Mark Santcroos wrote: Alessio, did you run into the issue also with the script that Andre posted or only with your own code? (In other words, I would like to be able to reproduce it) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1235 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ATSdv6fGKuPjyQH-xvvEsT0QQ9vOBboqks5rmZBZgaJpZM4MFR0b>.

marksantcroos · 2017-03-23T15:18:04Z

Update: I can reproduce this now. TBC.

marksantcroos · 2017-03-24T10:04:51Z

The hanging is likely caused by open-mpi/ompi#1132.
I'm preparing a shared installation without the sorting for you guys to test out.

The other error (open-mpi/ompi#3192) might be unrelated and therefore unresolved.

marksantcroos · 2017-03-24T10:29:55Z

Assuming you are using the fix/titan_orte branch, a fix has been committed that points to the installation with the workaround.
Please give it a try.

AA919 · 2017-03-24T10:36:49Z

I'll do it but I cannot do it right now. I am doing tests for a deadline that is very close (April 3) and all my four job slots on Titan are constantly busy.

andre-merzky · 2017-03-24T10:38:03Z

Thanks Mark, I'll give it a go!

andre-merzky · 2017-03-28T10:15:41Z

Mark, it took me a while to fix some problems wrt. pre/post exec on my end, and at the end I reverted to confirming the fix in ORTE.
I am sucessfully running this stack now:

(ve_test)merzky1@titan-ext1:~/test_ale $ radical-stack 
python            : 2.7.9
virtualenv        : /autofs/nccs-svm1_home1/merzky1/ve_test
radical.utils     : v0.45-2-g82050c5@devel
saga-python       : split-4-gaa285ca@devel
radical.pilot     : split-30-g5b41647@fix-titan_orte
radical.analytics : v0.1-137-g05f47f3@devel

and am able to run MPI units at scale (2k tested).

I'll continue to run tests, but think this is good for merging. I would like to redeploy w/o the _unsorted prefix on the ompi module though: I would like to only keep one canonical ompi installation per host. What would I need to add as configuration option?

Thanks!

marksantcroos · 2017-03-28T10:43:37Z

. I would like to redeploy w/o the _unsorted prefix on the ompi module though: I would like to only keep one canonical ompi installation per host. What would I need to add as configuration option?

This installation has a local edit that requires a proper fix. So currently you can't.

andre-merzky · 2017-03-28T10:48:20Z

Got it. Is that fix likely to be required on other machines, too, or is that titan specific? If the former, what's the best path forward? I assume a PR to openmpi?

marksantcroos · 2017-03-28T10:50:21Z

Its Cray specific. Its being worked on.

marksantcroos · 2017-04-06T13:51:19Z

Update:

node sorting with alps open-mpi/ompi#1132 (placement issue on Cray) is resolved. Tested with d7f283c.
PMIX ERROR: UNPACK-PAST-END (persistent dvm) open-mpi/ompi#3192 (PMIX error) is unresolved as of yet.

So you'll probably want to stay on 2017_03_24_6da4dbb-unsorted regardless.

AA919 · 2017-04-06T13:56:51Z

Mark sorry for the delay.
I will work on it starting from today.

AA919 · 2017-04-07T22:53:20Z

1024 gromacs successfully executed. First time that I am able to do it.
In other words, it seems to work.
I am increasing the scale o the experiment to determine the maximum number of CU that we are able to run.
What problem does "open-mpi/ompi#3192 (PMIX error)" cause?

marksantcroos · 2017-04-09T11:45:46Z

In other words, it seems to work.

Good.

I am increasing the scale o the experiment to determine the maximum number of CU that we are able to run.

There is a 64k units per pilot limit currently. Not concurrent cu's, but over time.

What problem does "open-mpi/ompi#3192 (PMIX error)" cause?

Stalled units.
Note that I'm actually not sure anymore whether it still exists, as I ran into a different issue that stops me now.
Anyway, you can continue with the version that you are using currently.

andre-merzky · 2017-04-11T08:24:48Z

At long last, we can consider this one solved. Thanks to all, and Mark specifically! :)

andre-merzky added the type:bug label Feb 18, 2017

andre-merzky added this to the Future Release milestone Feb 18, 2017

andre-merzky self-assigned this Feb 18, 2017

andre-merzky mentioned this issue Feb 24, 2017

import problem on titan #1243

Closed

ibethune modified the milestones: 0.46, Future Release Mar 2, 2017

andre-merzky assigned marksantcroos Mar 16, 2017

marksantcroos mentioned this issue Mar 16, 2017

PMIX ERROR: UNPACK-PAST-END (persistent dvm) open-mpi/ompi#3192

Closed

andre-merzky mentioned this issue Apr 5, 2017

CU does not complete execution - RP devel on Titan #1310

Closed

andre-merzky closed this as completed Apr 11, 2017

andre-merzky added the type:bug label Mar 4, 2018

disappearing units under ORTE on Titan #1235

disappearing units under ORTE on Titan #1235

Comments

andre-merzky commented Feb 18, 2017

AA919 commented Feb 19, 2017

andre-merzky commented Feb 19, 2017

AA919 commented Feb 19, 2017 • edited by andre-merzky Loading

andre-merzky commented Feb 19, 2017

AA919 commented Feb 19, 2017 • edited by andre-merzky Loading

marksantcroos commented Feb 20, 2017 • edited by andre-merzky Loading

andre-merzky commented Feb 20, 2017

andre-merzky commented Mar 13, 2017

andre-merzky commented Mar 15, 2017

andre-merzky commented Mar 15, 2017

AA919 commented Mar 15, 2017 via email • edited by andre-merzky Loading

andre-merzky commented Mar 15, 2017

AA919 commented Mar 15, 2017

andre-merzky commented Mar 15, 2017

AA919 commented Mar 15, 2017

andre-merzky commented Mar 15, 2017

AA919 commented Mar 15, 2017

andre-merzky commented Mar 15, 2017

andre-merzky commented Mar 16, 2017

marksantcroos commented Mar 16, 2017

andre-merzky commented Mar 16, 2017

marksantcroos commented Mar 16, 2017

AA919 commented Mar 16, 2017 via email

marksantcroos commented Mar 23, 2017

marksantcroos commented Mar 24, 2017

marksantcroos commented Mar 24, 2017

AA919 commented Mar 24, 2017

andre-merzky commented Mar 24, 2017

andre-merzky commented Mar 28, 2017

marksantcroos commented Mar 28, 2017 • edited Loading

andre-merzky commented Mar 28, 2017

marksantcroos commented Mar 28, 2017

marksantcroos commented Apr 6, 2017

AA919 commented Apr 6, 2017

AA919 commented Apr 7, 2017

marksantcroos commented Apr 9, 2017

andre-merzky commented Apr 11, 2017

AA919 commented Feb 19, 2017 •

edited by andre-merzky

Loading

AA919 commented Feb 19, 2017 •

edited by andre-merzky

Loading

marksantcroos commented Feb 20, 2017 •

edited by andre-merzky

Loading

AA919 commented Mar 15, 2017 via email •

edited by andre-merzky

Loading

marksantcroos commented Mar 28, 2017 •

edited

Loading