Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disappearing units under ORTE on Titan #1235

Closed
andre-merzky opened this issue Feb 18, 2017 · 37 comments
Closed

disappearing units under ORTE on Titan #1235

andre-merzky opened this issue Feb 18, 2017 · 37 comments
Assignees
Labels
Milestone

Comments

@andre-merzky
Copy link
Member

Vivek runs a workload towrads ornl.titan, thus using ORTE not ORTELIB, for its pre-exec support, under this stack:

radical-stack
python            : 2.7.9
virtualenv        : /ccs/home/vivekb/ves/seis_env_titan
radical.utils     : titan@no-branch
saga-python       : titan@no-branch
radical.pilot     : titan-4-g24e1956@no-branch

The workload is a single unit (at this point), with the following LM resulting script

$ cat radical_pilot_cu_launch_script.sh 
#!/bin/sh


# Change to working directory for unit
cd /lustre/atlas2/csc230/scratch/vivekb/radical.pilot.sandbox/rp.session.titan-ext7.vivekb.017215.0002/pilot.0000/unit.000000
# Environment variables
export RP_SESSION_ID=rp.session.titan-ext7.vivekb.017215.0002
export RP_PILOT_ID=pilot.0000
export RP_AGENT_ID=agent_1
export RP_SPAWNER_ID=agent.executing.0.child
export RP_UNIT_ID=unit.000000

# Pre-exec commands
module swap PrgEnv-pgi PrgEnv-gnu
module use --append /lustre/atlas/world-shared/bip103/modules
module load openmpi/DEVEL-STATIC
tar xf ipdata.tar
mkdir DATABASES_MPI
mkdir OUTPUT_FILES
cp -rf /lustre/atlas/scratch/vivekb/csc230/modules/specfem3d_globe/bin .
sed -i "s:^NUMBER_OF_SIMULTANEOUS_RUNS.*:NUMBER_OF_SIMULTANEOUS_RUNS = 1:g" DATA/Par_file
# The command to run
/lustre/atlas/world-shared/bip103/openmpi/static-nodebug/bin/orterun  --hnp "1694367744.0;tcp://10.128.0.92,160.91.205.242:57011"  --bind-to none -np 4 -host 820,820,820,820 ./bin/xmeshfem3D
RETVAL=$?
# Post-exec commands
tar cf opdata.tar DATA/ DATABASES_MPI/ OUTPUT_FILES/ bin/

# Exit the script with the return code from the command
exit $RETVAL

I see the CU started in the logs, and the pre-exec command get executed all right (all files exist), but the workload itself (the mpi task) seems to never run (no output nor stdout files), nor will the agent ever collect the unit -- the agent eventually times out while apparently still waiting for it.

I tried to reproduce the behavior like this, with an interactive node:

$ module swap PrgEnv-pgi PrgEnv-gnu
$ module use --append /lustre/atlas/world-shared/bip103/modules
$ module load openmpi/DEVEL-STATIC
$ /usr/bin/stdbuf -oL /lustre/atlas/world-shared/bip103/openmpi/static-nodebug/bin/orte-dvm > dvm.log 2>&1 &
$ cat dvm.log 
VMURI: 1321074688.0;tcp://10.128.36.164,160.91.205.204:39508
DVM ready

and then

$ time /lustre/atlas/world-shared/bip103/openmpi/static-nodebug/bin/orterun --hnp "1321074688.0;tcp://10.128.36.164,160.91.205.204:39508" -np 4  hostname        
nid02259
nid02259
nid02259
nid02259
[ORTE] Task: 0 is launched! (Job ID: [20158,2])
[ORTE] Task: 0 returned: 0 (Job ID: [20158,2])
r:0m0.048s  u:0m0.008s  s:0m0.008s

and

time /lustre/atlas/world-shared/bip103/openmpi/static-nodebug/bin/orterun --hnp "1321074688.0;tcp://10.128.36.164,160.91.205.204:39508" --bind-to none -np 4 ./bin/xmeshfem3D
[ORTE] Task: 0 is launched! (Job ID: [20158,5])
[ORTE] Task: 0 returned: 0 (Job ID: [20158,5])
r:0m35.705s  u:0m0.008s  s:0m0.008s

which is approximately the expected runtime, which can also be confirmed via aprun and mpirun.

Note that when I specify -host nid02259,nid02259,nid02259,nid02259, then the run returns immediately with an error:

$ time /lustre/atlas/world-shared/bip103/openmpi/static-nodebug/bin/orterun --hnp "1321074688.0;tcp://10.128.36.164,160.91.205.204:39508"  -np 4  --bind-to none  -host nid02259,nid02259,nid02259,nid02259 ./bin/xmeshfem3D
r:0m0.031s  u:0m0.004s  s:0m0.012s

$ echo $?
69

but I don't see stdout/stderr. I also do never see the task hanging.

I guess that needs some debugging on the dvm layer, to see why the mpi task does not start, and possibly on the RP layer, to see why the job is not collected (iff it returns).

Mark, could you help out with #1230 to get a handle on that?

@andre-merzky andre-merzky added this to the Future Release milestone Feb 18, 2017
@andre-merzky andre-merzky self-assigned this Feb 18, 2017
@AA919
Copy link

AA919 commented Feb 19, 2017

I have experienced the same problem. Some CUs disappear and causes the hitting of the wall-time. The problem seems related with the number of CUs. By using one core per CU, I have no problems up to 512 but I fail constantly with 1024 or greater.

@andre-merzky
Copy link
Member Author

Alession, can you please paste a compute unit description? Thanks!

@AA919
Copy link

AA919 commented Feb 19, 2017

This is my CU description:

        cu = rp.ComputeUnitDescription()
        cu.pre_exec = ["module load boost",
          "module load fftw",
          "module load cudatoolkit",
          "module use --append /lustre/atlas/world-shared/bip103/modules",
          "module load openmpi/DEVEL-STATIC"]
        cu.executable = [PATH+"gmx_mpi"]
        cu.arguments = ["mdrun","-ntomp","1", "-nb", "cpu","-s","topol.tpr","-c","out.gro"]
        cu.input_staging = ["topol.tpr"]
        cu.mpi = True
        cu.cores = cores

@andre-merzky
Copy link
Member Author

Thanks Alessio! What is PATH set to?

PS.: please use triple-backticks to format verbatim text. FWIW, ```python will also enable Python syntax highlighting, like this:

def test(a, b, c=None):
  pass

@AA919
Copy link

AA919 commented Feb 19, 2017

PATH is the path for gromacs in my ram disk space. It corresponds to: /lustre/atlas/scratch/aleang9/csc108/gromacs/bin/

@marksantcroos
Copy link
Contributor

marksantcroos commented Feb 20, 2017

Note that when I specify -host nid02259,nid02259,nid02259,nid02259, then the run returns immediately with an error:

You need to specify 2259 instead of nid02259 to duplicate RP's behaviour.

@andre-merzky
Copy link
Member Author

Ah, that's the trick :) I was about to parse through the LRMS code... Thanks, will do.

@ibethune ibethune modified the milestones: 0.46, Future Release Mar 2, 2017
@andre-merzky
Copy link
Member Author

#1277 is currently tested in this context

@andre-merzky
Copy link
Member Author

[Moving discussion from #1277 back to this ticket]

Hey Alessio,

in your unit description, you have:

                     "module use --append /lustre/atlas/world-shared/bip103/modules",
                     "module load openmpi/STATIC"]

This looks like the wrong module, this should be the same as in

https://github.com/radical-cybertools/radical.pilot/pull/1277/files#diff-4d8182f639ff06b24cd3f05d2c7e27abL68 . But I am not sure this makes a difference, really, given the mode in which pre_exec works in ORTE.

Thanks for opening up the gromacs install, I'll do some tests with that.

@andre-merzky
Copy link
Member Author

Hi Ale,

I fixed an error in the titan ORTE config (a subdir missing in a path spec).

The following script seems to work with your gromacs installation. So please update your RP installtion from the fix/titan_orte branch and give that script a try, and if that works ok, from here evolve it toward your use case and scale? Thanks!

#!/usr/bin/env python

import os
import sys

import radical.pilot as rp
import radical.utils as ru

PATH="/lustre/atlas/proj-shared/csc230/gromacs/gromacs/bin/"


#------------------------------------------------------------------------------
#
if __name__ == '__main__':

    report = ru.LogReporter(name='radical.pilot')
    report.title('Getting Started (RP version %s)' % rp.version_detail)

    resource    = 'ornl.titan'
    queue       = 'debug'
    pilot_cores = 16
    cu_cores    =  2
    cu_num      =  2

    session = rp.Session()

    try:

        report.info('read config')
        config = ru.read_json('./config.json')

        report.ok('>>ok\n')
        report.header('submit pilots')

        pmgr = rp.PilotManager(session=session)
        pd_init = {
                'resource'      : resource,
                'queue'         : queue,
                'project'       : 'CSC230',
                'access_schema' : 'local',
                'runtime'       : 60,
                'exit_on_error' : True,
                'cores'         : pilot_cores + 16,
                }
        pdesc = rp.ComputePilotDescription(pd_init)
        pilot = pmgr.submit_pilots(pdesc)

        report.header('submit units')

        umgr = rp.UnitManager(session=session)
        umgr.add_pilots(pilot)

        report.info('create ' + str(cu_num) + 'unit description(s)\n\t' )

        cuds = list()
        for i in range(0, cu_num):

            cud = rp.ComputeUnitDescription()
            cud.pre_exec      = ["module load boost",
                                 "module load fftw",
                                 "module load cudatoolkit",
                                 "module use --append /lustre/atlas/world-shared/csc230/modules/",
                                 "module load openmpi/2017_03_09_6da4dbb"]
            cud.executable    = [PATH + "gmx_mpi"] 
            cud.arguments     = ["mdrun","-ntomp","1", "-nb", "cpu","-s","topol.tpr","-c","out.gro"]
            cud.input_staging = ["topol.tpr"]
            cud.mpi           = True
            cud.cores         = cu_cores
            
            cuds.append(cud)
            report.progress()

        report.ok('>>ok\n')
        umgr.submit_units(cuds)

        report.header('gather results')
        umgr.wait_units()
    

    except Exception as e:
        report.error('caught Exception: %s\n' % e)
        raise

    except (KeyboardInterrupt, SystemExit) as e:
        report.warn('exit requested\n')

    finally:
        report.header('finalize')
        session.close(cleanup=False)

    report.header()


#-------------------------------------------------------------------------------

@AA919
Copy link

AA919 commented Mar 15, 2017 via email

@andre-merzky
Copy link
Member Author

At what scale does the problem appear at this point? How many cores are you using per CU? And can you send me a tarball of the pilot sandbox, please?

Thanks, Andre

@AA919
Copy link

AA919 commented Mar 15, 2017

As before 1024 cores. Each CU uses one core.
I attached the sandbox. I had to remove umgr.log and the CU folders because the archive is too big otherwise. If you need them I can put the archive somewhere on TITAN.

sandbox.tar.gz

@andre-merzky
Copy link
Member Author

I would indeed like to look into some of the unit sandboxes, too - would be great if you could make them available on titan then. thanks!

@AA919
Copy link

AA919 commented Mar 15, 2017

I have put it here : /lustre/atlas/proj-shared/csc230/
Not the best place but at least I am sure that you can access it.

@andre-merzky
Copy link
Member Author

Uh, I am afraid the unit sandboxes are not there either?

@AA919
Copy link

AA919 commented Mar 15, 2017

They are coming right now.

@andre-merzky
Copy link
Member Author

Wohoo, got an error message:

$ cat *ERR
[nid12737:01598] PMIX ERROR: UNPACK-PAST-END in file /lustre/atlas/world-shared/csc230/openmpi/src/ompi/opal/mca/pmix/pmix2x/pmix/src/client/pmix_client.c at line 113
[nid12737:01598] UNEXPECTED MESSAGE tag = 256

@marksantcroos , any idea, have you seen this before by any chance?

@andre-merzky
Copy link
Member Author

I have trouble re-creating the interactive dvm session documented above:

(ve_test)merzky1@titan-ext1:~/sandbox/rp.session.titan-ext1.merzky1.017240.0003/pilot.0000 $ qsub -I -A CSC230 -q debug -l nodes=3,walltime=30:00
qsub: waiting for job 3273741 to start
qsub: job 3273741 ready

merzky1@titan-login7:~ $ module swap PrgEnv-pgi PrgEnv-gnu
merzky1@titan-login7:~ $ module use --append /lustre/atlas/world-shared/bip103/modules
merzky1@titan-login7:~ $ module load openmpi/DEVEL-STATIC
merzky1@titan-login7:~ $ /usr/bin/stdbuf -oL /lustre/atlas/world-shared/bip103/openmpi/static-nodebug/bin/orte-dvm > dvm.log 2>&1 &
[1] 22323
merzky1@titan-login7:~ $ cat dvm.log 
VMURI: 3718578176.0;tcp://10.128.0.95,160.91.205.244:58957
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------

I see the same error when using the new ORTE install. @marksantcroos, any idea whats up with this?

@marksantcroos
Copy link
Contributor

@marksantcroos , any idea, have you seen this before by any chance?

No, but good news that you found an error. I'll report to openmpi.

@andre-merzky
Copy link
Member Author

Mark, I am handing this over to you at that point, if that's ok for you. But please do let me know if anything can be done from my end!

@marksantcroos
Copy link
Contributor

Alessio, did you run into the issue also with the script that Andre posted or only with your own code?
(In other words, I would like to be able to reproduce it)

@AA919
Copy link

AA919 commented Mar 16, 2017 via email

@marksantcroos
Copy link
Contributor

Update: I can reproduce this now. TBC.

@marksantcroos
Copy link
Contributor

The hanging is likely caused by open-mpi/ompi#1132.
I'm preparing a shared installation without the sorting for you guys to test out.

The other error (open-mpi/ompi#3192) might be unrelated and therefore unresolved.

@marksantcroos
Copy link
Contributor

Assuming you are using the fix/titan_orte branch, a fix has been committed that points to the installation with the workaround.
Please give it a try.

@AA919
Copy link

AA919 commented Mar 24, 2017

I'll do it but I cannot do it right now. I am doing tests for a deadline that is very close (April 3) and all my four job slots on Titan are constantly busy.

@andre-merzky
Copy link
Member Author

Thanks Mark, I'll give it a go!

@andre-merzky
Copy link
Member Author

Mark, it took me a while to fix some problems wrt. pre/post exec on my end, and at the end I reverted to confirming the fix in ORTE.
I am sucessfully running this stack now:

(ve_test)merzky1@titan-ext1:~/test_ale $ radical-stack 
python            : 2.7.9
virtualenv        : /autofs/nccs-svm1_home1/merzky1/ve_test
radical.utils     : v0.45-2-g82050c5@devel
saga-python       : split-4-gaa285ca@devel
radical.pilot     : split-30-g5b41647@fix-titan_orte
radical.analytics : v0.1-137-g05f47f3@devel

and am able to run MPI units at scale (2k tested).

I'll continue to run tests, but think this is good for merging. I would like to redeploy w/o the _unsorted prefix on the ompi module though: I would like to only keep one canonical ompi installation per host. What would I need to add as configuration option?

Thanks!

@marksantcroos
Copy link
Contributor

marksantcroos commented Mar 28, 2017

. I would like to redeploy w/o the _unsorted prefix on the ompi module though: I would like to only keep one canonical ompi installation per host. What would I need to add as configuration option?

This installation has a local edit that requires a proper fix. So currently you can't.

@andre-merzky
Copy link
Member Author

Got it. Is that fix likely to be required on other machines, too, or is that titan specific? If the former, what's the best path forward? I assume a PR to openmpi?

@marksantcroos
Copy link
Contributor

Its Cray specific. Its being worked on.

@marksantcroos
Copy link
Contributor

Update:

So you'll probably want to stay on 2017_03_24_6da4dbb-unsorted regardless.

@AA919
Copy link

AA919 commented Apr 6, 2017

Mark sorry for the delay.
I will work on it starting from today.

@AA919
Copy link

AA919 commented Apr 7, 2017

1024 gromacs successfully executed. First time that I am able to do it.
In other words, it seems to work.
I am increasing the scale o the experiment to determine the maximum number of CU that we are able to run.
What problem does "open-mpi/ompi#3192 (PMIX error)" cause?

@marksantcroos
Copy link
Contributor

In other words, it seems to work.

Good.

I am increasing the scale o the experiment to determine the maximum number of CU that we are able to run.

There is a 64k units per pilot limit currently. Not concurrent cu's, but over time.

What problem does "open-mpi/ompi#3192 (PMIX error)" cause?

Stalled units.
Note that I'm actually not sure anymore whether it still exists, as I ran into a different issue that stops me now.
Anyway, you can continue with the version that you are using currently.

@andre-merzky
Copy link
Member Author

At long last, we can consider this one solved. Thanks to all, and Mark specifically! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants