Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sarich/eos config #410

Merged
merged 8 commits into from
Aug 22, 2016
Merged

Sarich/eos config #410

merged 8 commits into from
Aug 22, 2016

Conversation

sarich
Copy link
Contributor

@sarich sarich commented Aug 16, 2016

initial configuration xml setup for eos.ccs.ornl.gov

Test suite: Passes cime_developer tests except for
SMS_D_Ln9_Mmpi-serial.f19_g16_rx1.A.eos_intel.20160816_191851

/autofs/nccs-svm1_proj/cli112/sarich/cime/externals/pio1/pio/piolib_mod.F90(1566): error #6404: This nam\
e does not have a type, and must have an explicit type.   [MPI_ORDER_FORTRAN]
            mpi_order_fortran,mpidatatype, iodesc2%filetype, ierr)
------------^

That code is recent, from commit b874c8a

Guessing CIME_MODEL=acme, set environment variable if this is incorrect
ERI.f45_g37.X.eos_intel (Overall: PASS), details:
PASS CREATE_NEWCASE
PASS XML
PASS SETUP
PASS SHAREDLIB_BUILD
PASS MODEL_BUILD
PASS RUN
PASS COMPARE_base_hybrid
PASS COMPARE_base_rest
PASS MEMLEAK
ERR_Ld3.f45_g37_rx1.A.eos_intel (Overall: PASS), details:
PASS CREATE_NEWCASE
PASS XML
PASS SETUP
PASS SHAREDLIB_BUILD
PASS MODEL_BUILD
PASS RUN
PASS MEMLEAK
PASS COMPARE_base_rest
ERS_Ld3.ne30_g16_rx1.A.eos_intel (Overall: PASS), details:
PASS CREATE_NEWCASE
PASS XML
PASS SETUP
PASS SHAREDLIB_BUILD
PASS MODEL_BUILD
ERS_N2_Ld3.f19_g16_rx1.A.eos_intel (Overall: PASS), details:
PASS CREATE_NEWCASE
PASS XML
PASS SETUP
PASS SHAREDLIB_BUILD
PASS MODEL_BUILD
PASS RUN
PASS COMPARE_base_rest
PASS MEMLEAK
NCK_Ld3.f45_g37_rx1.A.eos_intel (Overall: PASS), details:
PASS CREATE_NEWCASE
PASS XML
PASS SETUP
PASS SHAREDLIB_BUILD
PASS MODEL_BUILD
PASS RUN
PASS COMPARE_base_multiinst
PASS MEMLEAK
SEQ_Ln9.f19_g16_rx1.A.eos_intel (Overall: PASS), details:
PASS CREATE_NEWCASE
PASS XML
PASS SETUP
PASS SHAREDLIB_BUILD
PASS MODEL_BUILD
PASS RUN
PASS COMPARE_base_seq
PASS MEMLEAK
SMS_D_Ln9_Mmpi-serial.f19_g16_rx1.A.eos_intel (Overall: FAIL), details:
PASS CREATE_NEWCASE
PASS XML
PASS SETUP
FAIL SHAREDLIB_BUILD

Closes #282

Code review:

@jayeshkrishna
Copy link
Contributor

Looks like mpi-serial lib does not recognize MPI 3 (mpi_order_fortran).
This change was added to PIO1 in - 21e0758 - by @jedwards4b

@sarich
Copy link
Contributor Author

sarich commented Aug 16, 2016

Thanks Jayesh, I'll check the mpi-serial libraries

@jayeshkrishna
Copy link
Contributor

Ray Loy@ALCF is the maintainer of the mpi-serial library

@rljacob
Copy link
Member

rljacob commented Aug 16, 2016

Ray only works on them in his "spare time" so we may have to add this ourselves.

@jayeshkrishna
Copy link
Contributor

@sarich : After looking into the code it looks like MPI_ORDER_FORTRAN is only used if I/O is done using MPI (instead of netcdf/pnetcdf) with PIO. Is the configuration different on eos?
The testcase compiles successfully on blues.

@jayeshkrishna
Copy link
Contributor

FYI, for mpi-serial PNETCDF paths (PNETCDFROOT/PNETCDF_PATH) should not be set.

@jedwards4b
Copy link
Contributor

It shouldn't hurt to set them, they will be ignored.

@jayeshkrishna
Copy link
Contributor

With the current code if the PNETCDF paths are not set MPIIO is turned off (On blues for example with mpi-serial MPIIO is turned off - and we don't hit this error with the mpi-serial testcase). The quick fix might be to turn off MPIIO (by not setting PNETCDF paths).
The other solution is to fix the mpi-serial lib.

@sarich
Copy link
Contributor Author

sarich commented Aug 17, 2016

Thanks Jayesh, you led me right to the problem.

There was a typo in my eos xml file that should have prevented the cray-parallel-netcdf module from loading. This test now runs.

@jayeshkrishna
Copy link
Contributor

Great, so is this ready to be merged (shall i go ahead with testing and merging the PR)?

@sarich
Copy link
Contributor Author

sarich commented Aug 17, 2016

Yes, it's ready now.

@jayeshkrishna
Copy link
Contributor

@sarich : I keep getting these errors in my runs (the cime_developer tests build correctly),

Not able to fully resolve item '$MEMBERWORK/cli112/ERI.f45_g37.X.eos_intel.20160817_151530/run

It looks like you have hard-coded the value of $PROJECT (to cli112) in the config files. My $PROJECT in env is cli115.

@sarich
Copy link
Contributor Author

sarich commented Aug 17, 2016

Yeah, forgot about that.
There was a problem with $PROJECT not resolving correctly so I hard-coded
it in.

That may be fixed now, I'll change it back to $PROJECT and check.

On Wed, Aug 17, 2016 at 10:46 AM, jayeshkrishna [email protected]
wrote:

@sarich https://github.com/sarich : I keep getting these errors in my
runs (the cime_developer tests build correctly),

Not able to fully resolve item '$MEMBERWORK/cli112/ERI.f45_g37.X.eos_intel.20160817_151530/run

It looks like you have hard-coded the value of $PROJECT (to cli112) in the
config files. My $PROJECT in env is cli115.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#410 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AEs_ujkHnL6HT9u1VRPl_5t7Dy9lJD2zks5qgyzMgaJpZM4JlzSc
.

@sarich
Copy link
Contributor Author

sarich commented Aug 17, 2016

I've update to use $PROJECT now. It works if PROJECT is defined as an environment variable, but create_newcase doesn't work if you try to set PROJECT using --project=

@jayeshkrishna
Copy link
Contributor

ok, thanks

@jedwards4b
Copy link
Contributor

Seems to work for me to use the --project option to create_newcase

@sarich
Copy link
Contributor Author

sarich commented Aug 17, 2016

OK, It looks like --project is working now, but I'm getting warning
messages from the script:

sarich@eos-ext2:/ccs/proj/cli112/sarich/cime/scripts> ./create_newcase
-case T -compset X -res f19_g16 --project=cli112
Compset longname is 2000_XATM_XLND_XICE_XOCN_XROF_XGLC_XWAV
Compset specification file is
/autofs/nccs-svm1_proj/cli112/sarich/cime/scripts/Tools/../../driver_cpl/cime_config/config_compsets.xml
Pes specification file is
/autofs/nccs-svm1_proj/cli112/sarich/cime/scripts/Tools/../../cime_config/acme/allactive/config_pesall.xml

Not able to fully resolve item
'/ccs/home/sarich/acme_scratch/$PROJECT/T/bld'Not able to fully resolve
item '/lustre/atlas/scratch/sarich/$PROJECT/T/run'

Pes setting: grid is
a%1.9x2.5_l%1.9x2.5_oi%gx1v6_r%r05_m%gx1v6_g%null_w%null
Pes setting: compset is 2000_XATM_XLND_XICE_XOCN_XROF_XGLC_XWAV
Pes setting: grid match is a%1.9x2.5
Pes setting: machine match is edison|eos
Pes setting: compset_match is any
Pes setting: pesize match is any
Could not find a queue matching task count 384, falling back to depreciated
default walltime parameter
Job case.run queue batch walltime 01:15:00
Job case.test queue batch walltime 01:15:00
Job case.st_archive queue batch walltime 01:15:00
Job case.lt_archive queue batch walltime 01:15:00
Compset is: 2000_XATM_XLND_XICE_XOCN_XROF_XGLC_XWAV
Grid is: a%1.9x2.5_l%1.9x2.5_oi%gx1v6_r%r05_m%gx1v6_g%null_w%null
Components in compset are: ['xatm', 'xlnd', 'xice', 'xocn', 'xrof',
'xglc', 'xwav', 'sesp']
Creating Case directory /autofs/nccs-svm1_proj/cli112/sarich/cime/scripts/T

On Wed, Aug 17, 2016 at 11:13 AM, jedwards4b [email protected]
wrote:

Seems to work for me to use the --project option to create_newcase


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#410 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AEs_ur2IG72k7tRrJ6dwfRMMrg2eyNNnks5qgzM8gaJpZM4JlzSc
.

@rljacob
Copy link
Member

rljacob commented Aug 17, 2016

That's a bug. Just opened issue #417

@jayeshkrishna
Copy link
Contributor

@sarich : Can you try running the cime_developer on eos (I merged your branch locally to master and ran the tests) with the "--test-root" option? I am seeing the following error in the TestStatus.log,

Errput: Building test for ERI in directory /autofs/nccs-svm1_home1/jayesh/acme/ESMCI_cime/scripts/testroot/ERI.f45_g37.X.eos_intel.20160817_160928
Not able to fully resolve item '$MEMBERWORK/cli115/ERI.f45_g37.X.eos_intel.20160817_160928/run'
Not able to fully resolve item '$MEMBERWORK/cli115/archive/ERI.f45_g37.X.eos_intel.20160817_160928'
Not able to fully resolve item '$MEMBERWORK/cli115/ERI.f45_g37.X.eos_intel.20160817_160928/run'
Not able to fully resolve item '$MEMBERWORK/cli115/ERI.f45_g37.X.eos_intel.20160817_160928/run'
2016-08-17 12:16:59: RUN PASSED for test 'ERI.f45_g37.X.eos_intel'.

and later in TestStatus.log

Check case OK
Submitting job script qsub    -q batch -l walltime=01:15:00 -A cli115 case.test

Errput: Not able to fully resolve item '$MEMBERWORK/cli115/ERI.f45_g37.X.eos_intel.20160817_160928/run'

I don't have a directory $MEMBERWORK/cli115/ERI.f45_g37.X.eos_intel.20160817_160928 . I do have $MEMBERWORK/cli115. Is the script not being able to resolve $MEMBERWORK?

@jedwards4b
Copy link
Contributor

Try changing $MEMBERWORK to $ENV{MEMBERWORK} in config_machines.xml

@jayeshkrishna
Copy link
Contributor

Thanks @jedwards4b ! After changing $MEMBERWORK to $ENV{MEMBERWORK} in config_machines.xml I no longer see the "$MEMBERWORK" directory in the build directory. Waiting for the job to run.
(What is the criteria to decide which to use $VAR or $ENV{VAR} ?)

@jedwards4b
Copy link
Contributor

Use $ENV{VAR} if you know that the variable should be resolved from the environment and $VAR if it should be resolved from xml.

@jayeshkrishna
Copy link
Contributor

Thanks. So it looks like it should be $ENV{MEMBERWORK} , right?
Also how does the $PROJECT get resolved in the line below (config_machines.xml : eos)? I thought it was picked up from env.

<RUNDIR>$ENV{MEMBERWORK}/$PROJECT/$CASE/run</RUNDIR>

@jedwards4b
Copy link
Contributor

PROJECT is an xml variable in env_batch.xml, my PR #419 will make sure that it is found.

@jayeshkrishna
Copy link
Contributor

Got it, the xml files (like env_batch.xml) in the case/test dir.

@sarich
Copy link
Contributor Author

sarich commented Aug 17, 2016

jayesh,

I do see this behavior now at runtime, was looking in the wrong place. I'll
try the $ENV{MEMBERWORK} as well

On Wed, Aug 17, 2016 at 12:29 PM, jayeshkrishna [email protected]
wrote:

I still have the same issue (cime_developer test suite builds fine but
fails at runtime, ./cs.status.* does not show anything for RUN phase, the
last reported phase is MODEL_BUILD. The runtime error is mentioned above,
"Errput: Not able to fully resolve item '$MEMBERWORK/cli115/...", and I
have a directory named, "$MEMBERWORK", in the test build directories)
Can someone else try this branch out and see if it works for them? @wilke
https://github.com/wilke / @jgfouca https://github.com/jgfouca ?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#410 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AEs_uvVL5I8OhABkGI9Tlbi3KVqNBWMaks5qg0USgaJpZM4JlzSc
.

@rljacob rljacob added this to the CIME5.1.0 milestone Aug 17, 2016
@jayeshkrishna
Copy link
Contributor

@sarich : Are you able to run the cime_developer test suite successfully? I don't see the $MEMBERWORK directories any more, but still don't see the status of RUN when running ./cs.status.* .

@sarich
Copy link
Contributor Author

sarich commented Aug 18, 2016

Jayesh,

The cime_developer tests are running successfully for me now. Were your
jobs still in the queue? There's a long wait sometimes.

On Wed, Aug 17, 2016 at 9:43 PM, jayeshkrishna [email protected]
wrote:

@sarich https://github.com/sarich : Are you able to run the
cime_developer test suite successfully? I don't see the $MEMBERWORK
directories any more, but still don't see the status of RUN when running
./cs.status.* .


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#410 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AEs_unwjG8bjLk2Uyys_SjFbN2JAsYSaks5qg8bbgaJpZM4JlzSc
.

@jayeshkrishna
Copy link
Contributor

Ok, I will try again (first a single test and then the test suite) now and see if it goes through.

@jayeshkrishna
Copy link
Contributor

jayeshkrishna commented Aug 18, 2016

I ran the following testcase on eos and still don't see the status of the RUN stage. I merged your branch (33f93bf) to master (4953517)

./create_test ERI.f45_g37.X --test-root=/autofs/nccs-svm1_home1/jayesh/acme/ESMCI_cime/scripts/tmp_testroot

This is the output I get from cs.status.* (No jobs in queue)

eos-ext1 scripts/tmp_testroot> ./cs.status.20160818_163448 
ERI.f45_g37.X.eos_intel (Overall: PASS), details:
  PASS CREATE_NEWCASE
  PASS XML
  PASS SHAREDLIB_BUILD
  PASS MODEL_BUILD
eos-ext1 scripts/tmp_testroot> qstat -u jayesh

@sarich
Copy link
Contributor Author

sarich commented Aug 18, 2016

This is happening to all your tests?

Is there anything in the logs to indicate if it ran or not?

On Thu, Aug 18, 2016 at 1:37 PM, jayeshkrishna [email protected]
wrote:

I ran the following testcase on eos and still don't see the status of the
RUN stage. I merged your branch (33f93bf
33f93bf)
to master (4953517
4953517
)

./create_test ERI.f45_g37.X --test-root=/autofs/nccs-svm1_home1/jayesh/acme/ESMCI_cime/scripts/tmp_testroot

This is the output I get from cs.status.* (No jobs running)

eos-ext1 scripts/tmp_testroot> ./cs.status.20160818_163448
ERI.f45_g37.X.eos_intel (Overall: PASS), details:
PASS CREATE_NEWCASE
PASS XML
PASS SHAREDLIB_BUILD
PASS MODEL_BUILD
eos-ext1 scripts/tmp_testroot> qstat -u jayesh


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#410 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AEs_unCOatLmE-O79Q6nuDPOJHL7IHCsks5qhKZigaJpZM4JlzSc
.

@jayeshkrishna
Copy link
Contributor

Looks like I have the wrong python version. Let me retry.

@jayeshkrishna
Copy link
Contributor

jayeshkrishna commented Aug 18, 2016

Any idea where the python version is picked up from? software_environment.txt in the case directory shows python 2.7.9 but the job output .o* has this error message,

ERROR: Python 2, minor version 7+ is required, you have 2.6

also my shell env has python 2.7.9,

eos-ext1 tmp_testroot/ERI.f45_g37.X.eos_intel.20160818_163448> which python
/sw/xc30/python/2.7.9/sles11.3_gnu4.8.2/bin/python

@jayeshkrishna
Copy link
Contributor

env_mach_specific.xml has

    <modules>
      <command name="load">cmake/2.8.11.2</command>
      <command name="load">python/2.7.9</command>
    </modules>

@sarich
Copy link
Contributor Author

sarich commented Aug 18, 2016

I don't know where that version could come from.

The create_test script itself uses what's in the user's environment, but
the build and run scripts should use the env_mach_specific information. But
if your environment isn't using at least 2.7.9 then things should stop
before anything gets done:

./create_test SMS_D_Ln9_Mmpi-serial.f19_g16_rx1.A
ERROR: Python 2, minor version 7+ is required, you have 2.6

On Thu, Aug 18, 2016 at 3:54 PM, jayeshkrishna [email protected]
wrote:

env_mach_specific.xml has

<modules>
  <command name="load">cmake/2.8.11.2</command>
  <command name="load">python/2.7.9</command>
</modules>


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#410 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AEs_umR8RrZB7oMMWGtpsGfwOf7vBalfks5qhMZzgaJpZM4JlzSc
.

@jgfouca
Copy link
Contributor

jgfouca commented Aug 18, 2016

Does this machine do a purge in it's env setup? That could cause the python to become too old.

@jayeshkrishna
Copy link
Contributor

jayeshkrishna commented Aug 18, 2016

I don't see a module purge ( ) for the eos machine in config_machines.xml .

@jayeshkrishna
Copy link
Contributor

Since tests are working for @sarich , I will sit with him tomorrow to sort this out.

@jayeshkrishna
Copy link
Contributor

jayeshkrishna commented Aug 19, 2016

Update: @sarich and I are working on debugging this issue. From the outset it looks like a module environment issue, and we use two different shells ( @sarich - bash, @jayeshkrishna - tcsh ).
The job script (case.test) fails while checking the version of python.
In my case (tcsh) the version of python at runtime is the default version (/usr/bin/python - 2.6.9) although in the environment we do see module paths (PATH environments like LMFILES set by the module env) appended for python 2.7.9. However the PATH (which was printed out just before the CIME python version check) in my runtime environment does not have a path to python 2.7.9 (Hence the default version of python gets picked up when CIME does the python version check).
So right now it looks like "module load python/2.7.9" does not work as expected.
(We also see the behaviour of modules when LMFILES gets too long and gets split into *LMFILES001, *LMFILES002 etc. So first I will try out unloading all explicitly loaded modules and see if it works.)

@worleyph
Copy link
Contributor

Note that there were some issues with csh/tcsh support on Titan that got resolved a day or so ago (problem started at the beginning of the week?) after we complained. You should submit a help ticket - perhaps there is something simple that the OLCF staff can do to resolve this. Also, @mrnorman recently went back to a "module rm" style env_mach_specific from "module purge" because of its fragility. You might check what style is used on eos.

@jayeshkrishna
Copy link
Contributor

Thanks @worleyph . For eos @sarich is not using "module purge" (he is rm'ing the modules instead).

@jayeshkrishna
Copy link
Contributor

jayeshkrishna commented Aug 19, 2016

I added "module load python/2.7.9" to ".bashrc" (since I use tcsh, this file did not exist on eos for me) and the test is running now (and PASSes)!
(PBS scripts on eos currently use /bin/bash, #PBS -S /bin/bash, and case.test, the script submitted via qsub, is a python program)

@jayeshkrishna jayeshkrishna merged commit 38ab9f1 into master Aug 22, 2016
@jgfouca jgfouca deleted the sarich/eos_config branch March 21, 2017 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants