-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sarich/eos config #410
Sarich/eos config #410
Conversation
Looks like mpi-serial lib does not recognize MPI 3 (mpi_order_fortran). |
Thanks Jayesh, I'll check the mpi-serial libraries |
Ray Loy@ALCF is the maintainer of the mpi-serial library |
Ray only works on them in his "spare time" so we may have to add this ourselves. |
@sarich : After looking into the code it looks like MPI_ORDER_FORTRAN is only used if I/O is done using MPI (instead of netcdf/pnetcdf) with PIO. Is the configuration different on eos? |
FYI, for mpi-serial PNETCDF paths (PNETCDFROOT/PNETCDF_PATH) should not be set. |
It shouldn't hurt to set them, they will be ignored. |
With the current code if the PNETCDF paths are not set MPIIO is turned off (On blues for example with mpi-serial MPIIO is turned off - and we don't hit this error with the mpi-serial testcase). The quick fix might be to turn off MPIIO (by not setting PNETCDF paths). |
Thanks Jayesh, you led me right to the problem. There was a typo in my eos xml file that should have prevented the cray-parallel-netcdf module from loading. This test now runs. |
Great, so is this ready to be merged (shall i go ahead with testing and merging the PR)? |
Yes, it's ready now. |
@sarich : I keep getting these errors in my runs (the cime_developer tests build correctly),
It looks like you have hard-coded the value of $PROJECT (to cli112) in the config files. My $PROJECT in env is cli115. |
Yeah, forgot about that. That may be fixed now, I'll change it back to $PROJECT and check. On Wed, Aug 17, 2016 at 10:46 AM, jayeshkrishna [email protected]
|
I've update to use $PROJECT now. It works if PROJECT is defined as an environment variable, but create_newcase doesn't work if you try to set PROJECT using --project= |
ok, thanks |
Seems to work for me to use the --project option to create_newcase |
OK, It looks like --project is working now, but I'm getting warning sarich@eos-ext2:/ccs/proj/cli112/sarich/cime/scripts> ./create_newcase Not able to fully resolve item On Wed, Aug 17, 2016 at 11:13 AM, jedwards4b [email protected]
|
That's a bug. Just opened issue #417 |
@sarich : Can you try running the cime_developer on eos (I merged your branch locally to master and ran the tests) with the "--test-root" option? I am seeing the following error in the TestStatus.log,
and later in TestStatus.log
I don't have a directory $MEMBERWORK/cli115/ERI.f45_g37.X.eos_intel.20160817_160928 . I do have $MEMBERWORK/cli115. Is the script not being able to resolve $MEMBERWORK? |
Try changing $MEMBERWORK to $ENV{MEMBERWORK} in config_machines.xml |
Thanks @jedwards4b ! After changing $MEMBERWORK to $ENV{MEMBERWORK} in config_machines.xml I no longer see the "$MEMBERWORK" directory in the build directory. Waiting for the job to run. |
Use $ENV{VAR} if you know that the variable should be resolved from the environment and $VAR if it should be resolved from xml. |
Thanks. So it looks like it should be $ENV{MEMBERWORK} , right?
|
PROJECT is an xml variable in env_batch.xml, my PR #419 will make sure that it is found. |
Got it, the xml files (like env_batch.xml) in the case/test dir. |
jayesh, I do see this behavior now at runtime, was looking in the wrong place. I'll On Wed, Aug 17, 2016 at 12:29 PM, jayeshkrishna [email protected]
|
an environment variable
@sarich : Are you able to run the cime_developer test suite successfully? I don't see the $MEMBERWORK directories any more, but still don't see the status of RUN when running ./cs.status.* . |
Jayesh, The cime_developer tests are running successfully for me now. Were your On Wed, Aug 17, 2016 at 9:43 PM, jayeshkrishna [email protected]
|
Ok, I will try again (first a single test and then the test suite) now and see if it goes through. |
I ran the following testcase on eos and still don't see the status of the RUN stage. I merged your branch (33f93bf) to master (4953517)
This is the output I get from cs.status.* (No jobs in queue)
|
This is happening to all your tests? Is there anything in the logs to indicate if it ran or not? On Thu, Aug 18, 2016 at 1:37 PM, jayeshkrishna [email protected]
|
Looks like I have the wrong python version. Let me retry. |
Any idea where the python version is picked up from? software_environment.txt in the case directory shows python 2.7.9 but the job output .o* has this error message,
also my shell env has python 2.7.9,
|
env_mach_specific.xml has
|
I don't know where that version could come from. The create_test script itself uses what's in the user's environment, but ./create_test SMS_D_Ln9_Mmpi-serial.f19_g16_rx1.A On Thu, Aug 18, 2016 at 3:54 PM, jayeshkrishna [email protected]
|
Does this machine do a purge in it's env setup? That could cause the python to become too old. |
I don't see a module purge ( ) for the eos machine in config_machines.xml . |
Since tests are working for @sarich , I will sit with him tomorrow to sort this out. |
Update: @sarich and I are working on debugging this issue. From the outset it looks like a module environment issue, and we use two different shells ( @sarich - bash, @jayeshkrishna - tcsh ). |
Note that there were some issues with csh/tcsh support on Titan that got resolved a day or so ago (problem started at the beginning of the week?) after we complained. You should submit a help ticket - perhaps there is something simple that the OLCF staff can do to resolve this. Also, @mrnorman recently went back to a "module rm" style env_mach_specific from "module purge" because of its fragility. You might check what style is used on eos. |
I added "module load python/2.7.9" to ".bashrc" (since I use tcsh, this file did not exist on eos for me) and the test is running now (and PASSes)! |
initial configuration xml setup for eos.ccs.ornl.gov
Test suite: Passes cime_developer tests except for
SMS_D_Ln9_Mmpi-serial.f19_g16_rx1.A.eos_intel.20160816_191851
That code is recent, from commit b874c8a
Guessing CIME_MODEL=acme, set environment variable if this is incorrect
ERI.f45_g37.X.eos_intel (Overall: PASS), details:
PASS CREATE_NEWCASE
PASS XML
PASS SETUP
PASS SHAREDLIB_BUILD
PASS MODEL_BUILD
PASS RUN
PASS COMPARE_base_hybrid
PASS COMPARE_base_rest
PASS MEMLEAK
ERR_Ld3.f45_g37_rx1.A.eos_intel (Overall: PASS), details:
PASS CREATE_NEWCASE
PASS XML
PASS SETUP
PASS SHAREDLIB_BUILD
PASS MODEL_BUILD
PASS RUN
PASS MEMLEAK
PASS COMPARE_base_rest
ERS_Ld3.ne30_g16_rx1.A.eos_intel (Overall: PASS), details:
PASS CREATE_NEWCASE
PASS XML
PASS SETUP
PASS SHAREDLIB_BUILD
PASS MODEL_BUILD
ERS_N2_Ld3.f19_g16_rx1.A.eos_intel (Overall: PASS), details:
PASS CREATE_NEWCASE
PASS XML
PASS SETUP
PASS SHAREDLIB_BUILD
PASS MODEL_BUILD
PASS RUN
PASS COMPARE_base_rest
PASS MEMLEAK
NCK_Ld3.f45_g37_rx1.A.eos_intel (Overall: PASS), details:
PASS CREATE_NEWCASE
PASS XML
PASS SETUP
PASS SHAREDLIB_BUILD
PASS MODEL_BUILD
PASS RUN
PASS COMPARE_base_multiinst
PASS MEMLEAK
SEQ_Ln9.f19_g16_rx1.A.eos_intel (Overall: PASS), details:
PASS CREATE_NEWCASE
PASS XML
PASS SETUP
PASS SHAREDLIB_BUILD
PASS MODEL_BUILD
PASS RUN
PASS COMPARE_base_seq
PASS MEMLEAK
SMS_D_Ln9_Mmpi-serial.f19_g16_rx1.A.eos_intel (Overall: FAIL), details:
PASS CREATE_NEWCASE
PASS XML
PASS SETUP
FAIL SHAREDLIB_BUILD
Closes #282
Code review: