Skip to content

Commit

Permalink
Implementation of multiple drivers
Browse files Browse the repository at this point in the history
Implementation of multiple MCT drivers as an option for multi-instance simulations. If multi-instance is enabled, N drivers are run, each with one instance.

Also (changes not directly related to multi-driver):

    Changed interface of check_lockedfiles (check_lockedfiles.py) to take a case instead of a caseroot.
    Use case.get_env instead of EnvBuild in check_lockedfiles.py
    Changed check_case (case_submit.py) to not take a caseroot input.
    Cleaned up memleak testing in _check_for_memleak (system_tests_common.py)
    Fixed bad format in build_xcpl_nml (buildnml.py)

Test suite: scripts_regression_tests.py
Test baseline: NA
Test namelist changes: NA
Test status: bit for bit
Fixes: #1704
Fixes: #1714

User interface changes?: new --multi-driver option to create_newcase and _C# modifier to tests

Update gh-pages html (Y/N)?: Y

Code review: @gold2718
  • Loading branch information
goldy authored Sep 6, 2017
2 parents 4d9a8d7 + dd12a68 commit c7efee1
Show file tree
Hide file tree
Showing 41 changed files with 779 additions and 386 deletions.
12 changes: 11 additions & 1 deletion config/config_tests.xml
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,6 @@ NCR multi-instance validation vs single instance - concurrent PE for instance
do an initial run test with NINST 1 (suffix: base)
do an initial run test with NINST 2 (suffix: multiinst for both _0001 and _0002)
compare base and _0001 and _0002
(***note that NCR_script and NCK_script are the same - but NCR_build.csh and NCK_build.csh are different***)
NOC multi-instance validation for single instance ocean (default length)
do an initial run test with NINST 2 (other than ocn), with mod to instance 1 (suffix: inst1_base, inst2_mod)
Expand Down Expand Up @@ -517,6 +516,17 @@ NODEFAIL Tests restart upon detected node failure. Generates fake failu
<CONTINUE_RUN>FALSE</CONTINUE_RUN>
</test>

<test NAME="MCC">
<DESC>multi-driver validation vs single-instance (default length)</DESC>
<INFO_DBUG>1</INFO_DBUG>
<DOUT_S>FALSE</DOUT_S>
<CONTINUE_RUN>FALSE</CONTINUE_RUN>
<REST_OPTION>none</REST_OPTION>
<HIST_OPTION>$STOP_OPTION</HIST_OPTION>
<HIST_N>$STOP_N</HIST_N>
<MULTI_DRIVER>TRUE</MULTI_DRIVER>
</test>

<test NAME="NCK">
<DESC>multi-instance validation vs single instance (default length)</DESC>
<INFO_DBUG>1</INFO_DBUG>
Expand Down
122 changes: 68 additions & 54 deletions doc/source/users_guide/multi-instance.rst
Original file line number Diff line number Diff line change
@@ -1,95 +1,109 @@
.. _multi-instance:

**TODO: Need to update PE elements and explain + and - values**


Multi-instance component functionality
======================================

The CIME coupling infrastructure is capable of running multiple component instances under one model executable.
One caveat: If N multiple instances of any one active component are used, the same number of multiple instances of ALL active components are required.
More details are discussed below.

The primary motivation for this development was to be able to run an ensemble Kalman-Filter for data assimilation and parameter estimation (UQ, for example).
However, it also provides the ability to run a set of experiments within a single model executable where each instance can have a different namelist, and to have all the output go to one directory.

An F compset is used in the following example. Using the multiple-instance code involves the following steps:
The CIME coupling infrastructure is capable of running multiple
component instances (ensembles) under one model executable. There are
two modes of ensemble capability, single driver in which all component
instances are handled by a single driver/coupler component or
multi-driver in which each instance includes a separate driver/coupler
component. In the multi-driver mode the entire model is duplicated
for each instance while in the single driver mode only active
components need be duplicated. In most cases the multi-driver mode
will give better performance and should be used.

The primary motivation for this development was to be able to run an
ensemble Kalman-Filter for data assimilation and parameter estimation
(UQ, for example). However, it also provides the ability to run a set
of experiments within a single model executable where each instance
can have a different namelist, and to have all the output go to one
directory.

An F compset is used in the following example. Using the
multiple-instance code involves the following steps:

1. Create the case.
::

> create_newcase --case Fmulti --compset F --res ne30_g16
> create_newcase --case Fmulti --compset F2000_DEV --res f19_f19_mg17
> cd Fmulti

2. Assume this is the out-of-the-box pe-layout:
2. Assume this is the out-of-the-box pe-layout:
::

NTASKS(ATM)=128, NTHRDS(ATM)=1, ROOTPE(ATM)=0, NINST(ATM)=1
NTASKS(LND)=128, NTHRDS(LND)=1, ROOTPE(LND)=0, NINST(LND)=1
NTASKS(ICE)=128, NTHRDS(ICE)=1, ROOTPE(ICE)=0, NINST(ICE)=1
NTASKS(OCN)=128, NTHRDS(OCN)=1, ROOTPE(OCN)=0, NINST(OCN)=1
NTASKS(GLC)=128, NTHRDS(GLC)=1, ROOTPE(GLC)=0, NINST(GLC)=1
NTASKS(WAV)=128, NTHRDS(WAV)=1, ROOTPE(WAV)=0, NINST(WAV)=1
NTASKS(CPL)=128, NTHRDS(CPL)=1, ROOTPE(CPL)=0

The atm, lnd and rof are active components in this compset. The ocn is a prescribed data component, cice is a mixed prescribed/active component (ice-coverage is prescribed), and glc and wav are stub components.

Let's say we want to run two instances of CAM in this experiment.
We will also have to run two instances of CLM, CICE and RTM.
However, we can run either one or two instances of DOCN, and we can ignore glc and wav since they do not do anything in this compset as stub components.

To run two instances of CAM, CLM, CICE, RTM and DOCN, invoke the following commands in your **$CASEROOT** directory:
Comp NTASKS NTHRDS ROOTPE
CPL : 144/ 1; 0
ATM : 144/ 1; 0
LND : 144/ 1; 0
ICE : 144/ 1; 0
OCN : 144/ 1; 0
ROF : 144/ 1; 0
GLC : 144/ 1; 0
WAV : 144/ 1; 0
ESP : 1/ 1; 0

The atm, lnd, rof and glc are active components in this compset. The ocn is
a prescribed data component, cice is a mixed prescribed/active
component (ice-coverage is prescribed), and wav and esp are stub
components.

Let's say we want to run two instances of CAM in this experiment. We
will also have to run two instances of CLM, CICE, RTM and GLC. However, we
can run either one or two instances of DOCN, and we can ignore the
stub components since they do not do anything in this compset.

To run two instances of CAM, CLM, CICE, RTM, GLC and DOCN, invoke the following :ref: `xmlchange<modifying-an-xml-file>` commands in your **$CASEROOT** directory:
::

> ./xmlchange NINST_ATM=2
> ./xmlchange NINST_LND=2
> ./xmlchange NINST_ICE=2
> ./xmlchange NINST_ROF=2
> ./xmlchange NINST_GLC=2
> ./xmlchange NINST_OCN=2

As a result, you will have two instances of CAM, CLM and CICE (prescribed), RTM, and DOCN, each running concurrently on 64 MPI tasks.
As a result, you will have two instances of CAM, CLM and CICE (prescribed), RTM, GLC, and DOCN, each running concurrently on 72 MPI tasks and all using the same driver/coupler component. In this single driver/coupler mode the number of tasks for each component instance is NTASKS_COMPONENT/NINST_COMPONENT and the total number of tasks is the same as for the single instance case.

Now consider the multi driver model.
To use this mode change
::
> ./xmlchange MULTI_DRIVER=TRUE

**TODO: put in reference to xmlchange".**
This configuration will run each component instance on the original 144 tasks but will generate two copies of the model (in the same executable) for a total of 288 tasks.

3. Set up the case
::

> ./case.setup

A new **user_nl_xxx_NNNN** file (where NNNN is the number of the component instances) is generated when **case.setup** is called.
A new **user_nl_xxx_NNNN** file is generated for each component instance when case.setup is called (where xxx is the component type and NNNN is the number of the component instance).
When calling **case.setup** with the **env_mach_pes.xml** file specifically, these files are created in **$CASEROOT**:
::

user_nl_cam_0001, user_nl_cam_0002
user_nl_cice_0001, user_nl_cice_0002
user_nl_clm_0001, user_nl_clm_0002
user_nl_rtm_0001, user_nl_rtm_0002
user_nl_docn_0001, user_nl_docn_0002
user_nl_cam_0001 user_nl_clm_0001 user_nl_docn_0001 user_nl_cice_0001
user_nl_cism_0001 user_nl_mosart_0001
user_nl_cam_0002 user_nl_clm_0002 user_nl_docn_0002 user_nl_cice_0002
user_nl_cism_0002 user_nl_mosart_0002
user_nl_cpl

Also, **case.setup** creates the following ``*_in_*`` files and ``*txt*`` files in **$CASEROOT/CaseDocs**:
::

atm_in_0001, atm_in_0002
docn.streams.txt.prescribed_0001, docn.streams.txt.prescribed_0002
docn_in_0001, docn_in_0002
docn_ocn_in_0001, docn_ocn_in_0002
drv_flds_in, drv_in
ice_in_0001, ice_in_0002
lnd_in_0001, lnd_in_0002
rof_in_0001, rof_in_0002

The namelist for each component instance can be modified by changing the corresponding **user_nl_xxx_NNNN** file.
Modifying **user_nl_cam_0002** will result in your namelist changes being active ONLY for the second instance of CAM.
The namelist for each component instance can be modified by changing the corresponding **user_nl_xxx_NNNN** file.
Modifying **user_nl_cam_0002** will result in your namelist changes being active ONLY for the second instance of CAM.
To change the DOCN stream txt file instance 0002, copy **docn.streams.txt.prescribed_0002** to your **$CASEROOT** directory with the name **user_docn.streams.txt.prescribed_0002** and modify it accordlingly.

Also keep these important points in mind:

#. Note that these changes can be made at create_newcase time with option --ninst # where # is a positive integer, use the additional logical option --multi-driver to invoke the multi-driver mode.

#. **Multiple component instances can differ ONLY in namelist settings; they ALL use the same model executable.**

#. Multiple-instance implementation supports only one coupler component.
#. Calling **case.setup** with ``--clean`` *DOES NOT* remove the **user_nl_xxx_NN** (where xxx is the component name) files created by **case.setup**.

#. A special variable NINST_LAYOUT is provided for some experimental compsets, its value should be
'concurrent' for all but a few special cases and it cannot be used if MULTI_DRIVER=TRUE.

#. In **create_test** these options can be invoked with testname modifiers _N# for the single driver mode and _C# for the multi-driver mode. These are mutually exclusive options, they cannot be combined.

#. Calling **case.setup** with ``--clean`` *DOES NOT* remove the **user_nl_xxx_NN** files created by **case.setup**.
#. In create_newcase you may use --ninst # to set the number of instances and --multi-driver for multi-driver mode.

#. Multiple instances generally should un concurrently, which is the default setting in **env_mach_pes.xml**.
The serial setting is only for EXPERT USERS in upcoming development code implementations.
#. In multi-driver mode you will always get 1 instance of each component for each driver/coupler, if you change a case using xmlchange MULTI_COUPLER=TRUE you will get a number of driver/couplers equal to the maximum NINST value over all components.
3 changes: 1 addition & 2 deletions scripts/Tools/check_case
Original file line number Diff line number Diff line change
Expand Up @@ -50,9 +50,8 @@ def _main_func(description):

parse_command_line(sys.argv, description)

check_lockedfiles()

with Case(read_only=False) as case:
check_lockedfiles(case)
create_namelists(case)
build_complete = case.get_value("BUILD_COMPLETE")

Expand Down
4 changes: 3 additions & 1 deletion scripts/Tools/check_lockedfiles
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ This script compares xml files

from standard_script_setup import *
from CIME.check_lockedfiles import check_lockedfiles
from CIME.case import Case

def parse_command_line(args, description):
parser = argparse.ArgumentParser(
Expand Down Expand Up @@ -40,7 +41,8 @@ def _main_func(description):

caseroot = parse_command_line(sys.argv, description)

check_lockedfiles(caseroot)
with Case(case_root=caseroot, read_only=True) as case:
check_lockedfiles(case)

if __name__ == "__main__":
_main_func(__doc__)
23 changes: 15 additions & 8 deletions scripts/create_newcase
Original file line number Diff line number Diff line change
Expand Up @@ -46,9 +46,14 @@ OR
help="Specify a compiler. "
"To see list of supported compilers for each machine, use the utility query_config in this directory")

parser.add_argument("--multi-driver",action="store_true",
help="Specify that ninst should modify number of driver/coupler instances "
"default is to have one driver/coupler supporting multiple component instances.")

parser.add_argument("--ninst",default=1,
help="Specify number of component instances"
"Set the number of component instances in the case.")
help="Specify number of model ensemble instances. "
"Default is multiple components and one driver/coupler. Use --multi-driver to "
"run multiple driver/couplers in the ensemble.")

parser.add_argument("--mpilib", "-mpilib",
help="Specify the mpilib. "
Expand Down Expand Up @@ -155,8 +160,8 @@ OR
return args.case, args.compset, args.res, args.machine, args.compiler,\
args.mpilib, args.project, args.pecount, \
args.user_mods_dir, args.pesfile, \
args.user_grid, args.gridfile, args.srcroot, args.test, args.ninst, \
args.walltime, args.queue, args.output_root, args.script_root, \
args.user_grid, args.gridfile, args.srcroot, args.test, args.multi_driver, \
args.ninst, args.walltime, args.queue, args.output_root, args.script_root, \
run_unsupported, args.answer, args.input_dir

###############################################################################
Expand All @@ -167,8 +172,8 @@ def _main_func(description):
casename, compset, grid, machine, compiler, \
mpilib, project, pecount, \
user_mods_dir, pesfile, \
user_grid, gridfile, srcroot, test, ninst, walltime, queue, \
output_root, script_root, run_unsupported, \
user_grid, gridfile, srcroot, test, multi_driver, ninst, walltime, \
queue, output_root, script_root, run_unsupported, \
answer, input_dir = parse_command_line(sys.argv, cimeroot, description)

if script_root is None:
Expand All @@ -187,9 +192,11 @@ def _main_func(description):

with Case(caseroot, read_only=False) as case:
# Configure the Case
case.create(casename, srcroot, compset, grid, user_mods_dir=user_mods_dir, machine_name=machine, project=project,
case.create(casename, srcroot, compset, grid, user_mods_dir=user_mods_dir,
machine_name=machine, project=project,
pecount=pecount, compiler=compiler, mpilib=mpilib,
pesfile=pesfile,user_grid=user_grid, gridfile=gridfile, ninst=ninst, test=test,
pesfile=pesfile,user_grid=user_grid, gridfile=gridfile,
multi_driver=multi_driver, ninst=ninst, test=test,
walltime=walltime, queue=queue, output_root=output_root,
run_unsupported=run_unsupported, answer=answer,
input_dir=input_dir)
Expand Down
34 changes: 34 additions & 0 deletions scripts/lib/CIME/SystemTests/mcc.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
"""
Implemetation of CIME MCC test: Compares ensemble methods
This does two runs: In the first we run a three member ensemble using the
MULTI_DRIVER capability, then we run a second single instance case and compare
"""
from CIME.XML.standard_module_setup import *
from CIME.SystemTests.system_tests_compare_two import SystemTestsCompareTwo
from CIME.case_setup import case_setup

logger = logging.getLogger(__name__)


class MCC(SystemTestsCompareTwo):

def __init__(self, case):
self._comp_classes = []
self._test_instances = 3
SystemTestsCompareTwo.__init__(self, case,
separate_builds = True,
run_two_suffix = 'single_instance',
run_two_description = 'single instance',
run_one_description = 'multi driver')

def _case_one_setup(self):
# The multicoupler case will increase the number of tasks by the
# number of requested couplers.
self._case.set_value("MULTI_DRIVER",True)
self._case.set_value("NINST", self._test_instances)
case_setup(self._case, test_mode=False, reset=True)

def _case_two_setup(self):
self._case.set_value("NINST", 1)
case_setup(self._case, test_mode=True, reset=True)
51 changes: 33 additions & 18 deletions scripts/lib/CIME/SystemTests/pre.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,26 +85,41 @@ def run_phase(self): # pylint: disable=arguments-differ
else:
pause_comps = pause_comps.split(':')

multi_driver = self._case.get_value("MULTI_DRIVER")

for comp in pause_comps:
if comp == "cpl":
if multi_driver:
ninst = self._case.get_value("NINST_MAX")
else:
ninst = 1
else:
ninst = self._case.get_value("NINST_{}".format(comp.upper()))

comp_name = self._case.get_value('COMP_{}'.format(comp.upper()))
rname = '*.{}.r.*'.format(comp_name)
restart_files_1 = glob.glob(os.path.join(rundir1, rname))
expect((len(restart_files_1) > 0), "No case1 restart files for {}".format(comp))
restart_files_2 = glob.glob(os.path.join(rundir2, rname))
expect((len(restart_files_2) > len(restart_files_1)),
"No pause (restart) files found in case2 for {}".format(comp))
# Do cprnc of restart files.
rfile1 = restart_files_1[len(restart_files_1) - 1]
# rfile2 has to match rfile1 (same time string)
parts = os.path.basename(rfile1).split(".")
glob_str = "*.{}".format(".".join(parts[len(parts)-4:]))
restart_files_2 = glob.glob(os.path.join(rundir2, glob_str))
expect((len(restart_files_2) == 1),
"Missing case2 restart file, {}", glob_str)
rfile2 = restart_files_2[0]
ok = cprnc(comp, rfile1, rfile2, self._case, rundir2)[0]
logger.warning("CPRNC result for {}: {}".format(os.path.basename(rfile1), "PASS" if (ok == should_match) else "FAIL"))
compare_ok = compare_ok and (should_match == ok)
for index in range(1,ninst+1):
if ninst == 1:
rname = '*.{}.r.*'.format(comp_name)
else:
rname = '*.{}_{:04d}.r.*'.format(comp_name, index)

restart_files_1 = glob.glob(os.path.join(rundir1, rname))
expect((len(restart_files_1) > 0), "No case1 restart files for {}".format(comp))
restart_files_2 = glob.glob(os.path.join(rundir2, rname))
expect((len(restart_files_2) > len(restart_files_1)),
"No pause (restart) files found in case2 for {}".format(comp))
# Do cprnc of restart files.
rfile1 = restart_files_1[len(restart_files_1) - 1]
# rfile2 has to match rfile1 (same time string)
parts = os.path.basename(rfile1).split(".")
glob_str = "*.{}".format(".".join(parts[len(parts)-4:]))
restart_files_2 = glob.glob(os.path.join(rundir2, glob_str))
expect((len(restart_files_2) == 1),
"Missing case2 restart file, {}", glob_str)
rfile2 = restart_files_2[0]
ok = cprnc(comp, rfile1, rfile2, self._case, rundir2)[0]
logger.warning("CPRNC result for {}: {}".format(os.path.basename(rfile1), "PASS" if (ok == should_match) else "FAIL"))
compare_ok = compare_ok and (should_match == ok)

expect(compare_ok,
"Not all restart files {}".format("matched" if should_match else "failed to match"))
Loading

0 comments on commit c7efee1

Please sign in to comment.