Implementation of multiple drivers

Implementation of multiple MCT drivers as an option for multi-instance simulations. If multi-instance is enabled, N drivers are run, each with one instance. Also (changes not directly related to multi-driver): Changed interface of check_lockedfiles (check_lockedfiles.py) to take a case instead of a caseroot. Use case.get_env instead of EnvBuild in check_lockedfiles.py Changed check_case (case_submit.py) to not take a caseroot input. Cleaned up memleak testing in _check_for_memleak (system_tests_common.py) Fixed bad format in build_xcpl_nml (buildnml.py) Test suite: scripts_regression_tests.py Test baseline: NA Test namelist changes: NA Test status: bit for bit Fixes: #1704 Fixes: #1714 User interface changes?: new --multi-driver option to create_newcase and _C# modifier to tests Update gh-pages html (Y/N)?: Y Code review: @gold2718
ESMCI · Sep 6, 2017 · c7efee1 · c7efee1
2 parents 4d9a8d7 + dd12a68
commit c7efee1
Show file tree

Hide file tree

Showing 41 changed files with 779 additions and 386 deletions.
diff --git a/config/config_tests.xml b/config/config_tests.xml
@@ -125,7 +125,6 @@ NCR    multi-instance validation vs single instance - concurrent PE for instance
        do an initial run test with NINST 1 (suffix: base)
        do an initial run test with NINST 2 (suffix: multiinst for both _0001 and _0002)
         compare base and _0001 and _0002
-       (***note that NCR_script and NCK_script are the same - but NCR_build.csh and NCK_build.csh are different***)
 
 NOC    multi-instance validation for single instance ocean (default length)
        do an initial run test with NINST 2 (other than ocn), with mod to instance 1 (suffix: inst1_base, inst2_mod)
@@ -517,6 +516,17 @@ NODEFAIL          Tests restart upon detected node failure. Generates fake failu
     <CONTINUE_RUN>FALSE</CONTINUE_RUN>
   </test>
 
+  <test NAME="MCC">
+    <DESC>multi-driver validation vs single-instance (default length)</DESC>
+    <INFO_DBUG>1</INFO_DBUG>
+    <DOUT_S>FALSE</DOUT_S>
+    <CONTINUE_RUN>FALSE</CONTINUE_RUN>
+    <REST_OPTION>none</REST_OPTION>
+    <HIST_OPTION>$STOP_OPTION</HIST_OPTION>
+    <HIST_N>$STOP_N</HIST_N>
+    <MULTI_DRIVER>TRUE</MULTI_DRIVER>
+  </test>
+
   <test NAME="NCK">
     <DESC>multi-instance validation vs single instance (default length)</DESC>
     <INFO_DBUG>1</INFO_DBUG>

diff --git a/doc/source/users_guide/multi-instance.rst b/doc/source/users_guide/multi-instance.rst
@@ -1,95 +1,109 @@
 .. _multi-instance:
 
-**TODO: Need to update PE elements and explain + and - values**
-
-
 Multi-instance component functionality
 ======================================
 
-The CIME coupling infrastructure is capable of running multiple component instances under one model executable. 
-One caveat: If N multiple instances of any one active component are used, the same number of multiple instances of ALL active components are required. 
-More details are discussed below.
-
-The primary motivation for this development was to be able to run an ensemble Kalman-Filter for data assimilation and parameter estimation (UQ, for example). 
-However, it also provides the ability to run a set of experiments within a single model executable where each instance can have a different namelist, and to have all the output go to one directory. 
-
-An F compset is used in the following example. Using the multiple-instance code involves the following steps:
+The CIME coupling infrastructure is capable of running multiple
+component instances (ensembles) under one model executable.  There are
+two modes of ensemble capability, single driver in which all component
+instances are handled by a single driver/coupler component or
+multi-driver in which each instance includes a separate driver/coupler
+component.  In the multi-driver mode the entire model is duplicated
+for each instance while in the single driver mode only active
+components need be duplicated.  In most cases the multi-driver mode
+will give better performance and should be used.
+
+The primary motivation for this development was to be able to run an
+ensemble Kalman-Filter for data assimilation and parameter estimation
+(UQ, for example).  However, it also provides the ability to run a set
+of experiments within a single model executable where each instance
+can have a different namelist, and to have all the output go to one
+directory.
+
+An F compset is used in the following example. Using the
+multiple-instance code involves the following steps:
 
 1. Create the case.
 ::
 
-   > create_newcase --case Fmulti --compset F --res ne30_g16 
+   > create_newcase --case Fmulti --compset F2000_DEV --res f19_f19_mg17
    > cd Fmulti
 
-2. Assume this is the out-of-the-box pe-layout: 
+2. Assume this is the out-of-the-box pe-layout:
 ::
 
-   NTASKS(ATM)=128, NTHRDS(ATM)=1, ROOTPE(ATM)=0, NINST(ATM)=1
-   NTASKS(LND)=128, NTHRDS(LND)=1, ROOTPE(LND)=0, NINST(LND)=1
-   NTASKS(ICE)=128, NTHRDS(ICE)=1, ROOTPE(ICE)=0, NINST(ICE)=1
-   NTASKS(OCN)=128, NTHRDS(OCN)=1, ROOTPE(OCN)=0, NINST(OCN)=1
-   NTASKS(GLC)=128, NTHRDS(GLC)=1, ROOTPE(GLC)=0, NINST(GLC)=1
-   NTASKS(WAV)=128, NTHRDS(WAV)=1, ROOTPE(WAV)=0, NINST(WAV)=1
-   NTASKS(CPL)=128, NTHRDS(CPL)=1, ROOTPE(CPL)=0
-
-The atm, lnd and rof are active components in this compset. The ocn is a prescribed data component, cice is a mixed prescribed/active component (ice-coverage is prescribed), and glc and wav are stub components.
-
-Let's say we want to run two instances of CAM in this experiment. 
-We will also have to run two instances of CLM, CICE and RTM. 
-However, we can run either one or two instances of DOCN, and we can ignore glc and wav since they do not do anything in this compset as stub components.
-
-To run two instances of CAM, CLM, CICE, RTM and DOCN, invoke the following commands in your **$CASEROOT** directory:
+   Comp  NTASKS  NTHRDS  ROOTPE
+   CPL :    144/     1;      0
+   ATM :    144/     1;      0
+   LND :    144/     1;      0
+   ICE :    144/     1;      0
+   OCN :    144/     1;      0
+   ROF :    144/     1;      0
+   GLC :    144/     1;      0
+   WAV :    144/     1;      0
+   ESP :      1/     1;      0
+
+The atm, lnd, rof and glc are active components in this compset. The ocn is
+a prescribed data component, cice is a mixed prescribed/active
+component (ice-coverage is prescribed), and wav and esp are stub
+components.
+
+Let's say we want to run two instances of CAM in this experiment.  We
+will also have to run two instances of CLM, CICE, RTM and GLC.  However, we
+can run either one or two instances of DOCN, and we can ignore the
+stub components since they do not do anything in this compset.
+
+To run two instances of CAM, CLM, CICE, RTM, GLC and DOCN, invoke the following :ref: `xmlchange<modifying-an-xml-file>` commands in your **$CASEROOT** directory:
 ::
 
    > ./xmlchange NINST_ATM=2
    > ./xmlchange NINST_LND=2
    > ./xmlchange NINST_ICE=2
    > ./xmlchange NINST_ROF=2
+   > ./xmlchange NINST_GLC=2
    > ./xmlchange NINST_OCN=2
 
-As a result, you will have two instances of CAM, CLM and CICE (prescribed), RTM, and DOCN, each running concurrently on 64 MPI tasks.
+As a result, you will have two instances of CAM, CLM and CICE (prescribed), RTM, GLC, and DOCN, each running concurrently on 72 MPI tasks and all using the same driver/coupler component.   In this single driver/coupler mode the number of tasks for each component instance is NTASKS_COMPONENT/NINST_COMPONENT and the total number of tasks is the same as for the single instance case.
+
+Now consider the multi driver model.
+To use this mode change
+::
+   > ./xmlchange MULTI_DRIVER=TRUE
 
-**TODO: put in reference to xmlchange".**
+This configuration will run each component instance on the original 144 tasks but will generate two copies of the model (in the same executable) for a total of 288 tasks.
 
 3. Set up the case
 ::
 
    > ./case.setup
 
-A new **user_nl_xxx_NNNN** file (where NNNN is the number of the component instances) is generated when **case.setup** is called. 
+A new **user_nl_xxx_NNNN** file is generated for each component instance when case.setup is called (where xxx is the component type and NNNN is the number of the component instance).
 When calling **case.setup** with the **env_mach_pes.xml** file specifically, these files are created in **$CASEROOT**:
 ::
 
-   user_nl_cam_0001,  user_nl_cam_0002
-   user_nl_cice_0001, user_nl_cice_0002
-   user_nl_clm_0001,  user_nl_clm_0002
-   user_nl_rtm_0001,  user_nl_rtm_0002
-   user_nl_docn_0001, user_nl_docn_0002
+   user_nl_cam_0001 user_nl_clm_0001 user_nl_docn_0001 user_nl_cice_0001
+   user_nl_cism_0001 user_nl_mosart_0001
+   user_nl_cam_0002 user_nl_clm_0002 user_nl_docn_0002 user_nl_cice_0002
+   user_nl_cism_0002 user_nl_mosart_0002
    user_nl_cpl
 
-Also, **case.setup** creates the following ``*_in_*`` files and ``*txt*`` files in **$CASEROOT/CaseDocs**:
-::
-
-   atm_in_0001, atm_in_0002
-   docn.streams.txt.prescribed_0001, docn.streams.txt.prescribed_0002
-   docn_in_0001, docn_in_0002
-   docn_ocn_in_0001, docn_ocn_in_0002
-   drv_flds_in, drv_in
-   ice_in_0001, ice_in_0002
-   lnd_in_0001, lnd_in_0002
-   rof_in_0001, rof_in_0002
-
-The namelist for each component instance can be modified by changing the corresponding **user_nl_xxx_NNNN** file. 
-Modifying **user_nl_cam_0002** will result in your namelist changes being active ONLY for the second instance of CAM. 
+The namelist for each component instance can be modified by changing the corresponding **user_nl_xxx_NNNN** file.
+Modifying **user_nl_cam_0002** will result in your namelist changes being active ONLY for the second instance of CAM.
 To change the DOCN stream txt file instance 0002, copy **docn.streams.txt.prescribed_0002** to your **$CASEROOT** directory with the name **user_docn.streams.txt.prescribed_0002** and modify it accordlingly.
 
 Also keep these important points in mind:
 
+#. Note that these changes can be made at create_newcase time with option --ninst # where # is a positive integer, use the additional logical option --multi-driver to invoke the multi-driver mode.
+
 #. **Multiple component instances can differ ONLY in namelist settings; they ALL use the same model executable.**
 
-#. Multiple-instance implementation supports only one coupler component.
+#. Calling **case.setup** with ``--clean`` *DOES NOT* remove the **user_nl_xxx_NN** (where xxx is the component name) files created by **case.setup**.
+
+#. A special variable NINST_LAYOUT is provided for some experimental compsets, its value should be
+   'concurrent' for all but a few special cases and it cannot be used if MULTI_DRIVER=TRUE.
+
+#. In **create_test** these options can be invoked with testname modifiers _N# for the single driver mode and _C# for the multi-driver mode.  These are mutually exclusive options, they cannot be combined.
 
-#. Calling **case.setup** with ``--clean`` *DOES NOT* remove the **user_nl_xxx_NN** files created by **case.setup**.
+#. In create_newcase you may use --ninst # to set the number of instances and --multi-driver for multi-driver mode.
 
-#. Multiple instances generally should un concurrently, which is the default setting in **env_mach_pes.xml**. 
-   The serial setting is only for EXPERT USERS in upcoming development code implementations.
+#. In multi-driver mode you will always get 1 instance of each component for each driver/coupler, if you change a case using xmlchange MULTI_COUPLER=TRUE you will get a number of driver/couplers equal to the maximum NINST value over all components.
diff --git a/scripts/Tools/check_case b/scripts/Tools/check_case
@@ -50,9 +50,8 @@ def _main_func(description):
 
     parse_command_line(sys.argv, description)
 
-    check_lockedfiles()
-
     with Case(read_only=False) as case:
+        check_lockedfiles(case)
         create_namelists(case)
         build_complete = case.get_value("BUILD_COMPLETE")
 

diff --git a/scripts/Tools/check_lockedfiles b/scripts/Tools/check_lockedfiles
@@ -5,6 +5,7 @@ This script compares xml files
 
 from standard_script_setup import *
 from CIME.check_lockedfiles import check_lockedfiles
+from CIME.case import Case
 
 def parse_command_line(args, description):
     parser = argparse.ArgumentParser(
@@ -40,7 +41,8 @@ def _main_func(description):
 
     caseroot = parse_command_line(sys.argv, description)
 
-    check_lockedfiles(caseroot)
+    with Case(case_root=caseroot, read_only=True) as case:
+        check_lockedfiles(case)
 
 if __name__ == "__main__":
     _main_func(__doc__)
diff --git a/scripts/create_newcase b/scripts/create_newcase
@@ -46,9 +46,14 @@ OR
                         help="Specify a compiler. "
                         "To see list of supported compilers for each machine, use the utility query_config in this directory")
 
+    parser.add_argument("--multi-driver",action="store_true",
+                        help="Specify that ninst should modify number of driver/coupler instances "
+                        "default is to have one driver/coupler supporting multiple component instances.")
+
     parser.add_argument("--ninst",default=1,
-                        help="Specify number of component instances"
-                        "Set the number of component instances in the case.")
+                        help="Specify number of model ensemble instances. "
+                        "Default is multiple components and one driver/coupler.  Use --multi-driver to "
+                        "run multiple driver/couplers in the ensemble.")
 
     parser.add_argument("--mpilib", "-mpilib",
                         help="Specify the mpilib. "
@@ -155,8 +160,8 @@ OR
     return args.case, args.compset, args.res, args.machine, args.compiler,\
         args.mpilib, args.project, args.pecount, \
         args.user_mods_dir, args.pesfile, \
-        args.user_grid, args.gridfile, args.srcroot, args.test, args.ninst, \
-        args.walltime, args.queue, args.output_root, args.script_root, \
+        args.user_grid, args.gridfile, args.srcroot, args.test, args.multi_driver, \
+        args.ninst, args.walltime, args.queue, args.output_root, args.script_root, \
         run_unsupported, args.answer, args.input_dir
 
 ###############################################################################
@@ -167,8 +172,8 @@ def _main_func(description):
     casename, compset, grid, machine, compiler, \
         mpilib, project, pecount,  \
         user_mods_dir, pesfile, \
-        user_grid, gridfile, srcroot, test, ninst, walltime, queue, \
-        output_root, script_root, run_unsupported, \
+        user_grid, gridfile, srcroot, test, multi_driver, ninst, walltime, \
+        queue, output_root, script_root, run_unsupported, \
         answer, input_dir = parse_command_line(sys.argv, cimeroot, description)
 
     if script_root is None:
@@ -187,9 +192,11 @@ def _main_func(description):
 
     with Case(caseroot, read_only=False) as case:
         # Configure the Case
-        case.create(casename, srcroot, compset, grid, user_mods_dir=user_mods_dir, machine_name=machine, project=project,
+        case.create(casename, srcroot, compset, grid, user_mods_dir=user_mods_dir,
+                    machine_name=machine, project=project,
                     pecount=pecount, compiler=compiler, mpilib=mpilib,
-                    pesfile=pesfile,user_grid=user_grid, gridfile=gridfile, ninst=ninst, test=test,
+                    pesfile=pesfile,user_grid=user_grid, gridfile=gridfile,
+                    multi_driver=multi_driver, ninst=ninst, test=test,
                     walltime=walltime, queue=queue, output_root=output_root,
                     run_unsupported=run_unsupported, answer=answer,
                     input_dir=input_dir)

diff --git a/scripts/lib/CIME/SystemTests/mcc.py b/scripts/lib/CIME/SystemTests/mcc.py
@@ -0,0 +1,34 @@
+"""
+Implemetation of CIME MCC test: Compares ensemble methods
+
+This does two runs: In the first we run a three member ensemble using the
+ MULTI_DRIVER capability, then we run a second single instance case and compare
+"""
+from CIME.XML.standard_module_setup import *
+from CIME.SystemTests.system_tests_compare_two import SystemTestsCompareTwo
+from CIME.case_setup import case_setup
+
+logger = logging.getLogger(__name__)
+
+
+class MCC(SystemTestsCompareTwo):
+
+    def __init__(self, case):
+        self._comp_classes = []
+        self._test_instances = 3
+        SystemTestsCompareTwo.__init__(self, case,
+                                       separate_builds = True,
+                                       run_two_suffix = 'single_instance',
+                                       run_two_description = 'single instance',
+                                       run_one_description = 'multi driver')
+
+    def _case_one_setup(self):
+        # The multicoupler case will increase the number of tasks by the
+        # number of requested couplers.
+        self._case.set_value("MULTI_DRIVER",True)
+        self._case.set_value("NINST", self._test_instances)
+        case_setup(self._case, test_mode=False, reset=True)
+
+    def _case_two_setup(self):
+        self._case.set_value("NINST", 1)
+        case_setup(self._case, test_mode=True, reset=True)
diff --git a/scripts/lib/CIME/SystemTests/pre.py b/scripts/lib/CIME/SystemTests/pre.py
@@ -85,26 +85,41 @@ def run_phase(self): # pylint: disable=arguments-differ
         else:
             pause_comps = pause_comps.split(':')
 
+        multi_driver = self._case.get_value("MULTI_DRIVER")
+
         for comp in pause_comps:
+            if comp == "cpl":
+                if multi_driver:
+                    ninst = self._case.get_value("NINST_MAX")
+                else:
+                    ninst = 1
+            else:
+                ninst = self._case.get_value("NINST_{}".format(comp.upper()))
+
             comp_name = self._case.get_value('COMP_{}'.format(comp.upper()))
-            rname = '*.{}.r.*'.format(comp_name)
-            restart_files_1 = glob.glob(os.path.join(rundir1, rname))
-            expect((len(restart_files_1) > 0), "No case1 restart files for {}".format(comp))
-            restart_files_2 = glob.glob(os.path.join(rundir2, rname))
-            expect((len(restart_files_2) > len(restart_files_1)),
-                   "No pause (restart) files found in case2 for {}".format(comp))
-            # Do cprnc of restart files.
-            rfile1 = restart_files_1[len(restart_files_1) - 1]
-            # rfile2 has to match rfile1 (same time string)
-            parts = os.path.basename(rfile1).split(".")
-            glob_str = "*.{}".format(".".join(parts[len(parts)-4:]))
-            restart_files_2 = glob.glob(os.path.join(rundir2, glob_str))
-            expect((len(restart_files_2) == 1),
-                   "Missing case2 restart file, {}", glob_str)
-            rfile2 = restart_files_2[0]
-            ok = cprnc(comp, rfile1, rfile2, self._case, rundir2)[0]
-            logger.warning("CPRNC result for {}: {}".format(os.path.basename(rfile1), "PASS" if (ok == should_match) else "FAIL"))
-            compare_ok = compare_ok and (should_match == ok)
+            for index in range(1,ninst+1):
+                if ninst == 1:
+                    rname = '*.{}.r.*'.format(comp_name)
+                else:
+                    rname = '*.{}_{:04d}.r.*'.format(comp_name, index)
+
+                restart_files_1 = glob.glob(os.path.join(rundir1, rname))
+                expect((len(restart_files_1) > 0), "No case1 restart files for {}".format(comp))
+                restart_files_2 = glob.glob(os.path.join(rundir2, rname))
+                expect((len(restart_files_2) > len(restart_files_1)),
+                       "No pause (restart) files found in case2 for {}".format(comp))
+                # Do cprnc of restart files.
+                rfile1 = restart_files_1[len(restart_files_1) - 1]
+                # rfile2 has to match rfile1 (same time string)
+                parts = os.path.basename(rfile1).split(".")
+                glob_str = "*.{}".format(".".join(parts[len(parts)-4:]))
+                restart_files_2 = glob.glob(os.path.join(rundir2, glob_str))
+                expect((len(restart_files_2) == 1),
+                       "Missing case2 restart file, {}", glob_str)
+                rfile2 = restart_files_2[0]
+                ok = cprnc(comp, rfile1, rfile2, self._case, rundir2)[0]
+                logger.warning("CPRNC result for {}: {}".format(os.path.basename(rfile1), "PASS" if (ok == should_match) else "FAIL"))
+                compare_ok = compare_ok and (should_match == ok)
 
         expect(compare_ok,
                "Not all restart files {}".format("matched" if should_match else "failed to match"))