Merge cime5.2 changes from acme 03292017 #1287

agsalin · 2017-03-29T23:38:34Z

Pull back ACME repo changes of CIME since last subtree split (late January?) up to March 29 back into CIME master.

Test suite: scripts_regression_tests on penn pass
Test baseline:
Test namelist changes:
Test status: bit for bit

User interface changes?:

Code review: PLEASE!

Anvil (system name 'anvil') will be used for production runs, so added syslog.anvil script to archive checkpoint data and modified provenance.py to collect Anvil-specific provenance information.

get_timing.py is used to generate the performance summary file acme_timing.$case.$lid from the raw global performance data acme_timing_stats.$lid . The computation of TOT Run Time uses the formula: tmax = tmax + wtmax + correction where tmax = self.gettime(' CPL:RUN_LOOP ')[1] wtmax = self.gettime(' CPL:TPROF_WRITE ')[1] correction = max(0, ocnrunitime - ocnwaittime) Here tmax is the maximum time any process spends in the RUN loop. wtmax is the maximum time any process spends in the phase where checkpoint timing data is output, including the barrier waiting for all processes to enter this phase. If one component is running on nodes separate from the other components and takes very little time, wtmax will reflect the time that this component is waiting at the barrier while the other components are in the RUN loop, double counting this time after they are summed. This error is not significant during typical production runs, but it does affect short benchmark runs where checkpoint performance data is written frequently, which is the type of runs used to evaluate PE layouts and set performance optimization targets. As such it is important to fix this as soon as possible. The solution proposed here is to use wtmin = self.gettime(' CPL:TPROF_WRITE ')[0] tmax = tmax + wtmin + correction Since CPL:TPROF_WRITE includes barriers before and after the performance data write (t_prf), the minimum will capture the cost of the t_prf call even if the process achieving the minimum is not the one that spends the most time in t_prf. Note that I do not understand the role of 'correction' in the above formula - it is perhaps extrapolating what the TOT time would be if the OCN is simulated the same amount of time as the ATM (it is typically a little less), and do not know who wrote this script. In the cases used to diagnose the 'tmax + wtmax' issue, 'correction' was zero. With the current coupling frequency, I do not expect 'correction' to be very large in any case, but it would be worth while querying the author as to the intent, but that is distinct from resolving this issue with double counting RUN loop time.

This commit makes necessary modifications to support grizzly, a LANL internal machine. It also adds support for the intel compiler.

Fixes incorrect value for node size on grizzly

The TestStatus.log is pretty useless without this info. [BFB]

Rather than create a separate file per component each time syslog.anvil 'wakes' to take a snapshot of application progress, create one file per component and append updates. Also add a file for tracking ROF progress, change names of these files to all have the same suffix ('.step'), and add a file to capture all per simulated day timing information from the CPL log file. Other changes are (a) remove an unused variable, (b) change logic for how long to wait before starting checkpointing to when acme.log file has at least $ncores lines instead of $nnodes lines, and (c) change 'grep -a -i' to 'grep -Fa', for improved efficiency.

Update LANL IC machine support This PR will fix machine support for LANL IC machines grizzly and wolf that had been broken in the move to CIME5. It also includes the addition of intel support for grizzly. [BFB]

When trying to run acme on cori-haswell or cori-knl we get an error about git not being in the path. To mitigate this I added a module load git to the machines file for cori-*. [BFB]

Support for lawrencium-lr2 and lawrencium-lr3 is updated

Including simplifying our MKL link flags.

* A_WCYCL1850S / ne30_oEC_ICG on 32 nodes at 2.31 SYPD: -pecount S * A_WCYCL1850S / ne30_oEC_ICG on 59 nodes at 4.17 SYPD: default * FC5AV1C-04P2 / ne30_ne30 on 115 nodes at 11.32 SYPD: -pecount L [BFB] - Bit-For-Bit

Also add a flag to report compile times. And keep Mira/Cetus flags identical. [BFB] - Bit-For-Bit

Add full performance data and provenance capture support for Anvil Anvil (system name 'anvil') will be used for production runs, so added syslog.anvil script to archive checkpoint data and modified provenance.py to collect Anvil-specific provenance information. As part of this PR, prototyping some new functionality that will eventually be imported into syslog.$mach for the other supported systems (most motivated by comments in the review of this PR): a) rather than create a separate file per component each time syslog.anvil 'wakes' to take a snapshot of application progress, create one file per component and append updates. b) add a file for tracking ROF progress c) change names of these files to all have the same suffix ('.step') d) add a file to capture all per simulated day timing information from the CPL log file e) remove an unused variable f) change logic for how long to wait before starting checkpointing to when acme.log file has at least $ncores lines instead of $nnodes lines g) change 'grep -a -i' to 'grep -Fa', for improved efficiency. [BFB]

Add IBM compiler macro Also add a flag to report compile times. And keep Mira/Cetus settings identical. [BFB] - Bit-For-Bit

- Corrects the module command for language = python - Remove the duplicate 'account' entry in the run script

…s' (PR #1281) Adds support for LBL Lawrencium cluster [BFB]

Removing modules before we add them to ensure no errors happen.

Add T42 atm, lnd and ocn mask to config_grids.xml for SCM funcationality This includes the cime changes for PR #1252. [BFB] * bogensch/atm/EUL_SCM_cime: Add grid configure for T42 for SCM

[BFB]

@worleyph

Replaces existing generic, poor-performing ne30 PE layouts on Edison with better values. 173 node and 375 node A_WCYCL1850 layouts were created by @worleyph a year or so ago and the 114 node F-compset layout was created by @PeterCaldwell. [BFB]

Changes to use module craype-mic-knl to build/run on KNL nodes of Cori. Including simplifying our MKL link flags on cori-knl only for now. [BFB]

Merge commit '83634512187f89f012c73f3e42e0a1c1dd0b3ab8' into rljacob/cime/uptocime5.2.0 Bring in cime5.2.0 with a git subtree merge --squash Most conflicts for .py files resolved by using cime5.2.0 version. Conflicts: cime/cime_config/acme/allactive/config_pes.xml cime/cime_config/acme/allactive/config_pesall.xml cime/cime_config/acme/allactive/testmods_dirs/cam/outfrq9s/shell_commands cime/cime_config/acme/allactive/testmods_dirs/cam/outfrq9s/xmlchange_cmnds cime/cime_config/acme/allactive/testmods_dirs/force_netcdf_pio/shell_commands cime/cime_config/acme/allactive/testmods_dirs/force_netcdf_pio/xmlchange_cmnds cime/cime_config/acme/config_grids.xml cime/cime_config/acme/machines/Makefile cime/cime_config/acme/machines/config_batch.xml cime/cime_config/acme/machines/config_build.xml cime/cime_config/acme/machines/config_compilers.xml cime/cime_config/acme/machines/config_machines.xml cime/cime_config/acme/testmods_dirs/allactive/cam/outfrq9s/xmlchange_cmnds cime/cime_config/acme/testmods_dirs/allactive/force_netcdf_pio/xmlchange_cmnds cime/components/data_comps/datm/cime_config/config_component.xml cime/components/data_comps/dlnd/cime_config/config_component.xml cime/driver_cpl/cime_config/buildnml cime/driver_cpl/cime_config/config_component.xml cime/scripts/Tools/taskmaker cime/utils/python/CIME/SystemTests/homme.py cime/utils/python/CIME/XML/machines.py cime/utils/python/CIME/XML/pes.py cime/utils/python/CIME/bless_test_results.py cime/utils/python/CIME/build.py cime/utils/python/CIME/compare_test_results.py cime/utils/python/CIME/provenance.py cime/utils/python/CIME/task_maker.py cime/utils/python/CIME/test_scheduler.py cime/utils/python/CIME/utils.py cime/utils/python/update_acme_tests.py

ACME is still using the older algorithm in shr_orb_cosz so restore that.

rljacob · 2017-03-30T03:28:22Z

Was this first merged to a maint-cime5.2 branch? I'd like to tag that (5.2.1) before we mingle it with cime5.3 code.

agsalin · 2017-03-30T12:59:19Z

This branch agsalin/cime52-with-acmesplit-03292017 could be called maint-cime5.2 and the current head tagged 5.2.1.
I checked out the cime5.2.0 tag, created a branch, and merged in ACME code into it.

jgfouca · 2017-03-30T23:08:17Z

scripts/lib/CIME/build.py

-                  debug, compiler, mpilib, complist, ninst_build, smp_value,
-                  model_only):
+def post_build(case, logs):
+###############################################################################


This doesn't look right. It's showing lots of code added that should have already been there.

@agsalin , the only recent change (last two months) to this file on the ACME side was one line:

@@ -292,6 +292,7 @@ def case_build(caseroot, case, sharedlib_only=False, model_only=False): cimeroot, libroot, lid, compiler) if not sharedlib_only: + os.environ["INSTALL_SHAREDPATH"] = os.path.join(exeroot, sharedpath) # for MPAS makefile generators

I just made a commit to add this one line to the current master. Should fix this.

jgfouca · 2017-03-30T23:10:23Z

scripts/lib/CIME/aprun.py

+                                 pio_numtasks, pio_async_interface,
+                                 compiler, machine, run_exe):
+###############################################################################
+    """


@jedwards4b @mvertens this is the replacement for task_maker.

Two long routines were included twice. Instead of trying to fix the merge, I checked out CIME master of this file and added the 1 MPAS-related line that was the only commit on the ACME side since the 5.2 merge.

jgfouca · 2017-03-31T17:43:29Z

scripts/lib/CIME/BuildTools/configure.py

@@ -37,7 +37,7 @@ def configure(machobj, output_dir, macros_format, compiler, mpilib, debug, sysos
    """
    # Macros generation.
    suffixes = {'Makefile': 'make', 'CMake': 'cmake'}
-    macro_maker = Compilers(machobj)
+    macro_maker = Compilers(machobj, compiler=compiler, mpilib=mpilib)


This was a critical fix for the v1 macro writer.

jgfouca · 2017-03-31T17:47:36Z

scripts/lib/CIME/BuildTools/macrowriterbase.py

@@ -15,6 +15,59 @@
 from CIME.XML.standard_module_setup import *
 logger = logging.getLogger(__name__)

+def _get_components(value):
+    """
+    >>> value = '-something ${shell ${NETCDF_PATH}/bin/nf-config --flibs} -lblas -llapack'


This was a big change that I can't quite remember why it was needed. shell and environment variables weren't being handled correctly.

jgfouca · 2017-03-31T17:48:09Z

scripts/lib/CIME/XML/compilers.py

@@ -149,7 +149,7 @@ def write_macros_file(self, macros_file="Macros.make", output_format="make", xml
            for compiler_node in reversed(self.compiler_nodes):
                _add_to_macros(compiler_node, macros)
            write_macros_file_v1(macros, self.compiler, self.os,
-                                        self.machine, macros_file="Macros.make",
+                                        self.machine, macros_file=macros_file,


Key bug fix.

jgfouca · 2017-03-31T17:51:31Z

scripts/lib/CIME/test_scheduler.py

-            for _, _, running_phase in threads_in_flight.values():
-                if (running_phase == SHAREDLIB_BUILD_PHASE):
-                    return self._proc_pool + 1
+            if get_model() == "cesm":


ACME still can't handle sharing shared libs.

jgfouca · 2017-03-31T17:51:59Z

scripts/lib/CIME/utils.py

@@ -777,6 +777,78 @@ def compute_total_time(job_cost_map, proc_pool):

    return current_time

+def format_time(time_format, input_format, input_time):
+    """


@mfdeakin added this, maybe he can say why it was needed.

This was needed because CIME was failing with allocation walltimes greater than 23:59:59. This was a limitation of the Python date parser; it doesn't accept >=24 in the hours field; instead expecting it to roll over in to the days field.

jedwards4b

I think that these changes are minor...

jedwards4b · 2017-04-03T20:23:00Z

scripts/lib/CIME/XML/env_batch.py

@@ -270,6 +269,10 @@ def get_submit_args(self, case, job):
                    if flag == "-n" and rval<= 0:
                        rval = 1

+                    if flag == "-q" and rval == "batch" and case.get_value("MACH") == "blues":


You should be able to do this in xml and not in code.

@jedwards4b I couldn't figure out how to do that in XML. It's a pretty unusual case: the -q option needs to be provided unless the queue is "batch". I don't think we can express that in XML.

jedwards4b · 2017-04-03T20:26:33Z

scripts/lib/CIME/case.py

        self.tasks_per_numa = int(math.ceil(self.tasks_per_node / 2.0))
        smt_factor = max(1,int(self.get_value("MAX_TASKS_PER_NODE") / pes_per_node))

        threads_per_node = self.tasks_per_node * self.thread_count
        threads_per_core = 1 if (threads_per_node <= pes_per_node) else smt_factor
        self.cores_per_task = self.thread_count / threads_per_core

-        return total_tasks
+        if self.get_value("MACH") == "titan":


can we check if mpirun executable is aprun instead of hard coding titan?

jedwards4b · 2017-04-03T20:28:04Z

scripts/lib/CIME/case.py


+        # special case for aprun
+        if executable == "aprun":


Use this instead of test for titan above at line 142?

jedwards4b · 2017-04-03T20:29:11Z

scripts/lib/CIME/case_setup.py

@@ -210,7 +210,7 @@ def _case_setup_impl(case, caseroot, clean=False, test_mode=False, reset=False,
                logger.info("Finished testcase.setup")

        # Some tests need namelists created here (ERP) - so do this if are in test mode
-        if test_mode:
+        if test_mode or get_model() == "acme":


It's been a while since I fixed this. I think one of our models had buildnml stuff that depended on it being run at setup time.

jedwards4b · 2017-04-03T20:30:15Z

scripts/lib/CIME/preview_namelists.py

@@ -94,9 +95,10 @@ def create_namelists(case):
            raise

        if do_run_cmd:
-            logger.debug("   Running %s buildnml"%compname)
+            logger.info("   Running %s buildnml"%compname)


too noisy for default in my opinion.

I think I changed this because in the old implementation, all the output from buildnml was completely lost and I had trouble finding something.

jedwards4b · 2017-04-03T20:34:16Z

src/share/util/shr_orb_mod.F90

@@ -78,6 +78,8 @@ real(SHR_KIND_R8) pure FUNCTION shr_orb_cosz(jday,lat,lon,declin,dt_avg)
      shr_orb_cosz =  shr_orb_avg_cosz(jday, lat, lon, declin, dt_avg)
   else
      shr_orb_cosz = sin(lat)*sin(declin) - &
+   !   &              cos(lat)*cos(declin)*cos(jday*2.0_SHR_KIND_R8*pi + lon)


remove commented code?

@rljacob , thoughts?

Yes you can.

jedwards4b · 2017-04-04T14:55:25Z

scripts/lib/CIME/XML/env_mach_specific.py

-                                           check_members=check_members,
-                                           default=arg_node.get("default"))
-                args[arg_node.get("name")] = arg_value
+        if exe_only:


This logic looks wrong - if exe_only is False then args is not assigned either? I think you mean
if not exe_only?

Thanks, yes

jedwards4b · 2017-04-04T17:20:59Z

@mvertens is on vacation until next week. Do we want another reviewer or do we want to wait?

rljacob · 2017-04-04T18:23:28Z

Last week we agreed we just needed your review.

jgfouca · 2017-04-04T19:15:17Z

@jedwards4b I think we're good. All tests are passing now with a fix for the problem you saw. I'll resolve conflicts and merge.

jedwards4b · 2017-04-04T19:16:27Z

Sounds good - thanks

* master: fix for fortran unit tests update stubs file Set a default value for esmf_logging so it does not have to appear in drv_in. Allow a custom input root through create_test. Fix pylint issues Fixed non-ASCII character in description. Added log kind flag to ESMF_Initialize call

PeterCaldwell and others added 30 commits January 24, 2017 15:22

added 173 and 375 node options for Edison

922150b

made 114 node hyperthreaded the default for ne30 F-compset

45a5327

add full performance data and provenance capture support for Anvil

3320f78

Anvil (system name 'anvil') will be used for production runs, so added syslog.anvil script to archive checkpoint data and modified provenance.py to collect Anvil-specific provenance information.

Fixes support for LANL_IC machine grizzly

2959855

This commit makes necessary modifications to support grizzly, a LANL internal machine. It also adds support for the intel compiler.

Fixes max_tasks_per node

b7fa4de

Fixes incorrect value for node size on grizzly

Add homme.log to TestStatus.log for HOMME test

b237c07

The TestStatus.log is pretty useless without this info. [BFB]

Add grid configure for T42 for SCM

446a939

Removed 114 node F-compset layout that failed SMS-D test

ff3e6d6

Update support for LANL IC machine wolf.

b6da43f

Fix shortname for ne30np4_oEC60to30v3 grid

80883c5

Merge branch 'vanroekel/machines/LANL_update' (PR #1260)

57af7eb

Update LANL IC machine support This PR will fix machine support for LANL IC machines grizzly and wolf that had been broken in the move to CIME5. It also includes the addition of intel support for grizzly. [BFB]

Added git to modules for cori

57b8f12

When trying to run acme on cori-haswell or cori-knl we get an error about git not being in the path. To mitigate this I added a module load git to the machines file for cori-*. [BFB]

Updates the support for LBL Lawrencium machines

7acca9b

Support for lawrencium-lr2 and lawrencium-lr3 is updated

Changes to use module craype-mic-knl to build/run on KNL nodes of Cori.

ef5e0d3

Including simplifying our MKL link flags.

Add default PE configurations for ne30 A_WCYCL cases

ad10e15

* A_WCYCL1850S / ne30_oEC_ICG on 32 nodes at 2.31 SYPD: -pecount S * A_WCYCL1850S / ne30_oEC_ICG on 59 nodes at 4.17 SYPD: default * FC5AV1C-04P2 / ne30_ne30 on 115 nodes at 11.32 SYPD: -pecount L [BFB] - Bit-For-Bit

Add IBM compiler macro

9fbddba

Also add a flag to report compile times. And keep Mira/Cetus flags identical. [BFB] - Bit-For-Bit

Merge branch 'azamat/mira/add-cpribm-macro' (PR #1277)

c66197d

Add IBM compiler macro Also add a flag to report compile times. And keep Mira/Cetus settings identical. [BFB] - Bit-For-Bit

Fixes support for LBL Lawrencium machines

5fa630f

- Corrects the module command for language = python - Remove the duplicate 'account' entry in the run script

Merge branch 'bishtgautam/machinefiles/update-support-for-lbl-machine…

9808efa

…s' (PR #1281) Adds support for LBL Lawrencium cluster [BFB]

Fixing Module issues on titan

1879505

Removing modules before we add them to ensure no errors happen.

Merge branch 'bogensch/atm/EUL_SCM_cime' (PR #1253)

1246f3b

Add T42 atm, lnd and ocn mask to config_grids.xml for SCM funcationality This includes the cime changes for PR #1252. [BFB] * bogensch/atm/EUL_SCM_cime: Add grid configure for T42 for SCM

Pat's config file that fixed runtime issues.

d003d7a

[BFB]

Merge branch 'ndk/machinefiles/cori-knl-craype-mic-knl' (PR #1270)

e554702

Changes to use module craype-mic-knl to build/run on KNL nodes of Cori. Including simplifying our MKL link flags on cori-knl only for now. [BFB]

Restore ACME orbit calculation

6bbed68

ACME is still using the older algorithm in shr_orb_cosz so restore that.

Trying to get acme_developer to work

17ffcd4

agsalin requested a review from jgfouca March 29, 2017 23:38

rljacob added the in progress label Mar 29, 2017

jgfouca requested review from jedwards4b and mvertens March 30, 2017 23:04

jgfouca requested changes Mar 30, 2017

View reviewed changes

Fix build.py merge issues

c63a4e1

Two long routines were included twice. Instead of trying to fix the merge, I checked out CIME master of this file and added the 1 MPAS-related line that was the only commit on the ACME side since the 5.2 merge.

rljacob assigned jgfouca Mar 31, 2017

jgfouca reviewed Mar 31, 2017

View reviewed changes

jgfouca mentioned this pull request Apr 2, 2017

First try to support titan restart #1275

Merged

jedwards4b requested changes Apr 3, 2017

View reviewed changes

Make optimized nodes related to aprun, not titan

d80c428

jedwards4b approved these changes Apr 4, 2017

View reviewed changes

jgfouca approved these changes Apr 4, 2017

View reviewed changes

Fix inverted if statement

03001c8

jgfouca merged commit fbf1b0c into master Apr 4, 2017

jgfouca removed the in progress label Apr 4, 2017

rljacob changed the title ~~Merge changes from acme 03292017~~ Merge cime5.2 changes from acme 03292017 Apr 4, 2017

jgfouca deleted the agsalin/merge-from-acme-03292017 branch April 7, 2017 16:58

rljacob mentioned this pull request Apr 22, 2017

Error with walltimemax >= 24:00 #1251

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge cime5.2 changes from acme 03292017 #1287

Merge cime5.2 changes from acme 03292017 #1287

agsalin commented Mar 29, 2017 •

edited by rljacob

Loading

rljacob commented Mar 30, 2017

agsalin commented Mar 30, 2017

jgfouca Mar 30, 2017

jgfouca Mar 30, 2017

agsalin Mar 31, 2017

jgfouca Mar 30, 2017

jgfouca Mar 31, 2017

jgfouca Mar 31, 2017

jgfouca Mar 31, 2017

jgfouca Mar 31, 2017

jgfouca Mar 31, 2017

mfdeakin-sandia Mar 31, 2017

jedwards4b left a comment

jedwards4b Apr 3, 2017

jgfouca Apr 3, 2017

jedwards4b Apr 3, 2017

jedwards4b Apr 3, 2017

jedwards4b Apr 3, 2017

jedwards4b Apr 3, 2017

jgfouca Apr 4, 2017

jedwards4b Apr 3, 2017

jgfouca Apr 4, 2017

jedwards4b Apr 3, 2017

jgfouca Apr 4, 2017

rljacob Apr 4, 2017

jedwards4b Apr 4, 2017

jgfouca Apr 4, 2017

jedwards4b commented Apr 4, 2017

rljacob commented Apr 4, 2017

jgfouca commented Apr 4, 2017

jedwards4b commented Apr 4, 2017

Merge cime5.2 changes from acme 03292017 #1287

Merge cime5.2 changes from acme 03292017 #1287

Conversation

agsalin commented Mar 29, 2017 • edited by rljacob Loading

rljacob commented Mar 30, 2017

agsalin commented Mar 30, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jedwards4b left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jedwards4b commented Apr 4, 2017

rljacob commented Apr 4, 2017

jgfouca commented Apr 4, 2017

jedwards4b commented Apr 4, 2017

agsalin commented Mar 29, 2017 •

edited by rljacob

Loading