Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge cime5.2 changes from acme 03292017 #1287

Merged
merged 93 commits into from
Apr 4, 2017

Conversation

agsalin
Copy link
Contributor

@agsalin agsalin commented Mar 29, 2017

Pull back ACME repo changes of CIME since last subtree split (late January?) up to March 29 back into CIME master.

Test suite: scripts_regression_tests on penn pass
Test baseline:
Test namelist changes:
Test status: bit for bit

Fixes #1271
Fixes #1247
Fixes #1251

User interface changes?:

Code review: PLEASE!

PeterCaldwell and others added 30 commits January 24, 2017 15:22
Anvil (system name 'anvil') will be used for production runs,
so added syslog.anvil script to archive checkpoint data and
modified provenance.py to collect Anvil-specific provenance
information.
get_timing.py is used to generate the performance summary
file acme_timing.$case.$lid from the raw global performance data
acme_timing_stats.$lid . The computation of TOT Run Time uses
the formula:

        tmax = tmax + wtmax + correction

where

        tmax  = self.gettime(' CPL:RUN_LOOP ')[1]
        wtmax = self.gettime(' CPL:TPROF_WRITE ')[1]
        correction = max(0, ocnrunitime - ocnwaittime)

Here tmax is the maximum time any process spends in the RUN
loop. wtmax is the maximum time any process spends in the phase where
checkpoint timing data is output, including the barrier waiting for
all processes to enter this phase. If one component is running on
nodes separate from the other components and takes very little time,
wtmax will reflect the time that this component is waiting at the
barrier while the other components are in the RUN loop, double
counting this time after they are summed. This error is not
significant during typical production runs, but it does affect short
benchmark runs where checkpoint  performance data is written
frequently, which is the type of runs used to evaluate PE layouts and
set performance optimization targets. As such it is important to fix
this as soon as possible.

The solution proposed here is to use

        wtmin = self.gettime(' CPL:TPROF_WRITE ')[0]
        tmax = tmax + wtmin + correction

Since CPL:TPROF_WRITE includes barriers before and after the performance
data write (t_prf), the minimum will capture the cost of the t_prf call
even if the process achieving the minimum is not the one that spends
the most time in t_prf.

Note that I do not understand the role of 'correction' in the above
formula - it is perhaps  extrapolating what the TOT time would be if
the OCN is simulated the same  amount of time as the ATM (it is
typically a little less), and do not know who wrote this script. In
the cases used to diagnose the 'tmax + wtmax' issue, 'correction' was
zero. With the current coupling frequency, I do not expect
'correction' to be very large in any case, but it would be worth while
querying the author as to the intent, but that is distinct from
resolving this issue with double counting RUN loop time.
This commit makes necessary modifications to support grizzly, a LANL
internal machine.  It also adds support for the intel compiler.
Fixes incorrect value for node size on grizzly
The TestStatus.log is pretty useless without this info.

[BFB]
Rather than create a separate file per component each
time syslog.anvil 'wakes' to take a snapshot of application
progress, create one file per component and append
updates. Also add a file for tracking ROF progress, change
names of these files to all have the same suffix ('.step'),
and add a file to capture all per simulated day timing
information from the CPL log file.

Other changes are (a) remove an unused variable, (b)
change logic for how long to wait before starting checkpointing
to when acme.log file has at least $ncores lines instead of
$nnodes lines, and (c) change 'grep -a -i' to 'grep -Fa', for
improved efficiency.
Update LANL IC machine support

This PR will fix machine support for LANL IC machines grizzly and wolf that had
been broken in the move to CIME5. It also includes the addition of intel support
for grizzly.

[BFB]
When trying to run acme on cori-haswell or cori-knl we get an error about git
not being in the path. To mitigate this I added a module load git to the
machines file for cori-*.

[BFB]
Support for lawrencium-lr2 and lawrencium-lr3 is
updated
* A_WCYCL1850S / ne30_oEC_ICG on 32 nodes at 2.31 SYPD: -pecount S
* A_WCYCL1850S / ne30_oEC_ICG on 59 nodes at 4.17 SYPD: default
* FC5AV1C-04P2 / ne30_ne30 on 115 nodes at 11.32 SYPD: -pecount L

[BFB] - Bit-For-Bit
Also add a flag to report compile times.
And keep Mira/Cetus flags identical.

[BFB] - Bit-For-Bit
Add full performance data and provenance capture support for Anvil

Anvil (system name 'anvil') will be used for production runs,
so added syslog.anvil script to archive checkpoint data and
modified provenance.py to collect Anvil-specific provenance
information.

As part of this PR, prototyping some new functionality that
will eventually be imported into syslog.$mach for the other
supported systems (most motivated by comments in the
review of this PR):

 a) rather than create a separate file per component each
    time syslog.anvil 'wakes' to take a snapshot of application
    progress, create one file per component and append updates.
 b) add a file for tracking ROF progress
 c) change names of these files to all have the same suffix ('.step')
 d) add a file to capture all per simulated day timing
    information from the CPL log file
 e) remove an unused variable
 f) change logic for how long to wait before starting checkpointing
    to when acme.log file has at least $ncores lines instead of
    $nnodes lines
 g) change 'grep -a -i' to 'grep -Fa', for improved efficiency.

[BFB]
Add IBM compiler macro
Also add a flag to report compile times.
And keep Mira/Cetus settings identical.

[BFB] - Bit-For-Bit
- Corrects the module command for language = python
- Remove the duplicate 'account' entry in the run script
…s' (PR #1281)

Adds support for LBL Lawrencium cluster

[BFB]
Removing modules before we add them to ensure no errors happen.
Add T42 atm, lnd and ocn mask to config_grids.xml for SCM funcationality

This includes the cime changes for PR #1252.

[BFB]

* bogensch/atm/EUL_SCM_cime:
  Add grid configure for T42 for SCM
Replaces existing generic, poor-performing ne30 PE layouts on Edison
with better values. 173 node and 375 node A_WCYCL1850 layouts were
created by @worleyph a year or so ago and the 114 node F-compset
layout was created by @PeterCaldwell.

[BFB]
Changes to use module craype-mic-knl to build/run on KNL nodes of Cori.
Including simplifying our MKL link flags on cori-knl only for now.

[BFB]
Merge commit '83634512187f89f012c73f3e42e0a1c1dd0b3ab8' into rljacob/cime/uptocime5.2.0

Bring in cime5.2.0 with a git subtree merge --squash

Most conflicts for .py files resolved by using cime5.2.0 version.

Conflicts:
	cime/cime_config/acme/allactive/config_pes.xml
	cime/cime_config/acme/allactive/config_pesall.xml
	cime/cime_config/acme/allactive/testmods_dirs/cam/outfrq9s/shell_commands
	cime/cime_config/acme/allactive/testmods_dirs/cam/outfrq9s/xmlchange_cmnds
	cime/cime_config/acme/allactive/testmods_dirs/force_netcdf_pio/shell_commands
	cime/cime_config/acme/allactive/testmods_dirs/force_netcdf_pio/xmlchange_cmnds
	cime/cime_config/acme/config_grids.xml
	cime/cime_config/acme/machines/Makefile
	cime/cime_config/acme/machines/config_batch.xml
	cime/cime_config/acme/machines/config_build.xml
	cime/cime_config/acme/machines/config_compilers.xml
	cime/cime_config/acme/machines/config_machines.xml
	cime/cime_config/acme/testmods_dirs/allactive/cam/outfrq9s/xmlchange_cmnds
	cime/cime_config/acme/testmods_dirs/allactive/force_netcdf_pio/xmlchange_cmnds
	cime/components/data_comps/datm/cime_config/config_component.xml
	cime/components/data_comps/dlnd/cime_config/config_component.xml
	cime/driver_cpl/cime_config/buildnml
	cime/driver_cpl/cime_config/config_component.xml
	cime/scripts/Tools/taskmaker
	cime/utils/python/CIME/SystemTests/homme.py
	cime/utils/python/CIME/XML/machines.py
	cime/utils/python/CIME/XML/pes.py
	cime/utils/python/CIME/bless_test_results.py
	cime/utils/python/CIME/build.py
	cime/utils/python/CIME/compare_test_results.py
	cime/utils/python/CIME/provenance.py
	cime/utils/python/CIME/task_maker.py
	cime/utils/python/CIME/test_scheduler.py
	cime/utils/python/CIME/utils.py
	cime/utils/python/update_acme_tests.py
ACME is still using the older algorithm in
shr_orb_cosz so restore that.
@agsalin agsalin requested a review from jgfouca March 29, 2017 23:38
@rljacob
Copy link
Member

rljacob commented Mar 30, 2017

Was this first merged to a maint-cime5.2 branch? I'd like to tag that (5.2.1) before we mingle it with cime5.3 code.

@agsalin
Copy link
Contributor Author

agsalin commented Mar 30, 2017

This branch agsalin/cime52-with-acmesplit-03292017 could be called maint-cime5.2 and the current head tagged 5.2.1.
I checked out the cime5.2.0 tag, created a branch, and merged in ACME code into it.

@jgfouca jgfouca requested review from jedwards4b and mvertens March 30, 2017 23:04
debug, compiler, mpilib, complist, ninst_build, smp_value,
model_only):
def post_build(case, logs):
###############################################################################
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look right. It's showing lots of code added that should have already been there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agsalin , the only recent change (last two months) to this file on the ACME side was one line:

@@ -292,6 +292,7 @@ def case_build(caseroot, case, sharedlib_only=False, model_only=False):
                                cimeroot, libroot, lid, compiler)
 
     if not sharedlib_only:
+        os.environ["INSTALL_SHAREDPATH"] = os.path.join(exeroot, sharedpath) # for MPAS makefile generators

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just made a commit to add this one line to the current master. Should fix this.

pio_numtasks, pio_async_interface,
compiler, machine, run_exe):
###############################################################################
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jedwards4b @mvertens this is the replacement for task_maker.

Two long routines were included twice.

Instead of trying to fix the merge,
I checked out CIME master of this file and
added the 1 MPAS-related line that was the
only commit on the ACME side since the
5.2 merge.
@@ -37,7 +37,7 @@ def configure(machobj, output_dir, macros_format, compiler, mpilib, debug, sysos
"""
# Macros generation.
suffixes = {'Makefile': 'make', 'CMake': 'cmake'}
macro_maker = Compilers(machobj)
macro_maker = Compilers(machobj, compiler=compiler, mpilib=mpilib)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a critical fix for the v1 macro writer.

@@ -15,6 +15,59 @@
from CIME.XML.standard_module_setup import *
logger = logging.getLogger(__name__)

def _get_components(value):
"""
>>> value = '-something ${shell ${NETCDF_PATH}/bin/nf-config --flibs} -lblas -llapack'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a big change that I can't quite remember why it was needed. shell and environment variables weren't being handled correctly.

@@ -149,7 +149,7 @@ def write_macros_file(self, macros_file="Macros.make", output_format="make", xml
for compiler_node in reversed(self.compiler_nodes):
_add_to_macros(compiler_node, macros)
write_macros_file_v1(macros, self.compiler, self.os,
self.machine, macros_file="Macros.make",
self.machine, macros_file=macros_file,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Key bug fix.

for _, _, running_phase in threads_in_flight.values():
if (running_phase == SHAREDLIB_BUILD_PHASE):
return self._proc_pool + 1
if get_model() == "cesm":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACME still can't handle sharing shared libs.

@@ -777,6 +777,78 @@ def compute_total_time(job_cost_map, proc_pool):

return current_time

def format_time(time_format, input_format, input_time):
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mfdeakin added this, maybe he can say why it was needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was needed because CIME was failing with allocation walltimes greater than 23:59:59. This was a limitation of the Python date parser; it doesn't accept >=24 in the hours field; instead expecting it to roll over in to the days field.

Copy link
Contributor

@jedwards4b jedwards4b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that these changes are minor...

@@ -270,6 +269,10 @@ def get_submit_args(self, case, job):
if flag == "-n" and rval<= 0:
rval = 1

if flag == "-q" and rval == "batch" and case.get_value("MACH") == "blues":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to do this in xml and not in code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jedwards4b I couldn't figure out how to do that in XML. It's a pretty unusual case: the -q option needs to be provided unless the queue is "batch". I don't think we can express that in XML.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay

self.tasks_per_numa = int(math.ceil(self.tasks_per_node / 2.0))
smt_factor = max(1,int(self.get_value("MAX_TASKS_PER_NODE") / pes_per_node))

threads_per_node = self.tasks_per_node * self.thread_count
threads_per_core = 1 if (threads_per_node <= pes_per_node) else smt_factor
self.cores_per_task = self.thread_count / threads_per_core

return total_tasks
if self.get_value("MACH") == "titan":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we check if mpirun executable is aprun instead of hard coding titan?


# special case for aprun
if executable == "aprun":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use this instead of test for titan above at line 142?

@@ -210,7 +210,7 @@ def _case_setup_impl(case, caseroot, clean=False, test_mode=False, reset=False,
logger.info("Finished testcase.setup")

# Some tests need namelists created here (ERP) - so do this if are in test mode
if test_mode:
if test_mode or get_model() == "acme":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been a while since I fixed this. I think one of our models had buildnml stuff that depended on it being run at setup time.

@@ -94,9 +95,10 @@ def create_namelists(case):
raise

if do_run_cmd:
logger.debug(" Running %s buildnml"%compname)
logger.info(" Running %s buildnml"%compname)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too noisy for default in my opinion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I changed this because in the old implementation, all the output from buildnml was completely lost and I had trouble finding something.

@@ -78,6 +78,8 @@ real(SHR_KIND_R8) pure FUNCTION shr_orb_cosz(jday,lat,lon,declin,dt_avg)
shr_orb_cosz = shr_orb_avg_cosz(jday, lat, lon, declin, dt_avg)
else
shr_orb_cosz = sin(lat)*sin(declin) - &
! & cos(lat)*cos(declin)*cos(jday*2.0_SHR_KIND_R8*pi + lon)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove commented code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rljacob , thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you can.

check_members=check_members,
default=arg_node.get("default"))
args[arg_node.get("name")] = arg_value
if exe_only:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic looks wrong - if exe_only is False then args is not assigned either? I think you mean
if not exe_only?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, yes

@jedwards4b
Copy link
Contributor

@mvertens is on vacation until next week. Do we want another reviewer or do we want to wait?

@rljacob
Copy link
Member

rljacob commented Apr 4, 2017

Last week we agreed we just needed your review.

@jgfouca
Copy link
Contributor

jgfouca commented Apr 4, 2017

@jedwards4b I think we're good. All tests are passing now with a fix for the problem you saw. I'll resolve conflicts and merge.

@jedwards4b
Copy link
Contributor

Sounds good - thanks

* master:
  fix for fortran unit tests
  update stubs file
  Set a default value for esmf_logging so it does not have to appear in drv_in.
  Allow a custom input root through create_test.
  Fix pylint issues
  Fixed non-ASCII character in description.
  Added log kind flag to ESMF_Initialize call
@jgfouca jgfouca merged commit fbf1b0c into master Apr 4, 2017
@rljacob rljacob changed the title Merge changes from acme 03292017 Merge cime5.2 changes from acme 03292017 Apr 4, 2017
@jgfouca jgfouca deleted the agsalin/merge-from-acme-03292017 branch April 7, 2017 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.