-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge cime5.2 changes from acme 03292017 #1287
Conversation
Anvil (system name 'anvil') will be used for production runs, so added syslog.anvil script to archive checkpoint data and modified provenance.py to collect Anvil-specific provenance information.
get_timing.py is used to generate the performance summary file acme_timing.$case.$lid from the raw global performance data acme_timing_stats.$lid . The computation of TOT Run Time uses the formula: tmax = tmax + wtmax + correction where tmax = self.gettime(' CPL:RUN_LOOP ')[1] wtmax = self.gettime(' CPL:TPROF_WRITE ')[1] correction = max(0, ocnrunitime - ocnwaittime) Here tmax is the maximum time any process spends in the RUN loop. wtmax is the maximum time any process spends in the phase where checkpoint timing data is output, including the barrier waiting for all processes to enter this phase. If one component is running on nodes separate from the other components and takes very little time, wtmax will reflect the time that this component is waiting at the barrier while the other components are in the RUN loop, double counting this time after they are summed. This error is not significant during typical production runs, but it does affect short benchmark runs where checkpoint performance data is written frequently, which is the type of runs used to evaluate PE layouts and set performance optimization targets. As such it is important to fix this as soon as possible. The solution proposed here is to use wtmin = self.gettime(' CPL:TPROF_WRITE ')[0] tmax = tmax + wtmin + correction Since CPL:TPROF_WRITE includes barriers before and after the performance data write (t_prf), the minimum will capture the cost of the t_prf call even if the process achieving the minimum is not the one that spends the most time in t_prf. Note that I do not understand the role of 'correction' in the above formula - it is perhaps extrapolating what the TOT time would be if the OCN is simulated the same amount of time as the ATM (it is typically a little less), and do not know who wrote this script. In the cases used to diagnose the 'tmax + wtmax' issue, 'correction' was zero. With the current coupling frequency, I do not expect 'correction' to be very large in any case, but it would be worth while querying the author as to the intent, but that is distinct from resolving this issue with double counting RUN loop time.
This commit makes necessary modifications to support grizzly, a LANL internal machine. It also adds support for the intel compiler.
Fixes incorrect value for node size on grizzly
The TestStatus.log is pretty useless without this info. [BFB]
Rather than create a separate file per component each time syslog.anvil 'wakes' to take a snapshot of application progress, create one file per component and append updates. Also add a file for tracking ROF progress, change names of these files to all have the same suffix ('.step'), and add a file to capture all per simulated day timing information from the CPL log file. Other changes are (a) remove an unused variable, (b) change logic for how long to wait before starting checkpointing to when acme.log file has at least $ncores lines instead of $nnodes lines, and (c) change 'grep -a -i' to 'grep -Fa', for improved efficiency.
Update LANL IC machine support This PR will fix machine support for LANL IC machines grizzly and wolf that had been broken in the move to CIME5. It also includes the addition of intel support for grizzly. [BFB]
When trying to run acme on cori-haswell or cori-knl we get an error about git not being in the path. To mitigate this I added a module load git to the machines file for cori-*. [BFB]
Support for lawrencium-lr2 and lawrencium-lr3 is updated
Including simplifying our MKL link flags.
* A_WCYCL1850S / ne30_oEC_ICG on 32 nodes at 2.31 SYPD: -pecount S * A_WCYCL1850S / ne30_oEC_ICG on 59 nodes at 4.17 SYPD: default * FC5AV1C-04P2 / ne30_ne30 on 115 nodes at 11.32 SYPD: -pecount L [BFB] - Bit-For-Bit
Also add a flag to report compile times. And keep Mira/Cetus flags identical. [BFB] - Bit-For-Bit
Add full performance data and provenance capture support for Anvil Anvil (system name 'anvil') will be used for production runs, so added syslog.anvil script to archive checkpoint data and modified provenance.py to collect Anvil-specific provenance information. As part of this PR, prototyping some new functionality that will eventually be imported into syslog.$mach for the other supported systems (most motivated by comments in the review of this PR): a) rather than create a separate file per component each time syslog.anvil 'wakes' to take a snapshot of application progress, create one file per component and append updates. b) add a file for tracking ROF progress c) change names of these files to all have the same suffix ('.step') d) add a file to capture all per simulated day timing information from the CPL log file e) remove an unused variable f) change logic for how long to wait before starting checkpointing to when acme.log file has at least $ncores lines instead of $nnodes lines g) change 'grep -a -i' to 'grep -Fa', for improved efficiency. [BFB]
Add IBM compiler macro Also add a flag to report compile times. And keep Mira/Cetus settings identical. [BFB] - Bit-For-Bit
- Corrects the module command for language = python - Remove the duplicate 'account' entry in the run script
…s' (PR #1281) Adds support for LBL Lawrencium cluster [BFB]
Removing modules before we add them to ensure no errors happen.
Add T42 atm, lnd and ocn mask to config_grids.xml for SCM funcationality This includes the cime changes for PR #1252. [BFB] * bogensch/atm/EUL_SCM_cime: Add grid configure for T42 for SCM
Replaces existing generic, poor-performing ne30 PE layouts on Edison with better values. 173 node and 375 node A_WCYCL1850 layouts were created by @worleyph a year or so ago and the 114 node F-compset layout was created by @PeterCaldwell. [BFB]
Changes to use module craype-mic-knl to build/run on KNL nodes of Cori. Including simplifying our MKL link flags on cori-knl only for now. [BFB]
Merge commit '83634512187f89f012c73f3e42e0a1c1dd0b3ab8' into rljacob/cime/uptocime5.2.0 Bring in cime5.2.0 with a git subtree merge --squash Most conflicts for .py files resolved by using cime5.2.0 version. Conflicts: cime/cime_config/acme/allactive/config_pes.xml cime/cime_config/acme/allactive/config_pesall.xml cime/cime_config/acme/allactive/testmods_dirs/cam/outfrq9s/shell_commands cime/cime_config/acme/allactive/testmods_dirs/cam/outfrq9s/xmlchange_cmnds cime/cime_config/acme/allactive/testmods_dirs/force_netcdf_pio/shell_commands cime/cime_config/acme/allactive/testmods_dirs/force_netcdf_pio/xmlchange_cmnds cime/cime_config/acme/config_grids.xml cime/cime_config/acme/machines/Makefile cime/cime_config/acme/machines/config_batch.xml cime/cime_config/acme/machines/config_build.xml cime/cime_config/acme/machines/config_compilers.xml cime/cime_config/acme/machines/config_machines.xml cime/cime_config/acme/testmods_dirs/allactive/cam/outfrq9s/xmlchange_cmnds cime/cime_config/acme/testmods_dirs/allactive/force_netcdf_pio/xmlchange_cmnds cime/components/data_comps/datm/cime_config/config_component.xml cime/components/data_comps/dlnd/cime_config/config_component.xml cime/driver_cpl/cime_config/buildnml cime/driver_cpl/cime_config/config_component.xml cime/scripts/Tools/taskmaker cime/utils/python/CIME/SystemTests/homme.py cime/utils/python/CIME/XML/machines.py cime/utils/python/CIME/XML/pes.py cime/utils/python/CIME/bless_test_results.py cime/utils/python/CIME/build.py cime/utils/python/CIME/compare_test_results.py cime/utils/python/CIME/provenance.py cime/utils/python/CIME/task_maker.py cime/utils/python/CIME/test_scheduler.py cime/utils/python/CIME/utils.py cime/utils/python/update_acme_tests.py
ACME is still using the older algorithm in shr_orb_cosz so restore that.
Was this first merged to a maint-cime5.2 branch? I'd like to tag that (5.2.1) before we mingle it with cime5.3 code. |
This branch agsalin/cime52-with-acmesplit-03292017 could be called maint-cime5.2 and the current head tagged 5.2.1. |
scripts/lib/CIME/build.py
Outdated
debug, compiler, mpilib, complist, ninst_build, smp_value, | ||
model_only): | ||
def post_build(case, logs): | ||
############################################################################### |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look right. It's showing lots of code added that should have already been there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@agsalin , the only recent change (last two months) to this file on the ACME side was one line:
@@ -292,6 +292,7 @@ def case_build(caseroot, case, sharedlib_only=False, model_only=False):
cimeroot, libroot, lid, compiler)
if not sharedlib_only:
+ os.environ["INSTALL_SHAREDPATH"] = os.path.join(exeroot, sharedpath) # for MPAS makefile generators
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just made a commit to add this one line to the current master. Should fix this.
pio_numtasks, pio_async_interface, | ||
compiler, machine, run_exe): | ||
############################################################################### | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jedwards4b @mvertens this is the replacement for task_maker.
Two long routines were included twice. Instead of trying to fix the merge, I checked out CIME master of this file and added the 1 MPAS-related line that was the only commit on the ACME side since the 5.2 merge.
@@ -37,7 +37,7 @@ def configure(machobj, output_dir, macros_format, compiler, mpilib, debug, sysos | |||
""" | |||
# Macros generation. | |||
suffixes = {'Makefile': 'make', 'CMake': 'cmake'} | |||
macro_maker = Compilers(machobj) | |||
macro_maker = Compilers(machobj, compiler=compiler, mpilib=mpilib) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a critical fix for the v1 macro writer.
@@ -15,6 +15,59 @@ | |||
from CIME.XML.standard_module_setup import * | |||
logger = logging.getLogger(__name__) | |||
|
|||
def _get_components(value): | |||
""" | |||
>>> value = '-something ${shell ${NETCDF_PATH}/bin/nf-config --flibs} -lblas -llapack' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a big change that I can't quite remember why it was needed. shell and environment variables weren't being handled correctly.
@@ -149,7 +149,7 @@ def write_macros_file(self, macros_file="Macros.make", output_format="make", xml | |||
for compiler_node in reversed(self.compiler_nodes): | |||
_add_to_macros(compiler_node, macros) | |||
write_macros_file_v1(macros, self.compiler, self.os, | |||
self.machine, macros_file="Macros.make", | |||
self.machine, macros_file=macros_file, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Key bug fix.
for _, _, running_phase in threads_in_flight.values(): | ||
if (running_phase == SHAREDLIB_BUILD_PHASE): | ||
return self._proc_pool + 1 | ||
if get_model() == "cesm": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACME still can't handle sharing shared libs.
@@ -777,6 +777,78 @@ def compute_total_time(job_cost_map, proc_pool): | |||
|
|||
return current_time | |||
|
|||
def format_time(time_format, input_format, input_time): | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mfdeakin added this, maybe he can say why it was needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was needed because CIME was failing with allocation walltimes greater than 23:59:59. This was a limitation of the Python date parser; it doesn't accept >=24 in the hours field; instead expecting it to roll over in to the days field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that these changes are minor...
@@ -270,6 +269,10 @@ def get_submit_args(self, case, job): | |||
if flag == "-n" and rval<= 0: | |||
rval = 1 | |||
|
|||
if flag == "-q" and rval == "batch" and case.get_value("MACH") == "blues": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to do this in xml and not in code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jedwards4b I couldn't figure out how to do that in XML. It's a pretty unusual case: the -q option needs to be provided unless the queue is "batch". I don't think we can express that in XML.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay
scripts/lib/CIME/case.py
Outdated
self.tasks_per_numa = int(math.ceil(self.tasks_per_node / 2.0)) | ||
smt_factor = max(1,int(self.get_value("MAX_TASKS_PER_NODE") / pes_per_node)) | ||
|
||
threads_per_node = self.tasks_per_node * self.thread_count | ||
threads_per_core = 1 if (threads_per_node <= pes_per_node) else smt_factor | ||
self.cores_per_task = self.thread_count / threads_per_core | ||
|
||
return total_tasks | ||
if self.get_value("MACH") == "titan": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we check if mpirun executable is aprun instead of hard coding titan?
|
||
# special case for aprun | ||
if executable == "aprun": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use this instead of test for titan above at line 142?
@@ -210,7 +210,7 @@ def _case_setup_impl(case, caseroot, clean=False, test_mode=False, reset=False, | |||
logger.info("Finished testcase.setup") | |||
|
|||
# Some tests need namelists created here (ERP) - so do this if are in test mode | |||
if test_mode: | |||
if test_mode or get_model() == "acme": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's been a while since I fixed this. I think one of our models had buildnml stuff that depended on it being run at setup time.
@@ -94,9 +95,10 @@ def create_namelists(case): | |||
raise | |||
|
|||
if do_run_cmd: | |||
logger.debug(" Running %s buildnml"%compname) | |||
logger.info(" Running %s buildnml"%compname) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
too noisy for default in my opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I changed this because in the old implementation, all the output from buildnml was completely lost and I had trouble finding something.
@@ -78,6 +78,8 @@ real(SHR_KIND_R8) pure FUNCTION shr_orb_cosz(jday,lat,lon,declin,dt_avg) | |||
shr_orb_cosz = shr_orb_avg_cosz(jday, lat, lon, declin, dt_avg) | |||
else | |||
shr_orb_cosz = sin(lat)*sin(declin) - & | |||
! & cos(lat)*cos(declin)*cos(jday*2.0_SHR_KIND_R8*pi + lon) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove commented code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rljacob , thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you can.
check_members=check_members, | ||
default=arg_node.get("default")) | ||
args[arg_node.get("name")] = arg_value | ||
if exe_only: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic looks wrong - if exe_only is False then args is not assigned either? I think you mean
if not exe_only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, yes
@mvertens is on vacation until next week. Do we want another reviewer or do we want to wait? |
Last week we agreed we just needed your review. |
@jedwards4b I think we're good. All tests are passing now with a fix for the problem you saw. I'll resolve conflicts and merge. |
Sounds good - thanks |
* master: fix for fortran unit tests update stubs file Set a default value for esmf_logging so it does not have to appear in drv_in. Allow a custom input root through create_test. Fix pylint issues Fixed non-ASCII character in description. Added log kind flag to ESMF_Initialize call
Pull back ACME repo changes of CIME since last subtree split (late January?) up to March 29 back into CIME master.
Test suite: scripts_regression_tests on penn pass
Test baseline:
Test namelist changes:
Test status: bit for bit
Fixes #1271
Fixes #1247
Fixes #1251
User interface changes?:
Code review: PLEASE!