From 555fcc748d2cbb1c68d7175b7a10b2ed0147ad50 Mon Sep 17 00:00:00 2001 From: Patrick Worley Date: Tue, 31 Jan 2017 23:31:56 -0500 Subject: [PATCH] fix calculation of TOT Run Time in get_timing.py get_timing.py is used to generate the performance summary file acme_timing.$case.$lid from the raw global performance data acme_timing_stats.$lid . The computation of TOT Run Time uses the formula: tmax = tmax + wtmax + correction where tmax = self.gettime(' CPL:RUN_LOOP ')[1] wtmax = self.gettime(' CPL:TPROF_WRITE ')[1] correction = max(0, ocnrunitime - ocnwaittime) Here tmax is the maximum time any process spends in the RUN loop. wtmax is the maximum time any process spends in the phase where checkpoint timing data is output, including the barrier waiting for all processes to enter this phase. If one component is running on nodes separate from the other components and takes very little time, wtmax will reflect the time that this component is waiting at the barrier while the other components are in the RUN loop, double counting this time after they are summed. This error is not significant during typical production runs, but it does affect short benchmark runs where checkpoint performance data is written frequently, which is the type of runs used to evaluate PE layouts and set performance optimization targets. As such it is important to fix this as soon as possible. The solution proposed here is to use wtmin = self.gettime(' CPL:TPROF_WRITE ')[0] tmax = tmax + wtmin + correction Since CPL:TPROF_WRITE includes barriers before and after the performance data write (t_prf), the minimum will capture the cost of the t_prf call even if the process achieving the minimum is not the one that spends the most time in t_prf. Note that I do not understand the role of 'correction' in the above formula - it is perhaps extrapolating what the TOT time would be if the OCN is simulated the same amount of time as the ATM (it is typically a little less), and do not know who wrote this script. In the cases used to diagnose the 'tmax + wtmax' issue, 'correction' was zero. With the current coupling frequency, I do not expect 'correction' to be very large in any case, but it would be worth while querying the author as to the intent, but that is distinct from resolving this issue with double counting RUN loop time. --- utils/python/CIME/get_timing.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/utils/python/CIME/get_timing.py b/utils/python/CIME/get_timing.py index 8f1f592e168..b69b1637964 100644 --- a/utils/python/CIME/get_timing.py +++ b/utils/python/CIME/get_timing.py @@ -256,7 +256,7 @@ def getTiming(self): nmax = self.gettime(' CPL:INIT ')[1] tmax = self.gettime(' CPL:RUN_LOOP ')[1] - wtmax = self.gettime(' CPL:TPROF_WRITE ')[1] + wtmin = self.gettime(' CPL:TPROF_WRITE ')[0] fmax = self.gettime(' CPL:FINAL ')[1] for k in components: if k != "CPL": @@ -281,7 +281,7 @@ def getTiming(self): correction = max(0, ocnrunitime - ocnwaittime) - tmax = tmax + wtmax + correction + tmax = tmax + wtmin + correction ocn.tmax += ocnrunitime for m in self.models.values():