Skip to content

Commit

Permalink
fix calculation of TOT Run Time in get_timing.py
Browse files Browse the repository at this point in the history
get_timing.py is used to generate the performance summary
file acme_timing.$case.$lid from the raw global performance data
acme_timing_stats.$lid . The computation of TOT Run Time uses
the formula:

        tmax = tmax + wtmax + correction

where

        tmax  = self.gettime(' CPL:RUN_LOOP ')[1]
        wtmax = self.gettime(' CPL:TPROF_WRITE ')[1]
        correction = max(0, ocnrunitime - ocnwaittime)

Here tmax is the maximum time any process spends in the RUN
loop. wtmax is the maximum time any process spends in the phase where
checkpoint timing data is output, including the barrier waiting for
all processes to enter this phase. If one component is running on
nodes separate from the other components and takes very little time,
wtmax will reflect the time that this component is waiting at the
barrier while the other components are in the RUN loop, double
counting this time after they are summed. This error is not
significant during typical production runs, but it does affect short
benchmark runs where checkpoint  performance data is written
frequently, which is the type of runs used to evaluate PE layouts and
set performance optimization targets. As such it is important to fix
this as soon as possible.

The solution proposed here is to use

        wtmin = self.gettime(' CPL:TPROF_WRITE ')[0]
        tmax = tmax + wtmin + correction

Since CPL:TPROF_WRITE includes barriers before and after the performance
data write (t_prf), the minimum will capture the cost of the t_prf call
even if the process achieving the minimum is not the one that spends
the most time in t_prf.

Note that I do not understand the role of 'correction' in the above
formula - it is perhaps  extrapolating what the TOT time would be if
the OCN is simulated the same  amount of time as the ATM (it is
typically a little less), and do not know who wrote this script. In
the cases used to diagnose the 'tmax + wtmax' issue, 'correction' was
zero. With the current coupling frequency, I do not expect
'correction' to be very large in any case, but it would be worth while
querying the author as to the intent, but that is distinct from
resolving this issue with double counting RUN loop time.
  • Loading branch information
Patrick Worley committed Feb 1, 2017
1 parent 7e13433 commit 555fcc7
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions utils/python/CIME/get_timing.py
Original file line number Diff line number Diff line change
Expand Up @@ -256,7 +256,7 @@ def getTiming(self):

nmax = self.gettime(' CPL:INIT ')[1]
tmax = self.gettime(' CPL:RUN_LOOP ')[1]
wtmax = self.gettime(' CPL:TPROF_WRITE ')[1]
wtmin = self.gettime(' CPL:TPROF_WRITE ')[0]
fmax = self.gettime(' CPL:FINAL ')[1]
for k in components:
if k != "CPL":
Expand All @@ -281,7 +281,7 @@ def getTiming(self):

correction = max(0, ocnrunitime - ocnwaittime)

tmax = tmax + wtmax + correction
tmax = tmax + wtmin + correction
ocn.tmax += ocnrunitime

for m in self.models.values():
Expand Down

0 comments on commit 555fcc7

Please sign in to comment.