fix calculation of TOT Run Time in get_timing.py · ESMCI/cime@555fcc7

Commit

fix calculation of TOT Run Time in get_timing.py

get_timing.py is used to generate the performance summary
file acme_timing.$case.$lid from the raw global performance data
acme_timing_stats.$lid . The computation of TOT Run Time uses
the formula:

        tmax = tmax + wtmax + correction

where

        tmax  = self.gettime(' CPL:RUN_LOOP ')[1]
        wtmax = self.gettime(' CPL:TPROF_WRITE ')[1]
        correction = max(0, ocnrunitime - ocnwaittime)

Here tmax is the maximum time any process spends in the RUN
loop. wtmax is the maximum time any process spends in the phase where
checkpoint timing data is output, including the barrier waiting for
all processes to enter this phase. If one component is running on
nodes separate from the other components and takes very little time,
wtmax will reflect the time that this component is waiting at the
barrier while the other components are in the RUN loop, double
counting this time after they are summed. This error is not
significant during typical production runs, but it does affect short
benchmark runs where checkpoint  performance data is written
frequently, which is the type of runs used to evaluate PE layouts and
set performance optimization targets. As such it is important to fix
this as soon as possible.

The solution proposed here is to use

        wtmin = self.gettime(' CPL:TPROF_WRITE ')[0]
        tmax = tmax + wtmin + correction

Since CPL:TPROF_WRITE includes barriers before and after the performance
data write (t_prf), the minimum will capture the cost of the t_prf call
even if the process achieving the minimum is not the one that spends
the most time in t_prf.

Note that I do not understand the role of 'correction' in the above
formula - it is perhaps  extrapolating what the TOT time would be if
the OCN is simulated the same  amount of time as the ATM (it is
typically a little less), and do not know who wrote this script. In
the cases used to diagnose the 'tmax + wtmax' issue, 'correction' was
zero. With the current coupling frequency, I do not expect
'correction' to be very large in any case, but it would be worth while
querying the author as to the intent, but that is distinct from
resolving this issue with double counting RUN loop time.

Loading branch information

Patrick Worley committed Feb 1, 2017

1 parent 7e13433 commit 555fcc7

utils/python/CIME/get_timing.py

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -256,7 +256,7 @@ def getTiming(self):
  
            nmax  = self.gettime(' CPL:INIT ')[1]

            tmax  = self.gettime(' CPL:RUN_LOOP ')[1]

            wtmax = self.gettime(' CPL:TPROF_WRITE ')[1]

            wtmin = self.gettime(' CPL:TPROF_WRITE ')[0]

            fmax  = self.gettime(' CPL:FINAL ')[1]

            for k in components:

                if k != "CPL":

    @@ -281,7 +281,7 @@ def getTiming(self):
  
            correction = max(0, ocnrunitime - ocnwaittime)

            tmax = tmax + wtmax + correction

            tmax = tmax + wtmin + correction

            ocn.tmax += ocnrunitime

            for m in self.models.values():

0 comments on commit `555fcc7`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `555fcc7`

Commit

There are no files selected for viewing

0 comments on commit 555fcc7

0 comments on commit `555fcc7`