Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance issues and topics #386

Open
rgknox opened this issue May 23, 2018 · 23 comments
Open

performance issues and topics #386

rgknox opened this issue May 23, 2018 · 23 comments

Comments

@rgknox
Copy link
Contributor

rgknox commented May 23, 2018

As per discussion in #379, we have been seeing decreased performance in our fates simulations.

Lets use this space to document information that we feel helps elucidate performance degradation, and discuss potential places where the code could could be improved or has suffered from a regression.

@rgknox
Copy link
Contributor Author

rgknox commented May 23, 2018

Running list of things maybe causing slowness:

  1. Simply more cohorts and patches in our simulations (we can change the maximums allowed)
  2. More PFTs. More PFTs make for larger arrays and larger loops in canopy structure and radiation calculations
  3. There is a new procedure in PlantGrowth that seeks to keep DBH and structural biomass on-allometry. Without this, numerical integration truncation errors make the plant slowly drift off allometry, but I'm not sure if this routine is really worth it. See https://github.com/NGEET/fates/blob/master/biogeochem/FatesAllometryMod.F90#L1778
    (no discernible difference in simulation with/without structure reset)
  4. During canopy promotion and demotion, we have some iterative while loops. These (for myriad reasons) could be having a hard time converging. I have a new PR that removes the need for these loops and calculates demotion and promotion to a much higher precision. This may speed-up the code and also solve Balance Check failure in fire runs #378. See https://github.com/rgknox/fates/tree/rgknox-area-fixes
  5. Lots of checks. We have so many "ifs" in the code checking for problems, that this really slows things down. We can't use pre-processing flags to remove these. We could do an audit on checks.
  6. Features. We ask the code to do more things, decide between more options. Deciding between options slows things down.
  7. Allometry. Now that allometry is more "modular", there are some purposefully redundant calls. For instance, to determine how much storage carbon there is, it will automatically calculate leaf biomass, even if the code already knows the value. The reason I kept them in there, is because it ensures that things are calculated correctly.
  8. The code that calculates our history outputs is getting bulkier and bulkier with more variables and dimensions. Some of these every model time-step. Even though we may not write out all of these history variables, we still have to calculate and allocate them all.
  9. changes in the stomatal conductance causing the temperature iteration to take longer to converge

@rgknox
Copy link
Contributor Author

rgknox commented May 23, 2018

I've also been able to get DDT (a debugger and profiler) working on cheyenne. I think it would be a good idea to profile the code.

@rosiealice
Copy link
Contributor

rosiealice commented May 23, 2018 via email

@rgknox
Copy link
Contributor Author

rgknox commented May 23, 2018

thanks! added to my list

@rosiealice
Copy link
Contributor

rosiealice commented May 23, 2018 via email

@jenniferholm
Copy link
Contributor

Hi all,
Regarding the slowness issue with the new canopy conductance bug fixes, it looks like (at least one) main reason is the number of PFTs indexed. When running the same case with 3 pft indices, vs. 14 pft indices the run is back to it's regular speed. Well, maybe only a tiny bit slower (8 mins vs. ~5-6 mins for 1 simulation year). The 14 pft index case was taking ~35 mins for 1 year.

@ckoven
Copy link
Contributor

ckoven commented May 24, 2018

@rgknox if your profiling shows option 8 (history bloat) on your list as being important, maybe we should make a global switch called like 'history_detail_level', with values of 1-5, that turn the calculations on and off via something like the following?

  1. is to only calculate the bulk values
  2. adds a handful of 1-D structured values
  3. adds the full set of 1-D structured values
  4. adds a handful of 2-D or larger structured values
  5. is the full set of stuff that we've got possible now

@rgknox
Copy link
Contributor Author

rgknox commented May 24, 2018

Thanks for that info @jenniferholm , that is very helpful.

@ckoven, I think your on to something. We could add "group" tags to the output variables. And if all of the members of a group are turned off, then we will forgo allocations and calculations on that group. This will minimize the number of logical evaluations since we are evaluating groups.

Lets try to get a sense of how important it is. Maybe profiling will show the history is not as burdensome.

@jenniferholm
Copy link
Contributor

Hi all,
I was wondering if we could circle back around to this decreased performance issue. Could we maybe split up to-do tasks between us from Ryan's list above (and additions from Charlie and Rosie) to help increase performance? (If you are interested of course!).

I'm highly interested, because global runs, with multiple PFTs are pretty expensive and slow now. So I have some motivation to help with this. Is anybody else running global simulations, or single site with >3 PFTs since the new stomatal conductance bug fix and having speed issues? @jkshuman @huitang-earth ?

@rgknox - do you have suggestions from your list of which one lead to the greatest performance increase? Since this didn't happen before the conductance bug fix, I can look into the temperature
iteration and if it is taking longer to converge when there are larger arrays with more PFTs. I can also help with option 5, but now sure how critical that one is.

@walkeranthonyp
Copy link

walkeranthonyp commented Jun 22, 2018

Sounds like it could be a little of all of the above. Profiling would certainly be a good idea, especially if you have the capacity to do that.

Generally photosynthesis can be pretty expensive because it is calculated so many times (as you all know). Being wrapped in the canopy numerical solve, its a numerical solve within a numerical solve. Because of this, any increase in runtime of one photosynthesis solve could have potentially large speed up benefits. I'm curious @jenniferholm did you run the 14 pft case prior to the rb bugfix? Is the relative slowdown equivalent to the 3 PFT case or is there some compound issue there where many PFTs and the bugfix cause a greater relative slowdown?

@rgknox could you briefly outline the rb bug, its fix, and the result? I did see some of the discussion after the fact but I didn't catch what the problem was and what effect the fix had. @ckoven mentioned it on the modelling call the other day and if I remember right the fix resulted in increased transpiration. I guess that means the fix resulted in lower rb. Does this also mean that GPP increased?

@ckoven
Copy link
Contributor

ckoven commented Jun 22, 2018

@walkeranthonyp the recent bugfix I referred to on the call was #379. The change reduces rather than increases transpiration (because canopy conductance was previously too high by a factor of LAI).

@rgknox
Copy link
Contributor Author

rgknox commented Jun 26, 2018

I'm going to take a look at item 3 (to see whether forcing plants to stay tightly on allometry is worth it)

@rgknox
Copy link
Contributor Author

rgknox commented Jul 1, 2018

re item 3: I ran two simulations with tropical evergreens at BCI for 30 years. One simulation used master 91785 and the other removed calls to reset dbh to match structure. There was no discernible difference in wall-time and the results were nearly identical (528 vs. 521 seconds per day).

In fact, the master actually ran slightly faster. Here are some comparisons.

reset-test_plots.pdf

@rgknox
Copy link
Contributor Author

rgknox commented Jul 1, 2018

@walkeranthonyp , the fixes to canopy conductance were submitted via pull request 379. In summary, we had originally thought that the host land models were expecting FATES to pass it canopy resistance per unit leaf area. But we realized that it should just be mean canopy resistance.

https://github.com/NGEET/fates/pull/379/files

See the diffs in the subroutine ScaleLeafLayerFluxToCohort(), that is where conductance is calculated, and weighted by leaf area for each cohort. We then later normalize by total canopy leaf area to get a mean.

#379

@walkeranthonyp
Copy link

Thanks @rgknox and @ckoven for pointing me to the previous discussion. I think I'm up-to-date (on that at least).

WRT performance issues I'm looking into the photosynthesis calculation side of things. As noted by @everpassenger in issue #379 calculating 40 canopy layers for every canopy photosynthesis calculation is expensive! I've been checking this out with the rudimentary version of canopy MAAT. It's not set up exactly like FATES, but it does have a multi-layer canopy, sun/shade and diffuse/direct partitioning, plus a number of other assumptions.
canopy_layers_par
The figure shows the difference in instantaneous canopy assimilation compared with a 40 layer canopy calculation at different PAR levels. Colours are 10-50 layers in steps of 10, tested with Medlyn or Ball gs, and an LAI of 3, 5, and 7.

A rough guide is a drop in A of 2-5 % for a 20 layer canopy and 1-2 % for a 30 layer canopy. It's roughly the same across various CO2 and VPD levels. The question is how accurate is sufficient? It's unlikely that in the real world the canopy variables and parameters follow a completely smooth differential anyway.

On another tack. I think I have an idea about how to come up with an analytical solution that includes a non zero rb. Actually one already exists but it's a cubic and no-one really knows what root to use. I think I can come up with a quadratic solution. Just need to find a little time to implement it and to test it out.

@rosiealice
Copy link
Contributor

rosiealice commented Jul 9, 2018 via email

@walkeranthonyp
Copy link

walkeranthonyp commented Jul 10, 2018

Ah OK, I was going on what was said before in these threads (plus I remember a comment you made @rosiealice on a call about Gordon thinking 40 layers was what was needed to properly approximate that integral). That'll teach me to not look at the code in depth.

In any case, we have the tools to investigate how many layers we want to simulate.

Also I want to get it out there that I think an analytical, or semi-analytical, solution to the coupled photosynthesis equations is likely and that will save a good amount of time.

@rgknox
Copy link
Contributor Author

rgknox commented Apr 7, 2021

Place to improve efficiency, we currently allocate some largish arrays using a maxpft global constant, some examples here:

fates/main/EDTypesMod.F90

Lines 415 to 443 in 8827a6e

real(r8) :: pft_agb_profile(maxpft,n_dbh_bins) ! binned above ground biomass, for patch fusion: KgC/m2
real(r8) :: canopy_layer_tlai(nclmax) ! total leaf area index of each canopy layer
! used to determine attenuation of parameters during
! photosynthesis m2 veg / m2 of canopy area (patch without bare ground)
real(r8) :: total_canopy_area ! area that is covered by vegetation : m2
real(r8) :: total_tree_area ! area that is covered by woody vegetation : m2
real(r8) :: zstar ! height of smallest canopy tree -- only meaningful in "strict PPA" mode
real(r8) :: c_stomata ! Mean stomatal conductance of all leaves in the patch [umol/m2/s]
real(r8) :: c_lblayer ! Mean boundary layer conductance of all leaves in the patch [umol/m2/s]
! UNITS for the ai profiles
! [ m2 leaf / m2 contributing crown footprints]
real(r8) :: tlai_profile(nclmax,maxpft,nlevleaf) ! total leaf area in each canopy layer, pft, and leaf layer.
real(r8) :: elai_profile(nclmax,maxpft,nlevleaf) ! exposed leaf area in each canopy layer, pft, and leaf layer
real(r8) :: tsai_profile(nclmax,maxpft,nlevleaf) ! total stem area in each canopy layer, pft, and leaf layer
real(r8) :: esai_profile(nclmax,maxpft,nlevleaf) ! exposed stem area in each canopy layer, pft, and leaf layer
real(r8) :: layer_height_profile(nclmax,maxpft,nlevleaf)
real(r8) :: canopy_area_profile(nclmax,maxpft,nlevleaf) ! fraction of crown area per canopy area in each layer
! they will sum to 1.0 in the fully closed canopy layers
! but only in leaf-layers that contain contributions
! from all cohorts that donate to canopy_area
! layer, pft, and leaf layer:-
integer :: canopy_mask(nclmax,maxpft) ! is there any of this pft in this canopy layer?
integer :: nrad(nclmax,maxpft) ! number of exposed leaf layers for each canopy layer and pft
integer :: ncan(nclmax,maxpft) ! number of total leaf layers for each canopy layer and pft

We already do dynamic allocations of various patch level arrays, we could easily change these to use the actual number of pfts instead of the maximum.

@rgknox
Copy link
Contributor Author

rgknox commented Jul 28, 2021

In the following piece of code, we do some pretty laborious loops for diagnostics at the model timestep.

A typical cohort loop (right now) maxes out at 100-150 entities. This loop, assuming 12 pfts and 3 canopy layers and 30 leaf layers would be: 12330 = 1080

https://github.com/NGEET/fates/blob/master/main/FatesHistoryInterfaceMod.F90#L3683-L3686

The solution, aside from encompassing the whole loop structure into some type of hard-coded logical, would require us to determine if any of these variables are being sent to history. I've looked into this before, but I don't remember finding out where the history field list is saved.

@ekluzek @billsacks

@billsacks
Copy link
Member

Interesting. I could see how this could be useful in other places, too.

You could get at this information from histFileMod, but I don't know if there will be an easy way to get the information you want. Some issues I see are:

  • You would need to know the string names of all relevant history variables. Hard-coding these in the place where you're doing the calculations could be a maintenance problem, in that someone might change the history field name but forget to change the check in this other place. (And if the logic is: calculate these diagnostics as long as any of these history variables are present on any file, then you wouldn't notice the problem until someone does a run where none of them are present, so it could easily slip through testing.)
  • You would probably need to do a bunch of string comparisons against all history variables included in all files. These string comparisons could themselves be a significant expense if you're doing them every time step.

I thought a bit about how we could handle this in a way that avoids these issues. My initial thought would be to create a module like this:

module HistExpensiveDiagnostics

  ! This module defines an enumeration and a logical array for tracking whether a
  ! particular set of expensive diagnostics need to be calculated.
  !
  ! There are two parts to using this:
  !
  ! (1) In a hist_addfld call for a diagnostic that falls into one of these expensive
  ! diagnostic categories, add the optional argument "expensive_diagnostic", with value
  ! set to one of the expensive_diagnostics_* integer values defined below. For example:
  !
  !    hist_addfld1d(..., expensive_diagnostic = expensive_diagnostics_foo)
  !
  ! (2) At the point where this diagnostic variable is calculated, wrap the calculation
  ! in a conditional like:
  !
  !    if (diagnostics_needed(expensive_diagnostics_foo)) then ...

  implicit none
  private

  integer, parameter, public :: expensive_diagnostics_fates_something = 1
  integer, parameter, public :: expensive_diagnostics_something_else  = 2
  integer, parameter, public :: expensive_diagnostics_max             = 2

  logical, public :: diagnostics_needed(expensive_diagnostics_max)

end module HistExpensiveDiagnostics

Then you would need to add some infrastructure in histFileMod that stores the expensive_diagnostic argument for each history variable based on its value (if present) in the addfld call. Then, when a variable is actually added to a history stream, if it has this expensive_diagnostic field set, then the relevant entry in diagnostics_needed (which would have been initialized to .false. everywhere) is set to .true.

This might not be the best design, so happy to hear alternatives. I think it wouldn't be too hard to implement: it would mainly be a matter of making sure you find all the relevant places that need to be changed in histFileMod (as I recall, there are a number of blocks of near-duplicated code in that module, so you probably need to make similar changes in a few places).

@rgknox
Copy link
Contributor Author

rgknox commented Jul 29, 2021

Thanks for your ideas @billsacks
On the fates side, all of our history variables are defined as objects. Each object has a unique integer identifier, which is used to find or identify the variable when it comes time for use.
Definition call for one of our expensive variables:
https://github.com/NGEET/fates/blob/master/main/FatesHistoryInterfaceMod.F90#L5110-L5114
Alias call:
https://github.com/NGEET/fates/blob/master/main/FatesHistoryInterfaceMod.F90#L3492
Usage:
https://github.com/NGEET/fates/blob/master/main/FatesHistoryInterfaceMod.F90#L3711-L3712

I feel we could add your diagnostics_needed as a component of the object. EDIT: Or...it might be better as an array like you say, because attaching it as a scalar logical to the object would prevent us from doing checks on groups of variables. I think I like the array method better...

The string matching would only need to be done once on initialization, and to save time, we would only do string matching on the variables that are flagged expensive, so shouldn't be too bad. I think we have some string matching utitlities laying around here somewhere

@ckoven
Copy link
Contributor

ckoven commented May 19, 2022

see also #859

@rosiealice
Copy link
Contributor

@rgknox reports that the two stream model is slightly slower than the Norman code. This is due to the cohort scale resolution of the new radiation calculations.

@ckoven noted that the number of cohorts could be halved by using the deterministic rather than stochastic sorting mode. @rgknox notes that for gridded runs, the speed of the model is perhaps most related to the speed of the slowest or most complicated grid cell.

If this were true then the best performance increases should be attained by reducing the maximum number of patches or cohorts, as opposed to changing the mean number.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

No branches or pull requests

6 participants