Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mesh file for sparse grid for the NUOPC coupler #1731

Open
ekluzek opened this issue Apr 29, 2022 · 85 comments
Open

Mesh file for sparse grid for the NUOPC coupler #1731

ekluzek opened this issue Apr 29, 2022 · 85 comments
Assignees
Labels
enhancement new capability or improved behavior of existing capability

Comments

@ekluzek
Copy link
Collaborator

ekluzek commented Apr 29, 2022

We need a mesh file that can be used with the NUOPC coupler for the sparse grid.

Here's a sample case for the MCT coupler:

/glade/work/oleson/PPE.n11_ctsm5.1.dev030/cime/scripts/ctsm51c8BGC_PPEn11ctsm51d030_2deg_GSWP3V1_Sparse400_Control_2000

@slevis-lmwg
Copy link
Contributor

@ekluzek in case it's relevant:
I wrote instructions in this google doc for creating a high-res sparse grid mesh file for #1773 . Search for "mesh" in the doc to find the relevant section.

Very briefly:

  1. To start we need a file containing 1D or 2D variables of latitude and longitude for the grid of interest. If such file exists, I would be happy to try to generate the mesh file.
  2. If the lat/lon file also includes a mask of the sparse grid, we would then run mesh_mask_modifier to get that mask into the mesh file.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Sep 30, 2022

Using the new mesh_modifier tool, I was able to get a mesh file from the domain file. The mesh file for the atm forcing is different though in that it's not modifying the 2D grid it's a simple list of 400 points. So I need to create a SCRIP grid file that describes that list of points from the domain file, and then I convert it into ESMF mesh format.

@slevis-lmwg
Copy link
Contributor

slevis-lmwg commented Sep 30, 2022

So this is what you need to do:

  1. Generate scrip.nc from <file_with_lat_lon_2d>.nc where the latter I think is the domain file that you mentioned:
    ncks --rgr infer --rgr scrip=scrip.nc <file_with_lat_lon_2d>.nc foo.nc
    (where foo.nc contains metadata and will not be used)
  2. Generate mesh file from scrip.nc
module load esmf
ESMF_Scrip2Unstruct scrip.nc lnd_mesh.nc 0

(This mesh file’s mask = 1 everywhere)
3. At this point I suspect that you need to update this mesh file’s mask using the mesh_modifier tool to distinguish between land and ocean.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Oct 3, 2022

Awesome, thanks @slevisconsulting the above helped me to get mesh files created. I got everything setup Friday, but when I run the case it's failing. So I need to debug what's happening and get a working case. The mesh files I created are in...

/glade/work/erik/ctsm_worktrees/main_dev/cime_config/usermods_dirs/sparse_grid400_f19_mv17

Hopefully, the crash I'm seeing is something simple I can figure out.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Oct 7, 2022

The crash has to do with the connectivity on the forcing grid which is just a list of the 400 points. The suggestion from ESMF is that I make the forcing grid points vertices just be a tiny bit around the cell centers.

Because of the time it's taking to do this project I also plan to bring this work as a user-mod to master and add a test for it.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Oct 11, 2022

We talked about this at the standup this morning. An idea I got there was to try it with the new land mesh, but without the atm forcing mesh. I tried that and it works. So there's something going on with the new forcing that only has the 400 points. This is something I did suspect.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Oct 12, 2022

OK, I got a case to work! I couldn't use ncks to make the SCRIP grid file as it would "correct" my vertices to turn it into a regular grid. I was able to use curvilinear_to_SCRIP inside of NCL to write out a SCRIP grid file that I could then convert to a working mesh file. Using unstructured_to_ESMF inside of NCL didn't generate a mesh that I could use. One clue in the final mesh file I could see is that the nodeCount was 1600 (so 4x the number of points [400]) which shows that all of the points are isolated from each other. The mesh files that did NOT work all had a smaller number of total nodes than that which meant that they shared nodes between each other.

@slevis-lmwg
Copy link
Contributor

@ekluzek I understand that you used curvilinear_to_SCRIP instead of ncks, but I didn't follow what you ran to go from SCRIP to successful MESH.

@ekluzek ekluzek moved this from In Progress to Done (or no longer holding things up) in CESM: infrastructure / cross-component SE priorities Oct 13, 2022
@wwieder
Copy link
Contributor

wwieder commented Feb 2, 2023

Seems like this connects with #1919 too. In general we need better documentation on how to do this.

@slevis-lmwg
Copy link
Contributor

@ekluzek and I met to compare notes (this issue #1731 vs. discussion #1919):

@adrifoster
Copy link
Collaborator

@ekluzek and @slevis-lmwg I tried running a sparse grid simulation using the steps we talked about on the call today. I got a bunch of ESMF errors. It seems like I should be following what @ekluzek did above?

I'm not sure how to do what you did above, Erik. Do you remember the steps you took?

My log files for the case can be found :

/glade/scratch/afoster/ctsm51FATES_SP_OAAT_Control_2000/run

@ekluzek
Copy link
Collaborator Author

ekluzek commented Oct 4, 2023

It looks like the issue is in datm, from the PET file as you point out. So one thing to check would be to see if just the change for the MASK_MESH works. I think that should, so that would be good to try.

Another thing to try would be the datm mesh file I created
/glade/work/erik/ctsm_worktrees/main_dev/cime_config/usermods_dirs/sparse_grid400_f19_mv17/360x720_gswp3.0v1.c170606v4.dense400_ESMFmesh_c20221012.nc

which does differ from your file.

From reading above I ran into trouble with just using ncks, because it would change the vertices from me, so I couldn't use files created with it. I think that might be the warnings we saw when we worked on this that showed issues with the south pole.

(By the way the unclear messages from ESMF are another example of error checking that doesn't help you figure out the problem. I think this might be something where some better error checking could be done to help us figure out what's wrong).

@ekluzek
Copy link
Collaborator Author

ekluzek commented Oct 4, 2023

You can also try the land mesh file I created...

/glade/work/erik/ctsm_worktrees/main_dev/cime_config/usermods_dirs/sparse_grid400_f19_mv17/fv1.9x2.5_sparse400_181205v4_ESMFmesh_c20220929.nc

NOTE: That for this you would replace use it for

ATM_DOMAIN_MESH
LND_DOMAIN_MESH

And leave MASK_MESH as it was before.

OR -- you would reverse the mask and use it for MASK_MESH, and leave LND_DOMAIN_MESH/ATM_DOMAIN_MESH as they were.

I set it up changing ATM_DOMAIN_MESH/LND_DOMAIN_MESH because that is what made sense to me. But, as we saw with talking with @slevis-lmwg it's more general and simpler to swap out the MASK_MESH file.

That mesh file is also different from yours, and it's not just the mask. So again maybe there was something going on with the ncks conversion?

@slevis-lmwg
Copy link
Contributor

slevis-lmwg commented Oct 4, 2023

@adrifoster I do see four posts up that I wrote, "We think that his ncks attempt failed because he applied it to a "domain" file"
...followed by a suggestion how to resolve. So the "domain" shortcut probably did not work.

If things continue to fail, let's meet again and go through the full process, as I recommend it above. Let's plan on an hour.

@adrifoster
Copy link
Collaborator

It looks like the issue is in datm, from the PET file as you point out. So one thing to check would be to see if just the change for the MASK_MESH works. I think that should, so that would be good to try.

Okay I removed the info in user_nl_datm_streams and submit that (so the only thing that was changed was MASK_MESH. That failed for a different reason...

56: Which are NaNs =  F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F
56: F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F
56: F F F F F F F F F F F F F F F F T F F F F F F F F F F F F F F F F F F F F F F F
56: F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F
56: F F F F F F F F F F F F F F F F F F F F F F F F
56: NaN found in field Sl_lfrin at gridcell index           88
56: ERROR:  ERROR: One or more of the CTSM cap export_1D fields are NaN
87: # of NaNs =            3
87: Which are NaNs =  F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F
87: F F F F F F F F F F F F F F F F F F F F F F F F F F F F T F F F F F F F F F F F
87: F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F T
87: T F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F
87: F F F F F F F F F F F F F F F F F F F F F F F F
87: NaN found in field Sl_lfrin at gridcell index           60
87: NaN found in field Sl_lfrin at gridcell index          111
87: NaN found in field Sl_lfrin at gridcell index          112

Will try your next suggestion next.

@adrifoster
Copy link
Collaborator

That still failed... thanks @slevis-lmwg for joining meeting with @mvertens this afternoon. @ekluzek do you want to join as well?

@slevis-lmwg

This comment was marked as outdated.

@slevis-lmwg

This comment was marked as outdated.

@slevis-lmwg
Copy link
Contributor

@adrifoster I hope the run still works when you now point to the dense400 datm files and the datm mesh that I generated (previous post).

@adrifoster
Copy link
Collaborator

Unfortunately that did not work.

see logs /glade/scratch/afoster/ctsm51FATES_SP_OAAT_Control_2000_nuopctest/run

@adrifoster
Copy link
Collaborator

I'm using DATM data in:

/glade/p/cgd/tss/people/oleson/atm_forcing.datm7.GSWP3.0.5d.v1.c210222_400/

@adrifoster
Copy link
Collaborator

adrifoster commented Oct 17, 2023

Okay thanks to @slevis-lmwg for helping me fix a separate error. This seemed to work! And the timing is faster!

Updated table:

driver land grid DATM grid cost (pe-hrs/simulated_year) throughput (simulated_years/day)
mct full full DATM 53.73 659.32
mct sparse full DATM 6.2 418.35
mct sparse subset DATM 3.56 676.56
nuopc full full DATM 67.502 525.9
nuopc sparse full DATM 6.61 391.9
nuopc sparse subset DATM 5.66 458.23

Thank you @slevis-lmwg !!!

@adrifoster
Copy link
Collaborator

Here are the updated results for the new PE layout @slevis-lmwg @ekluzek @jedwards4b and @mvertens and I discussed today

---------------- TIMING PROFILE ---------------------
  Case        : nuopc_grid_400DATM
  LID         : 3887011.chadmin1.ib0.cheyenne.ucar.edu.231018-094900
  Machine     : cheyenne
  Caseroot    : /glade/work/afoster/FATES_calibration/nuopc_mct_testing/cases/nuopc_grid_400DATM
  Timeroot    : /glade/work/afoster/FATES_calibration/nuopc_mct_testing/cases/nuopc_grid_400DATM/Tools
  User        : afoster
  Curr Date   : Wed Oct 18 09:51:42 2023
  Driver      : CMEPS
  grid        : a%1.9x2.5_l%1.9x2.5_oi%null_r%null_g%null_w%null_z%null_m%gx1v7
  compset     : 2000_DATM%GSWP3v1_CLM51%FATES_SICE_SOCN_SROF_SGLC_SWAV_SESP
  run type    : startup, continue_run = FALSE (inittype = TRUE)
  stop option : nmonths, stop_n = 12
  run length  : 365 days (364.9791666666667 for ocean)

  component       comp_pes    root_pe   tasks  x threads instances (stride) 
  ---------        ------     -------   ------   ------  ---------  ------  
  cpl = cpl        104         4        104    x 1       1      (1     ) 
  atm = datm       4           0        4      x 1       1      (1     ) 
  lnd = clm        104         4        104    x 1       1      (1     ) 
  ice = sice       72          36       72     x 1       1      (1     ) 
  ocn = socn       72          36       72     x 1       1      (1     ) 
  rof = srof       72          36       72     x 1       1      (1     ) 
  glc = sglc       72          36       72     x 1       1      (1     ) 
  wav = swav       72          36       72     x 1       1      (1     ) 
  esp = sesp       1           0        1      x 1       1      (1     ) 

  total pes active           : 108 
  mpi tasks per node         : 36 
  pe count for cost estimate : 108 

  Overall Metrics: 
    Model Cost:               4.45   pe-hrs/simulated_year 
    Model Throughput:       582.35   simulated_years/day 

    Init Time   :       8.409 seconds 
    Run Time    :     148.365 seconds        0.406 seconds/day 
    Final Time  :       3.806 seconds 


Runs Time in total seconds, seconds/model-day, and model-years/wall-day 
CPL Run Time represents time in CPL pes alone, not including time associated with data exchange with other components 

    TOT Run Time:     148.365 seconds        0.406 seconds/mday       582.35 myears/wday 
    CPL Run Time:      26.339 seconds        0.072 seconds/mday      3280.24 myears/wday 
    ATM Run Time:      60.671 seconds        0.166 seconds/mday      1424.07 myears/wday 
    LND Run Time:      96.070 seconds        0.263 seconds/mday       899.34 myears/wday 
    ICE Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    OCN Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    ROF Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    GLC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    WAV Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    ESP Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    CPL COMM Time:     33.747 seconds        0.092 seconds/mday      2560.25 myears/wday 

The land and atm are much closer now. @jedwards4b does this seem like a good layout to land on?

To create these cases I used the script /glade/work/afoster/FATES_calibration/nuopc_mct_testing/setup_case, though the PE layout is incorrect in the script, so that will need to get updated either manually or in the script.

see the script /glade/work/afoster/FATES_calibration/nuopc_mct_testing/run_grid_tests for how to use setup_case

## setup case arguments

# driver=$1 # mct or nuopc
# grid=$2   # full or grid (grid means sparse)
# datm=$3   # 400DATM or fullDATM (400DATM means subset DATM)
## Note CTSM is on ctsm5.1.dev130 and fates on sci.1.66.1_api.25.5.0 so that MCT can be run

@jedwards4b
Copy link
Contributor

@adrifoster I think that layout is pretty good.

@slevis-lmwg
Copy link
Contributor

slevis-lmwg commented Oct 19, 2023

I will organize my instructions for generating mesh files into one post and hide (not delete) corresponding obsolete posts:

Generating mesh files for sparse grid I-cases where land and datm are on different grids

1) Generate sparse grid mesh for the land model
Starting with this file /glade/u/home/forrest/ppe_representativeness/output_v4/clusters.clm51_PPEn02ctsm51d021_2deg_GSWP3V1_leafbiomassesai_PPE3_hist.annual+sd.400.nc
generate landmask.nc:

a) In matlab (% are comments)

rcent = ncread('clusters.clm51_PPEn02ctsm51d021_2deg_GSWP3V1_leafbiomassesai_PPE3_hist.annual+sd.400.nc','rcent');
landmask = round(~isnan(rcent));  % "~" means "not"
nccreate('clusters.clm51_PPEn02ctsm51d021_2deg_GSWP3V1_leafbiomassesai_PPE3_hist.annual+sd.400.nc','landmask','Dimensions',{'lon',144,'lat',96})
ncwrite('clusters.clm51_PPEn02ctsm51d021_2deg_GSWP3V1_leafbiomassesai_PPE3_hist.annual+sd.400.nc','landmask',landmask);
nccreate('clusters.clm51_PPEn02ctsm51d021_2deg_GSWP3V1_leafbiomassesai_PPE3_hist.annual+sd.400.nc','mod_lnd_props','Dimensions',{'lon',144,'lat',96})
ncwrite('clusters.clm51_PPEn02ctsm51d021_2deg_GSWP3V1_leafbiomassesai_PPE3_hist.annual+sd.400.nc','mod_lnd_props',landmask);

b) After matlab

mv clusters.clm51_PPEn02ctsm51d021_2deg_GSWP3V1_leafbiomassesai_PPE3_hist.annual+sd.400.nc landmask.nc
ncks --rgr infer --rgr scrip=scrip.nc landmask.nc foo.nc
module load esmf
ESMF_Scrip2Unstruct scrip.nc lnd_mesh.nc 0

In a run where you point to the default atmosphere drivers (not the sparse version), set the three mesh paths to this lnd_mesh.nc in env_run.xml.

@adrifoster showed that this works. It's not the full story if you want to run faster BUT I believe we have not seen correct output from simulations that followed the next step. The time savings has not been sufficient motivation to continue problem solving.

@adrifoster points out that, if using xarray for (1a), then you have to set the encoding correctly otherwise nco complains:

# need encoding for ncks to work
encoding = {'lat': {'_FillValue': False},
            'lon': {'_FillValue': False},
            'landmask': {'_FillValue': False}}

dompft.to_netcdf(file_out, encoding=encoding)

2) Generate the sparse grid mesh for datm

The 1D datm domain file for mct runs came from combining the 2D gswp3 data and the dense400 mask. So... in matlab I will take the 1D mask from the domain file
/glade/p/cgd/tss/people/oleson/modify_domain/domain.lnd.360x720_gswp3.0v1.c170606.dense400.nc
and put it in a 2D mask in a 2D gswp3 file
/glade/p/cgd/tss/CTSM_datm_forcing_data/atm_forcing.datm7.GSWP3.0.5d.v1.1.c181207/TPHWL/clmforc.GSWP3.c2011.0.5x0.5.TPQWL.2014-01.nc
That will be our starting file for the rest of the steps.

In matlab:

longxy = ncread('clmforc.GSWP3.c2011.0.5x0.5.TPQWL.2014-01.nc','LONGXY');
latixy = ncread('clmforc.GSWP3.c2011.0.5x0.5.TPQWL.2014-01.nc','LATIXY');
lon = ncread('../../../people/oleson/modify_domain/domain.lnd.360x720_gswp3.0v1.c170606.dense400.nc','xc');
lat = ncread('../../../people/oleson/modify_domain/domain.lnd.360x720_gswp3.0v1.c170606.dense400.nc','yc');
mask = zeros(720,360);                                                                                
for cell = 1:400                                                                                      
  [i,j] = find(latixy==lat(cell) & longxy==lon(cell));                                                  
  mask(i,j) = 1;                                                                                        
end

In a copy of the datm file, so as to avoid overwriting the original, still in matlab:

nccreate('clmforc.GSWP3.c2011.0.5x0.5.TPQWL.2014-01.nc','landmask','Dimensions',{'lon',720,'lat',360})
ncwrite('clmforc.GSWP3.c2011.0.5x0.5.TPQWL.2014-01.nc','landmask',mask);
nccreate('clmforc.GSWP3.c2011.0.5x0.5.TPQWL.2014-01.nc','mod_lnd_props','Dimensions',{'lon',720,'lat',360})
ncwrite('clmforc.GSWP3.c2011.0.5x0.5.TPQWL.2014-01.nc','mod_lnd_props',mask);

After matlab

mv clmforc.GSWP3.c2011.0.5x0.5.TPQWL.2014-01.nc landmask.nc
ncks --rgr infer --rgr scrip=scrip.nc landmask.nc foo.nc
module load esmf
ESMF_Scrip2Unstruct scrip.nc lnd_mesh.nc 0

@adrifoster showed that this mesh file works when still pointing to the global datm data.

Next I modify this mesh file
from global with 259200 elements and mask = 1 at the 400 cells of the sparse grid
to a 400-element vector same as the domain and datm files that @adrifoster was using in the mct version of this work.

In matlab, I read variables from the global datm_mesh/lnd_mesh.nc file and the 400-element domain file...

elementMask = ncread('lnd_mesh_259200.nc','elementMask');
elementArea = ncread('lnd_mesh_259200.nc','elementArea');
centerCoords = ncread('lnd_mesh_259200.nc','centerCoords');
numElementConn = ncread('lnd_mesh_259200.nc','numElementConn');
elementConn = ncread('lnd_mesh_259200.nc','elementConn');
nodeCoords = ncread('lnd_mesh_259200.nc','nodeCoords');
xc = ncread('/glade/p/cgd/tss/people/oleson/modify_domain/domain.lnd.360x720_gswp3.0v1.c170606.dense400.nc','xc');
yc = ncread('/glade/p/cgd/tss/people/oleson/modify_domain/domain.lnd.360x720_gswp3.0v1.c170606.dense400.nc','yc');

node_index = 0;
for cell = 1:400
  [i] = find(centerCoords(1,:) == xc(cell) & centerCoords(2,:) == yc(cell));

  centerCoords_new(:,cell) = centerCoords(:,i);
  elementArea_new(cell) = elementArea(i);
  elementMask_new(cell) = elementMask(i);
  numElementConn_new(cell) = numElementConn(i);

  for nodes_per_element = 1:4
    node_index = node_index + 1;
    nodeCoords_new(:,node_index) = nodeCoords(:,i);    
    elementConn_new(nodes_per_element,cell) = node_index;
  end
end

nccreate('lnd_mesh_400_wo_att.nc','centerCoords','Dimensions',{'coordDim', 2, 'elementCount', 400})
ncwrite('lnd_mesh_400_wo_att.nc','centerCoords',centerCoords_new)
nccreate('lnd_mesh_400_wo_att.nc','elementArea','Dimensions',{'elementCount', 400})
ncwrite('lnd_mesh_400_wo_att.nc','elementArea',elementArea_new);
nccreate('lnd_mesh_400_wo_att.nc','elementConn','Dimensions',{'maxNodePElement', 4, 'elementCount', 400})
ncwrite('lnd_mesh_400_wo_att.nc','elementConn',elementConn_new);
nccreate('lnd_mesh_400_wo_att.nc','elementMask','Dimensions',{'elementCount', 400})
ncwrite('lnd_mesh_400_wo_att.nc','elementMask',elementMask_new);
nccreate('lnd_mesh_400_wo_att.nc','nodeCoords','Dimensions',{'coordDim', 2, 'nodeCount', 1600})
ncwrite('lnd_mesh_400_wo_att.nc','nodeCoords',nodeCoords_new);
nccreate('lnd_mesh_400_wo_att.nc','numElementConn','Dimensions',{'elementCount', 400})
ncwrite('lnd_mesh_400_wo_att.nc','numElementConn',numElementConn_new);

To work, this file needs all the same variable attributes as found in other working mesh files. I copied the attributes manually from the original file to an ascii version of the new file and used ncgen to generate a new netcdf. In this step I corrected some vars from double to int and added global attributes.

This is the mesh file for running with the 400-element datm files.

If necessary, convert type netcdf4 files to cdf5 (type classic is fine). To check and to change the type:

ncdump -k netcdf4_file.nc
nccopy -k cdf5 netcdf4_file.nc cdf5_file.nc

@adrifoster will test. To run faster, perform load balancing (discussion above).

@jedwards4b
Copy link
Contributor

You can avoid the last step of translating from netcdf4 files by creating the file in the right format in the first place: Example: nccreate("myFile.nc","Var1",Datatype="double",Format="classic")

@slevis-lmwg
Copy link
Contributor

From conversation with @oehmke @mvertens @adrifoster @ekluzek @slevis-lmwg
Still in matlab for now...
I return to the global datm mesh file that works and repeat the 400 element loop, but with a nested loop of 4 nodes per element. The resulting file should now include:

nodeCount = 1600
nodeCoords dimensioned (1600, 2)
elementConn = 1 ... 1600 (values equal to the nodeCoords indices)
Remove origGridDim and origGridRank to avoid confusion about the contents of this file.

@slevis-lmwg
Copy link
Contributor

@oehmke thank you for meeting with us this morning.
The new file is
/glade/scratch/slevis/temp_work/sparse_grid/datm_mesh/lnd_mesh_400.nc (ascii version available in same directory)

Let me know if you see anything wrong with this one.

@oehmke
Copy link

oehmke commented Nov 13, 2023 via email

@adrifoster
Copy link
Collaborator

So I tested this new mesh but the output looks the same to me. I ran it twice just to make sure. The datm.streams.xml is pointing to that new file

I didn't update the LND_DOMAIN_MESH, ATM_DOMAIN_MESH, or MASK_MESH, should I have?

subset_datm_tsa

rundir: /glade/scratch/afoster/ctsm51FATES_SP_OAAT_Control_testCLIM_2000/run
casedir: /glade/work/afoster/FATES_calibration/FATES_SP_OAAT/ctsm51FATES_SP_OAAT_Control_testCLIM_2000

@oehmke
Copy link

oehmke commented Nov 14, 2023 via email

@adrifoster
Copy link
Collaborator

Hmmm. Maybe that was a problem, but not the only one. Let me experiment a bit with the file here and see if I find anything.

On Nov 14, 2023, at 1:11 PM, Adrianna Foster @.***> wrote: So I tested this new mesh but the output looks the same to me. I ran it twice just to make sure. The datm.streams.xml is pointing to that new file I didn't update the LND_DOMAIN_MESH, ATM_DOMAIN_MESH, or MASK_MESH, should I have? https://user-images.githubusercontent.com/13225250/282911764-df98340a-57dd-4749-905d-e4d065e3e512.png rundir: /glade/scratch/afoster/ctsm51FATES_SP_OAAT_Control_testCLIM_2000/run casedir: /glade/work/afoster/FATES_calibration/FATES_SP_OAAT/ctsm51FATES_SP_OAAT_Control_testCLIM_2000 — Reply to this email directly, view it on GitHub <#1731 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE6A7UZ2AMC4HWL23SEWBYLYEPF6HAVCNFSM5UXJD462U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOBRGEYTMMBYGE4A. You are receiving this because you were mentioned.

Okay thank you! If it helps at all I used this script: /glade/work/afoster/FATES_calibration/FATES_SP_OAAT/setup_run_PPE_FATES.sh to set up that case

@slevis-lmwg
Copy link
Contributor

slevis-lmwg commented Nov 17, 2023

@adrifoster
Here is a new PE layout suggestion from @mvertens that may speed up your global datm runs:

  component       comp_pes    root_pe
  ---------        ------     -------
  cpl = cpl         36         0
  atm = datm       8           0
  lnd = clm        136         8
  all others         1          0

Also, for when we have time to work on the sparse datm further:
Mariana suggested trying to figure out whether the datm is passing correct info through the coupler to the land by making the following xmlchanges to the cpl namelist:

HIST_N=1
HIST_OPTION=nsteps

With these lines changed, we would compare what we get from an MCT case versus a NUOPC case. If I remember right, you don't have any working MCT cases, so we'd need to start over by checking out some old version of the model. Again, this is for when we have time/motivation to get the sparse datm working.

@adrifoster
Copy link
Collaborator

@mvertens and @jedwards4b I've moved over to Derecho and am having trouble optimizing my PE layout...

Right now the timing and throughput is not as good as on cheyenne for a PE layout suggested by @ekluzek. Should I increase the ATM ntasks?

---------------- TIMING PROFILE ---------------------
  Case        : ctsm51FATES_SP_OAAT_SatPhen_2000
  LID         : 2611669.desched1.231204-182003
  Machine     : derecho
  Caseroot    : /glade/work/afoster/FATES_calibration/FATES_SP_OAAT/ctsm51FATES_SP_OAAT_SatPhen_2000
  Timeroot    : /glade/work/afoster/FATES_calibration/FATES_SP_OAAT/ctsm51FATES_SP_OAAT_SatPhen_2000/Tools
  User        : afoster
  Curr Date   : Mon Dec  4 19:17:05 2023
  Driver      : CMEPS
  grid        : a%1.9x2.5_l%1.9x2.5_oi%null_r%null_g%null_w%null_z%null_m%gx1v7
  compset     : 2000_DATM%GSWP3v1_CLM51%FATES_SICE_SOCN_SROF_SGLC_SWAV_SESP
  run type    : startup, continue_run = TRUE (inittype = FALSE)
  stop option : nyears, stop_n = 10
  run length  : 3650 days (3650.0 for ocean)

  component       comp_pes    root_pe   tasks  x threads instances (stride) 
  ---------        ------     -------   ------   ------  ---------  ------  
  cpl = cpl        124         4        124    x 1       1      (1     ) 
  atm = datm       4           0        4      x 1       1      (1     ) 
  lnd = clm        124         4        124    x 1       1      (1     ) 
  ice = sice       124         4        124    x 1       1      (1     ) 
  ocn = socn       124         4        124    x 1       1      (1     ) 
  rof = srof       124         4        124    x 1       1      (1     ) 
  glc = sglc       124         4        124    x 1       1      (1     ) 
  wav = swav       124         4        124    x 1       1      (1     ) 
  esp = sesp       1           0        1      x 1       1      (1     ) 

  total pes active           : 128 
  mpi tasks per node         : 128 
  pe count for cost estimate : 128 

  Overall Metrics: 
    Model Cost:              12.10   pe-hrs/simulated_year 
    Model Throughput:       253.84   simulated_years/day 

    Init Time   :       8.846 seconds 
    Run Time    :    3403.731 seconds        0.933 seconds/day 
    Final Time  :       0.896 seconds 

@adrifoster
Copy link
Collaborator

Sorry, here is the time for each component:

Runs Time in total seconds, seconds/model-day, and model-years/wall-day 
CPL Run Time represents time in CPL pes alone, not including time associated with data exchange with other components 

    TOT Run Time:    3403.731 seconds        0.933 seconds/mday       253.84 myears/wday 
    CPL Run Time:     297.157 seconds        0.081 seconds/mday      2907.55 myears/wday 
    ATM Run Time:    2890.147 seconds        0.792 seconds/mday       298.95 myears/wday 
    LND Run Time:     832.000 seconds        0.228 seconds/mday      1038.46 myears/wday 
    ICE Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    OCN Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    ROF Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    GLC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    WAV Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    ESP Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    CPL COMM Time:   2345.235 seconds        0.643 seconds/mday       368.41 myears/wday 

@jedwards4b
Copy link
Contributor

@adrifoster One issue may be that with only 4 tasks for atm you are using the serial netcdf interface. Try changing PIO_STRIDE_ATM=1 and see if that helps. I'll build a case from your sandbox and play around with the pelayout a bit.

@adrifoster
Copy link
Collaborator

Thanks @jedwards4b, just FYI the script I used to build this case is here:

/glade/work/afoster/FATES_calibration/FATES_SP_OAAT/setup_run_PPE_FATES_globalDATM.sh

@jedwards4b
Copy link
Contributor

Model Cost: 10.94 pe-hrs/simulated_year
Model Throughput: 280.83 simulated_years/day

TOT Run Time: 3076.618 seconds 0.843 seconds/mday 280.83 myears/wday
CPL Run Time: 345.099 seconds 0.095 seconds/mday 2503.63 myears/wday
ATM Run Time: 1864.728 seconds 0.511 seconds/mday 463.34 myears/wday
LND Run Time: 1309.281 seconds 0.359 seconds/mday 659.90 myears/wday

This is with
./xmlchange NTASKS=64,ROOTPE=64,ROOTPE_ATM=0,PIO_STRIDE=32

@adrifoster
Copy link
Collaborator

Thank you @jedwards4b!!

@slevis-lmwg
Copy link
Contributor

Model Cost: 10.94 pe-hrs/simulated_year Model Throughput: 280.83 simulated_years/day

TOT Run Time: 3076.618 seconds 0.843 seconds/mday 280.83 myears/wday CPL Run Time: 345.099 seconds 0.095 seconds/mday 2503.63 myears/wday ATM Run Time: 1864.728 seconds 0.511 seconds/mday 463.34 myears/wday LND Run Time: 1309.281 seconds 0.359 seconds/mday 659.90 myears/wday

This is with ./xmlchange NTASKS=64,ROOTPE=64,ROOTPE_ATM=0,PIO_STRIDE=32

@jedwards4b I am revisiting this PE layout for a different 400-point sparse grid application and wanted to ask about PIO_STRIDE=32:

  1. I assume you are referring to PIO_ASYNCIO_STRIDE in env_mach_pes.xml, right?
  2. The documentation says to also set PIO_ASYNC_INTERFACE to TRUE. I assume it doesn't hurt to set to TRUE for all the components, unless you tell me that I should limit it to ATM, LND, and maybe CPL.
  3. You didn't mention, and I wonder if I should also change PIO_ASYNCIO_NTASKS from 0 to something and PIO_ASYNCIO_ROOTPE from 1 to something else.

Thanks!

@jedwards4b
Copy link
Contributor

SAM - there is no ASYNCIO in this case so I'm confused about your question.

@slevis-lmwg
Copy link
Contributor

Sorry to confuse you. I, too, was confused, thinking I would find PIO_STRIDE in env_mach_pes.xml, but I have located it in env_run.xml, so I think I'm all set. Thanks @jedwards4b.

@jedwards4b
Copy link
Contributor

I suggest that you should use xmlchange and xmlquery so that you don't need to know which file to look in.

@ekluzek ekluzek added enhancement new capability or improved behavior of existing capability and removed discussion labels Aug 14, 2024
@ekluzek
Copy link
Collaborator Author

ekluzek commented Aug 14, 2024

@adrifoster you got something working here. Is there more that needs to be done in this space? It does seem like we should make this something easy to do and easy to get the right PE layout and mesh file needed for NUOPC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement new capability or improved behavior of existing capability
Projects
Status: Done (or no longer holding things up)
Development

No branches or pull requests

7 participants