Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slow CICE6 IO in the GFSV17 HR1 test on wcoss2 #1895

Closed
junwang-noaa opened this issue Sep 11, 2023 · 8 comments
Closed

slow CICE6 IO in the GFSV17 HR1 test on wcoss2 #1895

junwang-noaa opened this issue Sep 11, 2023 · 8 comments
Assignees
Labels
bug Something isn't working

Comments

@junwang-noaa
Copy link
Collaborator

Description

This issue is found when investigating the GFSv17 scalability (issue #1367). On wcoss2, it is found that the GFSV17 HR1 test (without wave) runs have large coupling overhead. It is turned out that CICE6 restart writing is very slow (~100s writing one restart file). However this slowness does not show up in the cpld_bmark_p8 test on wcoss2. Further investigation is required to resole this issue.

To Reproduce:

  1. Run HR1 without wave, set atm layout 32x32, ocn 240, cice 960.
  2. Run experiment for 96 hours with 24 hr restart frequency.
  3. Check time for total and component run phases.

Additional context

Output

@junwang-noaa
Copy link
Collaborator Author

CICE has PIO option, pnetcdf can speed up the IO performance. CICE can't use pnetcdf on wcoss2. On C5, Pio was built with pnetcdf and no slow IO issue.

@junwang-noaa
Copy link
Collaborator Author

netcdf parallel is using HDF5 storage (parallel IO capability). Pnetcdf is using netcdf5 to write out netcdf file with parallel capability.

@DeniseWorthen
Copy link
Collaborator

@junwang-noaa I've run a C768 coupled test on C5, using pio + netcdf (ie, serial netcdf) for ICE on 960 procs and I see no large signal at CICE's restart frequency. In the test case below, I've turned off MOM6 restarts (by inserting a line in the cap to turn off the alarm) and I see that every 3hours, at the restart frequency, the Advance time for CICE

144:146:       10800  CICE ModelAdvance time:   4.99342396600014
288:290:       21600  CICE ModelAdvance time:   6.88059686300016
432:434:       32400  CICE ModelAdvance time:   5.70175658800008
576:578:       43200  CICE ModelAdvance time:   6.80486823100000

Run is in /lustre/f2/scratch/Denise.Worthen/ciceio/c768.p960

@DeniseWorthen
Copy link
Collaborator

With the box rearranger + pnetcdf + 8 iotasks, I get

72:144:146:       10800  CICE ModelAdvance time:   2.59570083900007
144:288:290:       21600  CICE ModelAdvance time:   3.57405910699981
216:432:434:       32400  CICE ModelAdvance time:   2.59559376000016
288:576:578:       43200  CICE ModelAdvance time:   3.56517140899996

@DeniseWorthen
Copy link
Collaborator

@junwang-noaa Are your tests on WCOSS2 done with Aerosols? Because I've always been testing no-wave,no-aersols.

@junwang-noaa
Copy link
Collaborator Author

No, the tests I am doing do not have aerosol and wave.

@junwang-noaa
Copy link
Collaborator Author

junwang-noaa commented Oct 25, 2023

The GFSv17 HR2 load balance issue is analyzed with the S2S configuration in order to fit into the operational time window (~7mins/forecast day).

The results are shown in the google sheet CICE in HR2 test3.
https://docs.google.com/spreadsheets/d/1OAsKobHWPGzVmeoJfjuU6OAkElQgl21OWEJ5F7-hBlc/edit#gid=398996980
The recommend configuration is: atm 32x32, 8 write groups with 64 tasks in each group. MOM6/CICE use 240 tasks each. The total run time is 813s for 48 hour forecast. IO impact of each component can be seen in the plots below.

@junwang-noaa
Copy link
Collaborator Author

In summary:

  1. in this configuration atm (0.35s/timestep) runs slower than ice(0.24s/step), ocn: 3s/ocean step (one ocean step =12 atm/ice steps). At run time when no io is involed, ice/ocn are
    waiting for atm.
  2. the radiation in FV3ATM called hourly takes additional 0.55 each time.
  3. atm does not show run time change at history output time
  4. cice takes additional 1.8s when writing history files each time, this makes atmosphere waiting 1.8s.
  5. ocean takes additional 3.5s when writing history files each time, this makes both ice/atm waiting additional 1.7s.
  6. With this configuration. the time of writing restart files for the three components are: 21s for atm, 23s for ice and 19s for ocean.

From the results, even though it's ideal to further reduce the history file writing time in CICE, but it is not a blocker to have GFSv17 to reach the operational time window with th
e current configuration.

The issue will be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
Development

No branches or pull requests

2 participants