Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Final Updates for HR2 #1827

Merged
merged 11 commits into from
Aug 30, 2023

Conversation

JessicaMeixner-NOAA
Copy link
Contributor

@JessicaMeixner-NOAA JessicaMeixner-NOAA commented Aug 29, 2023

Description

This PR has the final updates for HR2. There was one added variable for diagnostic output: iopt_diag=3 and required from land team. This also points to HR2 updates for initial conditions which update the land states in sfc* files compared to HR1.
Lastly, it was determined that the setting to enable running on hera WRTTASK_PER_GROUP_PER_THREAD_PER_TILE_GFS=20 actually will fail on WCOSS2. While not ideal, a setting that would run out of the box on both hera and wcoss2 for C768 could not be found, therefore a comment was added noting the need for a different setting on WCOSS2 until a solution can be found.

Resolves #1500 (although note this was technically already completed and this PR itself does not update the ufs-weather-model hash)

Type of change

  • New Feature - HR2 updates

Change characteristics

  • Is this a breaking change (a change in existing functionality)? NO
  • Does this change require a documentation update? NO

How has this been tested?

  • 2 Day C768 S2SW tests have been tested on orion, hera and wcoss2

Checklist

  • Any dependent changes have been merged and published
  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • I have made corresponding changes to the documentation if necessary

Comment on lines 175 to 177
if [[ ${machine} = "HERA" ]]; then
export WRTTASK_PER_GROUP_PER_THREAD_PER_TILE_GFS=20
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the indentation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this just for Hera? Is this load balanced?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was an issue of tabs, this has been updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this just for Hera? Is this load balanced?

It's an issue of memory on hera. The jobs will not run on hera with lower than 20 here and will not run on wcoss2 with 20 here.

parm/config/gfs/config.coupled_ic Show resolved Hide resolved
@aerorahul
Copy link
Contributor

Can someone explain why 20 in wcoss2 fails?

@JessicaMeixner-NOAA
Copy link
Contributor Author

Can someone explain why 20 in wcoss2 fails?

I know @GeorgeVandenberghe-NOAA was looking into some of this. There were machine issues on wcoss2, but despite many tries by myself and @jiandewang 20 never worked on WCOSS2, but 10 does. I know this is not ideal but I don't know of another solution.

@aerorahul
Copy link
Contributor

Is there an issue open for this anywhere and is it being tracked?
How much memory is actually used at this resolution? Is the write grid component load balanced for these choices?
A few lines above, for CDUMP=gdas, would there be similar issues with memory?

@JessicaMeixner-NOAA
Copy link
Contributor Author

JessicaMeixner-NOAA commented Aug 30, 2023

There's ongoing work led by @junwang-noaa to determine optimal load balancing for operational targets. This admittedly is not an extensive load balancing but an attempt to ensure that everyone could run HR2 with C768 out of the box for various machines. We did confirm we had similar performance for this set up as we did for HR1, which was not surprising. I have not tested in cycled mode, therefore did not add likes for the gdas component, but yes we'd likely be in a similar situation and I'd be happy to add additional lines there as this soon will be something that is tested.

@aerorahul
Copy link
Contributor

There's ongoing work led by @junwang-noaa to determine optimal load balancing for operational targets. This admittedly is not an extensive load balancing but an attempt to ensure that everyone could run HR2 with C768 out of the box for various machines. We did confirm we had similar performance for this set up as we did for HR1, which was not surprising. I have not tested in cycled mode, therefore did not add likes for the gdas component, but yes we'd likely be in a similar situation and I'd be happy to add additional lines there as this soon will be something that is tested.

Thanks.
I am leery of introducing machine specific logic in this file. We are actively working towards reducing collecting machine specific logic in as few places as possible. This is and has been our goal all along. This development runs contrary to our efforts in improving the system while supporting existing developments.

Running any configuration out of the box is an ideal desire from users. Though achieving this requires significant refactoring of the configuration system and is beyond the scope for any upcoming effort.

Having said that, I am not against merging this in the interest of getting this for HR2, though I would suggest not bringing this in develop and keep it in a HR2 tag/branch.

@JessicaMeixner-NOAA
Copy link
Contributor Author

I know this wasn't ideal, but at some point I don't know how to deal with the fact that one machine has significantly less memory than the others and requires a different settings. I know there are efforts to determine ideal settings for multiple machines, but in the meantime I'm not sure a way around it. I do not want to branch off of develop as I think that encourages people into their own sand-boxes and counters the efforts we've made to get everyone back to the develop branch. How about we leave the 20 here and maybe make a comment about the WCOSS2 setting as a compromise? Alternatively we'll just leave the 20 there.

@junwang-noaa
Copy link
Contributor

@JessicaMeixner-NOAA is there an issue available on tracking the issues with write grid component settings in HR2 in the tested platforms? Can you list the error message, model revision, platforms etc so that we can confirm it is the memory issue? If I remember correctly you mentioned about hera node issues before, can you include those information in the issue too?

@JessicaMeixner-NOAA
Copy link
Contributor Author

The node failure issue on hera has an open hera helpdesk issue-- I'll ask to see what the status is.

Many people have experienced this issue and an issue was previously opened on global-workflow for this: #1793 which was closed when we moved to 20 instead of 10. We've frequently ran into memory issues that were simply solved by changing settings, so this was not seen as a red-flag or cause for opening an issue on the model side. There's also a general workflow issue open about memory issues: #1664

If a specific ufs-weather-model issue is needed I can create one and add the information I have, although I finally cleaned up a lot of my runs as we were running out of space and we have successful configurations, so do not have much in terms of failed log files at this point.

@junwang-noaa
Copy link
Contributor

@JessicaMeixner-NOAA Thanks for the information. May I ask if there are diag_table updates in HR2 from HR1? Have we output more fields in history files? I am trying to understand what may cause the memory increase.

@JessicaMeixner-NOAA
Copy link
Contributor Author

There were a few additional variables added to the diag_table:
image

For full changes in workflow:
prototype/hr1...develop (Note this doesn't have the final changes from this PR though, which are small).

@junwang-noaa
Copy link
Contributor

What is the size of the sfc file compared to HR1?

@JessicaMeixner-NOAA
Copy link
Contributor Author

What is the size of the sfc file compared to HR1?

I don't have the size of HR1 sfc files readily available. HR2 sfc NetCDF files are 1.3G.

@jiandewang
Copy link
Contributor

What is the size of the sfc file compared to HR1?

I don't have the size of HR1 sfc files readily available. HR2 sfc NetCDF files are 1.3G.

just hecked HR1 sfc,
1313637221 2023-03-18 02:28 ./gfs.20200619/00/atmos/gfs.t00z.sfcf378.nc

@JessicaMeixner-NOAA
Copy link
Contributor Author

Thanks @jiandewang

The likely fix for the hera node failures will be implemented in the next maintenance on Sept 6th.

@GeorgeVandenberghe-NOAA
Copy link

@junwang-noaa
Copy link
Contributor

Just to confirm, the node failure has no impact on the write grid comp settings?

@JessicaMeixner-NOAA
Copy link
Contributor Author

Just to confirm, the node failure has no impact on the write grid comp settings?

Not to my knowledge no and while we didn't always see the "node failure" as a cause of an issue, the issues were widespread beyond just me I know at least 3-4 people who all independently had the failures that were fixed by going from WRTTASK_PER_GROUP_PER_THREAD_PER_TILE_GFS=10 to 20 on hera. @jiandewang do you have a log file? Otherwise the failure will have to be reproduced.

@GeorgeVandenberghe-NOAA the 10/20 is referring to WRTTASK_PER_GROUP_PER_THREAD_PER_TILE_GFS

@GeorgeVandenberghe-NOAA
Copy link

@WalterKolczynski-NOAA
Copy link
Contributor

WalterKolczynski-NOAA commented Aug 30, 2023

20*4*6 = 480, so the write group size itself should fit in a whole number of nodes on WCOSS2. I don't know that it necessarily would be assigned such by UFS though.

@WalterKolczynski-NOAA
Copy link
Contributor

How about we leave the 20 here and maybe make a comment about the WCOSS2 setting as a compromise?

Discussed it with Rahul. Let's go with this to get HR2 in.

@JessicaMeixner-NOAA
Copy link
Contributor Author

@WalterKolczynski-NOAA and @aerorahul -- I'll make this update now!

and removing hera exception and added a note that 10 is needed for WCOSS2
@GeorgeVandenberghe-NOAA
Copy link

@WalterKolczynski-NOAA WalterKolczynski-NOAA self-assigned this Aug 30, 2023
@WalterKolczynski-NOAA WalterKolczynski-NOAA merged commit 63270da into NOAA-EMC:develop Aug 30, 2023
@JessicaMeixner-NOAA
Copy link
Contributor Author

Thank you!!!! @WalterKolczynski-NOAA can this be tagged for HR2? Let me know if you want me to make an issue for that request, I'd be happy to.

@WalterKolczynski-NOAA
Copy link
Contributor

Already on it. Just need to hop on a machine and clone so I can make the tag. Will be prototype/hr2.

@WalterKolczynski-NOAA
Copy link
Contributor

@JessicaMeixner-NOAA Tag created: https://github.com/NOAA-EMC/global-workflow/releases/tag/prototype%2Fhr2

@JessicaMeixner-NOAA
Copy link
Contributor Author

Thank you @WalterKolczynski-NOAA !!!

@JessicaMeixner-NOAA JessicaMeixner-NOAA deleted the HR2updates branch March 3, 2024 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update ufs-weather-model hash to HR2 code tag
6 participants