Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create compute build option #3186

Merged

Conversation

DavidHuber-NOAA
Copy link
Contributor

@DavidHuber-NOAA DavidHuber-NOAA commented Dec 20, 2024

Description

This creates scripts to run compute-node builds and also refactors the build_all.sh script to make it easier to build all executables.

In place of various options to control what components are built when using build_all.sh, instead it takes in a list of one or more systems to build:

  • gfs builds everything needed for forecast-only gfs (UFS model with unstructured wave grid, gfs_utils, ufs_utils, upp, ww3 pre/post for unstructured wave grid)
  • gefs builds everything needed for GEFS (UFS model with structured wave grid, gfs_utils, ufs_utils, upp, ww3 pre/post for structured wave grid)
  • sfs builds everything needed SFS (UFS model in hydrostatic mode with unstructured wave grid, gfs_utils, ufs_utils, upp, ww3 pre/post for structured wave grid)
  • gsi builds GSI-based DA components (gsi_enkf, gsi_monitor, gsi_utils)
  • gdas builds JEDI-based DA components (gdas app, gsi_monitor, gsi_utils)

all will build all of the above (mostly for testing)

Examples:
Build for forecast-only GFS:
./build_all.sh gfs
Build cycled GFS including coupled DA:
./build_all.sh gfs gsi gdas
Build GEFS:
./build_all.sh gefs
Build everything (for testing purposes):
./build_all.sh all
Other options, such as -d to build in debug mode, remain unchanged.

The full script signature is now:

./build_all.sh [-a UFS_app][-c build_config][-d][-f][-h][-v] [gfs] [gefs] [sfs] [gsi] [gdas] [all]

Additionally, there is a new script to build components on the compute nodes using the job scheduler instead of the login node. This method takes the load off of the login nodes and may be faster in some cases. Compute build is invoked using the build_compute.sh script, which behaves similarly to the new build_all.sh:

./build_compute.sh [-h][-v][-A <hpc-account>] [gfs] [gefs] [sfs] [gsi] [gdas] [all]

Compute build will generate a rocoto workflow and then call rocotorun itself repeatedly until either a build fails or all builds succeed, at which point the script will exit. Since the script is calling rocotorun itself, you don't need to set up your own cron to do it, but advanced users can also use all the regular rocoto tools on build.xml and build.db if you wish.

Some things to note with the compute build:

  • When a build fails, other build jobs are not cancelled and will continue to run.
  • Since the script stops running rocotorun once one build fails, the rocoto database will no longer update with the status of the remaining jobs after that point.
  • Similarly, if the terminal running build_compute.sh gets disconnected, the rocoto database will no longer update.
  • In either of the above cases, you could run rocotorun yourself manually to update the database as long as the job information hasn't aged off the scheduler yet.

Resolves #3131

Type of change

  • Bug fix (fixes something broken)
  • New feature (adds functionality)
  • Maintenance (code refactor, clean-up, new CI test, etc.)

Change characteristics

  • Is this a breaking change (a change in existing functionality)? YES (build jobs are no longer specifiable)
  • Does this change require a documentation update? YES
  • Does this change require an update to any of the following submodules? NO

How has this been tested?

  • Built and linked all executables with both build_all.sh and compute_build.sh
  • Run a C96C48_hybatmDA case on Hera
  • Run a C48_S2SWA_gefs case on Hera

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have documented my code, including function, input, and output descriptions
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • This change is covered by an existing CI test or a new one has been added
  • Any new scripts have been added to the .github/CODEOWNERS file with owners
  • I have made corresponding changes to the system documentation if necessary

Copy link
Contributor

@aerorahul aerorahul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. thanks for taking the suggestions and updating to accommodate all systems.
A few more suggestions based on the code.

workflow/build_opts.yaml Outdated Show resolved Hide resolved
workflow/build_opts.yaml Outdated Show resolved Hide resolved
ush/compute_build.py Outdated Show resolved Hide resolved
workflow/build_compute.py Outdated Show resolved Hide resolved
sorc/build_compute.sh Outdated Show resolved Hide resolved
@DavidHuber-NOAA
Copy link
Contributor Author

@aerorahul Thanks for the quick review. I'm rerunning build tests now and will re-request a review when complete.

sorc/build_all.sh Outdated Show resolved Hide resolved
@aerorahul
Copy link
Contributor

@aerorahul What do you think of the workaround I came up with to add cmake to the default modules. This will cause cmake to load when the UPP compile_upp.sh script runs module reset.

It works. Though, will this not be applied for all machines? It might have unintended consequences.

@DavidHuber-NOAA
Copy link
Contributor Author

DavidHuber-NOAA commented Dec 24, 2024

It will be applied to all machines that do not have cmake in the search path. Having tested compute nodes on Hera, Gaea-C5, Gaea-C6, Hercules, and S4, I can verify that these all have cmake in their search paths.

I could go one step further and 1) add a detect_machine.sh call to build_upp.sh and 2) only run this block if we are on WCOSS2. Thoughts?

@DavidHuber-NOAA
Copy link
Contributor Author

DavidHuber-NOAA commented Dec 24, 2024

@DavidHuber-NOAA Creating a git diff patch and applying it via build_compute.sh on WCOSS2 might be sufficient until we get the permanent fix in the upp module file. Just a thought.

This could also work, though the patch can only be applied once. Thus, if build_compute.sh needs to be called again, it will need to be smart enough to know that the patch was already applied or just ignore the resulting error.

@aerorahul
Copy link
Contributor

aerorahul commented Dec 24, 2024

Since all of this is temporary, just do what's necessary and easy. Hopefully, upp will make a fix soon.

sorc/build_all.sh Fixed Show fixed Hide fixed
sorc/build_upp.sh Fixed Show fixed Hide fixed
@WenMeng-NOAA
Copy link
Contributor

@WenMeng-NOAA We would like to build upp on the compute nodes on all platforms. Can we include cmake in wcoss2.lua in the upp modulefiles?

@aerorahul Sure, please submit a PR to the UPP repos.

@DavidHuber-NOAA
Copy link
Contributor Author

Test compute build on WCOSS2 was successful.

Copy link
Contributor

@aerorahul aerorahul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@RussTreadon-NOAA
Copy link
Contributor

FYI: updating $HOMEgfs/sorc/gdas.cd/modulefiles/GDAS/wcoss2.intel.lua to NCO's test installation of spack-stack/1.6.0 along with adding cmake to the UPP wcoss2 modulefile allows ./build_compute.sh -v gdas to successfully run to completion on Dogwood.

@WalterKolczynski-NOAA
Copy link
Contributor

UPP now builds successfully on compute nodes.

@WalterKolczynski-NOAA WalterKolczynski-NOAA added CI-Wcoss2-Passed **Bot use only** CI testing on WCOSS for this PR has completed successfully and removed CI-Wcoss2-Failed **Bot use only** CI testing on WCOSS for this PR has failed labels Dec 24, 2024
@WalterKolczynski-NOAA WalterKolczynski-NOAA merged commit d85214d into NOAA-EMC:develop Dec 24, 2024
6 checks passed
tsga added a commit to tsga/global-workflow that referenced this pull request Jan 4, 2025
* develop:
  Ensure OCNRES and ICERES have 3 digits in the archive script (NOAA-EMC#3199)
  Set runtime shell requirements within Jenkins Pipeline (NOAA-EMC#3171)
  Add efcs and epos to ufs_hybatm xml (NOAA-EMC#3192) (NOAA-EMC#3193)
  Fix GEFS and SFS compile flags in build_all.sh (NOAA-EMC#3197)
  Remove early-cycle EnKF forecast (NOAA-EMC#3185)
  Fix mod_icec bug in atmos_prod (NOAA-EMC#3167)
  Create compute build option (NOAA-EMC#3186)
  Support global-workflow using Rocky 8 on CSPs (NOAA-EMC#2998)
danholdaway added a commit to danholdaway/global-workflow that referenced this pull request Jan 27, 2025
* develop:
  Remove WAFS files and references from `develop` (NOAA-EMC#3263)
  fix intel stack version number on c5 (NOAA-EMC#3258)
  Update gsi_monitor and ufs_utils hashes to recent hashes for C5/C6 build and run (NOAA-EMC#3252)
  Enable DA cycling on gaea C5/C6 (NOAA-EMC#3255)
  Copy post-processed sea ice increment for diagnostics (NOAA-EMC#3235)
  Only run METplus in the 3Dvar tests (NOAA-EMC#3245)
  Clone, build, and run C48_ATM and C48_S2SW on Gaea C5 and C6 (NOAA-EMC#3106)
  Add echgres as a dependency only for RUN=enkfgdas, not enkfgfs (NOAA-EMC#3246)
  Add domain level to wave gridded COM path (NOAA-EMC#3137)
  CI JJOB Tests using CMake (NOAA-EMC#3214)
  Make assorted updates to waves (NOAA-EMC#3190)
  Move WCOSS2 LD_LIBRARY_PATH patches to load_ufsda_modules.sh (NOAA-EMC#3236)
  Adding a gefs_arch task to GEFS workflow (NOAA-EMC#3211)
  Add additional GEFS variables needed for AI/ML applications  (NOAA-EMC#3221)
  Add bmat task dependency to marine LETKF task (NOAA-EMC#3224)
  Resolve bug with LMOD_TMOD_FIND_FIRST setting affecting build on WCOSS2 (NOAA-EMC#3229)
  Reinstate product groups (NOAA-EMC#3208)
  Additional fixes for downstream jobs (NOAA-EMC#3187)
  Turn IAU off during staging job for cold start experiments (NOAA-EMC#3215)
  Update the gdas.cd hash and enable GDASApp to run on WCOSS2 (NOAA-EMC#3220)
  Update upload-artifact to v4 (NOAA-EMC#3216)
  Prevent duplicate case generation in generate_workflows.sh (NOAA-EMC#3217)
  Update g-w to cycle with C1152 ATM (NOAA-EMC#3206)
  Separate use of initial increment/perturbation file from REPLAY/+03 ICs  (NOAA-EMC#3119)
  Update gsi_enkf hash and gsi_ver (NOAA-EMC#3207)
  Remove cpus-per-task from APRUN_OCNANALECEN on WCOSS2 (NOAA-EMC#3212)
  Remove 5WAVH from AWIPS GRIB2 parm files (NOAA-EMC#3146)
  Remove multi-grid wave support (NOAA-EMC#3188)
  Add echgres as a dependency for earc (NOAA-EMC#3202)
  Ensure OCNRES and ICERES have 3 digits in the archive script (NOAA-EMC#3199)
  Set runtime shell requirements within Jenkins Pipeline (NOAA-EMC#3171)
  Add efcs and epos to ufs_hybatm xml (NOAA-EMC#3192) (NOAA-EMC#3193)
  Fix GEFS and SFS compile flags in build_all.sh (NOAA-EMC#3197)
  Remove early-cycle EnKF forecast (NOAA-EMC#3185)
  Fix mod_icec bug in atmos_prod (NOAA-EMC#3167)
  Create compute build option (NOAA-EMC#3186)
  Support global-workflow using Rocky 8 on CSPs (NOAA-EMC#2998)
  Change orog gravity wave drag scheme for grid sizes less than 10km (NOAA-EMC#3175)
  Switch snow DA to use 2DVar for deterministic and ensemble mean (NOAA-EMC#3163)
  Update compression options for GEFS history files (NOAA-EMC#3184)
  Update compression options for high res history files (NOAA-EMC#3178)
  Turn DO_TEST_MODE off (NOAA-EMC#3177)
  Hotfix for gdas_arch div/0 (NOAA-EMC#3169)
  Allow building of the ufs-weather-model, WW3 pre/post execs for GFS, GEFS, SFS in the same clone of global-workflow (NOAA-EMC#3098)
  Switch Aerosol DA to use JCB and Jedi class (NOAA-EMC#3125)
  Update ufs-weather-model to 2024-12-06 commit  (NOAA-EMC#3145)
  Enable traditional threading as an option (NOAA-EMC#3149)
  Update HPC_ACCOUNT on Hercules to fv3-cpu (NOAA-EMC#3164)
  Turn C96C48_ufs_hybatmDA and C48mx500_3DVarAOWCDA into a regression test (NOAA-EMC#3120)
  Update GSI analysis jobs to use COMIN/COMOUT (NOAA-EMC#3092)
  Update HPC Tier Definitions (NOAA-EMC#3138)
  Add marine hybrid envar (NOAA-EMC#3041)
  Archive the experiment directory along with git status/diff output (NOAA-EMC#3105)
  Use stochastic restart patterns on rerun (NOAA-EMC#3077)
  Point Jenkinsfile back to CI/ (NOAA-EMC#3139)
  Fix wave restart for cold start and add ic version file (NOAA-EMC#3112)
  Allow users to override the default account at setup time (NOAA-EMC#3127)
  Refactor gridded wave post (NOAA-EMC#3014)
  Update docs related to NOAA CSPs (NOAA-EMC#3043)
  Allow APP to differ between RUNs (NOAA-EMC#2943)
  Run one executable for soca2cice (instead of two) (NOAA-EMC#3118)
  Speed up GSI analysis jobs in CI testing (NOAA-EMC#3115)
  Make aerosol output frequency variable (NOAA-EMC#2982)
  Add new stations to GFS BUFR sounding products (NOAA-EMC#3107)
  JCB-based obs+bias staging, Jedi class updates, and marine B-matrix refactoring (NOAA-EMC#2992)
  Enable tapering of atm ens perts at the model top (NOAA-EMC#3097)
  Update JGDAS ENKF POST  job  (NOAA-EMC#3090)
  SFS Runs at C96mx100  (NOAA-EMC#2960)
  Move machine-based options from config.base to host files (NOAA-EMC#3053)
  Remove RUNDIRS before running CI cases to cover re-run events (NOAA-EMC#3076)
  CI GitHub pipeline (hotfix) update for fetching repo name (NOAA-EMC#3084)
  Update JGDAS ENKF ECEN job  (NOAA-EMC#3050)
  Update snow obs processing job (NOAA-EMC#3055)
  Update to action workflow pipeline in default repo for development  (NOAA-EMC#3062)
  Update to action workflow pipeline in default repo for development (NOAA-EMC#3061)
  Update workflow pipeline (NOAA-EMC#3060)
  PW CI pipeline update5 ready for review so it can be merged and tested (NOAA-EMC#3059)
  Revert "GitHub CI Pipeline update for debugging forked PR support" (NOAA-EMC#3057)
  GitHub CI Pipeline update for debugging forked PR support (NOAA-EMC#3056)
  Add more ocean variables for post-processing in GEFS (NOAA-EMC#2995)
  Auto provisioning of PW clusters from GitHub CI added (NOAA-EMC#3051)
  Fix the name of the TC tracker filenames in archive.py (NOAA-EMC#3030)
  Make wxflow links static instead of from link_workflow (NOAA-EMC#3008)
  Update global jdas enkf diag job with COMIN/COMOUT for COM prefix (NOAA-EMC#2959)
  Add run and finalize methods to marine LETKF task (NOAA-EMC#2944)
  Fix wave restarts and GEFS FHOUT/FHMAX (NOAA-EMC#3009)
  Disabling hyper-threading (NOAA-EMC#2965)
  GitHub Actions Pipeline Updates for Self-Hosted Runners on PW (NOAA-EMC#3018)
  CI jekninsfile update hotfix (NOAA-EMC#3038)
  Update gdas.cd (NOAA-EMC#2978)
  Add ability to add tag to pslots with generate_workflows (NOAA-EMC#3036)
  CI update to shell environment with HOMEgfs to HOME_GFS for systems that need the path (NOAA-EMC#3013)
  Quick updated to Jenkins (health check) launch script (NOAA-EMC#3033)
  Document the generate_workflows.sh script (NOAA-EMC#3028)
  Replace gfs_cyc with an interval (NOAA-EMC#2928)
  Hotfix: Fix generate_workflows.sh optional build flags (NOAA-EMC#3024)
  Add a tool to run multiple YAML cases locally (NOAA-EMC#3004)
  Hotfix: Correctly set overwrite option when specified (NOAA-EMC#3021)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI-Hera-Passed **Bot use only** CI testing on Hera for this PR has completed successfully CI-Hercules-Passed **Bot use only** CI testing on Hercules for this PR has completed successfully CI-Orion-Passed **Bot use only** CI testing on Orion for this PR has completed successfully CI-Wcoss2-Passed **Bot use only** CI testing on WCOSS for this PR has completed successfully
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Build using compute nodes
6 participants