Skip to content

Version 2.3 Functional Release

Compare
Choose a tag to compare
@gdicker1 gdicker1 released this 01 Aug 23:39
· 85 commits to main since this release

The EarthWorks v2.3 release introduces these new features and fixes:

Optimizing OpenACC Data Transfers RRTMG-P: Updates to the interface between RRTMG-P and CAM allow variables to stay resident on the GPU between timesteps. This increases performance by reducing the overhead of data transfers.

Fix Correctness Issue: When doing longer science runs, marked differences were found between the EarthWorks v2.1 outputs and previous versions. This is addressed in this release by reverting MPAS-A to a version without OpenACC offloading.

Fix GPU Builds with Modified Physics Columns: Build failures during the final linking step when increasing pcols are addressed by adding the -mcmodel=medium flag, only during GPU builds. Increasing the number of physics columns processed per MPI rank (thus per GPU) reduces the kernel launch overhead and provides a performance boost.

Fix Archiving Step for MPAS-O and MPAS-SI: Some test failures from last release are addressed by adding a file that describes output files to archive for MPAS-O and MPAS-SI.

Description of Model Configurations (Compsets)

See EarthWorks Supported Configurations in the GitHub wiki for more details.

Testing

Tested Systems

NSF NCAR’s Derecho Supercomputer

The tests in this release occurred on Derecho.

CPU-only hardware Derecho’s CPU-only nodes consisted of dual-socket, 64-core, 3rd Gen AMD EPYC™ 7763 Milan processors with 256 GB of DDR4 memory.

CPU/GPU hybrid hardware Derecho has GPU nodes consisting of single-socket, 64-core, 3rd Gen AMD EPYC™ 7763 Milan processor with 512 GB of DDR4 memory plus 4 NVIDIA A100 GPUs each with 40 GB of onboard memory.

Tested Software Stacks

Compiler Versions

Derecho:

  • ifort (Intel Classic compiler version 2023.2.1)
  • ifx (Intel OneAPI version 2023..2.1)
  • Nvfortran (NVHPC fortran compiler version 24.3)
  • Gnufortran (compiler version 12.2.0)

Libraries

Derecho:

  • MPI (Cray MPICH version 8.1.27)
  • Parallel-NetCDF (version 1.12.3)
  • PIO2 (version 2.6.2)
  • ESMF (version 8.6.0)

Testing Results

Derecho create_test Results

To test this release, CPU-only tests were carried out on Derecho using the ew-pr and ew-rel categories as described in the v2.2 release.

5 Day Smoke Tests (ew-pr)

  • 120km FHS94 with GNU, Intel-OneAPI, and NVHPC (Overall: PASS)
  • 120km FKESSLER with NVHPC (Overall: PASS)
  • 120km QPC6 with NVHPC (Overall: PASS)
  • 120km F2000climoEW with NVHPC (Overall: PASS)
  • 120km FullyCoupledEW with NVHPC (Overall: PASS)
  • 120km CHAOS2000dev with NVHPC, Intel, and GNU (Overall: PASS)

All these runs ran to completion and no differences were found when comparing between two runs of this release. These results can serve as baselines going forward.

11 Day Exact Restart Tests (ew-rel)

  • 120km, 32L CHAOS2000dev with NVHPC and Intel (Overall: FAIL)
  • 120km, 58L CHAOS2000dev with NVHPC and Intel (Overall: FAIL)
  • 60km CHAOS2000dev with NVHPC and Intel (Overall: FAIL)
    • These tests failed during init, too few resources (nodes) requested.
  • 30km CHAOS2000dev with NVHPC and Intel (Overall: FAIL)
  • 15km, 58L CHAOS2000dev with NVHPC and Intel (Overall: FAIL)
    • These tests didn’t make it far through simulation, future releases will up the amount of resources (nodes) requested.

For tests without another note, NVHPC runs failed during initialization (see Known Issues below) and Intel runs failed when comparing the original run to the restart run.

Known Issues

  • NVHPC compilers (tested nvfortran version 24.3, continues from previous releases): Initializing from restart fails.
    • Additional details: Any configuration (tested with QPC6, F2000ClimoEW, FullyCoupledEW) that attempts to restart from a previous run will fail in CAM subroutine dyn_init.
    • Resolutions affected: all supported resolution/level combinations.
    • Work around: Run without restart.