Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add/fix build capability for Gaea-C5, Gaea-C6, and container #800

Merged
merged 13 commits into from
Nov 12, 2024

Conversation

DavidBurrows-NCO
Copy link
Contributor

@DavidBurrows-NCO DavidBurrows-NCO commented Oct 23, 2024

Resolves #799

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)

How Has This Been Tested?
Cloned and built on Gaea-C5, Gaea-C6, and in a container.

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

Copy link
Contributor

@RussTreadon-NOAA RussTreadon-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK to me.

Two comments:

  1. I can't test the container.
  2. Additional changes are needed to run GSI/EnKF ctests on Gaea-C6. @DavidBurrows-NCO , do you plan on adding these changes to this PR or will a new issue and PR be opened to activate GSI/EnKF ctests on Gaea-C6?

@@ -155,7 +155,11 @@ target_link_libraries(gsi_fortran_obj PUBLIC nemsio::nemsio)
target_link_libraries(gsi_fortran_obj PUBLIC ncio::ncio)
target_link_libraries(gsi_fortran_obj PUBLIC w3emc::w3emc_d)
target_link_libraries(gsi_fortran_obj PUBLIC sp::sp_d)
target_link_libraries(gsi_fortran_obj PUBLIC bufr::bufr_d)
if(DEFINED ENV{USE_BUFR4})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK to me. Cross checking with @DavidHuber-NOAA . PR #791 upgrades to bufr/12.1.0. Not sure how the bufr logic added here might impact Dave's PR.

@DavidBurrows-NCO
Copy link
Contributor Author

@RussTreadon-NOAA Thanks for taking a look

  1. @mark-a-potts Do you want to say anything about the container buildling?
  2. I have some time today, so let me look into the regression tests on C6. I'll get back to you today.

@DavidBurrows-NCO
Copy link
Contributor Author

Hello @RussTreadon-NOAA. I've made good progress on GSI reg tests. I'm currently using the same walltime/processor configuration as C5 for C6. This can be adjusted, but here are the current results::

Test project /gpfs/f6/bil-fire8/scratch/David.Burrows/oct24_gsi/c5c6con_branch/build
    Start 3: rrfs_3denvar_rdasens
    Start 1: global_4denvar
    Start 4: hafs_4denvar_glbens
    Start 5: hafs_3denvar_hybens
    Start 6: global_enkf
    Start 2: rtma
1/6 Test #2: rtma .............................   Passed  843.31 sec
2/6 Test #6: global_enkf ......................   Passed  844.64 sec
3/6 Test #5: hafs_3denvar_hybens ..............   Passed  1029.12 sec
4/6 Test #4: hafs_4denvar_glbens ..............   Passed  1154.60 sec
5/6 Test #1: global_4denvar ...................   Passed  1561.03 sec

rrfs_3denvar_rdasens_loproc_updat keeps hitting the wall clock even after I increased to 60 mins. It freezes in the same spot. I've attached a text file of the output log. I don't see anything too good in the working directory. Please let me know your thoughts. Thanks!
rrfs_output_wall_clock_C6_for_russ.odt

@RussTreadon-NOAA
Copy link
Contributor

Thank you @DavidBurrows-NCO for the update. This looks good. We've had problems with the rrfs_3denvar_rdasens test on other machines. Tagging regional DA staff: @TingLei-NOAA , @ShunLiu-NOAA , @hu5970

@RussTreadon-NOAA
Copy link
Contributor

C6 ctests results

@DavidBurrows-NCO , I obtained similar C6 ctest results

Test project /gpfs/f6/ira-sti/scratch/Russ.Treadon/git/gsi/pr800/build
    Start 2: rtma
    Start 1: global_4denvar
    Start 3: rrfs_3denvar_rdasens
    Start 4: hafs_4denvar_glbens
    Start 5: hafs_3denvar_hybens
    Start 6: global_enkf
1/6 Test #2: rtma .............................   Passed  851.24 sec
2/6 Test #6: global_enkf ......................   Passed  851.31 sec
3/6 Test #5: hafs_3denvar_hybens ..............   Passed  975.95 sec
4/6 Test #4: hafs_4denvar_glbens ..............   Passed  1275.93 sec
5/6 Test #1: global_4denvar ...................   Passed  1565.65 sec

rrfs_3denvar_rdasens_loproc_updat hung. The job was killed when it reached the specified 30 minute wall clock limit.

gsi_metguess_mod*create_: alloc() for met-guess done
 guess_grids*create_chemges_grids: trouble getting number of chem/gases
 metvardeb333-2d name ps
 metvardeb333-2d name z
 metvardeb333-2d name t2m
 metvardeb333-2d name q2m
 metvardeb333-3d name u
 metvardeb333-3d name v
 metvardeb333-3d name w
 metvardeb333-3d name tv
 metvardeb333-3d name q
 metvardeb333-3d name oz
 metvardeb333-3d name delp
 metvardeb333-3d name ql
 metvardeb333-3d name qr
 metvardeb333-3d name qs
 metvardeb333-3d name qi
 metvardeb333-3d name qg
 metvardeb333-3d name dbz
 metvardeb333-3d name fed
  fv3lam_io_dynmetvars3d_nouv is wtsendelp
  fv3lam_io_tracermevars3d_nouv is qozqlqrqsqiqg
  fv3lam_io_phymetvars3d_nouv is dbzfed
 the metvarname z will be dealt separately
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 207256804.0 ON c6n0992 CANCELLED AT 2024-11-05T13:18:35 DUE TO TIME LIMIT ***

C5 ctest results

Ctest behavior on C5 is very different. The following tests ran and failed

Test project /gpfs/f5/nggps_emc/scratch/Russ.Treadon/git/gsi/pr800/build
    Start 2: rtma
    Start 1: global_4denvar
    Start 3: rrfs_3denvar_rdasens
    Start 4: hafs_4denvar_glbens
    Start 5: hafs_3denvar_hybens
    Start 6: global_enkf
1/6 Test #2: rtma .............................***Failed  484.22 sec
2/6 Test #6: global_enkf ......................***Failed  484.22 sec
3/6 Test #1: global_4denvar ...................***Failed  554.70 sec

rtma
The low task count rtma job failed at the end of the run. Timing statistics are printed from task 0. The traceback cites mpi_finalize in /gpfs/f5/nggps_emc/scratch/Russ.Treadon/Russ.Treadon/gsi_tmp/ptmp/tmpreg_rtma/rtma_loproc_updat/stdout

 GENSTATS_GPS:  no profiles to process (nprof_gfs=           0 ), EXIT routine
 gsi_metguess_mod*destroy_: dealloc() for met-guess done
 glbsoi: complete
[000]gsisub(): : complete.


     ENDING DATE-TIME    NOV 05,2024  12:28:09.735  310  TUE   2460620
     PROGRAM GSI_ANL HAS ENDED.
* . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * .
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
libpthread-2.31.s  00001512DD61F910  Unknown               Unknown  Unknown
libmlx5.so.1.24.4  00001512D240FC4B  Unknown               Unknown  Unknown
libmlx5.so.1.24.4  00001512D245A81A  Unknown               Unknown  Unknown
libfabric.so.1.23  00001512D715FA73  Unknown               Unknown  Unknown
libfabric.so.1.23  00001512D7168CF7  Unknown               Unknown  Unknown
libfabric.so.1.23  00001512D717293C  Unknown               Unknown  Unknown
libfabric.so.1.23  00001512D71733EB  Unknown               Unknown  Unknown
libfabric.so.1.23  00001512D7175DD4  Unknown               Unknown  Unknown
libmpi_intel.so.1  00001512D9066FFE  Unknown               Unknown  Unknown
libmpi_intel.so.1  00001512D8EBA484  Unknown               Unknown  Unknown
libmpi_intel.so.1  00001512D7945854  MPI_Finalize          Unknown  Unknown
libmpifort_intel.  00001512DD6742B9  pmpi_finalize__       Unknown  Unknown
gsi.x              000000000041502F  MAIN__                    643  gsimain.f90

...

*****************RESOURCE STATISTICS*******************************
The total amount of wall time                        = 187.684936
The total amount of time in user mode                = 178.017540
The total amount of time in sys mode                 = 4.573581
The maximum resident set size (KB)                   = 1872772
Number of page faults without I/O activity           = 309201
Number of page faults with I/O activity              = 301
Number of times filesystem performed INPUT           = 2440
Number of times filesystem performed OUTPUT          = 0
Number of Voluntary Context Switches                 = 8665
Number of InVoluntary Context Switches               = 177
*****************END OF RESOURCE STATISTICS*************************

forrtl: severe (174): SIGSEGV, segmentation fault occurred

global_enkf
Here the low task count job ran to completion but the high task count job failed. Again, task 0 wrote timing statistics to stdout, but /gpfs/f5/nggps_emc/scratch/Russ.Treadon/Russ.Treadon/gsi_tmp/ptmp/tmpreg_global_enkf/global_enkf_hiproc_updat/stderr contains

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
libpthread-2.31.s  0000152997AAD910  Unknown               Unknown  Unknown
libmlx5.so.1.24.4  000015298B0C8C4B  Unknown               Unknown  Unknown
libmlx5.so.1.24.4  000015298B11381A  Unknown               Unknown  Unknown
libfabric.so.1.23  0000152991661A73  Unknown               Unknown  Unknown
libfabric.so.1.23  000015299166ACF7  Unknown               Unknown  Unknown
libfabric.so.1.23  000015299167493C  Unknown               Unknown  Unknown
libfabric.so.1.23  00001529916753EB  Unknown               Unknown  Unknown
libfabric.so.1.23  0000152991677DD4  Unknown               Unknown  Unknown
libmpi_intel.so.1  0000152993568FFE  Unknown               Unknown  Unknown
libmpi_intel.so.1  00001529933BC484  Unknown               Unknown  Unknown
libmpi_intel.so.1  0000152991E47854  MPI_Finalize          Unknown  Unknown
libmpifort_intel.  000015299F28C2B9  pmpi_finalize__       Unknown  Unknown
enkf.x             000000000046C65A  mpisetup_mp_mpi_c         134  mpisetup.f90
enkf.x             0000000000414D68  MAIN__                    281  enkf_main.f90
enkf.x             000000000041423D  Unknown               Unknown  Unknown
libc-2.31.so       0000152996DF924D  __libc_start_main     Unknown  Unknown
enkf.x             000000000041416A  Unknown               Unknown  Unknown

Line 134 of src/enkf/mpisetup.f90 is call mpi_finalize(ierr)

global_4denvar
The low task count job failed just like the low task count rtma. Timing statistics from task 0 are printed to /gpfs/f5/nggps_emc/scratch/Russ.Treadon/Russ.Treadon/gsi_tmp/ptmp/tmpreg_global_4denvar/global_4denvar_loproc_updat/stdout along with a traceback pointing to mpi_finalize

 destroy_ges_derivatives: successfully complete
 destroy_ges_tendencies: successfully complete
 gsi_chemguess_mod*destroy_: dealloc() for chem-tracer done
 gsi_metguess_mod*destroy_: dealloc() for met-guess done
 glbsoi: complete
[000]gsisub(): : complete.


     ENDING DATE-TIME    NOV 05,2024  12:29:31.208  310  TUE   2460620
     PROGRAM GSI_ANL HAS ENDED.
* . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * .
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
libpthread-2.31.s  0000148FF5C7C910  Unknown               Unknown  Unknown
libmlx5.so.1.24.4  0000148FEA890C4B  Unknown               Unknown  Unknown
libmlx5.so.1.24.4  0000148FEA8DB81A  Unknown               Unknown  Unknown
libfabric.so.1.23  0000148FEED5FA73  Unknown               Unknown  Unknown
libfabric.so.1.23  0000148FEED68CF7  Unknown               Unknown  Unknown
libfabric.so.1.23  0000148FEED7293C  Unknown               Unknown  Unknown
libfabric.so.1.23  0000148FEED733EB  Unknown               Unknown  Unknown
libfabric.so.1.23  0000148FEED75DD4  Unknown               Unknown  Unknown
libmpi_intel.so.1  0000148FF0C66FFE  Unknown               Unknown  Unknown
libmpi_intel.so.1  0000148FF0ABA484  Unknown               Unknown  Unknown
libmpi_intel.so.1  0000148FEF545854  MPI_Finalize          Unknown  Unknown
libmpifort_intel.  0000148FF5CD62B9  pmpi_finalize__       Unknown  Unknown
gsi.x              000000000041502F  MAIN__                    643  gsimain.f90
gsi.x              0000000000414FBD  Unknown               Unknown  Unknown

...

*****************RESOURCE STATISTICS*******************************
The total amount of wall time                        = 439.916953
The total amount of time in user mode                = 403.179300
The total amount of time in sys mode                 = 8.852160
The maximum resident set size (KB)                   = 1257204
Number of page faults without I/O activity           = 443816
Number of page faults with I/O activity              = 1279
Number of times filesystem performed INPUT           = 402022
Number of times filesystem performed OUTPUT          = 0
Number of Voluntary Context Switches                 = 38081
Number of InVoluntary Context Switches               = 300
*****************END OF RESOURCE STATISTICS*************************

Not sure what's going on here. Do we need to (un)set certain environment variables on C5?

The remaining tests

hafs_3denvar_hybens
hafs_4denvar_glbens
rrfs_3denvar_rdasens

were all killed by the system after reaching the specified wall clock limit. Interestingly, the low task count hafs_3denvar and hafs_4denvar jobs ran to completion. The high task count jobs did not finish. It looks like they hung while reading radar data.

  for radar KDGX nsuper=          57  delazmmax=  0.248988632686689      T
  vrmin,max=  -12.1000000000000        9.08870967741935       errmin,max=
  0.000000000000000E+000   4.54726892432853
  deltiltmin,max=  3.818932603314051E-002  0.605471355033601
  deldistmin,max=  -387.290631756274       -5.25372147616690
slurmstepd: error: *** STEP 135229599.0 ON c5n0775 CANCELLED AT 2024-11-05T12:50:00 DUE TO TIME LIMIT ***

The C5 rrfs_3denvar_rdasens job hangs in the same way as observed on C6

 gsi_metguess_mod*create_: alloc() for met-guess done
 guess_grids*create_chemges_grids: trouble getting number of chem/gases
 metvardeb333-2d name ps
 metvardeb333-2d name z
 metvardeb333-2d name t2m
 metvardeb333-2d name q2m
 metvardeb333-3d name u
 metvardeb333-3d name v
 metvardeb333-3d name w
 metvardeb333-3d name tv
 metvardeb333-3d name q
 metvardeb333-3d name oz
 metvardeb333-3d name delp
 metvardeb333-3d name ql
 metvardeb333-3d name qr
 metvardeb333-3d name qs
 metvardeb333-3d name qi
 metvardeb333-3d name qg
 metvardeb333-3d name dbz
 metvardeb333-3d name fed
  fv3lam_io_dynmetvars3d_nouv is wtsendelp
  fv3lam_io_tracermevars3d_nouv is qozqlqrqsqiqg
  fv3lam_io_phymetvars3d_nouv is dbzfed
 the metvarname z will be dealt separately
slurmstepd: error: *** STEP 135229565.0 ON c5n0801 CANCELLED AT 2024-11-05T12:36:00 DUE TO TIME LIMIT ***

@CoryMartin-NOAA
Copy link
Contributor

@RussTreadon-NOAA I vaguely recall something like this previously, like 6-9 months ago, where GSI would run but crash at the very end of execution. Do you recall this, or am I imagining it?

@RussTreadon-NOAA
Copy link
Contributor

@CoryMartin-NOAA , this sounds vaguely familiar. Let me check GSI issues and PRs for clues. The rrfs failure is a known problem.

@DavidBurrows-NCO
Copy link
Contributor Author

Hi @RussTreadon-NOAA. I know it's not the solution you want, but I adjusted the node/processor configuration to match Hera on C6 and rrfs was successful:

David.Burrows@gaea65 07:49 build $ ctest -j 6
Test project /gpfs/f6/bil-fire8/scratch/David.Burrows/oct24_gsi/c5c6con_branch/build
    Start 1: global_4denvar
    Start 4: hafs_4denvar_glbens
    Start 5: hafs_3denvar_hybens
    Start 6: global_enkf
    Start 2: rtma
    Start 3: rrfs_3denvar_rdasens
1/6 Test #3: rrfs_3denvar_rdasens .............   Passed  487.91 sec
2/6 Test #6: global_enkf ......................   Passed  848.19 sec
3/6 Test #2: rtma .............................   Passed  902.67 sec
4/6 Test #5: hafs_3denvar_hybens ..............   Passed  965.08 sec
5/6 Test #4: hafs_4denvar_glbens ..............   Passed  1094.00 sec
6/6 Test #1: global_4denvar ...................   Passed  1561.08 sec

100% tests passed, 0 tests failed out of 6

Total Test time (real) = 1561.09 sec

I'm working on C5 right now.

@RussTreadon-NOAA
Copy link
Contributor

@DavidBurrows-NCO : Changing the task count is consistent with the regional DA team recommendation. Thank you for looking at the C5 failures.

@RussTreadon-NOAA
Copy link
Contributor

@TingLei-NOAA informed me that he will be on leave and is unable to update the task count for rrfs_3denvar_rdasens on various machines.

@DavidBurrows-NCO , please commit the modified regression/regression_param.sh you used to get a successful rrfs_3denvar_rdasens run on Gaea C6. Thank you!

@RussTreadon-NOAA
Copy link
Contributor

@DavidBurrows-NCO , we need two peer reviews for GSI PRs. My review doesn't count as a peer review. Who would you like to review this PR?

@RussTreadon-NOAA RussTreadon-NOAA self-requested a review November 7, 2024 19:30
Copy link
Contributor

@RussTreadon-NOAA RussTreadon-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve.

ush/sub_gaeac5 Outdated
@@ -158,6 +158,7 @@ sbatch=${sbatch:-sbatch}
ofile=$DATA/subout$$
>$ofile
chmod 777 $ofile
export FI_VERBS_PREFER_XRC=0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this setting resolve what appears to be mpi_finalize problems on C5?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this setting resolve what appears to be mpi_finalize problems on C5?

It appears so. Here is the notice from Seth Underwood with Gaea C5: "After the C5 update, users reported that some jobs failed during the MPI_Finalize call. We have alerted ORNL and HPE. HPE has suggested setting the environment variable FI_VERBS_PREFER_XRC=0 in the run script (setenv FI_VERBS_PREFER_XRC 0, for csh; export FI_VERBS_PREFER_XRC=0). This has resolved the error in our tests. Please add this variable to your run script(s) if you also hit this error. Please note that we do not see any issues preemptively setting this environment variable."

Now that I think the MPI_Finalize issue is resolved, I am going to adjust the resources and test a little more. I'll let you know when I have my final changes in place for you to look over.

CoryMartin-NOAA
CoryMartin-NOAA previously approved these changes Nov 7, 2024
@RussTreadon-NOAA
Copy link
Contributor

Excellent! Thank you @DavidBurrows-NCO for working through various issues.

@DavidBurrows-NCO
Copy link
Contributor Author

@RussTreadon-NOAA Quick question...if a particular test fails, but I check all the stdout, and they return rc=0...does that typically mean the job took too long to run? I assume there are set run time values for each test? Thanks

@RussTreadon-NOAA
Copy link
Contributor

Unfortunately, the checks in GSI ctests are not very robust. Some of the timing and memory usage checks can yield false positives. The test Failed but a check of the results does not indicate a problem. Since GSI has no code manager and we are transitioning to JEDI, it's unlikely the GSI ctests will be cleaned up to yield more consistent results. If there's a particular failure you'd like me to look at, give me the path or the rundir and I'll take a look.

CoryMartin-NOAA
CoryMartin-NOAA previously approved these changes Nov 8, 2024
@RussTreadon-NOAA
Copy link
Contributor

Gaea C5 and C6 ctests
Install NOAA-EPIC:feature/c5c6conbuild at 3b98cf4 on Gaea C5 and C6. Run ctests with following results. On both machines use NOAA-EPIC:feature/c5c6conbuild as both the updat and contrl

Gaea C5

Test project /gpfs/f5/nggps_emc/scratch/Russ.Treadon/git/gsi/pr800/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_rdasens
    Start 4: hafs_4denvar_glbens
    Start 5: hafs_3denvar_hybens
    Start 6: global_enkf
1/6 Test #6: global_enkf ......................   Passed  1570.25 sec
2/6 Test #3: rrfs_3denvar_rdasens .............   Passed  1747.56 sec
3/6 Test #1: global_4denvar ...................   Passed  2688.64 sec
4/6 Test #4: hafs_4denvar_glbens ..............   Passed  3742.89 sec
5/6 Test #5: hafs_3denvar_hybens ..............   Passed  3743.19 sec
6/6 Test #2: rtma .............................***Failed  4817.39 sec

83% tests passed, 1 tests failed out of 6

Total Test time (real) = 4817.41 sec

The following tests FAILED:
          2 - rtma (Failed)

The rtma test failed due to

The runtime for rtma_loproc_updat is 252.267314 seconds.  This has exceeded maximum allowable threshold time of 201.010062 seconds, resulting in Failure time-thresh of the regression test.

Here are the gsi.x wall time for various runs of this test

rtma_hiproc_contrl/stdout:The total amount of wall time                        = 182.670720
rtma_hiproc_updat/stdout:The total amount of wall time                        = 180.407343
rtma_loproc_contrl/stdout:The total amount of wall time                        = 182.736420
rtma_loproc_updat/stdout:The total amount of wall time                        = 252.267314

Indeed the loproc_updat wall time is considerably greater than the loproc_contrl wall time. Note, however, that the updat and contrl tests use the same gsi.x. Thus, the wall time difference reflects system load, i/o speed, or other system related factors. This is not a fatal fail.

Gaea C6

Test project /gpfs/f6/ira-sti/scratch/Russ.Treadon/git/gsi/pr800/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_rdasens
    Start 4: hafs_4denvar_glbens
    Start 5: hafs_3denvar_hybens
    Start 6: global_enkf
1/6 Test #3: rrfs_3denvar_rdasens .............   Passed  487.90 sec
2/6 Test #6: global_enkf ......................***Failed  905.26 sec
3/6 Test #2: rtma .............................***Failed  1144.95 sec
4/6 Test #4: hafs_4denvar_glbens ..............   Passed  1206.88 sec
5/6 Test #5: hafs_3denvar_hybens ..............   Passed  1454.39 sec
6/6 Test #1: global_4denvar ...................   Passed  1866.25 sec

67% tests passed, 2 tests failed out of 6

Total Test time (real) = 1866.26 sec

The following tests FAILED:
          2 - rtma (Failed)
          6 - global_enkf (Failed)

The global_enkf and rtma failures are for the same reason

The runtime for global_enkf_hiproc_updat is 238.742326 seconds.  This has exceeded maximum allowable threshold time of 209.509297 seconds, resulting in Failure timethresh2 of the regression test.
The runtime for rtma_hiproc_updat is 208.155676 seconds.  This has exceeded maximum allowable threshold time of 194.161564 seconds, resulting in Failure of timethresh2 the regression test.

Here are the wall times from the various tests

tmpreg_global_enkf/global_enkf_hiproc_contrl/stdout:The total amount of wall time                        = 190.462998
tmpreg_global_enkf/global_enkf_hiproc_updat/stdout:The total amount of wall time                        = 238.742326
tmpreg_global_enkf/global_enkf_loproc_contrl/stdout:The total amount of wall time                        = 159.689075
tmpreg_global_enkf/global_enkf_loproc_updat/stdout:The total amount of wall time                        = 147.498631
tmpreg_rtma/rtma_hiproc_contrl/stdout:The total amount of wall time                        = 176.510513
tmpreg_rtma/rtma_hiproc_updat/stdout:The total amount of wall time                        = 208.155676
tmpreg_rtma/rtma_loproc_contrl/stdout:The total amount of wall time                        = 171.044928
tmpreg_rtma/rtma_loproc_updat/stdout:The total amount of wall time                        = 167.905287

For both tests the hiproc_updat wall time is notably greater than the hiproc_contrl. This is interesting. The updat and contrl runs use the same executables. he wall time differences reflect differences in system load, i/o speed, or other aspects of the system. This is not a fatal fail.

@RussTreadon-NOAA RussTreadon-NOAA self-requested a review November 8, 2024 15:43
Copy link
Contributor

@RussTreadon-NOAA RussTreadon-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Please reduce the Gaea C6 wall clock limit for rrfs_3denvar_rdasens from 0:60:00 to 0:15:00.

regression/regression_param.sh Outdated Show resolved Hide resolved
@RussTreadon-NOAA
Copy link
Contributor

This PR is awaiting the return of WCOSS2 to developers so WCOSS2 ctests can be run. Assuming reduction of the Gaea C6 rrfs_3denvar_rdasens wall time and acceptable WCOSS2 results, this PR can be merged into develop.

Copy link
Contributor

@RussTreadon-NOAA RussTreadon-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Approve.

CoryMartin-NOAA
CoryMartin-NOAA previously approved these changes Nov 8, 2024
Copy link
Contributor

@RussTreadon-NOAA RussTreadon-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve.

@RussTreadon-NOAA
Copy link
Contributor

@DavidBurrows-NCO , NCO said the Cactus upgrade encountered some issues which they are working through. I'm not sure when the development WCOSS2 machine will come back online. While we wait, I installed this PR on both Dogwood and Cactus. I'm ready to run on when either machine becomes dev.

@DavidBurrows-NCO
Copy link
Contributor Author

@RussTreadon-NOAA Thanks for the info, and thanks for your quick back and forth with this PR. Have a good weekend!

@RussTreadon-NOAA
Copy link
Contributor

WCOSS2 ctests

Install feature/c5c6conbuild @ 8cf6434 and develop @ b0e3cba on Cactus. Run ctests with the following results

Test project /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr800/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_rdasens
    Start 4: hafs_4denvar_glbens
    Start 5: hafs_3denvar_hybens
    Start 6: global_enkf
1/6 Test #3: rrfs_3denvar_rdasens .............   Passed  730.56 sec
2/6 Test #6: global_enkf ......................   Passed  857.98 sec
3/6 Test #2: rtma .............................   Passed  974.22 sec
4/6 Test #5: hafs_3denvar_hybens ..............   Passed  1220.23 sec
5/6 Test #4: hafs_4denvar_glbens ..............   Passed  1280.17 sec
6/6 Test #1: global_4denvar ...................   Passed  1683.64 sec

100% tests passed, 0 tests failed out of 6

Total Test time (real) = 1683.75 sec

@RussTreadon-NOAA RussTreadon-NOAA merged commit 2136b87 into NOAA-EMC:develop Nov 12, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add/fix build capability for Gaea-C5, Gaea-C6, and container
4 participants