Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write component hangs in nf90_enddef with planned operational RRFS #2174

Closed
SamuelTrahanNOAA opened this issue Mar 6, 2024 · 76 comments · Fixed by NOAA-EMC/fv3atm#803 or #2193
Closed
Assignees
Labels
bug Something isn't working

Comments

@SamuelTrahanNOAA
Copy link
Collaborator

SamuelTrahanNOAA commented Mar 6, 2024

Description

The head of develop hangs while writing NetCDF output files in the write component when running the version of RRFS planned for operations. This happens regardless of the compression settings or lack thereof. The behavior is like so:

  1. Model runs normally.
  2. Data is sent to a write component.
  3. The write component writes metadata to the file.
  4. The write component gets stuck in an nf90_enddef call. Some MPI ranks leave the call prematurely and the rest are stuck in an MPI_Allreduce within nf90_enddef. This is on line 483 of FV3/io/module_write_netcdf.F90 (see stack trace).

Commenting out some of the variables in the diag_table will prevent this problem. There isn't one specific set of variables that seem to cause it. Turning off the lake model or smoke model prevents the hang, but one should note that disables writing of many variables.

Using one thread (no OpenMP) appears to reduce the frequency of the hangs. Increasing the write component ranks by enormous amounts appears to increase the frequency of hangs. This conclusion is uncertain since we haven't run enough tests to get a statistically representative sample set.

I have been unable to reproduce the problem when the model is compiled in debug mode.

This problem has been confirmed on Jet, Hera, and WCOSS2, but hasn't been tested on other machines.

From lots of forum searching, this problem has been identified in the distant past when the model sends different metadata at different ranks. For example, 13 variables on one rank, but 14 on the others. Or one rank sends three attributes and the others sent five. I haven't investigated that possibility, but I don't see how it is possible in the code.

To Reproduce:

1. Executables were compiled like so:

target=jet # or wcoss2
opts="-DAPP=HAFSW -DCCPP_SUITES=FV3_HRRR_gf,FV3_global_nest_v1 -D32BIT=ON -DCCPP_32BIT=ON -DFASTER=ON"
./compile.sh "$target" "$opts" 32bit intel YES NO

2. Copy one of these test directories:

Jet: /lfs4/BMC/nrtrr/Samuel.Trahan/smoke/sudheer-case
Hera: /scratch2/BMC/wrfruc/Samuel.Trahan/rrfs/sudheer-case
Cactus: /lfs/h2/oar/esrl/noscrub/samuel.trahan/ming-io-hang

3. Edit the job script

Each machine's test directory contains a job.sh script. Edit it as needed to point to your code.

4. Run the job script.

Send the script to sbatch on Jet or qsub on Cactus. Do not run it on a login node.

Additional context

This problem exists in the version of RRFS planned to go operational.

Output

This stack trace comes from gdb analyzing a running write component MPI rank while it is hanging waiting for an MPI_Allreduce. The arguments in the stack trace may be meaningless because gdb has trouble interpreting Intel-compiled code. However, the line numbers and function calls should be correct. Some may have been optimized out.

stack trace of stuck MPI process
#0  0x00002b6eab22803a in MPIDI_SHMGR_release_generic (opcode=2893772520, mpir_comm=0x7ffca32a54c8, root=27, localbuf=0x1ec, count=-1405432336, datatype=1329139008, errflag=0x7ffca32b2548, knomial_factor=4, 
    algo_type=MPIDI_SHMGR_ALGO_FLAT) at ../../src/mpid/ch4/src/intel/ch4_shm_coll_templates.h:206
#1  0x00002b6eab21bf85 in MPIDI_SHMGR_Release_bcast (comm=0x2b6eac7b76e8 <PVAR_TIMER_idle+8>, buf=0x7ffca32a54c8, count=27, datatype=492, errflag=0x2b6eac3acdf0 <MPIR_THREAD_GLOBAL_ALLFUNC_MUTEX>, algo_type=1329139008, radix=0)
    at ../../src/mpid/ch4/src/intel/ch4_shm_coll.c:2619
#2  0x00002b6eab118690 in MPIDI_Allreduce_intra_composition_zeta (sendbuf=<optimized out>, recvbuf=<optimized out>, count=<optimized out>, datatype=<optimized out>, op=<optimized out>, comm_ptr=<optimized out>, errflag=<optimized out>, 
    ch4_algo_parameters_container=<optimized out>) at ../../src/mpid/ch4/src/intel/ch4_coll_impl.h:1078
#3  MPID_Allreduce_invoke (sendbuf=0xffffffffffffffff, recvbuf=0x7ffca32b2598, count=<optimized out>, datatype=<optimized out>, op=<optimized out>, comm=<optimized out>, errflag=<optimized out>, 
    ch4_algo_parameters_container=<optimized out>) at ../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:1831
#4  MPIDI_coll_invoke (coll_sig=0x2b6eac7b76e8 <PVAR_TIMER_idle+8>, container=0x7ffca32a54c8, req=0x1b) at ../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:3359
#5  0x00002b6eab0f47ec in MPIDI_coll_select (coll_sig=0x2b6eac7b76e8 <PVAR_TIMER_idle+8>, req=0x7ffca32a54c8) at ../../src/mpid/ch4/src/intel/ch4_coll_globals_default.c:130
#6  0x00002b6eab237387 in MPID_Allreduce (sendbuf=<optimized out>, recvbuf=<optimized out>, count=<optimized out>, datatype=<optimized out>, op=<optimized out>, comm=<optimized out>, errflag=<optimized out>)
    at ../../src/mpid/ch4/src/intel/ch4_coll.h:77
#7  MPIR_Allreduce (sendbuf=0x2b6eac7b76e8 <PVAR_TIMER_idle+8>, recvbuf=0x7ffca32a54c8, count=27, datatype=492, op=-1405432336, comm_ptr=0x2b6f4f390d40, errflag=0x7ffca32b2548) at ../../src/mpi/coll/intel/coll_impl.c:265
#8  0x00002b6eab0926e1 in PMPI_Allreduce (sendbuf=0x2b6eac7b76e8 <PVAR_TIMER_idle+8>, recvbuf=0x7ffca32a54c8, count=27, datatype=492, op=-1405432336, comm=1329139008) at ../../src/mpi/coll/allreduce/allreduce.c:417
#9  0x00002b6eabbade7e in PMPI_File_set_size (fh=0x2b6eac7b76e8 <PVAR_TIMER_idle+8>, size=140723045946568) at ../../../../../src/mpi/romio/mpi-io/set_size.c:69
#10 0x00002b6eafcc20e5 in H5FD__mpio_truncate () from /mnt/lfs4/HFIP/hfv3gfs/role.epic/spack-stack/spack-stack-1.5.1/envs/unified-env/install/intel/2021.5.0/hdf5-1.14.0-7pcehjm/lib/libhdf5.so.310
#11 0x00002b6eafca8445 in H5FD_truncate () from /mnt/lfs4/HFIP/hfv3gfs/role.epic/spack-stack/spack-stack-1.5.1/envs/unified-env/install/intel/2021.5.0/hdf5-1.14.0-7pcehjm/lib/libhdf5.so.310
#12 0x00002b6eafc922ad in H5F__flush () from /mnt/lfs4/HFIP/hfv3gfs/role.epic/spack-stack/spack-stack-1.5.1/envs/unified-env/install/intel/2021.5.0/hdf5-1.14.0-7pcehjm/lib/libhdf5.so.310
#13 0x00002b6eafc964ee in H5F_flush_mounts () from /mnt/lfs4/HFIP/hfv3gfs/role.epic/spack-stack/spack-stack-1.5.1/envs/unified-env/install/intel/2021.5.0/hdf5-1.14.0-7pcehjm/lib/libhdf5.so.310
#14 0x00002b6eafefb4ad in H5VL__native_file_specific () from /mnt/lfs4/HFIP/hfv3gfs/role.epic/spack-stack/spack-stack-1.5.1/envs/unified-env/install/intel/2021.5.0/hdf5-1.14.0-7pcehjm/lib/libhdf5.so.310
#15 0x00002b6eafee6571 in H5VL_file_specific () from /mnt/lfs4/HFIP/hfv3gfs/role.epic/spack-stack/spack-stack-1.5.1/envs/unified-env/install/intel/2021.5.0/hdf5-1.14.0-7pcehjm/lib/libhdf5.so.310
#16 0x00002b6eafc7ec02 in H5Fflush () from /mnt/lfs4/HFIP/hfv3gfs/role.epic/spack-stack/spack-stack-1.5.1/envs/unified-env/install/intel/2021.5.0/hdf5-1.14.0-7pcehjm/lib/libhdf5.so.310
#17 0x00002b6ea848fe82 in nc4_enddef_netcdf4_file () from /mnt/lfs4/HFIP/hfv3gfs/role.epic/spack-stack/spack-stack-1.5.1/envs/unified-env/install/intel/2021.5.0/netcdf-c-4.9.2-lg6bcpf/lib/libnetcdf.so.19
#18 0x00002b6ea848fe09 in NC4__enddef () from /mnt/lfs4/HFIP/hfv3gfs/role.epic/spack-stack/spack-stack-1.5.1/envs/unified-env/install/intel/2021.5.0/netcdf-c-4.9.2-lg6bcpf/lib/libnetcdf.so.19
#19 0x00002b6ea8407d50 in nc_enddef () from /mnt/lfs4/HFIP/hfv3gfs/role.epic/spack-stack/spack-stack-1.5.1/envs/unified-env/install/intel/2021.5.0/netcdf-c-4.9.2-lg6bcpf/lib/libnetcdf.so.19
#20 0x00002b6ea7f841fb in netcdf::nf90_enddef (ncid=-532069714, h_minfree=-532069809, v_align=<error reading variable: Cannot access memory at address 0x1b>, v_minfree=<error reading variable: Cannot access memory at address 0x1ec>, 
    r_align=1) at ./netcdf_file.F90:82
#21 0x0000000002bcb8d0 in get_dimlen_if_exists (ncid=<optimized out>, dim_name=<optimized out>, grid=..., dim_len=<optimized out>, rc=<optimized out>, .tmp.DIM_NAME.len_V$5138=<optimized out>)
    at /lfs4/BMC/nrtrr/Samuel.Trahan/smoke/community-20240228/FV3/io/module_write_netcdf.F90:483
#22 module_write_netcdf::write_netcdf (wrtfb=..., 
    filename='\000' <repeats 24 times>, '\061\000\000\000\000\000\000\000int\000\270\177\000\000\330\027\272[\270\177\000\000\000\000\000\000\000\000\000\000!\001\000\000\000\000\000\000\060\000\000\000\000\000\000\000\061', '\000' <repe
ats 23 times>, '\340^]\b', '\000' <repeats 20 times>, '\241\000\000\000\000\000\000\000\001\000\000\000\t\000\000\000\300\'+\243\374\177', '\000' <repeats 26 times>, '\001', '\000' <repeats 47 times>..., 
    use_parallel_netcdf=<error reading variable: Cannot access memory at address 0x1b>, mpi_comm=<error reading variable: Cannot access memory at address 0x1ec>, mype=1, grid_id=64, rc=0, .tmp.FILENAME.len_V$3649=10)
    at /lfs4/BMC/nrtrr/Samuel.Trahan/smoke/community-20240228/FV3/io/module_write_netcdf.F90:208
#23 0x00000000027782f6 in rtll (tlmd=<optimized out>, tphd=<optimized out>, almd=<optimized out>, aphd=<optimized out>, tlm0d=<optimized out>, tph0d=<optimized out>)
    at /lfs4/BMC/nrtrr/Samuel.Trahan/smoke/community-20240228/FV3/io/module_wrt_grid_comp.F90:2422
#24 module_wrt_grid_comp::wrt_run (wrt_comp=..., imp_state_write=..., exp_state_write=<error reading variable: Cannot access memory at address 0x1b>, clock=<error reading variable: Cannot access memory at address 0x1ec>, rc=1)
    at /lfs4/BMC/nrtrr/Samuel.Trahan/smoke/community-20240228/FV3/io/module_wrt_grid_comp.F90:2052
#25 0x0000000000acc092 in ESMCI::FTable::callVFuncPtr(char const*, ESMCI::VM*, int*) ()
    at /mnt/lfs4/HFIP/hfv3gfs/role.epic/spack-stack/spack-stack-1.5.1/cache/build_stage/spack-stage-esmf-8.5.0-6zav654sh2mjenj4s3h4w433vhg5oqzy/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:2167
@SamuelTrahanNOAA SamuelTrahanNOAA added the bug Something isn't working label Mar 6, 2024
@SamuelTrahanNOAA
Copy link
Collaborator Author

I'm pinging @DusanJovic-NOAA and @junwang-noaa hoping they have some guesses.

@DusanJovic-NOAA
Copy link
Collaborator

Do we know which MPI rank returns from nf90_enddef routine early?

@SamuelTrahanNOAA
Copy link
Collaborator Author

Do we know which MPI rank returns from nf90_enddef routine early?

In my last run, it was different. Some of them exited, and others got stuck. It wasn't only 1.

In the collapsed details, the ranks with:

  • ENTER PROBLEMATIC ENDDEF - entered the enddef but never exited
  • EXIT PROBLEMATIC ENDDEF entered the enddef, and exited while other ranks were waiting forever
Who exited the nf90_enddef?
for n in $( seq 690 779 ) ; do grep -E "^$n:" slurm-879847.out | tail -1; done
690:  ENTER PROBLEMATIC ENDDEF
691:  ENTER PROBLEMATIC ENDDEF
692:  ENTER PROBLEMATIC ENDDEF
693:  ENTER PROBLEMATIC ENDDEF
694:  ENTER PROBLEMATIC ENDDEF
695:  ENTER PROBLEMATIC ENDDEF
696:  EXIT PROBLEMATIC ENDDEF
697:  EXIT PROBLEMATIC ENDDEF
698:  EXIT PROBLEMATIC ENDDEF
699:  EXIT PROBLEMATIC ENDDEF
700:  EXIT PROBLEMATIC ENDDEF
701:  EXIT PROBLEMATIC ENDDEF
702:  EXIT PROBLEMATIC ENDDEF
703:  EXIT PROBLEMATIC ENDDEF
704:  EXIT PROBLEMATIC ENDDEF
705:  EXIT PROBLEMATIC ENDDEF
706:  EXIT PROBLEMATIC ENDDEF
707:  EXIT PROBLEMATIC ENDDEF
708:  EXIT PROBLEMATIC ENDDEF
709:  EXIT PROBLEMATIC ENDDEF
710:  EXIT PROBLEMATIC ENDDEF
711:  EXIT PROBLEMATIC ENDDEF
712:  ENTER PROBLEMATIC ENDDEF
713:  ENTER PROBLEMATIC ENDDEF
714:  ENTER PROBLEMATIC ENDDEF
715:  ENTER PROBLEMATIC ENDDEF
716:  ENTER PROBLEMATIC ENDDEF
717:  ENTER PROBLEMATIC ENDDEF
718:  ENTER PROBLEMATIC ENDDEF
719:  ENTER PROBLEMATIC ENDDEF
720:  EXIT PROBLEMATIC ENDDEF
721:  EXIT PROBLEMATIC ENDDEF
722:  EXIT PROBLEMATIC ENDDEF
723:  EXIT PROBLEMATIC ENDDEF
724:  EXIT PROBLEMATIC ENDDEF
725:  EXIT PROBLEMATIC ENDDEF
726:  EXIT PROBLEMATIC ENDDEF
727:  EXIT PROBLEMATIC ENDDEF
728:  ENTER PROBLEMATIC ENDDEF
729:  ENTER PROBLEMATIC ENDDEF
730:  ENTER PROBLEMATIC ENDDEF
731:  ENTER PROBLEMATIC ENDDEF
732:  ENTER PROBLEMATIC ENDDEF
733:  ENTER PROBLEMATIC ENDDEF
734:  ENTER PROBLEMATIC ENDDEF
735:  ENTER PROBLEMATIC ENDDEF
736:  EXIT PROBLEMATIC ENDDEF
737:  EXIT PROBLEMATIC ENDDEF
738:  EXIT PROBLEMATIC ENDDEF
739:  EXIT PROBLEMATIC ENDDEF
740:  EXIT PROBLEMATIC ENDDEF
741:  EXIT PROBLEMATIC ENDDEF
742:  EXIT PROBLEMATIC ENDDEF
743:  EXIT PROBLEMATIC ENDDEF
744:  EXIT PROBLEMATIC ENDDEF
745:  EXIT PROBLEMATIC ENDDEF
746:  EXIT PROBLEMATIC ENDDEF
747:  EXIT PROBLEMATIC ENDDEF
748:  EXIT PROBLEMATIC ENDDEF
749:  EXIT PROBLEMATIC ENDDEF
750:  EXIT PROBLEMATIC ENDDEF
751:  EXIT PROBLEMATIC ENDDEF
752:  EXIT PROBLEMATIC ENDDEF
753:  EXIT PROBLEMATIC ENDDEF
754:  EXIT PROBLEMATIC ENDDEF
755:  EXIT PROBLEMATIC ENDDEF
756:  EXIT PROBLEMATIC ENDDEF
757:  EXIT PROBLEMATIC ENDDEF
758:  EXIT PROBLEMATIC ENDDEF
759:  ENTER PROBLEMATIC ENDDEF
760:  ENTER PROBLEMATIC ENDDEF
761:  ENTER PROBLEMATIC ENDDEF
762:  ENTER PROBLEMATIC ENDDEF
763:  ENTER PROBLEMATIC ENDDEF
764:  ENTER PROBLEMATIC ENDDEF
765:  EXIT PROBLEMATIC ENDDEF
766:  EXIT PROBLEMATIC ENDDEF
767:  ENTER PROBLEMATIC ENDDEF
768:  EXIT PROBLEMATIC ENDDEF
769:  ENTER PROBLEMATIC ENDDEF
770:  EXIT PROBLEMATIC ENDDEF
771:  ENTER PROBLEMATIC ENDDEF
772:  ENTER PROBLEMATIC ENDDEF
773:  EXIT PROBLEMATIC ENDDEF
774:  EXIT PROBLEMATIC ENDDEF
775:  EXIT PROBLEMATIC ENDDEF
776:  EXIT PROBLEMATIC ENDDEF
777:  EXIT PROBLEMATIC ENDDEF
778:  EXIT PROBLEMATIC ENDDEF
779:  EXIT PROBLEMATIC ENDDEF

@DusanJovic-NOAA
Copy link
Collaborator

Ok, thanks. I do not see any pattern in this rank sequence between ranks that got stuck and those that successfully returned from nf90_enddef.

@DusanJovic-NOAA
Copy link
Collaborator

In your description I see you mentioned that compression had no effect on how often this happens, but the number of variables written does have an effect. It also seems that in configurations with smaller domain sizes this does not happen, or not as frequently. So maybe it's worth trying different (smaller) chunk sizes.

@SamuelTrahanNOAA
Copy link
Collaborator Author

In your description I see you mentioned that compression had no effect on how often this happens, but the number of variables written does have an effect. It also seems that in configurations with smaller domain sizes this does not happen, or not as frequently. So maybe it's worth trying different (smaller) chunk sizes.

I personally haven't run those tests, and I know little about the model_configure options for chunking and compression. Can you suggest combinations of options in the module configure?

Here are the relevant lines in my last run. The zstandard_level 4 was my change; that option is absent in the real-time RRFS parallels (which have the same bug). I added compression to speed up testing.

zstandard_level:         4
ideflate:                0
quantize_mode:           quantize_bitround
quantize_nsd:            0
ichunk2d:                -1
jchunk2d:                -1
ichunk3d:                -1
jchunk3d:                -1
kchunk3d:                -1

@DusanJovic-NOAA
Copy link
Collaborator

ichunk2d = -1 (and all other chunk options) means the model will set the values to be the same as the output grid size in corresponding direction. Try to set for ichunk2d/jchunk2d to half of the output grid size, for example. Similar for i,j,k chunk3d. kchunk3d can be for example half of the number of vertical layers.

To be honest I do not see how and why would this make any difference in why nf90_enddef hangs, but who knows.

@DusanJovic-NOAA
Copy link
Collaborator

I found that the model always hangs while writing the physics history file(s) (phyf???.nc). These files have about 260 variables. As you suggested, reducing the number of the output variables in physics seems to help avoid the hangs in nf90_enddef.

Instead of commenting some variables in diag_table, I made this change:

diff --git a/io/module_write_netcdf.F90 b/io/module_write_netcdf.F90
index d9d8ff9..3c3f5e0 100644
--- a/io/module_write_netcdf.F90
+++ b/io/module_write_netcdf.F90
@@ -477,6 +477,11 @@ contains
             ncerr = nf90_put_att(ncid, varids(i), 'grid_mapping', 'cubed_sphere'); NC_ERR_STOP(ncerr)
          end if
 
+         if (modulo(i,200) == 0) then
+           ncerr = nf90_enddef(ncid); NC_ERR_STOP(ncerr)
+           ncerr = nf90_redef(ncid); NC_ERR_STOP(ncerr)
+         endif
+
        end do   ! i=1,fieldCount
 
        ncerr = nf90_enddef(ncid); NC_ERR_STOP(ncerr)

This change ends the define mode after 200 variables, and immediately reenters the define mode and continues adding the rest of the variables. It seems to work (no hangs) in several test runs I made. (on wcoss2). There is nothing special about number 200. I just choose in randomly to avoid ending/reentering the define mode for files which have less variables.

Can you please try this change with your code/setup on both wcoss2 and jet.

@DusanJovic-NOAA
Copy link
Collaborator

And here are the timings of all history/restart writes from one of my test runs on wcoss2:

                                              dynf000.nc write time is   26.45372 at fcst   00:00
                                              phyf000.nc write time is   34.18413 at fcst   00:00
                                           ------- total write time is   60.79570 at Fcst   00:00
                                              dynf001.nc write time is   27.38813 at fcst   01:00
                                              phyf001.nc write time is   36.25545 at fcst   01:00
            RESTART/20240304.160000.fv_core.res.tile1.nc write time is   11.98606 at fcst   01:00
         RESTART/20240304.160000.fv_srf_wnd.res.tile1.nc write time is    1.12703 at fcst   01:00
          RESTART/20240304.160000.fv_tracer.res.tile1.nc write time is   24.19673 at fcst   01:00
                     RESTART/20240304.160000.phy_data.nc write time is   37.15952 at fcst   01:00
                     RESTART/20240304.160000.sfc_data.nc write time is   16.17145 at fcst   01:00
                                           ------- total write time is  154.44860 at Fcst   01:00
                                              dynf002.nc write time is   29.14509 at fcst   02:00
                                              phyf002.nc write time is   36.68917 at fcst   02:00
            RESTART/20240304.170000.fv_core.res.tile1.nc write time is   12.03668 at fcst   02:00
         RESTART/20240304.170000.fv_srf_wnd.res.tile1.nc write time is    1.70183 at fcst   02:00
          RESTART/20240304.170000.fv_tracer.res.tile1.nc write time is   25.06961 at fcst   02:00
                     RESTART/20240304.170000.phy_data.nc write time is   35.79864 at fcst   02:00
                     RESTART/20240304.170000.sfc_data.nc write time is   15.21344 at fcst   02:00
                                           ------- total write time is  155.85170 at Fcst   02:00
                                              dynf003.nc write time is   27.02799 at fcst   03:00
                                              phyf003.nc write time is   36.10061 at fcst   03:00
                                           ------- total write time is   63.29045 at Fcst   03:00
                                              dynf004.nc write time is   26.55296 at fcst   04:00
                                              phyf004.nc write time is   36.55510 at fcst   04:00
                                           ------- total write time is   63.26967 at Fcst   04:00
                                              dynf005.nc write time is   26.85602 at fcst   05:00
                                              phyf005.nc write time is   36.89835 at fcst   05:00
                                           ------- total write time is   63.91559 at Fcst   05:00
                                              dynf006.nc write time is   27.17454 at fcst   06:00
                                              phyf006.nc write time is   38.85850 at fcst   06:00
                                           ------- total write time is   66.19458 at Fcst   06:00
                                              dynf007.nc write time is   26.85234 at fcst   07:00
                                              phyf007.nc write time is   36.73923 at fcst   07:00
                                           ------- total write time is   63.75226 at Fcst   07:00
                                              dynf008.nc write time is   28.33648 at fcst   08:00
                                              phyf008.nc write time is   39.37756 at fcst   08:00
                                           ------- total write time is   68.01713 at Fcst   08:00
                                              dynf009.nc write time is   26.56586 at fcst   09:00
                                              phyf009.nc write time is   37.22793 at fcst   09:00
                                           ------- total write time is   63.95545 at Fcst   09:00
                                              dynf010.nc write time is   27.55396 at fcst   10:00
                                              phyf010.nc write time is   37.40796 at fcst   10:00
                                           ------- total write time is   65.12306 at Fcst   10:00
                                              dynf011.nc write time is   28.12703 at fcst   11:00
                                              phyf011.nc write time is   38.63406 at fcst   11:00
                                           ------- total write time is   66.92263 at Fcst   11:00
                                              dynf012.nc write time is   26.92893 at fcst   12:00
                                              phyf012.nc write time is   35.51953 at fcst   12:00
                                           ------- total write time is   62.60945 at Fcst   12:00
                                              dynf013.nc write time is   27.23213 at fcst   13:00
                                              phyf013.nc write time is   39.34664 at fcst   13:00
                                           ------- total write time is   66.74036 at Fcst   13:00
                                              dynf014.nc write time is   30.29397 at fcst   14:00
                                              phyf014.nc write time is   40.22186 at fcst   14:00
                                           ------- total write time is   70.67712 at Fcst   14:00
                                              dynf015.nc write time is   26.69101 at fcst   15:00
                                              phyf015.nc write time is   36.06051 at fcst   15:00
                                           ------- total write time is   62.91315 at Fcst   15:00
                                              dynf016.nc write time is   27.40320 at fcst   16:00
                                              phyf016.nc write time is   36.25180 at fcst   16:00
                                           ------- total write time is   63.81565 at Fcst   16:00
                                              dynf017.nc write time is   26.70780 at fcst   17:00
                                              phyf017.nc write time is   34.18888 at fcst   17:00
                                           ------- total write time is   61.05879 at Fcst   17:00
                                              dynf018.nc write time is   27.22682 at fcst   18:00
                                              phyf018.nc write time is   35.03558 at fcst   18:00
                                           ------- total write time is   62.42384 at Fcst   18:00

@SamuelTrahanNOAA
Copy link
Collaborator Author

This did not fix my test case on Jet. Some of the ranks still froze in the nf90_enddef. They froze in the same enddef as before, not the new one you added.

Which ranks got stuck this time?
for n in $( seq 690 779 ) ; do grep -E "^$n:" slurm-973875.out | tail -1; done
690:  ENTER PROBLEMATIC ENDDEF
691:  ENTER PROBLEMATIC ENDDEF
692:  ENTER PROBLEMATIC ENDDEF
693:  EXIT PROBLEMATIC ENDDEF
694:  ENTER PROBLEMATIC ENDDEF
695:  EXIT PROBLEMATIC ENDDEF
696:  ENTER PROBLEMATIC ENDDEF
697:  ENTER PROBLEMATIC ENDDEF
698:  ENTER PROBLEMATIC ENDDEF
699:  ENTER PROBLEMATIC ENDDEF
700:  ENTER PROBLEMATIC ENDDEF
701:  ENTER PROBLEMATIC ENDDEF
702:  ENTER PROBLEMATIC ENDDEF
703:  ENTER PROBLEMATIC ENDDEF
704:  ENTER PROBLEMATIC ENDDEF
705:  ENTER PROBLEMATIC ENDDEF
706:  ENTER PROBLEMATIC ENDDEF
707:  ENTER PROBLEMATIC ENDDEF
708:  ENTER PROBLEMATIC ENDDEF
709:  ENTER PROBLEMATIC ENDDEF
710:  ENTER PROBLEMATIC ENDDEF
711:  ENTER PROBLEMATIC ENDDEF
712:  ENTER PROBLEMATIC ENDDEF
713:  ENTER PROBLEMATIC ENDDEF
714:  ENTER PROBLEMATIC ENDDEF
715:  ENTER PROBLEMATIC ENDDEF
716:  ENTER PROBLEMATIC ENDDEF
717:  ENTER PROBLEMATIC ENDDEF
718:  ENTER PROBLEMATIC ENDDEF
719:  ENTER PROBLEMATIC ENDDEF
720:  ENTER PROBLEMATIC ENDDEF
721:  ENTER PROBLEMATIC ENDDEF
722:  ENTER PROBLEMATIC ENDDEF
723:  ENTER PROBLEMATIC ENDDEF
724:  ENTER PROBLEMATIC ENDDEF
725:  ENTER PROBLEMATIC ENDDEF
726:  ENTER PROBLEMATIC ENDDEF
727:  ENTER PROBLEMATIC ENDDEF
728:  ENTER PROBLEMATIC ENDDEF
729:  ENTER PROBLEMATIC ENDDEF
730:  ENTER PROBLEMATIC ENDDEF
731:  ENTER PROBLEMATIC ENDDEF
732:  ENTER PROBLEMATIC ENDDEF
733:  ENTER PROBLEMATIC ENDDEF
734:  ENTER PROBLEMATIC ENDDEF
735:  ENTER PROBLEMATIC ENDDEF
736:  ENTER PROBLEMATIC ENDDEF
737:  ENTER PROBLEMATIC ENDDEF
738:  ENTER PROBLEMATIC ENDDEF
739:  ENTER PROBLEMATIC ENDDEF
740:  ENTER PROBLEMATIC ENDDEF
741:  ENTER PROBLEMATIC ENDDEF
742:  ENTER PROBLEMATIC ENDDEF
743:  ENTER PROBLEMATIC ENDDEF
744:  ENTER PROBLEMATIC ENDDEF
745:  ENTER PROBLEMATIC ENDDEF
746:  ENTER PROBLEMATIC ENDDEF
747:  ENTER PROBLEMATIC ENDDEF
748:  ENTER PROBLEMATIC ENDDEF
749:  ENTER PROBLEMATIC ENDDEF
750:  ENTER PROBLEMATIC ENDDEF
751:  ENTER PROBLEMATIC ENDDEF
752:  ENTER PROBLEMATIC ENDDEF
753:  ENTER PROBLEMATIC ENDDEF
754:  ENTER PROBLEMATIC ENDDEF
755:  ENTER PROBLEMATIC ENDDEF
756:  ENTER PROBLEMATIC ENDDEF
757:  ENTER PROBLEMATIC ENDDEF
758:  ENTER PROBLEMATIC ENDDEF
759:  ENTER PROBLEMATIC ENDDEF
760:  ENTER PROBLEMATIC ENDDEF
761:  ENTER PROBLEMATIC ENDDEF
762:  ENTER PROBLEMATIC ENDDEF
763:  ENTER PROBLEMATIC ENDDEF
764:  ENTER PROBLEMATIC ENDDEF
765:  ENTER PROBLEMATIC ENDDEF
766:  ENTER PROBLEMATIC ENDDEF
767:  ENTER PROBLEMATIC ENDDEF
768:  ENTER PROBLEMATIC ENDDEF
769:  ENTER PROBLEMATIC ENDDEF
770:  ENTER PROBLEMATIC ENDDEF
771:  ENTER PROBLEMATIC ENDDEF
772:  ENTER PROBLEMATIC ENDDEF
773:  ENTER PROBLEMATIC ENDDEF
774:  ENTER PROBLEMATIC ENDDEF
775:  ENTER PROBLEMATIC ENDDEF
776:  ENTER PROBLEMATIC ENDDEF
777:  ENTER PROBLEMATIC ENDDEF
778:  ENTER PROBLEMATIC ENDDEF
779:  ENTER PROBLEMATIC ENDDEF

@SamuelTrahanNOAA
Copy link
Collaborator Author

I have a test case on hera now. The PR description has been updated with the path.

Hera: /scratch2/BMC/wrfruc/Samuel.Trahan/rrfs/sudheer-case

@DusanJovic-NOAA
Copy link
Collaborator

Thanks. I'm running that test case on Hera right now with this change (diff is against current head of develop branch):

diff --git a/io/module_write_netcdf.F90 b/io/module_write_netcdf.F90
index d9d8ff9..d3a3433 100644
--- a/io/module_write_netcdf.F90
+++ b/io/module_write_netcdf.F90
@@ -341,7 +341,12 @@ contains
           if (lsoil > 1) dimids_soil = [im_dimid,jm_dimid,lsoil_dimid,           time_dimid]
        end if
  
+       ncerr = nf90_enddef(ncid); NC_ERR_STOP(ncerr)
+
        do i=1, fieldCount
+
+         ncerr = nf90_redef(ncid); NC_ERR_STOP(ncerr)
+
          call ESMF_FieldGet(fcstField(i), name=fldName, rank=rank, typekind=typekind, rc=rc)
; ESMF_ERR_RETURN(rc)
  
          par_access = NF90_INDEPENDENT
@@ -477,11 +482,11 @@ contains
             ncerr = nf90_put_att(ncid, varids(i), 'grid_mapping', 'cubed_sphere'); NC_ERR_ST
OP(ncerr)
          end if
  
+         ncerr = nf90_enddef(ncid); NC_ERR_STOP(ncerr)
+
        end do   ! i=1,fieldCount
  
-       ncerr = nf90_enddef(ncid); NC_ERR_STOP(ncerr)
     end if
-    ! end of define mode
  
     !
     ! write dimension variables and lon,lat variables

Here for every variable we enter and leave define mode. So far first 4 files (phyf000, 001, 002 and 003) were written without hangs in nf90_enddef.

My run directory is: /scratch1/NCEPDEV/stmp2/Dusan.Jovic/sudheer-case

@DusanJovic-NOAA
Copy link
Collaborator

According to the nc_enddef documentation here, specifically:

It's not necessary to call nc_enddef() for netCDF-4 files. With netCDF-4 files, nc_enddef() is called when needed by the netcdf-4 library.

which means we do not need to call nf90_redef/nf90_enddef at all, since the history files are netCDF-4 files, created with NF90_NETCDF4 mode. @edwardhartnett can you confirm this.

I'll try to remove all nf90_redef/nf90_enddef calls and see what happens.

@edwardhartnett
Copy link
Contributor

@DusanJovic-NOAA you are correct, a file created with NC_NETCDF4 does not need to call enddef(), but I believe redef() must still be called.

For example, if you define some metadata, and then call nc_put_vara_float() (or some other data-writing function), then netCDF-4 will notice that you have not called nc_enddef(), and will call it for you.

But does that work for nc_redef()? I don't think so.

However, whether called explicitly by the programmer, or internally by the netCDF library, enddef()/redef() is an expensive operation. All buffers are flushed to disk. So try to write all your metadata (including all attributes), then write data. Don't switch back and forth.

In the case of the fragment of the code I see here, it seems like there's a loop:

for some cases
     redef()
     write attribute
     enddef()
     write data
end

What would be better would be two loops, the first to write all the attributes, the second to do all the data writes.

redef()
for some cases
     write attribute
end
enddef()
for some cases
     write data
end

@SamuelTrahanNOAA
Copy link
Collaborator Author

All of the variable data is written in a later loop except the dimension variables. Those are written in calls to subroutine add_dim inside the metadata-defining loop. It does have the required call to nf90_redef.

       if (lm > 1) then
         call add_dim(ncid, "pfull", pfull_dimid, wrtgrid, mype, rc)
         call add_dim(ncid, "phalf", phalf_dimid, wrtgrid, mype, rc)
       ... more of the same ...

  subroutine add_dim(ncid, dim_name, dimid, grid, mype, rc)
       ...
       ncerr = nf90_def_var(ncid, dim_name, NF90_REAL8, dimids=[dimid], varid=dim_varid); NC_ERR_STOP(ncerr)
       ...
       ncerr = nf90_enddef(ncid=ncid); NC_ERR_STOP(ncerr)
       ncerr = nf90_put_var(ncid, dim_varid, values=valueListR8); NC_ERR_STOP(ncerr)
       ncerr = nf90_redef(ncid=ncid); NC_ERR_STOP(ncerr)

@DusanJovic-NOAA
Copy link
Collaborator

@edwardhartnett Thanks for the confirmation.

@SamuelTrahanNOAA Yes, all variables are written in the second loop over all fields after all dimensions and attributes are defined and written. The only exception are 4 'dimension variables' or coordinates, (pfull, phalf, zsoil and time) in which case we define them, end define mode, write the coordinate values and reenter define mode. But those are small variables, and I do not think it costs a lot to exit/reenter define mode since there are just 4 of them and no other large variables are written yet. If that has any impact on the performance.

I'll run the test now with all enddef/redef calls removed to see if that works.

@DusanJovic-NOAA
Copy link
Collaborator

Documentation of nc_redef says:

For netCDF-4 files (i.e. files created with NC_NETCDF4 in the cmode in their call to nc_create()), it is not necessary to call nc_redef() unless the file was also created with NC_STRICT_NC3. For straight-up netCDF-4 files, nc_redef() is called automatically, as needed.

@edwardhartnett
Copy link
Contributor

OK, so you could take out the redef() and enddef().

Usually when netCDF hangs on a parallel operation it's because a collective operation is done, but not all tasks participated. Are all programs running this metadata code?

@SamuelTrahanNOAA
Copy link
Collaborator Author

Are all programs running this metadata code?

A way to test that is to put an MPI_Barrier before each NetCDF call.

@DusanJovic-NOAA
Copy link
Collaborator

Without any explicit call to nf90_redef/nf90_enddef, model works fine for about 5 hr but then hangs while writing physics history file. Last file (forecast hour 6) is only partially written (~30Mb) before the model hangs:

-rw-r--r-- 1 Dusan.Jovic h-nems 1685751925 Mar 11 18:25 phyf000.nc
-rw-r--r-- 1 Dusan.Jovic h-nems 1865073247 Mar 11 18:29 phyf001.nc
-rw-r--r-- 1 Dusan.Jovic h-nems 1878394918 Mar 11 18:33 phyf002.nc
-rw-r--r-- 1 Dusan.Jovic h-nems 1881375125 Mar 11 18:37 phyf003.nc
-rw-r--r-- 1 Dusan.Jovic h-nems 1876109574 Mar 11 18:41 phyf004.nc
-rw-r--r-- 1 Dusan.Jovic h-nems 1879258803 Mar 11 18:46 phyf005.nc
-rw-r--r-- 1 Dusan.Jovic h-nems   30817232 Mar 11 18:49 phyf006.nc

ncdump -h of phyf006.nc prints all metadata and exits without any error. Also comparing metadata and global attributes with nccmp does not report any difference between 005 and 006 files:

nccmp -mg phyf005.nc phyf006.nc

@SamuelTrahanNOAA
Copy link
Collaborator Author

Have we reached the point where we should involve NetCDF and HDF5 developers in this conversation?

@DusanJovic-NOAA
Copy link
Collaborator

Let me try your suggestion to insert an MPI_Barrier before each NetCDF call.

@DusanJovic-NOAA
Copy link
Collaborator

Now it hangs on the second history file (phyf001.nc):

-rw-r--r-- 1 Dusan.Jovic h-nems 1685751925 Mar 11 19:29 phyf000.nc
-rw-r--r-- 1 Dusan.Jovic h-nems   30817232 Mar 11 19:33 phyf001.nc

Interestingly the file size is exactly the same (30817232 bytes) as in the previous run where model hangs at phyf006.nc. It also never hangs while writing dynf???.nc files, always at phyf???.nc.

@SamuelTrahanNOAA
Copy link
Collaborator Author

Do you know where it is hanging?

You can find out by sshing to one of the compute nodes running your job. Then start gdb on a running process. It may take a few tries to figure out which ranks are associated with the frozen quilt server.

@SamuelTrahanNOAA
Copy link
Collaborator Author

Interestingly the file size is exactly the same (30817232 bytes) as in the previous run where model hangs at phyf006.nc.

I suspect this is the size of the file's metadata.

@DusanJovic-NOAA
Copy link
Collaborator

The only thing that seems to help avoid the hangs is reducing the number of fields written out in the history file. At this moment writing out all the fields specified in 'diag_table', creates 260 variables. What is special about 260?. It is just slightly larger than 256. Could it be that 256 is, for whatever reason, some kind of limit?

I'm running now with just 4 fields commented out in diag_table, the last 4, just to see what happens.

# Aerosols emission for smoke
"gfs_sfc",     "emdust",       "emdust",        "fv3_history2d",  "all",  .false.,  "none",  2
"gfs_sfc",     "coef_bb_dc",   "coef_bb_dc",    "fv3_history2d",  "all",  .false.,  "none",  2
"gfs_sfc",     "min_fplume",   "min_fplume",    "fv3_history2d",  "all",  .false.,  "none",  2
"gfs_sfc",     "max_fplume",   "max_fplume",    "fv3_history2d",  "all",  .false.,  "none",  2
"gfs_sfc",     "hwp",          "hwp",           "fv3_history2d",  "all",  .false.,  "none",  2
#"gfs_sfc",     "hwp_ave",      "hwp_ave",       "fv3_history2d",  "all",  .false.,  "none",  2
#"gfs_sfc",     "frp_output",   "frp_output",    "fv3_history2d",  "all",  .false.,  "none",  2
#"gfs_phys",    "ebu_smoke",    "ebu_smoke",     "fv3_history",    "all",  .false.,  "none",  2
#"gfs_phys",    "ext550",       "ext550",        "fv3_history",    "all",  .false.,  "none",  2

This should create a file with 256 variables.

@SamuelTrahanNOAA
Copy link
Collaborator Author

Disabling only the last two variables (ebu_smoke and ext550) is enough to get it to run reliably. There are other sets of variables one can remove to get it to run reliably. That's just the one I can remember off the top of my head.

@DusanJovic-NOAA
Copy link
Collaborator

Ok, so that means there is nothing special about 256 limit, which is good. That should also mean that there are no issues in nf90_* calls, since in that case (two variables less) everything works fine.

@SamuelTrahanNOAA
Copy link
Collaborator Author

Ok, so that means there is nothing special about 256 limit, which is good. That should also mean that there are no issues in nf90_* calls, since in that case (two variables less) everything works fine.

There must be an issue somewhere in there. The model freezes at an MPI_Allreduce deep within the HDF5 library.

@edwardhartnett
Copy link
Contributor

The answer to how you find it is to isolate this code into a one-file test, with the minimum code and processors needed to cause the problem. Once you have such a test you will know whether you have found a netCDF bug or not.

@DusanJovic-NOAA
Copy link
Collaborator

Ok, can somebody take this program and run it on Hera with the gnu compiler on 2 MPI tasks.

test_netcdf.F90.txt

Remove .txt extension.

@edwardhartnett
Copy link
Contributor

How about making this a unit test for fv3atm?

@DusanJovic-NOAA
Copy link
Collaborator

How about making this a unit test for fv3atm?

This test program does not use/test any code or function from fv3atm code. Why would it be a unit test for fv3atm?

@SamuelTrahanNOAA
Copy link
Collaborator Author

Perhaps one of the existing regression tests will reproduce the problem if we use the proper input.nml and diag_table?

@SamuelTrahanNOAA
Copy link
Collaborator Author

I tried modifying hrrr_gf_debug, but it ran to completion. I'll try one of the conus13km cases next.

@edwardhartnett
Copy link
Contributor

It's all about saving programmer time by eliminating debugging, which is expensive and unpredictable (for example: this issue).

As a unit test, this code will test the IO stack in a way that is very useful. (Isn't this IO code in fv3atm? That's why I suggested fv3atm as the home for the test code.) When this test passes, you know that your I/O stack is set up correctly and provides everything your code needs. Consider how useful that would be to know - not just on our current machines, with current versions, but on some new machine, with new versions of all dependencies. 15 years from now, this test will still be useful.

In the test file you posted, I see the code is pretty simplified, for example doesn't do that redef/enddef business. The test program should do all the things that write grid component is doing. Ideally your test program will call all the same netCDF calls, in the same order, with the same parameters, as one run of your grid component (when it is failing). (To do this quickly, make your code a unit test first, and then use the CI to help iterate it.)

If the failure is a netCDF bug, the test program will help me find it. (And a simplified version of the test program will also go into netcdf-fortran). If not, the test program will help you understand where the bug in the code is. Either way, the test program goes into the repo to help future NOAA programmers with future IO issues.

All debugging efforts should result in unit tests which make it impossible for the project to debug the same problem again.

@edwardhartnett
Copy link
Contributor

@SamuelTrahanNOAA setting up a regression test to catch this is a good idea.

But a unit test is also needed, and needed first.

System tests should never be used for debugging, because they are expensive and don't provide the proper granularity.

For example, if you get a system test to fail with this issue, that does not help determine whether the bug is in the system or in netCDF. Nor will it be any use to say to the team of some third-party package: our system test is failing, we think it's your fault, please debug it for us.

@DusanJovic-NOAA
Copy link
Collaborator

In the test file you posted, I see the code is pretty simplified, for example doesn't do that redef/enddef business.

We changed the code so that it does not require redef/enddef, see my comment here:
#2174 (comment)

@DusanJovic-NOAA
Copy link
Collaborator

If the failure is a netCDF bug, the test program will help me find it. (And a simplified version of the test program will also go into netcdf-fortran).

Have you tried to run test program I posted here:
#2174 (comment)

@DusanJovic-NOAA
Copy link
Collaborator

Again, the test program I posted does not execute any function from fv3atm, it does not use any module from fv3atm. It is just simple one-file test program that only calls a sequence of netcdf-fortran subroutines to: create a file, define dimensions, add attributes to those dimensions , define 260 variables using those dimensions, end define mode and closes the file. Nothing specific to fv3atm.

@DusanJovic-NOAA
Copy link
Collaborator

@SamuelTrahanNOAA Can you please take my test program from above, or just grab this directory on Hera:

/scratch2/NCEPDEV/fv3-cam/Dusan.Jovic/test_netcdf

and try to compile it and run it. I used the following commands:

module purge
module use /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.5.1/envs/unified-env/install/modulefiles/Core
module use /scratch1/NCEPDEV/jcsda/jedipara/spack-stack/modulefiles

module load stack-gcc/9.2.0
module load stack-openmpi/4.1.5
module load cmake/3.23.1
module load netcdf-fortran/4.6.0

cd /scratch2/NCEPDEV/fv3-cam/Dusan.Jovic/test_netcdf
mkdir build && cd build
cmake ..
make

Then run the ./test_netcdf program on 2 MPI tasks, for example (in interactive session on compute node)

$ srun -l -n 2 ./test_netcdf  
1:  file: /scratch2/NCEPDEV/fv3-cam/Dusan.Jovic/test_netcdf/test_netcdf.F90 line:           9
0 NetCDF: Problem with HDF5 dimscales.
1: --------------------------------------------------------------------------
1: MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
1: with errorcode 1.
1:  
1: NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
1: You may or may not see output from other processes, depending on
1: exactly when Open MPI kills them.
1: --------------------------------------------------------------------------
0: slurmstepd: error: *** STEP 57180374.0 ON h23c50 CANCELLED AT 2024-03-14T14:25:51 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: h23c50: tasks 0-1: Killed
srun: Terminating StepId=57180374.0

@SamuelTrahanNOAA
Copy link
Collaborator Author

I ran here:

/scratch2/BMC/wrfruc/Samuel.Trahan/rrfs/test-netcdf

There's test.sh and build.sh scripts to automate it.

Job output is here:

/scratch2/BMC/wrfruc/Samuel.Trahan/rrfs/test-netcdf/slurm-57181278.out

It failed with the message you mentioned:

+ srun -l -n 2 ./build/test_netcdf
1:  file: /scratch2/BMC/wrfruc/Samuel.Trahan/rrfs/test-netcdf/test_netcdf.F90 line:           90 NetCDF: Problem with HDF5 dimscales.
1: --------------------------------------------------------------------------
1: MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
1: with errorcode 1.
1: 
1: NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
1: You may or may not see output from other processes, depending on
1: exactly when Open MPI kills them.
1: --------------------------------------------------------------------------
0: slurmstepd: error: *** STEP 57181278.0 ON h10c19 CANCELLED AT 2024-03-14T14:57:26 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: h10c19: tasks 0-1: Killed
srun: Terminating StepId=57181278.0

@DusanJovic-NOAA
Copy link
Collaborator

Thanks. @SamuelTrahanNOAA

@SamuelTrahanNOAA
Copy link
Collaborator Author

I tried the hrrr_gf test with that diag_table and the gnu compiler. It succeeded.

The model may not be outputting all of the variables due to namelist differences. I haven't checked that yet.

@DusanJovic-NOAA
Copy link
Collaborator

Test run on wcoss2 (ming_io_hang) finished successfully with the latest updates.

@DusanJovic-NOAA
Copy link
Collaborator

I also tested two classic netcdf formats (CDF-2 and CDF-5). No issues. Updated write_netcdf routine to enable those two formats, for debugging purposes and as an alternative option. Currently hard-coded to netcdf4.

@edwardhartnett
Copy link
Contributor

Where is this test code going to be maintained?

Or is this a one-time effort, and the results to be discarded when you are done?

@DusanJovic-NOAA
Copy link
Collaborator

It's just a short test program that reproduces the same error we see in full program. I personally have no intention to maintain it after this issue is closed.

@SamuelTrahanNOAA
Copy link
Collaborator Author

Generally, a unit test specific to one library goes in that library's own unit test suite. Perhaps after the missing constant is added to the NetCDF Fortran library, this could be a unit test for it in that library?

I'm going to update the ufs-weather-model regression test for RRFS in the near future. I hope it'll reproduce the bug so we can ensure the write component doesn't break in this specific way in the future.

@DusanJovic-NOAA
Copy link
Collaborator

Should I open PR for the changes in write_netcdf? Or do we need to run more tests of 'sudheer-case' and 'ming-io-hang' cases on Hera/Jet and WCOSS2?

@SamuelTrahanNOAA
Copy link
Collaborator Author

We should try these changes in the RRFS parallels for a few days. Can you put it in a branch for them to try? Perhaps a draft pull request? I'd rather they not have on-disk code changes.

@DusanJovic-NOAA
Copy link
Collaborator

We should try these changes in the RRFS parallels for a few days. Can you put it in a branch for them to try? Perhaps a draft pull request? I'd rather they not have on-disk code changes.

See PR #2193

@edwardhartnett
Copy link
Contributor

Did you find the bug that was causing the hang in your write component?

@DusanJovic-NOAA
Copy link
Collaborator

I changed the nc90_create call to pass NF90_NODIMSCALE_ATTACH flag and we are currently running RRFS parallels to see if that change solved the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
4 participants