-
Notifications
You must be signed in to change notification settings - Fork 252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write component hangs in nf90_enddef with planned operational RRFS #2174
Write component hangs in nf90_enddef with planned operational RRFS #2174
Comments
I'm pinging @DusanJovic-NOAA and @junwang-noaa hoping they have some guesses. |
Do we know which MPI rank returns from nf90_enddef routine early? |
In my last run, it was different. Some of them exited, and others got stuck. It wasn't only 1. In the collapsed details, the ranks with:
Who exited the nf90_enddef?
|
Ok, thanks. I do not see any pattern in this rank sequence between ranks that got stuck and those that successfully returned from nf90_enddef. |
In your description I see you mentioned that compression had no effect on how often this happens, but the number of variables written does have an effect. It also seems that in configurations with smaller domain sizes this does not happen, or not as frequently. So maybe it's worth trying different (smaller) chunk sizes. |
I personally haven't run those tests, and I know little about the model_configure options for chunking and compression. Can you suggest combinations of options in the module configure? Here are the relevant lines in my last run. The
|
ichunk2d = -1 (and all other chunk options) means the model will set the values to be the same as the output grid size in corresponding direction. Try to set for ichunk2d/jchunk2d to half of the output grid size, for example. Similar for i,j,k chunk3d. kchunk3d can be for example half of the number of vertical layers. To be honest I do not see how and why would this make any difference in why nf90_enddef hangs, but who knows. |
I found that the model always hangs while writing the physics history file(s) (phyf???.nc). These files have about 260 variables. As you suggested, reducing the number of the output variables in physics seems to help avoid the hangs in nf90_enddef. Instead of commenting some variables in diag_table, I made this change: diff --git a/io/module_write_netcdf.F90 b/io/module_write_netcdf.F90
index d9d8ff9..3c3f5e0 100644
--- a/io/module_write_netcdf.F90
+++ b/io/module_write_netcdf.F90
@@ -477,6 +477,11 @@ contains
ncerr = nf90_put_att(ncid, varids(i), 'grid_mapping', 'cubed_sphere'); NC_ERR_STOP(ncerr)
end if
+ if (modulo(i,200) == 0) then
+ ncerr = nf90_enddef(ncid); NC_ERR_STOP(ncerr)
+ ncerr = nf90_redef(ncid); NC_ERR_STOP(ncerr)
+ endif
+
end do ! i=1,fieldCount
ncerr = nf90_enddef(ncid); NC_ERR_STOP(ncerr) This change ends the define mode after 200 variables, and immediately reenters the define mode and continues adding the rest of the variables. It seems to work (no hangs) in several test runs I made. (on wcoss2). There is nothing special about number 200. I just choose in randomly to avoid ending/reentering the define mode for files which have less variables. Can you please try this change with your code/setup on both wcoss2 and jet. |
And here are the timings of all history/restart writes from one of my test runs on wcoss2:
|
This did not fix my test case on Jet. Some of the ranks still froze in the nf90_enddef. They froze in the same enddef as before, not the new one you added. Which ranks got stuck this time?
|
I have a test case on hera now. The PR description has been updated with the path. Hera: |
Thanks. I'm running that test case on Hera right now with this change (diff is against current head of develop branch): diff --git a/io/module_write_netcdf.F90 b/io/module_write_netcdf.F90
index d9d8ff9..d3a3433 100644
--- a/io/module_write_netcdf.F90
+++ b/io/module_write_netcdf.F90
@@ -341,7 +341,12 @@ contains
if (lsoil > 1) dimids_soil = [im_dimid,jm_dimid,lsoil_dimid, time_dimid]
end if
+ ncerr = nf90_enddef(ncid); NC_ERR_STOP(ncerr)
+
do i=1, fieldCount
+
+ ncerr = nf90_redef(ncid); NC_ERR_STOP(ncerr)
+
call ESMF_FieldGet(fcstField(i), name=fldName, rank=rank, typekind=typekind, rc=rc)
; ESMF_ERR_RETURN(rc)
par_access = NF90_INDEPENDENT
@@ -477,11 +482,11 @@ contains
ncerr = nf90_put_att(ncid, varids(i), 'grid_mapping', 'cubed_sphere'); NC_ERR_ST
OP(ncerr)
end if
+ ncerr = nf90_enddef(ncid); NC_ERR_STOP(ncerr)
+
end do ! i=1,fieldCount
- ncerr = nf90_enddef(ncid); NC_ERR_STOP(ncerr)
end if
- ! end of define mode
!
! write dimension variables and lon,lat variables Here for every variable we enter and leave define mode. So far first 4 files (phyf000, 001, 002 and 003) were written without hangs in nf90_enddef. My run directory is: /scratch1/NCEPDEV/stmp2/Dusan.Jovic/sudheer-case |
According to the nc_enddef documentation here, specifically: It's not necessary to call nc_enddef() for netCDF-4 files. With netCDF-4 files, nc_enddef() is called when needed by the netcdf-4 library. which means we do not need to call nf90_redef/nf90_enddef at all, since the history files are netCDF-4 files, created with NF90_NETCDF4 mode. @edwardhartnett can you confirm this. I'll try to remove all nf90_redef/nf90_enddef calls and see what happens. |
@DusanJovic-NOAA you are correct, a file created with NC_NETCDF4 does not need to call enddef(), but I believe redef() must still be called. For example, if you define some metadata, and then call nc_put_vara_float() (or some other data-writing function), then netCDF-4 will notice that you have not called nc_enddef(), and will call it for you. But does that work for nc_redef()? I don't think so. However, whether called explicitly by the programmer, or internally by the netCDF library, enddef()/redef() is an expensive operation. All buffers are flushed to disk. So try to write all your metadata (including all attributes), then write data. Don't switch back and forth. In the case of the fragment of the code I see here, it seems like there's a loop:
What would be better would be two loops, the first to write all the attributes, the second to do all the data writes.
|
All of the variable data is written in a later loop except the dimension variables. Those are written in calls to subroutine add_dim inside the metadata-defining loop. It does have the required call to nf90_redef. if (lm > 1) then
call add_dim(ncid, "pfull", pfull_dimid, wrtgrid, mype, rc)
call add_dim(ncid, "phalf", phalf_dimid, wrtgrid, mype, rc)
... more of the same ...
subroutine add_dim(ncid, dim_name, dimid, grid, mype, rc)
...
ncerr = nf90_def_var(ncid, dim_name, NF90_REAL8, dimids=[dimid], varid=dim_varid); NC_ERR_STOP(ncerr)
...
ncerr = nf90_enddef(ncid=ncid); NC_ERR_STOP(ncerr)
ncerr = nf90_put_var(ncid, dim_varid, values=valueListR8); NC_ERR_STOP(ncerr)
ncerr = nf90_redef(ncid=ncid); NC_ERR_STOP(ncerr) |
@edwardhartnett Thanks for the confirmation. @SamuelTrahanNOAA Yes, all variables are written in the second loop over all fields after all dimensions and attributes are defined and written. The only exception are 4 'dimension variables' or coordinates, (pfull, phalf, zsoil and time) in which case we define them, end define mode, write the coordinate values and reenter define mode. But those are small variables, and I do not think it costs a lot to exit/reenter define mode since there are just 4 of them and no other large variables are written yet. If that has any impact on the performance. I'll run the test now with all enddef/redef calls removed to see if that works. |
Documentation of nc_redef says: For netCDF-4 files (i.e. files created with NC_NETCDF4 in the cmode in their call to nc_create()), it is not necessary to call nc_redef() unless the file was also created with NC_STRICT_NC3. For straight-up netCDF-4 files, nc_redef() is called automatically, as needed. |
OK, so you could take out the redef() and enddef(). Usually when netCDF hangs on a parallel operation it's because a collective operation is done, but not all tasks participated. Are all programs running this metadata code? |
A way to test that is to put an MPI_Barrier before each NetCDF call. |
Without any explicit call to nf90_redef/nf90_enddef, model works fine for about 5 hr but then hangs while writing physics history file. Last file (forecast hour 6) is only partially written (~30Mb) before the model hangs:
ncdump -h of phyf006.nc prints all metadata and exits without any error. Also comparing metadata and global attributes with nccmp does not report any difference between 005 and 006 files:
|
Have we reached the point where we should involve NetCDF and HDF5 developers in this conversation? |
Let me try your suggestion to insert an MPI_Barrier before each NetCDF call. |
Now it hangs on the second history file (phyf001.nc):
Interestingly the file size is exactly the same (30817232 bytes) as in the previous run where model hangs at phyf006.nc. It also never hangs while writing dynf???.nc files, always at phyf???.nc. |
Do you know where it is hanging? You can find out by sshing to one of the compute nodes running your job. Then start gdb on a running process. It may take a few tries to figure out which ranks are associated with the frozen quilt server. |
I suspect this is the size of the file's metadata. |
The only thing that seems to help avoid the hangs is reducing the number of fields written out in the history file. At this moment writing out all the fields specified in 'diag_table', creates 260 variables. What is special about 260?. It is just slightly larger than 256. Could it be that 256 is, for whatever reason, some kind of limit? I'm running now with just 4 fields commented out in diag_table, the last 4, just to see what happens.
This should create a file with 256 variables. |
Disabling only the last two variables (ebu_smoke and ext550) is enough to get it to run reliably. There are other sets of variables one can remove to get it to run reliably. That's just the one I can remember off the top of my head. |
Ok, so that means there is nothing special about 256 limit, which is good. That should also mean that there are no issues in nf90_* calls, since in that case (two variables less) everything works fine. |
There must be an issue somewhere in there. The model freezes at an MPI_Allreduce deep within the HDF5 library. |
The answer to how you find it is to isolate this code into a one-file test, with the minimum code and processors needed to cause the problem. Once you have such a test you will know whether you have found a netCDF bug or not. |
Ok, can somebody take this program and run it on Hera with the gnu compiler on 2 MPI tasks. Remove .txt extension. |
How about making this a unit test for fv3atm? |
This test program does not use/test any code or function from fv3atm code. Why would it be a unit test for fv3atm? |
Perhaps one of the existing regression tests will reproduce the problem if we use the proper input.nml and diag_table? |
I tried modifying hrrr_gf_debug, but it ran to completion. I'll try one of the conus13km cases next. |
It's all about saving programmer time by eliminating debugging, which is expensive and unpredictable (for example: this issue). As a unit test, this code will test the IO stack in a way that is very useful. (Isn't this IO code in fv3atm? That's why I suggested fv3atm as the home for the test code.) When this test passes, you know that your I/O stack is set up correctly and provides everything your code needs. Consider how useful that would be to know - not just on our current machines, with current versions, but on some new machine, with new versions of all dependencies. 15 years from now, this test will still be useful. In the test file you posted, I see the code is pretty simplified, for example doesn't do that redef/enddef business. The test program should do all the things that write grid component is doing. Ideally your test program will call all the same netCDF calls, in the same order, with the same parameters, as one run of your grid component (when it is failing). (To do this quickly, make your code a unit test first, and then use the CI to help iterate it.) If the failure is a netCDF bug, the test program will help me find it. (And a simplified version of the test program will also go into netcdf-fortran). If not, the test program will help you understand where the bug in the code is. Either way, the test program goes into the repo to help future NOAA programmers with future IO issues. All debugging efforts should result in unit tests which make it impossible for the project to debug the same problem again. |
@SamuelTrahanNOAA setting up a regression test to catch this is a good idea. But a unit test is also needed, and needed first. System tests should never be used for debugging, because they are expensive and don't provide the proper granularity. For example, if you get a system test to fail with this issue, that does not help determine whether the bug is in the system or in netCDF. Nor will it be any use to say to the team of some third-party package: our system test is failing, we think it's your fault, please debug it for us. |
We changed the code so that it does not require redef/enddef, see my comment here: |
Have you tried to run test program I posted here: |
Again, the test program I posted does not execute any function from fv3atm, it does not use any module from fv3atm. It is just simple one-file test program that only calls a sequence of netcdf-fortran subroutines to: create a file, define dimensions, add attributes to those dimensions , define 260 variables using those dimensions, end define mode and closes the file. Nothing specific to fv3atm. |
@SamuelTrahanNOAA Can you please take my test program from above, or just grab this directory on Hera: /scratch2/NCEPDEV/fv3-cam/Dusan.Jovic/test_netcdf and try to compile it and run it. I used the following commands:
Then run the ./test_netcdf program on 2 MPI tasks, for example (in interactive session on compute node)
|
I ran here:
There's Job output is here: /scratch2/BMC/wrfruc/Samuel.Trahan/rrfs/test-netcdf/slurm-57181278.out It failed with the message you mentioned:
|
Thanks. @SamuelTrahanNOAA |
I tried the hrrr_gf test with that diag_table and the gnu compiler. It succeeded. The model may not be outputting all of the variables due to namelist differences. I haven't checked that yet. |
Test run on wcoss2 (ming_io_hang) finished successfully with the latest updates. |
I also tested two classic netcdf formats (CDF-2 and CDF-5). No issues. Updated write_netcdf routine to enable those two formats, for debugging purposes and as an alternative option. Currently hard-coded to netcdf4. |
Where is this test code going to be maintained? Or is this a one-time effort, and the results to be discarded when you are done? |
It's just a short test program that reproduces the same error we see in full program. I personally have no intention to maintain it after this issue is closed. |
Generally, a unit test specific to one library goes in that library's own unit test suite. Perhaps after the missing constant is added to the NetCDF Fortran library, this could be a unit test for it in that library? I'm going to update the ufs-weather-model regression test for RRFS in the near future. I hope it'll reproduce the bug so we can ensure the write component doesn't break in this specific way in the future. |
Should I open PR for the changes in write_netcdf? Or do we need to run more tests of 'sudheer-case' and 'ming-io-hang' cases on Hera/Jet and WCOSS2? |
We should try these changes in the RRFS parallels for a few days. Can you put it in a branch for them to try? Perhaps a draft pull request? I'd rather they not have on-disk code changes. |
See PR #2193 |
Did you find the bug that was causing the hang in your write component? |
I changed the nc90_create call to pass NF90_NODIMSCALE_ATTACH flag and we are currently running RRFS parallels to see if that change solved the issue. |
Description
The head of develop hangs while writing NetCDF output files in the write component when running the version of RRFS planned for operations. This happens regardless of the compression settings or lack thereof. The behavior is like so:
Commenting out some of the variables in the diag_table will prevent this problem. There isn't one specific set of variables that seem to cause it. Turning off the lake model or smoke model prevents the hang, but one should note that disables writing of many variables.
Using one thread (no OpenMP) appears to reduce the frequency of the hangs. Increasing the write component ranks by enormous amounts appears to increase the frequency of hangs. This conclusion is uncertain since we haven't run enough tests to get a statistically representative sample set.
I have been unable to reproduce the problem when the model is compiled in debug mode.
This problem has been confirmed on Jet, Hera, and WCOSS2, but hasn't been tested on other machines.
From lots of forum searching, this problem has been identified in the distant past when the model sends different metadata at different ranks. For example, 13 variables on one rank, but 14 on the others. Or one rank sends three attributes and the others sent five. I haven't investigated that possibility, but I don't see how it is possible in the code.
To Reproduce:
1. Executables were compiled like so:
2. Copy one of these test directories:
Jet:
/lfs4/BMC/nrtrr/Samuel.Trahan/smoke/sudheer-case
Hera:
/scratch2/BMC/wrfruc/Samuel.Trahan/rrfs/sudheer-case
Cactus:
/lfs/h2/oar/esrl/noscrub/samuel.trahan/ming-io-hang
3. Edit the job script
Each machine's test directory contains a
job.sh
script. Edit it as needed to point to your code.4. Run the job script.
Send the script to
sbatch
on Jet orqsub
on Cactus. Do not run it on a login node.Additional context
This problem exists in the version of RRFS planned to go operational.
Output
This stack trace comes from gdb analyzing a running write component MPI rank while it is hanging waiting for an MPI_Allreduce. The arguments in the stack trace may be meaningless because gdb has trouble interpreting Intel-compiled code. However, the line numbers and function calls should be correct. Some may have been optimized out.
stack trace of stuck MPI process
The text was updated successfully, but these errors were encountered: