-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes for uninitialized variables (snan testing) #579
Fixes for uninitialized variables (snan testing) #579
Conversation
@@ -252,8 +252,8 @@ subroutine create_blocks(nx_global, ny_global, ew_boundary_type, & | |||
!*** set last physical point if padded domain | |||
|
|||
else if (j_global(j,n) == ny_global .and. & | |||
j > all_blocks(n)%jlo .and. & | |||
j < all_blocks(n)%jhi) then | |||
j >= all_blocks(n)%jlo .and. & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, this change allows jlo=jhi, i.e. 1x1 domains, which weren't allowed before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. This is where the 1x1 blocks failed. The 1x1 and 2x2 are still producing different results from the other block sizes and decomps, but they are running and part of our test suite now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to elaborate a bit further. Before this fix, jhi was not set properly if the active block size was 1. It remained the default (ny_block - nghost), see line 194. This section of code is where the lo/hi indices are set for padded blocks. It worked fine as long as the block size was at least 2.
@@ -1684,11 +1682,14 @@ subroutine makemask | |||
tarean(:,:,iblk) = c0 | |||
tareas(:,:,iblk) = c0 | |||
|
|||
do j = 1, ny_block | |||
do i = 1, nx_block | |||
do j = jlo,jhi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the ghost cells for these masks filled later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are not, but the halos are never accessed, so there is no error when we initialize with snan. There are probably many arrays that have uninitialized memory, but that's fine. In many ways, that's what we want. We want to only initialize memory that we think is used. That way we can trap code (or learn more about our codes) when an snan actually triggers an error. If we initialized everything explicitly to zero just for the sake of doing so, the snan testing wouldn't tell us anything.
enddo | ||
enddo | ||
!$OMP END PARALLEL DO | ||
hm (:,:,:) = c0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to initialize to zero in all of the block (also valid for kmt) and angle?
Wouldn't it be more correct to update from ilo to ihi and jlo to jhi and then use icehaloupdate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few different grid initialiizations (popgrid, popgrid_nc, latlon, rectangular, etc), and they don't all initialize the variables in the same way. I believe zero-ing these out here is required for the rectangular grid option that is used for the box setups. It is not needed for the popgrid option that we use for our more standard gx3+ testing. It could be that a better solution is to make all the grid options behave in the same way, but it wasn't clear to me what those requirements should be, so I went this direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK so I took a look at each commit. If I understand correctly, the issues with tarea
and uvm
highlighted in #560 are fixed, and so #572 would be closed as well. Is that right ?
However, I'm not sure I understand completely how the fix to the fill
values in the halo updates routine fix the problem for uvm, since fillValue
is not used for uvm... unless there is something I'm missing....
On the Git side of things, I would prefer that we not keep my temporary fixes of initializing tarea and uvm to zero, only to remove them in latter commits. Even if we use a squash merge workflow, the commit messages of all commits on the branch are still kept (if not manuallt edited-out) in the merge commit message, so it can be very confusing to read when you are actually reading the commit log to try to understand the evolution of the codebase...
! tcx, if l is 0, then fact has no factors, just return | ||
if (l == 0) return | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if I read the diff for this commit correctly, this here is the entirety of the bug fix for spacecurve, right ? I'm not familiar with the algorithm...
k = kmt(i,j,iblk) | ||
k = min(nint(kmt(i,j,iblk)),nlevel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this also something that was found with signalling NaNs ? not sure what this is changing ...
@phil-blain, thanks for having a look. I was hoping you'd find a few minutes. uvm=0 and tarea=0 are no longer being set. I think the halo fill fixed this. The halo fill operates even if no fill value is passed. It's set internally to zero by default, so it does matter in those cases. So yes, this addresses #560 and #572. The ice_spacecurve mod does fix the spacecurve error. The spacecurve implementation is not particularly robust in general, so my fix isn't wrong, but may not be how a robust implementation might deal with the issue. The k = "...nint(kmt)...", is needed in cases where kmt can be greater than nlevel. In that case, you'll end up referencing a element in the bathy arrays that is out of bounds. This whole implementation is a little dicey as nlevel is hardwired to 40 with fixed depths and that may have no relationship to the kmt file used by the CICE/ocean model or the depths defined there. But the idea is that you would normally read the depth file separately if you really had that information and were coupling to an ocean model. That's what we do in RASM when we use the bathymetry/landfast ice. For landfast ice, it also can't really matter what the depth is once it's greater than a few 10s of meters? Anyway, a better fix would be to get rid of this hardwired 40 levels and require that a depth profile be input when landfast ice is on. Also, kmt in CICE is a real, so we need to nint it to use the min function with nlevel. Some of the errors that this PR fixes appeared when I turned on the signalling nans and then ran the full test suite. There are still several outstanding issues that I couldn't fix quickly, so those are deferred and the snan is NOT on by default on the machines I test on. There is still work to do in this regard. #247 is still open. Finally, I understand the concerns about the git commit history, but I don't think it's worth making an effort to create a new branch or to try to clean up the history. I can edit some of it when I squash merge. In some ways, the squash merge does what we want and is why I advocate for it's use, it largely eliminates all that potentially chaotic history within the master branch commit series. Although I understand it does appear within an individual commit unless it's reviewed and cleaned up when we squash. Let me know if there are any other questions. thanks! |
I think that you could argue that use_bathymetry should be true as default. This reads in a depth from a bathymetry file. use_bathymetry=false is for a specific ocean setup used by Environment Canada (correct me if I am wrong @phil-blain) . In the case where kmt>nlevel I think that I would prefer an abort message. |
I agree with @TillRasmussen that an abort is appropriate if kmt > bathymetry file levels, when bathymetry is being read in. My general philosophy is that all namelist parameters relative to a given 'master namelist flag' should be set by default to the values desired when the master flag is turned on, but it sounds like the bathymetry file should be off by default, except for our tests, since coupled models read it in outside of the sea ice component. Is that correct? |
I have updated the bathymetry implementation to abort if kmt > nlevel. That seems like a good implementation. |
When CICE is compiled with the 'CICE_IN_NEMO' CPP macro, and 'calc_Tsfc' is false, ice_step::coupling_prep calls CICE_RunMod::srcflux_to_ocn to transfer heat fluxes on ice-free grid points to the ocean. The calculation in srcflux_to_ocn uses the Icepack parameter 'Lsub', but without declaring this variable nor querying it from Icepack, which results in a compile-time failure when using the standalone driver with 'CICE_IN_NEMO' defined. Fix that by correctly declaring 'Lsub' and querying it using the Icepack interface.
In subroutine 'init_grid2', the array 'tarea' is initialized in a loop on indices {ilo, ihi} and {jlo, jhi}, but is then used in subroutine 'makemask' in a loop on indices {1, nx_block} and {1, ny_block}, potentially causing an uninitialized value to be used (at line 1693). Initialize the whole array to zero first, to avoid using uninitialized values. The initialization to zero is actually commented out, following 10c446a (Dummy and unused variables. (CICE-Consortium#180), 2018-09-22) [1] , so uncomment it. [1] CICE-Consortium#180
In subroutine 'makemask', the array 'uvm' is initialized in a loop on indices {ilo, ihi} and {jlo, jhi}, but is then used in a loop on indices {1, nx_block} and {1, ny_block}, potentially causing an uninitialized value to be used (at line 1672). Initialize the whole array to zero first, to avoid using uninitialized values. The initialization to zero is actually commented out, following 10c446a (Dummy and unused variables. (CICE-Consortium#180), 2018-09-22) [1], so uncomment it. [1] CICE-Consortium#180
In subroutine 'init_coupler_flux', the initialization of the 'sst' array to 'Tf' is protected by a '#ifndef CICE_IN_NEMO' preprocessor directive. This has been the case since CICE-Consortium/CICE-svn-trunk@151b9af (cice-4.0~17, 2008-04-28), though that commit does not explain why. This is probably because in some conditions (depending on active CPPs at NEMO compilation time, as well as NEMO namelist parameters), the SST needs to be passed early to CICE, before calling CICE_Initialize (see the CICE coupling interface in NEMO [1]), and initializing it to 'Tf' in 'init_coupler_flux' would erase the values already passed from the ocean. If however the SST is *not* passed to CICE before CICE_Initialize is called, and 'CICE_IN_NEMO' is in use, then 'ice_init::set_state_var' reads from the uninitialized 'sst' array when placing ice where the ocean surface is cold. In ECCC's in-house version of the coupling interface (sbc_ice_cice), extensive changes have been made to work around bugs in the original version and limitations due to the sequence in which CICE and NEMO fields are initialized, such that we manually call 'ice_init::init_state' a second time at the end of subroutine sbc_ice_cice::cice_sbc_init. To avoid using a uninitialized array, remove the '#ifdef CICE_IN_NEMO' around the initialization of the 'sst' array to 'Tf', so that it is always initialized. These values will anyway be replaced by the "correct" ones when init_state is called a second time in our in-house version. [1] http://forge.ipsl.jussieu.fr/nemo/browser/NEMO/releases/release-3.6/NEMOGCM/NEMO/OPA_SRC/SBC/sbcice_cice.F90#L176
Picked up by testing with initialized snan and debug options in decomps Update cheyenne debug compiler options to add snan initialization Update test suite to add debug tests for multiple decomps
- Fix bug in ice_boundary fill in halo, was not taking into account padding correctly - Fix bug in ice_blocks for cases where the active size is 1 gridcell wide - Update ice_grid initialization to clean up prior workarounds. - Update some intel compilers to add -init=snan,arrays for debug builds - Add new tests to the decomp test suite including tests with debugging on, tests with the active size 1 gridcell wide, and tests with a single block that's bigger than the whole grid.
Back off on some of the full debug compiler flags.
c8fb70c
to
52e8102
Compare
OK, I have read that part of the code more carefully and understand more what it's doing and what you are changing. If I understand correctly, the conversion to integer was already implicit since
To clarify, in our in-house CICE4 implementation we pass the dynamic water height (bathy + SSH) fro NEMO to CICE, so no we do not read the bathymetry field from CICE. I've not yet gotten around to revisit that in my work on coupling our NEMO with CICE 6, but the same approach will be used. I was planning to do that soon. With regards to what should be the default for |
Thanks @phil-blain, there are a few things about the bathymetry implementation that I haven't understood well. I really don't understand why we have a 40 level default at all. I guess it's for testing. The get_bathymetry_popfile was added by RASM because pop has a well defined bathymetry file different from whatever was implemented before. I thought it was easier to create a new "read" method rather than duplicate the file in another format. Hence, it's largely a duplicate of the default implementation. k = kmt() is not the same as k = nint(kmt()). The former is doing an "int(kmt())". We really do want an explicit nint in this case, although in practice, they seem to be producing the same result for our test cases (which may not have any landfast ice unfortunately). I will modify the k>puny vs k>0. I agree that's a more robust implementation. kmt is normally thought of as an integer but it's a real in CICE. I think that's where the k/kmt type problems are showing up in the bathymetry, confusion about the types. I suspect we have (kmt()>puny) elsewhere, but should k=nint(kmt()), k>0 when converted to integer as opposed to k=kmt(), k>puny which is what seems to have been done. I also agree we may want to rethink use of the 40 level hardcoded default, but as a separate PR. |
I have rerun the full test suite on cheyenne after rebasing the code to the current master. Results are as expected and here, https://github.com/CICE-Consortium/Test-Results/wiki/cice_by_hash_forks#52e8102274c8a948f43e3e4e6b0428250706df58 |
I updated the k>puny line and ran a quick suite to make sure it didn't create any problems. all seems fine with that change. |
PR checklist
This fixes a handful of issues trapped by snan debug testing. See Halo Update Fill Implemention Bug with Padded Blocks #572, cicecore: fix some use of uninitialized variables #560
phil-blain, apcraig
bit-for-bit full test suite on cheyenne 3 compilers, https://github.com/CICE-Consortium/Test-Results/wiki/cice_by_hash_forks#eec256efb85e5206cd6e7745188e2c44a093c028
This replaces #560.
See also #572, #247
This fixes several issues found by turning on snan debug tests with the intel/gnu compilers. There are still outstanding issues, so snan is not out by default. This fixes