Fixes for uninitialized variables (snan testing) #579

apcraig · 2021-03-16T23:40:28Z

PR checklist

This replaces #560.

This fixes several issues found by turning on snan debug tests with the intel/gnu compilers. There are still outstanding issues, so snan is not out by default. This fixes

adds several tests including decomp tests with debug on, padded decomp with 1 active gridcell width in i and j directions, a 1x1 decomp test, and some serial decomp tests.
fixes fill of halo in haloupdate. was not working corrected for padded blocks. required storing some additional information in the halo datatype at initialization.
fixes issues with uvm and tarea, fixed by haloupdate fill.
pulls in fixes from cicecore: fix some use of uninitialized variables #560 (sst initiallization, Lsub initialization)
fixes a bug in the size 1 padded grid cell decomp (ice_blocks)
fixes a spacecurve issue
adds snan debug options for several machines, but many still commented out due to outstanding issues.

eclare108213 · 2021-03-18T00:18:08Z

cicecore/cicedynB/infrastructure/ice_blocks.F90

@@ -252,8 +252,8 @@ subroutine create_blocks(nx_global, ny_global, ew_boundary_type, &
            !*** set last physical point if padded domain

            else if (j_global(j,n) == ny_global .and. &
-                     j > all_blocks(n)%jlo      .and. &
-                     j < all_blocks(n)%jhi) then
+                     j >= all_blocks(n)%jlo      .and. &


If I understand correctly, this change allows jlo=jhi, i.e. 1x1 domains, which weren't allowed before?

Correct. This is where the 1x1 blocks failed. The 1x1 and 2x2 are still producing different results from the other block sizes and decomps, but they are running and part of our test suite now.

Just to elaborate a bit further. Before this fix, jhi was not set properly if the active block size was 1. It remained the default (ny_block - nghost), see line 194. This section of code is where the lo/hi indices are set for padded blocks. It worked fine as long as the block size was at least 2.

eclare108213 · 2021-03-18T00:24:40Z

cicecore/cicedynB/infrastructure/ice_grid.F90

@@ -1684,11 +1682,14 @@ subroutine makemask
         tarean(:,:,iblk) = c0
         tareas(:,:,iblk) = c0

-         do j = 1, ny_block
-         do i = 1, nx_block
+         do j = jlo,jhi


Are the ghost cells for these masks filled later?

They are not, but the halos are never accessed, so there is no error when we initialize with snan. There are probably many arrays that have uninitialized memory, but that's fine. In many ways, that's what we want. We want to only initialize memory that we think is used. That way we can trap code (or learn more about our codes) when an snan actually triggers an error. If we initialized everything explicitly to zero just for the sake of doing so, the snan testing wouldn't tell us anything.

TillRasmussen · 2021-03-18T07:58:38Z

cicecore/cicedynB/infrastructure/ice_grid.F90

-         enddo
-      enddo
-      !$OMP END PARALLEL DO
+      hm (:,:,:) = c0


Is it necessary to initialize to zero in all of the block (also valid for kmt) and angle?
Wouldn't it be more correct to update from ilo to ihi and jlo to jhi and then use icehaloupdate

There are a few different grid initialiizations (popgrid, popgrid_nc, latlon, rectangular, etc), and they don't all initialize the variables in the same way. I believe zero-ing these out here is required for the rectangular grid option that is used for the box setups. It is not needed for the popgrid option that we use for our more standard gx3+ testing. It could be that a better solution is to make all the grid options behave in the same way, but it wasn't clear to me what those requirements should be, so I went this direction.

phil-blain

OK so I took a look at each commit. If I understand correctly, the issues with tarea and uvm highlighted in #560 are fixed, and so #572 would be closed as well. Is that right ?

However, I'm not sure I understand completely how the fix to the fill values in the halo updates routine fix the problem for uvm, since fillValue is not used for uvm... unless there is something I'm missing....

On the Git side of things, I would prefer that we not keep my temporary fixes of initializing tarea and uvm to zero, only to remove them in latter commits. Even if we use a squash merge workflow, the commit messages of all commits on the branch are still kept (if not manuallt edited-out) in the merge commit message, so it can be very confusing to read when you are actually reading the commit log to try to understand the evolution of the codebase...

phil-blain · 2021-03-22T19:38:43Z

cicecore/shared/ice_spacecurve.F90

+   ! tcx, if l is 0, then fact has no factors, just return
+   if (l == 0) return
+


So if I read the diff for this commit correctly, this here is the entirety of the bug fix for spacecurve, right ? I'm not familiar with the algorithm...

phil-blain · 2021-03-22T20:05:23Z

cicecore/cicedynB/infrastructure/ice_grid.F90

-               k = kmt(i,j,iblk)
+               k = min(nint(kmt(i,j,iblk)),nlevel)


Is this also something that was found with signalling NaNs ? not sure what this is changing ...

apcraig · 2021-03-22T22:39:14Z

@phil-blain, thanks for having a look. I was hoping you'd find a few minutes. uvm=0 and tarea=0 are no longer being set. I think the halo fill fixed this. The halo fill operates even if no fill value is passed. It's set internally to zero by default, so it does matter in those cases. So yes, this addresses #560 and #572.

The ice_spacecurve mod does fix the spacecurve error. The spacecurve implementation is not particularly robust in general, so my fix isn't wrong, but may not be how a robust implementation might deal with the issue.

The k = "...nint(kmt)...", is needed in cases where kmt can be greater than nlevel. In that case, you'll end up referencing a element in the bathy arrays that is out of bounds. This whole implementation is a little dicey as nlevel is hardwired to 40 with fixed depths and that may have no relationship to the kmt file used by the CICE/ocean model or the depths defined there. But the idea is that you would normally read the depth file separately if you really had that information and were coupling to an ocean model. That's what we do in RASM when we use the bathymetry/landfast ice. For landfast ice, it also can't really matter what the depth is once it's greater than a few 10s of meters? Anyway, a better fix would be to get rid of this hardwired 40 levels and require that a depth profile be input when landfast ice is on. Also, kmt in CICE is a real, so we need to nint it to use the min function with nlevel.

Some of the errors that this PR fixes appeared when I turned on the signalling nans and then ran the full test suite. There are still several outstanding issues that I couldn't fix quickly, so those are deferred and the snan is NOT on by default on the machines I test on. There is still work to do in this regard. #247 is still open.

Finally, I understand the concerns about the git commit history, but I don't think it's worth making an effort to create a new branch or to try to clean up the history. I can edit some of it when I squash merge. In some ways, the squash merge does what we want and is why I advocate for it's use, it largely eliminates all that potentially chaotic history within the master branch commit series. Although I understand it does appear within an individual commit unless it's reviewed and cleaned up when we squash.

Let me know if there are any other questions. thanks!

TillRasmussen · 2021-03-23T08:36:22Z

I think that you could argue that use_bathymetry should be true as default. This reads in a depth from a bathymetry file. use_bathymetry=false is for a specific ocean setup used by Environment Canada (correct me if I am wrong @phil-blain) . In the case where kmt>nlevel I think that I would prefer an abort message.

eclare108213 · 2021-03-23T14:53:15Z

I agree with @TillRasmussen that an abort is appropriate if kmt > bathymetry file levels, when bathymetry is being read in. My general philosophy is that all namelist parameters relative to a given 'master namelist flag' should be set by default to the values desired when the master flag is turned on, but it sounds like the bathymetry file should be off by default, except for our tests, since coupled models read it in outside of the sea ice component. Is that correct?

apcraig · 2021-03-24T18:16:06Z

I have updated the bathymetry implementation to abort if kmt > nlevel. That seems like a good implementation.

When CICE is compiled with the 'CICE_IN_NEMO' CPP macro, and 'calc_Tsfc' is false, ice_step::coupling_prep calls CICE_RunMod::srcflux_to_ocn to transfer heat fluxes on ice-free grid points to the ocean. The calculation in srcflux_to_ocn uses the Icepack parameter 'Lsub', but without declaring this variable nor querying it from Icepack, which results in a compile-time failure when using the standalone driver with 'CICE_IN_NEMO' defined. Fix that by correctly declaring 'Lsub' and querying it using the Icepack interface.

In subroutine 'init_grid2', the array 'tarea' is initialized in a loop on indices {ilo, ihi} and {jlo, jhi}, but is then used in subroutine 'makemask' in a loop on indices {1, nx_block} and {1, ny_block}, potentially causing an uninitialized value to be used (at line 1693). Initialize the whole array to zero first, to avoid using uninitialized values. The initialization to zero is actually commented out, following 10c446a (Dummy and unused variables. (CICE-Consortium#180), 2018-09-22) [1] , so uncomment it. [1] CICE-Consortium#180

In subroutine 'makemask', the array 'uvm' is initialized in a loop on indices {ilo, ihi} and {jlo, jhi}, but is then used in a loop on indices {1, nx_block} and {1, ny_block}, potentially causing an uninitialized value to be used (at line 1672). Initialize the whole array to zero first, to avoid using uninitialized values. The initialization to zero is actually commented out, following 10c446a (Dummy and unused variables. (CICE-Consortium#180), 2018-09-22) [1], so uncomment it. [1] CICE-Consortium#180

In subroutine 'init_coupler_flux', the initialization of the 'sst' array to 'Tf' is protected by a '#ifndef CICE_IN_NEMO' preprocessor directive. This has been the case since CICE-Consortium/CICE-svn-trunk@151b9af (cice-4.0~17, 2008-04-28), though that commit does not explain why. This is probably because in some conditions (depending on active CPPs at NEMO compilation time, as well as NEMO namelist parameters), the SST needs to be passed early to CICE, before calling CICE_Initialize (see the CICE coupling interface in NEMO [1]), and initializing it to 'Tf' in 'init_coupler_flux' would erase the values already passed from the ocean. If however the SST is *not* passed to CICE before CICE_Initialize is called, and 'CICE_IN_NEMO' is in use, then 'ice_init::set_state_var' reads from the uninitialized 'sst' array when placing ice where the ocean surface is cold. In ECCC's in-house version of the coupling interface (sbc_ice_cice), extensive changes have been made to work around bugs in the original version and limitations due to the sequence in which CICE and NEMO fields are initialized, such that we manually call 'ice_init::init_state' a second time at the end of subroutine sbc_ice_cice::cice_sbc_init. To avoid using a uninitialized array, remove the '#ifdef CICE_IN_NEMO' around the initialization of the 'sst' array to 'Tf', so that it is always initialized. These values will anyway be replaced by the "correct" ones when init_state is called a second time in our in-house version. [1] http://forge.ipsl.jussieu.fr/nemo/browser/NEMO/releases/release-3.6/NEMOGCM/NEMO/OPA_SRC/SBC/sbcice_cice.F90#L176

Picked up by testing with initialized snan and debug options in decomps Update cheyenne debug compiler options to add snan initialization Update test suite to add debug tests for multiple decomps

- Fix bug in ice_boundary fill in halo, was not taking into account padding correctly - Fix bug in ice_blocks for cases where the active size is 1 gridcell wide - Update ice_grid initialization to clean up prior workarounds. - Update some intel compilers to add -init=snan,arrays for debug builds - Add new tests to the decomp test suite including tests with debugging on, tests with the active size 1 gridcell wide, and tests with a single block that's bigger than the whole grid.

Back off on some of the full debug compiler flags.

phil-blain · 2021-03-24T22:26:56Z

The k = "...nint(kmt)...", is needed in cases where kmt can be greater than nlevel. In that case, you'll end up referencing a element in the bathy arrays that is out of bounds. This whole implementation is a little dicey as nlevel is hardwired to 40 with fixed depths and that may have no relationship to the kmt file used by the CICE/ocean model or the depths defined there. But the idea is that you would normally read the depth file separately if you really had that information and were coupling to an ocean model. That's what we do in RASM when we use the bathymetry/landfast ice. For landfast ice, it also can't really matter what the depth is once it's greater than a few 10s of meters? Anyway, a better fix would be to get rid of this hardwired 40 levels and require that a depth profile be input when landfast ice is on. Also, kmt in CICE is a real, so we need to nint it to use the min function with nlevel.

OK, I have read that part of the code more carefully and understand more what it's doing and what you are changing. If I understand correctly, the conversion to integer was already implicit since k is declared as int, but it's more readable if it's explicit. I also agree with Till and Elizabeth that it makes more sense to just abort if kmt > nlevel. While we're there, I think we could change k > puny to k > 0 since k is an integer but puny is a double. It would be less confusing, I think (and it's already written that way in get_bathymetry_popfile). As an aside, the whole last part of these two subroutines is in fact duplicated, so a further clean-up could be to refactor this; but that can come latter.

I think that you could argue that use_bathymetry should be true as default. This reads in a depth from a bathymetry file. use_bathymetry=false is for a specific ocean setup used by Environment Canada (correct me if I am wrong @phil-blain) .

[...] My general philosophy is that all namelist parameters relative to a given 'master namelist flag' should be set by default to the values desired when the master flag is turned on, but it sounds like the bathymetry file should be off by default, except for our tests, since coupled models read it in outside of the sea ice component. Is that correct?

To clarify, in our in-house CICE4 implementation we pass the dynamic water height (bathy + SSH) fro NEMO to CICE, so no we do not read the bathymetry field from CICE. I've not yet gotten around to revisit that in my work on coupling our NEMO with CICE 6, but the same approach will be used. I was planning to do that soon.

With regards to what should be the default for use_bathymetry in our namelist, I think both approaches are valid. I'd say it depends if we want the namelist to have good defaults for coupled setups or standalone setups: for coupled, use_bathymetry = false (as it is now) is OK, because usually the ocean model is in charge of reading the bathymetry, but for standalone runs use_bathymetry = true would make more sense, since we do provide bathymetry files for our 3 grids, and that should give a more detailed field than the harcoded levels...

apcraig · 2021-03-24T22:55:11Z

Thanks @phil-blain, there are a few things about the bathymetry implementation that I haven't understood well. I really don't understand why we have a 40 level default at all. I guess it's for testing. The get_bathymetry_popfile was added by RASM because pop has a well defined bathymetry file different from whatever was implemented before. I thought it was easier to create a new "read" method rather than duplicate the file in another format. Hence, it's largely a duplicate of the default implementation.

k = kmt() is not the same as k = nint(kmt()). The former is doing an "int(kmt())". We really do want an explicit nint in this case, although in practice, they seem to be producing the same result for our test cases (which may not have any landfast ice unfortunately).

I will modify the k>puny vs k>0. I agree that's a more robust implementation. kmt is normally thought of as an integer but it's a real in CICE. I think that's where the k/kmt type problems are showing up in the bathymetry, confusion about the types. I suspect we have (kmt()>puny) elsewhere, but should k=nint(kmt()), k>0 when converted to integer as opposed to k=kmt(), k>puny which is what seems to have been done.

I also agree we may want to rethink use of the 40 level hardcoded default, but as a separate PR.

apcraig · 2021-03-24T22:57:16Z

I have rerun the full test suite on cheyenne after rebasing the code to the current master. Results are as expected and here, https://github.com/CICE-Consortium/Test-Results/wiki/cice_by_hash_forks#52e8102274c8a948f43e3e4e6b0428250706df58

apcraig · 2021-03-25T00:10:31Z

I updated the k>puny line and ran a quick suite to make sure it didn't create any problems. all seems fine with that change.

apcraig requested review from eclare108213, TillRasmussen and phil-blain March 16, 2021 23:40

apcraig mentioned this pull request Mar 16, 2021

cicecore: fix some use of uninitialized variables #560

Closed

16 tasks

apcraig linked an issue Mar 16, 2021 that may be closed by this pull request

Halo Update Fill Implemention Bug with Padded Blocks #572

Closed

apcraig added Build Software Engineering Testing Type: Bug labels Mar 17, 2021

eclare108213 reviewed Mar 18, 2021

View reviewed changes

eclare108213 approved these changes Mar 18, 2021

View reviewed changes

TillRasmussen reviewed Mar 18, 2021

View reviewed changes

phil-blain reviewed Mar 22, 2021

View reviewed changes

phil-blain and others added 10 commits March 24, 2021 12:18

Fix uninitialized values and an array bounds issue in spacecurve

8478874

Picked up by testing with initialized snan and debug options in decomps Update cheyenne debug compiler options to add snan initialization Update test suite to add debug tests for multiple decomps

Update some uninitialized variables.

68db317

Back off on some of the full debug compiler flags.

revert cheyenne_gnu compiler flags

75bde7c

add 1x1, 2x2, 3x3 block tests

c6922c7

add abort to default bathymetry if kmt gt nlevel

52e8102

apcraig force-pushed the initialize-before-use branch from c8fb70c to 52e8102 Compare March 24, 2021 18:25

update bathymetry k/kmt calculation

17c6183

phil-blain approved these changes Mar 25, 2021

View reviewed changes

apcraig merged commit 6b399d1 into CICE-Consortium:master Mar 25, 2021

phil-blain mentioned this pull request Apr 22, 2021

More use of uninitialized arrays / variables (running 'base_suite' with -init=snan,arrays) #599

Open

apcraig deleted the initialize-before-use branch August 17, 2022 20:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes for uninitialized variables (snan testing) #579

Fixes for uninitialized variables (snan testing) #579

apcraig commented Mar 16, 2021

eclare108213 Mar 18, 2021

apcraig Mar 18, 2021

apcraig Mar 18, 2021

eclare108213 Mar 18, 2021

apcraig Mar 18, 2021

TillRasmussen Mar 18, 2021 •

edited

Loading

apcraig Mar 18, 2021

phil-blain left a comment

phil-blain Mar 22, 2021

phil-blain Mar 22, 2021

apcraig commented Mar 22, 2021

TillRasmussen commented Mar 23, 2021 •

edited

Loading

eclare108213 commented Mar 23, 2021

apcraig commented Mar 24, 2021

phil-blain commented Mar 24, 2021

apcraig commented Mar 24, 2021

apcraig commented Mar 24, 2021

apcraig commented Mar 25, 2021

		! tcx, if l is 0, then fact has no factors, just return
		if (l == 0) return

Fixes for uninitialized variables (snan testing) #579

Fixes for uninitialized variables (snan testing) #579

Conversation

apcraig commented Mar 16, 2021

PR checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TillRasmussen Mar 18, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phil-blain left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apcraig commented Mar 22, 2021

TillRasmussen commented Mar 23, 2021 • edited Loading

eclare108213 commented Mar 23, 2021

apcraig commented Mar 24, 2021

phil-blain commented Mar 24, 2021

apcraig commented Mar 24, 2021

apcraig commented Mar 24, 2021

apcraig commented Mar 25, 2021

TillRasmussen Mar 18, 2021 •

edited

Loading

TillRasmussen commented Mar 23, 2021 •

edited

Loading