Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uninitialized local variables in land code #1717

Closed
jqyin opened this issue Aug 15, 2017 · 22 comments
Closed

Uninitialized local variables in land code #1717

jqyin opened this issue Aug 15, 2017 · 22 comments
Assignees
Labels

Comments

@jqyin
Copy link
Contributor

jqyin commented Aug 15, 2017

It have been confirmed that the 2 clm-eca tests DIFFs on Melvin when merging PR #1649 and #1468 were caused by uninitialized local variables in existing land code (most likely in BGC related code). Any BGC-related code changes are likely to give non-BFB results.

@jqyin jqyin added the Land label Aug 15, 2017
@jqyin jqyin self-assigned this Aug 15, 2017
@ndkeen
Copy link
Contributor

ndkeen commented Aug 16, 2017

Yea, I should have noted this sooner, but I also found 2 COMPARE_base_rest fails with:

ERS.f19_g16.I1850CNECACNTBC.cori-knl_gnu.clm-eca

ERS.f19_g16.I1850CNECACTCBC.cori-knl_gnu.clm-eca

I see this on edison and cori-knl (and likely cori-haswell) with GNU.
These tests pass on all machines with Intel (both version 17 and v18)

@jqyin
Copy link
Contributor Author

jqyin commented Aug 16, 2017

@ndkeen Thanks for the information. Did you run with threading on cori-knl? I know it's the default PE layout on edison. If so, It's actually a separate issue. There are race conditions in BGC-related code.

@ndkeen
Copy link
Contributor

ndkeen commented Aug 16, 2017

Yes, these tests used threading (as is the default with acme_developer). I also just ran the test on KNL with GNU using 1 thread and they both passed.

create_test -j2 --machine=cori-knl --compiler=gnu ERS_P256x1.f19_g16.I1850CNECACNTBC.cori-knl_gnu.clm-eca ERS_P256x1.f19_g16.I1850CNECACTCBC.cori-knl_gnu.clm-eca

@ndkeen
Copy link
Contributor

ndkeen commented Aug 16, 2017

I also just ran the 2 tests on cori-haswell with GNU just to verify. They pass. But this machine does not request this test to use threading.

@jqyin
Copy link
Contributor Author

jqyin commented Aug 17, 2017

@ndkeen Thanks for confirming. I'm not aware of any supported test machine other than Melvin that doesn't set uninitialized local variable by default.

@worleyph
Copy link
Contributor

worleyph commented Aug 17, 2017

@rljacob , I'd suggest making this a specific test/requirement - running without default setting of uninitialized variables, or setting them to INF or NaN. @ndkeen , is there an Intel compiler setting that will disable zeroing out uninitialized variables?

@rljacob
Copy link
Member

rljacob commented Aug 17, 2017

@ndkeen which version of gnu?

@ndkeen
Copy link
Contributor

ndkeen commented Aug 17, 2017

The version of GNU on edison/cori is 6.3

This same thing has happened to us on other projects -- Intel compilers happily init vars to zero, while GNU will catch it (ie cause problems). Testing with both (or more) compilers is great, of course. I am unaware of flags that will do what we want, but there are always things we can improve upon.

@rljacob
Copy link
Member

rljacob commented Aug 17, 2017

For some reason, gnu 5.3.0 on melvin is the only compiler doing this. That's suspicious. @jgfouca does melvin have some flags other gnu machines don't?

@rljacob
Copy link
Member

rljacob commented Aug 17, 2017

Oh and I guess gnu 6.3 on edison/cori with threading.

@ndkeen
Copy link
Contributor

ndkeen commented Aug 17, 2017

If it helps, we can easily try the following versions on Cori:
gcc/4.9.3 gcc/5.2.0 gcc/5.3.0 gcc/6.1.0 gcc/6.2.0 gcc/6.3.0(default)

Yes, it looks like this is only something that happens with threading.

@rljacob
Copy link
Member

rljacob commented Aug 17, 2017

@jqyin do the machines where you couldn't reproduce this with gnu use threading?

@ndkeen
Copy link
Contributor

ndkeen commented Aug 17, 2017

I tried building these 2 tests with DEBUG=TRUE on KNL with GNU and they both failed right away here:

001:  Opened existing file /project/projectdirs/acme/inputdata/lnd/clm2/surfdata_map/surfdata_1.9x2.5_simyr1850_c150626.nc           1
001:  ERROR: ERROR: ERROR in /global/cscratch1/sd/ndk/wacmy/ndk_machinefiles_cori-update-pelayouts01/components/clm/src/biogeochem/ch4Mod.F90 at line 699

Which is in:
subroutine InitCold
at this assert:
SHR_ASSERT_ALL((ubound(cellorg_col) == (/bounds%endc, nlevsoi/)), errMsg(__FILE__, __LINE__))

May not be related to error above.

casedir:
/global/cscratch1/sd/ndk/acme_scratch/cori-knl/mf-upes01b/ERS_D.f19_g16.I1850CNECACTCBC.cori-knl_gnu.clm-eca.20170816_223421_ax8hm0

Which does bring up something. A while ago I suggested I think our tests should be a set independent of the compiler flags. Then we should run all of these tests with DEBUG=TRUE, DEBUG=FALSE, and any other way we can think of.

@worleyph
Copy link
Contributor

According to the ifort man page:

   -zero  Initializes  to  zero all local scalar variables of
      intrinsic type INTEGER, REAL, COMPLEX,  or  LOGICAL
      that  are	 saved	but  not  yet  initialized.   The
      default is -nozero.  Use -save on the command  line
      to  make all local variables specifically marked as
      SAVE.

(so -nozero is the default) but this applies only to saved variables.

@worleyph
Copy link
Contributor

However, ifort does have:

   -check [keyword[,
          keyword...]]

 ...
          check uninit      Enables  run-time  checking  for  uninitialized variables. If a variable is read before it is written, a run-time
                            error routine will be called. Only local scalar variables of intrinsic type INTEGER, REAL, COMPLEX,  and  LOGICAL
                            without the SAVE attribute are checked.

@jqyin
Copy link
Contributor Author

jqyin commented Aug 17, 2017

@rljacob No, all my tests were without threading and DEBUG=FALSE. It's the setting on Melvin and the same setting I used to test on other machines. I haven't seen it fail except on Melvin and confirmed that it passed with adding flag -finit-local-zero.

There are currently 3 different issues: 1) uninitialized local variables 2) data race (PR #1468 fixed a bug but I tested there are still race conditions) 3) DEBUG flag (maybe related to issue #1686 ). All 3 are legacy issues and are in current master.

I agree there should be tests for threading, DEBUG mode, etc in the first place. Currently, I'm working on 1) .

@jqyin
Copy link
Contributor Author

jqyin commented Aug 17, 2017

@worleyph Thanks for the tips about Intel compiler. Gnu has a " -Wuninitialized " flag as well. I haven't found a way to disable the zeroing so I can test it on my local cluster though.

@worleyph
Copy link
Contributor

@jqyin and @ndk, Thought that it would be worthwhile to try to duplicate this error using a different compiler (intel) and on a different platform. Looked at the pgf90 man page and saw nothing useful. Found an old PGgroup userforum page that suggested using valgrind to find uninitialized variables.

@jgfouca
Copy link
Member

jgfouca commented Aug 17, 2017

@rljacob I don't believe melvin is using any special flags. you can see the flags in config_compilers.xml

@rljacob
Copy link
Member

rljacob commented Aug 17, 2017

Does it use threading in the land model?

@singhbalwinder
Copy link
Contributor

I think we should change the default PE layout on Melvin (and all testing machines) so that it uses more than 1 threads by default. As I understand it, Melvin uses just 1 thread by default which defeats the purpose of all tests which vary threads. For example, ERP test initiate two simulations, one with default threads and other with half the number of default threads. If a default simulation is using just one thread, the second simulation will also use just one thread (which essentially converts ERP to ERS test).

@singhbalwinder
Copy link
Contributor

I agree there should be tests for threading, DEBUG mode, etc in the first place. Currently, I'm working on 1) .

That's a great idea. We should revisit land tests to add/modify tests to have all these variations.

@jqyin and @ndk, Thought that it would be worthwhile to try to duplicate this error using a different compiler (intel) and on a different platform. Looked at the pgf90 man page and saw nothing useful. Found an old PGgroup userforum page that suggested using valgrind to find uninitialized variables.

NAG compiler is another option we can use to track it down. NAG is better than any other compiler out there to catch uninitialized variables. Lahey compiler is another good one but it doesn't support advanced Fortran features.
In my experience, Intel sometimes skip some uninitialized variables depending upon the nature of the variable (static vs. automatic etc.).

bishtgautam pushed a commit that referenced this issue Sep 5, 2017
Initialize unset local variables in CNAllocationMod

Fixes #1717

[non-BFB]
bishtgautam pushed a commit that referenced this issue Sep 6, 2017
Initialize unset local variables in CNAllocationMod

Fixes #1717

[non-BFB]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants