Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Njeffery debug zbgc3 d #207

Merged
merged 17 commits into from
Oct 26, 2018
Merged

Conversation

njeffery
Copy link
Contributor

@eclare108213 : zbgc now runs and passes the gx1 test. I needed to increase the queue time to 2 hours.

Corrected blocks in ice_forcing_bgc.F90
Moved bgc initializations in CICE_InitMod.
Uncommented some parameter definitions related to bgc in init_zbgc.
Completed test successfully with flags -g gx1 -m wolf -e intel -s bgcgx1 -p 32x1

Adds missing namelist fields.
Also added gx1 tests for zbgc (bgc,bgcgx1) and skl (bgcskl,sklgx1)
Required nutrient forcing filenames are hard coded (nitrate_climatologyWOA_gx1v6f.nc and silicate_climatologyWOA_gx1v6f.nc)
Namelist field bgc_data_dir needs to be properly specified.
Note  *gx1 zbgc test is really slow and fails on wolf*
Corrected blocks in ice_forcing_bgc.F90
Moved bgc initializations in CICE_InitMod.
Uncommented some necessary parameter definitions related to bgc in init_zbgc.
@apcraig
Copy link
Contributor

apcraig commented Oct 13, 2018

travis is failing because ftp is down for semi-annual maintenance. I will trigger the travis build again once the ftp site is back up.

Copy link
Contributor

@apcraig apcraig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple quick comments. It looks like there are some indentation issues. Can you check that you are not using tabs. Second, it looks like this needs to updated to master. Last week we merged the tracer cpps changes onto master and those are now namelist and not cpps, both cice.settings and ice_in have changed and that needs to be reflected in the set_env and set_nml files. Sorry about that. If you need any help, let me know.

@eclare108213
Copy link
Contributor

@njeffery: nslyr=5 for the bgcISPOL and bgcNICE cases. Is that correct?

@eclare108213
Copy link
Contributor

I've updated the environment and namelist files for dynamic allocation changes in the code, and smoke-tested bgcgx1 and sklgx1 -- they pass, but I have not looked at actual results yet.
A number of namelist flags are redundant for the bgc tracers, e.g. I don't think we don't need both trzaero (on/off) and n_zaero (number of aerosols, 0=none could also mean off). I'll run a base_suite and continue working to simplify/clean up the code. @njeffery please do longer gx1 runs for both zbgc and skl and check results, as soon as you have a chance.

@njeffery
Copy link
Contributor Author

njeffery commented Oct 17, 2018 via email

@eclare108213
Copy link
Contributor

That's what I thought. I'll fix this.

@eclare108213
Copy link
Contributor

base_suite results are here:
https://github.com/CICE-Consortium/Test-Results/wiki/abd1d6613c.pinto.intel.181017.180843
The 2 pending tests are because we don't have 40 pes available to run. All of the other tests passed.

@apcraig apcraig requested a review from eclare108213 October 18, 2018 00:16
@apcraig
Copy link
Contributor

apcraig commented Oct 18, 2018

Should I run a full test suite on conrad with 4 compilers? Is now a good time or is there still work to do? Just let me know if that is desired and when.

@eclare108213
Copy link
Contributor

This branch seems to be working, from a testing standpoint. I have not checked the actual test output to see if the tests are doing reasonable things, only that they pass the new bgc tests now available in base_suite.
@njeffery please run these yourself and look to see if the tests are reasonable. I suspect that the ISPOL and NICE tests are not running in the correct time period and I do not know what data those are using -- we are mainly interested in exercising the code here, but let's not do something that gives our users fits.
@apcraig I changed the way RUNLENGTH is set in the scripts. It was very unintuitive to me, and it wasn't documented anywhere that I could find. Now its value essentially corresponds to the number of wallclock hours requested.
@duvivier I added documentation for RUNLENGTH and also the available grids. Please make sure that I didn't screw it up!

@eclare108213
Copy link
Contributor

We might not want to keep all of these new bgc tests, especially since a few of them take a while to run.

@eclare108213
Copy link
Contributor

@apcraig I think it would be helpful to run the tests on conrad now, since this part of the code hasn't really been subject to that level of testing yet.
I still want (someone, maybe me) to clean up the bgc namelist flags, but that can be a separate PR.

@apcraig
Copy link
Contributor

apcraig commented Oct 18, 2018

I tested on conrad and 4 compilers and everything is fine EXCEPT the new bgc cases. sklgx1 and bgcgx1 both fail. The main reason is the input data. The filenames are hardwired in the code and are not consistent with the data in the consortium,

../../cicecore/cicedynB/general/ice_forcing_bgc.F90: nit_file = trim(bgc_data_dir)//'nitrate_climatologyWOA_gx1v6f.nc'
../../cicecore/cicedynB/general/ice_forcing_bgc.F90: sil_file = trim(bgc_data_dir)//'silicate_climatologyWOA_gx1v6f.nc'

while the data files in the consortium tar file have _20150107 in the filename. These filenames have to be moved to namelist! Second, the bgc_data_dir is also hardwired to a path on some machine. This has to be generalized. I think the right change is to add

bgc_data_dir = 'ICE_MACHINE_INPUTDATA/CICE_data/forcing/gx1/WOA/MONTHLY/'

in set_nml.gx1 and to remove the bgc_data_dir setting in set_nml.sklgx1 and set_nml.bgcgx1.

Finally, I do not believe the sklgx1 and bgcgx1 need to be tied to the gx1 resolution. We should be able to move the data_dir and data filenames to set_nml.gx1 and the for the skl and bgc to NOT be grid dependent. I also prefer if the set_nml files be prepended with bgc. So something like

set_nml.bgcskl
set_nml.bgczbgc

or something like it would be good.

I can test again or even help with the implementation of these changes if you like. Just let me know.

@njeffery
Copy link
Contributor Author

njeffery commented Oct 18, 2018 via email

@eclare108213
Copy link
Contributor

I changed the options files for bgc. ISPOL and NICE are appropriate for testing Icepack but not CICE, so those options files have been removed. There is still an 'ISPOL' forcing option in the code but the Consortium does not have the necessary data files. I created bgcsklclim and bgczclim option files for use with the gx1 grid (the 'clim' option is only available for gx1) and set the bgc data_types to 'default' in the bgcskl and bgcz option files, for use with either gx3 or gx1.

The new tests in base_suite all pass for me, except the one which is commented out -- it does a comparison that is not BFB. Comparing the diagnostic output is interesting, maybe this provides some clues for what's going on with the roundrobin cases? or maybe not. I don't understand why the distribution type is different! Of course, I'm probably doing something stupid, assigning 4x1 and 'thread'......

 smoke          gx3     4x1        bgcz,thread        smoke_gx3_8x2_bgcz



 <   Processors:  total    =      4
 ---
 >   Processors:  total    =      8
 181c185
 <   Distribution type:            cartesian
 ---
 >   Distribution type:           roundrobin
 189c193
 <   max_blocks =                 4
 ---
 >   max_blocks =                10
 194,199c198,202
 <  ice_domain work_unit, max_work_unit =         3170          10
 <  ice_domain nocn =         4491       31691      261744
 <  ice_domain work_per_block =            3          11         106
 < (proc_decomposition)  Processors (X x Y) =    2 x    2
 <  ice: total number of blocks is          16
 <   Block size:  nx_block =     27
 ---
 >  ice_domain work_unit, max_work_unit =          739          10
 >  ice_domain nocn =            0        7385      261744
 >  ice_domain work_per_block =            0          11         475
 >  ice: total number of blocks is          79
 >   Block size:  nx_block =      7
 208,209d210
 <  min/max ULON:  0.000000000000000E+000   360.000005165255     
 <  min/max ULAT:  -76.1620403824018        89.9000004922619     

Note that I attempted to fix the problem with the gx1 ocean forcing file in this PR but didn't test it.

" ", "0", "15 minutes (default)", " "
" ", "1", "59 minutes", " "
" ", "2", "2 hours", " "
" ", "other N$<$8", "N hours", " "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be:
" ", "other :math:2 < N < 8", "N hours", " "

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The block of math text is set off with the ` on either side. Just to be clear. Same comment for below.

"ICE_ACCOUNT", " ", "batch account number", "set by cice.setup, .cice_proj or by default"
"ICE_QUEUE", " ", "batch queue name", "set by cice.setup or by default"
"ICE_THREADED", "true,false", "force threading in compile, will always compile threaded if NTHRDS is gt 1", "false"
"ICE_THREADED", "true, false", "force threading in compile, will always compile threaded if NTHRDS is gt 1", "false"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be:
"ICE_THREADED", "true, false", "force threading in compile, will always compile threaded if ICE_NTHRDS :math:> 1", "false"

Copy link
Contributor

@duvivier duvivier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two minor errors in the user guide documentation. I tested these offline with my sphinx build and the suggestions I request fix the format.

"ICE_ACCOUNT", " ", "batch account number", "set by cice.setup, .cice_proj or by default"
"ICE_QUEUE", " ", "batch queue name", "set by cice.setup or by default"
"ICE_THREADED", "true,false", "force threading in compile, will always compile threaded if NTHRDS is gt 1", "false"
"ICE_THREADED", "true, false", "force threading in compile, will always compile threaded if ICE_NTHRDS :math: > 1", "false"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll have to add ` signs around the >1 (this didn't render before in my comment, sorry!)

" ", "0", "15 minutes (default)", " "
" ", "1", "59 minutes", " "
" ", "2", "2 hours", " "
" ", "other :math: 2 < N < 8", "N hours", " "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll have to add signs around the 2<N<8 (this didn't render before in my comment, sorry! If I do it in the comment it formats it instead of showing characters. See2<N<8`)

duvivier
duvivier previously approved these changes Oct 19, 2018
@eclare108213
Copy link
Contributor

trying again... is this better?

@duvivier
Copy link
Contributor

Yes, that is perfect. :)

@duvivier duvivier mentioned this pull request Oct 19, 2018
@njeffery
Copy link
Contributor Author

@eclare108213 : I've added the missing bgc paramters. Ran a 10 week test with bgcz on the gx3 that passed.

@apcraig
Copy link
Contributor

apcraig commented Oct 23, 2018

I just ran the bgc tests on conrad with 4 compilers, hash acbd01a at https://github.com/CICE-Consortium/Test-Results/wiki/cice_by_hash_forks. The results are better. The cray compiler fails all cases with errors like

lib-4961 : WARNING
Subscript 1 is out of range for dimension 1 for array
'aerosno(*,,)' at line 298 in file 'ice_step_mod.F90' with bounds 1:0.

lib-4961 : WARNING
Subscript 1 is out of range for dimension 1 for array
'aeron' at line 1069 in file 'icepack_shortwave.F90' with bounds 1:0.
(skl_biogeochemistry)initial sk_bgc < 0, nn,nbtrcr,cinit(nn) 15, 16, -2.99999999999999989E-2
(skl_biogeochemistry) cinit < c0
(icepack_warnings_setabort) T :file /p/home/apcraig/cice-consortium/cice.debugZbgc3D/icepack/columnphysics/icepack_algae.F90 :line 592
(icepack_warnings_aborted) ... (sklbio)
(icepack_warnings_aborted) ... (icepack_biogeochemistry)

(z_biogeochemistry)very large bgc value
(z_biogeochemistry)k,m,hbri,hbri_old,bio_tmp,biomat_cons(k,m),ocean_bio(m)
(z_biogeochemistry) 1, 15, 6.5519641720615418E-2, 7.14807665163326444E-2, 1.97942537215752571E+198, 1.29693004287869
203E+197, 2.45710724577836671E+198
(z_biogeochemistry)react(k,m),iphin_N(k),biomat_brine(k,m)
(z_biogeochemistry) -5.76394390422860599E+193, 0.46685667284947335, 4.23995713595056273E+198
(z_biogeochemistry) very large bgc value
(icepack_warnings_setabort) T :file /p/home/apcraig/cice-consortium/cice.debugZbgc3D/icepack/columnphysics/icepack_alga
e.F90 :line 1341
(z_biogeochemistry)trcrn(nt_zbgc_frac+m-1): 1.
(z_biogeochemistry)in_init_cons(k,m): 1.40508727472858092E+197
(z_biogeochemistry)trcrn(bio_index(m) + k-1)
(z_biogeochemistry) 1.96568579662269342E+198
(z_biogeochemistry)Category,m: 1, 15

The pgi cases fails with an error message,

0: Null pointer for a3db (/p/home/apcraig/cice-consortium/cice.debugZbgc3D/cicecore/cicedynB/analysis/ice_history_bgc.F90:
2548)

The intel case fails with an undefined seg fault,

forrtl: severe (174): SIGSEGV, segmentation fault occurred

The gnu case fails with a seg fault in what seems to be a threaded region, although not totally sure,

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
#0 0x60816f in ???
at /usr/src/packages/BUILD/glibc-2.11.3/nptl/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
#1 0x593fd0 in ???
#2 0x594936 in ???
#3 0xb7bddd in gomp_thread_start
at ../../../cray-gcc-6.3.0-201701050407.93fe37becc347/libgomp/team.c:119
#4 0xb4e3b5 in start_thread
at /usr/src/packages/BUILD/glibc-2.11.3/nptl/pthread_create.c:301
#5 0xbe2128 in ???
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#0 0x60816f in ???
at /usr/src/packages/BUILD/glibc-2.11.3/nptl/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
#1 0x593fd0 in ???
#2 0x594936 in ???
#3 0xb7bddd in gomp_thread_start
at ../../../cray-gcc-6.3.0-201701050407.93fe37becc347/libgomp/team.c:119
#4 0xb4e3b5 in start_thread
at /usr/src/packages/BUILD/glibc-2.11.3/nptl/pthread_create.c:301
#5 0xbe2128 in ???
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
_pmiu_daemon(SIGCHLD): [NID 01753] [c9-0c0s6n1] [Tue Oct 23 17:48:36 2018] PE RANK 0 exit signal Segmentation fault

The first step might be to fix the cray errors as those are clearly pointing at something concrete. Maybe the others fix too? At some point, we can also turn on debugging and test with/without openmp/mpi for a few cases to see if that provides any extra clues, although I'm not sure we're quite there yet.

@eclare108213
Copy link
Contributor

base_suite test results are here:
https://github.com/CICE-Consortium/Test-Results/wiki/acbd01a5c7.pinto.intel.181023.195142
The 2 failing tests are expected on pinto; otherwise everything passes.
I have a test suite sitting in the queue with the code at hash abcd01a, to see if the debug runs throw any errors as on conrad. If not, I'm not sure how to debug those issues. Could it be an optimization issue?

@eclare108213
Copy link
Contributor

My abcd01a tests also compiled and ran as advertised. @apcraig your errors look like a configuration problem to me, like maybe the namelist isn't getting read correctly so array allocations aren't being made. This could conceivably happen due to our having multiple, redundant namelist options for tracers, although the code checks at initialization ought to catch inconsistencies.
To start, could you post or email to me the files from at least one of the failing runs, which have the config information in them? ice_in, case settings, diagnostic log file if the case started, etc. I'll compare with my versions.

Update history field control, place control logicals inside init and accum subroutines for consistency.

Fix some indent issues in history subroutines.

Comment out OMP loop that was causing problems on conrad_cray.

Fix intent inout arguments in ice_forcing_bgc.F90, init_bgc_data for fed1, fep1.

Modify ice_arrays_column so more variables are dynamically allocated using icepack query routines.
Copy link
Contributor

@eclare108213 eclare108213 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

base_suite passes on pinto: https://github.com/CICE-Consortium/Test-Results/wiki/9f47dacbd0.pinto.intel.181026.131608
I did not check whether answers change compared with master. If they do not, then I'm in favor of merging this and then moving forward with additional PRs as needed.

@apcraig
Copy link
Contributor

apcraig commented Oct 26, 2018

I have updated the test results from last night. 4 tests are still in the queue for some strange reason, but otherwise, everything is running, passing, and bit-for-bit. hash 9f47dac, https://github.com/CICE-Consortium/Test-Results/wiki/cice_by_hash_forks.

I will merge the PR now.

@apcraig apcraig merged commit e6b9fee into CICE-Consortium:master Oct 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants