Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single threaded runs on Conrad crash in ice_broadcast.F90 #19

Closed
mattdturner opened this issue Aug 7, 2017 · 1 comment · Fixed by #25
Closed

Single threaded runs on Conrad crash in ice_broadcast.F90 #19

mattdturner opened this issue Aug 7, 2017 · 1 comment · Fixed by #25

Comments

@mattdturner
Copy link
Contributor

I attempted to run a test on Conrad (Cray XC40) with the following configuration:

  • smoke test
  • grid = gx3
  • PE = 4x1
  • sets = diag1, run5day, thread

The modules that are loaded for the compile:

  • PrgEnv/intel-5.2.40
  • intel/17.0.2.174
  • cray-mpich/7.3.2
  • cray-netcdf/4.3.2
  • cray-hdf5/1.8.13

The error that I am encountering is:

Rank 0 [Mon Aug  7 15:24:19 2017] [c2-0c1s2n2] Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(1614): MPI_Bcast(buf=0x7fffffff4640, count=1, dtype=0x4c000829, root=-999, comm=0x84000004) failed
PMPI_Bcast(1576): Invalid root (value given was -999)

forrtl: error (76): Abort trap signal
Image              PC                Routine            Line        Source
cice               000000000079DC32  Unknown               Unknown  Unknown
cice               0000000000787250  Unknown               Unknown  Unknown
cice               0000000000787396  Unknown               Unknown  Unknown
cice               0000000000769629  Unknown               Unknown  Unknown
cice               000000000075F03A  Unknown               Unknown  Unknown
cice               0000000000456F96  ice_broadcast_mp_          81  ice_broadcast.F90
cice               00000000004670DF  ice_diagnostics_m         759  ice_diagnostics.F90
cice               0000000000402DBD  cice_runmod_mp_ci          63  CICE_RunMod.F90
cice               000000000040064B  MAIN__                     50  CICE.F90
cice               00000000004005DE  Unknown               Unknown  Unknown
cice               0000000000DA7991  Unknown               Unknown  Unknown
cice               00000000004004B9  Unknown               Unknown  Unknown

The error appears to be a result of the model not being able to find the first diagnostic point:

  Find indices of diagnostic points

 found point   1
   lat    lon   TLAT   TLON     i     j   block  task
  90.0    0.0 -999.0 -999.0     0     0     0  -999

 found point   2
   lat    lon   TLAT   TLON     i     j   block  task
 -65.0  -45.0  -64.2  -45.0    25    10     2     1

I should note that the model runs successfully for hybrid MPI+OpenMP configurations (conrad_smoke_gx3_8x2_diag1_run5day) and pure-MPI implementations (conrad_smoke_gx3_4x1_debug_diag1_run5day). The error only seems to show up when running with CICE_THREADED enabled.

PASS conrad_smoke_gx3_8x2_diag1_run5day build
PASS conrad_smoke_gx3_8x2_diag1_run5day run
PASS conrad_smoke_gx3_8x2_diag24_run1year_medium build
PASS conrad_smoke_gx3_4x1_debug_diag1_run5day build
PASS conrad_smoke_gx3_4x1_debug_diag1_run5day run
PASS conrad_smoke_gx3_8x2_debug_diag1_run5day build
PASS conrad_smoke_gx3_8x2_debug_diag1_run5day run
PASS conrad_smoke_gx3_4x2_diag1_run5day build
PASS conrad_smoke_gx3_4x2_diag1_run5day run
PASS conrad_smoke_gx3_4x2_diag1_run5day bfbcomp conrad_smoke_gx3_8x2_diag1_run5day.t00
PASS conrad_smoke_gx3_4x1_diag1_run5day_thread build
FAIL conrad_smoke_gx3_4x1_diag1_run5day_thread run

Log file from the run: cice.runlog.txt
Log of the compile: cice.buildlog.txt

@dabail10
Copy link
Contributor

dabail10 commented Sep 7, 2017

I have an OMP fix for the diagnostic code. Delete the OMP directives around the loop where latdis/londis are computed in the subroutine init_diags. This is not thread-safe code.

Dave

apcraig added a commit that referenced this issue Sep 14, 2017
Resolution to Issue #19 (single threaded runs crash)
JFLemieux73 pushed a commit to JFLemieux73/CICE that referenced this issue Nov 18, 2021
* Bug fix for missing allocation of CD variables

* Adjusting indents and fixing a comment
apcraig added a commit that referenced this issue Nov 8, 2022
* merge latest master (#4)

* Isotopes for CICE (#423)

Co-authored-by: apcraig <[email protected]>
Co-authored-by: David Bailey <[email protected]>
Co-authored-by: Elizabeth Hunke <[email protected]>

* updated orbital calculations needed for cesm

* fixed problems in updated orbital calculations needed for cesm

* update CICE6 to support coupling with UFS

* put in changes so that both ufsatm and cesm requirements for potential temperature and density are satisfied

* Convergence on ustar for CICE. (#452) (#5)

* Add atmiter_conv to CICE

* Add documentation

* trigger build the docs

Co-authored-by: David A. Bailey <[email protected]>

* update icepack submodule

* Revert "update icepack submodule"

This reverts commit e70d1ab.

* update comp_ice.backend with temporary ice_timers fix

* Fix threading problem in init_bgc

* Fix additional OMP problems

* changes for coldstart running

* Move the forapps directory

* remove cesmcoupled ifdefs

* Fix logging issues for NUOPC

* removal of many cpp-ifdefs

* fix compile errors

* fixes to get cesm working

* fixed white space issue

* Add restart_coszen namelist option

* update icepack submodule

* change Orion to orion in backend

remove duplicate print lines from ice_transport_driver

* add -link_mpi=dbg to debug flags (#8)

* cice6 compile (#6)

* enable debug build. fix to remove errors

* fix an error in comp_ice.backend.libcice

* change Orion to orion for machine identification

* changes for consistency w/ current emc-cice5 (#13)

Update to emc/develop fork to current CICE consortium 

Co-authored-by: David A. Bailey <[email protected]>
Co-authored-by: Tony Craig <[email protected]>
Co-authored-by: Elizabeth Hunke <[email protected]>
Co-authored-by: Mariana Vertenstein <[email protected]>
Co-authored-by: apcraig <[email protected]>
Co-authored-by: Philippe Blain <[email protected]>

* Fixcommit (#14)

Align commit history between emc/develop and cice-consortium/master

* Update CICE6 for integration to S2S


* add wcoss_dell_p3 compiler macro

* update to icepack w/ debug fix

* replace SITE with MACHINE_ID

* update compile scripts

* Support TACC stampede (#19)

* update icepack

* add ice_dyn_vp module to CICE_InitMod

* update gitmodules, update icepack

* Update CICE to consortium master (#23)

updates include:

* deprecate upwind advection (#508)
* add implicit VP solver (#491)

* update icepack

* switch icepack branches

* update to icepack master but set abort flag in ITD routine
to false

* update icepack

* Update CICE to latest Consortium master (#26)


update CICE and Icepack

* changes the criteria for aborting ice for thermo-conservation errors
* updates the time manager
* fixes two bugs in ice_therm_mushy
* updates Icepack to Consortium master w/ flip of abort flag for troublesome IC cases

* add cice changes for zlvs (#29)

* update icepack and pointer

* update icepack and revert gitmodules

* Fix history features

- Fix bug in history time axis when sec_init is not zero.
- Fix issue with time_beg and time_end uninitialized values.
- Add support for averaging with histfreq='1' by allowing histfreq_n to be any value
  in that case.  Extend and clean up construct_filename for history files.  More could
  be done, but wanted to preserve backwards compatibility.
- Add new calendar_sec2hms to converts daily seconds to hh:mm:ss.  Update the
  calchk calendar unit tester to check this method
- Remove abort test in bcstchk, this was just causing problems in regression testing
- Remove known problems documentation about problems writing when istep=1.  This issue
  does not exist anymore with the updated time manager.
- Add new tests with hist_avg = false.  Add set_nml.histinst.

* revert set_nml.histall

* fix implementation error

* update model log output in ice_init

* Fix QC issues

- Add netcdf ststus checks and aborts in ice_read_write.F90
- Check for end of file when reading records in ice_read_write.F90 for
  ice_read_nc methods
- Update set_nml.qc to better specify the test, turn off leap years since we're cycling
  2005 data
- Add check in c ice.t-test.py to make sure there is at least 1825 files, 5 years of data
- Add QC run to base_suite.ts to verify qc runs to completion and possibility to use
  those results directly for QC validation
- Clean up error messages and some indentation in ice_read_write.F90

* Update testing

- Add prod suite including 10 year gx1prod and qc test
- Update unit test compare scripts

* update documentation

* reset calchk to 100000 years

* update evp1d test

* update icepack

* update icepack

* add memory profiling (#36)


* add profile_memory calls to CICE cap

* update icepack

* fix rhoa when lowest_temp is 0.0

* provide default value for rhoa when imported temp_height_lowest
(Tair) is 0.0
* resolves seg fault when frac_grid=false and do_ca=true

* update icepack submodule

* Update CICE for latest Consortium master (#38)


    * Implement advanced snow physics in icepack and CICE
    * Fix time-stamping of CICE history files
    * Fix CICE history file precision

* Use CICE-Consortium/Icepack master (#40)

* switch to icepack master at consortium

* recreate cap update branch (#42)


* add debug_model feature
* add required variables and calls for tr_snow

* remove 2 extraneous lines

* remove two log print lines that were removed prior to
merge of driver updates to consortium

* duplicate gitmodule style for icepack

* Update CICE to latest Consortium/main (#45)

* Update CICE to Consortium/main (#48)


Update OpenMP directives as needed including validation via new omp_suite. Fixed OpenMP in dynamics.
Refactored eap puny/pi lookups to improve scalar performance
Update Tsfc implementation to make sure land blocks don't set Tsfc to freezing temp
Update for sea bed stress calculations

* fix comment, fix env for orion and hera

* replace save_init with step_prep in CICE_RunMod

* fixes for cgrid repro

* remove added haloupdates

* baselines pass with these extra halo updates removed

* change F->S for ocean velocities and tilts

* fix debug failure when grid_ice=C

* compiling in debug mode using -init=snan,arrays requires
initialization of variables

* respond to review comments

* remove inserted whitespace for uvelE,N and vvelE,N

* Add wave-cice coupling; update to Consortium main (#51)


* add wave-ice fields
* initialize aicen_init, which turns up as NaN in calc of floediam
export
* add call to icepack_init_wave to initialize wavefreq and dwavefreq
* update to latest consortium main (PR 752)

* add initializationsin ice_state

* initialize vsnon/vsnon_init and vicen/vicen_init

Co-authored-by: apcraig <[email protected]>
Co-authored-by: David Bailey <[email protected]>
Co-authored-by: Elizabeth Hunke <[email protected]>
Co-authored-by: Mariana Vertenstein <[email protected]>
Co-authored-by: Minsuk Ji <[email protected]>
Co-authored-by: Tony Craig <[email protected]>
Co-authored-by: Philippe Blain <[email protected]>
apcraig added a commit that referenced this issue Aug 28, 2023
…856)

* merge latest master (#4)

* Isotopes for CICE (#423)

Co-authored-by: apcraig <[email protected]>
Co-authored-by: David Bailey <[email protected]>
Co-authored-by: Elizabeth Hunke <[email protected]>

* updated orbital calculations needed for cesm

* fixed problems in updated orbital calculations needed for cesm

* update CICE6 to support coupling with UFS

* put in changes so that both ufsatm and cesm requirements for potential temperature and density are satisfied

* Convergence on ustar for CICE. (#452) (#5)

* Add atmiter_conv to CICE

* Add documentation

* trigger build the docs

Co-authored-by: David A. Bailey <[email protected]>

* update icepack submodule

* Revert "update icepack submodule"

This reverts commit e70d1ab.

* update comp_ice.backend with temporary ice_timers fix

* Fix threading problem in init_bgc

* Fix additional OMP problems

* changes for coldstart running

* Move the forapps directory

* remove cesmcoupled ifdefs

* Fix logging issues for NUOPC

* removal of many cpp-ifdefs

* fix compile errors

* fixes to get cesm working

* fixed white space issue

* Add restart_coszen namelist option

* update icepack submodule

* change Orion to orion in backend

remove duplicate print lines from ice_transport_driver

* add -link_mpi=dbg to debug flags (#8)

* cice6 compile (#6)

* enable debug build. fix to remove errors

* fix an error in comp_ice.backend.libcice

* change Orion to orion for machine identification

* changes for consistency w/ current emc-cice5 (#13)

Update to emc/develop fork to current CICE consortium 

Co-authored-by: David A. Bailey <[email protected]>
Co-authored-by: Tony Craig <[email protected]>
Co-authored-by: Elizabeth Hunke <[email protected]>
Co-authored-by: Mariana Vertenstein <[email protected]>
Co-authored-by: apcraig <[email protected]>
Co-authored-by: Philippe Blain <[email protected]>

* Fixcommit (#14)

Align commit history between emc/develop and cice-consortium/master

* Update CICE6 for integration to S2S


* add wcoss_dell_p3 compiler macro

* update to icepack w/ debug fix

* replace SITE with MACHINE_ID

* update compile scripts

* Support TACC stampede (#19)

* update icepack

* add ice_dyn_vp module to CICE_InitMod

* update gitmodules, update icepack

* Update CICE to consortium master (#23)

updates include:

* deprecate upwind advection (#508)
* add implicit VP solver (#491)

* update icepack

* switch icepack branches

* update to icepack master but set abort flag in ITD routine
to false

* update icepack

* Update CICE to latest Consortium master (#26)


update CICE and Icepack

* changes the criteria for aborting ice for thermo-conservation errors
* updates the time manager
* fixes two bugs in ice_therm_mushy
* updates Icepack to Consortium master w/ flip of abort flag for troublesome IC cases

* add cice changes for zlvs (#29)

* update icepack and pointer

* update icepack and revert gitmodules

* Fix history features

- Fix bug in history time axis when sec_init is not zero.
- Fix issue with time_beg and time_end uninitialized values.
- Add support for averaging with histfreq='1' by allowing histfreq_n to be any value
  in that case.  Extend and clean up construct_filename for history files.  More could
  be done, but wanted to preserve backwards compatibility.
- Add new calendar_sec2hms to converts daily seconds to hh:mm:ss.  Update the
  calchk calendar unit tester to check this method
- Remove abort test in bcstchk, this was just causing problems in regression testing
- Remove known problems documentation about problems writing when istep=1.  This issue
  does not exist anymore with the updated time manager.
- Add new tests with hist_avg = false.  Add set_nml.histinst.

* revert set_nml.histall

* fix implementation error

* update model log output in ice_init

* Fix QC issues

- Add netcdf ststus checks and aborts in ice_read_write.F90
- Check for end of file when reading records in ice_read_write.F90 for
  ice_read_nc methods
- Update set_nml.qc to better specify the test, turn off leap years since we're cycling
  2005 data
- Add check in c ice.t-test.py to make sure there is at least 1825 files, 5 years of data
- Add QC run to base_suite.ts to verify qc runs to completion and possibility to use
  those results directly for QC validation
- Clean up error messages and some indentation in ice_read_write.F90

* Update testing

- Add prod suite including 10 year gx1prod and qc test
- Update unit test compare scripts

* update documentation

* reset calchk to 100000 years

* update evp1d test

* update icepack

* update icepack

* add memory profiling (#36)


* add profile_memory calls to CICE cap

* update icepack

* fix rhoa when lowest_temp is 0.0

* provide default value for rhoa when imported temp_height_lowest
(Tair) is 0.0
* resolves seg fault when frac_grid=false and do_ca=true

* update icepack submodule

* Update CICE for latest Consortium master (#38)


    * Implement advanced snow physics in icepack and CICE
    * Fix time-stamping of CICE history files
    * Fix CICE history file precision

* Use CICE-Consortium/Icepack master (#40)

* switch to icepack master at consortium

* recreate cap update branch (#42)


* add debug_model feature
* add required variables and calls for tr_snow

* remove 2 extraneous lines

* remove two log print lines that were removed prior to
merge of driver updates to consortium

* duplicate gitmodule style for icepack

* Update CICE to latest Consortium/main (#45)

* Update CICE to Consortium/main (#48)


Update OpenMP directives as needed including validation via new omp_suite. Fixed OpenMP in dynamics.
Refactored eap puny/pi lookups to improve scalar performance
Update Tsfc implementation to make sure land blocks don't set Tsfc to freezing temp
Update for sea bed stress calculations

* fix comment, fix env for orion and hera

* replace save_init with step_prep in CICE_RunMod

* fixes for cgrid repro

* remove added haloupdates

* baselines pass with these extra halo updates removed

* change F->S for ocean velocities and tilts

* fix debug failure when grid_ice=C

* compiling in debug mode using -init=snan,arrays requires
initialization of variables

* respond to review comments

* remove inserted whitespace for uvelE,N and vvelE,N

* Add wave-cice coupling; update to Consortium main (#51)


* add wave-ice fields
* initialize aicen_init, which turns up as NaN in calc of floediam
export
* add call to icepack_init_wave to initialize wavefreq and dwavefreq
* update to latest consortium main (PR 752)

* add initializationsin ice_state

* initialize vsnon/vsnon_init and vicen/vicen_init

* Update CICE (#54)


* update to include recent PRs to Consortium/main

* fix for nudiag_set

allow nudiag_set to be available outside of cesm; may prefer
to fix in coupling interface

* Update CICE for latest Consortium/main (#56)

* add run time info

* change real(8) to real(dbl)kind)

* fix syntax

* fix write unit

* use cice_wrapper for ufs timer functionality

* add elapsed model time for logtime

* tidy up the wrapper

* fix case for 'time since' at the first advance

* add timer and forecast log

* write timer values to timer log, not nu_diag
* write log.ice.fXXX

* only one time is needed

* modify message written for log.ice.fXXX

* change info in fXXX log file

* Update CICE from Consortium/main (#62)


* Fix CESMCOUPLED compile issue in icepack. (#823)
* Update global reduction implementation to improve performance, fix VP bug (#824)
* Update VP global sum to exclude local implementation with tripole grids
* Add functionality to change hist_avg for each stream (#827)
* Update Icepack to #6703bc533c968 May 22, 2023 (#829)
* Fix for mesh check in CESM driver (#830)
* Namelist option for time axis position. (#839)

* reset timer after Advance to retrieve "wait time"

* add logical control for enabling runtime info

* remove zsal items from cap

* fix typo

---------

Co-authored-by: apcraig <[email protected]>
Co-authored-by: David Bailey <[email protected]>
Co-authored-by: Elizabeth Hunke <[email protected]>
Co-authored-by: Mariana Vertenstein <[email protected]>
Co-authored-by: Minsuk Ji <[email protected]>
Co-authored-by: Tony Craig <[email protected]>
Co-authored-by: Philippe Blain <[email protected]>
Co-authored-by: Jun.Wang <[email protected]>
anton-seaice pushed a commit to anton-seaice/CICE that referenced this issue Dec 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants