Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CH4 Conservation Error in CH4Mod during diffusion #260

Closed
blcc opened this issue Apr 29, 2021 · 11 comments
Closed

CH4 Conservation Error in CH4Mod during diffusion #260

blcc opened this issue Apr 29, 2021 · 11 comments

Comments

@blcc
Copy link
Contributor

blcc commented Apr 29, 2021

Hi, I encounter an error when testing present NorESM code on Betzy, both master and noresm2 branch.

After git clone and checkout_externals (both Externals.cfg and Externals_continuous_development.cfg are same):

cime/scripts/create_newcase --case ~/work/noresm2_cases/noresm_test005 --compset NHIST --res f19_tn14 --mach betzy --project nn9039k 
cd ~/work/noresm2_cases/noresm_test005 && ./case.setup && ./case.build && ./case.submit

The job stopped when initialization, here is the cesm.log:

[skip]
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
(seq_domain_areafactinit) : min/max mdl2drv   0.999999999983463       1.00000000001725    areafact_o_OCN
(seq_domain_areafactinit) : min/max drv2mdl   0.999999999982750       1.00000000001654    areafact_o_OCN
(seq_domain_areafactinit) : min/max mdl2drv   0.999999999983463       1.00000000001725    areafact_i_ICE
(seq_domain_areafactinit) : min/max drv2mdl   0.999999999982750       1.00000000001654    areafact_i_ICE
 CH4 Conservation Error in CH4Mod during diffusion, nstep, c, errch4 (mol /m^2.t
 imestep)           0      107431                     NaN
 Latdeg,Londeg=   48.3157894736841        40.0000000000000     
 ENDRUN:
 ERROR: 
  ERROR: CH4 Conservation Error in CH4Mod during diffusionERROR in ch4Mod.F90 at
  line 3948
Image              PC                Routine            Line        Source             
cesm.exe           00000000029224C6  Unknown               Unknown  Unknown
cesm.exe           00000000025A6B80  shr_abort_mod_mp_         114  shr_abort_mod.F90
cesm.exe           0000000001B81AFF  abortutils_mp_end          50  abortutils.F90
cesm.exe           00000000021693E7  ch4mod_mp_ch4_tra        3947  ch4Mod.F90
cesm.exe           000000000215BE02  ch4mod_mp_ch4_           2045  ch4Mod.F90
cesm.exe           0000000001B8D091  clm_driver_mp_clm         960  clm_driver.F90
cesm.exe           0000000001B7689A  lnd_comp_mct_mp_l         456  lnd_comp_mct.F90
cesm.exe           00000000004376A0  component_mod_mp_         728  component_mod.F90
cesm.exe           000000000041B85B  cime_comp_mod_mp_        2724  cime_comp_mod.F90
cesm.exe           00000000004372E7  MAIN__                    125  cime_driver.F90
cesm.exe           0000000000419512  Unknown               Unknown  Unknown
libc-2.17.so       00002AE442507545  __libc_start_main     Unknown  Unknown
cesm.exe           0000000000419429  Unknown               Unknown  Unknown
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 143 in communicator MPI_COMM_WORLD
with errorcode 1001.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 179268.0 ON b4139 CANCELLED AT 2021-04-29T10:47:01 ***

I also tried Externals.cfg from older and workable NorESM, error is still exist.

Is it a bug or I use the wrong command or branch?

Thanks,
Ping-Gin

@adagj
Copy link
Contributor

adagj commented Apr 29, 2021

@blcc
Hi, can you add README from this specific run? Or the release/tag you used? You should use master or noresm-release2.0.4 when running on betzy. For details, see
https://github.com/NorESMhub/NorESM/tree/noresm2
and
https://noresm-docs.readthedocs.io/en/noresm2/configurations/platforms.html

Remember that you need to rerun Externals.cfg when changing branch or tag.

Best regards,
Ada

@blcc
Copy link
Contributor Author

blcc commented Apr 29, 2021

@blcc
Hi, can you add README from this specific run? Or the release/tag you used? You should use master or noresm-release2.04 when running on betzy. For details, see
https://github.com/NorESMhub/NorESM/tree/noresm2
and
https://noresm-docs.readthedocs.io/en/noresm2/configurations/platforms.html

Remember that you need to rerun Externals.cfg when changing branch or tag.

Best regards,
Ada

Thank you Ada. The README.case is attached.
README.case.txt

I tried master but still same error. I will try release2.0.4.
Best regards,
Ping-Gin

@adagj
Copy link
Contributor

adagj commented Apr 29, 2021

Thanks, then it is probably not related to the betzy settings.
One other issue we have had on betzy which might be useful to check is corrupted files occurring when copying files to betzy, causing some nans in the files. Are you using restart files? Then you can check if the checksum is the same for the restart files you use on betzy and the ones stored on NIRD ( e.g. using sha256sum $FILENAME)

If that doesn't help; @DirkOlivie @monsieuralok maybe you can help?

@blcc
Copy link
Contributor Author

blcc commented Apr 29, 2021

release-noresm2.0.4 is tried, but same error.
Now I suspect the input data on Betzy is corrupted.
I'll check it later. Thanks.

@DirkOlivie
Copy link
Contributor

Hi Ping-Gin,

if the error is still there, could you paste also the last lines of the lnd.log-file (the error is in the land component) in this issue?

In the land-model, there is currently a correction going to be applied : see
NorESMhub/CTSM#11
A pull-request is created, and this will probably soon be available in the code : see
NorESMhub/CTSM#12

This is possibly related to your problem, but I am not sure.

Best regards,
Dirk

@blcc
Copy link
Contributor Author

blcc commented May 3, 2021

Thanks @DirkOlivie, I did some tests with merged code but still same error.
However the problem disappeared if change NTASKS_LND from default 192 to 128.
I guess CTSM somehow has a bug when use NTASKS 192, make some points of t_soisno and some other variables NaN. Finally crashed at ch4Mod.F90.
The easiest way is changing default PE setting in cime_config/config_pes.xml, to avoid this problem. And add a warning about it in document or code.

Ping-Gin

@adagj
Copy link
Contributor

adagj commented May 3, 2021

@blcc thanks for the clarification, we will add it to the documentation-
When building, did you use the --pecount option (https://noresm-docs.readthedocs.io/en/noresm2/configurations/platforms.html#hpc-platforms section 4.1.2.1) ?
@monsieuralok can you make sure that the --pecount option uses 128 NTASKS for the land component?

@blcc
Copy link
Contributor Author

blcc commented May 3, 2021

Thanks @adagj, I did not use --pecount option, Seems it applied M set (8 nodes) automatically.

@monsieuralok
Copy link

@adagj @DirkOlivie I guess we should open this issue with CESM as reported earlier by others : ESCOMP/CTSM#135 It might be that it solved in newer version. But, it is difficult to check newer version in same framework.

@DirkOlivie
Copy link
Contributor

I have experienced the same error "ERROR: CH4 Conservation Error in CH4Mod during diffusionERROR in ch4Mod.F90" in a NHIST (1850-2014) simulation.

It happened after 20 years, at the moment of automatic resubmission (1870-01-01).

Resubmitting manually gave the same error message.
Resubmitting after recopying the 1870-01-01 restart files from the archive-directory to the run-directory, gave the same problem.
Resubmitting after recopying earlier restart files (1860-01-01) worked, and the simulation also passed later without any problems 1870-01-01.

@blcc
Copy link
Contributor Author

blcc commented Jul 1, 2022

I'll close this issue since no one have same problem for a long time.
We can reopen it if happened again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

4 participants