Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gefs_atmos_ensstat occasionally crashes with a segmentation fault error #3150

Open
EricSinsky-NOAA opened this issue Dec 10, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@EricSinsky-NOAA
Copy link
Contributor

EricSinsky-NOAA commented Dec 10, 2024

What is wrong?

The gefs_atmos_ensstat task crashes for a significant number of cases in the EP6 runs with a reproducible segmentation fault error in the GEFS run.

What should have happened?

The gefs_atmos_ensstat task should not crash.

What machines are impacted?

WCOSS2

What global-workflow hash are you using?

40f4536

Steps to reproduce

  1. Clone the feature/gefs_reforecast branch, checkout 40f4536 and build.
  2. Set up a GEFS experiment for the 20171116 case where FHOUT_GFS=3, FHMAX_GFS=240 and NPERT=10
  3. Run the experiment until the atmos_ensstat task, where a segmentation fault error is expected
    This issue will also be tested in the develop branch for the same case.

Additional information

@BoCui-NOAA found that the segmentation fault in ensstat occurs due to infinite values in the the ensemble spread. These infinite spread values arise due to extreme values in the pgrb data (generated in atmos_prod), which sometimes can result in infinite spread values, and other times it may lead to very large nonphysical values that may not be large enough to cause a seg fault error. In the 20171116 case, the extreme values occurred in five pgrb variables (TSOIL, SOILW, WEASD, SNOD and PEVPR). These extreme values in the pgrb data are presumably considered undefined values of 9.9990003E+20, but they are being interpreted as real values in the ensstat program.

Do you have a proposed solution?

This issue is actively under investigation and a proposed solution will be specified when more information on the issue is known. After some testing, removing the mod_icec function from the interp_atmos_master.sh script does avoid a segmentation fault error in the ensstat task, but the root cause of this is still unknown. Removing this function does not change the min, max and mean for the pgrb variables.

@EricSinsky-NOAA EricSinsky-NOAA added bug Something isn't working triage Issues that are triage labels Dec 10, 2024
@EricSinsky-NOAA EricSinsky-NOAA changed the title gefs_atmos_ensstat occasionally crashes with a segmenation fault error gefs_atmos_ensstat occasionally crashes with a segmentation fault error Dec 10, 2024
@WalterKolczynski-NOAA WalterKolczynski-NOAA removed the triage Issues that are triage label Dec 10, 2024
@EricSinsky-NOAA
Copy link
Contributor Author

After further investigation and working with @BoCui-NOAA, it seems the mod_icec function from the interp_atmos_master.sh script is removing the use of bitmap for TSOIL, SOILW, WEASD, SNOD and PEVPR. This is likely causing an issue for ensstat. When mod_icec is commented out in my testing, bitmap remains in use for these five variables and the ensstat task completes without error.

@EricSinsky-NOAA
Copy link
Contributor Author

@WalterKolczynski-NOAA In interp_atmos_master.sh (line 56), should:

mod_icec "${output_file_prefix}_${grid}"; export err=$?; err_chk

be changed to something like this:

  count=`$WGRIB2 "${output_file_prefix}_${grid}" -match "LAND|ICEC" |wc -l`                                                                                                                                                              
  if [ $count -eq 2 ]; then
    mod_icec "${output_file_prefix}_${grid}"; export err=$?; err_chk
  fi

Adding this condition is consistent with what is being done in the standalone UPP script (line 57-62). After applying the above change to interp_atmos_master.sh, the issue I explained in my previous comment has been resolved (bitmap correctly remains in use for those 5 variables). Pinging @WenMeng-NOAA on this suggested change too since she has developed many of these UPP scripts.

@EricSinsky-NOAA
Copy link
Contributor Author

It looks like mod_icec was brought into the global-workflow in PR #771 (Jun 10, 2022) in fv3gfs_dwn_nems.sh. When mod_icec was first brought into the global-workflow, it was correctly executed under the count condition. This condition was removed in PR #1822 (Sep 15, 2023) during the atmos product script simplifications in g-w.

@WenMeng-NOAA
Copy link
Contributor

@EricSinsky-NOAA Thanks for catching the bug. CC @ChristopherHill-NOAA in this issue so he might provide assistance from the GFS/GEFS post-processing perspective.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants