Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example cfg for NERSC #235

Closed
wants to merge 1 commit into from
Closed

Conversation

forsyth2
Copy link
Collaborator

Example cfg for NERSC. This is the NERSC example for #233.

@forsyth2 forsyth2 added the priority: low Low priority task label Apr 22, 2022
@forsyth2 forsyth2 self-assigned this Apr 22, 2022
@forsyth2 forsyth2 force-pushed the examples-nersc branch 2 times, most recently from 04ddae7 to 6d76525 Compare May 9, 2022 18:26
mapping_file = ""
vars = "RIVER_DISCHARGE_OVER_LAND_LIQ"

[tc_analysis]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chengzhuzhang It's possible #169 (which added Tropical Cyclone analysis/diagnostics) wasn't tested on Cori. This task worked fine on the Compy example (#234). The Cori example cfg, however, fails with the following. Any thoughts?

Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(537):
MPID_Init(246).......: channel initialization failed
MPID_Init(647).......:  PMI2 init failed: 1
/var/spool/slurmd/job58523789/slurm_script: line 94: 42021 Aborted                 DetectNodes --verbosity 0 --in_connect ${result_dir}connect_CSne${res}_v2.dat --closedcontourcmd "PSL,300.0,4.0,0;_AVG(T200,T500),-0.6,4,0.30" --mergedist 6.0 --searchbymin PSL --outputcmd "PSL,min,0;_VECMAG(UBOT,VBOT),max,2" --timestride 1 --in_data_list ${result_dir}inputfile_${file_name}.txt --out ${result_dir}out.dat

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like the kind of error that happens when you're using the wrong MPI (like from conda-forge). I might be on completely the wrong track but wanted to suggest that in case...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recall that there was an issue when you tested the tc_analysis.bash on Cori. Not sure if this is related. but from what I can tell, on Cori compute node, it tries to initiate an MPI process with the tempestextremes call (DetectNodes), maybe you can try following? , i.g.:

"srun -n 32 DetectNodes --verbosity 0 --in_connect ${result_dir}connect_CSne${res}_v2.dat --closedcontourcmd "PSL,300.0,4.0,0;_AVG(T200,T500),-0.6,4,0.30" --mergedist 6.0 --searchbymin PSL --outputcmd "PSL,min,0;_VECMAG(UBOT,VBOT),max,2" --timestride 1 --in_data_list ${result_dir}inputfile_${file_name}.txt --out ${result_dir}out.dat

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately it looks like adding the srun -n 32 didn't fix anything.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, I would suggest to get the stand alone DetecNodes command on has-well, with and without srun and see if the error would reproduce.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also do which DetectNodes to make sure it's coming from Spack and not from conda-forge?

Copy link
Collaborator Author

@forsyth2 forsyth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still a draft. Making notes for reference.

@@ -59,16 +59,21 @@ fi
mkdir -p $result_dir
file_name=${caseid}_${start}_${end}

which DetectNodes
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this line

@@ -78,27 +83,34 @@ cd ${drc_in};eval ls ${caseid}.$atm_name.h2.*{${start}..${end}}*.nc >${result_di
cd ${result_dir}
# Detection threshold including:
# The sea-level pressure (SLP) must be a local minimum; SLP must have a sufficient decrease (300 Pa) compared to surrounding nodes within 4 degree radius; The average of the 200 hPa and 500 hPa level temperature decreases by 0.6 K in all directions within a 4 degree radius from the location to fSLP minima
DetectNodes --verbosity 0 --in_connect ${result_dir}connect_CSne${res}_v2.dat --closedcontourcmd "PSL,300.0,4.0,0;_AVG(T200,T500),-0.6,4,0.30" --mergedist 6.0 --searchbymin PSL --outputcmd "PSL,min,0;_VECMAG(UBOT,VBOT),max,2" --timestride 1 --in_data_list ${result_dir}inputfile_${file_name}.txt --out ${result_dir}out.dat
echo "Run DetectNodes"
srun -n 32 DetectNodes --verbosity 0 --in_connect ${result_dir}connect_CSne${res}_v2.dat --closedcontourcmd "PSL,300.0,4.0,0;_AVG(T200,T500),-0.6,4,0.30" --mergedist 6.0 --searchbymin PSL --outputcmd "PSL,min,0;_VECMAG(UBOT,VBOT),max,2" --timestride 1 --in_data_list ${result_dir}inputfile_${file_name}.txt --out ${result_dir}out.dat
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

srun -n 32 is currently applied to VariableProcessor and both calls to DetectNodes. Should check to see which of the 3 are actually required. Should the srun only run on Cori or is it fine to always run? (Currently, tc_analysis_0051-0052 completes on Cori because of at least 1 of these 3 changes.)

ref_years = "51-52",
reference_data_path = /global/cscratch1/sd/forsyth/zppy_complete_run_nersc_output/20210528.v2rc3e.piControl.ne30pg2_EC30to60E2r2.chrysalis/post/atm/180x360_aave/clim
reference_data_path_climo_diurnal = /global/cscratch1/sd/forsyth/zppy_complete_run_nersc_output/20210528.v2rc3e.piControl.ne30pg2_EC30to60E2r2.chrysalis/post/atm/180x360_aave/clim_diurnal_8xdaily
reference_data_path_tc = /global/cscratch1/sd/forsyth/zppy_complete_run_nersc_output/20210528.v2rc3e.piControl.ne30pg2_EC30to60E2r2.chrysalis/post/atm/tc-analysis_0051_0052
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably, this change from 51_52 to 0051_0052 will get TC working on e3sm_diags_atm_monthly_180x360_aave_mvm_model_vs_model_0051-0052_vs_0051-0052.

sets = "lat_lon","zonal_mean_xy","zonal_mean_2d","polar","cosp_histogram","meridional_mean_2d","enso_diags","qbo","diurnal_cycle","annual_cycle_zonal_mean","streamflow", "zonal_mean_2d_stratosphere",
streamflow_obs_ts = /global/cfs/cdirs/e3sm/e3sm_diags/obs_for_e3sm_diags/time-series

[[ atm_monthly_180x360_aave_tc_analysis ]]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e3sm_diags_atm_monthly_180x360_aave_tc_analysis_model_vs_obs_0051-0052 says "OK" for status but it doesn't produce a viewer. There is a FileNotFoundError: [Errno 2] No such file or directory: b'/global/cfs/cdirs/e3sm/e3sm_diags/obs_for_e3sm_diags/tc_analysis/IBTrACS.NA.v04r00.nc'.

@forsyth2 forsyth2 mentioned this pull request Jun 29, 2022
@forsyth2
Copy link
Collaborator Author

forsyth2 commented Aug 3, 2022

Replaced by #264 / #282.

@forsyth2 forsyth2 closed this Aug 3, 2022
@forsyth2 forsyth2 deleted the examples-nersc branch November 1, 2022 23:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: low Low priority task
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants