Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add srun for running e3sm_diags #497

Merged
merged 2 commits into from
Sep 7, 2023
Merged

add srun for running e3sm_diags #497

merged 2 commits into from
Sep 7, 2023

Conversation

chengzhuzhang
Copy link
Collaborator

Fixes #485 for the PMI2_Init error and e3sm_diags slowness on Chrysalis.

@chengzhuzhang
Copy link
Collaborator Author

chengzhuzhang commented Sep 6, 2023

@forsyth2 while adding srun -n 1 to the e3sm_diags bash script works through submitting the standalone bash via sbatch. I hit error:

===== RUN E3SM DIAGS =====

slurmstepd: error: execve(): python: No such file or directory
srun: error: chr-0311: task 0: Exited with exit code 2

If submit the job via zppy. Any insight to have this fixed?

@forsyth2
Copy link
Collaborator

forsyth2 commented Sep 6, 2023

@chengzhuzhang That's really quite perplexing. The only thing I can think of is that the correct environment isn't being picked up when running with srun. But it doesn't even recognize python as a command here. And we've used srun plenty of other places in zppy without a problem...

@@ -222,7 +222,7 @@ def e3sm_diags(config, scriptDir, existing_bundles, job_ids_file): # noqa: C901
p.pprint(c)
p.pprint(s)

export = "NONE"
export = "ALL"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set export to "ALL" to load all environment.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that makes sense.

@forsyth2
Copy link
Collaborator

forsyth2 commented Sep 6, 2023

@chengzhuzhang I was able to reproduce your initial error and was able to get E3SM Diags working with your export line fix. I'm currently running with these changes to double check the elapsed time returns to a reasonable duration.

@chengzhuzhang
Copy link
Collaborator Author

@forsyth2 I think this PR is ready to review. With export="ALL", it loads the correct environment and e3sm_diags run launched successfully. I'm testing zppy on Chrysalis. Could you help test this branch with unified rc14 on perlmutter and Compy, if things go okay, please go ahead and release a new zppy rc..thank you.

@chengzhuzhang
Copy link
Collaborator Author

@chengzhuzhang I was able to reproduce your initial error and was able to get E3SM Diags working with your export line fix. I'm currently running with these changes to double check the elapsed time returns to a reasonable duration.

Sounds good. the model vs obs task elapsed time fell within one hour on Chrysalis. Hope there is no change for perlmutter and compy

@forsyth2
Copy link
Collaborator

forsyth2 commented Sep 6, 2023

Could you help test this branch with unified rc14 on perlmutter and Compy, if things go okay, please go ahead and release a new zppy rc..thank you.

Yes, I'll run on those and then hopefully make the last RC before final release.

@forsyth2
Copy link
Collaborator

forsyth2 commented Sep 6, 2023

Confirmed that, on Chrysalis, times indeed have been reduced:

$ tail -n 3 e3sm_diags_atm_monthly_180x360_aave_*.o*
==> e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1850-1851.o385043 <==
==============================================
Elapsed time: 43 seconds
==============================================

==> e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1850-1853.o385045 <==
==============================================
Elapsed time: 43 seconds
==============================================

==> e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1852-1853.o385044 <==
==============================================
Elapsed time: 43 seconds
==============================================

==> e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1850-1851.o385040 <==
==============================================
Elapsed time: 3562 seconds
==============================================

==> e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1850-1853.o385042 <==
==============================================
Elapsed time: 3604 seconds
==============================================

==> e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1852-1853.o385041 <==
==============================================
Elapsed time: 3586 seconds
==============================================

==> e3sm_diags_atm_monthly_180x360_aave_mvm_model_vs_model_1852-1853_vs_1850-1851.o385047 <==
==============================================
Elapsed time: 682 seconds
==============================================

==> e3sm_diags_atm_monthly_180x360_aave_tc_analysis_model_vs_obs_1850-1851.o385046 <==
==============================================
Elapsed time: 40 seconds
==============================================

Copy link
Collaborator

@forsyth2 forsyth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The complete_run test passes on Perlmutter and Compy still, so I think this ready to merge.

@forsyth2 forsyth2 merged commit b943e6f into main Sep 7, 2023
@forsyth2 forsyth2 deleted the srun_e3sm_diags branch September 7, 2023 17:45
@forsyth2 forsyth2 added the semver: bug Bug fix (will increment patch version) label Sep 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver: bug Bug fix (will increment patch version)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unified rc12 and rc14 testing: e3sm_diags takes too long on Chrysalis
2 participants