Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI processes (Starting PEs : 1) does not match the expected number 24 #454

Open
YueZhang720 opened this issue Oct 23, 2024 · 5 comments
Open
Assignees
Labels
category: Debug Help Request for help debugging GCHP stale No recent activity on this issue topic: Runtime Related to runtime issues (e.g. simulation stops with error)

Comments

@YueZhang720
Copy link

Your name

Yue Zhang

Your affiliation

HKUST(GZ)

What happened? What did you expect to happen?

After submitting slurm job, there are errors in GCHP log:

FATAL: mpp_domains_define.inc: not all the pe_end are in the pelist

 Starting PEs :            1
 Starting Threads :           56

FATAL: mpp_domains_define.inc: not all the pe_end are in the pelist

 For k_split (remapping)=           1
n_split is set to 02 for resolution-dt=0025x0025x6-  600.000
Using n_zfilter : 000
Using n_sponge : 001
Using non_ortho :       T
 For k_split (remapping)=           1
n_split is set to 02 for resolution-dt=0025x0025x6-  600.000
Using n_zfilter : 000
Using n_sponge : 001
Using non_ortho :       T
 Starting PEs :            1
 Starting PEs :            1
 Starting Threads :           56

FATAL: mpp_domains_define.inc: not all the pe_end are in the pelist

 Starting Threads :           56
 For k_split (remapping)=           1
n_split is set to 02 for resolution-dt=0025x0025x6-  600.000
Using n_zfilter : 000
Using n_sponge : 001
Using non_ortho :       T

 Starting PEs :            1
 Starting Threads :           56

FATAL: mpp_domains_define.inc: not all the pe_end are in the pelist

FATAL: mpp_domains_define.inc: not all the pe_end are in the pelist

 For k_split (remapping)=           1
n_split is set to 02 for resolution-dt=0025x0025x6-  600.000
Using n_zfilter : 000
Using n_sponge : 001
Using non_ortho :       T
 Starting PEs :            1
 For k_split (remapping)=           1
n_split is set to 02 for resolution-dt=0025x0025x6-  600.000
Using n_zfilter : 000
Using n_sponge : 001
Using non_ortho :       T
 Starting Threads :           56

FATAL: mpp_domains_define.inc: not all the pe_end are in the pelist

 Starting PEs :            1
 Starting Threads :           56

FATAL: mpp_domains_define.inc: not all the pe_end are in the pelist

 For k_split (remapping)=           1
n_split is set to 02 for resolution-dt=0025x0025x6-  600.000
Using n_zfilter : 000
Using n_sponge : 001
Using non_ortho :       T
 For k_split (remapping)=           1
n_split is set to 02 for resolution-dt=0025x0025x6-  600.000
Using n_zfilter : 000
Using n_sponge : 001
Using non_ortho :       T
 Starting PEs :            1
 Starting Threads :           56

FATAL: mpp_domains_define.inc: not all the pe_end are in the pelist

 Starting PEs :            1
 Starting Threads :           56
 For k_split (remapping)=           1
n_split is set to 02 for resolution-dt=0025x0025x6-  600.000
Using n_zfilter : 000
Using n_sponge : 001
Using non_ortho :       T
 Starting PEs :            1
 Starting Threads :           56

FATAL: mpp_domains_define.inc: not all the pe_end are in the pelist


FATAL: mpp_domains_define.inc: not all the pe_end are in the pelist

 For k_split (remapping)=           1
n_split is set to 02 for resolution-dt=0025x0025x6-  600.000
Using n_zfilter : 000
Using n_sponge : 001
Using non_ortho :       T
 Starting PEs :            1
 Starting Threads :           56

FATAL: mpp_domains_define.inc: not all the pe_end are in the pelist

 For k_split (remapping)=           1
n_split is set to 02 for resolution-dt=0025x0025x6-  600.000
Using n_zfilter : 000
Using n_sponge : 001
Using non_ortho :       T
 Starting PEs :            1
 Starting Threads :           56

FATAL: mpp_domains_define.inc: not all the pe_end are in the pelist

 For k_split (remapping)=           1
n_split is set to 02 for resolution-dt=0025x0025x6-  600.000
Using n_zfilter : 000
Using n_sponge : 001
Using non_ortho :       T
 Starting PEs :            1
 Starting Threads :           56

FATAL: mpp_domains_define.inc: not all the pe_end are in the pelist

 For k_split (remapping)=           1
n_split is set to 02 for resolution-dt=0025x0025x6-  600.000
Using n_zfilter : 000
Using n_sponge : 001
Using non_ortho :       T
 Starting PEs :            1
 Starting Threads :           56

FATAL: mpp_domains_define.inc: not all the pe_end are in the pelist

 For k_split (remapping)=           1
n_split is set to 02 for resolution-dt=0025x0025x6-  600.000
Using n_zfilter : 000
Using n_sponge : 001
Using non_ortho :       T
 Starting PEs :            1
 Starting Threads :           56

FATAL: mpp_domains_define.inc: not all the pe_end are in the pelist

 For k_split (remapping)=           1
n_split is set to 02 for resolution-dt=0025x0025x6-  600.000
Using n_zfilter : 000
Using n_sponge : 001
Using non_ortho :       T
 Starting PEs :            1
 Starting Threads :           56

FATAL: mpp_domains_define.inc: not all the pe_end are in the pelist

 For k_split (remapping)=           1
n_split is set to 02 for resolution-dt=0025x0025x6-  600.000
Using n_zfilter : 000
Using n_sponge : 001
Using non_ortho :       T
 Starting PEs :            1
 Starting Threads :           56

FATAL: mpp_domains_define.inc: not all the pe_end are in the pelist

What are the steps to reproduce the bug?

I have tried gchp13.3.4 and gchp14.4.3, and both simulations report the same errors. I also used different versions of [email protected] and [email protected]; they didn't work either. What do you think caused this issue and what do you think I should do to solve this problem?

Please attach any relevant configuration and log files.

ExtData.txt
GCHP_log.txt
run_sh.txt
setCommonRunSettings.txt

What GCHP version were you using?

14.4.3

What environment were you running GCHP on?

Local cluster

What compiler and version were you using?

gcc 10.2.0

What MPI library and version were you using?

openmpi 5.0.5

Will you be addressing this bug yourself?

Yes, but I will need some help

Additional information

No response

@YueZhang720 YueZhang720 added the category: Bug Something isn't working label Oct 23, 2024
@lizziel
Copy link
Contributor

lizziel commented Oct 23, 2024

Hi @YueZhang720, this looks like an issue where ESMF is running with a single thread, across all threads. This would explain why you are seeing multiple prints of the same code. Check your ESMF build. Was ESMF_COMM set to mpiuni? It needs to specify your MPI, in this case openmpi. See GCHP ReadTheDocs instructions for environment settings which includes ESMF settings needed for build: https://gchp.readthedocs.io/en/stable/getting-started/requirements.html.

@lizziel lizziel self-assigned this Oct 23, 2024
@lizziel lizziel added category: Debug Help Request for help debugging GCHP topic: Runtime Related to runtime issues (e.g. simulation stops with error) and removed category: Bug Something isn't working labels Oct 23, 2024
@lizziel
Copy link
Contributor

lizziel commented Nov 12, 2024

@YueZhang720, were you able to resolve this issue?

@YueZhang720
Copy link
Author

@YueZhang720, were you able to resolve this issue?

It doesn't work. So I try GCHPv14.5.0 with [email protected]. When I use mpirun -np 6 ./gchp, here is the error message:

[node09:1761803] [[57744,0],0] ORTE_ERROR_LOG: Not found in file ess_hnp_module.c at line 320
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_pmix_base_select failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------

When I use srun -n 48 -N 2 -m plane=24 --mpi=pmi2 ./gchp, the error is as follows:

--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM support. This usually happens
when OMPI was not configured --with-slurm and we weren't able
to discover a SLURM installation in the usual places.

Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[node09:1762615] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM support. This usually happens
when OMPI was not configured --with-slurm and we weren't able
to discover a SLURM installation in the usual places.

Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[node09:1762622] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM support. This usually happens
when OMPI was not configured --with-slurm and we weren't able
to discover a SLURM installation in the usual places.

Please configure as appropriate and try again.

Is there anything wrong between my slurm and esmf? I have tried many times but it didn't work out.

@lizziel
Copy link
Contributor

lizziel commented Nov 21, 2024

Hi @YueZhang720, this still looks like an MPI issue. Do you have a system administrator on your cluster who can help look into the MPI configuration?

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days it will be closed. You can add the "never stale" tag to prevent the issue from closing this issue.

@github-actions github-actions bot added the stale No recent activity on this issue label Dec 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: Debug Help Request for help debugging GCHP stale No recent activity on this issue topic: Runtime Related to runtime issues (e.g. simulation stops with error)
Projects
None yet
Development

No branches or pull requests

2 participants