Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ISSUES about pgigpu+openmpi with WCYCL1850 and CMPASO-NYF #5563

Closed
lulu1599 opened this issue Mar 28, 2023 · 11 comments
Closed

ISSUES about pgigpu+openmpi with WCYCL1850 and CMPASO-NYF #5563

lulu1599 opened this issue Mar 28, 2023 · 11 comments

Comments

@lulu1599
Copy link

Regarding E3SM maint-2.1, I tried to run it on a CPU platform with intel+impi, and it worked well with the case WCYCL1850+ne30pg2_EC30to60E2r2. However, when I tried to switch to a GPU platform using PGI+openmpi, I encountered the following issues:

  1. when I compiled with "--compset WCYCL1850 --res ne30pg2_EC30to60E2r2" on the GPU platform, I could not pass the compilation. In the "./case.build" process, I encountered an NVLINK multiple definition error, as follows:
    "nvlink error : Multiple definition of 'mpas_vector_reconstruction_mpas_reconstruct_1d_gpu_647_gpu' in '../../mpas-framework/src/libocn.a:mpas_vector_reconstruction.f90.o', first defined in '../../mpas-framework/src/libice.a:mpas_vector_reconstruction.f90.o'"
    I searched the forum but could not find any relevant solutions. It seems to point to a compatibility issue between the PGI compiler and the source code (perhaps it's my compile parameters?), however, I am not sure. Do you know what the reason might be?

  2. Then I tried another case, which only activates the ocean module "--compset CMPASO-NYF --res T62_oEC60to30v3" to avoid the aforementioned NVLINK problem. However, when I ran "./case.submit", I found that the process was not submitted to the GPU card but still running on the CPU. Did I miss something?

The config_machine.xml, cmake files, running scripts, and log files are attached. Hope an expert can help me, thanks!

config_machine.xml.txt
create_case.sh.txt
Depends.pgigpu.cmake.txt
pgigpu_lu-gpu.cmake.txt
preview_run.log

@philipwjones
Copy link
Contributor

@lulu1599 On your second, It looks like your files don't have the correct flags for building the GPU-enabled ocean. You will need a -DMPAS_OPENACC flag for preprocessing. And the fortran flags should have -acc -Minfo=accel (the ta flag is somewhat optional but if you know the architecture it can sometimes help). Looks like the latter are enabled on the LDFLAGS but not on the compile line. With the Minfo flag on, you should see the compiler generate additional output for acceleration and that's a good way to see whether it's actually generating gpu instructions.

On the first, we have seen that in the past but I think it disappeared with later compiler versions - which version of Nvidia/PGI are you using?

@lulu1599
Copy link
Author

lulu1599 commented Mar 29, 2023 via email

@lulu1599
Copy link
Author

lulu1599 commented Mar 29, 2023 via email

@philipwjones
Copy link
Contributor

Since the first one works, it seems you have a gpu-enabled executable, so i suspect this is an issue with how your job launcher/scheduler is allocating resources and whether it supports multiple ranks per gpu.

@lulu1599
Copy link
Author

OK, thanks! Your answers are very helpful!

@lulu1599
Copy link
Author

lulu1599 commented Apr 6, 2023

HI!

  1. I have another question, when I running the case WCLC1850 on GPU node, is the eam module using KOKKOS lib and mpas-o/mpas-si using OPENACC?
    I'm confused if other modules can be accelareted by GPU? (I'm using E3SM v2.1)

  2. Here's my "--compset WCYCL1850 --res ne30pg2_EC30to60E2r2" case cmake file and log file, I think it's kokkos error, do you know how to fix this?
    e3sm.bldlog.230406-143814.txt
    pgigpu_hpc.cmake.txt

I'm new at GPU running and thanks for your guidance!

@rljacob
Copy link
Member

rljacob commented Apr 6, 2023

In v2.1, only mpas-o/mpas-si have GPU capability. EAM does use kokkos in part of the code but its in "cpu mode".

@lulu1599
Copy link
Author

lulu1599 commented Apr 7, 2023

OK,THANKS for your reply.

@lulu1599
Copy link
Author

Here's another question, when I use mpas.part.* file in this path, I found the min num part file is 8. I know this is related to the num of ocn process. Thus, how can I run with ocn process smaller than 8, like 1, 2, or 4?

@philipwjones
Copy link
Contributor

To create a new partition, use the gpmetis command line tool from metis (you'll need metis on your machine - many of our supported machines have it available as a module). If you have the module loaded or metis installed, just use
gpmetis graph_file num_procs
where graph_file is the main graph file in the path you noted (the file with same name but without the .part.x suffix).

That's for the ocean only. The sea-ice does some additional load balancing and the process is a bit more involved.

@lulu1599
Copy link
Author

That’s really helpful!

@E3SM-Project E3SM-Project locked and limited conversation to collaborators Apr 22, 2023
@rljacob rljacob converted this issue into discussion #5623 Apr 22, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants