ISSUES about pgigpu+openmpi with WCYCL1850 and CMPASO-NYF #5563

lulu1599 · 2023-03-28T05:20:48Z

Regarding E3SM maint-2.1, I tried to run it on a CPU platform with intel+impi, and it worked well with the case WCYCL1850+ne30pg2_EC30to60E2r2. However, when I tried to switch to a GPU platform using PGI+openmpi, I encountered the following issues:

when I compiled with "--compset WCYCL1850 --res ne30pg2_EC30to60E2r2" on the GPU platform, I could not pass the compilation. In the "./case.build" process, I encountered an NVLINK multiple definition error, as follows:
"nvlink error : Multiple definition of 'mpas_vector_reconstruction_mpas_reconstruct_1d_gpu_647_gpu' in '../../mpas-framework/src/libocn.a:mpas_vector_reconstruction.f90.o', first defined in '../../mpas-framework/src/libice.a:mpas_vector_reconstruction.f90.o'"
I searched the forum but could not find any relevant solutions. It seems to point to a compatibility issue between the PGI compiler and the source code (perhaps it's my compile parameters?), however, I am not sure. Do you know what the reason might be?
Then I tried another case, which only activates the ocean module "--compset CMPASO-NYF --res T62_oEC60to30v3" to avoid the aforementioned NVLINK problem. However, when I ran "./case.submit", I found that the process was not submitted to the GPU card but still running on the CPU. Did I miss something?

The config_machine.xml, cmake files, running scripts, and log files are attached. Hope an expert can help me, thanks!

config_machine.xml.txt
create_case.sh.txt
Depends.pgigpu.cmake.txt
pgigpu_lu-gpu.cmake.txt
preview_run.log

The text was updated successfully, but these errors were encountered:

philipwjones · 2023-03-28T16:04:19Z

@lulu1599 On your second, It looks like your files don't have the correct flags for building the GPU-enabled ocean. You will need a -DMPAS_OPENACC flag for preprocessing. And the fortran flags should have -acc -Minfo=accel (the ta flag is somewhat optional but if you know the architecture it can sometimes help). Looks like the latter are enabled on the LDFLAGS but not on the compile line. With the Minfo flag on, you should see the compiler generate additional output for acceleration and that's a good way to see whether it's actually generating gpu instructions.

On the first, we have seen that in the past but I think it disappeared with later compiler versions - which version of Nvidia/PGI are you using?

lulu1599 · 2023-03-29T01:56:49Z

Thanks a lot! 1. I'm trying the FLAGS you mentioned to see if they worked.  2. My PGI version is 21.9-0, my CUDA version is 11.4. Maybe I should try a newer PGI? Thanks again, Jingyu 

…

------------------ 原始邮件 ------------------ 发件人: "E3SM-Project/E3SM" ***@***.***>; 发送时间: 2023年3月29日(星期三) 凌晨0:04 ***@***.***>; ***@***.******@***.***>; 主题: Re: [E3SM-Project/E3SM] ISSUES about pgigpu+openmpi with WCYCL1850 and CMPASO-NYF (Issue #5563) @lulu1599 On your second, It looks like your files don't have the correct flags for building the GPU-enabled ocean. You will need a -DMPAS_OPENACC flag for preprocessing. And the fortran flags should have -acc -Minfo=accel (the ta flag is somewhat optional but if you know the architecture it can sometimes help). Looks like the latter are enabled on the LDFLAGS but not on the compile line. With the Minfo flag on, you should see the compiler generate additional output for acceleration and that's a good way to see whether it's actually generating gpu instructions. On the first, we have seen that in the past but I think it disappeared with later compiler versions - which version of Nvidia/PGI are you using? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

lulu1599 · 2023-03-29T02:48:31Z

HI! I have added the FLAGS string(APPEND CPPDEFS " -DMPAS_OPENACC") string(APPEND FFLAGS "-acc -Minfo=accel") When I set num of process 8 (-n 8), and the --ngpu-per-node 8, the e3sm.exe was submitted to GPU successfully, however, When I set num of process 16 (-n 16), and the --ngpu-per-node 8, I can't see my process on GPU, here's my running info, perhaps you can see any clues? ![image](https://user-images.githubusercontent.com/51312558/228414079-d70b171e-4d7f-4541-b660-d800cc02818b.png) ![image](https://user-images.githubusercontent.com/51312558/228414141-cc5b3f93-9aae-4dab-bc5e-9e423e14e9c9.png) ![image](https://user-images.githubusercontent.com/51312558/228414154-92f0eb3c-450d-448d-b52b-fe490ff30690.png) ![image](https://user-images.githubusercontent.com/51312558/228414172-75c6fa31-2d68-4d52-9b29-f4b36f11b7b7.png)

philipwjones · 2023-03-29T17:34:59Z

Since the first one works, it seems you have a gpu-enabled executable, so i suspect this is an issue with how your job launcher/scheduler is allocating resources and whether it supports multiple ranks per gpu.

lulu1599 · 2023-03-30T08:58:45Z

OK, thanks! Your answers are very helpful!

lulu1599 · 2023-04-06T06:45:24Z

HI!

I have another question, when I running the case WCLC1850 on GPU node, is the eam module using KOKKOS lib and mpas-o/mpas-si using OPENACC?
I'm confused if other modules can be accelareted by GPU? (I'm using E3SM v2.1)
Here's my "--compset WCYCL1850 --res ne30pg2_EC30to60E2r2" case cmake file and log file, I think it's kokkos error, do you know how to fix this?
e3sm.bldlog.230406-143814.txt
pgigpu_hpc.cmake.txt

I'm new at GPU running and thanks for your guidance!

rljacob · 2023-04-06T15:12:26Z

In v2.1, only mpas-o/mpas-si have GPU capability. EAM does use kokkos in part of the code but its in "cpu mode".

lulu1599 · 2023-04-07T08:32:06Z

OK，THANKS for your reply.

lulu1599 · 2023-04-11T01:02:16Z

Here's another question, when I use mpas.part.* file in this path, I found the min num part file is 8. I know this is related to the num of ocn process. Thus, how can I run with ocn process smaller than 8, like 1, 2, or 4?

philipwjones · 2023-04-11T15:23:15Z

To create a new partition, use the gpmetis command line tool from metis (you'll need metis on your machine - many of our supported machines have it available as a module). If you have the module loaded or metis installed, just use
gpmetis graph_file num_procs
where graph_file is the main graph file in the path you noted (the file with same name but without the .part.x suffix).

That's for the ocean only. The sea-ice does some additional load balancing and the process is a bit more involved.

lulu1599 · 2023-04-13T02:32:19Z

That’s really helpful!

xylar mentioned this issue Mar 28, 2023

Revert z-star PR #5564

Merged

E3SM-Project locked and limited conversation to collaborators Apr 22, 2023

rljacob converted this issue into discussion #5623 Apr 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

ISSUES about pgigpu+openmpi with WCYCL1850 and CMPASO-NYF #5563

ISSUES about pgigpu+openmpi with WCYCL1850 and CMPASO-NYF #5563

lulu1599 commented Mar 28, 2023

philipwjones commented Mar 28, 2023

lulu1599 commented Mar 29, 2023 via email

lulu1599 commented Mar 29, 2023 via email •

edited

Loading

philipwjones commented Mar 29, 2023

lulu1599 commented Mar 30, 2023

lulu1599 commented Apr 6, 2023 •

edited

Loading

rljacob commented Apr 6, 2023

lulu1599 commented Apr 7, 2023 •

edited

Loading

lulu1599 commented Apr 11, 2023

philipwjones commented Apr 11, 2023

lulu1599 commented Apr 13, 2023

This issue was moved to a discussion.

This issue was moved to a discussion.

ISSUES about pgigpu+openmpi with WCYCL1850 and CMPASO-NYF #5563

ISSUES about pgigpu+openmpi with WCYCL1850 and CMPASO-NYF #5563

Comments

lulu1599 commented Mar 28, 2023

philipwjones commented Mar 28, 2023

lulu1599 commented Mar 29, 2023 via email

lulu1599 commented Mar 29, 2023 via email • edited Loading

philipwjones commented Mar 29, 2023

lulu1599 commented Mar 30, 2023

lulu1599 commented Apr 6, 2023 • edited Loading

rljacob commented Apr 6, 2023

lulu1599 commented Apr 7, 2023 • edited Loading

lulu1599 commented Apr 11, 2023

philipwjones commented Apr 11, 2023

lulu1599 commented Apr 13, 2023

This issue was moved to a discussion.

lulu1599 commented Mar 29, 2023 via email •

edited

Loading

lulu1599 commented Apr 6, 2023 •

edited

Loading

lulu1599 commented Apr 7, 2023 •

edited

Loading