Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime error when turning on threading in a CAM simulation #941

Closed
sjsprecious opened this issue Dec 13, 2023 · 8 comments
Closed

Runtime error when turning on threading in a CAM simulation #941

sjsprecious opened this issue Dec 13, 2023 · 8 comments
Labels
bug Something isn't working correctly

Comments

@sjsprecious
Copy link
Collaborator

What happened?

I tried to turn on the threading option in a CAM simulation (F2000climo compset, ne30pg3 resolution). I used one compute node on Derecho with 64 MPI tasks and 2 threads per MPI task. It built successfully but I encountered lots of runtime errors (partials of them are listed below):

dec2481.hsn.de.hpc.ucar.edu 16: munmap_chunk(): invalid pointer
dec2481.hsn.de.hpc.ucar.edu 43: free(): invalid pointer
dec2481.hsn.de.hpc.ucar.edu 51: munmap_chunk(): invalid pointer
dec2481.hsn.de.hpc.ucar.edu 44: munmap_chunk(): invalid pointer
dec2481.hsn.de.hpc.ucar.edu 62: munmap_chunk(): invalid pointer
dec2481.hsn.de.hpc.ucar.edu 39: free(): invalid pointer
dec2481.hsn.de.hpc.ucar.edu 40: free(): invalid pointer
dec2481.hsn.de.hpc.ucar.edu 45: free(): invalid pointer
dec2481.hsn.de.hpc.ucar.edu 15: free(): invalid pointer
dec2481.hsn.de.hpc.ucar.edu 24: munmap_chunk(): invalid pointer
dec2481.hsn.de.hpc.ucar.edu 17: free(): invalid pointer
dec2481.hsn.de.hpc.ucar.edu 59: munmap_chunk(): invalid pointer
dec2481.hsn.de.hpc.ucar.edu 42: free(): invalid pointer
dec2481.hsn.de.hpc.ucar.edu 57: free(): invalid pointer
dec2481.hsn.de.hpc.ucar.edu 18: free(): invalid pointer
dec2481.hsn.de.hpc.ucar.edu 27: munmap_chunk(): invalid pointer
dec2481.hsn.de.hpc.ucar.edu 32: free(): invalid pointer
dec2481.hsn.de.hpc.ucar.edu 33: free(): invalid pointer
dec2481.hsn.de.hpc.ucar.edu 35: free(): invalid pointer
dec2481.hsn.de.hpc.ucar.edu 38: free(): invalid pointer
dec2481.hsn.de.hpc.ucar.edu 55: free(): invalid pointer
dec2481.hsn.de.hpc.ucar.edu 50: free(): invalid pointer
dec2481.hsn.de.hpc.ucar.edu 54: free(): invalid pointer
dec2481.hsn.de.hpc.ucar.edu 2: free(): invalid pointer
dec2481.hsn.de.hpc.ucar.edu 8: free(): invalid pointer
dec2481.hsn.de.hpc.ucar.edu 10: free(): invalid pointer
dec2481.hsn.de.hpc.ucar.edu 13: free(): invalid pointer
dec2481.hsn.de.hpc.ucar.edu 19: free(): invalid pointer
dec2481.hsn.de.hpc.ucar.edu 23: free(): invalid pointer
dec2481.hsn.de.hpc.ucar.edu 41: free(): invalid pointer
dec2481.hsn.de.hpc.ucar.edu 36: free(): invalid pointer
dec2481.hsn.de.hpc.ucar.edu 12: free(): invalid pointer
dec2481.hsn.de.hpc.ucar.edu 6: forrtl: error (76): Abort trap signal
dec2481.hsn.de.hpc.ucar.edu 6: Image              PC                Routine            Line        Source
dec2481.hsn.de.hpc.ucar.edu 6: libpthread-2.31.s  000014A3E75D48C0  Unknown               Unknown  Unknown
dec2481.hsn.de.hpc.ucar.edu 6: libc-2.31.so       000014A3E2BEBCBB  gsignal               Unknown  Unknown
dec2481.hsn.de.hpc.ucar.edu 6: libc-2.31.so       000014A3E2BED355  abort                 Unknown  Unknown
dec2481.hsn.de.hpc.ucar.edu 6: libc-2.31.so       000014A3E2C31AE7  Unknown               Unknown  Unknown
dec2481.hsn.de.hpc.ucar.edu 6: libc-2.31.so       000014A3E2C39B6A  Unknown               Unknown  Unknown
dec2481.hsn.de.hpc.ucar.edu 6: libc-2.31.so       000014A3E2C3B614  Unknown               Unknown  Unknown
dec2481.hsn.de.hpc.ucar.edu 6: cesm.exe           000000000112BFFB  fvm_consistent_se         163  fvm_consistent_se_cslam.F90
dec2481.hsn.de.hpc.ucar.edu 6: libiomp5.so        000014A3E30F6053  __kmp_invoke_micr     Unknown  Unknown
dec2481.hsn.de.hpc.ucar.edu 6: libiomp5.so        000014A3E30642F3  Unknown               Unknown  Unknown
dec2481.hsn.de.hpc.ucar.edu 6: libiomp5.so        000014A3E3063232  Unknown               Unknown  Unknown
dec2481.hsn.de.hpc.ucar.edu 6: libiomp5.so        000014A3E30F6DC1  Unknown               Unknown  Unknown
dec2481.hsn.de.hpc.ucar.edu 6: libpthread-2.31.s  000014A3E75C86EA  Unknown               Unknown  Unknown
dec2481.hsn.de.hpc.ucar.edu 6: libc-2.31.so       000014A3E2CB8A6F  clone                 Unknown  Unknown
dec2481.hsn.de.hpc.ucar.edu 29: forrtl: error (76): Abort trap signal
dec2481.hsn.de.hpc.ucar.edu 29: Image              PC                Routine            Line        Source
dec2481.hsn.de.hpc.ucar.edu 29: libpthread-2.31.s  000014C84D0B88C0  Unknown               Unknown  Unknown
dec2481.hsn.de.hpc.ucar.edu 29: libc-2.31.so       000014C8486CFCBB  gsignal               Unknown  Unknown
dec2481.hsn.de.hpc.ucar.edu 29: libc-2.31.so       000014C8486D1355  abort                 Unknown  Unknown
dec2481.hsn.de.hpc.ucar.edu 29: libc-2.31.so       000014C848715AE7  Unknown               Unknown  Unknown
dec2481.hsn.de.hpc.ucar.edu 29: libc-2.31.so       000014C84871DB6A  Unknown               Unknown  Unknown
dec2481.hsn.de.hpc.ucar.edu 29: libc-2.31.so       000014C84871F614  Unknown               Unknown  Unknown
dec2481.hsn.de.hpc.ucar.edu 29: cesm.exe           000000000112BFFB  fvm_consistent_se         163  fvm_consistent_se_cslam.F90
dec2481.hsn.de.hpc.ucar.edu 29: libiomp5.so        000014C848BDA053  __kmp_invoke_micr     Unknown  Unknown
dec2481.hsn.de.hpc.ucar.edu 29: libiomp5.so        000014C848B482F3  Unknown               Unknown  Unknown
dec2481.hsn.de.hpc.ucar.edu 29: libiomp5.so        000014C848B47232  Unknown               Unknown  Unknown
dec2481.hsn.de.hpc.ucar.edu 29: libiomp5.so        000014C848BDADC1  Unknown               Unknown  Unknown
dec2481.hsn.de.hpc.ucar.edu 29: libpthread-2.31.s  000014C84D0AC6EA  Unknown               Unknown  Unknown
dec2481.hsn.de.hpc.ucar.edu 29: libc-2.31.so       000014C84879CA6F  clone                 Unknown  Unknown

The complete list of errors could be found on Derecho at /glade/derecho/scratch/sunjian/cam6_run/F2000climo.ne30pg3_ne30pg3_mg17.derecho.intel.gpu00_pcols00016_mpi0064_thread002_rrtmgp/run/cesm.log.2648024.desched1.231212-143239.

What are the steps to reproduce the bug?

To reproduce the error on Derecho, you can do:

  • ./create_newcase --case /glade/derecho/scratch/sunjian/cam6/F2000climo.ne30pg3_ne30pg3_mg17.derecho.intel --mach derecho --res ne30pg3_ne30pg3_mg17 --compset F2000climo --compiler intel
  • cd /glade/derecho/scratch/sunjian/cam6/F2000climo.ne30pg3_ne30pg3_mg17.derecho.intel
  • ./xmlchange --file env_mach_pes.xml --id NTASKS --val 64
  • ./xmlchange --file env_mach_pes.xml --id NTHRDS --val 2
  • ./case.setup
  • ./case.build
  • ./case.submit

What CAM tag were you using?

cam6_3_139

What machine were you running CAM on?

CISL machine (e.g. cheyenne)

What compiler were you using?

Intel

Path to a case directory, if applicable

/glade/derecho/scratch/sunjian/cam6/F2000climo.ne30pg3_ne30pg3_mg17.derecho.intel.gpu00_pcols00016_mpi0064_thread002_rrtmgp

Will you be addressing this bug yourself?

No

Extra info

No response

@sjsprecious sjsprecious added the bug Something isn't working correctly label Dec 13, 2023
@fvitt
Copy link

fvitt commented Dec 13, 2023

I don't believe threading is supported when the SE dycore is used. Could you try threading with the FV dycore (--res f09_f09_mg17)?

@sjsprecious
Copy link
Collaborator Author

Thanks @fvitt for your suggestion. Yes, switching to the FV dycore works with threading.

Is there a plan to support SE dycore with threading in the future or it just stays with the MPI-exclusive configuration?

@adamrher
Copy link

I don't know of plans to support threading for the SE dycore.

@sjsprecious
Copy link
Collaborator Author

Got it! Thanks @adamrher .

@sjsprecious
Copy link
Collaborator Author

Hi @fvitt , it seems that I could only run FV dycore with 2 threads per MPI task on Derecho. When I increase the number to 4, the simulation fails again with some errors coming from the dycore. Is it expected or should I set something specific for larger thread number?

@fvitt
Copy link

fvitt commented Dec 14, 2023

@sjsprecious

In principle, you should be able to use 4 threads per MPI task.

When I have tried threading on derecho I noticed the performance was quite poor, but the runs did not fail.

There is some discussions on how to run hybrid MPI+OpenMP jobs on slide 96 here:
https://www2.cisl.ucar.edu/sites/default/files/2023-08/2023%20Derecho%20Overview%20for%20NCAR%20Labs.pdf

I just have not tried the suggestions for process binding. Do we have the arguments to mpiexec correct for threading?

@sjsprecious
Copy link
Collaborator Author

Hi @fvitt , thanks a lot for your suggestion. It turns out that I need to manually add the arguments you indicate for a hybrid MPI/OpenMP job when I use more than 2 threads per MPI task. Now it is working.

Yes, the threading performance of CAM is poor on Derecho, and the reason is again that the arguments you suggest for hybrid MPI/OpenMP job are not set in CAM by default. When I manually change the MPI command with those arguments, the performance of hybrid MPI/OpenMP job is restored compared to the MPI-exclusive configuration.

CISL is working a wrapper script that will avoid adding this long argument list manually. I will test it in CAM once it is in good shape.

@sjsprecious
Copy link
Collaborator Author

The mpibind script is introduced by this PR (ESMCI/ccs_config_cesm#139). Close this issue now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working correctly
Projects
Status: Done
Development

No branches or pull requests

3 participants