Skip to content
Alberto F. Martin edited this page Jul 28, 2020 · 7 revisions

Remarks/tips/lessons learned while using/developing MPI-parallel programs in Julia

  1. Up to my knowledge (@amartinhuertas), at present, the unique way of "debugging" MPI.jl parallel programs is "print statement debugging". We have observed that messages printed to stdout using println by the different Julia REPLs running at different MPI tasks are not atomic, but broken/intermixed stochastically. However, if you do print("something\n") you are more likely to get it to print to a single line than println("something") (Thanks to @symonbyrne for this trick, it is so useful). More serious/definitive solutions are being discussed in this issue of MPI.jl.

  2. Some people have used tmpi (https://github.com/Azrael3000/tmpi) for running multiple sessions interactively, and we could try using the @mpi_do macro in MPIClusterManagers (I have not explored neither of them). If am not wrong, I guess that the first alternative may involve multiple gdb debuggers running at different terminal windows, and a deep knownledge of the low-level C code generated by Julia (see https://docs.julialang.org/en/v1/devdocs/debuggingtips/ for more details). I wonder whether, e.g., https://github.com/JuliaDebug/Debugger.jl, could be combined with tmpi.

  3. For reducing JIT lag it becomes absolutely mandatory to build a custom system image of (some of) the GridapDistributed.jl dependencies, e.g., Gridap.jl. See the following link for more details. https://github.com/gridap/Gridap.jl/tree/julia_script_creation_system_custom_images/compile. TO BE UPDATED WHEN BRANCH julia_script_creation_system_custom_images is merged into master. Assuming that the name of the Gridap.jl image is called Gridapv0.10.4.so, then one may call the parallel MPI.jl program as:

    mpirun -np 4 julia -J ./Gridapv0.10.4.so --project=. test/MPIPETScDistributedPoissonTests.jl
    
  4. Precompilation issues of MPI.jl in parallel runs. See here for more details.

  5. In NCI@Gadi (I do not know in other systems), I am getting per-task core dump files on crashes (e.g., SEGFAULT). This is bad, since the file system in Gadi is limited, and such core dump files are not particularly light-weight. I wrote to Gadi support, and I got the following answer. (I did not yet explore anything.):

Hi,

I am really not sure what can be done here. You are running an mpi julia program that in turn 
calls petsc. It looks like both petsc and julia have its own signal handlers so potentially may overwrite 
core dump settings. I am not sure how julia is calling PetscInitialize() but if does call it directly,
 you may try adding -no_signal_handler to it. It appears you can also put it into ~username/.petscrc (
https://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscInitialize.html)

Alternatively, it could be that ulimit -c setting is not propagating to all nodes ... and yes,
 it doesn't. I guess you can try using a wrapper to set it on all nodes (wrapper.csh):

#!/bin/tcsh
limit coredumpsize 0
./a.out

Replace a.out with the name of your program and then run mpirun ./wrapper.csh and see if this helps.

It is better to use csh script as you will get

vsetenv BASH_FUNC_module%% failed
vsetenv BASH_FUNC_switchml%% failed

errors from bash ... and it is too late for me to try to figured out where do they come from .

Best wishes

Andrey