-
Notifications
You must be signed in to change notification settings - Fork 433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak with UCX and OpenMPI 3.1.x #2921
Comments
Running strace on one of the MPI processes shows a steady stream of brk() calls that is not present when running without UCX (PML=ob1). [pid 11573] 0.011080 brk(NULL) = 0x67dc000 I'll see if I can tell where they are being called... |
Attaching to the same MPI process as in the previous comment with gdb and setting breakpoints on mmap, brk, and sbrk, shows... (gdb) break mmap Breakpoint 3, 0x00002b7a62311db0 in ucm_override_sbrk () from /lib64/libucm.so.0 Breakpoint 3, 0x00002b7a5c9bb4b0 in sbrk () from /usr/lib64/libc.so.6 Breakpoint 2, 0x00002b7a5c9bb440 in brk () from /usr/lib64/libc.so.6 |
@ca-taylor thanks a lot! we need to release datatype cache in OMPI.. will update when have a fix |
Ok. I have a git-clone of the repo so I can pull and test whenever you are ready. |
We are seeing a gaping memory leak when running OpenMPI 3.1.x (or 2.1.2, for that matter) built with UCX support. The leak shows up
whether the “ucx” PML is specified for the run or not. The applications in question are arepo and gizmo but it I have no reason to believe
that others are not affected as well.
Basically the MPI processes grow without bound until SLURM kills the job or the host memory is exhausted.
If I configure and build with “--without-ucx” the problem goes away.
Details:
—————————————
RHEL7.5
OpenMPI 3.1.2 (and any other version I’ve tried).
ucx 1.2.2-1.el7 (RH native)
RH native IB stack
Mellanox FDR/EDR IB fabric
Intel Parallel Studio 2018.1.163
Configuration Options:
—————————————————
The text was updated successfully, but these errors were encountered: