-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
collective operations with large data sizes #122
Comments
Ok, this is going to be tedious 😐 |
Opened branch |
@francescopt Can I use you test code in a test ? Thanks |
sure, go ahead |
Things are more annoying than expected, since the 32 bit limit is an MPI standard limit (all buf sizes are specified as int). |
The first "solution", since this problem appears for serialized type (I mean, for basic type, they're plain MPI issue, doesn't mean we should not deal with it at some point) and serialized types are typically not primitives. We could set a minimal archive size (like 8 or 16 bytes) and allocate the internal buffer with that type. To do that in a user friendly way (although we could templatize the slot size with a default value ?) we need a useful compromise between the maximum buffer size Any idea ? |
I just found out that the issue of size of MPI data is apparently known, and there is a C library that address this issue: BigMPI. In the github repository there is also an academical paper discussing the issue. As for the minimum buffer size. I am not an expert, but could the page size be a good choice in terms of efficiency? |
Page size looks a little big. Underlying implementation will use different types of communication (with remote buffer on on send for small, two steps (size+workload) for bigger coms etc.) |
The long term fix could be different, for example it could be based on Probe. |
OK, I guess that this is more a matter of individual cases. For what I have in mind, a max size of 64 Gb would be OK, that would be a buffer size of 64 Gb / (2^31) = 32 bytes. I tried to experiment the maximum size of scattered data with |
The library crashes when performing a collective operation like
gather
when the size of the objects is very large, but that should be still manageable with supercomputers. The issue is not easy to reproduce, because it requires quite some memory available.The following programs illustrates this:
The program create an object of 1G of memory. The struct
huge
is defined so to force the library to have a non-primitive MPI type. When run with only 2 tasks, it crashes givingOn a supercomputer JUWELS, which has boost 1.69, the error is:
It appears that the programs crashes around this line on
gather.hpp
The same crash is found even running with 1 task.
Reducing the size of
huge
asmakes again the program crash, this time it appears that the crash happens at this line of
gather.hpp
My impression is that the sizes are sometimes stored as
int
when they should be asize_t
. For example, in the line above,oasizes
is astd::vector
ofint
: even if the single-object size fits into aint
, the total buffer of gathered objects could exceed 2^31.The text was updated successfully, but these errors were encountered: