Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug in MPI_Reduce #195

Merged
merged 1 commit into from
Dec 10, 2021
Merged

Fix bug in MPI_Reduce #195

merged 1 commit into from
Dec 10, 2021

Conversation

csegarragonz
Copy link
Collaborator

@csegarragonz csegarragonz commented Dec 9, 2021

In this PR I solve a bug in distributed reduce operations that re-use the send and receive buffers across different MPI_Reduce calls. This bug was first introduced with #191.

When performing the locality-aware reduce in a local leader that is not co-located with the rank that originated the reduce, we perform the following operations:

  1. Receive data from local ranks.
  2. Reduce data from local ranks (and ours).
  3. Send reduced results to the rank that originated the reduce.

As the local leader is not expected to be receiving anything, it may be the case that, the recvBuffer in the method's signature is a nullptr (seems to be an unwritten agreement across MPI developers). Therefore, in (2) we were accumulating the partial reduce in the same buffer the local leader's data was provided (i.e. sendBuffer), and then sending this buffer in (3).

Unofrtunately, said sendBuffer is a pointer to the application's memory, and the application is not expecting that memory to be modified. In particular, if subsequent reduces are performed, the results returned by reduce are incorrect.

@csegarragonz csegarragonz self-assigned this Dec 9, 2021
@csegarragonz csegarragonz added bug Something isn't working mpi Related to the MPI implementation labels Dec 9, 2021
@eigenraven
Copy link
Collaborator

Does this fix any of the ignored mpi TSan race conditions by any chance? IIRC it was failing on reductions

@csegarragonz
Copy link
Collaborator Author

@KubaSz trying now. whilst we are at it, does #192 fix the ignore list entry here?

@csegarragonz csegarragonz merged commit 00d2d96 into master Dec 10, 2021
@csegarragonz csegarragonz deleted the reduce-bug branch December 10, 2021 09:33
@eigenraven
Copy link
Collaborator

@KubaSz trying now. whilst we are at it, does #192 fix the ignore list entry here?

Yes, it does, I forgot to remove it from the ignorelist

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working mpi Related to the MPI implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants