Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not use emplace for editing vector elements #143

Merged
merged 3 commits into from
Sep 20, 2021
Merged

Conversation

csegarragonz
Copy link
Collaborator

@csegarragonz csegarragonz commented Sep 17, 2021

Pretty simple, silly, yet nasty bug with the MPI implementation.

In two different std::vectors (which we previously filled by emplace_back-ing nullptrs, we used emplace(<offset>, <thing to be moved>) to modify the element at offset <offset>.

Turns out that this is not how emplace works. Citing cppreference:

Inserts a new element into the container directly before pos.

Thus, our vector elements were moved all around the place in a non-deterministic manner. The fact that this worked sometimes is because:

  • We lazy-initialise these vectors. If the element is not initialised, we create a new one. Thus, every time we created a new element at the right index, and used it (just once). This only happened a limited number of times.
  • Given that the rank distribution we test with is always lower-half host 1, upper-half host 2, and the vectors are sparse by nature, overwrites were occasional.

This would have worsen dramatically with more aggressive testing or different rank-host distributions. I checked, and this does not happen anywhere else in the codebase.

Should have read the reference better before hand.

@csegarragonz csegarragonz added the bug Something isn't working label Sep 17, 2021
@csegarragonz csegarragonz self-assigned this Sep 17, 2021
@@ -33,7 +33,7 @@ class MessageEndpoint
MessageEndpoint(const std::string& hostIn, int portIn, int timeoutMsIn);

// Delete assignment and copy-constructor as we need to be very careful with
// socping and same-thread instantiation
// scoping and same-thread instantiation
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caught this re-visiting the sources so figured out i'd just change it.

Copy link
Collaborator

@Shillaker Shillaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. With bugs like this it's good to add a test that would have failed before, but now passes (if possible). In this case perhaps it could combine calls to initRemoteMpiEndpoint and getUnackedMessageBuffer to check that things work?

Copy link
Collaborator

@Shillaker Shillaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on the tests. I've added some comments about how I don't think we need to mark them as test-only or throw exceptions, so once that stuff's been removed I think it's good to go.

include/faabric/scheduler/MpiWorld.h Outdated Show resolved Hide resolved
src/scheduler/MpiWorld.cpp Outdated Show resolved Hide resolved
src/scheduler/MpiWorld.cpp Outdated Show resolved Hide resolved
@csegarragonz csegarragonz merged commit b558ee2 into master Sep 20, 2021
@csegarragonz csegarragonz deleted the emplace-bug branch September 20, 2021 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants