-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add distributed coordination operations #161
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, very glad this is finally done 🎉
Just some minor comments.
Before merging though, could we do a bump faabric PR in faasm to make sure this does not break anything?
nSums = 1000; | ||
} | ||
|
||
// Spawn n-1 child threads to add to shared sums over several barriers so |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
several
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean by quoting a single word... The test is running the sum operations in a loop, so it's invoking several barriers. Does that make sense?
src/transport/PointToPointBroker.cpp
Outdated
{ | ||
std::vector<uint8_t> data(1, 0); | ||
|
||
ptpBroker.sendMessage(groupId, 0, groupIdx, data.data(), data.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We always assume the master's index to be 0 right? Maybe it would make the code more readable if this value was hardcoded somwhere. It is sometimes hard to understand that the 0 correponds to the master index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good point, we frequently hard-code this zero in the MPI code too, i.e. if(rank == 0) { // Do stuff for master }
, e.g. https://github.com/faasm/faabric/blob/master/src/scheduler/MpiWorld.cpp#L1216
I'll change for the ptp stuff but would be good to switch the MPI code at some point too.
Faasm PR checking here: faasm/faasm#531 |
Adds generic utilities to support distributed coordination primitives such as locks and barriers.
It works as follows:
PointToPointBroker
class, which deals withPointToPointGroups
, where each function in the group can coordinate with one another across hosts.PointToPointBroker
just deals with agroupId
(integer) to refer to each one.appId
and an optionalappIdx
, e.g. an MPI rank, or OpenMP thread ID), and potentially a group as well (with agroupId
and agroupIdx
). There may be many groups in a single app, therefore thegroupIdx
andappIdx
are treated separately (although may be set to the same thing if there's only one group in the app).groupId
andgroupIdx
set on the underlyingMessage
s. The scheduler uses this information to transparently set up the point-to-point mappings required to do the required messaging.PointToPointServer
. When the lock has been successfully acquired, a corresponding point-to-point message will be sent back to the group index that originally requested the lock. If the lock is requested locally, we do the same thing, just without the request to the remotePointToPointServer
.PointToPointServer
if remote.I've added a group lock/ unlock surrounding writing snapshot diffs. This could otherwise cause problems by overwriting regions of memory while another thread was in a critical section.
For posterity, this PR is a rewrite of #141