-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Blocking gets are unordered #233
Comments
Related comments from @minsii - I am still confused how the out-of-order cores can reorder blocking gets and be visible to user programs, and For network-offloaded get:
Should the mechanism of (3) ensures that (2) has already been performed and completed ? For active-message based get:
I could imagine out-of-order execution of (3) and (4) in the AM-based case, but (3) must be done when program loads Reading again the slides @anshumang used in WG calls, I understood that the proposal is to require Now thinking about the threaded program, where
|
How are you synchronizing between T0 and T1 threads in the example above and why does that synchronization not resolve the RAW consistency issue? Consistent views of memory across threads is a problem that should be addressed by the threading package, not by OpenSHMEM. For more details on why attempting to define a memory model in a library is a poor choice, see: http://www.hpl.hp.com/techreports/2004/HPL-2004-209.pdf |
4fbc3ae is a first attempt at updating the RMA and non-blocking RMA sections of the spec related to ordering. |
As a matter of process, you might want to limit this proposal to a minimal set of edits needed to add/update the desired semantics. Standard-ese is notoriously hard to get right. There appear to be many edits in here that are unrelated to the goal of this proposal, but will likely delay its progress. It would also be helpful if outside of the proposal, you enumerate the semantic updates that the proposal introduces to help a reader, since the semantic changes may be subtle. |
Is this semantic intended to target a usage model like this?
If so, I'm not sure overloading |
@jdinan Will update the PR with only edits related to the proposed semantic change in get APIs like you suggest. I will go ahead and create a separate description of semantic changes noting that it may be dropped based on what is agreed on here. |
@jdinan :I agree with your point. That is also the reason why I listed all possible use cases, in order to figure out whether we need change
|
@anshumang The changes in anshumang@d31c087 look reasonable. One suggestion, the "destination PE" wording feels a bit odd. What about "calling PE"? |
Regarding example 2, my reading of the specification suggests that statement PE2.c must see the value written at PE0.a without any additional OpenSHMEM calls by PE 2. |
@jdinan That matches my interpretation too. But to guarantee this semantic, a memory barrier is needed in the implementation of get APIs on relaxed memory model architectures like Power, ARM and GPUs. This ticket aims to relax the requirement that PE2.c must observe PE0.a. Expanding the scope of fence() to enforce ordering in such a case (PE2.c must observe PE0.a) is also part of the proposal (#232). |
@anshumang @jdinan I think example 2 is a good justification. If I want to implement
Now my questions are: |
@anshumang Before we can introduce a new semantic, I think we need a ticket to clarify the existing memory model. Once there is agreement on existing semantics, we should be able to address some of these inefficiencies. |
@minsii (1) Yes, I believe this program (assuming get/put operations) already has well-defined behavior in OpenSHMEM. (2) |
I'm a bit lost between multiple tickets discussing the same stuff... @minsii I think earlier you asked question how shmem_get "blocking" call gets reordered ? It will covered by earlier example in #229 where 3 PEs implement synchronization using blocking get routine. Within shared memory space it is equivalent of ```load`` instructions that can be executed out of order (even so it is blocked). @jdinan @anshumang probably we can extend the |
@shamisp Thanks for clarification. Actually, after thinking the example again (PE1 and PE2 are on shared memory), it looks like introducing a memory barrier on PE2 is not sufficient to make the program correct. That is, if my put operations are handled as active messages, the store operations on PE1 can be also out-of-order. Does current
|
My thought is shmem_fence is supposed to guarantee ordering of the puts. So if your implementation uses active messages, you’ll have to ‘do something’ to guarantee ordering. Other thoughts?
PE0 PE1 PE2
a. p(x,1, PE1); a.store x=1;
b. shmem_fence; x.fence
c. p(y,1, PE1); c.store y=1; a. while(g(y, PE1) != 1); // while (load y != 1)
b. memory barrier
c. assert(g(x, PE1) == 1); // assert (load x == 1)
From: Min Si [mailto:[email protected]]
Sent: Thursday, August 09, 2018 4:59 PM
To: openshmem-org/specification <[email protected]>
Cc: Subscribed <[email protected]>
Subject: Re: [openshmem-org/specification] Memory Model (ordering + reads from = happens before) : Blocking gets are unordered (#233)
@shamisp<https://github.com/shamisp> Thanks for clarification. Actually, after thinking the example again (PE1 and PE2 are on shared memory), it looks like introducing a memory barrier on PE2 is not sufficient to make the program correct. That is, if my put operations are handled as active messages, the store operations on PE1 can be also out-of-order. Does current shmem_fence include a memory barrier on PE1 ?
PE0 PE1 PE2
a. p(x,1, PE1); a.store x=1;
b. shmem_fence;
c. p(y,1, PE1); c.store y=1; a. while(g(y, PE1) != 1); // while (load y != 1)
b. memory barrier
c. assert(g(x, PE1) == 1); // assert (load x == 1)
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#233 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AH7d-sO0FkOd4oVnXkSBcjb5rcRNH9aIks5uPLCwgaJpZM4VgiFH>.
|
@shamisp Sorry for the confusion. The discussion on #229 progressed quickly so the comments are spread across this ticket and #229. Also removed the long prefix
@jdinan Like you suggested, I will create a new ticket for clarifications of the spec before we can move forward on this ticket. |
@minsii @bcernohous If your library unpacks the messages out of the network buffer to the application buffer, technically you would have to put store barrier somewhere or you can introduce some artificial dependency that will order the stores. Practically, with a lot of interconnects in order to process the completion queue that notifies you about the message arrival you would have to execute some sort of barrier. |
@anshumang semantically we should decouple blocking semantics from ordering semantics. There is implicit assumption there that instruction are executed in programmable order.... |
I suppose this is not for OpenSHMEM-1.5. |
Yes, this will land in 1.6. IIRC, the conclusion we reached in this discussion is that blocking fetching atomics are ordered and that nonblocking variants are not. We may also want to introduce a blocking variant that is unordered. |
Umbrella issue #229
A blocking get operation (shmem_g, shmem_get, shmem_iget), as defined in the OpenSHMEM specification v1.4, returns only after the the data has been delivered to the destination array at the local PE. For independent get operations, this requires implementations on relaxed memory model architectures to include appropriate memory barriers in each get operation, resulting in sub-optimal performance.
UPDATE : The description of blocking get enforces completion. But does that also imply ordering? If not, there could be more clarification on this.
This proposal places the requirement on the programmer to use a fence where ordering is required between independent get operations.
UPDATE : The reordering can be observed from another thread in the calling thread's PE. Is there any other way to observe the reordering? Suggestion from @jdinan below is to not consider memory view from other threads when deciding OpenSHMEM semantics.
The completion semantics of get remain unchanged from the v1.4 specification: the result of the get is available for any dependent operation that appears after it, in program order.
In example 1, line b is data dependent on the result of the get operation in line a. Lines a and b are guaranteed to execute in program order. Hence, the output where j takes value 0 is prohibited.
Example 1
Input : x(1) = 1, i = 0, j = 0
a: i = g(x,1)
| data dependency
V
b: j = i
Output : i = 1, j = 0
OpenSHMEM v1.4 : prohibited
Current proposal : prohibited
In example 2, get operations on lines a and c of PE2 are unrelated and can be reordered per this proposal. Hence the result where j takes value 0 in line c after i takes value 1 in line a is allowed as observed from PE 2.
Example 2
Input : x(1) = 0, y(1) = 0, i = 0, j = 0
PE 0
a: p(x,1)
b: fence()
c: p(y,1)
PE 2
a: i = g(y, 1)
b: use i (UPDATE : b is unnecessary, the interpretation of spec v1.4 is that a and c are ordered even without b)
| c can reach PE1 before a (requires fence for ordering)
V
c: j = g(x, 1)
Output : x(1) = 1, y(1) = 1, i = 1, j = 0
OpenSHMEM v1.4 : prohibited
Current proposal : allowed
The text was updated successfully, but these errors were encountered: