Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blocking gets are unordered #233

Open
anshumang opened this issue Jul 25, 2018 · 23 comments
Open

Blocking gets are unordered #233

anshumang opened this issue Jul 25, 2018 · 23 comments
Assignees
Milestone

Comments

@anshumang
Copy link

anshumang commented Jul 25, 2018

Umbrella issue #229

A blocking get operation (shmem_g, shmem_get, shmem_iget), as defined in the OpenSHMEM specification v1.4, returns only after the the data has been delivered to the destination array at the local PE. For independent get operations, this requires implementations on relaxed memory model architectures to include appropriate memory barriers in each get operation, resulting in sub-optimal performance.
UPDATE : The description of blocking get enforces completion. But does that also imply ordering? If not, there could be more clarification on this.

This proposal places the requirement on the programmer to use a fence where ordering is required between independent get operations.
UPDATE : The reordering can be observed from another thread in the calling thread's PE. Is there any other way to observe the reordering? Suggestion from @jdinan below is to not consider memory view from other threads when deciding OpenSHMEM semantics.

The completion semantics of get remain unchanged from the v1.4 specification: the result of the get is available for any dependent operation that appears after it, in program order.

In example 1, line b is data dependent on the result of the get operation in line a. Lines a and b are guaranteed to execute in program order. Hence, the output where j takes value 0 is prohibited.

Example 1
Input : x(1) = 1, i = 0, j = 0
a: i = g(x,1)
| data dependency
V
b: j = i
Output : i = 1, j = 0
OpenSHMEM v1.4 : prohibited
Current proposal : prohibited

In example 2, get operations on lines a and c of PE2 are unrelated and can be reordered per this proposal. Hence the result where j takes value 0 in line c after i takes value 1 in line a is allowed as observed from PE 2.

Example 2
Input : x(1) = 0, y(1) = 0, i = 0, j = 0
PE 0
a: p(x,1)
b: fence()
c: p(y,1)

PE 2
a: i = g(y, 1)
b: use i (UPDATE : b is unnecessary, the interpretation of spec v1.4 is that a and c are ordered even without b)
| c can reach PE1 before a (requires fence for ordering)
V
c: j = g(x, 1)

Output : x(1) = 1, y(1) = 1, i = 1, j = 0
OpenSHMEM v1.4 : prohibited
Current proposal : allowed

@anshumang
Copy link
Author

Related comments from @minsii -

I am still confused how the out-of-order cores can reorder blocking gets and be visible to user programs, and fence between blocking gets becomes necessary. Below is my thought, it might be incorrect/incomplete. Could you please give more detailed explanation ?

For network-offloaded get:

shmem_get(dest, P1)
  -- (1) CPU issues network read to P1
  -- (2) network transfers data from remote P1 to local dest buffer
  -- (3) CPU confirms local completion of (2) and then return to user program

Should the mechanism of (3) ensures that (2) has already been performed and completed ?

For active-message based get:

shmem_get(dest, P1)
  -- (1) CPU issues read-request packet to P1
  -- (2) CPU waits till received ack from P1
  -- (3) CPU copies data into local dest buffer
  -- (4) return to user program
load dest;

I could imagine out-of-order execution of (3) and (4) in the AM-based case, but (3) must be done when program loads dest.

Reading again the slides @anshumang used in WG calls, I understood that the proposal is to require fence() (memory barrier in this case ?) to order the completion of two blocking gets on local PE (seems needed only for AM-case). But such out-of-order seems never visible to single-threaded user program.

Now thinking about the threaded program, where load dest maybe performed by another core, thus such unordered gets becomes visible to user program. But do we always need additional cache coherence synchronization between T0 and T1 in this case ?

T0                                       T1
shmem_get(dest1);                    
shmem_get(dest2);
                                         load dest2;
                                         load dest1;

@jdinan
Copy link
Collaborator

jdinan commented Jul 26, 2018

How are you synchronizing between T0 and T1 threads in the example above and why does that synchronization not resolve the RAW consistency issue? Consistent views of memory across threads is a problem that should be addressed by the threading package, not by OpenSHMEM.

For more details on why attempting to define a memory model in a library is a poor choice, see: http://www.hpl.hp.com/techreports/2004/HPL-2004-209.pdf

@anshumang
Copy link
Author

anshumang commented Jul 26, 2018

4fbc3ae is a first attempt at updating the RMA and non-blocking RMA sections of the spec related to ordering.

@jdinan
Copy link
Collaborator

jdinan commented Jul 27, 2018

As a matter of process, you might want to limit this proposal to a minimal set of edits needed to add/update the desired semantics. Standard-ese is notoriously hard to get right. There appear to be many edits in here that are unrelated to the goal of this proposal, but will likely delay its progress. It would also be helpful if outside of the proposal, you enumerate the semantic updates that the proposal introduces to help a reader, since the semantic changes may be subtle.

@jdinan
Copy link
Collaborator

jdinan commented Jul 27, 2018

Furthermore, two successive calls to \FUNC{shmem_g} routine may deliver data out of order unless a call to \FUNC{shmem_fence} or a read of the return value is introduced between the two calls.

Is this semantic intended to target a usage model like this?

Thread 0 Thread 1
x = shmem_g(&x, ...)
y = shmem_g(&y, ...) while (!y) ;
-- assert(x > 0);

If so, I'm not sure overloading shmem_fence is the right solution here (or even in a non-threaded case since fence has non-local actions). I assume that the calling thread should observe the fetched results in order. I would prefer to support the above pattern only through shmem_wait as we presently do. Otherwise, users should use the threading model to resolve this.

@anshumang
Copy link
Author

@jdinan Will update the PR with only edits related to the proposed semantic change in get APIs like you suggest. I will go ahead and create a separate description of semantic changes noting that it may be dropped based on what is agreed on here.

@minsii
Copy link
Collaborator

minsii commented Jul 27, 2018

@jdinan :I agree with your point. That is also the reason why I listed all possible use cases, in order to figure out whether we need change shmem_fence to order blocking gets. My opinion is unneeded.

Consistent views of memory across threads is a problem that should be addressed by the threading package, not by OpenSHMEM

@anshumang
Copy link
Author

anshumang commented Jul 30, 2018

@jdinan here is a commit with changes restricted to get ordering

@anshumang
Copy link
Author

@jdinan @minsii I have updated example 2 in the issue description above. If independent gets are allowed to be reordered, they can reach the target to observe puts in different order. Do we allow this reordering of gets? If so, can fence between the gets be used to prevent the reordering?

@jdinan
Copy link
Collaborator

jdinan commented Aug 2, 2018

@anshumang The changes in anshumang@d31c087 look reasonable. One suggestion, the "destination PE" wording feels a bit odd. What about "calling PE"?

@jdinan
Copy link
Collaborator

jdinan commented Aug 2, 2018

Regarding example 2, my reading of the specification suggests that statement PE2.c must see the value written at PE0.a without any additional OpenSHMEM calls by PE 2.

@anshumang
Copy link
Author

@jdinan That matches my interpretation too. But to guarantee this semantic, a memory barrier is needed in the implementation of get APIs on relaxed memory model architectures like Power, ARM and GPUs. This ticket aims to relax the requirement that PE2.c must observe PE0.a.

Expanding the scope of fence() to enforce ordering in such a case (PE2.c must observe PE0.a) is also part of the proposal (#232).

@minsii
Copy link
Collaborator

minsii commented Aug 6, 2018

@anshumang @jdinan I think example 2 is a good justification. If I want to implement shmem_get as CPU load via shared memory (e.g., PE1 and PE2 are in shared memory in the example), load x and load y on PE2 can be out-of-order.

   PE0                  PE2
a. p(x,1, PE1);     
b. fence;     
c. p(y,1, PE1);         a. while(g(y, PE1) != 1); // while (load y != 1)
                        c. assert(g(x, PE1) == 1);  // assert (load x == 1)

Now my questions are:
(1) Whether the above program should be supported ? Do we want to just use the new signal API for this user case ?
(2) Is shmem_fence the appropriate API to enforce ordering of local memory operations ? It is only for "remote ordering" in the current spec.

@jdinan
Copy link
Collaborator

jdinan commented Aug 7, 2018

@anshumang Before we can introduce a new semantic, I think we need a ticket to clarify the existing memory model. Once there is agreement on existing semantics, we should be able to address some of these inefficiencies.

@jdinan
Copy link
Collaborator

jdinan commented Aug 7, 2018

@minsii (1) Yes, I believe this program (assuming get/put operations) already has well-defined behavior in OpenSHMEM. (2) shmem_fence introduces a lot of overhead if all you needed is a processor read fence.

@shamisp
Copy link
Contributor

shamisp commented Aug 9, 2018

I'm a bit lost between multiple tickets discussing the same stuff...

@minsii I think earlier you asked question how shmem_get "blocking" call gets reordered ? It will covered by earlier example in #229 where 3 PEs implement synchronization using blocking get routine. Within shared memory space it is equivalent of ```load`` instructions that can be executed out of order (even so it is blocked).

@jdinan @anshumang
anshumang@d31c087
Does the destination mean initiator "PE" ? IMHO target and initiator is much better descriptors. I'm not sure what is "copied data".

probably we can extend the shmem_fence() to shmem_fence(op) where of is load, store, or system wide (all possible ops).

@minsii
Copy link
Collaborator

minsii commented Aug 9, 2018

@shamisp Thanks for clarification. Actually, after thinking the example again (PE1 and PE2 are on shared memory), it looks like introducing a memory barrier on PE2 is not sufficient to make the program correct. That is, if my put operations are handled as active messages, the store operations on PE1 can be also out-of-order. Does current shmem_fence include a memory barrier on PE1 ?

 PE0                  PE1             PE2
a. p(x,1, PE1);   a.store x=1;
b. shmem_fence;     
c. p(y,1, PE1);   c.store y=1;    a. while(g(y, PE1) != 1); // while (load y != 1)
                                  b. memory barrier
                                  c. assert(g(x, PE1) == 1);  // assert (load x == 1)

@bcernohous
Copy link

bcernohous commented Aug 9, 2018 via email

@anshumang anshumang changed the title Memory Model (ordering + reads from = happens before) : Blocking gets are unordered Blocking gets are unordered Aug 9, 2018
@anshumang
Copy link
Author

@shamisp Sorry for the confusion. The discussion on #229 progressed quickly so the comments are spread across this ticket and #229. Also removed the long prefix Memory Model (ordering + ...) from the ticket names to avoid confusion.

Does the destination mean initiator "PE" ?
It can be both initiator (get) and target (put).
@jdinan had similar feedback that it was confusing. Will update and write back here.

@jdinan Like you suggested, I will create a new ticket for clarifications of the spec before we can move forward on this ticket.

@shamisp
Copy link
Contributor

shamisp commented Aug 10, 2018

@minsii @bcernohous If your library unpacks the messages out of the network buffer to the application buffer, technically you would have to put store barrier somewhere or you can introduce some artificial dependency that will order the stores.

Practically, with a lot of interconnects in order to process the completion queue that notifies you about the message arrival you would have to execute some sort of barrier.

@shamisp
Copy link
Contributor

shamisp commented Aug 10, 2018

@anshumang semantically we should decouple blocking semantics from ordering semantics. There is implicit assumption there that instruction are executed in programmable order....

@naveen-rn
Copy link
Contributor

I suppose this is not for OpenSHMEM-1.5.

@naveen-rn naveen-rn added this to the OpenSHMEM 1.6 milestone Jan 15, 2020
@jdinan jdinan self-assigned this Jan 15, 2020
@jdinan
Copy link
Collaborator

jdinan commented Jan 15, 2020

Yes, this will land in 1.6. IIRC, the conclusion we reached in this discussion is that blocking fetching atomics are ordered and that nonblocking variants are not. We may also want to introduce a blocking variant that is unordered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants