Blocking gets are unordered #233

anshumang · 2018-07-25T18:00:30Z

Umbrella issue #229

A blocking get operation (shmem_g, shmem_get, shmem_iget), as defined in the OpenSHMEM specification v1.4, returns only after the the data has been delivered to the destination array at the local PE. For independent get operations, this requires implementations on relaxed memory model architectures to include appropriate memory barriers in each get operation, resulting in sub-optimal performance.
UPDATE : The description of blocking get enforces completion. But does that also imply ordering? If not, there could be more clarification on this.

This proposal places the requirement on the programmer to use a fence where ordering is required between independent get operations.
UPDATE : The reordering can be observed from another thread in the calling thread's PE. Is there any other way to observe the reordering? Suggestion from @jdinan below is to not consider memory view from other threads when deciding OpenSHMEM semantics.

The completion semantics of get remain unchanged from the v1.4 specification: the result of the get is available for any dependent operation that appears after it, in program order.

In example 1, line b is data dependent on the result of the get operation in line a. Lines a and b are guaranteed to execute in program order. Hence, the output where j takes value 0 is prohibited.

Example 1
Input : x(1) = 1, i = 0, j = 0
a: i = g(x,1)
| data dependency
V
b: j = i
Output : i = 1, j = 0
OpenSHMEM v1.4 : prohibited
Current proposal : prohibited

In example 2, get operations on lines a and c of PE2 are unrelated and can be reordered per this proposal. Hence the result where j takes value 0 in line c after i takes value 1 in line a is allowed as observed from PE 2.

Example 2
Input : x(1) = 0, y(1) = 0, i = 0, j = 0
PE 0
a: p(x,1)
b: fence()
c: p(y,1)

PE 2
a: i = g(y, 1)
b: use i (UPDATE : b is unnecessary, the interpretation of spec v1.4 is that a and c are ordered even without b)
| c can reach PE1 before a (requires fence for ordering)
V
c: j = g(x, 1)

Output : x(1) = 1, y(1) = 1, i = 1, j = 0
OpenSHMEM v1.4 : prohibited
Current proposal : allowed

anshumang · 2018-07-26T19:05:25Z

Related comments from @minsii -

I am still confused how the out-of-order cores can reorder blocking gets and be visible to user programs, and fence between blocking gets becomes necessary. Below is my thought, it might be incorrect/incomplete. Could you please give more detailed explanation ?

For network-offloaded get:

shmem_get(dest, P1)
  -- (1) CPU issues network read to P1
  -- (2) network transfers data from remote P1 to local dest buffer
  -- (3) CPU confirms local completion of (2) and then return to user program

Should the mechanism of (3) ensures that (2) has already been performed and completed ?

For active-message based get:

shmem_get(dest, P1)
  -- (1) CPU issues read-request packet to P1
  -- (2) CPU waits till received ack from P1
  -- (3) CPU copies data into local dest buffer
  -- (4) return to user program
load dest;

I could imagine out-of-order execution of (3) and (4) in the AM-based case, but (3) must be done when program loads dest.

Reading again the slides @anshumang used in WG calls, I understood that the proposal is to require fence() (memory barrier in this case ?) to order the completion of two blocking gets on local PE (seems needed only for AM-case). But such out-of-order seems never visible to single-threaded user program.

Now thinking about the threaded program, where load dest maybe performed by another core, thus such unordered gets becomes visible to user program. But do we always need additional cache coherence synchronization between T0 and T1 in this case ?

T0                                       T1
shmem_get(dest1);                    
shmem_get(dest2);
                                         load dest2;
                                         load dest1;

jdinan · 2018-07-26T20:27:10Z

How are you synchronizing between T0 and T1 threads in the example above and why does that synchronization not resolve the RAW consistency issue? Consistent views of memory across threads is a problem that should be addressed by the threading package, not by OpenSHMEM.

For more details on why attempting to define a memory model in a library is a poor choice, see: http://www.hpl.hp.com/techreports/2004/HPL-2004-209.pdf

anshumang · 2018-07-26T20:56:03Z

4fbc3ae is a first attempt at updating the RMA and non-blocking RMA sections of the spec related to ordering.

jdinan · 2018-07-27T14:10:50Z

As a matter of process, you might want to limit this proposal to a minimal set of edits needed to add/update the desired semantics. Standard-ese is notoriously hard to get right. There appear to be many edits in here that are unrelated to the goal of this proposal, but will likely delay its progress. It would also be helpful if outside of the proposal, you enumerate the semantic updates that the proposal introduces to help a reader, since the semantic changes may be subtle.

jdinan · 2018-07-27T14:22:04Z

Furthermore, two successive calls to \FUNC{shmem_g} routine may deliver data out of order unless a call to \FUNC{shmem_fence} or a read of the return value is introduced between the two calls.

Is this semantic intended to target a usage model like this?

Thread 0	Thread 1
x = shmem_g(&x, ...)
y = shmem_g(&y, ...)	while (!y) ;
--	assert(x > 0);

If so, I'm not sure overloading shmem_fence is the right solution here (or even in a non-threaded case since fence has non-local actions). I assume that the calling thread should observe the fetched results in order. I would prefer to support the above pattern only through shmem_wait as we presently do. Otherwise, users should use the threading model to resolve this.

anshumang · 2018-07-27T18:08:19Z

@jdinan Will update the PR with only edits related to the proposed semantic change in get APIs like you suggest. I will go ahead and create a separate description of semantic changes noting that it may be dropped based on what is agreed on here.

minsii · 2018-07-27T20:56:03Z

@jdinan :I agree with your point. That is also the reason why I listed all possible use cases, in order to figure out whether we need change shmem_fence to order blocking gets. My opinion is unneeded.

Consistent views of memory across threads is a problem that should be addressed by the threading package, not by OpenSHMEM

anshumang · 2018-07-30T01:42:42Z

@jdinan here is a commit with changes restricted to get ordering

anshumang · 2018-08-02T16:09:17Z

@jdinan @minsii I have updated example 2 in the issue description above. If independent gets are allowed to be reordered, they can reach the target to observe puts in different order. Do we allow this reordering of gets? If so, can fence between the gets be used to prevent the reordering?

jdinan · 2018-08-02T20:03:22Z

@anshumang The changes in anshumang@d31c087 look reasonable. One suggestion, the "destination PE" wording feels a bit odd. What about "calling PE"?

jdinan · 2018-08-02T20:13:11Z

Regarding example 2, my reading of the specification suggests that statement PE2.c must see the value written at PE0.a without any additional OpenSHMEM calls by PE 2.

anshumang · 2018-08-02T21:37:55Z

@jdinan That matches my interpretation too. But to guarantee this semantic, a memory barrier is needed in the implementation of get APIs on relaxed memory model architectures like Power, ARM and GPUs. This ticket aims to relax the requirement that PE2.c must observe PE0.a.

Expanding the scope of fence() to enforce ordering in such a case (PE2.c must observe PE0.a) is also part of the proposal (#232).

minsii · 2018-08-06T18:02:16Z

@anshumang @jdinan I think example 2 is a good justification. If I want to implement shmem_get as CPU load via shared memory (e.g., PE1 and PE2 are in shared memory in the example), load x and load y on PE2 can be out-of-order.

   PE0                  PE2
a. p(x,1, PE1);     
b. fence;     
c. p(y,1, PE1);         a. while(g(y, PE1) != 1); // while (load y != 1)
                        c. assert(g(x, PE1) == 1);  // assert (load x == 1)

Now my questions are:
(1) Whether the above program should be supported ? Do we want to just use the new signal API for this user case ?
(2) Is shmem_fence the appropriate API to enforce ordering of local memory operations ? It is only for "remote ordering" in the current spec.

jdinan · 2018-08-07T13:44:56Z

@anshumang Before we can introduce a new semantic, I think we need a ticket to clarify the existing memory model. Once there is agreement on existing semantics, we should be able to address some of these inefficiencies.

jdinan · 2018-08-07T13:49:13Z

@minsii (1) Yes, I believe this program (assuming get/put operations) already has well-defined behavior in OpenSHMEM. (2) shmem_fence introduces a lot of overhead if all you needed is a processor read fence.

shamisp · 2018-08-09T21:32:03Z

I'm a bit lost between multiple tickets discussing the same stuff...

@minsii I think earlier you asked question how shmem_get "blocking" call gets reordered ? It will covered by earlier example in #229 where 3 PEs implement synchronization using blocking get routine. Within shared memory space it is equivalent of ```load`` instructions that can be executed out of order (even so it is blocked).

@jdinan @anshumang
anshumang@d31c087
Does the destination mean initiator "PE" ? IMHO target and initiator is much better descriptors. I'm not sure what is "copied data".

probably we can extend the shmem_fence() to shmem_fence(op) where of is load, store, or system wide (all possible ops).

minsii · 2018-08-09T21:59:11Z

@shamisp Thanks for clarification. Actually, after thinking the example again (PE1 and PE2 are on shared memory), it looks like introducing a memory barrier on PE2 is not sufficient to make the program correct. That is, if my put operations are handled as active messages, the store operations on PE1 can be also out-of-order. Does current shmem_fence include a memory barrier on PE1 ?

 PE0                  PE1             PE2
a. p(x,1, PE1);   a.store x=1;
b. shmem_fence;     
c. p(y,1, PE1);   c.store y=1;    a. while(g(y, PE1) != 1); // while (load y != 1)
                                  b. memory barrier
                                  c. assert(g(x, PE1) == 1);  // assert (load x == 1)

bcernohous · 2018-08-09T22:59:09Z

My thought is shmem_fence is supposed to guarantee ordering of the puts. So if your implementation uses active messages, you’ll have to ‘do something’ to guarantee ordering. Other thoughts? PE0 PE1 PE2 a. p(x,1, PE1); a.store x=1; b. shmem_fence; x.fence c. p(y,1, PE1); c.store y=1; a. while(g(y, PE1) != 1); // while (load y != 1) b. memory barrier c. assert(g(x, PE1) == 1); // assert (load x == 1) From: Min Si [mailto:[email protected]] Sent: Thursday, August 09, 2018 4:59 PM To: openshmem-org/specification <[email protected]> Cc: Subscribed <[email protected]> Subject: Re: [openshmem-org/specification] Memory Model (ordering + reads from = happens before) : Blocking gets are unordered (#233) @shamisp<https://github.com/shamisp> Thanks for clarification. Actually, after thinking the example again (PE1 and PE2 are on shared memory), it looks like introducing a memory barrier on PE2 is not sufficient to make the program correct. That is, if my put operations are handled as active messages, the store operations on PE1 can be also out-of-order. Does current shmem_fence include a memory barrier on PE1 ? PE0 PE1 PE2 a. p(x,1, PE1); a.store x=1; b. shmem_fence; c. p(y,1, PE1); c.store y=1; a. while(g(y, PE1) != 1); // while (load y != 1) b. memory barrier c. assert(g(x, PE1) == 1); // assert (load x == 1) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#233 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AH7d-sO0FkOd4oVnXkSBcjb5rcRNH9aIks5uPLCwgaJpZM4VgiFH>.

anshumang · 2018-08-09T23:58:33Z

@shamisp Sorry for the confusion. The discussion on #229 progressed quickly so the comments are spread across this ticket and #229. Also removed the long prefix Memory Model (ordering + ...) from the ticket names to avoid confusion.

Does the destination mean initiator "PE" ?
It can be both initiator (get) and target (put).
@jdinan had similar feedback that it was confusing. Will update and write back here.

@jdinan Like you suggested, I will create a new ticket for clarifications of the spec before we can move forward on this ticket.

shamisp · 2018-08-10T14:49:12Z

@minsii @bcernohous If your library unpacks the messages out of the network buffer to the application buffer, technically you would have to put store barrier somewhere or you can introduce some artificial dependency that will order the stores.

Practically, with a lot of interconnects in order to process the completion queue that notifies you about the message arrival you would have to execute some sort of barrier.

shamisp · 2018-08-10T14:58:06Z

@anshumang semantically we should decouple blocking semantics from ordering semantics. There is implicit assumption there that instruction are executed in programmable order....

naveen-rn · 2020-01-15T20:26:57Z

I suppose this is not for OpenSHMEM-1.5.

jdinan · 2020-01-15T22:35:06Z

Yes, this will land in 1.6. IIRC, the conclusion we reached in this discussion is that blocking fetching atomics are ordered and that nonblocking variants are not. We may also want to introduce a blocking variant that is unordered.

anshumang self-assigned this Jul 25, 2018

anshumang mentioned this issue Jul 25, 2018

Memory model umbrella ticket #229

Open

anshumang added the Feedback Requested label Jul 26, 2018

manjugv mentioned this issue Jul 26, 2018

Update atomic memory model text #204

Merged

anshumang changed the title ~~Memory Model (ordering + reads from = happens before) : Blocking gets are unordered~~ Blocking gets are unordered Aug 9, 2018

naveen-rn added this to the OpenSHMEM 1.6 milestone Jan 15, 2020

jdinan self-assigned this Jan 15, 2020

jdinan mentioned this issue Jan 31, 2020

Memory Model Part Deux #339

Open

manjugv unassigned anshumang Jun 28, 2021

naveen-rn modified the milestones: OpenSHMEM 1.6, OpenSHMEM 1.7 Oct 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blocking gets are unordered #233

Blocking gets are unordered #233

anshumang commented Jul 25, 2018 •

edited

Loading

anshumang commented Jul 26, 2018

jdinan commented Jul 26, 2018

anshumang commented Jul 26, 2018 •

edited

Loading

jdinan commented Jul 27, 2018

jdinan commented Jul 27, 2018

anshumang commented Jul 27, 2018

minsii commented Jul 27, 2018

anshumang commented Jul 30, 2018 •

edited

Loading

anshumang commented Aug 2, 2018

jdinan commented Aug 2, 2018

jdinan commented Aug 2, 2018

anshumang commented Aug 2, 2018

minsii commented Aug 6, 2018

jdinan commented Aug 7, 2018

jdinan commented Aug 7, 2018

shamisp commented Aug 9, 2018 •

edited

Loading

minsii commented Aug 9, 2018

bcernohous commented Aug 9, 2018 via email

anshumang commented Aug 9, 2018

shamisp commented Aug 10, 2018

shamisp commented Aug 10, 2018

naveen-rn commented Jan 15, 2020

jdinan commented Jan 15, 2020

Blocking gets are unordered #233

Blocking gets are unordered #233

Comments

anshumang commented Jul 25, 2018 • edited Loading

anshumang commented Jul 26, 2018

jdinan commented Jul 26, 2018

anshumang commented Jul 26, 2018 • edited Loading

jdinan commented Jul 27, 2018

jdinan commented Jul 27, 2018

anshumang commented Jul 27, 2018

minsii commented Jul 27, 2018

anshumang commented Jul 30, 2018 • edited Loading

anshumang commented Aug 2, 2018

jdinan commented Aug 2, 2018

jdinan commented Aug 2, 2018

anshumang commented Aug 2, 2018

minsii commented Aug 6, 2018

jdinan commented Aug 7, 2018

jdinan commented Aug 7, 2018

shamisp commented Aug 9, 2018 • edited Loading

minsii commented Aug 9, 2018

bcernohous commented Aug 9, 2018 via email

anshumang commented Aug 9, 2018

shamisp commented Aug 10, 2018

shamisp commented Aug 10, 2018

naveen-rn commented Jan 15, 2020

jdinan commented Jan 15, 2020

anshumang commented Jul 25, 2018 •

edited

Loading

anshumang commented Jul 26, 2018 •

edited

Loading

anshumang commented Jul 30, 2018 •

edited

Loading

shamisp commented Aug 9, 2018 •

edited

Loading