Memory model umbrella ticket #229

anshumang · 2018-07-19T22:32:38Z

Summary : ordering (inside a PE) + reads from (between two PEs) = happens before (across all PEs)

Following are the items that have been discussed in RMA WG on 6/21, 7/5 and 7/19 and are still open (except one marked by ^). They are grouped below 1) under ordering, 2) reads from and 3) happens before.

ordering

Fetch AMOs are ordered (related: fence is not required to order fetch AMOs)
Blocking get/g/iget are not ordered (related: fence orders blocking get/g/iget) (Blocking gets are unordered #233)
fence also orders non-blocking get/g/iget (all fence behavior changes Behavior of fence #232 )
Data-movement collective APIs using same psync are ordered in the order of their issue

reads from

All communication APIs progress without requiring quiet or barrier
Using shmem_put/iput/p to trigger shmem_wait_until is platform/implementation defined (related: shmem_atomic_set should be used to trigger shmem_wait_until)
New APIs to trigger shmem_wait_until that need to be single-copy atomic but not read-modify-write atomic
Trigger shmem_wait_until using the same type^
wait_until on remote symmetric memory
wait_until on local non-symmetric memory (related : when non-blocking fetch AMOs trigger wait_until, does it require read-modify-write-fetch to be atomic?)

happens before
None yet

anshumang · 2018-07-23T17:29:36Z

@spotluri @jdinan @manjugv @nspark @minsii @khamidouche and others
Please feel free to add if anything is missing. Plan is to create separate issues for tracking each of the items above and then create PRs for the proposed changes to the spec.

shamisp · 2018-07-23T19:22:42Z

Using shmem_wait_until in combination with AMO only is difficult one.
It somewhat simple if you only looking at shared memory use case only. It is much more complicated if initiator of AMO is located in different coherency domain in respect to target running waituntil loop. The local load operation may not be atomic in respect to remote atomics.
AMO is expensive operation when it is compared regular PUT. It is slower and limited in number of outstanding operations. On the other hand waking up a remote PE with regular PUT even partial one can be perfectly fine way of notification.

anshumang · 2018-07-23T21:20:53Z

Definition of concurrency used in #204 to be resolved in this.

anshumang · 2018-07-23T22:02:59Z

Have renamed (can be improved) to distinguish from #172

minsii · 2018-07-24T19:05:04Z

Is the related statement a mistake (get/g/iget ->put/p/iput) ?

Blocking get/g/iget are not ordered (related: fence orders blocking get/g/iget)

Do you want to order the delivery of data returning to the local buffer, or order the read access of the remote memory ? I think only the later is useful. E.g., a user may want to do put_nbi->fence->get_nbi.

fence also orders non-blocking get/g/iget

I am not sure if I understand this topic correctly. I have two questions: (1) Does wait_until check the return buffer of nonblocking fetch AMO on the source PE ? (2) What is the read-modify-write-fetch operation and when it is needed ?

wait_until on local non-symmetric memory (related : when non-blocking fetch AMOs trigger wait_until, does it require read-modify-write-fetch to be atomic?)

anshumang · 2018-07-24T19:26:50Z

@minsii

Is the related statement a mistake (get/g/iget ->put/p/iput) ?
No, the proposal is to make g/get/iget unordered as it requires memory fences on some relaxed architectures (if the data from get is used, ordering is enforced by the compiler/architecture)

Do you want to order the delivery of data returning to the local buffer, or order the read access of the remote memory ?
The original context for this was a comment from @nspark on the draft that fence ordering blocking and non-blocking put but only blocking get may be non-intuitive. A follow up question - why is the ordering of the local buffer update (for non-blocking get) not useful? Is it not a requirement for message passing to work?

I have two questions: (1) Does wait_until check the return buffer of nonblocking fetch AMO on the source PE ? (2) What is the read-modify-write-fetch operation and when it is needed ?
(1) Yes, this was suggested by @jdinan in the discussion on the mailing list.
(2) This is related to your comment in the same thread - Will it increases the overhead of fetching AMOs if we need atomicity guarantee ? E.g., is there any network that supports atomicity for the returning data transfer of AMO ? Is my understanding correct?

bcernohous · 2018-07-24T20:38:58Z

- fence also orders non-blocking get/g/igetÂ Sorry, I admit to not following this discussion closely enough, ...'orders', not 'completes' ... get_nbi() fence() get_nbi() so I'm guaranteed that the second get will be ordered after the first get? PE 0 PE 1 PE n put(data, pe1) fence(); put(signal=1, pe1) get_nbi(signal, pe1) fence() get_nbi(data, pe1) If <n> gets signal=1 then data is valid (from PE 0) since it was ‘signalled’ From: Anshuman Goswami [mailto:[email protected]] Sent: Tuesday, July 24, 2018 2:27 PM To: openshmem-org/specification <[email protected]> Cc: Subscribed <[email protected]> Subject: Re: [openshmem-org/specification] Memory Model (ordering + reads from = happens before) (#229) @minsii<https://github.com/minsii> Is the related statement a mistake (get/g/iget ->put/p/iput) ? No, the proposal is to make g/get/iget unordered as it requires memory fences on some relaxed architectures (if the data from get is used, ordering is enforced by the compiler/architecture) Do you want to order the delivery of data returning to the local buffer, or order the read access of the remote memory ? The original context for this was a comment from @nspark<https://github.com/nspark> on the draft that fence ordering blocking and non-blocking put but only blocking get may be non-intuitive. A follow up question - why is the ordering of the local buffer update (for non-blocking get) not useful? Is it not a requirement for message passing to work? I have two questions: (1) Does wait_until check the return buffer of nonblocking fetch AMO on the source PE ? (2) What is the read-modify-write-fetch operation and when it is needed ? (1) Yes, this was suggested by @jdinan<https://github.com/jdinan> in the discussion on the mailing list. (2) This is related to your comment in the same thread - Will it increases the overhead of fetching AMOs if we need atomicity guarantee ? E.g., is there any network that supports atomicity for the returning data transfer of AMO ? Is my understanding correct? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#229 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AH7d-ha-jMQ3K9K5FoAH-kk50iXTkPGVks5uJ3T8gaJpZM4VXMLn>.

anshumang · 2018-07-24T21:28:25Z

Thanks @bcernohous for the example. If I could use it to clarify my earlier comment : fence orders get_nbi implies that the local update is ordered. @minsii comments?

minsii · 2018-07-24T21:42:43Z

@bcernohous: The example seems a little problematic to me. How do you guarantee that get_nbi(signal, pe1) reads signal on PE1 after the update of put(signal=1, pe1) ? Do you have to add another synchronization between PE0 and PEn ? E.g., PE n must issue the get_nbi operations after completion of PE 0's put(signal=1, pe1).

PE 0                 PE 1                   PE n

put(data, pe1)
fence();
put(signal=1, pe1)

                                            get_nbi(signal, pe1)
                                            fence()
                                            get_nbi(data, pe1)

bcernohous · 2018-07-24T21:44:35Z

My email was an example with questions 😊

so I'm guaranteed that the second get will be ordered after the first get? If <n> gets signal=1 then data is valid (from PE 0) since it was ‘signalled’

And I guess the answer is yes to both? From: Anshuman Goswami [mailto:[email protected]] Sent: Tuesday, July 24, 2018 4:28 PM To: openshmem-org/specification <[email protected]> Cc: Bob Cernohous <[email protected]>; Mention <[email protected]> Subject: Re: [openshmem-org/specification] Memory Model (ordering + reads from = happens before) (#229) Thanks @bcernohous<https://github.com/bcernohous> for the example. If I could use it to clarify my earlier comment : fence orders get_nbi implies that the local update is ordered. @minsii<https://github.com/minsii> comments? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#229 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AH7d-iYAWXscz4GaZaP7B9taRbrw-jBsks5uJ5F6gaJpZM4VXMLn>.

bcernohous · 2018-07-24T21:48:43Z

I was questioning if that was how my example was supposed to work.

How do you guarantee that get_nbi(signal, pe1) reads signal on PE1 after the update of put(signal=1, pe1) ?

I don’t. It was an ordering question. *If* pe <n> gets signal=1 , then the data is from PE 0. PE <n> could get signal = ? (in my poor example) and there would be nothing else you could assert about the ordering. As I said, I haven’t followed this discussion closely enough and was surprised that fence orders get_nbi, and I’m trying to understand it too. From: Min Si [mailto:[email protected]] Sent: Tuesday, July 24, 2018 4:43 PM To: openshmem-org/specification <[email protected]> Cc: Bob Cernohous <[email protected]>; Mention <[email protected]> Subject: Re: [openshmem-org/specification] Memory Model (ordering + reads from = happens before) (#229) @bcernohous<https://github.com/bcernohous>: The example seems a little problematic to me. How do you guarantee that get_nbi(signal, pe1) reads signal on PE1 after the update of put(signal=1, pe1) ? Do you have to add another synchronization between PE0 and PEn ? E.g., PE n must issue the get_nbi operations after completion of PE 0's put(signal=1, pe1). PE 0 PE 1 PE n put(data, pe1) fence(); put(signal=1, pe1) get_nbi(signal, pe1) fence() get_nbi(data, pe1) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#229 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AH7d-o4YdP3yCK923O5WWeJdnJ8Xhvteks5uJ5TUgaJpZM4VXMLn>.

minsii · 2018-07-24T21:50:16Z

@anshumang :

I am a little confused. If I understand correctly, g/get/iget is already unordered in current semantics. Do you propose to only clarify the ordering semantics (no semantics change) ?

No, the proposal is to make g/get/iget unordered as it requires memory fences on some relaxed architectures (if the data from get is used, ordering is enforced by the compiler/architecture)

I do not expect any network hardware supports the atomicity of the entire processing of read-modify-write(on remote object) + write(on local return buffer). Actually, if we want to allow wait_until+nonblocking fetch AMO combination, we need additional atomicity on the local process between write(on local return buffer) in nbi fetch AMO and read(from local return buffer) in wait_until.

(2) This is related to your comment in the same thread - Will it increases the overhead of fetching AMOs if we need atomicity guarantee ? E.g., is there any network that supports atomicity for the returning data transfer of AMO ? Is my understanding correct?

anshumang · 2018-07-24T21:53:41Z

@minsii From 9.5.4 in spec v1.4 description of shmem_get - The routines return after the data has been delivered to the dest array on the local PE. Does this not imply that shmem_get is ordered with respect to other operations?

anshumang · 2018-07-24T22:12:28Z

@bcernohous fence orders non-blocking get came up in context of the new requirement that fence would order blocking gets (which are now unordered). I think it could be helpful from a user's perspective to be able to order blocking and non-blocking gets the same way. Is there a fundamental performance issue to guarantee this?

minsii · 2018-07-24T22:16:21Z

@anshumang: The blocking get/g/iget must be completed at the return of the routine. In that sense, the ordering between two blocking get operations are always ordered on the local PE. However, it is irrelevant to the shmem_fence semantics. For instance, there is no ordering between a blocking put and a blocking get. In the following example, the update of x by put might be delivered on PE1 after the return of get. shmem_fence does not help.

shmem_put(x, PE1); /* local completion at return */
shmem_get(x, PE1);

anshumang · 2018-07-24T22:23:16Z

@minsii

In that sense, the ordering between two blocking get operations are always ordered on the local PE.

The proposal is to relax this requirement.

shamisp · 2018-07-24T22:41:48Z

@minsii out-of-order core may executer in-depended loads (a.k.a. shmem "blocking" get ) out of order.

shamisp · 2018-07-25T03:05:07Z

@anshumang What is really surprising is that Cray T3D ("father" of openshmem) was using Alpha, which is out-of-order core. I just cannot imagine ordered loads on this platform. Original spec also had explicit ops for cache management. So my guess the ordering was happing through the cache invalidation routines. Otherwise you can complete the load from the local memory regardless what other side "put" there.

Looking at the original manual I only see barrier, no shmem_fence or shmem_quite operations. My guess these two were introduced post 1994.

anshumang · 2018-07-25T17:35:37Z

Thanks for the comments @shamisp Can you please add the pointer to the original spec?

shamisp · 2018-07-25T18:41:39Z

https://www.cs.cmu.edu/afs/cs/project/cmcl/link.iwarp/OldFiles/archive/fx-papers/cri-shmem-users-guide.ps

minsii · 2018-07-25T19:47:16Z

@shamisp I am still confused how the out-of-order cores can reorder blocking gets and be visible to user programs, and fence between blocking gets becomes necessary. Below is my thought, it might be incorrect/incomplete. Could you please give more detailed explanation ?

For network-offloaded get:

shmem_get(dest, P1)
  -- (1) CPU issues network read to P1
  -- (2) network transfers data from remote P1 to local dest buffer
  -- (3) CPU confirms local completion of (2) and then return to user program

Should the mechanism of (3) ensures that (2) has already been performed and completed ?

For active-message based get:

shmem_get(dest, P1)
  -- (1) CPU issues read-request packet to P1
  -- (2) CPU waits till received ack from P1
  -- (3) CPU copies data into local dest buffer
  -- (4) return to user program
load dest;

I could imagine out-of-order execution of (3) and (4) in the AM-based case, but (3) must be done when program loads dest.

Reading again the slides @anshumang used in WG calls, I understood that the proposal is to require fence() (memory barrier in this case ?) to order the completion of two blocking gets on local PE (seems needed only for AM-case). But such out-of-order seems never visible to single-threaded user program.

Now thinking about the threaded program, where load dest maybe performed by another core, thus such unordered gets becomes visible to user program. But do we always need additional cache coherence synchronization between T0 and T1 in this case ?

T0                                       T1
shmem_get(dest1);                    
shmem_get(dest2);
                                         load dest2;
                                         load dest1;

anshumang · 2018-07-26T19:06:18Z

Thanks for the code examples @minsii
I have created issue #233 for tracking ordering of gets.
Maybe, we can continue the discussion there? I have copied your example under #233

anshumang · 2018-08-23T00:48:34Z

Slides discussed in OpenSHMEM 2018 F2F

anshumang · 2018-08-23T20:12:31Z

References from the MPI RMA memory model and a generalized RMA memory model (coreRMA).

anshumang · 2018-09-26T17:00:02Z

Keynote by Will Deacon from OpenSHMEM Workshop 2018

anshumang self-assigned this Jul 23, 2018

anshumang changed the title ~~Memory model clarifications~~ Memory Model (ordering + reads from = happens before) Jul 23, 2018

nspark mentioned this issue Jul 24, 2018

Add support for put with signal operation #218

Closed

anshumang mentioned this issue Jul 25, 2018

Behavior of fence #232

Open

anshumang mentioned this issue Jul 25, 2018

Blocking gets are unordered #233

Open

anshumang added the Feedback Requested label Jul 26, 2018

jdinan mentioned this issue Jul 27, 2018

Update atomic memory model text #204

Merged

anshumang changed the title ~~Memory Model (ordering + reads from = happens before)~~ Memory Model umbrella ticket Aug 9, 2018

anshumang changed the title ~~Memory Model umbrella ticket~~ Memory model umbrella ticket Aug 9, 2018

jdinan mentioned this issue Mar 20, 2019

Wait/Test Clarifications #267

Merged

1 task

manjugv mentioned this issue Sep 20, 2019

Adding shmem_malloc_with_hints interface #259

Merged

jdinan self-assigned this Jan 15, 2020

jdinan added this to the OpenSHMEM 1.6 milestone Jan 31, 2020

jdinan mentioned this issue Jan 31, 2020

Memory Model Part Deux #339

Open

manjugv unassigned anshumang Jun 28, 2021

naveen-rn modified the milestones: OpenSHMEM 1.6, OpenSHMEM 1.7 Oct 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory model umbrella ticket #229

Memory model umbrella ticket #229

anshumang commented Jul 19, 2018 •

edited

Loading

anshumang commented Jul 23, 2018

shamisp commented Jul 23, 2018

anshumang commented Jul 23, 2018

anshumang commented Jul 23, 2018

minsii commented Jul 24, 2018

anshumang commented Jul 24, 2018

bcernohous commented Jul 24, 2018 via email

anshumang commented Jul 24, 2018

minsii commented Jul 24, 2018

bcernohous commented Jul 24, 2018 via email

bcernohous commented Jul 24, 2018 via email

minsii commented Jul 24, 2018

anshumang commented Jul 24, 2018 •

edited

Loading

anshumang commented Jul 24, 2018

minsii commented Jul 24, 2018 •

edited

Loading

anshumang commented Jul 24, 2018

shamisp commented Jul 24, 2018 •

edited

Loading

shamisp commented Jul 25, 2018

anshumang commented Jul 25, 2018

shamisp commented Jul 25, 2018

minsii commented Jul 25, 2018

anshumang commented Jul 26, 2018

anshumang commented Aug 23, 2018 •

edited

Loading

anshumang commented Aug 23, 2018

anshumang commented Sep 26, 2018 •

edited

Loading

Memory model umbrella ticket #229

Memory model umbrella ticket #229

Comments

anshumang commented Jul 19, 2018 • edited Loading

anshumang commented Jul 23, 2018

shamisp commented Jul 23, 2018

anshumang commented Jul 23, 2018

anshumang commented Jul 23, 2018

minsii commented Jul 24, 2018

anshumang commented Jul 24, 2018

bcernohous commented Jul 24, 2018 via email

anshumang commented Jul 24, 2018

minsii commented Jul 24, 2018

bcernohous commented Jul 24, 2018 via email

bcernohous commented Jul 24, 2018 via email

minsii commented Jul 24, 2018

anshumang commented Jul 24, 2018 • edited Loading

anshumang commented Jul 24, 2018

minsii commented Jul 24, 2018 • edited Loading

anshumang commented Jul 24, 2018

shamisp commented Jul 24, 2018 • edited Loading

shamisp commented Jul 25, 2018

anshumang commented Jul 25, 2018

shamisp commented Jul 25, 2018

minsii commented Jul 25, 2018

anshumang commented Jul 26, 2018

anshumang commented Aug 23, 2018 • edited Loading

anshumang commented Aug 23, 2018

anshumang commented Sep 26, 2018 • edited Loading

anshumang commented Jul 19, 2018 •

edited

Loading

anshumang commented Jul 24, 2018 •

edited

Loading

minsii commented Jul 24, 2018 •

edited

Loading

shamisp commented Jul 24, 2018 •

edited

Loading

anshumang commented Aug 23, 2018 •

edited

Loading

anshumang commented Sep 26, 2018 •

edited

Loading