Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory model umbrella ticket #229

Open
anshumang opened this issue Jul 19, 2018 · 25 comments
Open

Memory model umbrella ticket #229

anshumang opened this issue Jul 19, 2018 · 25 comments
Assignees
Milestone

Comments

@anshumang
Copy link

anshumang commented Jul 19, 2018

Summary : ordering (inside a PE) + reads from (between two PEs) = happens before (across all PEs)

Following are the items that have been discussed in RMA WG on 6/21, 7/5 and 7/19 and are still open (except one marked by ^). They are grouped below 1) under ordering, 2) reads from and 3) happens before.

  1. ordering
  • Fetch AMOs are ordered (related: fence is not required to order fetch AMOs)
  • Blocking get/g/iget are not ordered (related: fence orders blocking get/g/iget) (Blocking gets are unordered #233)
  • fence also orders non-blocking get/g/iget (all fence behavior changes Behavior of fence #232 )
  • Data-movement collective APIs using same psync are ordered in the order of their issue
  1. reads from
  • All communication APIs progress without requiring quiet or barrier
  • Using shmem_put/iput/p to trigger shmem_wait_until is platform/implementation defined (related: shmem_atomic_set should be used to trigger shmem_wait_until)
  • New APIs to trigger shmem_wait_until that need to be single-copy atomic but not read-modify-write atomic
  • Trigger shmem_wait_until using the same type^
  • wait_until on remote symmetric memory
  • wait_until on local non-symmetric memory (related : when non-blocking fetch AMOs trigger wait_until, does it require read-modify-write-fetch to be atomic?)
  1. happens before
    None yet
@anshumang
Copy link
Author

@spotluri @jdinan @manjugv @nspark @minsii @khamidouche and others
Please feel free to add if anything is missing. Plan is to create separate issues for tracking each of the items above and then create PRs for the proposed changes to the spec.

@anshumang anshumang self-assigned this Jul 23, 2018
@shamisp
Copy link
Contributor

shamisp commented Jul 23, 2018

  1. Using shmem_wait_until in combination with AMO only is difficult one.
    It somewhat simple if you only looking at shared memory use case only. It is much more complicated if initiator of AMO is located in different coherency domain in respect to target running waituntil loop. The local load operation may not be atomic in respect to remote atomics.

  2. AMO is expensive operation when it is compared regular PUT. It is slower and limited in number of outstanding operations. On the other hand waking up a remote PE with regular PUT even partial one can be perfectly fine way of notification.

@anshumang
Copy link
Author

Definition of concurrency used in #204 to be resolved in this.

@anshumang anshumang changed the title Memory model clarifications Memory Model (ordering + reads from = happens before) Jul 23, 2018
@anshumang
Copy link
Author

Have renamed (can be improved) to distinguish from #172

@minsii
Copy link
Collaborator

minsii commented Jul 24, 2018

Is the related statement a mistake (get/g/iget ->put/p/iput) ?

  • Blocking get/g/iget are not ordered (related: fence orders blocking get/g/iget)

Do you want to order the delivery of data returning to the local buffer, or order the read access of the remote memory ? I think only the later is useful. E.g., a user may want to do put_nbi->fence->get_nbi.

  • fence also orders non-blocking get/g/iget

I am not sure if I understand this topic correctly. I have two questions: (1) Does wait_until check the return buffer of nonblocking fetch AMO on the source PE ? (2) What is the read-modify-write-fetch operation and when it is needed ?

  • wait_until on local non-symmetric memory (related : when non-blocking fetch AMOs trigger wait_until, does it require read-modify-write-fetch to be atomic?)

@anshumang
Copy link
Author

@minsii

Is the related statement a mistake (get/g/iget ->put/p/iput) ?
No, the proposal is to make g/get/iget unordered as it requires memory fences on some relaxed architectures (if the data from get is used, ordering is enforced by the compiler/architecture)

Do you want to order the delivery of data returning to the local buffer, or order the read access of the remote memory ?
The original context for this was a comment from @nspark on the draft that fence ordering blocking and non-blocking put but only blocking get may be non-intuitive. A follow up question - why is the ordering of the local buffer update (for non-blocking get) not useful? Is it not a requirement for message passing to work?

I have two questions: (1) Does wait_until check the return buffer of nonblocking fetch AMO on the source PE ? (2) What is the read-modify-write-fetch operation and when it is needed ?
(1) Yes, this was suggested by @jdinan in the discussion on the mailing list.
(2) This is related to your comment in the same thread - Will it increases the overhead of fetching AMOs if we need atomicity guarantee ? E.g., is there any network that supports atomicity for the returning data transfer of AMO ? Is my understanding correct?

@bcernohous
Copy link

bcernohous commented Jul 24, 2018 via email

@anshumang
Copy link
Author

Thanks @bcernohous for the example. If I could use it to clarify my earlier comment : fence orders get_nbi implies that the local update is ordered. @minsii comments?

@minsii
Copy link
Collaborator

minsii commented Jul 24, 2018

@bcernohous: The example seems a little problematic to me. How do you guarantee that get_nbi(signal, pe1) reads signal on PE1 after the update of put(signal=1, pe1) ? Do you have to add another synchronization between PE0 and PEn ? E.g., PE n must issue the get_nbi operations after completion of PE 0's put(signal=1, pe1).

PE 0                 PE 1                   PE n

put(data, pe1)
fence();
put(signal=1, pe1)

                                            get_nbi(signal, pe1)
                                            fence()
                                            get_nbi(data, pe1)

@bcernohous
Copy link

bcernohous commented Jul 24, 2018 via email

@bcernohous
Copy link

bcernohous commented Jul 24, 2018 via email

@minsii
Copy link
Collaborator

minsii commented Jul 24, 2018

@anshumang :

  • I am a little confused. If I understand correctly, g/get/iget is already unordered in current semantics. Do you propose to only clarify the ordering semantics (no semantics change) ?

No, the proposal is to make g/get/iget unordered as it requires memory fences on some relaxed architectures (if the data from get is used, ordering is enforced by the compiler/architecture)

  • I do not expect any network hardware supports the atomicity of the entire processing of read-modify-write(on remote object) + write(on local return buffer). Actually, if we want to allow wait_until+nonblocking fetch AMO combination, we need additional atomicity on the local process between write(on local return buffer) in nbi fetch AMO and read(from local return buffer) in wait_until.

(2) This is related to your comment in the same thread - Will it increases the overhead of fetching AMOs if we need atomicity guarantee ? E.g., is there any network that supports atomicity for the returning data transfer of AMO ? Is my understanding correct?

@anshumang
Copy link
Author

anshumang commented Jul 24, 2018

@minsii From 9.5.4 in spec v1.4 description of shmem_get - The routines return after the data has been delivered to the dest array on the local PE. Does this not imply that shmem_get is ordered with respect to other operations?

@anshumang
Copy link
Author

@bcernohous fence orders non-blocking get came up in context of the new requirement that fence would order blocking gets (which are now unordered). I think it could be helpful from a user's perspective to be able to order blocking and non-blocking gets the same way. Is there a fundamental performance issue to guarantee this?

@minsii
Copy link
Collaborator

minsii commented Jul 24, 2018

@anshumang: The blocking get/g/iget must be completed at the return of the routine. In that sense, the ordering between two blocking get operations are always ordered on the local PE. However, it is irrelevant to the shmem_fence semantics. For instance, there is no ordering between a blocking put and a blocking get. In the following example, the update of x by put might be delivered on PE1 after the return of get. shmem_fence does not help.

shmem_put(x, PE1); /* local completion at return */
shmem_get(x, PE1); 

@anshumang
Copy link
Author

@minsii

In that sense, the ordering between two blocking get operations are always ordered on the local PE.

The proposal is to relax this requirement.

@shamisp
Copy link
Contributor

shamisp commented Jul 24, 2018

@minsii out-of-order core may executer in-depended loads (a.k.a. shmem "blocking" get ) out of order.

@shamisp
Copy link
Contributor

shamisp commented Jul 25, 2018

@anshumang What is really surprising is that Cray T3D ("father" of openshmem) was using Alpha, which is out-of-order core. I just cannot imagine ordered loads on this platform. Original spec also had explicit ops for cache management. So my guess the ordering was happing through the cache invalidation routines. Otherwise you can complete the load from the local memory regardless what other side "put" there.

Looking at the original manual I only see barrier, no shmem_fence or shmem_quite operations. My guess these two were introduced post 1994.

@anshumang
Copy link
Author

Thanks for the comments @shamisp Can you please add the pointer to the original spec?

@shamisp
Copy link
Contributor

shamisp commented Jul 25, 2018

https://www.cs.cmu.edu/afs/cs/project/cmcl/link.iwarp/OldFiles/archive/fx-papers/cri-shmem-users-guide.ps

@minsii
Copy link
Collaborator

minsii commented Jul 25, 2018

@shamisp I am still confused how the out-of-order cores can reorder blocking gets and be visible to user programs, and fence between blocking gets becomes necessary. Below is my thought, it might be incorrect/incomplete. Could you please give more detailed explanation ?

For network-offloaded get:

shmem_get(dest, P1)
  -- (1) CPU issues network read to P1
  -- (2) network transfers data from remote P1 to local dest buffer
  -- (3) CPU confirms local completion of (2) and then return to user program

Should the mechanism of (3) ensures that (2) has already been performed and completed ?

For active-message based get:

shmem_get(dest, P1)
  -- (1) CPU issues read-request packet to P1
  -- (2) CPU waits till received ack from P1
  -- (3) CPU copies data into local dest buffer
  -- (4) return to user program
load dest;

I could imagine out-of-order execution of (3) and (4) in the AM-based case, but (3) must be done when program loads dest.

Reading again the slides @anshumang used in WG calls, I understood that the proposal is to require fence() (memory barrier in this case ?) to order the completion of two blocking gets on local PE (seems needed only for AM-case). But such out-of-order seems never visible to single-threaded user program.

Now thinking about the threaded program, where load dest maybe performed by another core, thus such unordered gets becomes visible to user program. But do we always need additional cache coherence synchronization between T0 and T1 in this case ?

T0                                       T1
shmem_get(dest1);                    
shmem_get(dest2);
                                         load dest2;
                                         load dest1;

@anshumang
Copy link
Author

Thanks for the code examples @minsii
I have created issue #233 for tracking ordering of gets.
Maybe, we can continue the discussion there? I have copied your example under #233

@anshumang anshumang changed the title Memory Model (ordering + reads from = happens before) Memory Model umbrella ticket Aug 9, 2018
@anshumang anshumang changed the title Memory Model umbrella ticket Memory model umbrella ticket Aug 9, 2018
@anshumang
Copy link
Author

anshumang commented Aug 23, 2018

Slides discussed in OpenSHMEM 2018 F2F

@anshumang
Copy link
Author

References from the MPI RMA memory model and a generalized RMA memory model (coreRMA).

@anshumang
Copy link
Author

anshumang commented Sep 26, 2018

Keynote by Will Deacon from OpenSHMEM Workshop 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants