Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shmem_put_signal #206

Closed
bcernohous opened this issue Feb 28, 2018 · 26 comments · Fixed by #275
Closed

shmem_put_signal #206

bcernohous opened this issue Feb 28, 2018 · 26 comments · Fixed by #275
Milestone

Comments

@bcernohous
Copy link

bcernohous commented Feb 28, 2018

Discuss a proposal for shmem_put_signal.

Transfers contiguous data from a local data object to a specified processing element (PE) without blocking the caller and sets a remote flag to signal completion.

Cray's current implementation:

void shmem_put_signal[_nb](void *target, const void *source, size_t len, uint64_t *sig_target, uint64_t sig_val, int pe)

Semantically (almost) equivalent to :

  • shmem_put[_nb](data)
  • shmem_fence()
  • shmem_put[_nb](signal)

except a shmem_fence is 'heavier' than the shmem_put_signal which only fences the data before the signal.

  • The target buffer and signal location must reside in symmetric memory.
  • All data size and type variants are supported.
  • signal is always a uint64_t.

Attaching slides for RMA WG 03 01 2018
https://github.com/openshmem-org/specification/wiki/RMA-WG-03-01-2018

OpenSHMEM_RMA_PUTSIGNAL.pptx

Edit (@nspark): Fixed some Markdown formatting issues.

@naveen-rn
Copy link
Contributor

I just realized, after today's OpenSHMEM spec meeting and @jdinan Active set changes discussion. If we didn't discuss about the put_with_signal with put size 0, we should have a quick discussion on it.

@nspark
Copy link
Contributor

nspark commented Mar 19, 2018

I think that a shmem_put_signal call with a size of 0 should still send the signal update.

@naveen-rn
Copy link
Contributor

I think that a shmem_put_signal call with a size of 0 should still send the signal update.

+1
Cray SHMEM currently sends a signal even for a 0 size puts. I'm OK with this semantics.

@jdinan
Copy link
Collaborator

jdinan commented Nov 2, 2018

@bcernohous I talked with @shefty about implementation and he indicated that FI_FENCE does not guarantee that data written at the target will be visible in memory prior to that of the fenced operation (assuming here that the signal is sent with the FI_FENCE flag). Sounds like this may not be the right way to implement put with signal on OFI.

@bcernohous
Copy link
Author

Why not? Does FI_FENCE only guarantee FI_TRANSMIT_COMPLETE? It doesn't specify that. If you are using FI_DELIVERY_COMPLETE on the operations? If FI_DELIVERY_COMPLETE doesn't guarantee visibility then nothing does and it's not really a FI_FENCE issue.

FI_DELIVERY_COMPLETE
Applies to fi_writemsg. Indicates that a completion should be generated when the operation has been processed by the destination.

FI_FENCE
Applies to transmits. Indicates that the requested operation, also known as the fenced operation, and any operation posted after the fenced operation will be deferred until all previous operations targeting the same peer endpoint have completed. Operations posted after the fencing will see and/or replace the results of any operations initiated prior to the fenced operation.

@shefty
Copy link

shefty commented Nov 5, 2018

FI_FENCE provides an ordering guarantee at the target for operations that aren't usually ordered (e.g. issuing an RDMA write after RDMA read), but does not specify anything about how the data must be written by a NIC into host memory.

I think these are the options to support the desired shmem semantics. 1.) Submit 2 writes, both with FI_DELIVERY_COMPLETE and FI_FENCE in between. 2.) Use FI_ORDER_DATA.

Part of the issue with the shmem semantics is that the most widely deployed RDMA NICs support only FI_INJECT_COMPLETE (iWarp) or FI_TRANSMIT_COMPLETE (IB/RoCE) completion semantics without software intervention. And architecturally, provide no guarantees on how data must be written into memory (no FI_ORDER_DATA).

Polling on memory is in generally not supported by these architectures. But FI_ORDER_DATA was included by OFI for this use case.

@bcernohous
Copy link
Author

1.) Submit 2 writes, both with FI_DELIVERY_COMPLETE and FI_FENCE in between.

That was my point, made more clearly. I agree. And 2) is true too if you want ordering.

FI_INJECT_COMPLETE (iWarp) or FI_TRANSMIT_COMPLETE (IB/RoCE)

I was aware of that but this discussion has opened questions in my mind. How do we complete fi_read()'s on these providers? The data has to be ready locally so we can't really be getting either FI_INJECT_COMPLETE/FI_TRANSMIT_COMPLETE on the cntr/cq after we issue fi_read.

@shefty
Copy link

shefty commented Nov 5, 2018

For operations that return data to the initiator (e.g. RMA reads), the default completion model is FI_DELIVERY_COMPLETE. But the data isn't guaranteed to be available until a completion has been generated. I.e. polling on memory still doesn't work, unless FI_ORDER_DATA has been set for the local endpoint.

My understanding of the put_signal call is that it requires data ordering. FI_ORDER_DATA should provide better performance. Fencing is a heavy operation, as it basically requires flushing the transmit queue prior to the fenced operation beginning.

@jdinan
Copy link
Collaborator

jdinan commented Nov 12, 2018

@shefty Want to make sure I understand what you mean by "FI_FENCE in between". Does this mean that put with signal can be implemented as two fi_write operations with FI_FENCE | FI_DELIVERY_COMPLETE set on both writes? Or is there a third operation involved?

@shefty
Copy link

shefty commented Nov 12, 2018

I meant the first operation is write + FI_FENCE | FI_DELIVERY_COMPLETE. The second operation is write + FI_DELIVERY_COMPLETE. I don't know if the second FI_DELIVERY_COMPLETE flag is strictly necessary. That depends on the semantic needed at the initiator side of the put_signal.

@bcernohous
Copy link
Author

bcernohous commented Nov 12, 2018 via email

@shefty
Copy link

shefty commented Nov 12, 2018

Uhm - yeah, what Bob said.

@naveen-rn
Copy link
Contributor

@shamisp @manjugv Based on my understanding from the UCX workshop discussions and other discussions with @jdinan, adding some extra clarifications:

  1. In this proposal, we don't specifically say whether the signal update is an atomic update or a single element put update - we leave it as an implementation detail. Implementations can handle it based on their requirement. Theoretically based on PCIe specification, it seems only 32b writes can be guaranteed with full completion. But, it seems most implementations make a single 64b word write to be atomic - but it is not in the spec.
  2. We say in this proposal, that the signal is observable through shmem_wait_until. But, we dont specifically say whether the read from shmem_wait_until guarantees full or partial completion. If required this should be opened as a separate ticket and handled with respect to other operations like shmem_atomic_set, shmem_p, and shmem_put_signal.
  3. In this proposal, we don't allow overlapping dest and signal data buffer. We restrict it explicitly.
  4. We don't restrict users to use shmem_wait_until alone to read the signal update. Normal reads through load operations are also allowed. I'm not sure whether this has an impact on some networks. Needs clarification.

@shamisp
Copy link
Contributor

shamisp commented Dec 12, 2018

@naveen-rn @manjugv @jdinan

  1. I think there was a strong consensus that in PCIe case you must use atomic operation and apparently this is how interconnects (on PCI) implement this. I'm not aware about 32b word writes guarantees. Can you please point out page in the spec that guarantees this ? 64B comment was made about very specific implementation of PCIe root complex. Overall we should not rely on this.

We discussed this issues with rdma-core and UCX and there is consensus that if users wants to observe full update they have to use atomic operations

  1. shmem_wait_until - as noted above, atomics have to be used for full observability, otherwise no guarantees.

  2. There was a consensus that shmem_wait_until must be used. Apparently non of the interconnect standard has control over order in which is will be delivered in memory. If you don't use shmem_wait the code is not portable.

@naveen-rn
Copy link
Contributor

I'm not aware about 32b word writes guarantees.

@shamisp This statement was based on my understanding from the discussion. I might have stated this incorrectly.

WRT this proposal, if I understand correctly, to me it looks like saying that the signal update is an atomic update explicitly is unnecessary. This is essentially an implementation detail. It can be anything.

But as we have explicitly included the put-with-signal in the point-to-point sync routine explanation as follows and with the "Note to implementors" section in shmem_wait_until - we can imply that implementations are expected to make sure that the signal update should be fully complete.

The point-to-point synchronization routines can be used to portably
ensure that memory access operations observe remote updates in the order enforced by the initiator PE using the put-with-signal,
shmem_fence and shmem_quiet routines.

Let me know, if this looks correct.

@shamisp
Copy link
Contributor

shamisp commented Dec 12, 2018

I'm not aware about 32b word writes guarantees.

@shamisp This statement was based on my understanding from the discussion. I might have stated this incorrectly.

Ah. This noted in the context of "reasonable" implementation of PCIe complex (but not required by the spec). It makes sense to align PCIe RC implementation CPU bus implementation. Even then you can run into weird alignment requirements, etc.

WRT this proposal, if I understand correctly, to me it looks like saying that the signal update is an atomic update explicitly is unnecessary. This is essentially an implementation detail. It can be anything.

It really depends what we want to claim about observability. If partial update is allowed, it can be implemented as anything. It was mentioned that it is most useful to have the signal observed as full value. As a result it was suggest to define it as atomic update.

But as we have explicitly included the put-with-signal in the point-to-point sync routine explanation as follows and with the "Note to implementors" section in shmem_wait_until - we can imply that implementations are expected to make sure that the signal update should be fully complete.

The point-to-point synchronization routines can be used to portably
ensure that memory access operations observe remote updates in the order enforced by the initiator PE using the put-with-signal,
shmem_fence and shmem_quiet routines.

Let me know, if this looks correct.

Correct - this is how it is defined now.
Since nobody really polling on memory I suggest to bump it up from the "note" section to main definition.

@naveen-rn
Copy link
Contributor

@shamisp @nspark
Some updates based on the RMA WG discussions on 03-Jan-2019. This clarifies some of the questions raised during the UCX working group meeting.

From the target side perspective:
We don't restrict users to use just the shmem_wait/test operations alone to read the signal update from put-with-signal. Instead, we update the following text from the p2p synchronization explanation in section 9.9 to include put-with-signal along with the existing reference to fence and quiet. With this change and the changes already made available in OpenSHMEM-1.4 in the wait/test "Note to implementors" section, we specify to users that if the partial updates to the signal memory is not acceptable, then users are expected to explicitly use shmem_wait/test operations to read signal updates from the receiver side.

The point-to-point synchronization routines can be used to portably ensure that memory access operations observe remote updates in the order enforced by the initiator PE using the put-with-signal, shmem_fence and shmem_quiet routines.

From the initiator side perspective:
The initiator has to guarantee that the target side PE doesn't observe a partial update on the signal buffer through shmem_wait/test operation. On a PCIe based implementation - to guarantee this semantics, the signal update must be atomic. But this is an implementation detail. Hence, we don't specifically mention the type of operation that the signal update has to be. This is implicit to the implementors.

If we have to clarify this further, then @anshumang broader memory model proposal can handle it.

Also, there was another overall question from @manjugv in comparing put-with-signal and put-with-inc. To elaborate why we introduced put-with-signal operation in this proposal rather than put-with-inc operation -

  1. Based on @jdinan statement - there are separate use cases which can make use of both the operations. put-with-signal can be used in cases where the capability of passing extra information in the signal can be useful. For example, he used the ISx kernel as a use case.
  2. Other major point is that - put-with-signal is introduced initially as Cray has an implementation of this feature in Cray SHMEM for a very long time. Users are making use of this feature. In future, based on further requirements and usage models - we could extend and add separate APIs for different variants of put-with-inc.

@shamisp
Copy link
Contributor

shamisp commented Jan 4, 2019

@shamisp @nspark
Some updates based on the RMA WG discussions on 03-Jan-2019. This clarifies some of the questions raised during the UCX working group meeting.

From the target side perspective:
We don't restrict users to use just the shmem_wait/test operations alone to read the signal update from put-with-signal.

IMHO this is big no-no and breaks for our (and many others) architectures.
(Putting aside that our wait/test definition is broken)

From the initiator side perspective:
The initiator has to guarantee that the target side PE doesn't observe a partial update on the signal buffer through shmem_wait/test operation. On a PCIe based implementation - to guarantee this semantics, the signal update must be atomic. But this is an implementation detail. Hence, we don't specifically mention the type of operation that the signal update has to be. This is implicit to the implementors.

If we have to clarify this further, then @anshumang broader memory model proposal can handle it.

We are trying describe atomic operation without using word "atomic", which is entertaining.
My preference is to define the signal as atomic and let implementations/hardware (PCIe, NIC, etc.) to implement atomics as they want and not the other way around. Once you define the operation as atomic it also makes it shmem_ptr compatible, since it maps on native atomic operations.

@bcernohous
Copy link
Author

bcernohous commented Jan 4, 2019 via email

@naveen-rn
Copy link
Contributor

Based on some offline discussions and comments from @shamisp and others - it looks like we need further clarifications from the OpenSHMEM committee on the following item:

The changes in the current proposal in file content/p2p_sync_intro.tex (link) for reading the signal update in the target side through shmem_wait/test routines are complete. This satisfies the required partial/full update semantics on the signal without modifying the current p2p synchronization semantics. I suppose, we could considered this complete.

But, the source side semantics to specifically state about the atomicity guarantees on the signal update is implicit. We felt this is an implementation detail. Do we need to explicitly state that this signal update is equivalent to OpenSHMEM atomics? We need some input from the OpenSHMEM committee regarding this change.

@jdinan
Copy link
Collaborator

jdinan commented Jan 17, 2019

Sorry that I missed this discussion. For the most part, I think we're on the right track, with the exception of specifying that the signal update must be an atomic operation. Consider OpenSHMEM implementations that use only shared memory and implementations that use PCIe alternatives like Gen-Z, CCIX, NVLink, etc. Does specifying that the signal update must be an atomic operation hurt performance of these implementations?

@nspark
Copy link
Contributor

nspark commented Feb 28, 2019

@naveen-rn As I mentioned on today's RMA WG call, I/we have a strong desire to add specifiable operations for the signal update (specifically, set/store/write and add).

Such an API could look like:

typedef enum {
  SHMEM_OP_SET,
  SHMEM_OP_ADD,
} shmem_op_t;

void shmem_put_signal(TYPE *dest, const TYPE *source, size_t nelems,
                      uint64_t *sig_addr, shmem_op_t op, uint64_t signal, int pe);

void shmem_put_signal(shmem_ctx_t ctx, TYPE *dest, const TYPE *source, size_t nelems,
                      uint64_t *sig_addr, shmem_op_t op, uint64_t signal, int pe);

This would capture both the put-with-signal and counting puts (ref:PDF) use-cases in a single API.

(Note, I've suggested shmem_op_t for consideration here too: gmegan#84 (comment))

@naveen-rn
Copy link
Contributor

naveen-rn commented Feb 28, 2019

I think the proposed change is implementable (efficiently) - the only catch is that, the operations are atomic only with respect to other operations with the same op_code. Two different operations with different op_codes are not expected to be atomic.

I hope this is an acceptable limitation.

@nspark
Copy link
Contributor

nspark commented Mar 1, 2019

Yes, I think such a limitation is acceptable. At least for now, I'm only proposing SET and ADD operations. Whether these are mutually atomic or not, they're not commutative operations, so it would have to be a very specific case that would concurrently update the same signal location with both operations and have some reasonable behavior. Precluding such a specific case isn't a concern to me.

@jamesaross
Copy link
Contributor

The 1.4 specification presently states that atomicity is by datatype, not operation:

3.1 Atomicity Guarantees
OpenSHMEM contains a number of routines that operate on symmetric data atomically (Section 9.7). These routines guarantee that accesses by OpenSHMEM’s atomic operations with the same datatype will be exclusive, but do not guarantee exclusivity in combination with other routines, either inside OpenSHMEM or outside.

I don't recall the logic for this decision, but @naveen-rn seems to suggest the opposite. Maybe section 3.1 should be revisited?

@naveen-rn
Copy link
Contributor

naveen-rn commented Apr 12, 2019

Based on recent discussions in the RMA WG and with @jdinan, @bcernohous and @davidozog

To efficiently support put-with-signal with both SET and ADD operations and also to read the signal updates, we could use one of the following semantics.

SEMANTIC OPTION: 1

Sender Side

  1. put-with-signal should provide only the following guarantees
  • it is compatible with wait/test - all point to point synchronization APIs
  • partial updates and other definitions should be handled as part the wait/test clarification changes taken by @jdinan as part of Wait/Test Clarifications #267. Irrespective of the type of change in Wait/Test Clarifications #267, put-with-signal should work with wait/test, this is the only guaranteed usage model
  1. concurrent accesses on signal address buffer by different put-with-signal operations using the same signal update operator is guaranteed to not introduce any partial updates
  2. put-with-signal API would be as follows:
void put_with_signal(TYPE *dest, const TYPE *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
where sig_op operator would either be constant SHMEM_SIGNAL_ADD or SHMEM_SIGNAL_SET

Receive Side

  1. As put-with-signal is compatible only with wait/test - the only way to read the signal updates are through wait/test APIs. To accommodate the above semantics, we need to update the current shmem_wait_until API to return the value of signal buffer as follows:
uint64_t shmem_wait_until(uint64_t *ivar, uint64_t cmp, uint64_t cmp_value);
  1. If the users require the new shmem_wait_until_all, shmem_wait_until_any and shmem_wait_until_some APIs to be updated to include a vector argument to return the value, we could accommodate those changes.

In this semantics:

  1. Interaction between put-with-signal with different operators on the same signal buffer is not defined
  2. Interaction between put-with-signal and other RMA/AMO/load/store operations are not defined
    Both these operations might introduce data race. The safe way is to use only the updated wait operations.

SEMANTIC OPTION: 2

  1. Same as SEMANTIC OPTION:1, but the only change is that instead of updating wait routines we could introduce a new shmem_fetch_signal API. I would prefer this change, as the wait APIs unnecessarily needs modification and some DATA TYPE in those APIs unnecessarily needs to return values. (Remember signal is always of type uint64_t)
  2. put-with-signal will be compatible with both wait/test and shmem_fetch_signal.
  • shmem_fetch_signal is a scalar API, need for a vector API is questionable. There are no obvious optimizations from introducing a vector variant - users could implement it themselves

SEMANTIC OPTION: 3

We would prefer not to go into this design, adding it just for the sake of completeness. Make signal updates atomic with respect to other shmem_atomic operations. This semantic has performance issue and completely removes possibilities for future hardware optimizations.

  • future hardware may optimize SHMEM_SIGNAL_ADD without being compatible with shmem_RMW_atomics

I would prefer to get some feedback from @nspark, @shamisp @manjugv, @anshumang and others, on the proposed semantic options and then work on the text changes based on some consensus.

NEED FOR THIS CHANGE

We are trying to introduce this change mainly to accommodate the efficient use of SHMEM_SIGNAL_ADD operation.

  1. With SHMEM_SIGNAL_SET operation, the possibility of using the same signal address buffer for two different concurrent signaling operation is very low. In fact, this usage looks erroneous and defn it would introduce data race. So, when a wait observes a change in the signal address buffer, the receive side is free to read the change using any available option like load, atomic_load or shmem_atomic_fetch based on the platform.
  2. With ```SHMEM_SIGNAL_ADD`` operation, the possibility of using the same signal address buffer for two concurrent signaling operation is doable and seems useful. Though there is still a race, some additive use cases could get benefitted from this model. So, to read the signal update without any race, we need this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants