-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
shmem_put_signal #206
Comments
I just realized, after today's OpenSHMEM spec meeting and @jdinan Active set changes discussion. If we didn't discuss about the put_with_signal with put size 0, we should have a quick discussion on it. |
I think that a |
+1 |
@bcernohous I talked with @shefty about implementation and he indicated that |
Why not? Does FI_FENCE only guarantee FI_TRANSMIT_COMPLETE? It doesn't specify that. If you are using FI_DELIVERY_COMPLETE on the operations? If FI_DELIVERY_COMPLETE doesn't guarantee visibility then nothing does and it's not really a FI_FENCE issue. FI_DELIVERY_COMPLETE FI_FENCE |
FI_FENCE provides an ordering guarantee at the target for operations that aren't usually ordered (e.g. issuing an RDMA write after RDMA read), but does not specify anything about how the data must be written by a NIC into host memory. I think these are the options to support the desired shmem semantics. 1.) Submit 2 writes, both with FI_DELIVERY_COMPLETE and FI_FENCE in between. 2.) Use FI_ORDER_DATA. Part of the issue with the shmem semantics is that the most widely deployed RDMA NICs support only FI_INJECT_COMPLETE (iWarp) or FI_TRANSMIT_COMPLETE (IB/RoCE) completion semantics without software intervention. And architecturally, provide no guarantees on how data must be written into memory (no FI_ORDER_DATA). Polling on memory is in generally not supported by these architectures. But FI_ORDER_DATA was included by OFI for this use case. |
That was my point, made more clearly. I agree. And 2) is true too if you want ordering.
I was aware of that but this discussion has opened questions in my mind. How do we complete fi_read()'s on these providers? The data has to be ready locally so we can't really be getting either FI_INJECT_COMPLETE/FI_TRANSMIT_COMPLETE on the cntr/cq after we issue fi_read. |
For operations that return data to the initiator (e.g. RMA reads), the default completion model is FI_DELIVERY_COMPLETE. But the data isn't guaranteed to be available until a completion has been generated. I.e. polling on memory still doesn't work, unless FI_ORDER_DATA has been set for the local endpoint. My understanding of the put_signal call is that it requires data ordering. FI_ORDER_DATA should provide better performance. Fencing is a heavy operation, as it basically requires flushing the transmit queue prior to the fenced operation beginning. |
@shefty Want to make sure I understand what you mean by "FI_FENCE in between". Does this mean that put with signal can be implemented as two fi_write operations with |
I meant the first operation is write + FI_FENCE | FI_DELIVERY_COMPLETE. The second operation is write + FI_DELIVERY_COMPLETE. I don't know if the second FI_DELIVERY_COMPLETE flag is strictly necessary. That depends on the semantic needed at the initiator side of the put_signal. |
I think the FI_FENCE is on the second write. The second write is deferred until previous writes are complete.
FI_FENCE
Applies to transmits. Indicates that the requested operation, also known as the fenced operation, and any operation posted after the fenced operation will be deferred until all previous operations targeting the same peer endpoint have completed. Operations posted after the fencing will see and/or replace the results of any operations initiated prior to the fenced operation.
The ordering of operations starting at the posting of the fenced operation (inclusive) to the posting of a subsequent fenced operation (exclusive) is controlled by the endpoint’s ordering semantics.
From: Sean Hefty [mailto:[email protected]]
Sent: Monday, November 12, 2018 2:06 PM
To: openshmem-org/specification <[email protected]>
Cc: Bob Cernohous <[email protected]>; Mention <[email protected]>
Subject: Re: [openshmem-org/specification] shmem_put_signal (#206)
I meant the first operation is write + FI_FENCE | FI_DELIVERY_COMPLETE. The second operation is write + FI_DELIVERY_COMPLETE. I don't know if the second FI_DELIVERY_COMPLETE flag is strictly necessary. That depends on the semantic needed at the initiator side of the put_signal.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#206 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AH7d-rP693aezLjRysOo-YaPxZmMTFYEks5uudSqgaJpZM4SXNPl>.
|
Uhm - yeah, what Bob said. |
@shamisp @manjugv Based on my understanding from the UCX workshop discussions and other discussions with @jdinan, adding some extra clarifications:
|
We discussed this issues with rdma-core and UCX and there is consensus that if users wants to observe full update they have to use atomic operations
|
@shamisp This statement was based on my understanding from the discussion. I might have stated this incorrectly. WRT this proposal, if I understand correctly, to me it looks like saying that the signal update is an atomic update explicitly is unnecessary. This is essentially an implementation detail. It can be anything. But as we have explicitly included the put-with-signal in the point-to-point sync routine explanation as follows and with the "Note to implementors" section in
Let me know, if this looks correct. |
Ah. This noted in the context of "reasonable" implementation of PCIe complex (but not required by the spec). It makes sense to align PCIe RC implementation CPU bus implementation. Even then you can run into weird alignment requirements, etc.
It really depends what we want to claim about observability. If partial update is allowed, it can be implemented as anything. It was mentioned that it is most useful to have the signal observed as full value. As a result it was suggest to define it as atomic update.
Correct - this is how it is defined now. |
@shamisp @nspark From the target side perspective:
From the initiator side perspective: If we have to clarify this further, then @anshumang broader memory model proposal can handle it. Also, there was another overall question from @manjugv in comparing put-with-signal and put-with-inc. To elaborate why we introduced put-with-signal operation in this proposal rather than put-with-inc operation -
|
IMHO this is big no-no and breaks for our (and many others) architectures.
We are trying describe atomic operation without using word "atomic", which is entertaining. |
Once you define the operation as atomic it also makes it shmem_ptr compatible, since it maps on native atomic operations.
I don’t understand that comment. CPU atomics (over a shmem_ptr) are not compatible with network/shmem API atomics.
From: Pavel Shamis (Pasha) [mailto:[email protected]]
Sent: Friday, January 04, 2019 9:08 AM
To: openshmem-org/specification <[email protected]>
Cc: Bob Cernohous <[email protected]>; Mention <[email protected]>
Subject: Re: [openshmem-org/specification] shmem_put_signal (#206)
@shamisp<https://github.com/shamisp> @nspark<https://github.com/nspark>
Some updates based on the RMA WG discussions on 03-Jan-2019. This clarifies some of the questions raised during the UCX working group meeting.
From the target side perspective:
We don't restrict users to use just the shmem_wait/test operations alone to read the signal update from put-with-signal.
IMHO this is big no-no and breaks for our (and many others) architectures.
(Putting aside that our wait/test definition is broken)
From the initiator side perspective:
The initiator has to guarantee that the target side PE doesn't observe a partial update on the signal buffer through shmem_wait/test operation. On a PCIe based implementation - to guarantee this semantics, the signal update must be atomic. But this is an implementation detail. Hence, we don't specifically mention the type of operation that the signal update has to be. This is implicit to the implementors.
If we have to clarify this further, then @anshumang<https://github.com/anshumang> broader memory model proposal can handle it.
We are trying describe atomic operation without using word "atomic", which is entertaining.
My preference is to define the signal as atomic and let implementations/hardware (PCIe, NIC, etc.) to implement atomics as they want and not the other way around. Once you define the operation as atomic it also makes it shmem_ptr compatible, since it maps on native atomic operations.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#206 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AH7d-hJfyEPosbrBrmXvfXnPpVVAIR5Bks5u_249gaJpZM4SXNPl>.
|
Based on some offline discussions and comments from @shamisp and others - it looks like we need further clarifications from the OpenSHMEM committee on the following item: The changes in the current proposal in file But, the source side semantics to specifically state about the atomicity guarantees on the signal update is implicit. We felt this is an implementation detail. Do we need to explicitly state that this signal update is equivalent to OpenSHMEM atomics? We need some input from the OpenSHMEM committee regarding this change. |
Sorry that I missed this discussion. For the most part, I think we're on the right track, with the exception of specifying that the signal update must be an atomic operation. Consider OpenSHMEM implementations that use only shared memory and implementations that use PCIe alternatives like Gen-Z, CCIX, NVLink, etc. Does specifying that the signal update must be an atomic operation hurt performance of these implementations? |
@naveen-rn As I mentioned on today's RMA WG call, I/we have a strong desire to add specifiable operations for the signal update (specifically, set/store/write and add). Such an API could look like: typedef enum {
SHMEM_OP_SET,
SHMEM_OP_ADD,
} shmem_op_t;
void shmem_put_signal(TYPE *dest, const TYPE *source, size_t nelems,
uint64_t *sig_addr, shmem_op_t op, uint64_t signal, int pe);
void shmem_put_signal(shmem_ctx_t ctx, TYPE *dest, const TYPE *source, size_t nelems,
uint64_t *sig_addr, shmem_op_t op, uint64_t signal, int pe); This would capture both the put-with-signal and counting puts (ref:PDF) use-cases in a single API. (Note, I've suggested |
I think the proposed change is implementable (efficiently) - the only catch is that, the operations are atomic only with respect to other operations with the same op_code. Two different operations with different op_codes are not expected to be atomic. I hope this is an acceptable limitation. |
Yes, I think such a limitation is acceptable. At least for now, I'm only proposing |
The 1.4 specification presently states that atomicity is by datatype, not operation:
I don't recall the logic for this decision, but @naveen-rn seems to suggest the opposite. Maybe section 3.1 should be revisited? |
Based on recent discussions in the RMA WG and with @jdinan, @bcernohous and @davidozog To efficiently support put-with-signal with both SET and ADD operations and also to read the signal updates, we could use one of the following semantics. SEMANTIC OPTION: 1Sender Side
Receive Side
In this semantics:
SEMANTIC OPTION: 2
SEMANTIC OPTION: 3We would prefer not to go into this design, adding it just for the sake of completeness. Make signal updates atomic with respect to other
I would prefer to get some feedback from @nspark, @shamisp @manjugv, @anshumang and others, on the proposed semantic options and then work on the text changes based on some consensus. NEED FOR THIS CHANGEWe are trying to introduce this change mainly to accommodate the efficient use of
|
Discuss a proposal for shmem_put_signal.
Transfers contiguous data from a local data object to a specified processing element (PE) without blocking the caller and sets a remote flag to signal completion.
Cray's current implementation:
Semantically (almost) equivalent to :
shmem_put[_nb](data)
shmem_fence()
shmem_put[_nb](signal)
except a
shmem_fence
is 'heavier' than theshmem_put_signal
which only fences the data before the signal.uint64_t
.Attaching slides for RMA WG 03 01 2018
https://github.com/openshmem-org/specification/wiki/RMA-WG-03-01-2018
OpenSHMEM_RMA_PUTSIGNAL.pptx
Edit (@nspark): Fixed some Markdown formatting issues.
The text was updated successfully, but these errors were encountered: