Are reductions as safe as intended? #74

krzikalla · 2017-06-06T14:48:49Z

Hi all,

consider the following (pseudo) code running on two PEs:

int reduction_arg = 1, dest = 0;
int just_two = 2;
shmem_int_sum_to_all(&dest, &reduction_arg, 1, ...);
if (ownPE == 1)
  shmem_int_put(&reduction_arg, &just_two, 1, 0);
else
  printf("%d", dest);

Will this always print 2 according to the spec? Or might it print 3 sometimes?

Consider the following scenario: both PEs enter the reduction at nearly the same time. At the start of the reduction processing they send the value of reduction_arg (1) to the respective other PE.
Then, for some reason, PE 0 is delayed. Meanwhile, PE 1 receives the value of PE 0, adds it to its own reduction_arg, stores it to dest and thus can complete the reduction and leave. Afterwards it puts 2 in the reduction_arg of PE 0. This seems to constitute a race, because if PE 0 now resumes its execution, it finds a value of 2 in its own reduction_arg, which it then uses to calculate the result.
Storing the original values would help, but the worker array is too small for this.

Is there something I have missed in the spec?

Thank you for any clarification
Olaf Krzikalla

The text was updated successfully, but these errors were encountered:

jeffhammond · 2017-06-06T15:28:30Z

Reductions return the result at all PEs, so every PE must wait to return until the result is generated. Thus, it is impossible for the shmem_int_put to occur before dest is written with the result of the collective.

krzikalla · 2017-06-06T15:33:33Z

Thus there is an implicit barrier at the end of every reduction? Shouldn't this be stated somewhere in the spec to reduce the potential implementation space accordingly?

jdinan · 2017-06-06T15:51:19Z

Hi Olaf, This is indeed an ambiguity in OpenSHMEM 1.3. We have ratified a proposal that will resolve the ambiguity in the OpenSHMEM 1.4 specification. The following new text will be added to the reductions section: "Upon return from a reduction routine, the following are true for the local PE: The dest array is updated and the source array may be safely reused." ~Jim.

…

On Tue, Jun 6, 2017 at 9:33 AM, krzikalla ***@***.***> wrote: Thus there is an implicit barrier at the end of every reduction? Shouldn't this be stated somewhere in the spec to reduce the potential implementation space accordingly? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#74 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADPX8ukWlPk9qK-AkGtvIKefY_SVGn0rks5sBXFNgaJpZM4NxdGd> .

naveen-rn · 2017-06-06T18:01:11Z

I'm not sure, whether the 1.4 spec change resolves this issue. The new change seems to clarify about using the buffer by local PE.

Thus, it is impossible for the shmem_int_put to occur before dest is written with the result of the collective.

In his 2-PE example, this is correct. Let's say we have 4 PEs participating in the reduction and if PE-1 and PE-3 returns from the reduction, while PE-0 and PE-2 still computes the reduction. Then, either PE-1/PE-3 can still modify the source/dest buffer on PE-0/PE-2. FWIU there is no implicit barrier after all-reduce. It is users responsibility to add an active-set based barrier to achieve his usage.

jdinan · 2017-06-06T20:59:28Z

Ahh, I didn't read the example closely enough. Yes, according to the specification that is a race. Completion of the reduction at PE 1 does not guarantee completion at any other PE. This is a race even for two PEs, since OpenSHMEM does not define an ordering between operations performed by PE 1 within the reduction (e.g., which could be bounce buffered and converted to non-blocking) and the subsequent put.

naveen-rn · 2017-06-06T22:39:21Z

This is a race even for two PEs

Yes, you are correct - even for 2 PEs this is a race.

jeffhammond · 2017-06-07T01:45:18Z

If reductions deliver result to all PEs, there's an all-to-all data dependency, which is equivalent to an execution barrier, no? Only way this isn't true is if logical collectives cheat and do early exit after first zero is observed for AND and first non-zero is observed for OR.

krzikalla · 2017-06-07T14:07:55Z

@jeffhammond: If the implementation of a reduction calculates it at a particular place, then it's actually an all-to-one followed by an one-to-all data dependency, isn't it? IMHO even then one PE could receive the result and proceed before another PE.

@jdinan: This clarification is a good start. I think, there is still a statement about remote memory accesses needed. Something like this:

"Accessing memory involved in a collective routine while the PE is processing that collective, results in undefined behavior. Since PEs can enter and exit collectives at different times, accessing such memory remotely requires some additional synchronization with the corresponding remote PE."

I guess, someone can rephrase it better.

jdinan · 2017-06-07T14:53:16Z

@jeffhammond Exit from all-reduce implies that all processes have reached the call to all-reduce, but it doesn't carry the barrier guarantee of ordering/completion of pending RMA operations.

@krzikalla This would be a good change. We likely need similar verbiage for all of the collectives.

jeffhammond · 2017-06-07T16:00:41Z

@jdinan I meant execution barrier in the abstract sense of synchronizing all processes, not shmem_barrier. My previous post has been edited for clarity.

Do you really think we need to explicitly tell users that reductions do not synchronize RMA? If we are going to clarify anything, we should list all of the (few) operations that actually synchronize RMA, rather than note ones that do not.

AFAIK, the list of operations that remotely synchronize RMA in some way are shmem_fence, shmem_quiet, and shmem_barrier(_all).

jdinan · 2017-06-07T17:41:38Z

I would hope that's not necessary. I think the change that @krzikalla suggested, which clarifies that no completion guarantees are made with respect to remote buffers, should cover it.

jdinan · 2020-01-31T20:58:45Z

Collectives section committee, please review and determine if any clarifications should be added.

davidozog · 2020-02-03T23:07:25Z

I think @krzikalla's clarifying statement is a very good one. I'd like to propose a few minor edits if that's ok:

Accessing symmetric memory involved in a collective routine while the PE
is processing that collective results in undefined behavior. Since PEs can
enter and exit collectives at different times, accessing such memory remotely
may requires some additional synchronization with the
corresponding remote PE between communicating PEs.

"symmetric" memory because that's the problem this statement is tackling (right...?).
"may" because some implementation-specific collective algorithms are indeed synchronizing.
The last sentence was a little hard for me to parse as originally written, but I'm not quite sure the above is much of an improvement... 🤷‍♂

Closes openshmem-org#74

nspark · 2020-03-10T20:12:44Z

Closed by 1.5rc1

jdinan added this to the OpenSHMEM 1.5 milestone Jan 31, 2020

jdinan added the SectionCommittee label Jan 31, 2020

nspark added a commit to nspark/specification that referenced this issue Feb 6, 2020

section/collectives: Clarify UB for data races with collectives

0793a58

Closes openshmem-org#74

nspark closed this as completed Mar 10, 2020

jdinan added this to OpenSHMEM 1.5rc1 Section Committees Jul 29, 2024

jdinan moved this to Collectives (9.11) in OpenSHMEM 1.5rc1 Section Committees Jul 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Are reductions as safe as intended? #74

Are reductions as safe as intended? #74

krzikalla commented Jun 6, 2017

jeffhammond commented Jun 6, 2017

krzikalla commented Jun 6, 2017

jdinan commented Jun 6, 2017 via email

naveen-rn commented Jun 6, 2017

jdinan commented Jun 6, 2017

naveen-rn commented Jun 6, 2017

jeffhammond commented Jun 7, 2017 via email •

edited

Loading

krzikalla commented Jun 7, 2017

jdinan commented Jun 7, 2017

jeffhammond commented Jun 7, 2017

jdinan commented Jun 7, 2017

jdinan commented Jan 31, 2020

davidozog commented Feb 3, 2020

nspark commented Mar 10, 2020

Are reductions as safe as intended? #74

Are reductions as safe as intended? #74

Comments

krzikalla commented Jun 6, 2017

jeffhammond commented Jun 6, 2017

krzikalla commented Jun 6, 2017

jdinan commented Jun 6, 2017 via email

naveen-rn commented Jun 6, 2017

jdinan commented Jun 6, 2017

naveen-rn commented Jun 6, 2017

jeffhammond commented Jun 7, 2017 via email • edited Loading

krzikalla commented Jun 7, 2017

jdinan commented Jun 7, 2017

jeffhammond commented Jun 7, 2017

jdinan commented Jun 7, 2017

jdinan commented Jan 31, 2020

davidozog commented Feb 3, 2020

nspark commented Mar 10, 2020

jeffhammond commented Jun 7, 2017 via email •

edited

Loading