Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are reductions as safe as intended? #74

Closed
krzikalla opened this issue Jun 6, 2017 · 14 comments
Closed

Are reductions as safe as intended? #74

krzikalla opened this issue Jun 6, 2017 · 14 comments

Comments

@krzikalla
Copy link

Hi all,

consider the following (pseudo) code running on two PEs:

int reduction_arg = 1, dest = 0;
int just_two = 2;
shmem_int_sum_to_all(&dest, &reduction_arg, 1, ...);
if (ownPE == 1)
  shmem_int_put(&reduction_arg, &just_two, 1, 0);
else
  printf("%d", dest);

Will this always print 2 according to the spec? Or might it print 3 sometimes?

Consider the following scenario: both PEs enter the reduction at nearly the same time. At the start of the reduction processing they send the value of reduction_arg (1) to the respective other PE.
Then, for some reason, PE 0 is delayed. Meanwhile, PE 1 receives the value of PE 0, adds it to its own reduction_arg, stores it to dest and thus can complete the reduction and leave. Afterwards it puts 2 in the reduction_arg of PE 0. This seems to constitute a race, because if PE 0 now resumes its execution, it finds a value of 2 in its own reduction_arg, which it then uses to calculate the result.
Storing the original values would help, but the worker array is too small for this.

Is there something I have missed in the spec?

Thank you for any clarification
Olaf Krzikalla

@jeffhammond
Copy link

Reductions return the result at all PEs, so every PE must wait to return until the result is generated. Thus, it is impossible for the shmem_int_put to occur before dest is written with the result of the collective.

@krzikalla
Copy link
Author

Thus there is an implicit barrier at the end of every reduction? Shouldn't this be stated somewhere in the spec to reduce the potential implementation space accordingly?

@jdinan
Copy link
Collaborator

jdinan commented Jun 6, 2017 via email

@naveen-rn
Copy link
Contributor

I'm not sure, whether the 1.4 spec change resolves this issue. The new change seems to clarify about using the buffer by local PE.

Thus, it is impossible for the shmem_int_put to occur before dest is written with the result of the collective.

In his 2-PE example, this is correct. Let's say we have 4 PEs participating in the reduction and if PE-1 and PE-3 returns from the reduction, while PE-0 and PE-2 still computes the reduction. Then, either PE-1/PE-3 can still modify the source/dest buffer on PE-0/PE-2. FWIU there is no implicit barrier after all-reduce. It is users responsibility to add an active-set based barrier to achieve his usage.

@jdinan
Copy link
Collaborator

jdinan commented Jun 6, 2017

Ahh, I didn't read the example closely enough. Yes, according to the specification that is a race. Completion of the reduction at PE 1 does not guarantee completion at any other PE. This is a race even for two PEs, since OpenSHMEM does not define an ordering between operations performed by PE 1 within the reduction (e.g., which could be bounce buffered and converted to non-blocking) and the subsequent put.

@naveen-rn
Copy link
Contributor

This is a race even for two PEs

Yes, you are correct - even for 2 PEs this is a race.

@jeffhammond
Copy link

jeffhammond commented Jun 7, 2017 via email

@krzikalla
Copy link
Author

@jeffhammond: If the implementation of a reduction calculates it at a particular place, then it's actually an all-to-one followed by an one-to-all data dependency, isn't it? IMHO even then one PE could receive the result and proceed before another PE.

@jdinan: This clarification is a good start. I think, there is still a statement about remote memory accesses needed. Something like this:

"Accessing memory involved in a collective routine while the PE is processing that collective, results in undefined behavior. Since PEs can enter and exit collectives at different times, accessing such memory remotely requires some additional synchronization with the corresponding remote PE."

I guess, someone can rephrase it better.

@jdinan
Copy link
Collaborator

jdinan commented Jun 7, 2017

@jeffhammond Exit from all-reduce implies that all processes have reached the call to all-reduce, but it doesn't carry the barrier guarantee of ordering/completion of pending RMA operations.

@krzikalla This would be a good change. We likely need similar verbiage for all of the collectives.

@jeffhammond
Copy link

@jdinan I meant execution barrier in the abstract sense of synchronizing all processes, not shmem_barrier. My previous post has been edited for clarity.

Do you really think we need to explicitly tell users that reductions do not synchronize RMA? If we are going to clarify anything, we should list all of the (few) operations that actually synchronize RMA, rather than note ones that do not.

AFAIK, the list of operations that remotely synchronize RMA in some way are shmem_fence, shmem_quiet, and shmem_barrier(_all).

@jdinan
Copy link
Collaborator

jdinan commented Jun 7, 2017

I would hope that's not necessary. I think the change that @krzikalla suggested, which clarifies that no completion guarantees are made with respect to remote buffers, should cover it.

@jdinan jdinan added this to the OpenSHMEM 1.5 milestone Jan 31, 2020
@jdinan
Copy link
Collaborator

jdinan commented Jan 31, 2020

Collectives section committee, please review and determine if any clarifications should be added.

@davidozog
Copy link
Collaborator

I think @krzikalla's clarifying statement is a very good one. I'd like to propose a few minor edits if that's ok:

Accessing symmetric memory involved in a collective routine while the PE
is processing that collective results in undefined behavior. Since PEs can
enter and exit collectives at different times, accessing such memory remotely
may requires some additional synchronization with the
corresponding remote PE between communicating PEs.

"symmetric" memory because that's the problem this statement is tackling (right...?).
"may" because some implementation-specific collective algorithms are indeed synchronizing.
The last sentence was a little hard for me to parse as originally written, but I'm not quite sure the above is much of an improvement... 🤷‍♂

nspark added a commit to nspark/specification that referenced this issue Feb 6, 2020
@nspark
Copy link
Contributor

nspark commented Mar 10, 2020

Closed by 1.5rc1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: Collectives (9.11)
Development

No branches or pull requests

6 participants