Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-incorporate the teams example/tests from OpenSHMEM specification #907

Merged
merged 10 commits into from
Dec 12, 2019

Conversation

davidozog
Copy link
Member

@davidozog davidozog commented Nov 19, 2019

This reverts commit 90886c8, which removed some teams-related tests that are based on the examples from the specification. This PR is intended to resolve the outstanding issues with these tests.


test/shmemx/shmemx_team_split_2D.c fails with 12 PEs on the split_2D at line 32 with the following error:


[0008] WARN:  ../../src/shmem_team.c:246: shmem_internal_team_split_strided
[0008]        Invalid start, stride, or size in team_split operation
[0008] ERROR: ../../src/shmem_team.c:329: shmem_internal_team_split_2d
[0008]        x-axis 2D strided split failed

test/shmemx/shmemx_team_context.c fails with:

$ mpiexec -np 4 test/shmemx/shmemx_team_context
Send to neighbor fail due to invalid context
Send to neighbor fail due to invalid context
Send to neighbor fail due to invalid context
Send to neighbor fail due to invalid context
Send to neighbor fail due to invalid context
Fail to translate pe 4 from 3s context to 2s context

test/shmemx/shmemx_team_sync.c (and shmemx_reduce.c?) also appears to have some bugs.

Related issues:
openshmem-org/specification#302
openshmem-org/specification#313
#901

shmemx_wait_until_all \
shmemx_wait_until_any \
shmemx_wait_until_some \
shmemx_test_all \
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe shmemx_test_all is actually a part of the spec, if that matters...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it looks like shmemx_test_all is another case of using shmem_p instead of shmem_atomic_set.

@@ -31,26 +31,10 @@ check_PROGRAMS += \
atomic_nbi \
put_signal \
put_signal_nbi \
shmemx_wait_until_all \
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shmemx_wait_until_all looks like it's a little different than the spec example (it uses shmem_p instead of shmem_atomic_set). Do we want to update that in this PR or separately?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it looks like all the wait/test any/all/some routines have this problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was updated upstream in the spec (wait/test clarification proposal). One motivation for separating out the spec tests is that this should make it easier for us to sync with the upstream sources.

@@ -31,26 +31,10 @@ check_PROGRAMS += \
atomic_nbi \
put_signal \
put_signal_nbi \
shmemx_wait_until_all \
shmemx_wait_until_any \
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shmemx_wait_until_any also looks like it uses shmem_p instead of shmem_atomic_set...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fixed upstream in the OpenSHMEM spec. One reason for this change is to make it easier for us to pull in updates from the spec.

@@ -31,26 +31,10 @@ check_PROGRAMS += \
atomic_nbi \
put_signal \
put_signal_nbi \
shmemx_wait_until_all \
shmemx_wait_until_any \
shmemx_wait_until_some \
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shmemx_wait_until_some also looks like it uses shmem_p instead of shmem_atomic_set...

shmemx_wait_until_any \
shmemx_wait_until_some \
shmemx_test_all \
shmemx_test_any \
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shmemx_test_any is another case of using shmem_p instead of shmem_atomic_set...

shmemx_wait_until_some \
shmemx_test_all \
shmemx_test_any \
shmemx_test_some \
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shmemx_test_some is another case of using shmem_p instead of shmem_atomic_set.

(Yup, they all have that problem - my bad).

@wrrobin
Copy link
Collaborator

wrrobin commented Nov 21, 2019

@davidozog Yes, I saw those differences. But, I still kept them here because there are enough similarities. And also, the text before the header suggests that these tests are derived from spec. I am ok with keeping them at either location. Other options might be, we can keep two versions of these tests if we want to, or we can change them to reflect the spec.

@jdinan
Copy link
Member

jdinan commented Nov 21, 2019

@wrrobin Can you check for updates here: https://github.com/openshmem-org/specification/tree/master/example_code

Also, I think a few of the test files have different names in SOS. We should fix this to match the spec, so it is easier for us to keep them in sync.

@wrrobin
Copy link
Collaborator

wrrobin commented Nov 21, 2019

@jdinan I noticed the name differences and I thought SOS has better names. Is it ok to change the file names in the spec repo? For example, many of the file names have example in it, which is unnecessary.

@wrrobin
Copy link
Collaborator

wrrobin commented Nov 21, 2019

@jdinan, I believe we should keep the exact same code from the spec, right? For example, if we see the diff in shmem_ctx.c (First is in spec, Second is in SOS)

43c43
<     int tl, i;
---
>     int tl, i, ret;
49,50c49,60
<     shmem_init_thread(SHMEM_THREAD_MULTIPLE, &tl);
<     if (tl != SHMEM_THREAD_MULTIPLE) shmem_global_exit(1);
---
>     ret = shmem_init_thread(SHMEM_THREAD_MULTIPLE, &tl);
>
>     if (tl != SHMEM_THREAD_MULTIPLE || ret != 0) {
>         printf("Init failed (requested thread level %d, got %d, ret %d)\n",
>                SHMEM_THREAD_MULTIPLE, tl, ret);
>
>         if (ret == 0) {
>             shmem_global_exit(1);
>         } else {
>             return ret;
>         }
>     }
62,63c72,73
<             printf("%d: Error creating context (%d)\n", me, ret);
<             shmem_global_exit(2);
---
>             printf("%d: Warning, unable to create context (%d)\n", me, ret);
>             ctx = SHMEM_CTX_DEFAULT;
69c79
<             long task = shmem_atomic_fetch_inc(ctx, &task_cntr, task_pe);
---
>             long task = shmem_ctx_long_atomic_fetch_inc(ctx, &task_cntr, task_pe);
73c83
<                 task = shmem_atomic_fetch_inc(ctx, &task_cntr, task_pe);
---
>                 task = shmem_ctx_long_atomic_fetch_inc(ctx, &task_cntr, task_pe);
79c89
<         shmem_ctx_destroy(ctx);
---
>         if (ctx != SHMEM_CTX_DEFAULT) shmem_ctx_destroy(ctx);
84a95,97
>     if (me == 0 && result)
>         printf("Error: total_done is %ld, expected %ld\n", total_done, ntasks * npes);
>

It looks like one in SOS are better coded. Should we merge those changes in the spec then as doc edits?

}
shmemx_team_sync(SHMEMX_TEAM_WORLD);
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the failure scenario for this test apart from the npes check?

if (status[i]) num_ignored++; \
} \
} \
if (nelems == 0 || num_ignored == nelems) { \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a unit test that exercises the num_ignored == nelems case?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I don't think so.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In such a unit test, we should just check whether wait_until_all returns, right? There is no need to check the completion of the AMO since the wait set is empty.

Looking at the spec, there is one sentence at the end of the API description for wait_until_all which may not hold true in these cases.

Implementations must ensure that shmem_wait_until_all does not return before the update of the memory indicated by ivars is fully complete.

Do we have to add an unless clause indicating if the status array does not include all PEs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this sentence cover the issue?

If all elements in status are set to 1 or nelems is 0, the wait set is empty and this routine returns immediately.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it does. But the last sentence in the API description is not covering the special case. I guess it was confusing to me since that line is written as a separate paragraph. It might be easier to merge this sentence to the end of the first paragraph.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see the confusion. It might be correct, since there is no "update of the memory" in this special case right? But yeah, it could use some clarification - if I understand correctly, your suggestion could be a doc edit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, definitely doc edit should be sufficient. Similar edits may go to any/some/vector APIs as well.

Copy link
Member

@jdinan jdinan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wrrobin Could you separate out moving files and merging upstream changes? Many of the upstream changes are deleting a bunch of local changes that we need to keep (contexts and C99 compatibility, and error checking). I'd like to review those changes closely.

test/spec-example/shmem_ctx.c Outdated Show resolved Hide resolved
test/spec-example/shmem_ctx.c Outdated Show resolved Hide resolved
test/spec-example/shmem_ctx_pipelined_reduce.c Outdated Show resolved Hide resolved
@wrrobin
Copy link
Collaborator

wrrobin commented Dec 4, 2019

@jdinan To keep these checks, do we want these tests in both directories then? One in the spec_example would be exactly same as in the spec and the other one in unit would have all the checks. I have similar doubt in mind; so for the wait/test unit tests, I did not remove these extra checks and kept them different from the spec examples.

@jdinan
Copy link
Member

jdinan commented Dec 5, 2019

If we want to run the examples in the spec verbatim, we can do that off of the spec repository. It's useful to maintain copies in the SOS repository in cases where we make changes to adapt the examples into proper unit tests.

Edit: Is it helpful to add a README file with the above info to the spec-example directory to explain the motivation and how we are maintaining these tests?

@wrrobin
Copy link
Collaborator

wrrobin commented Dec 11, 2019

@jdinan @davidozog Let me know if the changes look good.

@@ -51,6 +51,8 @@ install -d %{testdir}/unit
install test/unit/.libs/* %{testdir}/unit
install -d %{testdir}/shmemx
install test/shmemx/.libs/* %{testdir}/shmemx
install -d %{testdir}/spec-example
install test/spec-example/.libs/* %{testdir}/spec-example
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am impressed that you remembered to update the spec file! 💯

@jdinan
Copy link
Member

jdinan commented Dec 11, 2019

@wrrobin Looks like there is a merge conflict that also needs to be resolved.

@wrrobin
Copy link
Collaborator

wrrobin commented Dec 11, 2019

Apologies for multiple commits. Merge conflict is resolved now.

@jdinan
Copy link
Member

jdinan commented Dec 12, 2019

@davidozog This looks good to me, please take a look and merge when you are ready.

@davidozog
Copy link
Member Author

davidozog commented Dec 12, 2019

@jdinan @wrrobin - looks like shmem_test_all also needs the status array, correct?

Want me to make that update?

@wrrobin
Copy link
Collaborator

wrrobin commented Dec 12, 2019

@davidozog Oh.. did not check that, Yes, please go ahead if you want to.

@jdinan
Copy link
Member

jdinan commented Dec 12, 2019

Let's merge these changes and handle shmem_test_all as a separate PR.

@jdinan jdinan closed this Dec 12, 2019
@jdinan jdinan reopened this Dec 12, 2019
@jdinan
Copy link
Member

jdinan commented Dec 12, 2019

Oops. 🤕

@davidozog
Copy link
Member Author

@jdinan - ok, let me just finish my review, then I'll post the separate PR for shmem_test_all.

Copy link
Member Author

@davidozog davidozog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@davidozog davidozog merged commit b3bac65 into Sandia-OpenSHMEM:master Dec 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants