Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Signal operation type for put with signal #929

Merged
merged 10 commits into from
Feb 24, 2020

Conversation

wrrobin
Copy link
Collaborator

@wrrobin wrrobin commented Feb 6, 2020

No description provided.

shmem_transport_xpmem_get(&old_signal_val, sig_addr, sizeof(uint64_t), pe,
shmem_internal_get_shr_rank(pe));
signal += old_signal_val;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be an atomic operation, use shmem_shr_transport_atomic.

Copy link
Collaborator Author

@wrrobin wrrobin Feb 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jdinan shmem_shr_transport_atomic is enabled through USE_SHR_ATOMICS. Unless we use --enable-shr-atomics flag, we will not be able to use this API, right? Please let me know if I am missing something.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the target is waiting for several messages using the same signal variable, this will not work, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, you would need to write code to support builds with and without shared memory atomics.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jdinan. One question, from the current code, there seems to be no relation between the flags USE_SHR_ATOMICS and USE_XPMEM/USE_CMA. Was there any particular reason behind this? I am thinking shouldn't USE_SHR_ATOMICS be defined when either of these on-node transports is defined?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For nearly all builds, we can't enable shared memory atomics because the networking layer is not coherent with processor atomics. By default, we only enable shared memory copy to implement put/get. This is why shared memory atomics are a separate option, and disabled by default. CMA is essentially memory copy performed by the Linux kernel and it cannot support atomics.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For CMA, we should directly use the transport layer atomics then, as you also suggested later. But, for XPMEM, can we say that enabling USE_XPMEM should enable USE_SHR_ATOMICS as well? We can use the shared memory atomics if shmem_internal_get_shr_rank returns something other than -1, right? But, currently, shmem_shr_transport_use_atomic also seems to be guarded by USE_SHR_ATOMICS.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, it is always safe to use shared memory put/get because OpenSHMEM says overlapping operations (e.g. between local shared memory and remote network transport accesses) leads to undefined behavior. Atomics do allow this overlapping. Therefore, it is only safe to use shared memory atomics when you know the transport layer is coherent with processor atomics. Because of this, enabling XPMEM only enables shared memory put/get and there is an additional flag to enable shared memory atomics.

shmem_transport_cma_get(&old_signal_val, sig_addr, sizeof(uint64_t), pe,
shmem_internal_get_shr_rank(pe));
signal += old_signal_val;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be an atomic. CMA does not support atomics; use shmem_transport_atomic.

@wrrobin
Copy link
Collaborator Author

wrrobin commented Feb 12, 2020

@jdinan Can you please review the changes in shr_transport? There is one test failing in Travis that I am taking a look, but looks like it is related to the unit test change.

# define ONLY_XPMEM_TRANSPORT 1
#else
# define ONLY_XPMEM_TRANSPORT 0
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh dear, no. This will cause shared memory atomics to be enabled in some paths and not others. You should only need to check USE_SHR_ATOMICS. This flag is set by the configure script. If you are concerned about it being set incorrectly, please resolve those concerns at configure time, not here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, shmem_wait will not be using the correct atomics if USE_SHR_ATOMICS is not set.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I was thinking initially to use USE_SHR_ATOMICS by defining it when only XPMEM is enabled in the configure script. But for some reason, I misunderstood that we wanted a separate flag.

shmem_internal_get_shr_rank(pe));
shmem_internal_membar_acq_rel(); /* Memory fence to ensure target PE observes
stores in the correct order */
if (ONLY_XPMEM_TRANSPORT) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be replaced by #if USE_SHR_ATOMICS but other than change this looks looks ok.

@jdinan
Copy link
Member

jdinan commented Feb 12, 2020

The put_signal_nbi test is failing in the Travis row with huge pages and OFI fence enabled. Huge page support is auto-disabling, looks like possibly the issue is related to the OFI fence flag?

[0000] WARN:  ../../src/symmetric_heap_c.c:187: mmap_alloc
[0000]        file open failed, cannot use huge pages
         0: target[1] = 0 not matching 1
         0: target[2] = 0 not matching 2
         0: target[3] = 0 not matching 3
         0: target[4] = 0 not matching 4
         0: target[5] = 0 not matching 5
         0: target[6] = 0 not matching 6
         0: target[7] = 0 not matching 7
         0: target[8] = 0 not matching 8
         0: target[9] = 0 not matching 9
FAIL put_signal_nbi (exit status: 9)

@wrrobin
Copy link
Collaborator Author

wrrobin commented Feb 12, 2020

Yes, I am trying to reproduce the issue locally. The test is modified with this PR to include a check for ADD signal atomic. Earlier, with no signal op argument, this test was not failing.

@@ -414,6 +414,10 @@ AS_IF([test "$transport_xpmem" = "yes" -o "$transport_cma" = "yes"],
AC_DEFINE([ENABLE_HARD_POLLING], [1], [Enable hard polling])
])

if test "$enable_shr_atomics" = "no" -a "$transport_xpmem" = "yes" -a "$transport_ofi" = "no" -a "$transport_portals4" = "no" ; then
AC_DEFINE([USE_SHR_ATOMICS], [1], [If defined, the shared memory layer will perform processor atomics.])
fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have always treated the above as an invalid build; the configure script should emit a warning any time there isn't a transport selected. That having been said, this seems like a good change. You might want to update the description of the --enable-shr-atomics to say "default: auto" rather than "default: disabled".

A minor fix here, the check for "$enable_shr_atomics" = "no" is not quite right (see Example 1.1 here). It's counterintuitive, but I think this check should be "$enable_shr_atomics" != "no" i.e. "not disabled".

@wrrobin wrrobin requested review from jdinan and davidozog February 13, 2020 15:39
Copy link
Member

@jdinan jdinan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wrrobin Based on earlier discussion, these changes look good to me. I unfortunately won't have time to review the unit test, perhaps @davidozog can help there? One last suggestion, try configuring with "--disable-shr-atomics" and "--enable-shr-atomics" and look in src/config.h under your build directory to ensure that the preprocessor macro is being defined correctly (I never trust myself to get the inverted logic of configure parameters correct).

@davidozog
Copy link
Member

@jdinan, do you have any opinion as to whether this should be included in the v1.4.5 release? All testing would have to be redone, so @wrrobin and I have tentatively decided to defer this merge.

@jdinan
Copy link
Member

jdinan commented Feb 13, 2020

Our CI testing is pretty thorough, apart from Portals, and this is an extension, so I wouldn't be opposed if you feel that it's ready. Since this PR includes an API change to bring the signaling API into compliance with the latest OpenSHMEM 1.5 draft, it might be nice to include.

@wrrobin
Copy link
Collaborator Author

wrrobin commented Feb 14, 2020

@jdinan I have tried with different combinations of --enable/disable-shr-atomics and with-xpmem/ofi and here are the results with the current configure script:

  1. --enable-shr-atomics -> USE_SHR_ATOMICS defined
  2. --disable-shr-atomics -> undefined
  3. --enable-shr-atomics --with-xpmem -> USE_SHR_ATOMICS defined
  4. --disable-shr-atomics --with-xpmem -> USE_SHR_ATOMICS defined
  5. --with-xpmem -> undefined
  6. --with-xpmem --with-ofi -> undefined
  7. --with-xpmem --with-ofi --enable-shr-atomics -> USE_SHR_ATOMICS defined
  8. --with-xpmem --with-ofi --disable-shr-atomics -> undefined

IIUC, the bold italic ones (4,5,7) above are producing incorrect configurations. Please let me know if you think so as well. For 7, I think it is producing the config that the user wants. But, I am not sure whether we can label the config as "auto" rather than "disabled".

If I keep the configure script condition as before

if test "$enable_shr_atomics" != "yes" -a "$transport_xpmem" = "yes" -a "$transport_ofi" = "no" -a "$transport_portals4" = "no" ; then
    AC_DEFINE([USE_SHR_ATOMICS], [1], [If defined, the shared memory layer will perform processor atomics.])
fi

then only the following result changes:
5. --with-xpmem -> USE_SHR_ATOMICS defined

which is the behavior we want to see. The 4th and 7th use-cases are still problematic.

@jdinan
Copy link
Member

jdinan commented Feb 14, 2020

With your current code:

if test "$enable_shr_atomics" != "no" -a "$transport_xpmem" = "yes" -a "$transport_ofi" = "no" -a "$transport_portals4" = "no" ; then
    AC_DEFINE([USE_SHR_ATOMICS], [1], [If defined, the shared memory layer will perform processor atomics.])
fi

For 4, I see:

$ ./configure --disable-shr-atomics --with-xpmem=$HOME/opt/xpmem --enable-pmi-simple
...
$ grep USE_SHR src/config.h
/* #undef USE_SHR_ATOMICS */

For 5:

$ ./configure --with-xpmem=$HOME/opt/xpmem --enable-pmi-simple
...
$ grep USE_SHR src/config.h
#define USE_SHR_ATOMICS 1

That's the right behavior for 7. User got what they asked for. Can you please re-check? The above look like the correct behavior.

Another possibility. There is a portability issue with string comparison in the test utility, which is why you often see if test "x$enable_shr_atomics" != "xno" .... I don't think Linux has this issue, which is why we have been lazy about using this style. You could try that if you are seeing weird behavior.

@wrrobin
Copy link
Collaborator Author

wrrobin commented Feb 14, 2020

I am not sure why I have seen those behaviors before. I am seeing the correct configuration now similar to yours. I am testing these on Cori and not sure now whether the behavior is random.

@wrrobin
Copy link
Collaborator Author

wrrobin commented Feb 14, 2020

@davidozog Can you please review the PR? If we do not see any other issues, we can merge this for the release.

Copy link
Member

@davidozog davidozog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just a few comments/nit-picks.

src/transport_portals4.h Outdated Show resolved Hide resolved
test/shmemx/put_signal.c Show resolved Hide resolved
test/shmemx/put_signal.c Outdated Show resolved Hide resolved
test/shmemx/put_signal_nbi.c Outdated Show resolved Hide resolved
test/shmemx/put_signal.c Show resolved Hide resolved
@davidozog
Copy link
Member

@wrrobin and @jdinan - I've just run through the release checklist with this branch included and had no problems. I haven't yet tested the GNI provider with this patch though, because Cori is under maintenance - but I don't expect to see any issues.

Are we ok with merging this and including in a v1.4.5 release today?

@wrrobin
Copy link
Collaborator Author

wrrobin commented Feb 24, 2020

@davidozog Sounds good to me. I tested with GNI (Cori) last time and all the unit and shmemx tests passed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants