Improve robustness of GPUDirect and fix silent errors #238

maddyscientist · 2015-05-13T19:53:25Z

This significantly improved the error checking and robustness when using GPU_COMMS to avoid silent errors and removes a nasty hack when supporting staggered fermions. The additional error checking also applies to the non-GPU_COMMS path and in either case is only performed when HOST_DEBUG=yes.

Remove TIFR / naive staggered work around
Check all GPU receive buffers when creating a message handler using cudaMemset
Check all GPU send buffers by trying a cudaMemcpy
Check all CPU receive buffers by trying a std::fill and catching any exception
Check all CPU send buffer by trying a std:copy and catching any exception
2-d variants of all of the above for strided communicators
Only allocate message handlers for the numbers of faces that are actually used

With respect to the last point: that fact that message handlers for all numbers of faces were declared irrespective of the operator was a source of silent failures. E.g., when doing Wilson fermions, the halo region / ghost zone is of depth one, but message handlers were previously created for depths (face) 1 through 3. However, the size of the receive buffers allocated was only sized for depth 1. This is a silent error if it goes unchecked, it didn’t causes any problems when running on GPU-aware MPI (e.g., creating a MPI handle using invalid memory), but I guess this is totally undefined behaviour, and could causes nasty things to happen on GDR.

…son fermions.

… added debugging of communicator declaration in comm_common.cpp.

…CPU comms buffer checking using std::fill and std::copy.

…handlers for the requested number of faces.

…mm_common.cpp.

…ces.

…ggered fermions.

mathiaswagner · 2015-05-13T20:43:48Z

Changes look good to me but I did not yet compile (my local build machine is down again).
I am not in favor of the use of #define in general but here it looks to be a good choice - MILC does that a lot and it makes the code a pain to read.

The comments mention possible future improvements / optimizations. I have not yet checked but it might be good to have issues for them.

Automated testing would really be handy now ;-)

maddyscientist · 2015-05-13T20:46:46Z

The use of the macros here follows that used in malloc_quda.h, and it is used to get access to the file, function, line info string for debugging. I agree that in general macros should be avoided, but used sparingly they can be useful.

mathiaswagner · 2015-05-13T20:50:40Z

I did get the intention and consider it to be a reasonable choice here.

maddyscientist · 2015-05-13T20:52:39Z

On making new issues for the improvements in the comments: I believe these have already been implemented into Justin's peer-2-peer branch, but I have yet to verify that. Once the immediate GPU_COMMS bugs have been verified fixed (issue #231) and this is merged in, one of my next tasks is to clean the peer-2-peer branch and get it merged in too (as a new feature).

mathiaswagner · 2015-05-14T20:18:09Z

I don't think we can run any specific tests but as GPU_COMMS is anyway still experimental I will merge that in. The code looks good to me.

Improve robustness of GPUDirect and fix silent errors

maddyscientist added 7 commits May 12, 2015 14:43

Fixed assertion failure for GDR receive buffer with Nface > 1 for Wil…

87f8cca

…son fermions.

Small clean up of message handle creation in cuda_color_field.cpp and…

705adab

… added debugging of communicator declaration in comm_common.cpp.

Error checking for comms send buffers must be non-destructive. Added …

b03757e

…CPU comms buffer checking using std::fill and std::copy.

In cudaColorSpinorField::createComms, only allocate the send message …

3b56ea3

…handlers for the requested number of faces.

Added buffer validity checking for strided message handlers in lib/co…

36292b4

…mm_common.cpp.

Only allocate receive messahe handlers for the requested number of fa…

634d170

…ces.

Always allocate a ghost zone of one for Wilson-like and three for sta…

00b01f1

…ggered fermions.

maddyscientist mentioned this pull request May 13, 2015

GPU_COMMS and clover bicg #231

Closed

maddyscientist added this to the QUDA 0.7.1 milestone May 13, 2015

maddyscientist added bug clean-up labels May 14, 2015

mathiaswagner pushed a commit that referenced this pull request May 14, 2015

Merge pull request #238 from lattice/hotfix/gdr

0859c72

Improve robustness of GPUDirect and fix silent errors

mathiaswagner merged commit 0859c72 into develop May 14, 2015

mathiaswagner deleted the hotfix/gdr branch May 14, 2015 20:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve robustness of GPUDirect and fix silent errors #238

Improve robustness of GPUDirect and fix silent errors #238

maddyscientist commented May 13, 2015

mathiaswagner commented May 13, 2015

maddyscientist commented May 13, 2015

mathiaswagner commented May 13, 2015

maddyscientist commented May 13, 2015

mathiaswagner commented May 14, 2015

Improve robustness of GPUDirect and fix silent errors #238

Improve robustness of GPUDirect and fix silent errors #238

Conversation

maddyscientist commented May 13, 2015

mathiaswagner commented May 13, 2015

maddyscientist commented May 13, 2015

mathiaswagner commented May 13, 2015

maddyscientist commented May 13, 2015

mathiaswagner commented May 14, 2015