Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve robustness of GPUDirect and fix silent errors #238

Merged
merged 7 commits into from
May 14, 2015

Conversation

maddyscientist
Copy link
Member

This significantly improved the error checking and robustness when using GPU_COMMS to avoid silent errors and removes a nasty hack when supporting staggered fermions. The additional error checking also applies to the non-GPU_COMMS path and in either case is only performed when HOST_DEBUG=yes.

  • Remove TIFR / naive staggered work around
  • Check all GPU receive buffers when creating a message handler using cudaMemset
  • Check all GPU send buffers by trying a cudaMemcpy
  • Check all CPU receive buffers by trying a std::fill and catching any exception
  • Check all CPU send buffer by trying a std:copy and catching any exception
  • 2-d variants of all of the above for strided communicators
  • Only allocate message handlers for the numbers of faces that are actually used

With respect to the last point: that fact that message handlers for all numbers of faces were declared irrespective of the operator was a source of silent failures. E.g., when doing Wilson fermions, the halo region / ghost zone is of depth one, but message handlers were previously created for depths (face) 1 through 3. However, the size of the receive buffers allocated was only sized for depth 1. This is a silent error if it goes unchecked, it didn’t causes any problems when running on GPU-aware MPI (e.g., creating a MPI handle using invalid memory), but I guess this is totally undefined behaviour, and could causes nasty things to happen on GDR.

@mathiaswagner
Copy link
Member

Changes look good to me but I did not yet compile (my local build machine is down again).
I am not in favor of the use of #define in general but here it looks to be a good choice - MILC does that a lot and it makes the code a pain to read.

The comments mention possible future improvements / optimizations. I have not yet checked but it might be good to have issues for them.

Automated testing would really be handy now ;-)

@maddyscientist
Copy link
Member Author

The use of the macros here follows that used in malloc_quda.h, and it is used to get access to the file, function, line info string for debugging. I agree that in general macros should be avoided, but used sparingly they can be useful.

@mathiaswagner
Copy link
Member

I did get the intention and consider it to be a reasonable choice here.

@maddyscientist
Copy link
Member Author

On making new issues for the improvements in the comments: I believe these have already been implemented into Justin's peer-2-peer branch, but I have yet to verify that. Once the immediate GPU_COMMS bugs have been verified fixed (issue #231) and this is merged in, one of my next tasks is to clean the peer-2-peer branch and get it merged in too (as a new feature).

@mathiaswagner
Copy link
Member

I don't think we can run any specific tests but as GPU_COMMS is anyway still experimental I will merge that in. The code looks good to me.

mathiaswagner pushed a commit that referenced this pull request May 14, 2015
Improve robustness of GPUDirect and fix silent errors
@mathiaswagner mathiaswagner merged commit 0859c72 into develop May 14, 2015
@mathiaswagner mathiaswagner deleted the hotfix/gdr branch May 14, 2015 20:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants