Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QMP peer-to-peer initialization now uses MPI_Allgather #466

Merged
merged 3 commits into from
May 4, 2016

Conversation

maddyscientist
Copy link
Member

Acceleration of the peer-to-peer initialization when using QMP: the previous QMP reduction approach was unreliable and unscalable. Instead we just call MPI_Allgather directly, which performs much better at large process count.

… unscalable reduction approach (which was also unreliable due to using float reduction hack).
…e need to exchange hostnames twice. Fixed a bug in QMP device ordinal setting.
@maddyscientist
Copy link
Member Author

Just pushed a commit that also

  • Moved peer-to-peer initialization into comm_init to avoid needing to exchange the hostname twice, reducing the setup overhead
  • Fixed a bug in QMP device ordinal setting that could lead to processes aliasing a GPU

@mathiaswagner
Copy link
Member

Strict build fails with

forbids declaration of ‘hostname_buf’ with no type [-fpermissive]
 void comm_peer2peer_init(const *hostname_buf) {}

@maddyscientist
Copy link
Member Author

maddyscientist commented May 4, 2016

Fix pushed (was a single-GPU typo). Edit: looks like @mathiaswagner beat me to it 😄

Looks like the cmake build only does a single GPU build: we should probably test both single and multi GPU builds here. Also, any reason for keeping around the configure style builds now?

@kostrzewa
Copy link
Member

@maddyscientist

Also, any reason for keeping around the configure style builds now?

I was up to now unable to get the cmake build to work on Jureca @ JSC... Essentially the problem seems to be related to cmakecache resetting when the compiler is changed, which seems to be necesary because the wrong compiler is "auto-detected". I couldn't get it to compile by editing CMakeCache manually either, so currently I rely on make.inc which seems to work very well.
Need to identify exactly what goes wrong and file issues (CMake is clearly more comfortable and provides essential out-of-source builds), but there's the usual problem with time constraints...

@maddyscientist
Copy link
Member Author

@kostrzewa Thanks for that feedback, @mathiaswagner can may be help there as he's the cmake expert 😉. In my comment above I was actually meaning with respect to the Jenkins builds, since Jenkins is presently building both using configure and cmake, which is probably unnecessary. Rest assured, we won't be breaking configure-style builds anytime soon.

While I've got your attention @kostrzewa: any outstanding blockers affecting your job throughput on Jureca at the moment? I know the single-GPU issue is outstanding, but hopefully the peer-to-peer enablement has reduced the need for this.

@mathiaswagner
Copy link
Member

All WIP. Slowly getting there.

On 04.05.2016, at 12:28, maddyscientist <[email protected]mailto:[email protected]> wrote:

Fix pushed (was a single-GPU typo). This should be ready for merging now.

Looks like the cmake build only does a single GPU build: we should probably test both single and multi GPU builds here. Also, any reason for keeping around the configure style builds now?

You are receiving this because you were assigned.
Reply to this email directly or view it on GitHubhttps://github.com//pull/466#issuecomment-216917366

NVIDIA GmbH, Wuerselen, Germany, Amtsgericht Aachen, HRB 8361
Managing Director: Karen Theresa Burns


This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by

reply email and destroy all copies of the original message.

@mathiaswagner
Copy link
Member

Best way to set the compiler is to execute cmake (in a clean directory) using:
CXX= CC== cmake , see

https://cmake.org/Wiki/CMake_FAQ#I_change_CMAKE_C_COMPILER_in_the_GUI_but_it_changes_back_on_the_next_configure_step._Why.3F

https://cmake.org/Wiki/CMake_FAQ#How_do_I_use_a_different_compiler.3F

From: Bartosz Kostrzewa <[email protected]mailto:[email protected]>
Reply-To: lattice/quda <[email protected]mailto:[email protected]>
Date: Wednesday, May 4, 2016 at 12:28
To: lattice/quda <[email protected]mailto:[email protected]>
Cc: Mathias Wagner <[email protected]mailto:[email protected]>, Mention <[email protected]mailto:[email protected]>
Subject: Re: [lattice/quda] QMP peer-to-peer initialization now uses MPI_Allgather (#466)

@maddyscientisthttps://github.com/maddyscientist

Also, any reason for keeping around the configure style builds now?

I was up to now unable to get the cmake build to work on Jureca @ JSC... Essentially the problem seems to be related to cmakecache resetting when the compiler is changed, which seems to be necesary because the wrong compiler is "auto-detected". I couldn't get it to compile by editing CMakeCache manually either, so currently I rely on make.inc which seems to work very well.
Need to identify exactly what goes wrong and file issues (CMake is clearly more comfortable and provides essential out-of-source builds), but there's the usual problem with time constraints...


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHubhttps://github.com//pull/466#issuecomment-216920262

NVIDIA GmbH, Wuerselen, Germany, Amtsgericht Aachen, HRB 8361
Managing Director: Karen Theresa Burns


This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by

reply email and destroy all copies of the original message.

@mathiaswagner mathiaswagner merged commit fa3164e into develop May 4, 2016
@mathiaswagner mathiaswagner deleted the feature/fast-qmp-p2p-setup branch May 4, 2016 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants