-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
numa_affinity broken ? #223
Comments
That will allow us to skip the first step with looking for the bus location in the files under Or am I missing something here? |
Also, I have noted that the current implementation did not allow true multi-threading within an MPI process (e.g., all openmp threads will be launched on a single core). Don Holmgren wrote a patch to numa_affinity.cpp file to fix the issue, which I tried in my early versions of the eigcg solver. Need to revisit this stuff. |
Can you create a branch with that patch? On Apr 15, 2015, at 17:15, Alexei Strelchenko <[email protected]mailto:[email protected]> wrote: Also, I have noted that the current implementation did not allow true multi-threading within an MPI process (e.g., all openmp threads will be launched on a single core). Don Holmgren wrote a patch to numa_affinity.cpp file to fix the issue, which I tried in my early versions of the eigcg solver. Need to revisit this stuff. — |
I also noticed after some hacks that it sets the affinity to just one core instead of the list.
|
Done, in hotfix/numa branch |
Thanks. I will look into it. On Apr 15, 2015, at 17:52, Alexei Strelchenko <[email protected]mailto:[email protected]> wrote: Done, in hotfix/numa branch — |
It might depend on the right 'numa.h' ??? Although the version still looks for the no-longer existing path Reading |
Ooops, yes, you need an old numa library for the compilation. I asked Don to revisit the stuff. |
http://www.open-mpi.org/projects/hwloc/ might be worth a look. There is also an example code related to GPUs on https://www.open-mpi.org/faq/?category=runcuda |
Where does this stand at the moment? Is there an easy fix for this to tide us over? Did Don prove an updated variant that compiles with more recent numa? |
I have not heard of an update from Don. So, my summary:
I am not sure which of the files under |
I'm communicating with Don about the topic. Will let you know. |
Before I delve into this internally, anyone know if the numa affinity was working v340 of the driver? I'm trying to track down a performance regression in the driver since then, and numa working on v340 but not since then would explain this. |
Which CUDA version is 340? As I wrote it worked on a Cray with 5.5 but I don't know the driver version. For a single GPU you can use taskset to force it when launching. |
Around CUDA 6.5, but that doesn't mean that a user running 6.5 isn't running a more recent version. I guess I'll have to look into this in more detail |
For the issue of parsing cpulistaffinity / cpuaffinity I found something hopefully useful: and https://github.com/karelzak/util-linux/blob/master/lib/cpuset.c It seems to provide functions to parse the files mentioned above. |
I have given that a first try in hotfix/numa_quickfix_cpuset. Ugly proof of concept. |
The proof of concept is somewhat cleaned up and somewhat tested now. |
Thanks for pushing on this Mathias. Can you give a run down of the implications of including LGPL code in QUDA? |
I believe we only need to mention the license. Quda is open source anyway so this should not be an issue. But I will have to read LGPL and maybe it is a good idea to ask the maintainer about it. |
My only concern is that QUDA must never be bound to GPL, so I want us to be very careful. |
I get that. I'll check and once I came to a conclusion I will let you know. |
Unfortunately, I think we want to stay away from the LGPL. Since QUDA is itself a library, I'm pretty sure that bundling an LGPL-licensed source file would be tantamount to relicensing all of QUDA under the LGPL; see section 2 of LGPLv2.1: https://www.gnu.org/licenses/old-licenses/lgpl-2.1.en.html I'm being generous here and assuming that the author of "cpuset" intended to license it under v2 or v2.1 (and would be willing to edit the two source files to say as much), since LGPLv3 imposes additional restrictions that make it incompatible with GPLv2, among other problems. I liked the idea of using hwloc. I guess that didn't pan out? |
I have not yet followed the hwloc path. For now I will remove the code from github (force delete the branch) - just to be on the safe side. In case we gain further insight we can re-add it if we are sure it does not have any undesirable licensing impacts. |
http://docs.nvidia.com/deploy/nvml-api/index.html might be useful |
Thanks, @alexstrel ! This http://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html#group__nvmlDeviceQueries_1g1c52b8aefbf804bcb0ab07f5d872c2a1 indeed looks promising. We might just need to check which requirements (NVML version, etc) come with it. I will have a look. |
Looks like we need at last a CUDA 6.5 driver: http://docs.nvidia.com/deploy/nvml-api/change-log.html#change-log I know that our old code somewhat worked on 5.5. I am not sure about 6.0. |
no problem on pi0g cluster at FNAL, we have the latest driver (CUDA 7.0). |
I "think" that we will have great news about cuda in the next couple months. On Wed, Jun 10, 2015 at 2:33 PM, Alexei Strelchenko <
|
I am also aware that some machines like the Cray in Bloomington will probably not get CUDA 7.0 at all. (at least not supported by Cray due to the management contract …). Anyway, this issue is about NUMA and not CUDA support. Just that the NVML functions for cpu affinity require a v340 driver. Mathias On Jun 10, 2015, at 15:44, nmrcardoso <[email protected]mailto:[email protected]> wrote: I "think" that we will have great news about cuda in the next couple months. On Wed, Jun 10, 2015 at 2:33 PM, Alexei Strelchenko <
— Mathias Wagner |
I think that the only machine that will need 5.5 support in the near future is Titan. All other machines of note have been upgraded to 6.5. So I think we should be ok with the strategy of using NVML in general, and drop back to the old way if running on CUDA 5.5. I'm not even sure if this is an issue on Titan anyway, since it has only one GPU and one CPU per node. Though I guess it could be an issue, since the Opteron processors used by Titan do have a NUMA architecture. |
O looked into NVML and using NVML to set the cpu affinity is actually pretty easy. However, using it for QUDA brings up some questions.
We should definitely use an option in configure to choose whether to use the new behavior or fall back to the old version (which should still be somewhat functional for CUDA 5.5) |
The code works on a local machine with CUDA 7 and on Cray with CUDA 6.5 and 5.5. However, on the cray the Right now the code uses the compile time check for CUDA version, i.e.
so compilation and linking also work apart from the library path issue also on systems with older CUDA / driver versions. But that may need more checks. Maybe we can use some configure option like |
@mikeaclark @rbabich I think the nvml solution looks good, but before going further: Do you know whether we can distribute 'nvml.h' with quda? |
There's an ongoing thread about this, but it tentatively looks like we can distribute nvml.h with QUDA. We should just add another blurb to the LICENSE file, incorporating the text at the top of the header. It's a bit unfortunate that the user will have to link against the nvml shared library. There are two options there:
Option 2 is useful since it doesn't require that the build machine (e.g., the head node of a cluster) have the driver installed. If we wanted to support option 2 exclusively, we wouldn't have to distribute nvml.h at all, since it's included with the GDK. In that case, we'd want to define "--with-nvml" such that it takes the parent directory of the one where the library is installed (e.g., It's probably more user-friendly to support both options, though. In that case, we have to distribute the header anyway, so we might as well adopt the convention that --with-nvml points directly to the library path, e.g., |
The last paragraph is roughly what I had in mind. But of course that might break on a head node without GPUs. It does work on the cray nodes. For non cray systems it is not even necessary to specify the location of the library as it is in a default location. I think we can include the header and use the configure option to point to the library (full or stub). Then we have a solution that works in most cases without any user configuration, on Cray with just specifying the path and on GPU less head nodes the user will have to install the gdk. |
I now have a solution that adds the the following configure options:
If numa is not enabled none of them apply. At some point I would like to get rid of the old code. I am not sure about the naming in configure: I think NVML is the correct abbreviation but the library itself is named I will put in a pull request once the license questions have finally settled. |
@rbabich Just wanted to follow up whether the internal discussion led to a final conclusion whether we may include the header file in quda. |
Kate has pinged the relevant folks again... will hopefully know tomorrow. I'm 90% sure that we're okay, though. |
Got to ask here again as I am doing some benchmarks with MILC and affinity improves things somewhat. Also, I don't like having this lying around forever on my desk / issue tracker forever. I can bug the people who decide that if you give me their contact data. |
If we go on with CMake I believe we can probably integrate and void adding nvml.h https://github.com/jirikraus/scripts/blob/master/NVML/cmake/FindNVML.cmake |
I noticed getting the message:
Failed to determine NUMA affinity for device 0 (possibly not applicable)
quite often. When looking in the code it looks for something like
"/proc/driver/nvidia/gpus/%d/information", my_gpu
with my_gpu being the device id (i.e. an integer).
I checked on some machines and see
It looks like the directory names have changed from device-ids to Bus-Ids with the newer drivers. I don't know when that happened but we should update the numa_affinity.cpp to reflect this.
The text was updated successfully, but these errors were encountered: