Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to provide supported CUDA compute capabilities and runtime version where needed? #33542

Open
makortel opened this issue Apr 27, 2021 · 18 comments

Comments

@makortel
Copy link
Contributor

Currently the cudaIsEnabled uses a hardcoded condition for the CUDA compute capability check.

#33057 adds a need to have the list of supported compute capabilities to runTheMatrix as well as the CUDA runtime version (two leading parts if I understood correctly, e.g. 11.2).

At this point we should look into having a single source of this information that would scale to adding other pieces of information and also to other technologies beyond CUDA.

@makortel
Copy link
Contributor Author

assign core,heterogeneous,pdmv

@cmsbuild
Copy link
Contributor

New categories assigned: heterogeneous,core,pdmv

@Dr15Jones,@smuzaffar,@jordan-martins,@fwyzard,@chayanit,@wajidalikhan,@makortel,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

A new Issue was created by @makortel Matti Kortelainen.

@Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor Author

I can think of adding environment variables for these in the CUDA toolfile, but is that the best solution?

@smuzaffar
Copy link
Contributor

smuzaffar commented Apr 27, 2021

one other way could be that cuda.spec generates a python script which one can import to get cuda and supported compute capabilities information

@makortel
Copy link
Contributor Author

(to repeat also here) one question for the runTheMatrix use case is that different CPU platforms may support different sets of CUDA compute capabilities (e.g. currently ARM does not support Pascal 6.x, whereas x86 and PPC do). When we start to use multiple CPU architectures I'd imagine the job submission to be done on some/any CPU node (e.g. x86 on lxplus).

How to have the correct CUDA compute capability list for ARM jobs? Would the job submission specify the target SCRAM_ARCH explicitly? (in which case we could use arch-specific defaults)

@makortel
Copy link
Contributor Author

cudaIsEnabled is a C++ program, so how about e.g. a JSON file?

@smuzaffar
Copy link
Contributor

json is also possible but then we still need an environment variable to point to it. Can we convert cudaIsEnabled in to python?

@makortel
Copy link
Contributor Author

I suppose it could be reimplemented in python by interpreting the output of cudaComputeCapabilities that would still communicate to CUDA via C(++) API.

@makortel
Copy link
Contributor Author

makortel commented Apr 27, 2021

Or we remove cudaIsEnabled altogether and change https://github.com/cms-sw/cmssw/blob/master/HeterogeneousCore/CUDACore/python/SwitchProducerCUDA.py#L9 to call cudaComputeCapabilities directly if it would be ok to import that python fragment in CMSSW configuration.

@fwyzard
Copy link
Contributor

fwyzard commented Apr 27, 2021

Note that the compute capabilities that we have in CMSDIST's cuda-flags.file are those we compile for, but they are not an exhaustive list of those we can run on.

Currently (on x86 and power) we build for 6.0, 7.0, 7.5.
Most likely the code can then run on any 6.x and 7.x GPU (and fail on any 5.x and 8.x GPUs).

@fwyzard
Copy link
Contributor

fwyzard commented Apr 27, 2021

I'm confused what you mean with using cudaComputeCapabilities ?

cudaComputeCapabilities prints the GPUs available on the current system, along with their compute capabilities - but it doesn't know anything about the capabilities that CMSSW supports or not.

@makortel
Copy link
Contributor Author

Note that the compute capabilities that we have in CMSDIST's cuda-flags.file are those we compile for, but they are not an exhaustive list of those we can run on.

Right, so we'd need to provide an explicit list of the compute capabilities where the software can run on (or that's what I understood #33057 would need).

I'm confused what you mean with using cudaComputeCapabilities ?

In python program (be it a separate script or part of the configuration system) execute the program, parse the compute capabilities of the devices from its output, and compare that to the contents of the python fragment generated in cuda.spec.

(I think this is still heavily on "what we could do" side rather than "should do")

@fwyzard
Copy link
Contributor

fwyzard commented Apr 27, 2021

I'm confused what you mean with using cudaComputeCapabilities ?

In python program (be it a separate script or part of the configuration system) execute the program, parse the compute capabilities of the devices from its output, and compare that to the contents of the python fragment generated in cuda.spec.

(I think this is still heavily on "what we could do" side rather than "should do")

Ah, I see, you mean to decide whether "CUDA is available" locally, independently from the job description/matching.

@makortel
Copy link
Contributor Author

I'm confused what you mean with using cudaComputeCapabilities ?

In python program (be it a separate script or part of the configuration system) execute the program, parse the compute capabilities of the devices from its output, and compare that to the contents of the python fragment generated in cuda.spec.
(I think this is still heavily on "what we could do" side rather than "should do")

Ah, I see, you mean to decide whether "CUDA is available" locally, independently from the job description/matching.

Right, this was for cudaIsEnabled/ SwitchProducerCUDA purpose.

@fwyzard
Copy link
Contributor

fwyzard commented Apr 28, 2021

OK. Then I'd suggest we should autodetect which GPUs are usable based on if we are actually able to use them: #33561 .

@makortel
Copy link
Contributor Author

I agree, #33561 provides much better way for cudaIsEnabled (thanks!). That leaves the need to communicate the compute capabilities where can run and the runtime version only to runTheMatrix.

@fwyzard
Copy link
Contributor

fwyzard commented Apr 28, 2021

If we merge cms-sw/cmsdist#6851 , then we will only need to give a "minimum compute capability".

Maybe we can add that value to cuda-flags.file and export via an environment variable or via scram tool tag cuda SOMETHING ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants