-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a generic mechanism to specify compute accelerators to use in the configuration #36699
Conversation
-code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-36699/27772
Code check has found code style and quality issues which could be resolved by applying following patch(s)
|
98596a8
to
80a7513
Compare
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-36699/27773
|
A new Pull Request was created by @makortel (Matti Kortelainen) for master. It involves the following packages:
@Martin-Grunewald, @perrotta, @makortel, @ahmad3213, @cmsbuild, @missirol, @fwyzard, @pmandrik, @smuzaffar, @Dr15Jones, @emanueleusai, @AdrianoDee, @jfernan2, @slava77, @jpata, @qliphy, @fabiocos, @pbo0, @clacaputo, @srimanob, @davidlange6, @rvenditti can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
test parameters:
|
@cmsbuild, please test |
@Dr15Jones please review @fwyzard could you take a look too? |
+reconstruction |
+1 |
void ensureAvailableAccelerators(edm::ParameterSet const& parameterSet) { | ||
auto const& selectedAccelerators = | ||
parameterSet.getUntrackedParameter<std::vector<std::string>>("@selected_accelerators"); | ||
ParameterSet const& optionsPset(parameterSet.getUntrackedParameterSet("options")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also for a follow-up PR, this could be moved inside the if
block
+Upgrade |
+1 |
+operations
|
This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2) |
+1 |
PR description:
This PR resolves #31760. It adds a new parameter
that can be used to specify the compute accelerator(s) a job should use. A special valueauto
(also the default to preserve the current behavior with CUDA) can be used to let the job to pick the accelerators that are available on a worker node. An emptyvstring
means that no accelerators are used. If a specific accelerator name is given and the worker node does not have that accelerator, the job is terminated with a specific exit code.that can be used to specify the compute accelerator(s) a job is allowed to use. Patterns with
*
and?
wildcards are allowed (similar to shell). Default value is*
(i.e. the intersection what's defined in the job and available in the worker node) to preserve the current behavior with CUDA. An emptyvstring
is an error.The recognized (and allowed) values are specified with instances of new
ProcessAccelerator
-derived classes whose objects are attached to theProcess
. The system implicitly addscpu
label to denote CPU fallback. EachProcessAccelerator
class defines the accelerator labels it recognizes, returns a subset of the labels that are enabled on a worker node, and have a possibility to customize (within restrictions) theProcess
right before the point where the python configuration is "serialized" for C++ code at the worker node. These customizations must not change the configuration hash, which is ensured by wrapping theProcess
object into another class that (currently) gives access only to Services.In order to interoperate with
ProcessAccelerator
theSwitchProducer
s are changed slightly: the case-specific functions that tell whether that case is enabled or not take now the list labels of enabled accelerators as an argument.Encapsulating the accelerator knowledge into specific configuration objects defined outside of the framework leaves the framework to stay independent of the accelerator technologies, and makes the system flexible and easy to extend. This design should be easy to extend e.g. for the use of Alpaka (or any portability technology) that can support multiple accelerators and communicate additional information about the chosen accelerator for the framework (such as the namespace of EDModules corresponding the chosen accelerator). It should also help with #30044.
For CUDA this PR adds
ProcessAcceleratorCUDA
, and changes all current loads ofHeterogeneousCore.CUDAServices.CUDAService_cfi
withHeterogeneousCore.CUDACore.ProcessAcceleratorCUDA_cfi
. TheProcessAcceleratorCUDA
internally loads theCUDAService
if it is not loaded already, and also adds theCUDAService
to MessageLogger categories. The logic of whether CUDA is enabled on a worker node or not is moved fromSwitchProducerCUDA
toProcessAcceleratorCUDA
. For the NVIDIA GPU "acclerator label" I pickedgpu-nvidia
.PR validation:
Unit tests pass (on a machine without a GPU and on a machine with a GPU)