Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An alpaka module with an explicit cpu backend fails to run in a job with a list of accelerators that does not include the cpu #43780

Open
fwyzard opened this issue Jan 24, 2024 · 8 comments

Comments

@fwyzard
Copy link
Contributor

fwyzard commented Jan 24, 2024

An @alpaka module with an explicit CPU backend such as

process.testProducerSerial = cms.EDProducer('TestAlpakaProducer@alpaka',
    size = cms.int32(99),
    alpaka = cms.untracked.PSet(
        backend = cms.untracked.string("serial_sync")
    )
)

will fail to run if the process is configured to exclude the CPU from the accelerators:

process.options.accelerators = [ 'gpu-nvidia' ]

with the message:

An exception of category 'UnavailableAccelerator' occurred while
   [0] Processing the python configuration file named writer.py
Exception Message:
Module testProducerSerial has the Alpaka backend set explicitly, but its accelerator is not available for the job because of the combination of the job configuration and accelerator availability on the machine. The following Alpaka backends are available for the job cuda_async.

Currently, the workaround is to use the alpaka_serial_sync:: variant explicitly:

process.testProducerSerial = cms.EDProducer('alpaka_serial_sync::TestAlpakaProducer',
    size = cms.int32(99)
)
@cmsbuild
Copy link
Contributor

cmsbuild commented Jan 24, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

A new Issue was created by @fwyzard Andrea Bocci.

@Dr15Jones, @makortel, @rappoccio, @smuzaffar, @antoniovilela, @sextonkennedy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 24, 2024

assign core, heterogeneous

@cmsbuild
Copy link
Contributor

New categories assigned: core,heterogeneous

@Dr15Jones,@fwyzard,@makortel,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

makortel commented Jan 24, 2024

This comment is mostly to just think out loud. I want to find out if module.alpaka.backend = 'serial_sync' could be made to work for this case, or if that could cause any issues.

The process.options.accelerators specifies the set of accelerators that the job may use. I.e. with accelerators = ['gpu-nvidia', 'cpu'] the can run on a machine without a GPU, whereas accelerators = ['gpu-nvidia'] would lead to a failure.

The process.options.accelerators should drive the behavior of @alpaka module when the Alpaka backend is not explicitly specified.

Currently, the ProcessAcceleratorAlpaka (which plays python-side role in how the @alpaka-modules are handled) requires that also the explicitly set backends must be compatible with the process.options.accelerators.

In a way CPU is a special "accelerator" as it is always (assumed to be) present. And non-Alpaka code will use the CPU anyway. So perhaps just allowing explicitly-set host backends irrespective of the contents of process.options.accelerators would be ok.

If the previous case would be allowed, what about setting the backend explicitly to anything? For example, module.alpaka.backend = 'cuda_async' when accelerators = ['cpu'], should that work when on a machine that has a GPU or lead to an early failure? On a first thought, I'm leaning towards "should continue to lead to failure".

@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 25, 2024

what about setting the backend explicitly to anything? For example, module.alpaka.backend = 'cuda_async' when accelerators = ['cpu'], should that work when on a machine that has a GPU or lead to an early failure? On a first thought, I'm leaning towards "should continue to lead to failure".

I agree that I think this should fail.

What about the case where a job uses alpaka_cuda_async::producer explicitly ?
Should that fail as well ?

@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 25, 2024

What about the case where a job uses alpaka_cuda_async::producer explicitly ?
Should that fail as well ?

Actually, that fails because

----- Begin Fatal Exception 25-Jan-2024 01:25:05 CET-----------------------
An exception of category 'NotFound' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 1 stream: 0
   [1] Running path 'process_path'
   [2] Calling method for module alpaka_cuda_async::TestAlpakaProducer/'testProducer'
Exception Message:
Service Request unable to find requested service with compiler type name ' alpaka_cuda_async::AlpakaService'.
----- End Fatal Exception -------------------------------------------------

@makortel
Copy link
Contributor

What about the case where a job uses alpaka_cuda_async::producer explicitly ?
Should that fail as well ?

Actually, that fails because

I'm glad alpaka_cuda_async::producer alone fails. Theoreically user could still hack it to work with explicit process.add_(cms.Service('alpaka_cuda_async::AlpakaService')) and somehow removing ProcessAcceleratorAlpaka from the process. But I hope this level is something that would be caught in the code review (plus removal of ProcessAcceleratorAlpaka would break all modules relying on @alpaka suffix).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants