-
Notifications
You must be signed in to change notification settings - Fork 13
CMSSW Fatal System Signal During Exit with Alpaka Caching Allocator #312
Comments
Paging @dan131riley. If anything comes to mind, please chime in! |
Tagging @fwyzard, we have two backtraces here (one from me and one from Dan both linked above). |
I can try to reproduce and have a look, but first a couple of questions:
|
Thanks, Andrea! Some replies to your questions:
The workaround was propagated to our own copy of the caching allocator: 6ea9524.
I have been working on getting the setup to work in
I think it should work anywhere as long as |
@fwyzard so far as I know, the current LST Alpaka integration is not using the Alpaka caching allocator service. If you look at my stack traces, both calls to the caching allocator destructor are in the exit handlers, which is going to be after the CUDA service was unloaded. It may not be worth your time looking at this until the LST CMSSW integration is using the allocator service. |
I went ahead and updated to more recent releases. |
Right now we are using a copied version of the caching allocator which can also be run for our standalone code. @fwyzard your fix is applied on the alpaka_upgrade branch that is still waiting to be merged in on #314. This error only occurs when the caching allocator is enabled and within CMSSW, and persists on this branch after your fix was applied for the related issue. |
@GNiendorf I'm confused: does the crash happen when the application is run stand-alone (with the copy of the caching allocator with the fix) or does it happen within CMSSW ? |
It happens only within CMSSW, but we are using our copied version of the CMSSW caching allocator when we are running within CMSSW as Dan mentioned above. See here for our copied version of the alpaka interface: https://github.com/SegmentLinking/TrackLooper/tree/alpaka_upgrade/code/alpaka_interface |
So there are two identical but independent instances of the caching allocator ? |
@fwyzard Sorry for the late reply, we spent some time resolving a few CPU/GPU backend differences before coming back to this issue. Is there any documentation on how to use the CMSSW Alpaka caching allocator service correctly? Is it as simple as changing the include statement towards the relevant CMSSW path as I did here? Or is using the service more complicated? |
hi Gavin, If you have instances of a caching allocator in your code, you could try calling |
The error is given as:
Fatal system signal has occurred during exit ./alpaka_setup.sh: line 59: 2562239 Aborted (core dumped) cmsRun step3_RAW2DIGI_RECO_VALIDATION_DQM_PU.py
Edit: Here is the full bt
Dan's bt with more information - Here
This Issue seems related.
Steps to reproduce (on cgpu1, taken from @VourMa's readme instructions). If you put this into a file alpaka_setup.sh for example and run chmod +x alpaka_setup.sh and ./alpaka_setup.sh it should run automatically and produce the error at the very end. Make sure your github username is set though or it will fail. This setup uses the 100 step2 events input file on CGPU1 that was made by Manos:
The text was updated successfully, but these errors were encountered: