CMSSW Fatal System Signal During Exit with Alpaka Caching Allocator #312

GNiendorf · 2023-07-28T22:49:58Z

The error is given as:

Fatal system signal has occurred during exit ./alpaka_setup.sh: line 59: 2562239 Aborted (core dumped) cmsRun step3_RAW2DIGI_RECO_VALIDATION_DQM_PU.py

Edit: Here is the full bt

Dan's bt with more information - Here

This Issue seems related.

Steps to reproduce (on cgpu1, taken from @VourMa's readme instructions). If you put this into a file alpaka_setup.sh for example and run chmod +x alpaka_setup.sh and ./alpaka_setup.sh it should run automatically and produce the error at the very end. Make sure your github username is set though or it will fail. This setup uses the 100 step2 events input file on CGPU1 that was made by Manos:

# Clone the TrackLooper repo
git clone [email protected]:SegmentLinking/TrackLooper.git
cd TrackLooper/

# Source the setup script to configure the environment
source setup.sh

# Make the TrackLooper using the "-mc" option to turn the caching allocator on
sdl_make_tracklooper -mc

cd ..

# Create the working folder and move into it
mkdir workingFolder
cd workingFolder

# Set up CMSSW
cmsrel CMSSW_13_0_0_pre4
cd CMSSW_13_0_0_pre4/src
cmsenv

# Initialize git and add remote
git cms-init
git remote add SegLink [email protected]:SegmentLinking/cmssw.git

# Fetch and checkout specific branch
git fetch SegLink CMSSW_13_0_0_pre4_LST_X
git cms-addpkg RecoTracker Configuration
git checkout CMSSW_13_0_0_pre4_LST_X

# Create lst.xml
cat <<EOF >lst.xml
<tool name="lst" version="1.0">
  <client>
    <environment name="LSTBASE" default="$PWD/../../../TrackLooper"/>
    <environment name="LIBDIR" default="\$LSTBASE/SDL"/>
    <environment name="INCLUDE" default="\$LSTBASE"/>
  </client>
  <runtime name="LST_BASE" value="\$LSTBASE"/>
  <lib name="sdl"/>
</tool>
EOF

# Setup scram and env
scram setup lst.xml
cmsenv

# Modify the LSTProducer.cc file
sed -i 's/lst_.run(ctx.queue().getNativeHandle(),/lst_.run(ctx.queue(),/' ./RecoTracker/LST/plugins/alpaka/LSTProducer.cc

# Check dependencies
git cms-checkdeps -a -A

# Build
scram b -j 12

# Generate the step3 file
cmsDriver.py step3  -s RAW2DIGI,RECO:reconstruction_trackingOnly,VALIDATION:@trackingOnlyValidation,DQM:@trackingOnlyDQM --conditions auto:phase2_realistic_T21 --datatier GEN-SIM-RECO,DQMIO -n 10 --eventcontent RECOSIM,DQM --geometry Extended2026D88 --era Phase2C17I13M9 --pileup AVE_200_BX_25ns --pileup_input file:file.root --procModifiers gpu,trackingLST,trackingIters01 --no_exec

# Edit the configuration file
sed -i "28i process.load('Configuration.StandardSequences.Accelerators_cff')\nprocess.AlpakaServiceCudaAsync = cms.Service('AlpakaServiceCudaAsync')\nprocess.AlpakaServiceSerialSync = cms.Service('AlpakaServiceSerialSync')" step3_RAW2DIGI_RECO_VALIDATION_DQM_PU.py

sed -i "/process.mix.input.fileNames =/c \
process.mix.input.fileNames = cms.untracked.vstring(['file:/data2/segmentlinking/PUSamplesForCMSSW1263/CMSSW_12_3_0_pre5/RelValMinBias_14TeV/GEN-SIM/123X_mcRun4_realistic_v4_2026D88noPU-v1/066fc95d-1cef-4469-9e08-3913973cd4ce.root', 'file:/data2/segmentlinking/PUSamplesForCMSSW1263/CMSSW_12_3_0_pre5/RelValMinBias_14TeV/GEN-SIM/123X_mcRun4_realistic_v4_2026D88noPU-v1/07928a25-231b-450d-9d17-e20e751323a1.root', 'file:/data2/segmentlinking/PUSamplesForCMSSW1263/CMSSW_12_3_0_pre5/RelValMinBias_14TeV/GEN-SIM/123X_mcRun4_realistic_v4_2026D88noPU-v1/26bd8fb0-575e-4201-b657-94cdcb633045.root', 'file:/data2/segmentlinking/PUSamplesForCMSSW1263/CMSSW_12_3_0_pre5/RelValMinBias_14TeV/GEN-SIM/123X_mcRun4_realistic_v4_2026D88noPU-v1/4206a9c5-44c2-45a5-aab2-1a8a6043a08a.root', 'file:/data2/segmentlinking/PUSamplesForCMSSW1263/CMSSW_12_3_0_pre5/RelValMinBias_14TeV/GEN-SIM/123X_mcRun4_realistic_v4_2026D88noPU-v1/55a372bf-a234-4111-8ce0-ead6157a1810.root', 'file:/data2/segmentlinking/PUSamplesForCMSSW1263/CMSSW_12_3_0_pre5/RelValMinBias_14TeV/GEN-SIM/123X_mcRun4_realistic_v4_2026D88noPU-v1/59ad346c-f405-4288-96d7-795f81c43fe8.root', 'file:/data2/segmentlinking/PUSamplesForCMSSW1263/CMSSW_12_3_0_pre5/RelValMinBias_14TeV/GEN-SIM/123X_mcRun4_realistic_v4_2026D88noPU-v1/7280f5ec-b71d-4579-a730-7ce2de0ff906.root', 'file:/data2/segmentlinking/PUSamplesForCMSSW1263/CMSSW_12_3_0_pre5/RelValMinBias_14TeV/GEN-SIM/123X_mcRun4_realistic_v4_2026D88noPU-v1/b93adc85-715f-477a-afc9-65f3241933ee.root', 'file:/data2/segmentlinking/PUSamplesForCMSSW1263/CMSSW_12_3_0_pre5/RelValMinBias_14TeV/GEN-SIM/123X_mcRun4_realistic_v4_2026D88noPU-v1/c7a0aa46-f55c-4b01-977f-34a397b71fba.root', 'file:/data2/segmentlinking/PUSamplesForCMSSW1263/CMSSW_12_3_0_pre5/RelValMinBias_14TeV/GEN-SIM/123X_mcRun4_realistic_v4_2026D88noPU-v1/e77fa467-97cb-4943-884f-6965b4eb0390.root'])" step3_RAW2DIGI_RECO_VALIDATION_DQM_PU.py

sed -i "s|fileNames = cms.untracked.vstring('file:step3_DIGI2RAW.root')|fileNames = cms.untracked.vstring('file:/ceph/cms/store/user/evourlio/LST/step2_21034.1_100Events.root')|" step3_RAW2DIGI_RECO_VALIDATION_DQM_PU.py

# Run the modified step3
cmsRun step3_RAW2DIGI_RECO_VALIDATION_DQM_PU.py

The text was updated successfully, but these errors were encountered:

GNiendorf · 2023-07-28T22:55:02Z

Paging @dan131riley. If anything comes to mind, please chime in!

GNiendorf · 2023-08-08T18:05:14Z

Tagging @fwyzard, we have two backtraces here (one from me and one from Dan both linked above).

fwyzard · 2023-08-08T22:15:33Z

I can try to reproduce and have a look, but first a couple of questions:

the recipe above mentions CMSSW_13_0_0_pre4; is the issue still present after merging the workaround in Replace the SFINAE check with static_assert [13.0.x] cms-sw/cmssw#42427 ?
do you have a recipe for a more recent release of CMSSW - like 13.0.10 or 13.2.0 ?
can I use the recipe on e.g. lxplus-gpu, or one of the online machines ?

VourMa · 2023-08-08T23:09:42Z

Thanks, Andrea! Some replies to your questions:

the recipe above mentions CMSSW_13_0_0_pre4; is the issue still present after merging the workaround in cms-sw/cmssw#42427 ?

The workaround was propagated to our own copy of the caching allocator: 6ea9524.
It's included in PR #314, which may not be merged yet but Gavin tested locally, and it seems that the error persists. It is true that the test happened in CMSSW_13_0_0_pre4. @GNiendorf could comment if I got anything wrong.

do you have a recipe for a more recent release of CMSSW - like 13.0.10 or 13.2.0 ?

I have been working on getting the setup to work in CMSSW_13_2_0_pre2. The version I prepared should be functional in any CMSSW version with the "new accelerator framework". I can tidy it up tomorrow and send you a few details.

can I use the recipe on e.g. lxplus-gpu, or one of the online machines ?

I think it should work anywhere as long as cvmfs is available.

dan131riley · 2023-08-08T23:22:58Z

@fwyzard so far as I know, the current LST Alpaka integration is not using the Alpaka caching allocator service. If you look at my stack traces, both calls to the caching allocator destructor are in the exit handlers, which is going to be after the CUDA service was unloaded. It may not be worth your time looking at this until the LST CMSSW integration is using the allocator service.

VourMa · 2023-08-09T10:49:12Z

I have been working on getting the setup to work in CMSSW_13_2_0_pre2. The version I prepared should be functional in any CMSSW version with the "new accelerator framework". I can tidy it up tomorrow and send you a few details.

I went ahead and updated to more recent releases.
If one chooses to work in CMSSW_13_2_0_pre2, then the README can be followed to the letter by applying the substitutions CMSSW_13_0_0_pre4(_LST_X) -> CMSSW_13_2_0_pre2(_LST_X).
If one chooses to work in any other release, in which cms-sw/cmssw#41341 is in, then cherry-pick-ing commits SegmentLinking/cmssw@05c3d73 and SegmentLinking/cmssw@a0aae36 from SegmentLinking/cmssw/CMSSW_13_2_0_pre2_LST_X, instead of pulling the specific _LST_X branch, should work as well.

GNiendorf · 2023-08-09T15:24:12Z

@fwyzard so far as I know, the current LST Alpaka integration is not using the Alpaka caching allocator service. If you look at my stack traces, both calls to the caching allocator destructor are in the exit handlers, which is going to be after the CUDA service was unloaded. It may not be worth your time looking at this until the LST CMSSW integration is using the allocator service.

Right now we are using a copied version of the caching allocator which can also be run for our standalone code. @fwyzard your fix is applied on the alpaka_upgrade branch that is still waiting to be merged in on #314. This error only occurs when the caching allocator is enabled and within CMSSW, and persists on this branch after your fix was applied for the related issue.

fwyzard · 2023-08-09T15:26:39Z

@GNiendorf I'm confused: does the crash happen when the application is run stand-alone (with the copy of the caching allocator with the fix) or does it happen within CMSSW ?

GNiendorf · 2023-08-09T15:29:12Z

@GNiendorf I'm confused: does the crash happen when the application is run stand-alone (with the copy of the caching allocator with the fix) or does it happen within CMSSW ?

It happens only within CMSSW, but we are using our copied version of the CMSSW caching allocator when we are running within CMSSW as Dan mentioned above. See here for our copied version of the alpaka interface: https://github.com/SegmentLinking/TrackLooper/tree/alpaka_upgrade/code/alpaka_interface

fwyzard · 2023-08-09T15:32:00Z

So there are two identical but independent instances of the caching allocator ?
That could very well be the reason of the problem.

GNiendorf · 2023-09-22T15:17:16Z

@fwyzard Sorry for the late reply, we spent some time resolving a few CPU/GPU backend differences before coming back to this issue.

Is there any documentation on how to use the CMSSW Alpaka caching allocator service correctly? Is it as simple as changing the include statement towards the relevant CMSSW path as I did here? Or is using the service more complicated?

fwyzard · 2023-09-23T08:52:57Z

hi Gavin,
the issue is that you must ensure that all memory allocated by the caching allocator has been freed, before the alpaka objects are destroyed at the end of the job.

If you have instances of a caching allocator in your code, you could try calling freeAllCached() after the event processing is complete, and before the destruction of the alpaka devices (which should happen sometime during the destruction of the services, if I remember correctly).

GNiendorf added the bug Something isn't working label Jul 28, 2023

GNiendorf linked a pull request Oct 29, 2023 that will close this issue

Upgrade to CMSSW 13_3_0_pre3 and use libraries directly from CMSSW #348

Merged

ariostas mentioned this issue Nov 10, 2023

Upgrade to CMSSW 13_3_0_pre3 and use libraries directly from CMSSW #348

Merged

VourMa closed this as completed in #348 Nov 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CMSSW Fatal System Signal During Exit with Alpaka Caching Allocator #312

CMSSW Fatal System Signal During Exit with Alpaka Caching Allocator #312

GNiendorf commented Jul 28, 2023 •

edited

Loading

GNiendorf commented Jul 28, 2023

GNiendorf commented Aug 8, 2023

fwyzard commented Aug 8, 2023 •

edited

Loading

VourMa commented Aug 8, 2023

dan131riley commented Aug 8, 2023

VourMa commented Aug 9, 2023

GNiendorf commented Aug 9, 2023 •

edited

Loading

fwyzard commented Aug 9, 2023

GNiendorf commented Aug 9, 2023 •

edited

Loading

fwyzard commented Aug 9, 2023

GNiendorf commented Sep 22, 2023

fwyzard commented Sep 23, 2023

CMSSW Fatal System Signal During Exit with Alpaka Caching Allocator #312

CMSSW Fatal System Signal During Exit with Alpaka Caching Allocator #312

Comments

GNiendorf commented Jul 28, 2023 • edited Loading

GNiendorf commented Jul 28, 2023

GNiendorf commented Aug 8, 2023

fwyzard commented Aug 8, 2023 • edited Loading

VourMa commented Aug 8, 2023

dan131riley commented Aug 8, 2023

VourMa commented Aug 9, 2023

GNiendorf commented Aug 9, 2023 • edited Loading

fwyzard commented Aug 9, 2023

GNiendorf commented Aug 9, 2023 • edited Loading

fwyzard commented Aug 9, 2023

GNiendorf commented Sep 22, 2023

fwyzard commented Sep 23, 2023

GNiendorf commented Jul 28, 2023 •

edited

Loading

fwyzard commented Aug 8, 2023 •

edited

Loading

GNiendorf commented Aug 9, 2023 •

edited

Loading

GNiendorf commented Aug 9, 2023 •

edited

Loading