HLT farm crash in run 379617 (part-2) #44786

mmusich · 2024-04-19T20:21:21Z

While reviewing the whole list of error streamer files from run 379617 (related issue #44769) stored on /eos/cms/store/group/tsg/FOG/debug/240417_run379617/ to ascertain if CMSSW_14_0_5_patch2 fixed all of them using the following script [1] I've found a single instance which still crashes taking in input the file /eos/cms/store/group/tsg/FOG/debug/240417_run379617/run379617_ls0329_index000242_fu-c2b02-12-01_pid3327112.root .

To reproduce:

cmsrel CMSSW_14_0_5_patch2
cd CMSSW_14_0_5_patch2/src
cmsenv

and then running:

#!/bin/bash -ex

# CMSSW_14_0_5_patch2

hltGetConfiguration run:379617 \
  --globaltag 140X_dataRun3_HLT_v3 \
  --data \
  --no-prescale \
  --no-output \
  --max-events -1 \
  --input /store/group/tsg/FOG/debug/240417_run379617/run379617_ls0329_index000242_fu-c2b02-12-01_pid3327112.root  > hlt.py
  
cmsRun hlt.py &> hlt.log

On lxplus-gpu the following assertion is hit:

terminate called after throwing an instance of 'std::runtime_error'
  what():  
src/HeterogeneousCore/CUDAUtilities/src/CachingDeviceAllocator.h, line 617:
cudaCheck(error = cudaEventRecord(search_key.ready_event, search_key.associated_stream));
cudaErrorAssert: device-side assert triggered



A fatal system signal has occurred: abort signal
The following is the call stack containing the origin of the signal.

src/RecoTracker/PixelSeeding/plugins/alpaka/BrokenLineFit.dev.cc:167: void alpaka_cuda_async::Kernel_BLFastFit<N, TrackerTraits>::operator()(const TAcc &, const reco::TrackSoA<TrackerTraits>::HitContainer *, const cms::alpakatools::OneToManyAssocRandomAccess<TrackerTraits::tindex_type, <expression>, TrackerTraits::maxNumberOfTuples> *, TrackingRecHitSoA<TrackerTraits>::Layout::ConstView, const pixelCPEforDevice::ParamsOnDeviceT<TrackerTraits> *, TrackerTraits::tindex_type *, double *, float *, double *, unsigned int, unsigned int, signed int) const [with TAcc = alpaka::AccGpuUniformCudaHipRt<alpaka::ApiCudaRt, std::integral_constant<unsigned long, 1UL>, unsigned int>; <template-parameter-2-2> = void; int N = 3; TrackerTraits = pixelTopology::Phase1]: block:[69,0,0], thread: [2,0,0] Assertion `fast_fit(3) == fast_fit(3)` failed.

while on lxplus (so on CPU) no crash is observed.

@cms-sw/hlt-l2 FYI
@cms-sw/heterogeneous-l2 FYI

[1]

Click me

#!/bin/bash -ex

# CMSSW_14_0_5_patch2         
                                                                                            
hltGetConfiguration run:379617 \
  --globaltag 140X_dataRun3_HLT_v3 \
  --data \
  --no-prescale \
  --no-output \
  --max-events -1 \
  --input file:converted.root  > hlt.py

cat <<@EOF >> hlt.py
process.options.numberOfThreads = 32
process.options.numberOfStreams = 32
@EOF

# Define a function to execute each iteration of the loop
process_file() {
    inputfile="$1"
    outputfile="${inputfile%.root}"
    cp hlt.py hlt_${outputfile}.py
    sed -i "s/file:converted\.root/\/store\/group\/tsg\/FOG\/debug\/240417_run379617\/${inputfile}/g" hlt_${outputfile}.py
    cmsRun hlt_${outputfile}.py &> "${outputfile}.log"
}

# Export the function so it can be used by parallel
export -f process_file

# Find the root files and run the function in parallel using GNU Parallel
eos ls /eos/cms/store/group/tsg/FOG/debug/240417_run379617/ | grep '\.root$' | parallel -j 8 process_file

The text was updated successfully, but these errors were encountered:

mmusich · 2024-04-19T20:21:34Z

@cms-sw/tracking-pog-l2 FYI

cmsbuild · 2024-04-19T20:21:40Z

cms-bot internal usage

cmsbuild · 2024-04-19T20:21:40Z

A new Issue was created by @mmusich.

@rappoccio, @antoniovilela, @smuzaffar, @makortel, @Dr15Jones, @sextonkennedy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel · 2024-04-19T20:40:59Z

assign hlt, heterogeneous, reconstruction

makortel · 2024-04-19T20:41:06Z

FYI @AdrianoDee

cmsbuild · 2024-04-19T20:41:19Z

New categories assigned: hlt,heterogeneous,reconstruction

@Martin-Grunewald,@mmusich,@fwyzard,@jfernan2,@makortel,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel · 2024-04-19T20:43:23Z

The failing assertion is a NaN check

cmssw/RecoTracker/PixelSeeding/plugins/alpaka/BrokenLineFit.dev.cc

Lines 163 to 167 in 653fed5

    
           // no NaN here.... 
        
           ALPAKA_ASSERT_ACC(fast_fit(0) == fast_fit(0)); 
        
           ALPAKA_ASSERT_ACC(fast_fit(1) == fast_fit(1)); 
        
           ALPAKA_ASSERT_ACC(fast_fit(2) == fast_fit(2)); 
        
           ALPAKA_ASSERT_ACC(fast_fit(3) == fast_fit(3));

fwyzard · 2024-04-19T21:43:19Z

Curious.

Unrelated to the crash, but I wonder if

ALPAKA_ASSERT_ACC(not isnan(fast_fit(0)));
ALPAKA_ASSERT_ACC(not isnan(fast_fit(1)));
ALPAKA_ASSERT_ACC(not isnan(fast_fit(2)));
ALPAKA_ASSERT_ACC(not isnan(fast_fit(3)));

wouldn't be easier to understand when reading the code ?

missirol · 2024-04-21T08:00:15Z

Added the following printf in alpaka/BrokenLine.h

printf("fastFit: 0=%f 1=%f 2=%f 3=%f tmp=%f a=(%f,%f) b=(%f,%f) c=(%f,%f)\n", result(0), result(1), result(2), result(3), tmp, a(0), a(1), b(0), b(1), c(0), c(1));

Output on GPU.

fastFit: 0=inf 1=inf 2=inf 3=-nan tmp=inf a=(0.368802,-3.760513) b=(0.934136,-9.524984) c=(-1.302937,13.285498)

Output on CPU.

fastFit: 0=-303360091.997263 1=-29751183.422276 2=304815482.465272 3=0.534923 tmp=-631410.897150 a=(0.368802,-3.760513) b=(0.934136,-9.524984) c=(-1.302937,13.285498)

On GPU, both riemannFit::cross2D(acc, b, a) and riemannFit::cross2D(acc, c, a) are zero in this case.

missirol · 2024-04-21T11:19:32Z

The CUDA implementation (tested by just using a different HLT menu [1]) also produces the nan discussed above. The CUDA implementation also has similar assert calls (link), but (if I understand correctly) these assert calls do nothing when running the CUDA implementation (because NDEBUG is defined).

How to proceed ? Remove the ALPAKA_ASSERT_ACC calls in the Alpaka version ? Or, add some kind of safeguard against these zero/nan/inf values ? Or, else ?

[1]

#!/bin/bash

jobLabel=test_cmssw44786_cuda

if [ ! -f "${jobLabel}"_cfg.py ]; then

  https_proxy=http://cmsproxy.cms:3128/ \
  hltGetConfiguration run:379660 \
    --globaltag 140X_dataRun3_HLT_v3 \
    --data \
    --no-prescale \
    --no-output \
    --paths AlCa_PFJet40_v* \
    --max-events 1 \
    --input root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/debug/240417_run379617/run379617_ls0329_index000242_fu-c2b02-12-01_pid3327112.root \
    > "${jobLabel}"_cfg.py

  cat <<@EOF >> "${jobLabel}"_cfg.py

del process.hltL1sZeroBias

if hasattr(process, 'HLTAnalyzerEndpath'):
    del process.HLTAnalyzerEndpath

try:
    del process.MessageLogger
    process.load('FWCore.MessageLogger.MessageLogger_cfi')
    process.MessageLogger.cerr.enableStatistics = False
except:
    pass

process.options.numberOfThreads = 1
process.options.numberOfStreams = 0

process.source.skipEvents = cms.untracked.uint32( 86 )

process.options.wantSummary = True
@EOF
fi

CUDA_LAUNCH_BLOCKING=1 \
cmsRun "${jobLabel}"_cfg.py &> "${jobLabel}".log

slava77 · 2024-04-21T13:56:01Z

type tracking

fwyzard · 2024-04-21T17:36:44Z

How to proceed ? Remove the ALPAKA_ASSERT_ACC calls in the Alpaka version ? Or, add some kind of safeguard against these zero/nan/inf values ? Or, else ?

Naively I don't think that just removing the assert is a good approach (dust, carpet, you get the idea).

On GPU, both riemannFit::cross2D(acc, b, a) and riemannFit::cross2D(acc, c, a) are zero in this case.

So the three points are basically aligned.
@AdrianoDee @rovere @VinInn @slava77 would it make sense to check for this and handle in some special way ?

mmusich · 2024-04-22T06:59:39Z

Naively I don't think that just removing the assert is a good approach (dust, carpet, you get the idea).

while agreeing with the approach, we need to make soon an assessment because every day we spend discussing this, is one less day we take data with the menu V1.1 (and all its physics triggers updates).
So we either:

bit the bullet and put online V1.1 with CMSSW_14_0_5_patch2 with the potential pitfall of this kind of crash (hopefully it will be low frequency - 1 single occurrence in one run 379617, we still need to look at run 379613 though)
remove temporarily the assert in the alpaka code while something better is prepared by the experts.
prepare a version "V1.05" menu with the other updates while reverting the pixel local reconstruction to CUDA (which AFAIU has the same problem though). This will be very labor intensive on the STORM side.

fwyzard · 2024-04-22T07:23:11Z

I do not see any reason for

prepare a version "V1.05" menu with the other updates while reverting the pixel local reconstruction to CUDA (which AFAIU has the same problem though). This will be very labor intensive on the STORM side.

fwyzard · 2024-04-22T07:25:09Z

remove temporarily the assert in the alpaka code while something better is prepared by the experts.

I am loath to adopt "temporary" solutions, because they have the undesirable tendency to stick around much longer than intended.

VinInn · 2024-04-22T08:17:59Z

We never assert on NaN (please remove the assert, I do not know who introduced those and in any case none of those are safe in case of fast-math (on host))
Why are we asserting in Alpaka: is this not making the gpu version dramatically slower?
We need to emit an error:
a) we do not have (yet) a mechanism to propagate errors from device to host
b) most probably we can just leave the NaN percolate and catch it later on Host
c) we should just invalidate that track

fwyzard · 2024-04-22T08:39:16Z

We never assert on NaN (please remove the assert, I do not know who introduced those and in any case none of those are safe in case of fast-math (on host))

Technically, I think you did :-p
The oldest commit I could find with that code already has the comment and assert, and they made their way through the patatrack integration in CMSSW and the porting to Alpaka.

Why are we asserting in Alpaka: is this not making the gpu version dramatically slower?

Probably yes.
But we also have a lot of extra code being put online for the first time, so I decided it was a good idea to keep the asserts enabled while we commission the new code.
We can re-measure the impact of keeping asserts in GPU code, and decide what to do once all crashed have been solved.

We need to emit an error:

a) we do not have (yet) a mechanism to propagate errors from device to host

Technically a device-side assert does just that: it makes the kernel fail, which is caught by an asynchronous exception on the host.
But I agree it's not a very nice approach.

b) most probably we can just leave the NaN percolate and catch it later on Host

OK - but is there already some code on the host that would catch it and ...

c) we should just invalidate that track

... do this ?

AdrianoDee · 2024-04-22T08:56:24Z

OK - but is there already some code on the host that would catch it and ...
... do this ?

Yes, the track would be discarded later in the chain

cmssw/RecoTracker/PixelSeeding/plugins/alpaka/CAHitNtupletGeneratorKernelsImpl.h

Lines 524 to 532 in 3fb8b0e

    
                   for (int i = 0; i < 5; ++i) { 
        
                     isNaN |= std::isnan(tracks_view[it].state()(i)); 
        
                   } 
        
                   if (isNaN) { 
        
           #ifdef NTUPLE_DEBUG 
        
                     printf("NaN in fit %d size %d chi2 %f\n", it, tracks_view.hitIndices().size(it), tracks_view[it].chi2()); 
        
           #endif 
        
                     continue; 
        
                   }

I checked that it is the case for the event there (when disabling the NaN checks). Note that this is not only a single event, it's actually a single triplet in the whole run.

fwyzard · 2024-04-22T09:31:05Z

Curious - std::isnan is not constexpr (until c++23), so in principle it should not work in a kernel.

fwyzard · 2024-04-22T09:34:19Z

c) we should just invalidate that track

by the way - why ?

3 (almost) aligned hits would point to a very high pT track. Why should we invalidate it ?

VinInn · 2024-04-22T10:59:34Z

a=(0.368802,-3.760513) b=(0.934136,-9.524984) c=(-1.302937,13.285498)

could be printed with full precision using "%a" instead of "%f"?
I would like to test if it is FMA that makes the cross product zero or something else

VinInn · 2024-04-22T13:26:29Z

I added the printf and run on CPU: got no output!
Which code is running on a machine w/o gpu?

VinInn · 2024-04-22T13:37:48Z

edited the file in interface, not in inteface/alpaka....
hope we get rid of all these duplications soon

VinInn · 2024-04-22T14:01:05Z

found

fastFit: 0=inf 1=inf 2=inf 3=-nan tmp=inf a=(0x1.79a72p-2,-0x1.e1588p+1) b=(0x1.de4706p-1,-0x1.30ccacp+3) c=(-0x1.4d8d4bp+0,0x1.a922ccp+3)

this is on cpu I suppose

fastFit: 0=-0x1.214e85bff4ca4p+28 1=-0x1.c5f78f6c1a45ap+24 2=0x1.22b1d7a771c18p+28 3=0x1.11e163587decap-1 tmp=-0x1.344e5cb57444ap+19 a=(0x1.79a724p-2,-0x1.e1588p+1) b=(0x1.de4704p-1,-0x1.30ccacp+3) c=(-0x1.4d8d4bp+0,0x1.a922ccp+3)

is not a subtle precision issue: the cross product of those three double precision numbers (any combination) is zero (on cpu as well) no matter how one computes them. On cpu the imputs have few bits different

VinInn · 2024-04-22T14:10:55Z

A track with zero curvature will sooner or later produce a NaN and somebody will reject it.
We are not even sure that a very small value will survive further manipulations down-stream: so, unless somebody goes around and make sure C=0 is properly propagated, the best is just to keep the nan.
For sure the asserts shall be removed even if exactly ZERO in double precision is always a bit suspicision (still can happen, are only 53bits after all...) and it can happen on cpu as well.
In the printout is "plenty" of values that could be "inf" in float

mmusich · 2024-04-23T07:20:46Z

we still need to look at run 379613 though

for the record, we checked all the available error stream files for that run with the following script [1] in CMSSW_14_0_5_patch2
and found no other crashes.

This particular issue should be dealt with by #44808 (if accepted).

[1]

#!/bin/bash -ex

# CMSSW_14_0_5_patch2         
                                                                                            
hltGetConfiguration run:379613 \
  --globaltag 140X_dataRun3_HLT_v3 \
  --data \
  --no-prescale \
  --no-output \
  --max-events -1 \
  --input file:converted.root  > hlt.py

cat <<@EOF >> hlt.py
process.options.numberOfThreads = 32
process.options.numberOfStreams = 32
@EOF

# Define a function to execute each iteration of the loop
process_file() {
    inputfile="$1"
    outputfile="${inputfile%.root}"
    cp hlt.py hlt_${outputfile}.py
    sed -i "s/file:converted\.root/\/store\/group\/tsg\/FOG\/debug\/240417_run379613\/${inputfile}/g" hlt_${outputfile}.py
    cmsRun hlt_${outputfile}.py &> "${outputfile}.log"
}

# Export the function so it can be used by parallel
export -f process_file

# Find the root files and run the function in parallel using GNU Parallel
eos ls /eos/cms/store/group/tsg/FOG/debug/240417_run379613/ | grep '\.root$' | parallel -j 8 process_file

mmusich · 2024-05-07T07:36:23Z

+hlt

issue fixed by [14.0.X] BrokenLineFit and RiemannFit: mask asserts behind DEBUG flags #44808 included in CMSSW_14_0_6
deployed online via CMSSW_14_0_6_MULTIARCHS (intermediate patches were not used online) together with HLT menu GRun 2024 V1.1. The first stable-beams run CMSSW_14_0_6_MULTIARCHS, using the "v1.1.X" HLT menu (see CMSHLT-3164), was run-380306 (Run2024D). No further issues were spotted.

makortel · 2024-05-07T13:15:24Z

+heterogeneous

cmsbuild added the pending-assignment label Apr 19, 2024

mmusich mentioned this issue Apr 19, 2024

HLT farm crash in run 379617 #44769

Closed

cmsbuild added reconstruction-pending hlt-pending pending-signatures heterogeneous-pending and removed pending-assignment labels Apr 19, 2024

cmsbuild added the tracking label Apr 21, 2024

This was referenced Apr 22, 2024

BrokenLineFit and RiemannFit: mask asserts behind DEBUG flags #44801

Merged

[14.0.X] BrokenLineFit and RiemannFit: mask asserts behind DEBUG flags #44808

Merged

cmsbuild closed this as completed in #44801 Apr 23, 2024

cmsbuild added hlt-approved and removed hlt-pending labels May 7, 2024

cmsbuild added heterogeneous-approved and removed heterogeneous-pending labels May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HLT farm crash in run 379617 (part-2) #44786

HLT farm crash in run 379617 (part-2) #44786

mmusich commented Apr 19, 2024 •

edited

Loading

mmusich commented Apr 19, 2024

cmsbuild commented Apr 19, 2024 •

edited

Loading

cmsbuild commented Apr 19, 2024

makortel commented Apr 19, 2024

makortel commented Apr 19, 2024

cmsbuild commented Apr 19, 2024

makortel commented Apr 19, 2024

fwyzard commented Apr 19, 2024

missirol commented Apr 21, 2024

missirol commented Apr 21, 2024 •

edited

Loading

slava77 commented Apr 21, 2024

fwyzard commented Apr 21, 2024

mmusich commented Apr 22, 2024

fwyzard commented Apr 22, 2024

fwyzard commented Apr 22, 2024

VinInn commented Apr 22, 2024

fwyzard commented Apr 22, 2024

AdrianoDee commented Apr 22, 2024 •

edited

Loading

fwyzard commented Apr 22, 2024

fwyzard commented Apr 22, 2024

VinInn commented Apr 22, 2024

VinInn commented Apr 22, 2024

VinInn commented Apr 22, 2024

VinInn commented Apr 22, 2024 •

edited

Loading

VinInn commented Apr 22, 2024 •

edited

Loading

mmusich commented Apr 23, 2024

mmusich commented May 7, 2024

makortel commented May 7, 2024

HLT farm crash in run 379617 (part-2) #44786

HLT farm crash in run 379617 (part-2) #44786

Comments

mmusich commented Apr 19, 2024 • edited Loading

mmusich commented Apr 19, 2024

cmsbuild commented Apr 19, 2024 • edited Loading

cmsbuild commented Apr 19, 2024

makortel commented Apr 19, 2024

makortel commented Apr 19, 2024

cmsbuild commented Apr 19, 2024

makortel commented Apr 19, 2024

fwyzard commented Apr 19, 2024

missirol commented Apr 21, 2024

missirol commented Apr 21, 2024 • edited Loading

slava77 commented Apr 21, 2024

fwyzard commented Apr 21, 2024

mmusich commented Apr 22, 2024

fwyzard commented Apr 22, 2024

fwyzard commented Apr 22, 2024

VinInn commented Apr 22, 2024

fwyzard commented Apr 22, 2024

AdrianoDee commented Apr 22, 2024 • edited Loading

fwyzard commented Apr 22, 2024

fwyzard commented Apr 22, 2024

VinInn commented Apr 22, 2024

VinInn commented Apr 22, 2024

VinInn commented Apr 22, 2024

VinInn commented Apr 22, 2024 • edited Loading

VinInn commented Apr 22, 2024 • edited Loading

mmusich commented Apr 23, 2024

mmusich commented May 7, 2024

makortel commented May 7, 2024

mmusich commented Apr 19, 2024 •

edited

Loading

cmsbuild commented Apr 19, 2024 •

edited

Loading

missirol commented Apr 21, 2024 •

edited

Loading

AdrianoDee commented Apr 22, 2024 •

edited

Loading

VinInn commented Apr 22, 2024 •

edited

Loading

VinInn commented Apr 22, 2024 •

edited

Loading