Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HLT farm crash in run 379617 (part-2) #44786

Closed
mmusich opened this issue Apr 19, 2024 · 28 comments · Fixed by #44801
Closed

HLT farm crash in run 379617 (part-2) #44786

mmusich opened this issue Apr 19, 2024 · 28 comments · Fixed by #44801

Comments

@mmusich
Copy link
Contributor

mmusich commented Apr 19, 2024

While reviewing the whole list of error streamer files from run 379617 (related issue #44769) stored on /eos/cms/store/group/tsg/FOG/debug/240417_run379617/ to ascertain if CMSSW_14_0_5_patch2 fixed all of them using the following script [1] I've found a single instance which still crashes taking in input the file /eos/cms/store/group/tsg/FOG/debug/240417_run379617/run379617_ls0329_index000242_fu-c2b02-12-01_pid3327112.root .

To reproduce:

cmsrel CMSSW_14_0_5_patch2
cd CMSSW_14_0_5_patch2/src
cmsenv

and then running:

#!/bin/bash -ex

# CMSSW_14_0_5_patch2

hltGetConfiguration run:379617 \
  --globaltag 140X_dataRun3_HLT_v3 \
  --data \
  --no-prescale \
  --no-output \
  --max-events -1 \
  --input /store/group/tsg/FOG/debug/240417_run379617/run379617_ls0329_index000242_fu-c2b02-12-01_pid3327112.root  > hlt.py
  
cmsRun hlt.py &> hlt.log

On lxplus-gpu the following assertion is hit:

terminate called after throwing an instance of 'std::runtime_error'
  what():  
src/HeterogeneousCore/CUDAUtilities/src/CachingDeviceAllocator.h, line 617:
cudaCheck(error = cudaEventRecord(search_key.ready_event, search_key.associated_stream));
cudaErrorAssert: device-side assert triggered



A fatal system signal has occurred: abort signal
The following is the call stack containing the origin of the signal.

src/RecoTracker/PixelSeeding/plugins/alpaka/BrokenLineFit.dev.cc:167: void alpaka_cuda_async::Kernel_BLFastFit<N, TrackerTraits>::operator()(const TAcc &, const reco::TrackSoA<TrackerTraits>::HitContainer *, const cms::alpakatools::OneToManyAssocRandomAccess<TrackerTraits::tindex_type, <expression>, TrackerTraits::maxNumberOfTuples> *, TrackingRecHitSoA<TrackerTraits>::Layout::ConstView, const pixelCPEforDevice::ParamsOnDeviceT<TrackerTraits> *, TrackerTraits::tindex_type *, double *, float *, double *, unsigned int, unsigned int, signed int) const [with TAcc = alpaka::AccGpuUniformCudaHipRt<alpaka::ApiCudaRt, std::integral_constant<unsigned long, 1UL>, unsigned int>; <template-parameter-2-2> = void; int N = 3; TrackerTraits = pixelTopology::Phase1]: block:[69,0,0], thread: [2,0,0] Assertion `fast_fit(3) == fast_fit(3)` failed.

while on lxplus (so on CPU) no crash is observed.

@cms-sw/hlt-l2 FYI
@cms-sw/heterogeneous-l2 FYI

[1]

Click me
#!/bin/bash -ex

# CMSSW_14_0_5_patch2         
                                                                                            
hltGetConfiguration run:379617 \
  --globaltag 140X_dataRun3_HLT_v3 \
  --data \
  --no-prescale \
  --no-output \
  --max-events -1 \
  --input file:converted.root  > hlt.py

cat <<@EOF >> hlt.py
process.options.numberOfThreads = 32
process.options.numberOfStreams = 32
@EOF

# Define a function to execute each iteration of the loop
process_file() {
    inputfile="$1"
    outputfile="${inputfile%.root}"
    cp hlt.py hlt_${outputfile}.py
    sed -i "s/file:converted\.root/\/store\/group\/tsg\/FOG\/debug\/240417_run379617\/${inputfile}/g" hlt_${outputfile}.py
    cmsRun hlt_${outputfile}.py &> "${outputfile}.log"
}

# Export the function so it can be used by parallel
export -f process_file

# Find the root files and run the function in parallel using GNU Parallel
eos ls /eos/cms/store/group/tsg/FOG/debug/240417_run379617/ | grep '\.root$' | parallel -j 8 process_file
@mmusich
Copy link
Contributor Author

mmusich commented Apr 19, 2024

@cms-sw/tracking-pog-l2 FYI

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 19, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

A new Issue was created by @mmusich.

@rappoccio, @antoniovilela, @smuzaffar, @makortel, @Dr15Jones, @sextonkennedy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

assign hlt, heterogeneous, reconstruction

@makortel
Copy link
Contributor

FYI @AdrianoDee

@cmsbuild
Copy link
Contributor

New categories assigned: hlt,heterogeneous,reconstruction

@Martin-Grunewald,@mmusich,@fwyzard,@jfernan2,@makortel,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

The failing assertion is a NaN check

// no NaN here....
ALPAKA_ASSERT_ACC(fast_fit(0) == fast_fit(0));
ALPAKA_ASSERT_ACC(fast_fit(1) == fast_fit(1));
ALPAKA_ASSERT_ACC(fast_fit(2) == fast_fit(2));
ALPAKA_ASSERT_ACC(fast_fit(3) == fast_fit(3));

@fwyzard
Copy link
Contributor

fwyzard commented Apr 19, 2024

Curious.

Unrelated to the crash, but I wonder if

ALPAKA_ASSERT_ACC(not isnan(fast_fit(0)));
ALPAKA_ASSERT_ACC(not isnan(fast_fit(1)));
ALPAKA_ASSERT_ACC(not isnan(fast_fit(2)));
ALPAKA_ASSERT_ACC(not isnan(fast_fit(3)));

wouldn't be easier to understand when reading the code ?

@missirol
Copy link
Contributor

Added the following printf in alpaka/BrokenLine.h

printf("fastFit: 0=%f 1=%f 2=%f 3=%f tmp=%f a=(%f,%f) b=(%f,%f) c=(%f,%f)\n", result(0), result(1), result(2), result(3), tmp, a(0), a(1), b(0), b(1), c(0), c(1));

Output on GPU.

fastFit: 0=inf 1=inf 2=inf 3=-nan tmp=inf a=(0.368802,-3.760513) b=(0.934136,-9.524984) c=(-1.302937,13.285498)

Output on CPU.

fastFit: 0=-303360091.997263 1=-29751183.422276 2=304815482.465272 3=0.534923 tmp=-631410.897150 a=(0.368802,-3.760513) b=(0.934136,-9.524984) c=(-1.302937,13.285498)

On GPU, both riemannFit::cross2D(acc, b, a) and riemannFit::cross2D(acc, c, a) are zero in this case.

@missirol
Copy link
Contributor

missirol commented Apr 21, 2024

The CUDA implementation (tested by just using a different HLT menu [1]) also produces the nan discussed above. The CUDA implementation also has similar assert calls (link), but (if I understand correctly) these assert calls do nothing when running the CUDA implementation (because NDEBUG is defined).

How to proceed ? Remove the ALPAKA_ASSERT_ACC calls in the Alpaka version ? Or, add some kind of safeguard against these zero/nan/inf values ? Or, else ?


[1]

#!/bin/bash

jobLabel=test_cmssw44786_cuda

if [ ! -f "${jobLabel}"_cfg.py ]; then

  https_proxy=http://cmsproxy.cms:3128/ \
  hltGetConfiguration run:379660 \
    --globaltag 140X_dataRun3_HLT_v3 \
    --data \
    --no-prescale \
    --no-output \
    --paths AlCa_PFJet40_v* \
    --max-events 1 \
    --input root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/debug/240417_run379617/run379617_ls0329_index000242_fu-c2b02-12-01_pid3327112.root \
    > "${jobLabel}"_cfg.py

  cat <<@EOF >> "${jobLabel}"_cfg.py

del process.hltL1sZeroBias

if hasattr(process, 'HLTAnalyzerEndpath'):
    del process.HLTAnalyzerEndpath

try:
    del process.MessageLogger
    process.load('FWCore.MessageLogger.MessageLogger_cfi')
    process.MessageLogger.cerr.enableStatistics = False
except:
    pass

process.options.numberOfThreads = 1
process.options.numberOfStreams = 0

process.source.skipEvents = cms.untracked.uint32( 86 )

process.options.wantSummary = True
@EOF
fi

CUDA_LAUNCH_BLOCKING=1 \
cmsRun "${jobLabel}"_cfg.py &> "${jobLabel}".log

@slava77
Copy link
Contributor

slava77 commented Apr 21, 2024

type tracking

@fwyzard
Copy link
Contributor

fwyzard commented Apr 21, 2024

How to proceed ? Remove the ALPAKA_ASSERT_ACC calls in the Alpaka version ? Or, add some kind of safeguard against these zero/nan/inf values ? Or, else ?

Naively I don't think that just removing the assert is a good approach (dust, carpet, you get the idea).

On GPU, both riemannFit::cross2D(acc, b, a) and riemannFit::cross2D(acc, c, a) are zero in this case.

So the three points are basically aligned.
@AdrianoDee @rovere @VinInn @slava77 would it make sense to check for this and handle in some special way ?

@mmusich
Copy link
Contributor Author

mmusich commented Apr 22, 2024

Naively I don't think that just removing the assert is a good approach (dust, carpet, you get the idea).

while agreeing with the approach, we need to make soon an assessment because every day we spend discussing this, is one less day we take data with the menu V1.1 (and all its physics triggers updates).
So we either:

  • bit the bullet and put online V1.1 with CMSSW_14_0_5_patch2 with the potential pitfall of this kind of crash (hopefully it will be low frequency - 1 single occurrence in one run 379617, we still need to look at run 379613 though)
  • remove temporarily the assert in the alpaka code while something better is prepared by the experts.
  • prepare a version "V1.05" menu with the other updates while reverting the pixel local reconstruction to CUDA (which AFAIU has the same problem though). This will be very labor intensive on the STORM side.

@fwyzard
Copy link
Contributor

fwyzard commented Apr 22, 2024

I do not see any reason for

  • prepare a version "V1.05" menu with the other updates while reverting the pixel local reconstruction to CUDA (which AFAIU has the same problem though). This will be very labor intensive on the STORM side.

@fwyzard
Copy link
Contributor

fwyzard commented Apr 22, 2024

remove temporarily the assert in the alpaka code while something better is prepared by the experts.

I am loath to adopt "temporary" solutions, because they have the undesirable tendency to stick around much longer than intended.

@VinInn
Copy link
Contributor

VinInn commented Apr 22, 2024

  1. We never assert on NaN (please remove the assert, I do not know who introduced those and in any case none of those are safe in case of fast-math (on host))
  2. Why are we asserting in Alpaka: is this not making the gpu version dramatically slower?
  3. We need to emit an error:
    a) we do not have (yet) a mechanism to propagate errors from device to host
    b) most probably we can just leave the NaN percolate and catch it later on Host
    c) we should just invalidate that track

@fwyzard
Copy link
Contributor

fwyzard commented Apr 22, 2024

  1. We never assert on NaN (please remove the assert, I do not know who introduced those and in any case none of those are safe in case of fast-math (on host))

Technically, I think you did :-p
The oldest commit I could find with that code already has the comment and assert, and they made their way through the patatrack integration in CMSSW and the porting to Alpaka.

  1. Why are we asserting in Alpaka: is this not making the gpu version dramatically slower?

Probably yes.
But we also have a lot of extra code being put online for the first time, so I decided it was a good idea to keep the asserts enabled while we commission the new code.
We can re-measure the impact of keeping asserts in GPU code, and decide what to do once all crashed have been solved.

  1. We need to emit an error:
  • a) we do not have (yet) a mechanism to propagate errors from device to host

Technically a device-side assert does just that: it makes the kernel fail, which is caught by an asynchronous exception on the host.
But I agree it's not a very nice approach.

  • b) most probably we can just leave the NaN percolate and catch it later on Host

OK - but is there already some code on the host that would catch it and ...

  • c) we should just invalidate that track

... do this ?

@AdrianoDee
Copy link
Contributor

AdrianoDee commented Apr 22, 2024

OK - but is there already some code on the host that would catch it and ...
... do this ?

Yes, the track would be discarded later in the chain

for (int i = 0; i < 5; ++i) {
isNaN |= std::isnan(tracks_view[it].state()(i));
}
if (isNaN) {
#ifdef NTUPLE_DEBUG
printf("NaN in fit %d size %d chi2 %f\n", it, tracks_view.hitIndices().size(it), tracks_view[it].chi2());
#endif
continue;
}

I checked that it is the case for the event there (when disabling the NaN checks). Note that this is not only a single event, it's actually a single triplet in the whole run.

@fwyzard
Copy link
Contributor

fwyzard commented Apr 22, 2024

Curious - std::isnan is not constexpr (until c++23), so in principle it should not work in a kernel.

@fwyzard
Copy link
Contributor

fwyzard commented Apr 22, 2024

c) we should just invalidate that track

by the way - why ?

3 (almost) aligned hits would point to a very high pT track. Why should we invalidate it ?

@VinInn
Copy link
Contributor

VinInn commented Apr 22, 2024

a=(0.368802,-3.760513) b=(0.934136,-9.524984) c=(-1.302937,13.285498)

could be printed with full precision using "%a" instead of "%f"?
I would like to test if it is FMA that makes the cross product zero or something else

@VinInn
Copy link
Contributor

VinInn commented Apr 22, 2024

I added the printf and run on CPU: got no output!
Which code is running on a machine w/o gpu?

@VinInn
Copy link
Contributor

VinInn commented Apr 22, 2024

edited the file in interface, not in inteface/alpaka....
hope we get rid of all these duplications soon

@VinInn
Copy link
Contributor

VinInn commented Apr 22, 2024

found

fastFit: 0=inf 1=inf 2=inf 3=-nan tmp=inf a=(0x1.79a72p-2,-0x1.e1588p+1) b=(0x1.de4706p-1,-0x1.30ccacp+3) c=(-0x1.4d8d4bp+0,0x1.a922ccp+3)

this is on cpu I suppose

fastFit: 0=-0x1.214e85bff4ca4p+28 1=-0x1.c5f78f6c1a45ap+24 2=0x1.22b1d7a771c18p+28 3=0x1.11e163587decap-1 tmp=-0x1.344e5cb57444ap+19 a=(0x1.79a724p-2,-0x1.e1588p+1) b=(0x1.de4704p-1,-0x1.30ccacp+3) c=(-0x1.4d8d4bp+0,0x1.a922ccp+3)

is not a subtle precision issue: the cross product of those three double precision numbers (any combination) is zero (on cpu as well) no matter how one computes them. On cpu the imputs have few bits different

@VinInn
Copy link
Contributor

VinInn commented Apr 22, 2024

A track with zero curvature will sooner or later produce a NaN and somebody will reject it.
We are not even sure that a very small value will survive further manipulations down-stream: so, unless somebody goes around and make sure C=0 is properly propagated, the best is just to keep the nan.
For sure the asserts shall be removed even if exactly ZERO in double precision is always a bit suspicision (still can happen, are only 53bits after all...) and it can happen on cpu as well.
In the printout is "plenty" of values that could be "inf" in float

@mmusich
Copy link
Contributor Author

mmusich commented Apr 23, 2024

we still need to look at run 379613 though

for the record, we checked all the available error stream files for that run with the following script [1] in CMSSW_14_0_5_patch2
and found no other crashes.

This particular issue should be dealt with by #44808 (if accepted).

[1]

#!/bin/bash -ex

# CMSSW_14_0_5_patch2         
                                                                                            
hltGetConfiguration run:379613 \
  --globaltag 140X_dataRun3_HLT_v3 \
  --data \
  --no-prescale \
  --no-output \
  --max-events -1 \
  --input file:converted.root  > hlt.py

cat <<@EOF >> hlt.py
process.options.numberOfThreads = 32
process.options.numberOfStreams = 32
@EOF

# Define a function to execute each iteration of the loop
process_file() {
    inputfile="$1"
    outputfile="${inputfile%.root}"
    cp hlt.py hlt_${outputfile}.py
    sed -i "s/file:converted\.root/\/store\/group\/tsg\/FOG\/debug\/240417_run379613\/${inputfile}/g" hlt_${outputfile}.py
    cmsRun hlt_${outputfile}.py &> "${outputfile}.log"
}

# Export the function so it can be used by parallel
export -f process_file

# Find the root files and run the function in parallel using GNU Parallel
eos ls /eos/cms/store/group/tsg/FOG/debug/240417_run379613/ | grep '\.root$' | parallel -j 8 process_file

@mmusich
Copy link
Contributor Author

mmusich commented May 7, 2024

+hlt

@makortel
Copy link
Contributor

makortel commented May 7, 2024

+heterogeneous

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants