Deal with late parents in HydjetHadronizer #39784

Dr15Jones · 2022-10-19T20:21:16Z

PR description:

There were cases where parent particles were later in the list than their children. This lead to segmentation faults. This change creates those parents when needed and avoids causing them to be created later.

PR validation:

I was able to catch the segmentation fault in the debugger and found that the ID for the parent was larger than the child. This meant the parent had not yet been made which lead to the crash. After making this change, I ran the job 4 times and saw no further crashes. Previously the same job would crash > 50% of the time.

There were cases where parent particles were later in the list than their children. This lead to segmentation faults. This change creates those parents when needed and avoids causing them to be created later.

Dr15Jones · 2022-10-19T20:24:17Z

please test

cmsbuild · 2022-10-19T20:28:00Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-39784/32650

This PR adds an extra 16KB to repository

cmsbuild · 2022-10-19T20:28:28Z

A new Pull Request was created by @Dr15Jones (Chris Jones) for master.

It involves the following packages:

GeneratorInterface/HydjetInterface (generators)

@SiewYan, @mkirsano, @Saptaparna, @alberto-sanchez, @menglu21, @GurpreetSinghChahal can you please review it and eventually sign? Thanks.
@alberto-sanchez, @cbaus, @mkirsano this is something you requested to watch as well.
@perrotta, @dpiparo, @rappoccio you are the release manager for this.

cms-bot commands are listed here

Dr15Jones · 2022-10-19T20:30:46Z

please test workflow 159.03

cmsbuild · 2022-10-19T23:41:39Z

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-430f76/28380/summary.html
COMMIT: 9162d5f
CMSSW: CMSSW_12_6_X_2022-10-19-1100/el8_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/39784/28380/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

@slava77 comparisons for the following workflows were not done due to missing matrix map:

/data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/data/PR-430f76/159.03_HydjetQ_MinBias_5020GeV_2021_ppReco+HydjetQ_MinBias_5020GeV_2021_ppReco+DIGIHI2021PPRECO+RAWPRIMESIMHI18+RECOHI2022PROD+MINIHI2022PROD

Summary:

No significant changes to the logs found
Reco comparison results: 12 differences found in the comparisons
Reco comparison had 6 failed jobs
DQMHistoTests: Total files compared: 48
DQMHistoTests: Total histograms compared: 3391158
DQMHistoTests: Total failures: 12
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 3391124
DQMHistoTests: Total skipped: 22
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
Checked 206 log files, 48 edm output root files, 48 DQM output files
TriggerResults: no differences found

Dr15Jones · 2022-10-24T12:56:29Z

ping
this fixes a reoccurring crash happening in the IBs.

perrotta · 2022-10-24T13:06:07Z

ping @cms-sw/generators-l2

perrotta · 2022-10-25T14:30:16Z

urgent

menglu21 · 2022-10-25T16:44:23Z

+1

cmsbuild · 2022-10-25T16:44:44Z

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

perrotta · 2022-10-25T16:54:05Z

+1

wouf · 2022-10-26T10:08:25Z

Dear @menglu21 , @Dr15Jones, all, unfortunately I missed this PR (could You please to notify me, or other HI GEN contacts, in the future in case of changes in HI generators!).
I can't reproduce the issue, You speaking about, in CMSSW_12_6_X_2022-10-25-2300 with/ and without/ @Dr15Jones 's corrections both. I used this config. Cold You please to provide more information about it?

The changes in lines 371, 387 looks useless, because the value of iterator ihy does not not repeated until primary_particle and particle vectors re-initialisations line 335.

Condition 397 have not to be executed, otherwise we have to check the logic (for my tests it have not).

Dr15Jones · 2022-10-26T14:04:20Z

I can't reproduce the issue, You speaking about, in CMSSW_12_6_X_2022-10-25-2300 with/ and without/ @Dr15Jones 's corrections both. I used this config. Cold You please to provide more information about it?

So you can see the failure in yesterday evening's IB (just before this PR was added to the IBs)
https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_amd64_gcc10/CMSSW_12_6_X_2022-10-25-1100/pyRelValMatrixLogs/run/159.03_HydjetQ_MinBias_5020GeV_2021_ppReco+HydjetQ_MinBias_5020GeV_2021_ppReco+DIGIHI2021PPRECO+RAWPRIMESIMHI18+RECOHI2022PROD+MINIHI2022PROD/step1_HydjetQ_MinBias_5020GeV_2021_ppReco+HydjetQ_MinBias_5020GeV_2021_ppReco+DIGIHI2021PPRECO+RAWPRIMESIMHI18+RECOHI2022PROD+MINIHI2022PROD.log#/

problems were also seen with the address sanitizer (ASAN) builds as seen in this issue #39350

This segmentation fault does not happen every time the workflow 159.03 runs. When I was debugging the problem, it happened about half the time. I was fortunately that one of those times happened while running the job with the debugger after I'd recompiled the code to contain the debugging symbols. This let me see that the problem was caused by a mother particle having an index larger (in the case I saw just +1 larger) than the daughter index. Therefore requesting the mother returned a nullptr.

Dr15Jones · 2022-10-26T14:13:40Z

@wouf wrote:

Condition 397 have not to be executed, otherwise we have to check the logic (for my tests it have not).

Actually this was exactly what caused the segmentation fault, the value of mid was greater than ihy (i.e. the mother's index was larger than the daughters index). So for that case, I now create the mother as needed and add it to the requisite containers. Since we can now build particles at another part of the code, that is why the changes were made to line 371 and 380 since we must avoid overwriting the case where a mother particle was created as needed.

Dr15Jones · 2022-10-26T14:46:33Z

It appears that this fix is either incomplete or uncovered a different problem as we are now getting an exception in the workflow
https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_amd64_gcc11/CMSSW_12_6_X_2022-10-26-1100/pyRelValMatrixLogs/run/159.03_HydjetQ_MinBias_5020GeV_2021_ppReco+HydjetQ_MinBias_5020GeV_2021_ppReco+DIGIHI2021PPRECO+RAWPRIMESIMHI18+RECOHI2022PROD+MINIHI2022PROD/step1_HydjetQ_MinBias_5020GeV_2021_ppReco+HydjetQ_MinBias_5020GeV_2021_ppReco+DIGIHI2021PPRECO+RAWPRIMESIMHI18+RECOHI2022PROD+MINIHI2022PROD.log#/269-269

where the message is

----- Begin Fatal Exception 26-Oct-2022 13:45:02 CEST-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 1 stream: 3
   [1] Running path 'simulation_step'
   [2] Calling method for module HydjetGeneratorFilter/'generator'
Exception Message:
A std::exception was thrown.
vector::_M_range_check: __n (which is 86315) >= this->size() (which is 86315)
----- End Fatal Exception -------------------------------------------------

Dr15Jones · 2022-10-26T14:49:43Z

Looks like I missed the case where the container wasn't large enough already as it looks to me like the new exception comes from this line

cmssw/GeneratorInterface/HydjetInterface/src/HydjetHadronizer.cc

Line 391 in d03a9e7

HepMC::GenParticle* mother = primary_particle.at(mid);

wouf · 2022-10-26T15:19:25Z

I can't reproduce the issue, You speaking about, in CMSSW_12_6_X_2022-10-25-2300 with/ and without/ @Dr15Jones 's corrections both. I used this config. Cold You please to provide more information about it?

So you can see the failure in yesterday evening's IB (just before this PR was added to the IBs) https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_amd64_gcc10/CMSSW_12_6_X_2022-10-25-1100/pyRelValMatrixLogs/run/159.03_HydjetQ_MinBias_5020GeV_2021_ppReco+HydjetQ_MinBias_5020GeV_2021_ppReco+DIGIHI2021PPRECO+RAWPRIMESIMHI18+RECOHI2022PROD+MINIHI2022PROD/step1_HydjetQ_MinBias_5020GeV_2021_ppReco+HydjetQ_MinBias_5020GeV_2021_ppReco+DIGIHI2021PPRECO+RAWPRIMESIMHI18+RECOHI2022PROD+MINIHI2022PROD.log#/

problems were also seen with the address sanitizer (ASAN) builds as seen in this issue #39350

This segmentation fault does not happen every time the workflow 159.03 runs. When I was debugging the problem, it happened about half the time. I was fortunately that one of those times happened while running the job with the debugger after I'd recompiled the code to contain the debugging symbols. This let me see that the problem was caused by a mother particle having an index larger (in the case I saw just +1 larger) than the daughter index. Therefore requesting the mother returned a nullptr.

Was it multi-thread task? Let me remind, that Hydjet is not thread-safe. Meanwhile I tried to execute it without Your corrections about 6 times (each contained about 6-10 events). Do You mean this debug line? Or something else? Please note, that some of arrays exporting from Fortran, so they indexing from 1 (not 0), and indexes shifted too. On the other hand some particles may have mother index equal its own (it's not really correct, but I've seen it). Could You please to show me exact output printout? We need to understand which processes it was (which PDG was involved). Or could You give me recipe to reproduce the issue?

wouf · 2022-10-26T15:46:31Z

@wouf wrote:

Condition 397 have not to be executed, otherwise we have to check the logic (for my tests it have not).

Actually this was exactly what caused the segmentation fault, the value of mid was greater than ihy (i.e. the mother's index was larger than the daughters index). So for that case, I now create the mother as needed and add it to the requisite containers. Since we can now build particles at another part of the code, that is why the changes were made to line 371 and 380 since we must avoid overwriting the case where a mother particle was created as needed.

Then, if it is really possible, that mid > ihy, it have to come from this line, we have to get the values of index[isub] and hyjets.khj[2][ihy] (mother index from fortran) for that case, it may be bug, if it's so.

Dr15Jones · 2022-10-26T16:20:56Z

What I had seen was when mid > ihy the value was actually mid == ihy + 1.

The exception that is happening after my change was added shows that the call primary_particle.at(mid); is using a value of mid which is exactly equal to primary_particle.size() and if this is happening for the very last particle (i.e. ihy == hyjets.nhj-1) then it would also be happening for the condition mid == ihy + 1 I had seen.

Dr15Jones · 2022-10-26T19:12:43Z

So I added some printouts that are triggered when mid > ihy. I also increased the number of events processed to be 10 in the job and now I consistently see problems. An example of the output is

 ihy 70533 mid 70534 with daughter of type 310
 mid 70534 mother of type 313
 ihy 70535 mid 70538 with daughter of type 321
 mid 70538 mother of type 311
 ihy 70536 mid 70538 with daughter of type -211
 ihy 70539 mid 70546 with daughter of type 310
 mid 70546 mother of type -211
 ihy 70541 mid 70550 with daughter of type -311
 mid 70550 mother of type 310
 ihy 70542 mid 70550 with daughter of type 111
 ihy 70543 mid 70551 with daughter of type 310
 mid 70551 mother of type 223
 ihy 70544 mid 70552 with daughter of type 22
 mid 70552 mother of type -211
 ihy 70545 mid 70552 with daughter of type 22
 ihy 70548 mid 70564 with daughter of type 310
 mid 70564 mother of type 211
 ihy 70550 mid 70568 with daughter of type 310
 mid 70568 mother of type -211
 ihy 70552 mid 70572 with daughter of type -211
 mid 70572 mother of type 22
 ihy 70553 mid 70572 with daughter of type 211
 ihy 70554 mid 70572 with daughter of type 111
 ihy 70555 mid 70575 with daughter of type 22
 mid 70575 mother of type 22
 ihy 70556 mid 70575 with daughter of type 22
 ihy 70558 mid 70584 with daughter of type -211
 mid 70584 mother of type 22
 ihy 70559 mid 70584 with daughter of type 111
 ihy 70560 mid 70586 with daughter of type 22
 mid 70586 mother of type 211
 ihy 70561 mid 70586 with daughter of type 22
 ihy 70564 mid 70596 with daughter of type 211
 mid 70596 mother of type -211
 ihy 70565 mid 70596 with daughter of type -211
 ihy 70571 mid 70610 with daughter of type 22
 mid 70610 mother of type 22
 ihy 70572 mid 70610 with daughter of type 22
 ihy 70574 mid 70616 with daughter of type 22
 mid 70616 mother of type -211
 ihy 70575 mid 70616 with daughter of type 22
 ihy 70577 mid 70622 with daughter of type -211
 mid 70622 mother of type -321
 ihy 70578 mid 70622 with daughter of type 111
 ihy 70579 mid 70624 with daughter of type 22
 mid 70624 mother of type -311
 ihy 70580 mid 70624 with daughter of type -11
 ihy 70581 mid 70624 with daughter of type 11
 ihy 70583 mid 70634 with daughter of type 22
 mid 70634 mother of type 310
 ihy 70584 mid 70634 with daughter of type 22
 ihy 70589 mid 70646 with daughter of type 22

where daughter type is the pdgID value for the particle at ihy and mother type is the pdgID of mid.

Dr15Jones · 2022-10-26T19:36:20Z

I'm using release CMSSW_12_6_X_2022-10-26-1100

Dr15Jones · 2022-10-26T19:36:52Z

I added more printout about how the mid is calculated

 ihy 70533 mid 70534 with daughter of type 310
  hyjets.khj[2][ihy] 1900002 hjoffset 1900000 isub 38 index[isub] 70532
 mid 70534 mother of type 313
 ihy 70535 mid 70538 with daughter of type 321
  hyjets.khj[2][ihy] 1900004 hjoffset 1900000 isub 38 index[isub] 70534
 mid 70538 mother of type 311
 ihy 70536 mid 70538 with daughter of type -211
  hyjets.khj[2][ihy] 1900004 hjoffset 1900000 isub 38 index[isub] 70534
 ihy 70539 mid 70546 with daughter of type 310
  hyjets.khj[2][ihy] 1900008 hjoffset 1900000 isub 38 index[isub] 70538
 mid 70546 mother of type -211
 ihy 70541 mid 70550 with daughter of type -311
  hyjets.khj[2][ihy] 1900010 hjoffset 1900000 isub 38 index[isub] 70540
 mid 70550 mother of type 310
 ihy 70542 mid 70550 with daughter of type 111
  hyjets.khj[2][ihy] 1900010 hjoffset 1900000 isub 38 index[isub] 70540
 ihy 70543 mid 70551 with daughter of type 310
  hyjets.khj[2][ihy] 1900011 hjoffset 1900000 isub 38 index[isub] 70540
 mid 70551 mother of type 223
 ihy 70544 mid 70552 with daughter of type 22
  hyjets.khj[2][ihy] 1900012 hjoffset 1900000 isub 38 index[isub] 70540
 mid 70552 mother of type -211
 ihy 70545 mid 70552 with daughter of type 22
  hyjets.khj[2][ihy] 1900012 hjoffset 1900000 isub 38 index[isub] 70540
 ihy 70548 mid 70564 with daughter of type 310
  hyjets.khj[2][ihy] 1900017 hjoffset 1900000 isub 38 index[isub] 70547
 mid 70564 mother of type 211
 ihy 70550 mid 70568 with daughter of type 310
  hyjets.khj[2][ihy] 1900019 hjoffset 1900000 isub 38 index[isub] 70549
 mid 70568 mother of type -211
 ihy 70552 mid 70572 with daughter of type -211
  hyjets.khj[2][ihy] 1900021 hjoffset 1900000 isub 38 index[isub] 70551
 mid 70572 mother of type 22
 ihy 70553 mid 70572 with daughter of type 211
  hyjets.khj[2][ihy] 1900021 hjoffset 1900000 isub 38 index[isub] 70551
 ihy 70554 mid 70572 with daughter of type 111
  hyjets.khj[2][ihy] 1900021 hjoffset 1900000 isub 38 index[isub] 70551
 ihy 70555 mid 70575 with daughter of type 22
  hyjets.khj[2][ihy] 1900024 hjoffset 1900000 isub 38 index[isub] 70551
 mid 70575 mother of type 22
 ihy 70556 mid 70575 with daughter of type 22
  hyjets.khj[2][ihy] 1900024 hjoffset 1900000 isub 38 index[isub] 70551
 ihy 70558 mid 70584 with daughter of type -211
  hyjets.khj[2][ihy] 1900027 hjoffset 1900000 isub 38 index[isub] 70557
 mid 70584 mother of type 22
 ihy 70559 mid 70584 with daughter of type 111
  hyjets.khj[2][ihy] 1900027 hjoffset 1900000 isub 38 index[isub] 70557
 ihy 70560 mid 70586 with daughter of type 22
  hyjets.khj[2][ihy] 1900029 hjoffset 1900000 isub 38 index[isub] 70557
 mid 70586 mother of type 211
 ihy 70561 mid 70586 with daughter of type 22
  hyjets.khj[2][ihy] 1900029 hjoffset 1900000 isub 38 index[isub] 70557
 ihy 70564 mid 70596 with daughter of type 211
  hyjets.khj[2][ihy] 1900033 hjoffset 1900000 isub 38 index[isub] 70563
 mid 70596 mother of type -211
 ihy 70565 mid 70596 with daughter of type -211
  hyjets.khj[2][ihy] 1900033 hjoffset 1900000 isub 38 index[isub] 70563
 ihy 70571 mid 70610 with daughter of type 22
  hyjets.khj[2][ihy] 1900040 hjoffset 1900000 isub 38 index[isub] 70570
 mid 70610 mother of type 22
 ihy 70572 mid 70610 with daughter of type 22
  hyjets.khj[2][ihy] 1900040 hjoffset 1900000 isub 38 index[isub] 70570
 ihy 70574 mid 70616 with daughter of type 22
  hyjets.khj[2][ihy] 1900043 hjoffset 1900000 isub 38 index[isub] 70573
 mid 70616 mother of type -211
 ihy 70575 mid 70616 with daughter of type 22
  hyjets.khj[2][ihy] 1900043 hjoffset 1900000 isub 38 index[isub] 70573
 ihy 70577 mid 70622 with daughter of type -211
  hyjets.khj[2][ihy] 1900046 hjoffset 1900000 isub 38 index[isub] 70576
 mid 70622 mother of type -321
 ihy 70578 mid 70622 with daughter of type 111
  hyjets.khj[2][ihy] 1900046 hjoffset 1900000 isub 38 index[isub] 70576
 ihy 70579 mid 70624 with daughter of type 22
  hyjets.khj[2][ihy] 1900048 hjoffset 1900000 isub 38 index[isub] 70576
 mid 70624 mother of type -311
 ihy 70580 mid 70624 with daughter of type -11
  hyjets.khj[2][ihy] 1900048 hjoffset 1900000 isub 38 index[isub] 70576
 ihy 70581 mid 70624 with daughter of type 11
  hyjets.khj[2][ihy] 1900048 hjoffset 1900000 isub 38 index[isub] 70576
 ihy 70583 mid 70634 with daughter of type 22
  hyjets.khj[2][ihy] 1900052 hjoffset 1900000 isub 38 index[isub] 70582
 mid 70634 mother of type 310
 ihy 70584 mid 70634 with daughter of type 22
  hyjets.khj[2][ihy] 1900052 hjoffset 1900000 isub 38 index[isub] 70582
 ihy 70589 mid 70646 with daughter of type 22
  hyjets.khj[2][ihy] 1900058 hjoffset 1900000 isub 38 index[isub] 70588

Dr15Jones · 2022-10-26T20:21:05Z

So I wondered why the value of index[isub] kept growing even though isub in the printout remained constant. So I checked and I found that when the problem occurs, the value of isub is bouncing back and forward between two values (which differ by 1) and the weird mother id seems to be related to the larger number.

 ihy 70533 mid 70534 with daughter of type 310
  hyjets.khj[2][ihy] 1900002 hjoffset 1900000 isub 38 index[isub] 70532 isub_l 38
 mid 70534 mother of type 313
 back to isub 37 from isub_l 38
 back to isub 38 from isub_l 37
 ihy 70535 mid 70538 with daughter of type 321
  hyjets.khj[2][ihy] 1900004 hjoffset 1900000 isub 38 index[isub] 70534 isub_l 38
 mid 70538 mother of type 311
 ihy 70536 mid 70538 with daughter of type -211
  hyjets.khj[2][ihy] 1900004 hjoffset 1900000 isub 38 index[isub] 70534 isub_l 38
 back to isub 37 from isub_l 38
 back to isub 38 from isub_l 37
 ihy 70539 mid 70546 with daughter of type 310
  hyjets.khj[2][ihy] 1900008 hjoffset 1900000 isub 38 index[isub] 70538 isub_l 38
 mid 70546 mother of type -211
 back to isub 37 from isub_l 38
 back to isub 38 from isub_l 37
 ihy 70541 mid 70550 with daughter of type -311
  hyjets.khj[2][ihy] 1900010 hjoffset 1900000 isub 38 index[isub] 70540 isub_l 38
 mid 70550 mother of type 310
 ihy 70542 mid 70550 with daughter of type 111
  hyjets.khj[2][ihy] 1900010 hjoffset 1900000 isub 38 index[isub] 70540 isub_l 38
 ihy 70543 mid 70551 with daughter of type 310
  hyjets.khj[2][ihy] 1900011 hjoffset 1900000 isub 38 index[isub] 70540 isub_l 38
 mid 70551 mother of type 223
 ihy 70544 mid 70552 with daughter of type 22
  hyjets.khj[2][ihy] 1900012 hjoffset 1900000 isub 38 index[isub] 70540 isub_l 38
 mid 70552 mother of type -211
 ihy 70545 mid 70552 with daughter of type 22
  hyjets.khj[2][ihy] 1900012 hjoffset 1900000 isub 38 index[isub] 70540 isub_l 38
 back to isub 37 from isub_l 38
 back to isub 38 from isub_l 37
 ihy 70548 mid 70564 with daughter of type 310
  hyjets.khj[2][ihy] 1900017 hjoffset 1900000 isub 38 index[isub] 70547 isub_l 38
 mid 70564 mother of type 211
 back to isub 37 from isub_l 38
 back to isub 38 from isub_l 37
 ihy 70550 mid 70568 with daughter of type 310
  hyjets.khj[2][ihy] 1900019 hjoffset 1900000 isub 38 index[isub] 70549 isub_l 38
 mid 70568 mother of type -211
 back to isub 37 from isub_l 38
 back to isub 38 from isub_l 37
 ihy 70552 mid 70572 with daughter of type -211
  hyjets.khj[2][ihy] 1900021 hjoffset 1900000 isub 38 index[isub] 70551 isub_l 38
 mid 70572 mother of type 22
 ihy 70553 mid 70572 with daughter of type 211
  hyjets.khj[2][ihy] 1900021 hjoffset 1900000 isub 38 index[isub] 70551 isub_l 38
 ihy 70554 mid 70572 with daughter of type 111
  hyjets.khj[2][ihy] 1900021 hjoffset 1900000 isub 38 index[isub] 70551 isub_l 38
 ihy 70555 mid 70575 with daughter of type 22
  hyjets.khj[2][ihy] 1900024 hjoffset 1900000 isub 38 index[isub] 70551 isub_l 38
 mid 70575 mother of type 22
 ihy 70556 mid 70575 with daughter of type 22
  hyjets.khj[2][ihy] 1900024 hjoffset 1900000 isub 38 index[isub] 70551 isub_l 38
 back to isub 37 from isub_l 38
 back to isub 38 from isub_l 37
 ihy 70558 mid 70584 with daughter of type -211
  hyjets.khj[2][ihy] 1900027 hjoffset 1900000 isub 38 index[isub] 70557 isub_l 38
 mid 70584 mother of type 22
 ihy 70559 mid 70584 with daughter of type 111
  hyjets.khj[2][ihy] 1900027 hjoffset 1900000 isub 38 index[isub] 70557 isub_l 38
 ihy 70560 mid 70586 with daughter of type 22
  hyjets.khj[2][ihy] 1900029 hjoffset 1900000 isub 38 index[isub] 70557 isub_l 38
 mid 70586 mother of type 211
 ihy 70561 mid 70586 with daughter of type 22
  hyjets.khj[2][ihy] 1900029 hjoffset 1900000 isub 38 index[isub] 70557 isub_l 38
 back to isub 37 from isub_l 38
 back to isub 38 from isub_l 37
 ihy 70564 mid 70596 with daughter of type 211
  hyjets.khj[2][ihy] 1900033 hjoffset 1900000 isub 38 index[isub] 70563 isub_l 38
 mid 70596 mother of type -211
 ihy 70565 mid 70596 with daughter of type -211
  hyjets.khj[2][ihy] 1900033 hjoffset 1900000 isub 38 index[isub] 70563 isub_l 38
 back to isub 37 from isub_l 38
 back to isub 38 from isub_l 37
 ihy 70571 mid 70610 with daughter of type 22
  hyjets.khj[2][ihy] 1900040 hjoffset 1900000 isub 38 index[isub] 70570 isub_l 38
 mid 70610 mother of type 22
 ihy 70572 mid 70610 with daughter of type 22
  hyjets.khj[2][ihy] 1900040 hjoffset 1900000 isub 38 index[isub] 70570 isub_l 38
 back to isub 37 from isub_l 38
 back to isub 38 from isub_l 37
 ihy 70574 mid 70616 with daughter of type 22
  hyjets.khj[2][ihy] 1900043 hjoffset 1900000 isub 38 index[isub] 70573 isub_l 38
 mid 70616 mother of type -211
 ihy 70575 mid 70616 with daughter of type 22
  hyjets.khj[2][ihy] 1900043 hjoffset 1900000 isub 38 index[isub] 70573 isub_l 38
 back to isub 37 from isub_l 38
 back to isub 38 from isub_l 37
 ihy 70577 mid 70622 with daughter of type -211
  hyjets.khj[2][ihy] 1900046 hjoffset 1900000 isub 38 index[isub] 70576 isub_l 38
 mid 70622 mother of type -321
 ihy 70578 mid 70622 with daughter of type 111
  hyjets.khj[2][ihy] 1900046 hjoffset 1900000 isub 38 index[isub] 70576 isub_l 38
 ihy 70579 mid 70624 with daughter of type 22
  hyjets.khj[2][ihy] 1900048 hjoffset 1900000 isub 38 index[isub] 70576 isub_l 38
 mid 70624 mother of type -311
 ihy 70580 mid 70624 with daughter of type -11
  hyjets.khj[2][ihy] 1900048 hjoffset 1900000 isub 38 index[isub] 70576 isub_l 38
 ihy 70581 mid 70624 with daughter of type 11
  hyjets.khj[2][ihy] 1900048 hjoffset 1900000 isub 38 index[isub] 70576 isub_l 38
 back to isub 37 from isub_l 38
 back to isub 38 from isub_l 37
 ihy 70583 mid 70634 with daughter of type 22
  hyjets.khj[2][ihy] 1900052 hjoffset 1900000 isub 38 index[isub] 70582 isub_l 38
 mid 70634 mother of type 310
 ihy 70584 mid 70634 with daughter of type 22
  hyjets.khj[2][ihy] 1900052 hjoffset 1900000 isub 38 index[isub] 70582 isub_l 38
 back to isub 37 from isub_l 38
 back to isub 38 from isub_l 37
 ihy 70589 mid 70646 with daughter of type 22
  hyjets.khj[2][ihy] 1900058 hjoffset 1900000 isub 38 index[isub] 70588 isub_l 38

wouf · 2022-10-26T20:29:29Z

Thank You! But I still can't reproduce it! What I'm doing is:

ssh lxplus8
scram -a el8_amd64_gcc10 project CMSSW_12_6_X_2022-10-26-1100
git cms-addpkg GeneratorInterface/HydjetInterface
moving Your changes and compiling (even placing condition if( mid> ihy) assert(false);)
cmsDriver.py Hydjet_Quenched_MinBias_5020GeV_cfi -s GEN,SIM -n 10 --conditions auto:phase1_2022_realistic_hi --beamspot Realistic25ns13p6TeVEarly2022Collision --datatier GEN-SIM --eventcontent RAWSIM --era Run3_pp_on_PbPb --geometry DB:Extended --relval 2000,1
But no issue nor segfault found.

Dr15Jones · 2022-10-26T20:40:53Z

It looks like the version of the command I'm based on is

cmsDriver.py Hydjet_Quenched_MinBias_5020GeV_cfi -s GEN,SIM -n 1 --conditions auto:phase1_2022_realistic_hi --beamspot Realistic25ns13p6TeVEarly2022Collision --datatier GEN-SIM --eventcontent RAWSIM --era Run3_pp_on_PbPb --geo\
metry DB:Extended --relval 2000,1 --fileout file:step1.root --nThreads 4

Dr15Jones · 2022-10-26T20:45:56Z

So I just tried with 1 thread and no problems seen! So now to see what is happening.

wouf · 2022-10-26T20:58:32Z

Hydjet is not thread save: please follow this discussion. May it be the reason?

Dr15Jones · 2022-10-26T21:14:11Z

I double checked and see that the Framework only runs the hydjet module on one thread at a time.

Dr15Jones · 2022-10-26T21:31:34Z

It is not a threading issue. I ran it single threaded and set the job to run 100 events and it happened for 2 different events.

perrotta · 2022-10-27T05:01:18Z

@Dr15Jones @wouf I've started the build of CMSSW_12_6_0_pre4 even if this issue is not yet solved, see also #39865

As such there will be a few HI workflows which will fail in the relvals. It is intended that if a solution is found in short, we can even decide to stop that build and start a new one including the fix.

wouf · 2022-10-27T06:59:33Z

@perrotta @Dr15Jones , I have reproduced the issue. This happened in events with an ultra high multiplicity per sub-event. Unfortunately this is due to offset overflow, so Hydjet core needs to be updated (looks like 50 000 is not enough for such events).

 <--- 85480
 ---85480---> 85481
85481 MULTin ev.:89598 SubEv.#61 Part #85482, PDG: 311 (st. 2) mother=35485 (3050000, 3050000, 35484), childs (85483-85483), vtx (0,0,0)
 ---35486---> 85482
85482 MULTin ev.:89598 SubEv.#61 Part #85483, PDG: 310 (st. 1) mother=85482 (3099997, 3050000, 35484), childs (85481-85481), vtx (0,0,0)
 <--- 85482
 ---85482---> 85483
85483 MULTin ev.:89598 SubEv.#61 Part #85484, PDG: 2112 (st. 1) mother=35485 (3050000, 3050000, 35484), childs (35485-35485), vtx (0,0,0)
 ---35486---> 85484
85484 MULTin ev.:89598 SubEv.#61 Part #85485, PDG: -2112 (st. 1) mother=35485 (3050000, 3050000, 35484), childs (35485-35485), vtx (0,0,0)
 ---35486---> 85485
85485 MULTin ev.:89598 SubEv.#61 Part #85486, PDG: 111 (st. 2) mother=35485 (3050000, 3050000, 35484), childs (85487-85488), vtx (0,0,0)
 ---35486---> 85486
85486 MULTin ev.:89598 SubEv.#62 Part #85487, PDG: 22 (st. 1) mother=85487 (3100001, 3100000, 85485), childs (85486-85486), vtx (1.06919e-05,6.168e-05,0.000468078)
 <--- 85487
 ---85487---> 85487
85487 MULTin ev.:89598 SubEv.#62 Part #85488, PDG: 22 (st. 1) mother=85487 (3100001, 3100000, 85485), childs (85486-85486), vtx (1.06919e-05,6.168e-05,0.000468078)
 ---85487---> 85488
85488 MULTin ev.:89598 SubEv.#61 Part #85489, PDG: -311 (st. 2) mother=85488 (3050000, 3050000, 85487), childs (135493-135493), vtx (0,0,0)
 <--- 85488
 ---85488---> 85489
85489 MULTin ev.:89598 SubEv.#62 Part #85490, PDG: 310 (st. 1) mother=85493 (3100004, 3100000, 85488), childs (85492-85492), vtx (0,0,0)

Here is Part #85487 and Part #85487 originally from sub-event 61, but due overflow of the offset buffer (first number in the bracket after mother index is hyjets.khj[2][ihy] - mother index from Fortran part, it should be less than 3100000 for sub-event 61) the sub-event number in HydjetInterface has increased. The same happened with the @Dr15Jones printout.
So, we have to roll back the @Dr15Jones 's solution, because such events are useless anyway. I will check the possibility of increasing the offset buffer in the Hydjet core.

Dr15Jones · 2022-10-27T13:23:01Z

@wouf nice catch! I actually woke up this morning wondering if that was the problem. I'm making a new PR with this change reverted and it now throws an exception in the case where mid > ihy .

Deal with late mothers in HydjetHadronizer

9162d5f

There were cases where parent particles were later in the list than their children. This lead to segmentation faults. This change creates those parents when needed and avoids causing them to be created later.

cmsbuild added this to the CMSSW_12_6_X milestone Oct 19, 2022

cmsbuild added code-checks-pending generators-pending orp-pending pending-signatures tests-pending labels Oct 19, 2022

cmsbuild added tests-started and removed tests-pending labels Oct 19, 2022

cmsbuild added code-checks-approved and removed code-checks-pending labels Oct 19, 2022

cmsbuild added tests-approved and removed tests-started labels Oct 19, 2022

Dr15Jones mentioned this pull request Oct 25, 2022

ASAN/segmentation faults in gen::HydjetHadronizer::get_particles() #39350

Closed

cmsbuild added the urgent label Oct 25, 2022

cmsbuild added fully-signed generators-approved and removed generators-pending pending-signatures labels Oct 25, 2022

cmsbuild added orp-approved and removed orp-pending labels Oct 25, 2022

cmsbuild merged commit 6d2b369 into cms-sw:master Oct 25, 2022

Dr15Jones deleted the fixHydjetHadronizer branch October 26, 2022 14:46

perrotta mentioned this pull request Oct 27, 2022

range_check error in HydjetGeneratorFilter #39865

Closed

Dr15Jones mentioned this pull request Oct 27, 2022

Catch case where parent index is bad in Hydjet #39874

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deal with late parents in HydjetHadronizer #39784

Deal with late parents in HydjetHadronizer #39784

Dr15Jones commented Oct 19, 2022

Dr15Jones commented Oct 19, 2022

cmsbuild commented Oct 19, 2022

cmsbuild commented Oct 19, 2022

Dr15Jones commented Oct 19, 2022

cmsbuild commented Oct 19, 2022

Dr15Jones commented Oct 24, 2022

perrotta commented Oct 24, 2022

perrotta commented Oct 25, 2022

menglu21 commented Oct 25, 2022

cmsbuild commented Oct 25, 2022

perrotta commented Oct 25, 2022

wouf commented Oct 26, 2022

Dr15Jones commented Oct 26, 2022

Dr15Jones commented Oct 26, 2022

Dr15Jones commented Oct 26, 2022

Dr15Jones commented Oct 26, 2022 •

edited

Loading

wouf commented Oct 26, 2022

wouf commented Oct 26, 2022

Dr15Jones commented Oct 26, 2022

Dr15Jones commented Oct 26, 2022

Dr15Jones commented Oct 26, 2022

Dr15Jones commented Oct 26, 2022

Dr15Jones commented Oct 26, 2022

wouf commented Oct 26, 2022

Dr15Jones commented Oct 26, 2022

Dr15Jones commented Oct 26, 2022

wouf commented Oct 26, 2022 •

edited

Loading

Dr15Jones commented Oct 26, 2022

Dr15Jones commented Oct 26, 2022

perrotta commented Oct 27, 2022

wouf commented Oct 27, 2022 •

edited

Loading

Dr15Jones commented Oct 27, 2022

Deal with late parents in HydjetHadronizer #39784

Deal with late parents in HydjetHadronizer #39784

Conversation

Dr15Jones commented Oct 19, 2022

PR description:

PR validation:

Dr15Jones commented Oct 19, 2022

cmsbuild commented Oct 19, 2022

cmsbuild commented Oct 19, 2022

Dr15Jones commented Oct 19, 2022

cmsbuild commented Oct 19, 2022

Comparison Summary

Dr15Jones commented Oct 24, 2022

perrotta commented Oct 24, 2022

perrotta commented Oct 25, 2022

menglu21 commented Oct 25, 2022

cmsbuild commented Oct 25, 2022

perrotta commented Oct 25, 2022

wouf commented Oct 26, 2022

Dr15Jones commented Oct 26, 2022

Dr15Jones commented Oct 26, 2022

Dr15Jones commented Oct 26, 2022

Dr15Jones commented Oct 26, 2022 • edited Loading

wouf commented Oct 26, 2022

wouf commented Oct 26, 2022

Dr15Jones commented Oct 26, 2022

Dr15Jones commented Oct 26, 2022

Dr15Jones commented Oct 26, 2022

Dr15Jones commented Oct 26, 2022

Dr15Jones commented Oct 26, 2022

wouf commented Oct 26, 2022

Dr15Jones commented Oct 26, 2022

Dr15Jones commented Oct 26, 2022

wouf commented Oct 26, 2022 • edited Loading

Dr15Jones commented Oct 26, 2022

Dr15Jones commented Oct 26, 2022

perrotta commented Oct 27, 2022

wouf commented Oct 27, 2022 • edited Loading

Dr15Jones commented Oct 27, 2022

Dr15Jones commented Oct 26, 2022 •

edited

Loading

wouf commented Oct 26, 2022 •

edited

Loading

wouf commented Oct 27, 2022 •

edited

Loading