Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deal with late parents in HydjetHadronizer #39784

Merged
merged 1 commit into from
Oct 25, 2022

Conversation

Dr15Jones
Copy link
Contributor

PR description:

There were cases where parent particles were later in the list than their children. This lead to segmentation faults. This change creates those parents when needed and avoids causing them to be created later.

PR validation:

I was able to catch the segmentation fault in the debugger and found that the ID for the parent was larger than the child. This meant the parent had not yet been made which lead to the crash. After making this change, I ran the job 4 times and saw no further crashes. Previously the same job would crash > 50% of the time.

There were cases where parent particles were later in the list than
their children. This lead to segmentation faults. This change
creates those parents when needed and avoids causing them to be
created later.
@Dr15Jones
Copy link
Contributor Author

please test

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-39784/32650

  • This PR adds an extra 16KB to repository

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @Dr15Jones (Chris Jones) for master.

It involves the following packages:

  • GeneratorInterface/HydjetInterface (generators)

@SiewYan, @mkirsano, @Saptaparna, @alberto-sanchez, @menglu21, @GurpreetSinghChahal can you please review it and eventually sign? Thanks.
@alberto-sanchez, @cbaus, @mkirsano this is something you requested to watch as well.
@perrotta, @dpiparo, @rappoccio you are the release manager for this.

cms-bot commands are listed here

@Dr15Jones
Copy link
Contributor Author

please test workflow 159.03

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-430f76/28380/summary.html
COMMIT: 9162d5f
CMSSW: CMSSW_12_6_X_2022-10-19-1100/el8_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/39784/28380/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

@slava77 comparisons for the following workflows were not done due to missing matrix map:

  • /data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/data/PR-430f76/159.03_HydjetQ_MinBias_5020GeV_2021_ppReco+HydjetQ_MinBias_5020GeV_2021_ppReco+DIGIHI2021PPRECO+RAWPRIMESIMHI18+RECOHI2022PROD+MINIHI2022PROD

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 12 differences found in the comparisons
  • Reco comparison had 6 failed jobs
  • DQMHistoTests: Total files compared: 48
  • DQMHistoTests: Total histograms compared: 3391158
  • DQMHistoTests: Total failures: 12
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3391124
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
  • Checked 206 log files, 48 edm output root files, 48 DQM output files
  • TriggerResults: no differences found

@Dr15Jones
Copy link
Contributor Author

ping
this fixes a reoccurring crash happening in the IBs.

@perrotta
Copy link
Contributor

ping @cms-sw/generators-l2

@perrotta
Copy link
Contributor

urgent

@menglu21
Copy link
Contributor

+1

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

@perrotta
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit 6d2b369 into cms-sw:master Oct 25, 2022
@wouf
Copy link
Contributor

wouf commented Oct 26, 2022

Dear @menglu21 , @Dr15Jones, all, unfortunately I missed this PR (could You please to notify me, or other HI GEN contacts, in the future in case of changes in HI generators!).
I can't reproduce the issue, You speaking about, in CMSSW_12_6_X_2022-10-25-2300 with/ and without/ @Dr15Jones 's corrections both. I used this config. Cold You please to provide more information about it?

The changes in lines 371, 387 looks useless, because the value of iterator ihy does not not repeated until primary_particle and particle vectors re-initialisations line 335.

Condition 397 have not to be executed, otherwise we have to check the logic (for my tests it have not).

@Dr15Jones
Copy link
Contributor Author

I can't reproduce the issue, You speaking about, in CMSSW_12_6_X_2022-10-25-2300 with/ and without/ @Dr15Jones 's corrections both. I used this config. Cold You please to provide more information about it?

So you can see the failure in yesterday evening's IB (just before this PR was added to the IBs)
https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_amd64_gcc10/CMSSW_12_6_X_2022-10-25-1100/pyRelValMatrixLogs/run/159.03_HydjetQ_MinBias_5020GeV_2021_ppReco+HydjetQ_MinBias_5020GeV_2021_ppReco+DIGIHI2021PPRECO+RAWPRIMESIMHI18+RECOHI2022PROD+MINIHI2022PROD/step1_HydjetQ_MinBias_5020GeV_2021_ppReco+HydjetQ_MinBias_5020GeV_2021_ppReco+DIGIHI2021PPRECO+RAWPRIMESIMHI18+RECOHI2022PROD+MINIHI2022PROD.log#/

problems were also seen with the address sanitizer (ASAN) builds as seen in this issue #39350

This segmentation fault does not happen every time the workflow 159.03 runs. When I was debugging the problem, it happened about half the time. I was fortunately that one of those times happened while running the job with the debugger after I'd recompiled the code to contain the debugging symbols. This let me see that the problem was caused by a mother particle having an index larger (in the case I saw just +1 larger) than the daughter index. Therefore requesting the mother returned a nullptr.

@Dr15Jones
Copy link
Contributor Author

@wouf wrote:

Condition 397 have not to be executed, otherwise we have to check the logic (for my tests it have not).

Actually this was exactly what caused the segmentation fault, the value of mid was greater than ihy (i.e. the mother's index was larger than the daughters index). So for that case, I now create the mother as needed and add it to the requisite containers. Since we can now build particles at another part of the code, that is why the changes were made to line 371 and 380 since we must avoid overwriting the case where a mother particle was created as needed.

@Dr15Jones
Copy link
Contributor Author

It appears that this fix is either incomplete or uncovered a different problem as we are now getting an exception in the workflow
https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_amd64_gcc11/CMSSW_12_6_X_2022-10-26-1100/pyRelValMatrixLogs/run/159.03_HydjetQ_MinBias_5020GeV_2021_ppReco+HydjetQ_MinBias_5020GeV_2021_ppReco+DIGIHI2021PPRECO+RAWPRIMESIMHI18+RECOHI2022PROD+MINIHI2022PROD/step1_HydjetQ_MinBias_5020GeV_2021_ppReco+HydjetQ_MinBias_5020GeV_2021_ppReco+DIGIHI2021PPRECO+RAWPRIMESIMHI18+RECOHI2022PROD+MINIHI2022PROD.log#/269-269

where the message is

----- Begin Fatal Exception 26-Oct-2022 13:45:02 CEST-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 1 stream: 3
   [1] Running path 'simulation_step'
   [2] Calling method for module HydjetGeneratorFilter/'generator'
Exception Message:
A std::exception was thrown.
vector::_M_range_check: __n (which is 86315) >= this->size() (which is 86315)
----- End Fatal Exception -------------------------------------------------

@Dr15Jones Dr15Jones deleted the fixHydjetHadronizer branch October 26, 2022 14:46
@Dr15Jones
Copy link
Contributor Author

Dr15Jones commented Oct 26, 2022

Looks like I missed the case where the container wasn't large enough already as it looks to me like the new exception comes from this line

HepMC::GenParticle* mother = primary_particle.at(mid);

@wouf
Copy link
Contributor

wouf commented Oct 26, 2022

I can't reproduce the issue, You speaking about, in CMSSW_12_6_X_2022-10-25-2300 with/ and without/ @Dr15Jones 's corrections both. I used this config. Cold You please to provide more information about it?

So you can see the failure in yesterday evening's IB (just before this PR was added to the IBs) https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_amd64_gcc10/CMSSW_12_6_X_2022-10-25-1100/pyRelValMatrixLogs/run/159.03_HydjetQ_MinBias_5020GeV_2021_ppReco+HydjetQ_MinBias_5020GeV_2021_ppReco+DIGIHI2021PPRECO+RAWPRIMESIMHI18+RECOHI2022PROD+MINIHI2022PROD/step1_HydjetQ_MinBias_5020GeV_2021_ppReco+HydjetQ_MinBias_5020GeV_2021_ppReco+DIGIHI2021PPRECO+RAWPRIMESIMHI18+RECOHI2022PROD+MINIHI2022PROD.log#/

problems were also seen with the address sanitizer (ASAN) builds as seen in this issue #39350

This segmentation fault does not happen every time the workflow 159.03 runs. When I was debugging the problem, it happened about half the time. I was fortunately that one of those times happened while running the job with the debugger after I'd recompiled the code to contain the debugging symbols. This let me see that the problem was caused by a mother particle having an index larger (in the case I saw just +1 larger) than the daughter index. Therefore requesting the mother returned a nullptr.

Was it multi-thread task? Let me remind, that Hydjet is not thread-safe. Meanwhile I tried to execute it without Your corrections about 6 times (each contained about 6-10 events). Do You mean this debug line? Or something else? Please note, that some of arrays exporting from Fortran, so they indexing from 1 (not 0), and indexes shifted too. On the other hand some particles may have mother index equal its own (it's not really correct, but I've seen it). Could You please to show me exact output printout? We need to understand which processes it was (which PDG was involved). Or could You give me recipe to reproduce the issue?

@wouf
Copy link
Contributor

wouf commented Oct 26, 2022

@wouf wrote:

Condition 397 have not to be executed, otherwise we have to check the logic (for my tests it have not).

Actually this was exactly what caused the segmentation fault, the value of mid was greater than ihy (i.e. the mother's index was larger than the daughters index). So for that case, I now create the mother as needed and add it to the requisite containers. Since we can now build particles at another part of the code, that is why the changes were made to line 371 and 380 since we must avoid overwriting the case where a mother particle was created as needed.

Then, if it is really possible, that mid > ihy, it have to come from this line, we have to get the values of index[isub] and hyjets.khj[2][ihy] (mother index from fortran) for that case, it may be bug, if it's so.

@Dr15Jones
Copy link
Contributor Author

What I had seen was when mid > ihy the value was actually mid == ihy + 1.

The exception that is happening after my change was added shows that the call primary_particle.at(mid); is using a value of mid which is exactly equal to primary_particle.size() and if this is happening for the very last particle (i.e. ihy == hyjets.nhj-1) then it would also be happening for the condition mid == ihy + 1 I had seen.

@Dr15Jones
Copy link
Contributor Author

So I added some printouts that are triggered when mid > ihy. I also increased the number of events processed to be 10 in the job and now I consistently see problems. An example of the output is

 ihy 70533 mid 70534 with daughter of type 310
 mid 70534 mother of type 313
 ihy 70535 mid 70538 with daughter of type 321
 mid 70538 mother of type 311
 ihy 70536 mid 70538 with daughter of type -211
 ihy 70539 mid 70546 with daughter of type 310
 mid 70546 mother of type -211
 ihy 70541 mid 70550 with daughter of type -311
 mid 70550 mother of type 310
 ihy 70542 mid 70550 with daughter of type 111
 ihy 70543 mid 70551 with daughter of type 310
 mid 70551 mother of type 223
 ihy 70544 mid 70552 with daughter of type 22
 mid 70552 mother of type -211
 ihy 70545 mid 70552 with daughter of type 22
 ihy 70548 mid 70564 with daughter of type 310
 mid 70564 mother of type 211
 ihy 70550 mid 70568 with daughter of type 310
 mid 70568 mother of type -211
 ihy 70552 mid 70572 with daughter of type -211
 mid 70572 mother of type 22
 ihy 70553 mid 70572 with daughter of type 211
 ihy 70554 mid 70572 with daughter of type 111
 ihy 70555 mid 70575 with daughter of type 22
 mid 70575 mother of type 22
 ihy 70556 mid 70575 with daughter of type 22
 ihy 70558 mid 70584 with daughter of type -211
 mid 70584 mother of type 22
 ihy 70559 mid 70584 with daughter of type 111
 ihy 70560 mid 70586 with daughter of type 22
 mid 70586 mother of type 211
 ihy 70561 mid 70586 with daughter of type 22
 ihy 70564 mid 70596 with daughter of type 211
 mid 70596 mother of type -211
 ihy 70565 mid 70596 with daughter of type -211
 ihy 70571 mid 70610 with daughter of type 22
 mid 70610 mother of type 22
 ihy 70572 mid 70610 with daughter of type 22
 ihy 70574 mid 70616 with daughter of type 22
 mid 70616 mother of type -211
 ihy 70575 mid 70616 with daughter of type 22
 ihy 70577 mid 70622 with daughter of type -211
 mid 70622 mother of type -321
 ihy 70578 mid 70622 with daughter of type 111
 ihy 70579 mid 70624 with daughter of type 22
 mid 70624 mother of type -311
 ihy 70580 mid 70624 with daughter of type -11
 ihy 70581 mid 70624 with daughter of type 11
 ihy 70583 mid 70634 with daughter of type 22
 mid 70634 mother of type 310
 ihy 70584 mid 70634 with daughter of type 22
 ihy 70589 mid 70646 with daughter of type 22

where daughter type is the pdgID value for the particle at ihy and mother type is the pdgID of mid.

@Dr15Jones
Copy link
Contributor Author

I'm using release CMSSW_12_6_X_2022-10-26-1100

@Dr15Jones
Copy link
Contributor Author

I added more printout about how the mid is calculated

 ihy 70533 mid 70534 with daughter of type 310
  hyjets.khj[2][ihy] 1900002 hjoffset 1900000 isub 38 index[isub] 70532
 mid 70534 mother of type 313
 ihy 70535 mid 70538 with daughter of type 321
  hyjets.khj[2][ihy] 1900004 hjoffset 1900000 isub 38 index[isub] 70534
 mid 70538 mother of type 311
 ihy 70536 mid 70538 with daughter of type -211
  hyjets.khj[2][ihy] 1900004 hjoffset 1900000 isub 38 index[isub] 70534
 ihy 70539 mid 70546 with daughter of type 310
  hyjets.khj[2][ihy] 1900008 hjoffset 1900000 isub 38 index[isub] 70538
 mid 70546 mother of type -211
 ihy 70541 mid 70550 with daughter of type -311
  hyjets.khj[2][ihy] 1900010 hjoffset 1900000 isub 38 index[isub] 70540
 mid 70550 mother of type 310
 ihy 70542 mid 70550 with daughter of type 111
  hyjets.khj[2][ihy] 1900010 hjoffset 1900000 isub 38 index[isub] 70540
 ihy 70543 mid 70551 with daughter of type 310
  hyjets.khj[2][ihy] 1900011 hjoffset 1900000 isub 38 index[isub] 70540
 mid 70551 mother of type 223
 ihy 70544 mid 70552 with daughter of type 22
  hyjets.khj[2][ihy] 1900012 hjoffset 1900000 isub 38 index[isub] 70540
 mid 70552 mother of type -211
 ihy 70545 mid 70552 with daughter of type 22
  hyjets.khj[2][ihy] 1900012 hjoffset 1900000 isub 38 index[isub] 70540
 ihy 70548 mid 70564 with daughter of type 310
  hyjets.khj[2][ihy] 1900017 hjoffset 1900000 isub 38 index[isub] 70547
 mid 70564 mother of type 211
 ihy 70550 mid 70568 with daughter of type 310
  hyjets.khj[2][ihy] 1900019 hjoffset 1900000 isub 38 index[isub] 70549
 mid 70568 mother of type -211
 ihy 70552 mid 70572 with daughter of type -211
  hyjets.khj[2][ihy] 1900021 hjoffset 1900000 isub 38 index[isub] 70551
 mid 70572 mother of type 22
 ihy 70553 mid 70572 with daughter of type 211
  hyjets.khj[2][ihy] 1900021 hjoffset 1900000 isub 38 index[isub] 70551
 ihy 70554 mid 70572 with daughter of type 111
  hyjets.khj[2][ihy] 1900021 hjoffset 1900000 isub 38 index[isub] 70551
 ihy 70555 mid 70575 with daughter of type 22
  hyjets.khj[2][ihy] 1900024 hjoffset 1900000 isub 38 index[isub] 70551
 mid 70575 mother of type 22
 ihy 70556 mid 70575 with daughter of type 22
  hyjets.khj[2][ihy] 1900024 hjoffset 1900000 isub 38 index[isub] 70551
 ihy 70558 mid 70584 with daughter of type -211
  hyjets.khj[2][ihy] 1900027 hjoffset 1900000 isub 38 index[isub] 70557
 mid 70584 mother of type 22
 ihy 70559 mid 70584 with daughter of type 111
  hyjets.khj[2][ihy] 1900027 hjoffset 1900000 isub 38 index[isub] 70557
 ihy 70560 mid 70586 with daughter of type 22
  hyjets.khj[2][ihy] 1900029 hjoffset 1900000 isub 38 index[isub] 70557
 mid 70586 mother of type 211
 ihy 70561 mid 70586 with daughter of type 22
  hyjets.khj[2][ihy] 1900029 hjoffset 1900000 isub 38 index[isub] 70557
 ihy 70564 mid 70596 with daughter of type 211
  hyjets.khj[2][ihy] 1900033 hjoffset 1900000 isub 38 index[isub] 70563
 mid 70596 mother of type -211
 ihy 70565 mid 70596 with daughter of type -211
  hyjets.khj[2][ihy] 1900033 hjoffset 1900000 isub 38 index[isub] 70563
 ihy 70571 mid 70610 with daughter of type 22
  hyjets.khj[2][ihy] 1900040 hjoffset 1900000 isub 38 index[isub] 70570
 mid 70610 mother of type 22
 ihy 70572 mid 70610 with daughter of type 22
  hyjets.khj[2][ihy] 1900040 hjoffset 1900000 isub 38 index[isub] 70570
 ihy 70574 mid 70616 with daughter of type 22
  hyjets.khj[2][ihy] 1900043 hjoffset 1900000 isub 38 index[isub] 70573
 mid 70616 mother of type -211
 ihy 70575 mid 70616 with daughter of type 22
  hyjets.khj[2][ihy] 1900043 hjoffset 1900000 isub 38 index[isub] 70573
 ihy 70577 mid 70622 with daughter of type -211
  hyjets.khj[2][ihy] 1900046 hjoffset 1900000 isub 38 index[isub] 70576
 mid 70622 mother of type -321
 ihy 70578 mid 70622 with daughter of type 111
  hyjets.khj[2][ihy] 1900046 hjoffset 1900000 isub 38 index[isub] 70576
 ihy 70579 mid 70624 with daughter of type 22
  hyjets.khj[2][ihy] 1900048 hjoffset 1900000 isub 38 index[isub] 70576
 mid 70624 mother of type -311
 ihy 70580 mid 70624 with daughter of type -11
  hyjets.khj[2][ihy] 1900048 hjoffset 1900000 isub 38 index[isub] 70576
 ihy 70581 mid 70624 with daughter of type 11
  hyjets.khj[2][ihy] 1900048 hjoffset 1900000 isub 38 index[isub] 70576
 ihy 70583 mid 70634 with daughter of type 22
  hyjets.khj[2][ihy] 1900052 hjoffset 1900000 isub 38 index[isub] 70582
 mid 70634 mother of type 310
 ihy 70584 mid 70634 with daughter of type 22
  hyjets.khj[2][ihy] 1900052 hjoffset 1900000 isub 38 index[isub] 70582
 ihy 70589 mid 70646 with daughter of type 22
  hyjets.khj[2][ihy] 1900058 hjoffset 1900000 isub 38 index[isub] 70588

@Dr15Jones
Copy link
Contributor Author

So I wondered why the value of index[isub] kept growing even though isub in the printout remained constant. So I checked and I found that when the problem occurs, the value of isub is bouncing back and forward between two values (which differ by 1) and the weird mother id seems to be related to the larger number.

 ihy 70533 mid 70534 with daughter of type 310
  hyjets.khj[2][ihy] 1900002 hjoffset 1900000 isub 38 index[isub] 70532 isub_l 38
 mid 70534 mother of type 313
 back to isub 37 from isub_l 38
 back to isub 38 from isub_l 37
 ihy 70535 mid 70538 with daughter of type 321
  hyjets.khj[2][ihy] 1900004 hjoffset 1900000 isub 38 index[isub] 70534 isub_l 38
 mid 70538 mother of type 311
 ihy 70536 mid 70538 with daughter of type -211
  hyjets.khj[2][ihy] 1900004 hjoffset 1900000 isub 38 index[isub] 70534 isub_l 38
 back to isub 37 from isub_l 38
 back to isub 38 from isub_l 37
 ihy 70539 mid 70546 with daughter of type 310
  hyjets.khj[2][ihy] 1900008 hjoffset 1900000 isub 38 index[isub] 70538 isub_l 38
 mid 70546 mother of type -211
 back to isub 37 from isub_l 38
 back to isub 38 from isub_l 37
 ihy 70541 mid 70550 with daughter of type -311
  hyjets.khj[2][ihy] 1900010 hjoffset 1900000 isub 38 index[isub] 70540 isub_l 38
 mid 70550 mother of type 310
 ihy 70542 mid 70550 with daughter of type 111
  hyjets.khj[2][ihy] 1900010 hjoffset 1900000 isub 38 index[isub] 70540 isub_l 38
 ihy 70543 mid 70551 with daughter of type 310
  hyjets.khj[2][ihy] 1900011 hjoffset 1900000 isub 38 index[isub] 70540 isub_l 38
 mid 70551 mother of type 223
 ihy 70544 mid 70552 with daughter of type 22
  hyjets.khj[2][ihy] 1900012 hjoffset 1900000 isub 38 index[isub] 70540 isub_l 38
 mid 70552 mother of type -211
 ihy 70545 mid 70552 with daughter of type 22
  hyjets.khj[2][ihy] 1900012 hjoffset 1900000 isub 38 index[isub] 70540 isub_l 38
 back to isub 37 from isub_l 38
 back to isub 38 from isub_l 37
 ihy 70548 mid 70564 with daughter of type 310
  hyjets.khj[2][ihy] 1900017 hjoffset 1900000 isub 38 index[isub] 70547 isub_l 38
 mid 70564 mother of type 211
 back to isub 37 from isub_l 38
 back to isub 38 from isub_l 37
 ihy 70550 mid 70568 with daughter of type 310
  hyjets.khj[2][ihy] 1900019 hjoffset 1900000 isub 38 index[isub] 70549 isub_l 38
 mid 70568 mother of type -211
 back to isub 37 from isub_l 38
 back to isub 38 from isub_l 37
 ihy 70552 mid 70572 with daughter of type -211
  hyjets.khj[2][ihy] 1900021 hjoffset 1900000 isub 38 index[isub] 70551 isub_l 38
 mid 70572 mother of type 22
 ihy 70553 mid 70572 with daughter of type 211
  hyjets.khj[2][ihy] 1900021 hjoffset 1900000 isub 38 index[isub] 70551 isub_l 38
 ihy 70554 mid 70572 with daughter of type 111
  hyjets.khj[2][ihy] 1900021 hjoffset 1900000 isub 38 index[isub] 70551 isub_l 38
 ihy 70555 mid 70575 with daughter of type 22
  hyjets.khj[2][ihy] 1900024 hjoffset 1900000 isub 38 index[isub] 70551 isub_l 38
 mid 70575 mother of type 22
 ihy 70556 mid 70575 with daughter of type 22
  hyjets.khj[2][ihy] 1900024 hjoffset 1900000 isub 38 index[isub] 70551 isub_l 38
 back to isub 37 from isub_l 38
 back to isub 38 from isub_l 37
 ihy 70558 mid 70584 with daughter of type -211
  hyjets.khj[2][ihy] 1900027 hjoffset 1900000 isub 38 index[isub] 70557 isub_l 38
 mid 70584 mother of type 22
 ihy 70559 mid 70584 with daughter of type 111
  hyjets.khj[2][ihy] 1900027 hjoffset 1900000 isub 38 index[isub] 70557 isub_l 38
 ihy 70560 mid 70586 with daughter of type 22
  hyjets.khj[2][ihy] 1900029 hjoffset 1900000 isub 38 index[isub] 70557 isub_l 38
 mid 70586 mother of type 211
 ihy 70561 mid 70586 with daughter of type 22
  hyjets.khj[2][ihy] 1900029 hjoffset 1900000 isub 38 index[isub] 70557 isub_l 38
 back to isub 37 from isub_l 38
 back to isub 38 from isub_l 37
 ihy 70564 mid 70596 with daughter of type 211
  hyjets.khj[2][ihy] 1900033 hjoffset 1900000 isub 38 index[isub] 70563 isub_l 38
 mid 70596 mother of type -211
 ihy 70565 mid 70596 with daughter of type -211
  hyjets.khj[2][ihy] 1900033 hjoffset 1900000 isub 38 index[isub] 70563 isub_l 38
 back to isub 37 from isub_l 38
 back to isub 38 from isub_l 37
 ihy 70571 mid 70610 with daughter of type 22
  hyjets.khj[2][ihy] 1900040 hjoffset 1900000 isub 38 index[isub] 70570 isub_l 38
 mid 70610 mother of type 22
 ihy 70572 mid 70610 with daughter of type 22
  hyjets.khj[2][ihy] 1900040 hjoffset 1900000 isub 38 index[isub] 70570 isub_l 38
 back to isub 37 from isub_l 38
 back to isub 38 from isub_l 37
 ihy 70574 mid 70616 with daughter of type 22
  hyjets.khj[2][ihy] 1900043 hjoffset 1900000 isub 38 index[isub] 70573 isub_l 38
 mid 70616 mother of type -211
 ihy 70575 mid 70616 with daughter of type 22
  hyjets.khj[2][ihy] 1900043 hjoffset 1900000 isub 38 index[isub] 70573 isub_l 38
 back to isub 37 from isub_l 38
 back to isub 38 from isub_l 37
 ihy 70577 mid 70622 with daughter of type -211
  hyjets.khj[2][ihy] 1900046 hjoffset 1900000 isub 38 index[isub] 70576 isub_l 38
 mid 70622 mother of type -321
 ihy 70578 mid 70622 with daughter of type 111
  hyjets.khj[2][ihy] 1900046 hjoffset 1900000 isub 38 index[isub] 70576 isub_l 38
 ihy 70579 mid 70624 with daughter of type 22
  hyjets.khj[2][ihy] 1900048 hjoffset 1900000 isub 38 index[isub] 70576 isub_l 38
 mid 70624 mother of type -311
 ihy 70580 mid 70624 with daughter of type -11
  hyjets.khj[2][ihy] 1900048 hjoffset 1900000 isub 38 index[isub] 70576 isub_l 38
 ihy 70581 mid 70624 with daughter of type 11
  hyjets.khj[2][ihy] 1900048 hjoffset 1900000 isub 38 index[isub] 70576 isub_l 38
 back to isub 37 from isub_l 38
 back to isub 38 from isub_l 37
 ihy 70583 mid 70634 with daughter of type 22
  hyjets.khj[2][ihy] 1900052 hjoffset 1900000 isub 38 index[isub] 70582 isub_l 38
 mid 70634 mother of type 310
 ihy 70584 mid 70634 with daughter of type 22
  hyjets.khj[2][ihy] 1900052 hjoffset 1900000 isub 38 index[isub] 70582 isub_l 38
 back to isub 37 from isub_l 38
 back to isub 38 from isub_l 37
 ihy 70589 mid 70646 with daughter of type 22
  hyjets.khj[2][ihy] 1900058 hjoffset 1900000 isub 38 index[isub] 70588 isub_l 38

@wouf
Copy link
Contributor

wouf commented Oct 26, 2022

Thank You! But I still can't reproduce it! What I'm doing is:

  1. ssh lxplus8
  2. scram -a el8_amd64_gcc10 project CMSSW_12_6_X_2022-10-26-1100
  3. git cms-addpkg GeneratorInterface/HydjetInterface
  4. moving Your changes and compiling (even placing condition if( mid> ihy) assert(false);)
  5. cmsDriver.py Hydjet_Quenched_MinBias_5020GeV_cfi -s GEN,SIM -n 10 --conditions auto:phase1_2022_realistic_hi --beamspot Realistic25ns13p6TeVEarly2022Collision --datatier GEN-SIM --eventcontent RAWSIM --era Run3_pp_on_PbPb --geometry DB:Extended --relval 2000,1
    But no issue nor segfault found.

@Dr15Jones
Copy link
Contributor Author

It looks like the version of the command I'm based on is

cmsDriver.py Hydjet_Quenched_MinBias_5020GeV_cfi -s GEN,SIM -n 1 --conditions auto:phase1_2022_realistic_hi --beamspot Realistic25ns13p6TeVEarly2022Collision --datatier GEN-SIM --eventcontent RAWSIM --era Run3_pp_on_PbPb --geo\
metry DB:Extended --relval 2000,1 --fileout file:step1.root --nThreads 4  

@Dr15Jones
Copy link
Contributor Author

So I just tried with 1 thread and no problems seen! So now to see what is happening.

@wouf
Copy link
Contributor

wouf commented Oct 26, 2022

Hydjet is not thread save: please follow this discussion. May it be the reason?

@Dr15Jones
Copy link
Contributor Author

I double checked and see that the Framework only runs the hydjet module on one thread at a time.

@Dr15Jones
Copy link
Contributor Author

It is not a threading issue. I ran it single threaded and set the job to run 100 events and it happened for 2 different events.

@perrotta
Copy link
Contributor

@Dr15Jones @wouf I've started the build of CMSSW_12_6_0_pre4 even if this issue is not yet solved, see also #39865

As such there will be a few HI workflows which will fail in the relvals. It is intended that if a solution is found in short, we can even decide to stop that build and start a new one including the fix.

@wouf
Copy link
Contributor

wouf commented Oct 27, 2022

@perrotta @Dr15Jones , I have reproduced the issue. This happened in events with an ultra high multiplicity per sub-event. Unfortunately this is due to offset overflow, so Hydjet core needs to be updated (looks like 50 000 is not enough for such events).

 <--- 85480
 ---85480---> 85481
85481 MULTin ev.:89598 SubEv.#61 Part #85482, PDG: 311 (st. 2) mother=35485 (3050000, 3050000, 35484), childs (85483-85483), vtx (0,0,0)
 ---35486---> 85482
85482 MULTin ev.:89598 SubEv.#61 Part #85483, PDG: 310 (st. 1) mother=85482 (3099997, 3050000, 35484), childs (85481-85481), vtx (0,0,0)
 <--- 85482
 ---85482---> 85483
85483 MULTin ev.:89598 SubEv.#61 Part #85484, PDG: 2112 (st. 1) mother=35485 (3050000, 3050000, 35484), childs (35485-35485), vtx (0,0,0)
 ---35486---> 85484
85484 MULTin ev.:89598 SubEv.#61 Part #85485, PDG: -2112 (st. 1) mother=35485 (3050000, 3050000, 35484), childs (35485-35485), vtx (0,0,0)
 ---35486---> 85485
85485 MULTin ev.:89598 SubEv.#61 Part #85486, PDG: 111 (st. 2) mother=35485 (3050000, 3050000, 35484), childs (85487-85488), vtx (0,0,0)
 ---35486---> 85486
85486 MULTin ev.:89598 SubEv.#62 Part #85487, PDG: 22 (st. 1) mother=85487 (3100001, 3100000, 85485), childs (85486-85486), vtx (1.06919e-05,6.168e-05,0.000468078)
 <--- 85487
 ---85487---> 85487
85487 MULTin ev.:89598 SubEv.#62 Part #85488, PDG: 22 (st. 1) mother=85487 (3100001, 3100000, 85485), childs (85486-85486), vtx (1.06919e-05,6.168e-05,0.000468078)
 ---85487---> 85488
85488 MULTin ev.:89598 SubEv.#61 Part #85489, PDG: -311 (st. 2) mother=85488 (3050000, 3050000, 85487), childs (135493-135493), vtx (0,0,0)
 <--- 85488
 ---85488---> 85489
85489 MULTin ev.:89598 SubEv.#62 Part #85490, PDG: 310 (st. 1) mother=85493 (3100004, 3100000, 85488), childs (85492-85492), vtx (0,0,0)

Here is Part #85487 and Part #85487 originally from sub-event 61, but due overflow of the offset buffer (first number in the bracket after mother index is hyjets.khj[2][ihy] - mother index from Fortran part, it should be less than 3100000 for sub-event 61) the sub-event number in HydjetInterface has increased. The same happened with the @Dr15Jones printout.
So, we have to roll back the @Dr15Jones 's solution, because such events are useless anyway. I will check the possibility of increasing the offset buffer in the Hydjet core.

@Dr15Jones
Copy link
Contributor Author

@wouf nice catch! I actually woke up this morning wondering if that was the problem. I'm making a new PR with this change reverted and it now throws an exception in the case where mid > ihy .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants