-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deal with late parents in HydjetHadronizer #39784
Conversation
There were cases where parent particles were later in the list than their children. This lead to segmentation faults. This change creates those parents when needed and avoids causing them to be created later.
please test |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-39784/32650
|
A new Pull Request was created by @Dr15Jones (Chris Jones) for master. It involves the following packages:
@SiewYan, @mkirsano, @Saptaparna, @alberto-sanchez, @menglu21, @GurpreetSinghChahal can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
please test workflow 159.03 |
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-430f76/28380/summary.html Comparison Summary@slava77 comparisons for the following workflows were not done due to missing matrix map:
Summary:
|
ping |
ping @cms-sw/generators-l2 |
urgent |
+1 |
This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @rappoccio (and backports should be raised in the release meeting by the corresponding L2) |
+1 |
Dear @menglu21 , @Dr15Jones, all, unfortunately I missed this PR (could You please to notify me, or other HI GEN contacts, in the future in case of changes in HI generators!). The changes in lines 371, 387 looks useless, because the value of iterator Condition 397 have not to be executed, otherwise we have to check the logic (for my tests it have not). |
So you can see the failure in yesterday evening's IB (just before this PR was added to the IBs) problems were also seen with the address sanitizer (ASAN) builds as seen in this issue #39350 This segmentation fault does not happen every time the workflow 159.03 runs. When I was debugging the problem, it happened about half the time. I was fortunately that one of those times happened while running the job with the debugger after I'd recompiled the code to contain the debugging symbols. This let me see that the problem was caused by a mother particle having an index larger (in the case I saw just +1 larger) than the daughter index. Therefore requesting the mother returned a nullptr. |
@wouf wrote:
Actually this was exactly what caused the segmentation fault, the value of |
It appears that this fix is either incomplete or uncovered a different problem as we are now getting an exception in the workflow where the message is
|
Looks like I missed the case where the container wasn't large enough already as it looks to me like the new exception comes from this line
|
Was it multi-thread task? Let me remind, that Hydjet is not thread-safe. Meanwhile I tried to execute it without Your corrections about 6 times (each contained about 6-10 events). Do You mean this debug line? Or something else? Please note, that some of arrays exporting from Fortran, so they indexing from 1 (not 0), and indexes shifted too. On the other hand some particles may have mother index equal its own (it's not really correct, but I've seen it). Could You please to show me exact output printout? We need to understand which processes it was (which PDG was involved). Or could You give me recipe to reproduce the issue? |
Then, if it is really possible, that mid > ihy, it have to come from this line, we have to get the values of |
What I had seen was when The exception that is happening after my change was added shows that the call |
So I added some printouts that are triggered when
where daughter type is the pdgID value for the particle at |
I'm using release CMSSW_12_6_X_2022-10-26-1100 |
I added more printout about how the
|
So I wondered why the value of
|
Thank You! But I still can't reproduce it! What I'm doing is:
|
It looks like the version of the command I'm based on is
|
So I just tried with 1 thread and no problems seen! So now to see what is happening. |
Hydjet is not thread save: please follow this discussion. May it be the reason? |
I double checked and see that the Framework only runs the hydjet module on one thread at a time. |
It is not a threading issue. I ran it single threaded and set the job to run 100 events and it happened for 2 different events. |
@Dr15Jones @wouf I've started the build of CMSSW_12_6_0_pre4 even if this issue is not yet solved, see also #39865 As such there will be a few HI workflows which will fail in the relvals. It is intended that if a solution is found in short, we can even decide to stop that build and start a new one including the fix. |
@perrotta @Dr15Jones , I have reproduced the issue. This happened in events with an ultra high multiplicity per sub-event. Unfortunately this is due to offset overflow, so Hydjet core needs to be updated (looks like 50 000 is not enough for such events).
Here is Part #85487 and Part #85487 originally from sub-event 61, but due overflow of the offset buffer (first number in the bracket after mother index is hyjets.khj[2][ihy] - mother index from Fortran part, it should be less than 3100000 for sub-event 61) the sub-event number in HydjetInterface has increased. The same happened with the @Dr15Jones printout. |
@wouf nice catch! I actually woke up this morning wondering if that was the problem. I'm making a new PR with this change reverted and it now throws an exception in the case where |
PR description:
There were cases where parent particles were later in the list than their children. This lead to segmentation faults. This change creates those parents when needed and avoids causing them to be created later.
PR validation:
I was able to catch the segmentation fault in the debugger and found that the ID for the parent was larger than the child. This meant the parent had not yet been made which lead to the crash. After making this change, I ran the job 4 times and saw no further crashes. Previously the same job would crash > 50% of the time.