-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix double counting in HepMC to G4 handoff #39427
Conversation
type bugfix |
test parameters:
|
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-39427/32138
|
A new Pull Request was created by @tvami (Tamas Vami) for master. It involves the following packages:
@cmsbuild, @civanch, @mdhildreth can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
@cmsbuild , please test |
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-63f495/27616/summary.html HIGH_STATS Comparison SummarySummary:
Comparison Summary@slava77 comparisons for the following workflows were not done due to missing matrix map:
Summary:
|
Oh well it seems we cannot rely on these test. Even with the HS samples we are too much in the tails of the distribution so it doesnt see any changes |
I guess this also means this change doesnt break anything seriously, just to see the silver lining :) |
hi @civanch I'm wondering we should benefit from the fact that 12_5_0 comes out next Tuesday, and try to push this in there and be part of the final 12_5_0 validation, what do you think? |
@tvami , thank you for the fix. We should backport to 12_5, 12_4, and 10_6. |
+1 |
This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @rappoccio (and backports should be raised in the release meeting by the corresponding L2) |
@shedprog actually said in the relevant meeting that asking for status 2 OR status >3 should be done since it looked like three was reserved. So let's at least change that. My hesitancy to submit the PR wasn't a technical worry. It's that this change is so central and will affect so many people in a small way that this needs to be carefully checked. i.e. I don't think this should come from just one analysis team, it should be coordinated by simulation coordination, etc etc. Maybe @civanch can comment about the best way to move forward -- is it really just making this change and hoping for the best? |
Let me copy paste PdmV's answer here " |
Dear @tvami @civanch (@CeliaFernandez @elusian FYI), you can see the tracking efficiency drops dramatically and the fake rate explodes. To makes sure this is really the culprit, I run |
Hi @mmusich -- interesting. Remember that this change is not to the reco, but to the SIM. So this is just informing GEANT of metastable particles it (incorrectly) didn't know about before. Without handling this correctly, we've seen two symptoms:
To help understand the source of this change you're seeing, can you provide more information on what kind sample it is -- what's the physics process here; what's the analysis? Ultimately, I think you'll need to understand in your "before" samples if there are duplicate particles in the G4 track collection in some way that are just being (correctly) removed in the new code. Or you might be seeing an object-level BG that was being suppressed with the bug, now coming back with the fix. This is going to need a microscopic study to figure out what's going on with the sample... It's for exactly this kind of reason that I kept pushing for a very detailed validation of this change before merging so that low-level studies were already in hand from potentially affected analyses. |
yes, that's why I am showing you the plot of the tracking particle eta. In case you are not aware that's a purely simulated quantity.
I think @CeliaFernandez can best answer these details from EXO side. I am merely reporting that this caused a massive change in the tracking particle distributions.
I think we are now seeing more tracking particles rather than less.
Right. It seems that EXO flagged this as failure back in 12_6_0_pre3: https://cms-pdmv.cern.ch/valdb/?srch=12_6_0_pre3&PAGs=true&PFull=true but there was no follow-up. |
Hi @lawrenceleejr, this sample is one of the RelVals for standard campaign validation which is used to validate displaced objects (either jet tracks, electrons and muons). The DisplacedSUSY process simulates the pair-production of long-lived stops (pdgIds ) that each decay through an RPV vertex to a lepton (e, mu, or tau with 1/3 probability each) and a b-quark. This sample thus produces both displaced jets and displaced leptons. You can see the PR where it was added here for more details: #31692 and also the JIRA ticket where it was requested here: https://its.cern.ch/jira/browse/PDMVRELVALS-96 As said by Marco, the tracking particles the numbers highly increase and this seems to have this impact on the efficiencies and the fake rate. Specifically, on the EXO efficiency side, every lepton particle seems to be reconstructed (which shows inconsistent results as the efficiency decays with displacement). I think that if the problem was that we were not removing the duplicated particles in the "before samples" we would observe the inverse effect, like a decrease in the efficiency. |
Thanks for the details. I agree that this seems to be going the other direction which is why I said "Or you might be seeing an object-level BG that was being suppressed with the bug, now coming back with the fix." I fear that's more likely. I definitely not saying I know what's going on here, and it is surprising, but I do think it's crucial to understand exactly what is happening under the hood between these two samples. What is the origin of these new tracking particles that have shown up. The question is: these new tracking particles, should they be there or not? The change in this PR has no way of adding new particles to the truth record. It's just telling GEANT about particles that existed at the generator level. So after this PR, the environment is noisier with lots more tracking particles, but it might be that that's what it should be? The sculpting of some of the kinematics is interesting and I'm not yet sure what to make of that. As I wrote in my initial presentations on this bug, it's exactly this kind of signal that I suspected would see major changes - displaced production of heavy flavor. If the long-lived stop is decaying to a b always, then information about that b decay chain will be lost to GEANT and GEANT will try to redecay things or otherwise do something undesired. This is primarily a problem when the decay is done at the generator level instead of in a dedicated GEANT step like we've discussed with the SIM group. Is someone available to dig into the origin of these G4 tracks? |
I think that the massive increase in the fake rate and drop in the efficiency (that's while the number of tracking particles goes up so much) is indicative that the tracking particles no longer relate to the tracker digis in the same way. |
Some of the reconstructed quantities,only for this sample, are affected e.g. for the jets (although these are not plots optimized for this sample, but just to have something to compare) some variations are seen in some of them: |
Hi all, specifics of generators/Geant4 interface is in assumptions used in the interface about generator output. For the case displaced muons, generator provides muon final state radiation of low-energy gamma and change status code of such muon. The interface was not aware about such case and produced duplicated muons. I am not sure if this fix help to all possible cases. In our practice, generators do not notify us on specifics of final states, so we use to react on each new case when identified. It is impossible from our side to put all possible constrains to generator developers in advance. So, we have a pragmatic approach: if the exotica group identify a problem, it provides a solution of this problem updating the interface. Part of solutions are general, another - specific to only one case. May be not elegant, sometimes boring but working. The fix usually is simple, the main time is lost to understand that the problem exist. So, for an each new case we should start from scratch. First, need to know the input python configuration file. Often with limited statistics it is possible to understand what happens. "Wrong" may mean an extra particle for G4 or lost particle for G4, it also may be wrong primary particle(s) type or property for G4. This would be the first check in the given situation. Note, that G4 itself do not add a particle or remove a particle without the reason. if number of tracking particles so different, then ~10 events with "info" detailed printout would be enough to check input to G4 and number of MCtruth particles produced by SIM. This check will allow to understand if it is a problem of the generator/G4 interface, of MCtruth assignment, or downstream. |
Hi both -- all of this is fascinating. The extra jet activity is interesting and I'm not sure where that could be coming from. IMO the way to proceed is to find one of those G4 tracks that didn't exist before but does now. Get its link to a GenParticle object. Understand if that's supposed to be there, already accounted for, or what. Does that make sense as a path forward? |
IIUC, the physics here in the DisplacedSusy relval is m_stop 800 GeV pair production (lifetime 50 cm) looking at plain track distributions (not truth matched), high purity with pt>1 and then also at pf ch met sum et https://tinyurl.com/2pn7xnxk The new sample actually makes more sense from this perspective. |
but as was mentioned elsewhere, the lepton efficiencies vs displacement do not make sense https://tinyurl.com/2j25tr6a Here's an event display for the second event (try it from https://fireworks.cern.ch) for /store/relval/CMSSW_13_0_0_pre3/RelValDisplacedSUSY_14TeV/GEN-SIM-RECO/130X_mcRun3_2022_realistic_v2-v1/00000/3091a92f-402c-4edb-b249-7c67fde62eb3.root |
Just to understand -- this sample is from after this MR? Is it possible to show that these particular g4 tracks are not duplicated before this MR? i.e. to show that this duplication is the cause of all of this extra activity? It would also be interesting to understand in this sample, if e.g. all of those g4 tracks coming from 1677 are separately responsible for g4 hits in the detector. Are these just the result of a single particle leaving each energy deposition as it passes through material. Even if these are treated as fully separate in g4, because of their exact same four momentum, I would not expect so much extra activity in the detector at record level, just extra energy deposition. |
The sample was created in 12_6_0_pre3 (which includes this PR). |
I understand... I'm not arguing if the change is caused by this PR. That has been clear. The question is, what's the actual underlying cause here. Any efficiency is way too high a level to learn anything from if we're trying to microscopically understand this. A few weird facts:
This is just from poking around in fireworks for a few minutes, but this needs a more serious study... This appears to be pointing to something else that's evil in the handoff to G4... |
I take it back vtx 13 and 14 are displaced, there are no repeated prompt particles The only issue is that the leptons show up twice, which I guess is the original issue that this PR was supposed to fix. |
Yes I agree that the post-PR picture is really problematic now. Attaching a screenshot of a side by side comparison of two samples @tvami helped make before and after this PR (with only this change as the difference).
The right display is after, the left before. Before, we have a reasonable looking displaced jet with a reco pT of 242 GeV. On the right, we can see the blue (genparticles) giving a displaced jet again, but this time it's reconstructed as a 7.2 TeV jet which is of course insane. Skimming through the gen collections, they both look reasonable. No weird duplicates or anything at that level (as expected). The "after" g4 track collection does show many duplicates as discussed above, whereas the "before" g4 track collection has only a few. (This is all a very rough event display exploration, so very not detailed...) And those extra g4 tracks are clearly leaving energy depositions in the detector giving rise to this huge unphysical jet. That said, this change in this PR is only deciding which of the GenParticles should be told to G4, so there's a unitarity that is obeyed here. So I don't really see a way this line could be the direct culprit, but I think it requires now rerunning the SIM with some debug info on a few events to try and understand better. It's almost as if another part of the code is stepping in and now duplicating g4 tracks wildly under this specific condition. The next steps would be to do some studies directly running the SIM and probing where this explosion of g4 tracks is happening. But I'm not sure who has the time and expertise to dive into that... |
Hi all, is it possible to test #40820? It is an attempt to recover by returning back and by addition instead of (status >3) a condition (status >50 && status< 100) I am not fully sure about this fix, because do not have information about status codes in problematic events with displaced jets. The main idea of this PR is to squeeze possible value of status codes used in condition to create a daughter for the predefine decay. |
@civanch but wont this lead to further problems in the future? Who knows if at one point the gen community will pick whateverStatusCode+1 (101 in your case) to be something that should be assigned a daughter of? Isnt the fix supposed to be more downstream in the R-hadron handling part of the code, instead of the daughter assigning part? |
Unfortunately, in past a discussion with generator peoples even not started. May be I asked wrong peoples, may be at wrong time, but so far we use to introduce fixes after each new problem. |
I've tested the commits: see cursory analysis done with |
@tvami , this is a right place to decide what is the list of primary particles for Geant4 simulation. If at this point any particle is lost or extra unwanted particle is added, then simulation will be definitely wrong, any filter downstream will not be able to recover fully. |
I have a suspicion about the cause of this. It's another case of over-assuming about the status code. https://github.com/cms-sw/cmssw/blob/master/SimG4Core/Generators/src/Generator.cc#L395 If a particle is long lived but doesn't have status 2, then the recursive child-assigning function that this PR was about is never even called. So if these R-hadrons have another status code (very possible -- would be quick to check) then GEANT is not told about the relationship with the decay products again. I'm not certain this is the problem yet, but it smells consistent... To fix this, it would be a similar fix to this PR to figure out if I definitely agree with @tvami -- going more exclusive with the status code checking 1) will probably not fix the issue and 2) is just going back in the dangerous direction of assuming too much about the status code which will lead to problems in the future. |
Can confirm that the RHadrons here are given a status code other than 2. So they're never handed to the |
I think I see it -- or at least another problem, again in cmssw/SimG4Core/Generators/src/Generator.cc Line 248 in 803f553
In the event in the screenshot above, you have a stop RHadron with status 104. In this line of the code:
The first term is true: The status 104 > 3. [Edit: Now I'm not sure what's going on. This just needs an actual study. The structure of this code is too much to just read... at least this late in the day...] This hacky approach to such a central piece of code is really causing a lot of recurring issues and I think this needs a bigger cleanup IMO. I'd be happy to work with someone who had some time, but unfortunately I don't see anyone in my group having the time to do this for the next few months. |
@lawrenceleejr , thank you for this analysis. I agree that there may be a problem. Note, that "per-assign" decay mechanism was introduced for normal hadrons, which have long decay chain with many decay modes. Each mode may have very small branching ratio. Inside G4 decay mode of tau, B-mesons and baryons, and even of D-mesons and baryons are simulated with not high accuracy. In PHYTHIA and other generators we have the best accuracy of generation of decay cascades, cascades are mainly inside beam pipe but partly go out where "pre-assigned" decay is used. Here the method of predefined decay makes simulation effective and reasonable. In HepMC interface it is status=2. The particleAssignDaughters(..) method is needed for that to prepare recursively the list of particles for G4 tracking and pre-defined decay channels for decays outside beam pipe. The same code is applied to exotics, where situation is not well defined. Exotic particles are not part of G4 particle description, they come with external description including their decay modes. Using predefined decay models for such particles makes all kind of problems and basically required to react on each new case making patch on top of old patches. This seems tobe because each new case is different from previous. For me, an ideal situation would be for long-lived exotics to have status=1 and decay modes defined in configuration. It is not the case, unfortunately. Concerning #40820 : is it possible to check R-hadron case and propose a modification to address that? Or we merge #40820 and will fix R-hadrons in the next iteration? |
PR description:
HSCP group discovered a bug in the handoff from HepMC to G4 for long-lived particles (either SM or BSM), which causes a double counting of tracks. More details in [1]. This PR changes the condition on the status code so special cases (like the status 91 mentioned in the talk below) are handled properly. As agreed with @civanch we are submitting this PR.
[1] https://indico.cern.ch/event/1198290/contributions/5043130/attachments/2510066/4313973/091622_LLee_HSCP_SIMBug.pdf
PR validation:
Code compiles, 10026.0 runs passes. Further HS tests are required, but for Jenkins tests and possibly a PdmV level specific release validation campaign.
Proposed test here
If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:
Not a backport, but affects UL samples, so we should consider to backport to 10_6_X ?