-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Outline of HGCAL ML ntuples using NanoAOD #32187
Conversation
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-32187/19926 |
A new Pull Request was created by @kdlong (Kenneth Long) for master. It involves the following packages: PhysicsTools/NanoAOD @cmsbuild, @santocch, @mariadalfonso, @gouskos, @fgolf can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
-1 There is no consensus on how to flatten the edm objects. similar request was made for PFcandidates |
@mariadalfonso indeed, I just meant to put this here as an example of a way we could consider going. I should also add that this is just something I cooked up today, it's not something that we are using widely for HGCAL ML trainings yet. I thought it was worth sharing because I think it's cleaner than the disjointed and independent ntuplizers that we are currently using. |
Please push on "Ready for review" button, whenever this PR will be ready to be reviewed/merged |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-32187/21232
|
Pull request #32187 was updated. @SiewYan, @perrotta, @civanch, @gouskos, @mkirsano, @mdhildreth, @cmsbuild, @jpata, @fgolf, @slava77, @alberto-sanchez, @agrohsje, @mariadalfonso, @GurpreetSinghChahal can you please check and sign again. |
One important point raised by @rovere in the meeting today is the fact that the associations used here are strictly one to one, whereas the proper match in many cases is OneToManyWithQuality. There are a few reasons I used one to one here:
The concern raised is that having these maps in CMSSW implies that one to one matching is fully correct. For other ML algorithms, it would likely be important to maintain the mixed association of rechits to simclusters, and it is not fully correct to imply that PF associations a rechit to one candidate. Note that, however, at least in the present implementation, the SimHit and SimTrack to SimClusters really are one to one and the associations should be correct. I see a few options to address this:
|
@cmsbuild please test |
-1 Failed Tests: RelVals RelVals-INPUT RelVals
RelVals-INPUT
|
void beginRun(const edm::Run&, const edm::EventSetup& iSetup) override { | ||
// TODO: check that the geometry exists | ||
iSetup.get<CaloGeometryRecord>().get(caloGeom_); | ||
rhtools_.setGeometry(*caloGeom_); | ||
iSetup.get<TrackerDigiGeometryRecord>().get("idealForDigi", trackGeom_); | ||
// Believe this is ideal, but we're not so precise here... | ||
iSetup.get<GlobalTrackingGeometryRecord>().get(globalGeom_); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ES consumes should be used
based on issues reported in the static analysis
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-86974f/13067/llvm-analysis/
@kdlong |
@slava77: Looking at the failure I thought maybe I messed something up with treating pileup in the maps in CaloTruthMerger. Actually I think it probably makes most sense to produce those in a separate producer in any case, so I can restructure this and test further, just haven't had time yet. On the physics aspects, my preference is to keep the OneToOne associations but possibly add a OneToMany as well. Others should comment, and maybe we should discuss this more within the HGCAL DPG and PF group. |
Dear all,
|
Thanks Marco. Note that the PR was marked RFC = request for comment, that is, I made this to be a discussion. It was originally a bare bones overview of an idea, then I updated it with too much stuff. I'm not trying to force things into CMSSW central without involving relevant parties, it was and is meant to be a discussion. I can remove the associations from the PR if that is the strong preference of everyone. However, they are very important to the visualization setup, so having a solution that we agree on would be useful. A little more info: the RecHit to PFCand associations are purely based on PFCandidates and are not specific to HGCAL. The OneToOne assumption is very useful to the visualization but should be made with care. An additional OneToMany association could/should be added. I hope the PF group can comment on whether they would be interested in supporting this or not. The HGC specific associations are the SimCluster ones. A few are not ambiguous, SimTrack --> SimCluster and SimHit --> SimTrack for example. These are valuable for the visualization setup. I can certainly remove this from the CaloTruthAccumulator into a separate producer, though. The RecHit --> SimCluster is naturally oneToMany so the assumption of one to one here is an issue as you highlighted. I understand that the HGCAL DPG has developed an approach to this using associations to layer clusters. I think it's worth considering an implementation of oneToMany matching without the intermediate matching to layer clusters. Indeed it would be good to discuss this in the HGCAL DPG. |
@kdlong would you be able to split this in multiple independent RFC PRs? PFCandidates in the endcap are produced by TICL, so I would recommend having an associator that is coherent if you go through the full chain |
Hi,
The RecHit to PFCandidate association is indeed already OneToMany for PFClusters in the barrel or for Run1/2/3 given the weights/sharing which is used there, so it would indeed be good to eventually have a general solution to this.
Indeed it sounds like we should discuss this as well in an HGCal DPG meeting to make sure that all of these aspects are fully discussed. |
@bendavid sorry if my opinion was not clear. Rephrasing: I think that an association done through the two paths (the direct one and the full reco one) should produce the same association map. |
Ok, but in this case the "final consumer of the map" is actually the nanoaod output stage, which cannot currently handle OneToMany in a reasonable way, so the conversion has to be done upstream of that. |
@felicepantaleo Yes, good suggestion to separate the PRs, I may not get to this immediately but I will do it in the next week or so. For sure it's necessary to have coherent results whether you go through TICL or PF. I used the PFCand interface because I wanted this to also apply outside of the endcap and because I'm interested in a generic visualization of reconstruction (e.g., including PFSim and the current PF). Shouldn't TICL and PF give the same answer by construction, if the TICLCand is used to fill the PFCand? Independent of building association maps it should be validated that accessing rechits from the TICL reco chain and from the PFCands give the same result. I am happy to make some checks of this, but it would be very difficult for me to do in a short time scale. I'm not sure if the visualization group is interested in this, they should comment (relevant presentation here), but based on the feedback I got from presentations and discussion I understood that there is interest from colleagues to use this as a lightweight visualization setup for understanding reconstruction. It would require a decent bit of work to make it as precise and complete as fireworks, for example, but it's lightweight and flexible which is nice for the current use case. |
@kdlong you keep mentioning a visualization setup, but I see no trace of any visualization code in this or any other open PR. Is this a private tool or is it meant to be a centrally maintained and available tool? @bendavid @kdlong I fear there is a profound misunderstanding about the association and the way HGCAL DPG thought and implemented it. Finally, since this is marked as [RFC], I believe we should simply close this PR, partition it along the lines I suggested a couple of comments ago and start having a discussion we should have had before opening this [RFC] PR. For the future, I believe it would be better to open PR with a specific meaning/scope: in this specific one we ended up talking about visualization (which is not part of this PR), generator improvements, ntuple creation and associators. While I understand that, in the very end, everything should come together, for integration and discussion purposes, limiting the scope would improve communication and, in the end, the integration process. |
Hi, (The visualization use case which has been discussed here is indeed "just" one of the examples that was discussed for code/use cases running on top of the NANOAOD produced in this PR, but the actual visualization is not included in the PR, since as I understand it's "just" some 3d plotting scripts making scatter plots from the information in the NANOAOD-formatted output) One can discuss the utility of opening an "RFC" pull request, vs sending an email or giving a presentation saying "please comment on this git branch", but given the technical aspects of this work, it's surely useful to have concrete code to comment on. Now some comments have definitely been collected, and this has been discussed in a few places, but not in an HGCal DPG meeting as has been said, and given the open points/possible remaining misunderstandings this would surely be useful/critical. While there is evidently some lively discussion and strong opinions about the association, it's also true that as a demonstrator for NANOAOD-for-low-level-detector-and-truth stuff one does need SOME association to maximize its utility, and I hope we can find some reasonable solution here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even though chances are that the CommonTools/ code may end up being rewritten, hopefully the following are still useful
#include "SimDataFormats/TrackingAnalysis/interface/TrackingParticle.h" | ||
#include "SimDataFormats/CaloAnalysis/interface/SimClusterFwd.h" | ||
#include "SimDataFormats/CaloAnalysis/interface/SimCluster.h" | ||
#include "DataFormats/CaloRecHit/interface/CaloRecHit.h" | ||
#include "SimDataFormats/CaloHit/interface/PCaloHit.h" | ||
#include "SimDataFormats/CaloHit/interface/PCaloHitContainer.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#include "SimDataFormats/TrackingAnalysis/interface/TrackingParticle.h" | |
#include "SimDataFormats/CaloAnalysis/interface/SimClusterFwd.h" | |
#include "SimDataFormats/CaloAnalysis/interface/SimCluster.h" | |
#include "DataFormats/CaloRecHit/interface/CaloRecHit.h" | |
#include "SimDataFormats/CaloHit/interface/PCaloHit.h" | |
#include "SimDataFormats/CaloHit/interface/PCaloHitContainer.h" | |
#include "DataFormats/CaloRecHit/interface/CaloRecHit.h" |
it looks like sim is not used; please cleanup this and the commented out code
: //caloSimhitCollectionTokens_(edm::vector_transform(pset.getParameter<std::vector<edm::InputTag>>("caloSimHits"), | ||
// [this](const edm::InputTag& tag) {return mayConsume<edm::PCaloHitContainer>(tag); })), | ||
//trackSimhitCollectionTokens_(edm::vector_transform(pset.getParameter<edm::InputTag>("trackSimHits"), | ||
// [this](const edm::InputTag& tag) {return mayConsume<std::vector<PSimHit>(tag); }), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this commented out code needed?
Please remove or add comments inline in the code why the commented out block is relevant
pfCollectionToken_(consumes<reco::PFCandidateCollection>(pset.getParameter<edm::InputTag>("pfCands"))) { | ||
for (auto& tag : caloRechitTags_) { | ||
const std::string& label = tag.instance(); | ||
//TODO: Can this be an edm::View? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this appropriately placed? IIRC, edm::View
s are for reading, not writing
// [this](const edm::InputTag& tag) {return mayConsume<std::vector<PSimHit>(tag); }), | ||
caloRechitTags_(pset.getParameter<std::vector<edm::InputTag>>("caloRecHits")), | ||
caloRechitCollectionTokens_(edm::vector_transform( | ||
caloRechitTags_, [this](const edm::InputTag& tag) { return mayConsume<edm::View<CaloRecHit>>(tag); })), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, mayConsume
should be avoided where possible; switch to consumes
and if some of the preconfigured input tags are excessively added, clean the condig
std::vector<edm::EDGetTokenT<edm::PCaloHitContainer>> caloSimhitCollectionTokens_; | ||
//std::vector<edm::EDGetTokenT<std::vector<PSimHit>> trackSimhitCollectionTokens_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::vector<edm::EDGetTokenT<edm::PCaloHitContainer>> caloSimhitCollectionTokens_; | |
//std::vector<edm::EDGetTokenT<std::vector<PSimHit>> trackSimhitCollectionTokens_; |
not used, apparently
for (size_t h = 0; h < caloRechitCollection->size(); h++) { | ||
const CaloRecHit& caloRh = caloRechitCollection->at(h); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for (size_t h = 0; h < caloRechitCollection->size(); h++) { | |
const CaloRecHit& caloRh = caloRechitCollection->at(h); | |
for (auto const& caloRh : *caloRechitCollection) { |
@@ -0,0 +1,136 @@ | |||
// system include files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the file name has a typo, should be SimClusterRecHitAssociationProducer.cc
edm::Handle<edm::View<CaloRecHit>> caloRechitCollection; | ||
iEvent.getByToken(caloRechitCollectionTokens_.at(i), caloRechitCollection); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
edm::Handle<edm::View<CaloRecHit>> caloRechitCollection; | |
iEvent.getByToken(caloRechitCollectionTokens_.at(i), caloRechitCollection); | |
edm::Handle<edm::View<CaloRecHit>> caloRechitCollection = iEvent.getHandle(caloRechitCollectionTokens_[i]); |
for (size_t h = 0; h < caloRechitCollection->size(); h++) { | ||
const CaloRecHit& caloRh = caloRechitCollection->at(h); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for (size_t h = 0; h < caloRechitCollection->size(); h++) { | |
const CaloRecHit& caloRh = caloRechitCollection->at(h); | |
for (const auto& caloRh : *caloRechitCollection) { |
if (simClusterToRecEnergy->find(match) == simClusterToRecEnergy->end()) | ||
(*simClusterToRecEnergy)[match] = energy * fraction; | ||
else | ||
simClusterToRecEnergy->at(match) += energy * fraction; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't the default inserted value going to be zero? In this case the following is the same and is shorter
if (simClusterToRecEnergy->find(match) == simClusterToRecEnergy->end()) | |
(*simClusterToRecEnergy)[match] = energy * fraction; | |
else | |
simClusterToRecEnergy->at(match) += energy * fraction; | |
(*simClusterToRecEnergy)[match] += energy * fraction; |
@kdlong |
@slava77 I discussed further with @rovere I will split the PR into pieces, one with the configurations without associations, then make the associations in a separate PR. In the meantime I've updated the maps to be one to many and modified the producer. I need a bit more time to make these changes more robust/general for CMSSW. We should also start a conversation with XPOG about how to best store oneToMany associations across collections in this. Depending on how this proceeds we can either store only "Best" associations + number of associations per object or go straight to a oneToMany setup. |
PR description:
This isn't really meant to be merged. It's an outline of a WIP effort to have ntuples for machine learning based reconstruction in HGCAL in the NanoAOD framework.
ML efforts generally need ~flat trees. Normally we cook up a simple ntuplizer with an EDAnlayzer. But configs like those in this PR are much more readable and can easily be shared---i.e., schedule this config snippet instead of copy/paste these lines of C++ code. I'm curious if there is any interest to support this centrally. Recently discussed with @bendavid on the PF side, who is in favor.