[RFC] Outline of HGCAL ML ntuples using NanoAOD #32187

kdlong · 2020-11-19T21:36:21Z

PR description:

This isn't really meant to be merged. It's an outline of a WIP effort to have ntuples for machine learning based reconstruction in HGCAL in the NanoAOD framework.

ML efforts generally need ~flat trees. Normally we cook up a simple ntuplizer with an EDAnlayzer. But configs like those in this PR are much more readable and can easily be shared---i.e., schedule this config snippet instead of copy/paste these lines of C++ code. I'm curious if there is any interest to support this centrally. Recently discussed with @bendavid on the PF side, who is in favor.

cmsbuild · 2020-11-19T21:42:52Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-32187/19926

This PR adds an extra 16KB to repository
Found files with invalid states:
- PhysicsTools/NanoAOD/python/simTracks_cff.py:
  - Added: 964145b
  - Deleted: 359fa61

cmsbuild · 2020-11-19T21:43:16Z

A new Pull Request was created by @kdlong (Kenneth Long) for master.

It involves the following packages:

PhysicsTools/NanoAOD

@cmsbuild, @santocch, @mariadalfonso, @gouskos, @fgolf can you please review it and eventually sign? Thanks.
@gpetruc, @peruzzim, @swertz this is something you requested to watch as well.
@silviodonato, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

mariadalfonso · 2020-11-19T21:58:28Z

-1

There is no consensus on how to flatten the edm objects.
Need to be discussed widely and planned properly

similar request was made for PFcandidates
#31795

mariadalfonso · 2020-11-19T21:59:43Z

@rovere

kdlong · 2020-11-19T22:58:03Z

@mariadalfonso indeed, I just meant to put this here as an example of a way we could consider going. I should also add that this is just something I cooked up today, it's not something that we are using widely for HGCAL ML trainings yet. I thought it was worth sharing because I think it's cleaner than the disjointed and independent ntuplizers that we are currently using.

silviodonato · 2020-12-07T18:31:28Z

Please push on "Ready for review" button, whenever this PR will be ready to be reviewed/merged

cmsbuild · 2021-02-24T07:33:41Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-32187/21232

This PR adds an extra 120KB to repository
Found files with invalid states:
- SimDataFormats/PFAnalysis/src/PFParticle.cc:
  - Added: ed71c6a
  - Deleted: 3241f80
- PhysicsTools/NanoAOD/python/nanoHGCML_cff.py:
  - Added: ed71c6a
  - Modified: d5d585c, 0dfc17f, 3241f80
  - Deleted: 5e3ee35
- PhysicsTools/NanoAOD/python/caloParticles_cff.py:
  - Added: ed71c6a
  - Deleted: 0ed9e50
- PhysicsTools/NanoAOD/python/trackSimHits_cff.py:
  - Added: ed71c6a
  - Modified: d5d585c, 4b0f962, 87f45c4
  - Deleted: 0ed9e50
- DPGAnalysis/CommonNanoAOD/python/simClusters_cff.py:
  - Added: 5e3ee35
  - Deleted: 87f45c4
- DPGAnalysis/CommonNanoAOD/plugins/PositionFromDetIDTableProducer.cc:
  - Added: 5e3ee35
  - Deleted: 87f45c4
- PhysicsTools/NanoAOD/python/pfCands_cff.py:
  - Added: ed71c6a
  - Deleted: 5e3ee35
- DPGAnalysis/CommonNanoAOD/python/hgcSimHits_cff.py:
  - Added: 5e3ee35
  - Deleted: 87f45c4
- SimDataFormats/PFAnalysis/interface/PFParticle.h:
  - Added: ed71c6a
  - Deleted: 3241f80
- DPGAnalysis/CommonNanoAOD/python/hgcSimTracks_cff.py:
  - Added: 5e3ee35
  - Deleted: 87f45c4
- SimDataFormats/PFAnalysis/src/classes.h:
  - Added: ed71c6a
  - Deleted: 3241f80
- PhysicsTools/NanoAOD/python/trackingParticles_cff.py:
  - Added: ed71c6a
  - Deleted: 0ed9e50
- PhysicsTools/NanoAOD/plugins/ObjectIndexFromAssociationProducer.cc:
  - Added: ed71c6a
  - Modified: 2f68bdd, 4b0f962, 6bbf9e9
  - Deleted: 0ed9e50
- PhysicsTools/NanoAOD/python/hgcRecHits_cff.py:
  - Added: d5d585c
  - Modified: 4b0f962, 87f45c4
  - Deleted: 0ed9e50
- PhysicsTools/NanoAOD/plugins/ObjectPropertyFromIndexMapProducer.cc:
  - Added: 4b0f962
  - Modified: 6bbf9e9
  - Deleted: 0ed9e50
- DPGAnalysis/CommonNanoAOD/python/hgcRecHits_cff.py:
  - Added: 5e3ee35
  - Deleted: 87f45c4
- DPGAnalysis/CommonNanoAOD/python/nanoHGCML_cff.py:
  - Added: 5e3ee35
  - Deleted: 87f45c4
- PhysicsTools/NanoAOD/python/hgcSimTracks_cff.py:
  - Added: ed71c6a
  - Modified: e0ce707, d5d585c
  - Deleted: 5e3ee35
- SimDataFormats/PFAnalysis/interface/PFParticleFwd.h:
  - Added: ed71c6a
  - Deleted: 3241f80
- CommonTools/RecoAlgos/plugins/SimHitRecHitAssocitionProducer.cc:
  - Added: d5d585c
  - Deleted: 4b0f962
- DPGAnalysis/CommonNanoAOD/plugins/SimHitPositionTableProducer.cc:
  - Added: 87f45c4
  - Modified: 6bbf9e9
  - Deleted: 0ed9e50
- PhysicsTools/NanoAOD/python/tracks_cff.py:
  - Added: d5d585c
  - Deleted: 0ed9e50
- PhysicsTools/NanoAOD/python/hgcSimHits_cff.py:
  - Added: ed71c6a
  - Modified: e0ce707, d5d585c
  - Deleted: 5e3ee35
- SimDataFormats/PFAnalysis/BuildFile.xml:
  - Added: ed71c6a
  - Deleted: 3241f80
- SimDataFormats/PFAnalysis/src/classes_def.xml:
  - Added: ed71c6a
  - Deleted: 3241f80
- PhysicsTools/NanoAOD/plugins/PositionFromDetIDTableProducer.cc:
  - Added: ed71c6a
  - Modified: 2f68bdd
  - Deleted: 5e3ee35
- PhysicsTools/NanoAOD/python/simClusters_cff.py:
  - Added: ed71c6a
  - Modified: e0ce707, d5d585c, 0dfc17f, 4b0f962
  - Deleted: 5e3ee35
There are other open Pull requests which might conflict with changes you have proposed:
- File SimDataFormats/Associations/src/classes_def.xml modified in PR(s): MultiCluster-CaloParticle associator #32941
- File SimGeneral/CaloAnalysis/plugins/CaloTruthAccumulator.cc modified in PR(s): Migrate most of MixingModule and PreMixingModule to EventSetup consumes #31697

cmsbuild · 2021-02-24T07:34:08Z

Pull request #32187 was updated. @SiewYan, @perrotta, @civanch, @gouskos, @mkirsano, @mdhildreth, @cmsbuild, @jpata, @fgolf, @slava77, @alberto-sanchez, @agrohsje, @mariadalfonso, @GurpreetSinghChahal can you please check and sign again.

kdlong · 2021-02-24T16:10:50Z

One important point raised by @rovere in the meeting today is the fact that the associations used here are strictly one to one, whereas the proper match in many cases is OneToManyWithQuality.

There are a few reasons I used one to one here:

NanoAOD does not support nested lists. In principle you could make a new table and store a flatten list to represent something like RecHit_PFCandIndices, also keeping the row splits, but this is not trivial
One to one is much more convenient for visualization
For the specific Object Condensation ML algorithm we are using, one to one is appropriate for RecHits --> SimClusters since we assign clusters based on "representative nodes" which therefore need one cluster assignment.
edm::Association is way easier to work with than edm::AssociationMap

The concern raised is that having these maps in CMSSW implies that one to one matching is fully correct. For other ML algorithms, it would likely be important to maintain the mixed association of rechits to simclusters, and it is not fully correct to imply that PF associations a rechit to one candidate. Note that, however, at least in the present implementation, the SimHit and SimTrack to SimClusters really are one to one and the associations should be correct. I see a few options to address this:

We remove all associations from the PR and just have hits. This would kill a lot of the utility of the code for visualization of clustering and of reco algos
We keep the one to one matching but make it more clear that the associations are to the best match, e.g., by naming something like RecHitToBestSimClusterMatch
We do something in between, and possible we build the RecHit --> SimCluster maps in a way that makes it more explicit that they are a simplification of the fully correct OneToManyWithQuality, e.g., by starting with the OneToMany maps as input to build the associations.

slava77 · 2021-02-24T16:50:21Z

@cmsbuild please test

cmsbuild · 2021-02-24T19:50:24Z

-1

Failed Tests: RelVals RelVals-INPUT
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-86974f/13067/summary.html
COMMIT: af52b9b
CMSSW: CMSSW_11_3_X_2021-02-23-2300/slc7_amd64_gcc900
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/32187/13067/install.sh to create a dev area with all the needed externals and cmssw changes.

RelVals

23434.999

----- Begin Fatal Exception 24-Feb-2021 19:10:12 CET-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 1 stream: 0
   [1] Running path 'PREMIXoutput_step'
   [2] Prefetching for module PoolOutputModule/'PREMIXoutput'
   [3] Calling method for module MixingModule/'mix'
Exception Message:
A std::exception was thrown.
_Map_base::at
----- End Fatal Exception -------------------------------------------------

RelVals-INPUT

23434.2123434.21_TTbar_14TeV+2026D49PU_ProdLike+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14INPUT+DigiTriggerPU+RecoGlobalPU+MiniAODPU/step2_TTbar_14TeV+2026D49PU_ProdLike+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14INPUT+DigiTriggerPU+RecoGlobalPU+MiniAODPU.log
23434.9923434.99_TTbar_14TeV+2026D49PU_PMXS1S2+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14INPUT+PREMIX_PremixHLBeamSpot14PU+DigiTriggerPU+RecoGlobalPU+HARVESTGlobalPU/step2_TTbar_14TeV+2026D49PU_PMXS1S2+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14INPUT+PREMIX_PremixHLBeamSpot14PU+DigiTriggerPU+RecoGlobalPU+HARVESTGlobalPU.log
23434.99923434.999_TTbar_14TeV+2026D49PU_PMXS1S2PR+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14INPUT+PREMIX_PremixHLBeamSpot14PU+DigiTriggerPU+RecoGlobalPU+HARVESTGlobalPU/step2_TTbar_14TeV+2026D49PU_PMXS1S2PR+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14INPUT+PREMIX_PremixHLBeamSpot14PU+DigiTriggerPU+RecoGlobalPU+HARVESTGlobalPU.log

slava77 · 2021-02-25T03:06:00Z

DPGAnalysis/CommonNanoAOD/plugins/HitPositionTableProducer.cc

+  void beginRun(const edm::Run&, const edm::EventSetup& iSetup) override {
+    // TODO: check that the geometry exists
+    iSetup.get<CaloGeometryRecord>().get(caloGeom_);
+    rhtools_.setGeometry(*caloGeom_);
+    iSetup.get<TrackerDigiGeometryRecord>().get("idealForDigi", trackGeom_);
+    // Believe this is ideal, but we're not so precise here...
+    iSetup.get<GlobalTrackingGeometryRecord>().get(globalGeom_);
+  }


ES consumes should be used

based on issues reported in the static analysis
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-86974f/13067/llvm-analysis/

slava77 · 2021-03-08T13:26:27Z

@kdlong
Please clarify on the status of this PR.
I think that the jenkins tests failed because of the changes in this PR. These should be addressed.
Are there any pending items from the earlier review (if any was already provided by xpog)?

kdlong · 2021-03-08T23:11:30Z

@slava77: Looking at the failure I thought maybe I messed something up with treating pileup in the maps in CaloTruthMerger. Actually I think it probably makes most sense to produce those in a separate producer in any case, so I can restructure this and test further, just haven't had time yet.

On the physics aspects, my preference is to keep the OneToOne associations but possibly add a OneToMany as well. Others should comment, and maybe we should discuss this more within the HGCAL DPG and PF group.

rovere · 2021-03-10T09:07:02Z

Dear all,
I already commented during the X-POG meeting and I will write again my comments here:

The addition of a new GunProducer should not be part of this PR and should stay in another, stand-alone PR
The creation of flat ntuples from nanoAOD for anything that does not involve linking/association between collection is a nice development and falls under XPOG responsibilities.
The associators proposed are not in line with what HGCAL DPG has been developing over the past years and I'm not in favour of putting them in the release. This will create confusion and would open the door to all possible discrepancies between results produced from the tuples and the validation results obtained within CMSSW. I'd appreciate, in future, if this kind of development is discussed with the involved party (HGCAL DPG) before submitting a PR.
The changes proposed to the Truth objects, namely to the CaloTruthAccumulator, have not been agreed nor discussed with HGCAL DPG. As I said during the X-POG meeting, we need at the very least to understand what the cost is, in terms of memory and disk usage, of having these additional collection produced. This development, again, should not be part of this PR but should be confined in another, dedicated, PR.

rovere · 2021-03-10T09:13:20Z

FYI
@lecriste @felicepantaleo @cseez

kdlong · 2021-03-10T12:11:57Z

Thanks Marco. Note that the PR was marked RFC = request for comment, that is, I made this to be a discussion. It was originally a bare bones overview of an idea, then I updated it with too much stuff. I'm not trying to force things into CMSSW central without involving relevant parties, it was and is meant to be a discussion.

I can remove the associations from the PR if that is the strong preference of everyone. However, they are very important to the visualization setup, so having a solution that we agree on would be useful.

A little more info: the RecHit to PFCand associations are purely based on PFCandidates and are not specific to HGCAL. The OneToOne assumption is very useful to the visualization but should be made with care. An additional OneToMany association could/should be added. I hope the PF group can comment on whether they would be interested in supporting this or not.

The HGC specific associations are the SimCluster ones. A few are not ambiguous, SimTrack --> SimCluster and SimHit --> SimTrack for example. These are valuable for the visualization setup. I can certainly remove this from the CaloTruthAccumulator into a separate producer, though. The RecHit --> SimCluster is naturally oneToMany so the assumption of one to one here is an issue as you highlighted. I understand that the HGCAL DPG has developed an approach to this using associations to layer clusters. I think it's worth considering an implementation of oneToMany matching without the intermediate matching to layer clusters. Indeed it would be good to discuss this in the HGCAL DPG.

felicepantaleo · 2021-03-10T13:04:40Z

@kdlong would you be able to split this in multiple independent RFC PRs?
For the visualization, have you discussed it with the CMS visualization team if they are interested in your tool, and willing to support it and maintain it? @alja @osschar
For what concerns the simCluster associator, I agree with @rovere that a non-coherent association would lead to a maintenance disaster.
Wouldn't it be easier if you take the oneToMany associator developed in the DPG and apply some cut a posteriori when visualizing?

PFCandidates in the endcap are produced by TICL, so I would recommend having an associator that is coherent if you go through the full chain rechits -> layerclusters -> tracksters -> PFCandidates or rechits directly to PFCandidates. Could you please clarify why you think this is not necessary?

bendavid · 2021-03-10T13:48:32Z

Hi,
A few points:

Currently it's not so straightforward to store one to many associations using NANOAOD as far as I understand. This could be a motivation for storing the one-to-one association using the best quality association at least until a good general solution could be implemented for nanoaod. (But if a corresponding OneToMany association can be produced at least at the EDM level I agree this may be useful)

The RecHit to PFCandidate association is indeed already OneToMany for PFClusters in the barrel or for Run1/2/3 given the weights/sharing which is used there, so it would indeed be good to eventually have a general solution to this.

I would tend to agree with @kdlong that it's useful to have RecHit -> Sim associations which are agnostic to any reco-level clustering.
@felicepantaleo I'm not sure I understand your last comment. Are you saying that the rechit->PFCandidate associations should be produced and stored passing through layerclusters and tracksters? For the TICL-produced PFCandidates certainly the rechit->layercluster->trackster->pfcandidate chain is what must be used to produce the rechit->pfcandidate associations, but there shouldn't be any issue storing directly the rechit->PFCandidate association right? (This can then be used for future machine-learning algorithms etc, where only the association logic, but not the storage would need to change) Or was the issue here again about oneToOne vs oneToMany associations?

Indeed it sounds like we should discuss this as well in an HGCal DPG meeting to make sure that all of these aspects are fully discussed.

felicepantaleo · 2021-03-10T14:27:58Z

@felicepantaleo I'm not sure I understand your last comment. Are you saying that the rechit->PFCandidate associations should be produced and stored passing through layerclusters and tracksters? For the TICL-produced PFCandidates certainly the rechit->layercluster->trackster->pfcandidate chain is what must be used to produce the rechit->pfcandidate associations, but there shouldn't be any issue storing directly the rechit->PFCandidate association right? (This can then be used for future machine-learning algorithms etc, where only the association logic, but not the storage would need to change) Or was the issue here again about oneToOne vs oneToMany associations?

@bendavid sorry if my opinion was not clear. Rephrasing: I think that an association done through the two paths (the direct one and the full reco one) should produce the same association map.
For visualization a transformation from oneToMany to oneToOne should be applied in the final consumer of the map.

bendavid · 2021-03-10T14:36:03Z

Ok, but in this case the "final consumer of the map" is actually the nanoaod output stage, which cannot currently handle OneToMany in a reasonable way, so the conversion has to be done upstream of that.

kdlong · 2021-03-10T17:48:35Z

@felicepantaleo Yes, good suggestion to separate the PRs, I may not get to this immediately but I will do it in the next week or so.

For sure it's necessary to have coherent results whether you go through TICL or PF. I used the PFCand interface because I wanted this to also apply outside of the endcap and because I'm interested in a generic visualization of reconstruction (e.g., including PFSim and the current PF). Shouldn't TICL and PF give the same answer by construction, if the TICLCand is used to fill the PFCand? Independent of building association maps it should be validated that accessing rechits from the TICL reco chain and from the PFCands give the same result. I am happy to make some checks of this, but it would be very difficult for me to do in a short time scale.

I'm not sure if the visualization group is interested in this, they should comment (relevant presentation here), but based on the feedback I got from presentations and discussion I understood that there is interest from colleagues to use this as a lightweight visualization setup for understanding reconstruction. It would require a decent bit of work to make it as precise and complete as fireworks, for example, but it's lightweight and flexible which is nice for the current use case.

rovere · 2021-03-10T20:17:20Z

@kdlong you keep mentioning a visualization setup, but I see no trace of any visualization code in this or any other open PR. Is this a private tool or is it meant to be a centrally maintained and available tool?

@bendavid @kdlong I fear there is a profound misunderstanding about the association and the way HGCAL DPG thought and implemented it.
As for most of the code developed within the HGCAL DPG, the concepts and usage are documented here.
The building block is always RecHits based, it's only the final (or initial, or both) aggregation stage that can change.
The main concern, again, is that it is based on OneToMany and not OneToOne. I believe the former is the correct one.

Finally, since this is marked as [RFC], I believe we should simply close this PR, partition it along the lines I suggested a couple of comments ago and start having a discussion we should have had before opening this [RFC] PR.

For the future, I believe it would be better to open PR with a specific meaning/scope: in this specific one we ended up talking about visualization (which is not part of this PR), generator improvements, ntuple creation and associators. While I understand that, in the very end, everything should come together, for integration and discussion purposes, limiting the scope would improve communication and, in the end, the integration process.

bendavid · 2021-03-10T21:57:07Z

Hi,
I suggested that this PR be opened (as it says in the description), based on code that was already written at the time, mainly to promote the idea and provide a concrete example of NANOAOD being used as a format for low level information for analysis, plotting, machine-learning training/validation, etc.

(The visualization use case which has been discussed here is indeed "just" one of the examples that was discussed for code/use cases running on top of the NANOAOD produced in this PR, but the actual visualization is not included in the PR, since as I understand it's "just" some 3d plotting scripts making scatter plots from the information in the NANOAOD-formatted output)

One can discuss the utility of opening an "RFC" pull request, vs sending an email or giving a presentation saying "please comment on this git branch", but given the technical aspects of this work, it's surely useful to have concrete code to comment on.

Now some comments have definitely been collected, and this has been discussed in a few places, but not in an HGCal DPG meeting as has been said, and given the open points/possible remaining misunderstandings this would surely be useful/critical.

While there is evidently some lively discussion and strong opinions about the association, it's also true that as a demonstrator for NANOAOD-for-low-level-detector-and-truth stuff one does need SOME association to maximize its utility, and I hope we can find some reasonable solution here.

slava77

even though chances are that the CommonTools/ code may end up being rewritten, hopefully the following are still useful

slava77 · 2021-02-25T03:08:04Z