Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DQM: new DQMStore. #28622

Merged
merged 112 commits into from
Feb 10, 2020
Merged

DQM: new DQMStore. #28622

merged 112 commits into from
Feb 10, 2020

Conversation

schneiml
Copy link
Contributor

@schneiml schneiml commented Dec 13, 2019

PR description:

This PR depends on many previous PR and contains them. Integration will take a while. [Edit: All dependencies are in!. Should be good to go now.]

This PR completely replaces the Core DQM infrastructure. Some things don't change, but are re-implemented. This state was achieved in previous pull requests:

  • DQM Histograms are stored and managed by edm::Service<DQMStore>.

    • Using a service makes things like online DQM and the DQM@HLT protobuf chain easier, since we can ask the DQMStore for a snapshot of the current histograms.
    • Using a service is required to handle job-level products, which we need for multi-run harvesting (and for historical reasons also normal harvesting).
    • Using a service allows legacy modules to work unchanged.
  • In non-legacy modules, all dependencies across the the DQMStore are passed to edm via DQMToken products, that do not contain any data. (As far as possible -- there are no job products, so the job-level dependencies cannot be expressed.)

    • The tokens are divided into 4 generations: DQMGenerationReco, DQMGenerationHarvesting, DQMGenerationQTest, denoted in the instance label.
    • DQMGenerationReco is produced by DQMEDAnalyzers, which consume only "normal" (non-DQM) products.
    • DQMGenerationHarvesting is produced by DQMEDHarvesters, which consume (by default) all DQMGenerationReco tokens. This allows all old code to work without explicitly declaring dependencies. DQMEDHarvesters provide a mechanism to consume more specific tokens, including DQMGenerationHarvesting tokens from other harvesters.
    • DQMGenerationQTest is produced only by the QualityTester module (of which we typically have many instances). The QualityTester has fully configurable dependencies to allow the more complicated setups sometimes required in harvesting, but typically consumes DQMGenerationHarvesting.
    • There is a hidden dependency between end run and end job in harvesting: Many harvesters effectively depend on DQMGenerationQTest products (the QTest results) and do not specify that, but things still work because they do their work in endJob.
  • There are six supported types of DQM modules:

    • DQMEDAnalyzer, based on edm::one::EDProducer. Used for the majority of histogram filling in RECO jobs. Soon to be based on edm::stream.
    • DQMOneEDAnalyzer based on edm::one::EDProducer. Used when begin/end job transitions are required. Can accept more edm::one specific options. Cannot save per-lumi histograms.
    • DQMOneLumiEDAnalyzer based on edm::one::EDProducer. Used when begin/end lumi transitions are needed. Blocks concurrent lumisections.
    • DQMGlobalEDAnalyzer based on edm::global::EDProducer. Used for DQM@HLT and a few random other things. Cannot save per-lumi histograms (this is a conflict with the fact that HLT typically saves only per lumi histograms, see Concurrent Lumis and DQM at HLT #28341).
    • DQMEDHarvester based on edm::one::EDProducer. Used in harvesting jobs to manipulate histograms in lumi, run, and job transitions.
    • edm::EDAnalyzer legacy modules. Can do filling and harvesting. Not safe to use in the presence of concurrent lumisections. Safe for multi-threaded running from the DQM framework side.
  • There are four supported file formats for DQM histograms:

    • DQMIO, TTree based ROOT files. Reading and writing implemented in EDM input and output modules in DQMServices/FwkIO: DQMRootSource and DQMRootOutputModule.
    • Legacy TDirectory based DQM_*.root ROOT files. Only write support, implemented in DQMServices/Core (see DQM: new DQMFileSaver #28588). Read support dropped in this PR, unneeded since references where removed. Only this format supports saving JOB histograms.
    • ProtoBuf streamer, a.k.a. fastHadd files. Implemented by DQMFileSaverPB/DQMFileSaver (see Replace mark-and-delete at next merge algorithm in DQMStore (75x) #11086) and the edm input module DQMProtobufReader, in "streamer" format for HLT/DAQ.
    • MEtoEDM edm based files. Read and written by the Pool I/O modules, after copying DQMStore content to edm products using MEtoEDMConverter.

In this PR, the following features are implemented/modified:

  • Mode-less DQMStore. While there where many different modes before (enableMultiThread, lsBasedMode, forceResetOnLumi, collate, etc.) that could be configured in various ways and changed the behavior of the DQMStore (sometimes fundamentally, making certain features only work in certain modes), the new DQMStore has only a single mode.
    • There are two options to control per-lumi saving (saveByLumi in the DQMStore) and harvesting (reScope in DQMRootSource). Both can be expressed in terms of Scope, see later.
    • Another option, assertLegacySafe, only adds assertions to make sure no operations that would be unsafe in the presence of legacy modules sneak in. It does not affect the behaviour.
    • The verbose option should not affect the behavior, though it has in the past (if only due to race conditions).
  • Data-race free. The new implementation tries very hard to prevent data races even when legacy APIs are used, using fine-grained locking.
    • This does not mean that there are no race conditions. Whenever histograms are read while they are being filled in a multi-threaded context, the results can be dependent on thread interleaving. But, there should be no undefined behavior on C++ level.
    • For safe manipulation of ROOT objects during booking, a callback interface is provided.
    • Currently, there is still only one global IBooker and booking methods cannot run concurrently, but it is easy to change this now (will probably be allowed with the edm::stream modules).
  • Global and local histograms for concurrent lumisections.
    • In the DQM API, we face the conflict that MonitorElement objects are held in the modules (so their life cycle has to match that of the module) but also represent histograms whose life cycle depends the data processed (run and lumi transitions). This caused conflicts since the introduction of multi-threading.
    • The new DQMStore resolves this conflict by representing each monitor element using (at least) two objects: A local MonitorElement, that follows the module life cycle but does not own data, and a global MonitorElement that owns histogram data but does not belong to any module. There may be multiple local MEs for one global ME if multiple modules fill the same histogram (edm::stream or even independent modules). There may be multiple global MEs for the same histogram if there are concurrent lumisections.
    • The live cycle of local MEs is driven by callbacks from each of the module base classes (enterLumi, leaveLumi). For legacy edm::EDAnalyzers, global begin/end run/lumi hooks are used, which only work as long as there are no concurrent lumisections. The local MEs are kept in a set of containers indexed by the moduleID, with special value 0 for all legacy modules and special values for DQMGlobalEDAnalyzers, where the local MEs need to match the life cycle of the runCache (module id + run number).
    • The live cycle of global MEs is driven by the enter/leave lumi calls (indirectly) and the cleanupLumi hook using the edm feature added in Added Service callbacks for Run and Lumi writing #28562, edm::Service callback for lumi/run cleanup #28521 . They are kept in a set of containers indexed by run and lumi number. For RUN MEs, the lumi number is 0; for JOB MEs, run and lumi are zero. The special pair (0,0) is also used for prototypes: Global MEs that are not currently associated to any run or lumi, but can (and have to, for the legacy guarantees) be recycled once a run or lumi starts.
    • If there are no concurrent lumisections, both local and global MEs live for the entire job and are always connected in the same way, which means all legacy interactions continue to work. assertLegacySafe (enabled by default) checks for this condition and crashes the job if it is violated.
  • Harvesting controlled by Scope.
    • In the old DQMStore, the handling of per-job vs. per-run historgrams as well as the handling of per-lumi histograms in online, offline, and harvesting is very confusing.
    • To clarify this situation, we introduce the concept of Scope:
      • The scope of a ME can be one of JOB, RUN, or LUMI.
      • The scope defines how long a global ME should be used before it is saved and replaced with a new histogram.
      • The scope must be set during booking.
    • By default, the scope for MEs is RUN.
      • Code can explicitly use IBooker::setScope() to change the scope to e.g. LUMI. This replaces the old setLumiFlag.
      • When setting saveByLumi option in the DQMStore, the default scope changes to LUMI for all modules that can support per-lumi saving. It could still be manually overridden in code.
      • In harvesting, the default scope is JOB. This works for single-run as well as multi-run harvesting, and emulates the old behavior. Moving to scope RUN for non-multi-run harvesting would be cleaner, but requires bigger changes to existing code.
    • When harvesting, we expect histograms to get merged. This merging can be controlled using a single option in DQMRootSource: reScope. This option sets the finest allowed scope when reading histograms from the input files.
      • When just merging files (e.g. in the DQMIO merge jobs), reScope is set to LUMI. The scope of MEs is not changed, histograms are only merged if a run is split over multiple files.
      • When harvesting, reScope is set to RUN. Now, MEs saved with scope LUMI will be switched to scope RUN and merged. The harvesting modules can observe increasing statistics in the histogram as a run is processed (like in online DQM).
      • For multi-run harvesting, reScope is set to JOB. Now, even RUN histograms are merged. This is the default, since it also works for today's single-run harvesting jobs.
  • In-order harvesting.
    • Harvesting jobs are always processed sequentially, like the data was taken: runs and lumisections are processed in increasing order. This is implemented in DQMRootSource.
  • Thread-safe MonitorElements.
    • There is (effectively) only one type of ME, and it uses locking on a local lock to allow safe access from multiple threads.
    • There are three different namespaces (reco, harvesting, legacy) providing this ME type, however the reco version does not expose some non-thread-safe APIs.
    • The MonitorElement can own the MonitorElementData that holds the actual histogram (that is the case for global MEs), or share it with one ME that owns the data and others that don't (this is the case for local MEs).
    • The MonitorElementData does not provide an API. It is always wrapped in a MonitorElement.
    • The MonitorElement has no state, apart from the pointer to the data and the is_owned flag.
      • Actually, it does have some state in the DQMNet::CoreObject, which is present in the MonitorElement to remain compatible with DQMNet for online DQM. The values there (dir, name, flags, qtests) are to be considered cached copies of the "real" values in MonitorElementData. The dir/name copy in the CoreObject is also used as a set index for the std::sets in the DQMStore, since it remains available even when the MonitorElementData is detached from local MEs (e.g. between lumisections).
    • The MonitorElement still allows access to the underlying ROOT object (getTH1()), but this is unsafe and should be avoided whenever possible. However, there is no universal replacement yet, and DQM framework code uses direct access to the ROOT object in many places.
  • Debugging tools: To understand how a ME is handled, set the process.DQMStore.trackME = cms.untracked.string("<ME Name>") option in the config file.
    - The DQMStore will then log all life cycle events affecting MEs matching this name. This does not include things done to the ME (like filling -- the DQMStore is not involved there), but it does include creation, reset, recycling, and saving of MEs.
    - The matching is a sub-string match on the full path, so it is also possible to watch folders or groups of MEs.
    - For the more difficult cases, it can make sense to put a debug breakpoint (std::raise(SIGNINT)) inside DQMStore::debugTrackME to inspect the stack when a certain ME is created/modified.
    - The previous functionality of logging the caller for all booking calls also still exists and can be enabled by setting process.DQMStore.verbose = 4.

PR validation:

Passes the relevant unit tests and also some runTheMatrixworkflows.

Known broken:

  • Anything with DQMStore::load() (should be unused, but there still is code referring to it, related to references. This can be removed, since references are gone for a while now.)
  • Standard Configuration files mostly still work, but need clean up. The DQMStore does not accept many options any more. Esp. check around DQMRootSource related to single/multi run harvesting.

Other flaws:

  • Some parts of the MonitorElement implementation can be moved around/removed.

@cmsbuild
Copy link
Contributor

The code-checks are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-28622/13150

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @schneiml (Marcel Schneider) for master.

It involves the following packages:

CalibTracker/SiStripChannelGain
Calibration/LumiAlCaRecoProducers
DQM/BeamMonitor
DQM/CTPPS
DQM/CastorMonitor
DQM/DTMonitorClient
DQM/DTMonitorModule
DQM/EcalCommon
DQM/EcalPreshowerMonitorModule
DQM/HLTEvF
DQM/HcalCommon
DQM/HcalTasks
DQM/L1TMonitor
DQM/L1TMonitorClient
DQM/Physics
DQM/PixelLumi
DQM/RPCMonitorDigi
DQM/SiPixelMonitorDigi
DQM/SiPixelMonitorRawData
DQM/SiStripMonitorClient
DQM/SiStripMonitorHardware
DQM/SiStripMonitorPedestals
DQM/TrackingMonitor
DQM/TrackingMonitorClient
DQMOffline/CalibCalo
DQMOffline/JetMET
DQMOffline/L1Trigger
DQMOffline/Trigger
DQMServices/Components
DQMServices/Core
DQMServices/Demo
DQMServices/Examples
DQMServices/FileIO
DQMServices/FwkIO
DataFormats/Histograms
HLTrigger/Timer
HLTriggerOffline/Egamma
HLTriggerOffline/Exotica
HLTriggerOffline/Higgs
HLTriggerOffline/Muon
HLTriggerOffline/SUSYBSM
PhysicsTools/NanoAOD
SimTracker/SiPhase2Digitizer
Validation/DTRecHits
Validation/EcalDigis
Validation/EventGenerator
Validation/HGCalValidation
Validation/HcalDigis
Validation/HcalHits
Validation/RecoEgamma
Validation/RecoMuon
Validation/RecoParticleFlow
Validation/RecoTrack
Validation/SiTrackerPhase2V

The following packages do not have a category, yet:

DQMServices/Demo
Please create a PR for https://github.com/cms-sw/cms-bot/blob/master/categories_map.py to assign category

@SiewYan, @andrius-k, @schneiml, @kpedro88, @Martin-Grunewald, @rekovic, @fioriNTU, @tlampen, @alberto-sanchez, @pohsun, @santocch, @peruzzim, @civanch, @cmsbuild, @agrohsje, @fwyzard, @smuzaffar, @Dr15Jones, @efeyazgan, @mdhildreth, @jfernan2, @tocheng, @qliphy, @benkrikler, @mkirsano, @kmaeshima, @christopheralanwest, @franzoni, @fgolf can you please review it and eventually sign? Thanks.
@echabert, @rappoccio, @felicepantaleo, @forthommel, @jandrea, @robervalwalsh, @kpedro88, @argiro, @Martin-Grunewald, @bsunanda, @pfs, @ahinzmann, @threus, @mmusich, @seemasharmafnal, @venturia, @acimmino, @mmarionncern, @kreczko, @sethzenz, @apsallid, @makortel, @jhgoh, @lgray, @jdolen, @HuguesBrun, @cericeci, @battibass, @pieterdavid, @rociovilar, @vandreev11, @abbiendi, @barvic, @DryRun, @GiacomoSguazzoni, @tocheng, @VinInn, @cseez, @missirol, @nhanvtran, @gkasieczka, @rovere, @deguio, @schoef, @alesaggio, @idebruyn, @ebrondol, @mtosi, @dgulhan, @clelange, @jdamgov, @hdelanno, @trocino, @jan-kaspar, @gbenelli, @Fedespring, @calderona, @hatakeyamak, @wmtford, @gpetruc, @mariadalfonso, @folguera, @fioriNTU this is something you requested to watch as well.
@davidlange6, @slava77, @fabiocos you are the release manager for this.

cms-bot commands are listed here

@schneiml
Copy link
Contributor Author

please test

It would be surprising if this actually passed.

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 13, 2019

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-run-pr-tests/3967/console Started: 2019/12/13 18:06

@@ -174,6 +173,7 @@ void AlcaBeamMonitor::bookHistograms(DQMStore::IBooker& ibooker, edm::Run const&
}
}
ibooker.setCurrentFolder(monitorName_ + "Service");
auto scope = ibooker.setScope(MonitorElementData::Scope::LUMI);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The better C++ way to do such behavior is to use RAII (Resource Acquisition Is Initialization - resource release is destruction).

{
     DQMScopeContext scope{ ibooker, MonitorElementData::Scope::LUMI };
     theValuesContainer_ = ...
}

at the end of the block scope, the scope object is destroyed and it resets the ibooker scope back to what it started at.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was not exactly how I expected this API to be used, but it turned out to be a common pattern for migrating the existing code. I might add a helper like this.

@cmsbuild
Copy link
Contributor

-1

Tested at: b9acc66

CMSSW: CMSSW_11_1_X_2019-12-13-1100
SCRAM_ARCH: slc7_amd64_gcc820
You can see the results of the tests here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-8cba95/3967/summary.html

I found follow errors while testing this PR

Failed tests: ClangBuild

  • Clang:

I found compilation warning while trying to compile with clang. Command used:

USER_CUDA_FLAGS='--expt-relaxed-constexpr' USER_CXXFLAGS='-Wno-register -fsyntax-only' scram build -k -j 32 COMPILER='llvm compile'

See details on the summary page.

@cmsbuild
Copy link
Contributor

Comparison not run due to Build errors/Fireworks only changes/No short matrix requested (RelVals and Igprof tests were also skipped)

@mtosi
Copy link
Contributor

mtosi commented Jun 25, 2020

ciao
sorry for coming to this PR only now, but running the usual script of the tracking efficiency monitoring which makes the stack plot of the efficiency per tracking iteration, we realized that from pre2 to pre3 the stack plot does not make much sense (while comparing the single distributions they seem comparable)

in particular, the resulting stack in 11_1_0_pre2
1110pre2_phase2_efficiency_eta_cum
while in 11_1_0_pre3
1110pre3_phase2_efficiency_eta_cum

if I loop on the single distributions I get the integral and the maximum of each histo and of the updated stack
in pre2
h effic 59.3535502967 90
stack 59.3535502967 0.929356098175
h effic 13.3359933533 90
stack 72.6895434856 0.93662327528
h effic 0.400576615764 90
stack 73.0901202559 0.940172374249
h effic 0.0202829334739 90
stack 73.1104031205 0.940172374249
h effic 0.228987028415 90
stack 73.3393899798 0.944059491158
h effic 0.632008572386 90
stack 73.9713985324 0.946399509907
h effic 0.0 90
stack 73.9713985324 0.946399509907
h effic 0.00213345608063 90
stack 73.973532021 0.946399509907

both the integral and the maximum increase
in pre3, instead

h effic 59.2632414947 90
stack 59.2632414947 0.922642588615
h effic 13.3893704087 90
stack 23.2090102354 0.443937152624
h effic 0.430286794202 90
stack 2.78062788786 0.0663475841284
h effic 0.0170889779401 90
stack 0.122144599902 0.004930596333
h effic 0.234343903809 90
stack 0.110274714562 0.00367779331282
h effic 0.642823928472 90
stack 0.105616362159 0.0040756505914
h effic 0.0 90
stack 0.105616360762 0.0040756505914
h effic 0.00245980154432 90
stack 0.033769877245 0.00149156176485

the stack gets smaller !?!?

so you see that in the latter case something gets wrong in the handling of the histo

do you have any suggestions ?
thanks

@fwyzard

@mtosi
Copy link
Contributor

mtosi commented Jun 26, 2020

I've checked the ROOT version in 11_1_0_pre2 and 11_1_0_pre3, by running
scram tool info root (thanks @fwyzard )
CMSSW_11_1_0_pre2 : 6.18.04-bcolbf
CMSSW_11_1_0_pre3 : 6.18.04-bcolbf2

therefore, I think this issue is not related to the ROOT version
@schneiml @silviodonato @slava77 @perrotta @Dr15Jones @jfernan2 @mmusich : any clue / suggestions ?

@jfernan2
Copy link
Contributor

@mtosi which is "the usual script" you are running?

@mtosi
Copy link
Contributor

mtosi commented Jun 26, 2020

ciao
thanks for the head up !
the script is https://github.com/cms-sw/cmssw/blob/CMSSW_11_1_0_pre2/Validation/RecoTrack/test/publicPlots/plot.py
and in particular the function for making the stacked plot is
https://github.com/cms-sw/cmssw/blob/CMSSW_11_1_0_pre2/Validation/RecoTrack/test/publicPlots/plot.py#L734
which calls
https://github.com/cms-sw/cmssw/blob/CMSSW_11_1_0_pre2/Validation/RecoTrack/test/publicPlots/plot.py#L187

the printout I quoted in my comment comes from that part, playing w/ the h and self._stack
but I ran it in CMSSW_11_1_0_pre6 on DQM files produced by PdmV for the different CMSSW validation
so, I ran the same script using the same setup area on different DQM files (produced by different CMSSW releases)
up to 11_1_0_pre2 I have the wanted stacked plot
starting from 11_1_0_pre3 I have the weird plot
looking at the list of PR merged in 11_1_0_pre3 [1], I though this one might have been the source --but I'm not sure :( --

thanks for your help !

[1]
https://github.com/cms-sw/cmssw/releases/CMSSW_11_1_0_pre3

@schneiml
Copy link
Contributor Author

Hi Mia,

after running the crystal ball for a while (I haven't really tried executing the script yet), maybe this is related: https://github.com/cms-sw/cmssw/pull/28622/files#diff-8e94a26c1f038018d957101ae048c7c5L228-R275

In theory that should make bare ROOT respect the efficiency flag we had in DQM since a long time.

@mtosi
Copy link
Contributor

mtosi commented Jun 26, 2020

thanks @schneiml
that was the other PR, indeed
what should I do ? I'm sorry, but as now it is not clear to me :(
thanks

@schneiml
Copy link
Contributor Author

Try resetting that bit in your Python code. There is a good chance that these plots have the efficiency flag in CMSSW set, and that is saved into the ROOT object now, and that that causes the different behaviour.

Now, if this is actually what we want is not that clear, but for a workaround, un-setting that bit in your plotting code should work (and prove that this is actually the cause, which I am not sure about).

void setEfficiencyFlag() {
auto access = this->accessMut();
if (access.value.object_)
access.value.object_->SetBit(TH1::kIsAverage);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mtosi Looks like Github screwed up that link before. This is the relevant line.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, but I'm sorry, I'm still confused
the single histogram used by the stacked plot is an efficiency, right,
but it comes from
https://github.com/cms-sw/cmssw/blob/CMSSW_11_1_0_pre3/Validation/RecoTrack/python/PostProcessorTracker_cfi.py#L21
which is making use of the DQMGenericClient

what am I missing ?
thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forget, I missed your message above
and indeed, I'm trying to rollback this bit, but I'm not able to :(
how should this be done ?
thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

th1.SetBit(ROOT.kIsAverage, 0) or th1.ResetBit(ROOT.kIsAverage) would be my guess in Python. If you have trouble finding the constant, you can just use the value directly (1 << 18).

@mtosi
Copy link
Contributor

mtosi commented Jun 26, 2020

ciao
I can confirm that is the source of the problem,
but not knowing how to rollback that setting,
and experimenting that the Clone() clones even such a setting
I had to pass through a tmp histogram which stores all the relevant information (binning, content, style, etc)
it seems to work, but it is not super clean :(
1110pre3_phase2_efficiency_eta_cum
1110pre2_phase2_efficiency_eta_cum

@schneiml
Copy link
Contributor Author

Ok, maybe consider asking in the ROOT forum if this is the expected behaviour.

We could still roll back this change and go back to our own definition of the efficiency flag if there are systematic problems, but that should be pretty quickly (given that we have a sort-of-final 11_1 now and this code is actually used for large-scale productions now).

@mtosi
Copy link
Contributor

mtosi commented Jun 26, 2020

ciao
about asking ROOT, while looking for info about that setting (I was not aware of it :( ) I think it is working as expected / designed
then, the fact that I cannot unset the TH1 is probably a different thing

why do we move our MEs to make use of that setting is even another point, and you (DQM core) have decided that --I admit my ignorance on the reason :( --
I can image you have done because it supports some needs, even if they are not clear to me --I have to admit--

about having this "feature" in 11_1_0, but a rollback in 11_2 and even 11_1_X I do not see a major problem, because as far as I understood it affects the DQM/Validation only, and I think 11_1_0 will be used for the offline reconstruction of the samples for the HLT TDR

I managed to find a --very ugly-- solution to our issue, therefore I'm not feeling to push towards a rollback,
but I would appreciate if someone could explain to me what we gain w/ this setting, because it is not super clear to me, as now

said that thanks a lot for the support ! it was really helpful in finding the source of the issue

@schneiml
Copy link
Contributor Author

Regarding the "why" of that change: In the past, we had the TH1 and the "efficiency flag" stored separately. That meant anything handling DQM MEs had to manually check the efficiency flag and adapt it's behaviour accordingly (e.g. hAdd in harvesting, fastHadd, reference/overlay in DQMGUI) to handle our efficiency-stored-in-TH1 MEs. (TProfile always did the right thing by default).

According to the docs, setting this bit should make ROOT do the right thing by default, simplifying things downstream (also the flags field of the MEs is completely unused now, which helps e.g. DQMIO storage, which never (correctly) supported storing flags...). However, I took this decision based on the docs, and have not extensively tested which behaviors of ROOT actually change.

Your corner case is odd, because you have an efficiency, but due to how tracking works, summing up efficiencies makes sense in your application. I think it is fair that you have to adjust your code here since in the common case (e.g. summing up Lumisections, or runs, or datasets) the new behaviour is correct: 90% efficiency in 7 runs is 90% efficiency, not 630% efficiency. Only in the special case of tracking, 10% efficiency in each of 7 steps is actually 70% efficiency (and not 0.00001%, which would be another common interpretation...).

@fwyzard
Copy link
Contributor

fwyzard commented Jun 26, 2020

There is nothing special about tracking in these plots - the same behaviour applies whenever you have some kind of "iterative" reconstruction, or the like.

For example, we could make the same plots for the HLT muon efficiency (which use 3 iterations based on outside-in, inside-out, inside-out based on L1), etc.

@mtosi
Copy link
Contributor

mtosi commented Jun 26, 2020

uhmmm, averaging the efficiency through different LSs is different than summing the efficiency from independent categories

@schneiml
Copy link
Contributor Author

uhmmm, averaging the efficiency through different LSs is different than summing the efficiency from independent categories

Yes, that is the fundamental issue. But ROOT only offers one "hadd" (to my knowlegde?) and that one is predominantly used to assemle complete statistics from partial histograms (to my knowledge, again). IMO we should just make sure that this bit can be easily reset to change the behavior to what you need, and according to the docs that should absolutely work, so this feels like a ROOT issue to me.

@mtosi
Copy link
Contributor

mtosi commented Jun 26, 2020

can the DQM core handle this, please ?

@mtosi
Copy link
Contributor

mtosi commented Jun 29, 2020

@pcanal could you please help on this task ?
thanks

@fioriNTU
Copy link
Contributor

@mtosi , for what I can understand, this is a buggy behavior in ROOT, and we don't really want to fix a root bug through CMSSW. I think there are multiple ways to solve your issue, the more straightforward I can see is to compute the efficiiencies of the various steps summing up the numerators of the steps before, and then doing a simple Draw("same") with any need of a THStack. This would require minor changes to the script you linked.

@davidlange6
Copy link
Contributor

davidlange6 commented Jun 29, 2020 via email

@schneiml
Copy link
Contributor Author

@davidlange6 There is not really a bug, more of a feature.

DQM (since ~forever) has a mechanism called "Efficiency Flag", that is used to mark histograms that are not histograms so DQMGUI can handle them correctly.

This PR sets kIsAverage in ROOT if the efficiency flag is set so that ROOT also handles them correctly.

This now breaks a downstream script that relied on TH1::Add on a efficiency TH1 being handled like if it was a histogram.

According to @mtosi it was not that easy to just TH1::ResetBit(kIsAverage), which indicates a bug somewhere on ROOT side.

@mtosi
Copy link
Contributor

mtosi commented Jun 29, 2020

@schneiml SORRY
as I wrote in my previous messages, I was not able to reset the bit, but actually I was not running TH1::ResetBit(kIsAverage) !!!!!
now, I remade the check by using it, and I can confirm that it works --as expected--

sorry the noise, I'm going to make the PR for adding to our script the handling of it

@davidlange6
Copy link
Contributor

davidlange6 commented Jun 29, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.