EventSetup Records with large payloads #33436

makortel · 2021-04-15T01:43:06Z

Enabling concurrent IOVs has a risk to increase memory usage, because the payloads for all active IOVs need to be kept in memory as long as events from those IOVs are being processed. One way to limit this memory increase is to disable concurrency for EventSetup Records that have large payloads (and hopefully have long IOVs). The purpose of this issue is to identify such Records.

cmsbuild · 2021-04-15T01:43:27Z

A new Issue was created by @makortel Matti Kortelainen.

@Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel · 2021-04-15T01:44:01Z

assign core, alca

cmsbuild · 2021-04-15T01:44:19Z

New categories assigned: core,alca

@Dr15Jones,@smuzaffar,@christopheralanwest,@tlampen,@pohsun,@yuanchao,@makortel,@francescobrivio,@malbouis you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel · 2021-04-15T01:48:54Z

@cms-sw/alca-l2 I can quickly think of

IdealGeometryRecord
SiPixelTemplateDBObjectRcd

as Records that can have large payloads, could you comment if this is correct and what other such Records we have? (let's say "large" is more than 10 MB)

christopheralanwest · 2021-04-15T03:08:10Z

assign db

I think that @ggovi is probably the best person to answer this question.

cmsbuild · 2021-04-15T03:08:30Z

New categories assigned: db

@ggovi you have been requested to review this Pull request/Issue and eventually sign? Thanks

mmusich · 2021-04-15T08:44:34Z

@makortel

can quickly think of

IdealGeometryRecord
SiPixelTemplateDBObjectRcd

these are not even close to the absolute largest which is the per-pixel Gain Calibration used for offline reconstruction.
Here is a list of the worst offenders (5MB and above) extracted from the last open IOV of all tags in Prompt Reco:

  239M  SiPixelGainCalibrationOfflineRcd,-
   46M  EcalPulseCovariancesRcd,-
   35M  SiPixel2DTemplateDBObjectRcd,numerator
   22M  SiStripPedestalsRcd,-
   21M  DQMReferenceHistogramRootFileRcd,-
   20M  SiStripNoisesRcd,-
   20M  GBRWrapperRcd,PFGCorrectionBar
   20M  GBRWrapperRcd,PFGCorrectionEndHighR9
   20M  GBRWrapperRcd,PFGCorrectionEndLowR9
   20M  GBRWrapperRcd,PFEcalResolution
   13M  GBRWrapperRcd,PFLCCorrection
  9.8M  GBRWrapperRcd,PFLCorrectionBar
  9.8M  GBRWrapperRcd,PFLCorrectionEnd
  9.3M  CSCDBNoiseMatrixRcd,-
  7.3M  GBRWrapperRcd,wgbrph_EBCorrection
  6.7M  GBRDWrapperRcd,gedphoton_EECorrection_50ns
  6.1M  L1MuCSCPtLutRcd,-
  6.0M  SiPixelGainCalibrationForHLTRcd,-
  5.9M  GBRWrapperRcd,wgbrph_EBUncertainty
  5.8M  IdealGeometryRecord,-
  5.6M  GBRDWrapperRcd,gedphoton_EECorrection_25ns
  5.3M  GBRWrapperRcd,PFResolution
  5.3M  GBRWrapperRcd,PFGlobalCorrection

you can find the complete list here:
https://gist.github.com/mmusich/be5cfc4208f7146a333830f11d0a423e

ggovi · 2021-04-15T08:59:47Z

Thanks Marco for this prompt answer. We need then to identify the threshold. How will the exclusion list be implemented? Hard-coded or configurable?

ggovi · 2021-04-15T09:03:24Z

An other possibility is to avoid at all to keep the payloads in memory, given that they are all cached permanently in frontier...

makortel · 2021-04-16T01:51:04Z

Here is a list of the worst offenders (5MB and above) extracted from the last open IOV of all tags in Prompt Reco:
...
you can find the complete list here:
https://gist.github.com/mmusich/be5cfc4208f7146a333830f11d0a423e

Thanks @mmusich!

Does the list contain only the payloads in the CondDB? The EventSetup products created within CMSSW contribute to the memory requirement too. Does anyone have any hunch on those, or they have to be looked for with a profiler?

makortel · 2021-04-16T02:10:56Z

How will the exclusion list be implemented? Hard-coded or configurable?

Simplest way is to disable the concurrent IOV support for the relevant Records in the C++ code along

cmssw/FWCore/Framework/test/Dummy2Record.h

Lines 13 to 16 in 53993f8

    
           class Dummy2Record : public edm::eventsetup::EventSetupRecordImplementation<Dummy2Record> { 
        
           public: 
        
             static constexpr bool allowConcurrentIOVs_ = false; 
        
           };

The level of concurrency can also be set in the configuration per Record (for those for which the concurrency is not disabled) along

process.options.eventSetup = cms.untracked.PSet(
    numberOfConcurrentIOVs = cms.untracked.uint32(2), # default concurrency
    forceNumberOfConcurrentIOVs = cms.untracked.PSet(
        SiPixelGainCalibrationOfflineRcd = cms.untracked.uint32(1),
        EcalPulseCovariancesRcd = cms.untracked.uint32(1),
        ...
    )
)

I would believe the hardcoding to be good-enough to get most threading efficiency benefits (also I can't think of a natural place for a configuration that would automatically propagate to all applications).

makortel · 2021-04-16T02:14:11Z

An other possibility is to avoid at all to keep the payloads in memory, given that they are all cached permanently in frontier...

I probably misunderstood, but I believe asking the payloads from the Frontier on each event would (significantly) decrease the event processing throughput.

mmusich · 2021-04-16T07:04:56Z

Does the list contain only the payloads in the CondDB?

correct

The EventSetup products created within CMSSW contribute to the memory requirement too. Does anyone have any hunch on those, or they have to be looked for with a profiler?

I am wondering if it would be possible to get the (non persisted) records data modifying this?
Otherwise yes, I think it needs to be profiled.

makortel · 2021-04-16T15:42:52Z

If I got it right (from condDbBrowser), the largest payload with non-Run IOV is EcalPedestalsRcd (with time IOV) with size of 2.3 MB. So the maximum possible memory increase is not necessarily that large in practice (currently framework synchronizes anyway at Run boundaries, and AFAIK we don't really have jobs processing multiple Runs).

Actually, how easy would it be to get a list of tags that have non-run IOVs?

mmusich · 2021-04-16T15:54:16Z

Actually, how easy would it be to get a list of tags that have non-run IOVs?

straightforward.
These are the only records in Prompt Reco with non-Run IOVs:

Record	Label	Tag	Time Type	Syncronization
BeamSpotObjectsRcd	-	BeamSpotObjects_PCL_byLumi_v0_prompt	Lumi	pcl
DTHVStatusRcd	-	DTHVStatus_V05_hlt	Time	express
DTKeyedConfigContainerRcd	-	DTKeyedConfig_V06_hlt	Hash	hlt
EcalLaserAPDPNRatiosRcd	-	EcalLaserAPDPNRatios_prompt_v2	Time	pcl
LHCInfoRcd	-	LHCInfoEndFill_prompt_v2	Time	prompt
LumiCorrectionsRcd	-	LumiPCC_Corrections_prompt	Lumi	pcl
SiPixelQualityFromDbRcd	-	SiPixelQuality_byPCL_prompt_v2	Lumi	pcl
SiStripDetVOffRcd	-	SiStripDetVOff_v6_prompt	Time	prompt

for the record one can get it with:

import CondCore.Utilities.conddblib as conddb
con = conddb.connect(url = conddb.make_url("pro"))
session = con.session()
IOV     = session.get_dbtype(conddb.IOV)
TAG     = session.get_dbtype(conddb.Tag)
GT      = session.get_dbtype(conddb.GlobalTag)
GTMAP   = session.get_dbtype(conddb.GlobalTagMap)
RUNINFO = session.get_dbtype(conddb.RunInfo)

GTMap = session.query(GTMAP.record, GTMAP.label, GTMAP.tag_name).\
        filter(GTMAP.global_tag_name == "112X_dataRun3_Prompt_v5").\
        order_by(GTMAP.record, GTMAP.label).\
        all()

print "| Record | Label |Tag |Time Type |Syncronization|"
print "| -------| ------|----|----------|--------------|"
for element in GTMap:
    Record = element[0]
    Label  = element[1]
    Tag    = element[2]

    TagInfo = session.query(TAG.synchronization,TAG.time_type).filter(TAG.name == Tag).all()[0]
    if(TagInfo[1]!="Run"):
        print "|",Record,"|",Label,"|",Tag,"|",TagInfo[1],"|",TagInfo[0],"|"

makortel · 2021-04-16T18:03:59Z

Thanks @mmusich! Correlating those to your earlier list gives

BeamSpotObjectsRcd 64 kB
DTHVStatusRcd 176 kB
DTKeyedConfigContainerRcd 64 kB
EcalLaserAPDPNRatiosRcd 1.2 MB
LHCInfoRcd 100 kB
LumiCorrectionsRcd: 64 kB
SiPixelQualityFromDbRcd: 64 kB
SiStripDetVOffRcd: 136 kB

so ~1.9 MB in total. That alone sounds something I'd expect us to live with (i.e. at most 2 MB memory increase per job during any IOV transition period). This number still misses all the ESProducts constructed within the job, but I'd imagine even factor of 10 increase to be tolerable.

makortel · 2021-04-26T18:43:33Z

Given that the largest possible increase from DB payloads would be around 2 MB, and that the transient ESProducts in non-Run IOV records are unlikely (many) magnitudes larger, we could enable concurrent IOVs by default (when concurrent lumis are enabled), and deal with possible problems if they arise.

makortel · 2021-04-26T18:43:37Z

+1

I was supposed to do this at the same time as cms-sw#35302 that followed cms-sw#34231 and the conclusion in cms-sw#33436

cmsbuild added the pending-assignment label Apr 15, 2021

cmsbuild added alca-pending core-pending pending-signatures and removed pending-assignment labels Apr 15, 2021

makortel mentioned this issue Apr 15, 2021

Ask AlCa which records have big payloads etc cms-sw/framework-team#117

Closed

cmsbuild added the db-pending label Apr 15, 2021

makortel closed this as completed Apr 26, 2021

cmsbuild added core-approved and removed core-pending labels Apr 26, 2021

makortel mentioned this issue May 5, 2021

Disable concurrent IOVs for Records that have large payloads cms-sw/framework-team#118

Closed

makortel added a commit to makortel/cmssw that referenced this issue Mar 30, 2022

Enable concurrent IOVs by default in ConfigBuilder

054a62f

I was supposed to do this at the same time as cms-sw#35302 that followed cms-sw#34231 and the conclusion in cms-sw#33436

makortel mentioned this issue Mar 30, 2022

Enable concurrent IOVs by default in ConfigBuilder #37419

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EventSetup Records with large payloads #33436

EventSetup Records with large payloads #33436

makortel commented Apr 15, 2021

cmsbuild commented Apr 15, 2021

makortel commented Apr 15, 2021

cmsbuild commented Apr 15, 2021

makortel commented Apr 15, 2021

christopheralanwest commented Apr 15, 2021

cmsbuild commented Apr 15, 2021

mmusich commented Apr 15, 2021

ggovi commented Apr 15, 2021

ggovi commented Apr 15, 2021

makortel commented Apr 16, 2021 •

edited

Loading

makortel commented Apr 16, 2021

makortel commented Apr 16, 2021

mmusich commented Apr 16, 2021

makortel commented Apr 16, 2021

mmusich commented Apr 16, 2021 •

edited

Loading

makortel commented Apr 16, 2021

makortel commented Apr 26, 2021

makortel commented Apr 26, 2021

EventSetup Records with large payloads #33436

EventSetup Records with large payloads #33436

Comments

makortel commented Apr 15, 2021

cmsbuild commented Apr 15, 2021

makortel commented Apr 15, 2021

cmsbuild commented Apr 15, 2021

makortel commented Apr 15, 2021

christopheralanwest commented Apr 15, 2021

cmsbuild commented Apr 15, 2021

mmusich commented Apr 15, 2021

ggovi commented Apr 15, 2021

ggovi commented Apr 15, 2021

makortel commented Apr 16, 2021 • edited Loading

makortel commented Apr 16, 2021

makortel commented Apr 16, 2021

mmusich commented Apr 16, 2021

makortel commented Apr 16, 2021

mmusich commented Apr 16, 2021 • edited Loading

makortel commented Apr 16, 2021

makortel commented Apr 26, 2021

makortel commented Apr 26, 2021

makortel commented Apr 16, 2021 •

edited

Loading

mmusich commented Apr 16, 2021 •

edited

Loading