Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HLT IBs tests: edmPluginHelp crashing for CkfTrackCandidateMaker plugin #42988

Closed
smuzaffar opened this issue Oct 11, 2023 · 26 comments
Closed

Comments

@smuzaffar
Copy link
Contributor

smuzaffar commented Oct 11, 2023

We have noticed that failure frequency for CMSSW IBs HLT jobs has increased. these jobs run on CERN HTCondor batch nodes. We have started to monitor these jobs and noticed that edmPluginHelp consumes a lot (150+GB RSS) of memory [a]. If I run edmPluginHelp -p CkfTrackCandidateMaker locally then I it segfaults ( looks like it goes in to some recursive loop). Any idea what might have caused this issue?

[a]

PID START   RSS  SIZE    VSZ          CMD
64931 12:42    56  1140  1929            /bin/tcsh ./runAll.csh IB
 5172 12:51   376  1072  1922            /bin/tcsh ./runOne.csh DATA IB
36422 13:31  1292  1072  1922            /bin/tcsh ./runIntegration.csh DATA IB
36535 13:31  1372   812  1025            /bin/bash /pool/condor/dir_57168/jenkins/workspace/ib-run-HLT/CMSSW_13_1_X_2023-10-10-2300/bin/el8_amd64_gcc11/hltIntegrationTests OnLine_HLT_GRun.py -x realData=1 -x globalTag=@ -d HLT_Integration_GRun_DATA -i file:../RelVal_Raw_GRun_DATA.root -n 100 -j 4 -a cpu
36885 13:32  1256   660  1010            /bin/bash /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02806/el8_amd64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-10-08-0000/bin/el8_amd64_gcc11/edmCheckMultithreading hlt.py
36891 13:32   448   660  1010            /bin/bash /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02806/el8_amd64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-10-08-0000/bin/el8_amd64_gcc11/edmCheckMultithreading hlt.py
37463 13:32 62746720 416575124 41670470  edmPluginHelp -p CkfTrackCandidateMaker
37464 13:32   476   660  1010            /bin/bash /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02806/el8_amd64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-10-08-0000/bin/el8_amd64_gcc11/edmCheckMultithreading hlt.py
37465 13:32  1776  1424   590            grep -q not implemented the function
36886 13:32   640   360   484            grep legacy
 5173 12:51   372  1072  1922            /bin/tcsh ./runOne.csh MC IB
45678 13:47  1292  1072  1922            /bin/tcsh ./runIntegration.csh MC IB
45777 13:47  1372   808  1024            /bin/bash /pool/condor/dir_57168/jenkins/workspace/ib-run-HLT/CMSSW_13_1_X_2023-10-10-2300/bin/el8_amd64_gcc11/hltIntegrationTests OnLine_HLT_GRun.py -x realData=0 -x globalTag=@ -d HLT_Integration_GRun_MC -i file:../RelVal_Raw_GRun_MC.root -n 100 -j 4 -a cpu
46096 13:47  1256   660  1010            /bin/bash /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02806/el8_amd64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-10-08-0000/bin/el8_amd64_gcc11/edmCheckMultithreading hlt.py
46102 13:47   448   660  1010            /bin/bash /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02806/el8_amd64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-10-08-0000/bin/el8_amd64_gcc11/edmCheckMultithreading hlt.py
46651 13:47 154348204 281120588 28125016  edmPluginHelp -p CkfTrackCandidateMaker
46652 13:47   472   660  1010            /bin/bash /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02806/el8_amd64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-10-08-0000/bin/el8_amd64_gcc11/edmCheckMultithreading hlt.py
46658 13:47  1780  1424   590            grep -q not implemented the function
46097 13:47   640   360   484            grep legacy
64932 12:42    48   492  2091            /usr/bin/coreutils --coreutils-prog-shebang=tee /usr/bin/tee -a /pool/condor/dir_57168/jenkins/workspace/ib-run-HLT/results/runIB.log

@cmsbuild
Copy link
Contributor

A new Issue was created by @smuzaffar Malik Shahzad Muzaffar.

@makortel, @Dr15Jones, @smuzaffar, @antoniovilela, @sextonkennedy, @rappoccio can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@smuzaffar
Copy link
Contributor Author

edmPluginHelp -p CkfTrackCandidateMaker misbehaves for cmssw 13.1.X too

@smuzaffar
Copy link
Contributor Author

assign hlt,reconstruction

@cmsbuild
Copy link
Contributor

New categories assigned: hlt,reconstruction

@Martin-Grunewald,@mmusich,@jfernan2,@mandrenguyen,@missirol you have been requested to review this Pull request/Issue and eventually sign? Thanks

@Martin-Grunewald
Copy link
Contributor

assign core

@cmsbuild
Copy link
Contributor

New categories assigned: core

@Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

@Martin-Grunewald
Copy link
Contributor

Martin-Grunewald commented Oct 11, 2023

The HLT menu is large (in terms of modules configured), but querying a single module should not crash.
I assume you tested edmPluginHelp -p CkfTrackCandidateMaker even outside of HLT?
Doing so indeed segfaults for me as well, after putting out a lot of gibberish.

@mmusich
Copy link
Contributor

mmusich commented Oct 11, 2023

I just tested it standalone and it segfaults also in CMSSW_12_3_0.
My bet is that it's coming all the way from #36459 and #36658 (@missirol FYI)

@smuzaffar
Copy link
Contributor Author

smuzaffar commented Oct 11, 2023

edmPluginHelp runs from https://github.com/cms-sw/cmssw/blob/master/FWCore/Utilities/scripts/edmCheckMultithreading#L32-L34 for all the cms EDAnalyzer, EDFilter, EDProducer and OutputModule

@smuzaffar smuzaffar changed the title HLT IBs tests and memory issue HLT IBs tests: edmPluginHelp crashing for CkfTrackCandidateMaker plugin Oct 11, 2023
@Dr15Jones
Copy link
Contributor

On a very quick look, it appears to be a recursive dependencies between plugins (e.g plugin A uses plugin B and then plugin B says it uses plugin A). In that case, edmPluginHelp will keep working its way down the infinite hierarchy.

@Dr15Jones
Copy link
Contributor

I believe the problem is now understood. The edmPluginHelp code assumes the fillDescriptions builds a 'directed acyclic graph' of the dependencies for the PSets. However, the plugin CompositeTrajectoryFilter has a VPSet containing plugins of type TrajectoryFilter from which it also inherits. Since a CompositeTrajectoryFilter can load a CompositeTrajectoryFilter this breaks the acyclic assumption and leads to an infinite recursion while printing out the help and eventually would die with hitting the limit of the stack size on the call stack.

@Dr15Jones
Copy link
Contributor

There appears to be a minor bug in the formatting of the printout from edmPluginHelp. I see

Section 1.1.1.1.1.1.2 CkfBaseTrajectoryFilter Plugin description:
...
minGoodStripCharge
                type: PSet 
                see Section 1.1.1.1.1.1.1
...
Section 1.1.1.1.1.1.1 minGoodStripCharge PSet description:

        value
                        type: double 
                        default: 1620

        Section 1.1.1.1.1.1.3 ClusterShapeTrajectoryFilter Plugin description:

but it should have been

Section 1.1.1.1.1.1.2 CkfBaseTrajectoryFilter Plugin description:
...
minGoodStripCharge
                 type: PSet 
                 see Section 1.1.1.1.1.1.2.1
...
Section 1.1.1.1.1.1.2.1 minGoodStripCharge PSet description:

        value
                        type: double 
                        default: 1620

Section 1.1.1.1.1.1.3 ClusterShapeTrajectoryFilter Plugin description:

@mmusich
Copy link
Contributor

mmusich commented Oct 11, 2023

Setting aside for the moment the particular module CkfTrackCandidateMaker that clearly needs a fix or some other treatment, do we still need this:

edmCheckMultithreading hlt.py | grep legacy
in our HLT integration tests?
In other words is the framework still technically capable of supporting legacy modules ?
Based on #36404 (comment) I would say no.

@makortel
Copy link
Contributor

In other words is the framework still technically capable of supporting legacy modules ?
Based on #36404 (comment) I would say no.

Correct, the support for legacy modules was removed in 13_0_0_pre3.

@Martin-Grunewald
Copy link
Contributor

OK, then we should indeed remove it from our HLT tests!

@makortel
Copy link
Contributor

@wddgit Could you take a look if the machinery used by edmPluginHelp could be made to detect a cycle (and stop in such case), and the minor formatting bug Chris mentioned in #42988 (comment) ?

@wddgit
Copy link
Contributor

wddgit commented Oct 11, 2023

Yes I'll take a look. I'll finish the demonstrator of the delayed end lumi with concurrent lumis issue first and then I'll work on this next (probably start on it tomorrow).

@dan131riley
Copy link

The infinite recursion involves

        Section 1.1.1.1.1.1 PluginDescriptorTrajectoryFilterFactory Plugins description:
        Section 1.1.1.1.1.1.1 ChargeSignificanceTrajectoryFilter Plugin description:
        Section 1.1.1.1.1.1.2 CkfBaseTrajectoryFilter Plugin description:
        Section 1.1.1.1.1.1.3 ClusterShapeTrajectoryFilter Plugin description:
        Section 1.1.1.1.1.1.4 CompositeTrajectoryFilter Plugin description:

@mmusich
Copy link
Contributor

mmusich commented Oct 17, 2023

+hlt

  • HLT IB test problem is circumvented by remove edmCheckMultithreading from hltIntegrationTests #42993 and its backports, by skipping the call to edmPluginHelp
  • a fix in the reconstruction and/or core code to allow edmPluginHelp to work properly for CkfTrackCandidateMaker is still needed (technically outside of HLT responsibilities).

@wddgit
Copy link
Contributor

wddgit commented Oct 17, 2023

I'm actively working on fixing edmPluginHelp today. I'm not sure how long it will take. Hopefully, not too long. Seems like this shouldn't be that hard once I remember how this part of the code works.

@wddgit
Copy link
Contributor

wddgit commented Oct 26, 2023

Pull request #43126 was just submitted which will fix the problem.

@makortel
Copy link
Contributor

makortel commented Nov 1, 2023

+core

#43126 was merged

@makortel
Copy link
Contributor

makortel commented Nov 1, 2023

I don't think there is anything for reconstruction to do

@makortel
Copy link
Contributor

makortel commented Nov 1, 2023

unassign reconstruction

@makortel
Copy link
Contributor

makortel commented Nov 1, 2023

@cmsbuild, please close

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 1, 2023

This issue is fully signed and ready to be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants