-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IB Failure from Worflow 537 (DYToLL012Jets_5FS_TuneCH3_13TeV_amcatnloFxFx_herwig7) #34531
Comments
assign generators |
New categories assigned: generators @Saptaparna,@mkirsano,@SiewYan,@alberto-sanchez,@agrohsje,@GurpreetSinghChahal you have been requested to review this Pull request/Issue and eventually sign? Thanks |
A new Issue was created by @qliphy Qiang Li. @Dr15Jones, @perrotta, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
related to #34463 @theofil @colizz @Dominic-Stafford please have a look. |
you can reproduce the error locally: yet it works well with single thread: |
Thanks @qliphy, This LHE hence does not include any systematics. Therefore we meet the error when combining this LHE with other three LHEs. P.S. and run 2 events Maybe we are running too few events (10 split to 2, 2, 3, 3 events to four threads) that triggers this error? |
@colizz Thanks! I think you are right. With 2 threads, the events are splitting to 5+5 for 2 jobs and it works well runTheMatrix.py -l 537 --job-reports -t 2 --ibeos @cms-sw/pdmv-l2 @smuzaffar Is there some way to set larger events or smaller threads for this specific workflow 537? Or should we set generateConcurrently to false? @theofil |
@qliphy setting the generateConcurrently to false could be an acceptable way to proceed for now, unless we find a better way to make it work e.g., by the suggestion you made to set larger sample events per thread. |
thanks to @colizz and @qliphy for analyzing the various symptoms and figuring out the details. To summarize: I can confirm that running with in 1 thread is OK, with 4 threads we get error and with 2 threads is again fine. Therefore although the issue is on the merging of the multiple LHEs, it seems that the problem resides on the files we try to merge when statistics are low rather on the mergeLHE.py script per se. We can proceed either by:
To my mind, the second solution could be easily implementable but the first would be preferable. Including @cms-sw/pdmv-l2 @smuzaffar @Dominic-Stafford for reference and commenting |
@smuzaffar Is there some way to increase event number for a specific workflow in IB test? Somewhere in cms-bot? |
I’d suggest that we don’t want code that is known to blow up if not run on some arbitrary number of events.
… On Jul 21, 2021, at 7:44 AM, Qiang Li ***@***.***> wrote:
@smuzaffar Is there some way to increase event number for a specific workflow in IB test? Somewhere in cms-bot?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
@qliphy , yes there was ways to do that e.g. via Configuration/PyReleaseValidation/python where the workflow is defined. There are many workflows which override the default events counts. But I agree with @davidlange6 that we should understand the reason and fix the code to behave properly if not correct number of event are given |
I think this issue is more a physics issue rather than a bug, that asking madgraph for only 2 events at NLO could result in the xs being zero or something similar, which causes the issue with calculating the systematics. Maybe @colizz could comment if it would be easy to modify mergeLHE.py to skip LHEs which have problematic systematics, but otherwise I'm not sure what the proper behaviour would be. In practise I think generating 20 events should be enough to avoid the problem, and I would expect any users would want to generate a lot more events than this, so could we just increase the number of events in the test to 20, please? |
yes, I agree this is more a physics issue of madgraph, and that maybe extending the event number can be a good solution. I think the message is that we cannot merely request 2 events from this gridpack. Even we run in one thread, for 2 events, the LHE production still fails (though silently). The LHE is then problematic and will cause a buggy EDM-format output. Therefore I may not suggest modifying |
at least for me running 1000 events fails too.. So there is a problem to fix having nothing to do with a small number of processed events. The first error looks to be caused by the thread0 lhe file having its always proceeded by some whitespace which the merging script does not expect. Always the thread0 and only the thread0. Thats easy to fix in the parsing. Then the next error is that header line counts are different. Thats really true (by 1000+ lines of difference). Again, thread0 is different from the rest. Anyway easy to reproduce |
@Dominic-Stafford is right. It is not a bug, we made an unfortunate choice. The systematics module of MG5 prints a summary of the cross section change. Now if at NLO one event has a weight and the other -the weight, the cross section is 0.0 and the summary calculation involves dividing by 0. If we pick 3 events per thread the problem is solved. One per thread is also ok, just 2 is bad. |
so how many events does it take to get three per thread? |
(with 4 threads) |
one might think that 4*3+some margin would be sufficient.... |
hi @davidlange6. it is really the canceling for two events. with an odd number of events per thread this will not happen. so 4*3 is perfectly fine. |
so its really just me for which this workflow crashes with 1000 events and 4 threads, or is this a speculative statement? I'd be happy to have screwed up the configuration somehow. |
Running 4 threads (really 4 streams) over 12 threads in no way guarantees 3 events per thread. The events are assigned on a first come first serve basis so it is very each for the number of events to be different per stream, especially for small numbers of events. |
@davidlange6 do you still have the output line? |
an alternative to a single thread would be modifying the python code in the corresponding gridpack and adding a patch to genproduction for systematics.py. |
hum, I started over and don't reproduce my problem - though, that said, this does not give exit code 0
|
Thanks but I was confused. Reading the code
|
@davidlange6 @Dr15Jones @colizz in this case there is no problem as the systematics module runs in the lhe step, so 3 avoids the problem. however about first come first serve: I see in the logs: |
This is just because Herwig silently skips events which do not pass the merging, so at the end it throws this exception because it's reached the end of the lhe file before it's generated all the events CMSSW asked for. It's related to the fact the lhe event information doesn't line up with the generated events, and so is on our to-do list, but not relevant to this issue |
@Dominic-Stafford ok, that was my understanding. |
@davidlange6 @Dr15Jones @smuzaffar @colizz @agrohsje @Dominic-Stafford would you change the number of events per thread to an odd number or how do we proceed ? |
+generators |
This issue is fully signed and ready to be closed. |
Looks like
could this be causing |
by the way you can download the |
Hi @smuzaffar thanks for noticing this! Looking into the code I agree that the real reason for the collapse is due to the extra spaces you mention.
As a further investigation, the LHE before running systematics https://github.com/cms-sw/genproductions/blob/5b8972e680e5bc0d5ce72e590e3674258ce59389/bin/MadGraph5_aMCatNLO/runcmsgrid_NLO.sh#L137 DOES has these extra spaces before <init> , </init> . After this step, the new LHE has no spaces. Since thread0 has a problem in this systematics step, the resulting LHE file has these spaces.
Though these spaces won't trigger problems in any normal routine, I think for safety maybe we'll adjust the code. |
@smuzaffar: let me add. mcatnlo generates a file that is called cmsgrid_final.lhe. the systematics module crashes when printing the summary statistics (1/(+w - w) = 1/0. so instead of the proper file with systematic weights mergeLHE.pl uses 3 times the output from systematics, once from mcatnlo. this is what is causing the difference you are reporting. |
Dears, I was trying to figure out why 538.0 is failing in gcc10, then (today) I noticed the failure for 537.0 in gcc10 is the same as for 538.0. Also, I cannot find where/when 538.0 was added, and there is no record for it before July 16 I have two questions - do you expect 537.0 failure in gcc10 to be fixed by 34861 and |
Workflows 537 and 538 were added here: #34463. The issue you've reported looks like it comes from ThePEG failing to read in the lhe files, so I wouldn't expect it to be fixed by 34861, though it may just be because we're using old gridpacks as you suggest- I'll look into this. |
using older gridpacks begs the question how the program runs compiled with gcc900 but not with 10.
thank you. For any external change (assuming you need to patch ThePEG) do not hesitate to tell me although I noticed you know how to patch and test it yourself |
I've done some tests, and I think this is simply a bug with the pointer not being created in ThePEG for gcc10, so the old gridpacks aren't the problem. I'll email the Herwig authors about this and ask if they can provide a patch |
https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_amd64_gcc900/CMSSW_12_0_DEVEL_X_2021-07-16-1100/pyRelValMatrixLogs/run/537.0_DYToLL012Jets_5FS_TuneCH3_13TeV_amcatnloFxFx_herwig7+DYToLL012Jets_5FS_TuneCH3_13TeV_amcatnloFxFx_herwig7+HARVESTGEN/step1_DYToLL012Jets_5FS_TuneCH3_13TeV_amcatnloFxFx_herwig7+DYToLL012Jets_5FS_TuneCH3_13TeV_amcatnloFxFx_herwig7+HARVESTGEN.log#/972-972
%MSG-w ExternalLHEProducer: ExternalLHEProducer:externalLHEProducer@beginRun 16-Jul-2021 20:37:26 CEST Run: 1
mergeLHE.py is not a relative path. Run it as a shell command.
%MSG
[INFO] >>> launch mergeLHE.py in /data/cmsbld/jenkins/workspace/ib-run-relvals/CMSSW_12_0_DEVEL_X_2021-07-16-1100/pyRelval/537.0_DYToLL012Jets_5FS_TuneCH3_13TeV_amcatnloFxFx_herwig7+DYToLL012Jets_5FS_TuneCH3_13TeV_amcatnloFxFx_herwig7+HARVESTGEN
[INFO] >>> Merge 4 files: [thread0/cmsgrid_final.lhe, thread1/cmsgrid_final.lhe, thread2/cmsgrid_final.lhe, thread3/cmsgrid_final.lhe]
[INFO] >>> Write to output: cmsgrid_final.lhe
Traceback (most recent call last):
File "/cvmfs/cms-ib.cern.ch/nweek-02689/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_DEVEL_X_2021-07-16-1100/bin/slc7_amd64_gcc900/mergeLHE.py", line 418, in
main()
File "/cvmfs/cms-ib.cern.ch/nweek-02689/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_DEVEL_X_2021-07-16-1100/bin/slc7_amd64_gcc900/mergeLHE.py", line 414, in main
lhe_merger.merge()
File "/cvmfs/cms-ib.cern.ch/nweek-02689/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_DEVEL_X_2021-07-16-1100/bin/slc7_amd64_gcc900/mergeLHE.py", line 201, in merge
line = next(self._f[i])
StopIteration
----- Begin Fatal Exception 16-Jul-2021 20:37:26 CEST-----------------------
An exception of category 'ExternalLHEProducer' occurred while
[0] Processing global begin Run run: 1
[1] Calling method for module ExternalLHEProducer/'externalLHEProducer'
Exception Message:
Child failed with exit code 1.
----- End Fatal Exception -------------------------------------------------
Another exception was caught while trying to clean up runs after the primary fatal exception.
The text was updated successfully, but these errors were encountered: