-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CLANG_X] Segmentation violation in ThePEG::EventGenerator::doinit #45872
Comments
assign generators |
cms-bot internal usage |
A new Issue was created by @iarspider. @Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
@Dominic-Stafford @theofil can you please take a look? It seems related to Herwig. Thanks! |
I'll try to have a look next week (unless @theofil has time before then). @iarspider has the CLANG/C++ version or anything else like this changed which might have triggered the issue? |
I don't think so. |
Hi, I'll will also try to look at it next week starting on Tuesday. a) Could we get some instructions how to reproduce the problem in an lxplus session ? best, |
The recipe to reproduce would be along cmssw-el8
cmsrel CMSSW_14_2_CLANG_X_2024-09-05-2300
cd CMSSW_14_2_CLANG_X_2024-09-05-2300/src
cmsenv
runTheMatrix.py -l 535.0 -t 4 |
In case there is a connection to #45510 (as wondered in #45510 (comment)), we updated the C++ standard from 17 to 20 around Jul 12 (cms-sw/cmsdist#9288), and #45510 was opened on Jul 19. But I guess we have lost the information on whether the problem reported in #45510 started on 07-18-2300 IB or earlier (or on the first occurrence of the problem reported in this issue). |
@makortel thanks for the code. indeed the test is passed for CMSSW_14_2_X_2024-09-06-1100 but fails for CMSSW_14_2_CLANG_X_2024-09-05-2300 but both IBs have clang version 18.1.6, at least this is what I get from ideas ? |
The only difference between the default IB and the CLANG IB is that in the default IB the CMSSW code is compiled with Some thoughts
|
I've had a bit of a look at this, and found very weirdly that I can reproduce it locally when I do
That seems to be that Herwig has crashed due to some issue reading the LHE file (though I'm not exactly sure why as the lhe file is the same as the one that ran without issues in a non-CLANG build), and then we keep going despite not having produced the run file, causing a seg fault). For this new issue I'd definitely lay the blame at the fact we skip over any errors from Herwig here -it may or may not be the root cause of the full issue, but it certainly makes it harder to debug. We've kept this block for a long time as it's supposed to get around an issue with Herwig being called before the externalLHEProducer, but I think we should really get rid of it, as running past errors where Herwig would just exit could be the root of what we're seeing here, then deal with the issue with the sequence of calls if it's still occuring. I probably won't have time to try this in the next few days, so if you have time @theofil that would be good, otherwise I'll try to by the end of the week. |
I have checked our opensearch and found that workflow |
I haven't yet found the origin of the problem, but I can reply to this question:
yes @smuzaffar thanks a lot for the info. I see that the could relevant to the crash we see. I will try to have a look if this is really where the problem starts. Would replacing the Apart from the software changes we see in the are there other differences between the 2 IBs for what concerns their builds ? |
As @makortel mentioned we also have updated c++ standard (to c++20) for July 12th IB. By the way, build herwig7 and sherpa in debug mode, I get this stacktrace for workflow 535/step1
|
Note that all these failing workflows ( |
failed for both |
I had very little progress so far unfortunately. I compiled two versions of Herwig under
and run standalone Herwig MC generation, checking if we can generate simple processes without reading external LHE files. We cannot generate any event in While compiling the code in the two releases, I see many warnings regarding to the ThePEG regarding arithmetic operations that I was not used to see before, but that all seem innocent in the CMSSW_14_2_X_2024-09-06-1100 warnings case. However in the CMSSW_14_2_CLANG_X_2024-09-05-2300.txt warnings we see for fist time warning regarding the creation of the which later appears in the crash messages. This to me confirms that the problem is not related with the CMSSW HerwigInterface and there is not much we can do there, but rather with the external package ThePEG, which is needed by Herwig generator. Is there a reason why we build CMSSW with clang while the external packages, like ThePEG are still built with gcc ? Is it sound to use the same binary of the ThePEG in the two cases ? Is it possible to try to have the ThePEG built also with clang instead of gcc when CMSSW is built with clang ? |
Hi, I've had a bit more of a look at this. The first thing to note is it's happening also for standalone Herwig (no external LHE), and there it's happening consistently, so the block in the interface I mentioned earlier definitely isn't to blame. Running with valgrind, it gives the same segfault, but before that it gives a warning about |
The |
This is a compilation warning about EventInfoBase & operator=(const EventInfoBase &) = delete; so the warning is effectively asking to add an accompanying EventInfoBase (const EventInfoBase &) = default; I'd suggest to define the move constructor and assignment operator as well. I don't see how these warnings could affect the generated binaries though. On the other hand, some other questions came to my mind. @smuzaffar Did we rebuild all/relevant the externals with C++20? (or did we inherit those from the CPP20 IB flavor "for free"?) On a quick look it also seems to me we are not passing the C++ standard to ThePEG. Should we? |
we might have picked them from CPP20 IB flavor but by now most of externals should have been rebuilt as we have change a lot base level packages which should have re-trigger the build of these externals.
I can open a cmsdist PR to test it |
Failing CLANG IB relvals passed when ThePeg and Herwig7 were built with [a]
|
I have a feeling this discovery suggests we should, in principle, compile all CMSSW and externals with as close compilation flags as possible. |
I think it's really weird that c++20 vs c++1x changes this behavior. To me this smells like undefined behavior somewhere the initialization process. |
We have some code in headers that does one thing for one C++ version, and a different thing for an earlier C++ version. I wouldn't be surprised that other libraries, including |
I do see some compilation warnings [a] when we build [a]
|
Most of those warnings are quite scary... |
Thanks for getting to the bottom of this @smuzaffar. I'll check I can reproduce these errors in a stand-alone install and if so contact the Herwig authors about these warnings - I think at least some of them may have been resolved in more recent releases, maybe we can get them to give us a patch to fix them in 7.2. For reference, which gcc version did you use to build cmsdist? |
I used our default gcc i.e |
RelVals 535.0, 537.0, 538.0 failed with SIGSEGV in
ThePEG::EventGenerator::doinit
:The text was updated successfully, but these errors were encountered: