-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple failures in NONLTO, CLANG and ASAN Unit Tests and RelVals due to PluginNotFound
#44821
Comments
cms-bot internal usage |
A new Issue was created by @aandvalenzuela. @smuzaffar, @makortel, @rappoccio, @antoniovilela, @Dr15Jones, @sextonkennedy can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
The duplicate dictionary checker for those failing IBs says the
|
at the end of build phase we do run
|
https://github.com/cms-sw/cmsdist/blob/IB/CMSSW_14_1_X/master/scram-project-build.file#L228 is where we run |
one can reproduce the crash on cmsdev4X nodes by starting
|
note that |
I wonder if we have a 'one definition violation'. Maybe valgrind could spot a problem? |
Running valgrind only showed "Invalid read of size 8" in the But that destructor comes from |
assign heterogeneous |
(jumping into the rabbit hole with @Dr15Jones) So the
Our
and seems to be the only CMSSW shared object having the One thing to note on the ROCm setup is that (as far as I can tell) we are taking the binaries from AMD's RHEL8 RPMs. I would assume those were built with the system GCC against the system libstdc++, that seem to be 8 (or at least |
The ROCm libraries get loaded by |
The
|
Disassembling things, the instructions of It seems like we have an ODR violation from trying to mix libraries that were built with (very) different versions of libstdc++, and thus if we need to keep the rocprofiler, we'd have to build it ourselves. |
I'm not particularly interested in keeping rocprofiler (and in fact we did not have it until now). Unfortunately it seems to be a dependency of
|
I have opened cms-sw/cmsdist#9153 and #44824 to revert ROCm update |
Adding here cms-sw/cmsdist#9143 (comment)
The trend continued: in CMSSW_14_1_X_2024-04-23-2300 the NONLTO and CLANG IBs failed, but none of the others. |
#44838 fixes |
thanks @makortel , I have tested it for NONLTO and confirm that
|
For LTO builds ( where dd4hep is also build with lto flags)
So may be that is why LTO enabled IBs are not failing. |
I got a suggestion from a possible workaround from an AMD expert: can we |
As far as I can tell, GCC 12 does not provide any shared object that would provide |
I see, these are all good points.
I guess the only option is to build ROCm from the sources.
|
As a temporary workaround it might be enough to build a stub library to replace This seems to work to build CMSSW with ROCm 6.1.2:
|
Note: one reason to upgrade ROCm is that the current version of the kernel drivers, 6.2.x, are only compatible with ROCm 6.0.x and newer. Anecdotally, running with ROCm 5.6.x on the 6.2.x driver frequently hangs :-( |
Sounds to me like trying out our own stub library could be less painful than figuring out how to build the whole ROCm stack from the sources. Of course only time will tell how painful the maintenance of the stub library would be. |
Since the problem itself was worked around by downgrading ROCm, how about we close this issue (which is mostly about the problem), and continue the ROCm discussion either here or in other issue? |
OK for me. |
+1 |
please close |
This issue is fully signed and ready to be closed. |
I have opened cms-sw/cmsdist#9493 |
Hello,
There are multiple failures in NONLTO, CLANG and ASAN IBs (both in Unit Tests and RelVals) in lastest IBs (CMSSW_14_1_[FLAVOR]_X_2024-04-22-2300) reporting:
There are other variants of the exception, for example:
CondCore/SiPixelPlugins
:CondCore/CondDB
:I am not sure if it is related, but we had ROCm update yesterday in #44777 and ROCm device builds fine (See log).
However, there was a similar issue in the past reported at cmssw#40680 and related to a ROCm update in which the missing plugins were not properly registered in the
.edmplugincache
file.Thanks,
Andrea
The text was updated successfully, but these errors were encountered: