-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sherpa related workflows get stuck due to a problem with opening an openmpi session #45165
Comments
cms-bot internal usage |
A new Issue was created by @ArturAkh. @Dr15Jones, @antoniovilela, @makortel, @sextonkennedy, @rappoccio, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign generator |
assign generators |
New categories assigned: generators @alberto-sanchez,@bbilin,@GurpreetSinghChahal,@mkirsano,@menglu21,@SiewYan you have been requested to review this Pull request/Issue and eventually sign? Thanks |
Ping @cms-sw/generators-l2 |
@shimashimarin did we observe this also in our recent tests? |
Sorry for the late reply. I usually test the Sherpa processes locally or via private production. I have not encountered such an issue. However, I noticed that OpenMPI is used here. the MPI parallelization is mainly to speed up the integration process, i.e. Sherpack generation. Parallelization of event generation can simply be done by starting multiple instances of Sherpa. Therefore, I think it is not necessary to use OpenMPI sessions for Sherpa event generation. Maybe we can test the Sherpa event production without using OpenMPI? |
Just to note that avoiding OpenMPI from Sherpa in CMSSW would also avoid this thread-unsafe workaround cmssw/GeneratorInterface/SherpaInterface/src/SherpackUtilities.cc Lines 154 to 161 in 5587561
(reported in #46002 (comment)) |
Is anyone looking into avoiding the use of OpenMPI from Sherpa during event production? |
Hi @makortel I haven't found time to work on it yet. But it seems that we can disable OpenMPI in SherpaHadronizer.cc. I will do some tests and let you know. |
Dear all,
At KIT, we were seeing some problems with Sherpa related workflows at our opportunistic resources (KIT-HOREKA), e.g.
The jobs seem to hang with a CPU usage at 0%, leading to very low efficiency (below 20%) for HoreKa resources:
https://grafana-sdm.scc.kit.edu/d/qn-VJhR4k/lrms-monitoring?orgId=1&refresh=15m&var-pool=GridKa+Opportunistic&var-schedd=total&var-location=horeka&viewPanel=98&from=1717406527904&to=1717579327904
After some investigation of the situation, we have figured out the following:
So the entire process is unable to open an openmpi session. Even more problematic is, that the job does not fail properly but is hanging (i.e. running further with 0% efficiency). We see often this message in the logs when running locally:
Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
According to our local physics group which had some experience with running Sherpa, this is a known problem.
Resetting the
$TMPDIR
variable to a different location was allowing us to make the process work properly if running it manually. We are not sure though, whether this is a correct action to be taken on an entire (sub)site for all worker nodes...We would like to know, how to resolve this issue, and whether something needs to be done in terms of openmpi libraries in the CMSSW software stack for that.
Best regards,
Artur Gottmann
The text was updated successfully, but these errors were encountered: