Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed destruction of Geant4 simulation #35002

Merged
merged 1 commit into from
Aug 26, 2021

Conversation

civanch
Copy link
Contributor

@civanch civanch commented Aug 24, 2021

PR description:

When an exception happens there may be situations, when the real trace of the problem is shadowing by destruction of Geant4. This fix is a backport of #34820 which fix the issue #34271.

Should not affect mainstream production.

PR validation:

private

@cmsbuild cmsbuild added this to the CMSSW_10_6_X milestone Aug 24, 2021
@civanch
Copy link
Contributor Author

civanch commented Aug 24, 2021

please test

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @civanch (Vladimir Ivantchenko) for CMSSW_10_6_X.

It involves the following packages:

  • SimG4Core/Application (simulation)

@civanch, @mdhildreth can you please review it and eventually sign? Thanks.
@makortel, @cvuosalo, @rovere, @bsunanda, @fabiocos, @slomeo this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-599f65/18007/summary.html
COMMIT: f434cc3
CMSSW: CMSSW_10_6_X_2021-08-22-0000/slc7_amd64_gcc700
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/35002/18007/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 35
  • DQMHistoTests: Total histograms compared: 3215686
  • DQMHistoTests: Total failures: 1
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3215351
  • DQMHistoTests: Total skipped: 334
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 34 files compared)
  • Checked 143 log files, 29 edm output root files, 35 DQM output files
  • TriggerResults: no differences found

@civanch
Copy link
Contributor Author

civanch commented Aug 25, 2021

+1

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next CMSSW_10_6_X IBs (tests are also fine) and once validation in the development release cycle CMSSW_12_1_X is complete. This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

@qliphy
Copy link
Contributor

qliphy commented Aug 26, 2021

+1

@cmsbuild cmsbuild merged commit a4e96c8 into cms-sw:CMSSW_10_6_X Aug 26, 2021
@qliphy
Copy link
Contributor

qliphy commented Aug 27, 2021

@civanch This PR causes Reval errors (with mutli-threads) in 10_6_X IB
https://cmssdt.cern.ch/SDT/html/cmssdt-ib/#/relVal/CMSSW_10_6/2021-08-26-1100?selectedArchs=slc7_amd64_gcc700&selectedFlavors=X&selectedStatus=failed

I have tested it works well with
"runTheMatrix.py -l 4007.0 --job-reports -t 4 --ibeos" under
CMSSW_10_6_X_2021-08-22-2300
while not with
CMSSW_10_6_X_2021-08-22-2300 + PR 35002

Would you please have a check?
As we are asked to make a new 10_6_X asap, it is kind of urgent. Or should we revert this PR for the moment?

@perrotta
Copy link
Contributor

As far as I can read in #34271 (comment), the actual fix was implemented in the geant4 external, what integrated here being only a "extra protection".
As such, and due to the fact that it seems to originate errors in the tests, I think that this backport should be better reverted in the closed release cycle 10_6
@civanch what do you think?

@civanch
Copy link
Contributor Author

civanch commented Aug 27, 2021

@qliphy , @perrotta , I just run locally on top of CMSSW_10_6_X_2021-08-22-2300. First test my private step1 test for 2018, which is fine. Then update to PR 35002 and it is fine too.

Then run runTheMatrix.py -l 4007.0, which fail at step2, then revert 35002 and it is fail again.
Then run 10824.0 and it is fine. Then added 35002 again and run 10824.0 - it is fine again.

My conclusion is that for few runs for Run2 WFs there is no problem, Run1 WF has problem independently on this PR. Note, that this fix in the master is done for Geant4 10.7, legacy uses Geant4 10.4.

Because this PR is minor and not important for production except it is crashing it can be reverted. Please, go ahead. However, I suspect that there is another problem which affect Run1 WFs.

@qliphy
Copy link
Contributor

qliphy commented Aug 27, 2021

@civanch

Then run runTheMatrix.py -l 4007.0, which fail at step2, then revert 35002 and it is fail again.
Then run 10824.0 and it is fine. Then added 35002 again and run 10824.0 - it is fine again.

Probably you forget to set the grid-certificate? Otherwise step2 will fail as it (worflow 4007.0) needs to access pileup inputs.
Once your set up your grid-certificate with "voms-proxy-init -voms cms", it should work well with CMSSW_10_6_X_2021-08-22-2300 either running
run runTheMatrix.py -l 4007.0
or
runTheMatrix.py -l 4007.0 --job-reports -t 4 --ibeos

Note IB tests are with multi-thread while PR test not.
So you should test with
runTheMatrix.py -l 4007.0 --job-reports -t 4 --ibeos
instead of
run runTheMatrix.py -l 4007.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants