-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race condition in DAQ modules when exception is thrown in event processing (only affecting multithreading) - 76X #12201
Conversation
…g, with other thread already requests next event from source. Source can then open next LS (internally) and report event number in past LS to the FastMonitoringService. In this case it is possible to run preEndLumi triggered by exception later than source report, in which case exception check was (incorrectly) being skipped.
A new Pull Request was created by @smorovic (Srecko Morovic) for CMSSW_7_6_X. Fix race condition in DAQ modules when exception is thrown in event processing (only affecting multithreading) - 76X It involves the following packages: EventFilter/Utilities @mommsen, @cvuosalo, @cmsbuild, @emeschi, @slava77 can you please review it and eventually sign? Thanks. |
@cmsbuild please test |
The tests are being triggered in jenkins. |
-1 runTheMatrix-results/25.0_TTbar+TTbar+DIGI+RECOAlCaCalo+HARVEST+ALCATT/step3_TTbar+TTbar+DIGI+RECOAlCaCalo+HARVEST+ALCATT.log ----- Begin Fatal Exception 30-Oct-2015 14:12:45 CET----------------------- An exception of category 'FileFlushError' occurred while [0] Calling File::flush() Exception Message: fdatasync() failed with system error 'Disk quota exceeded' (error code 122) ----- End Fatal Exception ------------------------------------------------- you can see the results of the tests here: |
@cmsbuild please test |
The tests are being triggered in jenkins. |
+1 Fixing rare multi-threading race condition in event processing by DAQ modules. There should be no change in monitored quantities. #12200 is the 75X version of this PR, and it has already been approved by Reco. The code changes are satisfactory, and Jenkins tests against baseline CMSSW_7_6_X_2015-10-30-1100 show no significant differences, as expected. |
+1 |
This pull request is fully signed and it will be integrated in one of the next CMSSW_7_6_X IBs (tests are also fine). This pull request requires discussion in the ORP meeting before it's merged. @davidlange6, @Degano, @smuzaffar |
I am cleaning up the 76x queue aside from things for analysis workflows. I'm closing this pull request, please make sure the PR is in 80x. Thanks! |
Port of #12200 (75X).
A rare race condition occurs when exception is thrown during processing of last few events in a file and LS. In this case, another thread can already request next event from the source. If next event belongs to the next LS, input source reports to the FastMonitoringService a total number of events in previous LS.
Normally in case of exception, we skip writing JSON stream output (catching exception action callback in the FastMonitoringService), and subsequently hltd assigns missing events as error events to close micro-merge of that LS. However, suppression was not happening after input source already reported the total number of events to the FastMonitoringService. This lead to incomplete micromerge for some streams. The problem is present only in multithreading, as in the single-threaded mode source can get a request for next event before exception on currently processed event is thrown (i.e. event requests are aborted and run/LS get closed).
In this update, JSON output is suppressed if exception has been thrown, regardless of input source report.