Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition in DAQ modules when exception is thrown in event processing (only affecting multithreading) - 76X #12201

Closed
wants to merge 1 commit into from

Conversation

smorovic
Copy link
Contributor

Port of #12200 (75X).

A rare race condition occurs when exception is thrown during processing of last few events in a file and LS. In this case, another thread can already request next event from the source. If next event belongs to the next LS, input source reports to the FastMonitoringService a total number of events in previous LS.

Normally in case of exception, we skip writing JSON stream output (catching exception action callback in the FastMonitoringService), and subsequently hltd assigns missing events as error events to close micro-merge of that LS. However, suppression was not happening after input source already reported the total number of events to the FastMonitoringService. This lead to incomplete micromerge for some streams. The problem is present only in multithreading, as in the single-threaded mode source can get a request for next event before exception on currently processed event is thrown (i.e. event requests are aborted and run/LS get closed).

In this update, JSON output is suppressed if exception has been thrown, regardless of input source report.

…g, with other thread already requests next event from source. Source can then open next LS (internally) and report event number in past LS to the FastMonitoringService. In this case it is possible to run preEndLumi triggered by exception later than source report, in which case exception check was (incorrectly) being skipped.
@cmsbuild
Copy link
Contributor

A new Pull Request was created by @smorovic (Srecko Morovic) for CMSSW_7_6_X.

Fix race condition in DAQ modules when exception is thrown in event processing (only affecting multithreading) - 76X

It involves the following packages:

EventFilter/Utilities

@mommsen, @cvuosalo, @cmsbuild, @emeschi, @slava77 can you please review it and eventually sign? Thanks.
@Martin-Grunewald this is something you requested to watch as well.
You can sign-off by replying to this message having '+1' in the first line of your reply.
You can reject by replying to this message having '-1' in the first line of your reply.
If you are a L2 or a release manager you can ask for tests by saying 'please test' or '@cmsbuild, please test' in the first line of a comment.
@Degano you are the release manager for this.
You can merge this pull request by typing 'merge' in the first line of your comment.

@slava77
Copy link
Contributor

slava77 commented Oct 30, 2015

@cmsbuild please test

@cmsbuild
Copy link
Contributor

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/9381/console

@cmsbuild
Copy link
Contributor

-1
Tested at: 28b29c3
When I ran the RelVals I found an error in the following worklfows:
25.0 step3

runTheMatrix-results/25.0_TTbar+TTbar+DIGI+RECOAlCaCalo+HARVEST+ALCATT/step3_TTbar+TTbar+DIGI+RECOAlCaCalo+HARVEST+ALCATT.log
----- Begin Fatal Exception 30-Oct-2015 14:12:45 CET-----------------------
An exception of category 'FileFlushError' occurred while
   [0] Calling File::flush()
Exception Message:
fdatasync() failed with system error 'Disk quota exceeded' (error code 122)
----- End Fatal Exception -------------------------------------------------

you can see the results of the tests here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-12201/9381/summary.html

@cvuosalo
Copy link
Contributor

@cmsbuild please test

@cmsbuild
Copy link
Contributor

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/9402/console

@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

@cvuosalo
Copy link
Contributor

+1

For #12201 28b29c3

Fixing rare multi-threading race condition in event processing by DAQ modules. There should be no change in monitored quantities. #12200 is the 75X version of this PR, and it has already been approved by Reco.

The code changes are satisfactory, and Jenkins tests against baseline CMSSW_7_6_X_2015-10-30-1100 show no significant differences, as expected.

@mommsen
Copy link
Contributor

mommsen commented Nov 4, 2015

+1

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 4, 2015

This pull request is fully signed and it will be integrated in one of the next CMSSW_7_6_X IBs (tests are also fine). This pull request requires discussion in the ORP meeting before it's merged. @davidlange6, @Degano, @smuzaffar

@davidlange6
Copy link
Contributor

I am cleaning up the 76x queue aside from things for analysis workflows. I'm closing this pull request, please make sure the PR is in 80x. Thanks!

@smorovic smorovic deleted the exception-eols-fix-76X branch February 13, 2019 11:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants