Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[L1-O2O] ESGetToken migration L1CondDBPayloadWriter and O2O unit tests #37602

Conversation

panoskatsoulis
Copy link
Contributor

@panoskatsoulis panoskatsoulis commented Apr 17, 2022

PR description:

This PR implements WriterProxy from CondTools/L1TriggerExt to inherit from edm::EDConsumerBase as it's required for running esConsumes() to register ESGetToken objects when is called by CondDBPayloadWriter

PR validation:

This pkg is part of the L1 O2O and there is not any central wf to be tested
Tested Locally and from custom patch on machine conddb-1 writing to sqlite file and Prep respectively
The local replica of the p5 script is commited together in this PR ( runL1-O2O-iov.sh )
if this PR is a backport please specify the original PR and why you need to backport that PR:

@cmsbuild
Copy link
Contributor

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-37602/29367

  • This PR adds an extra 16KB to repository

Code check has found code style and quality issues which could be resolved by applying following patch(s)

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-37602/29368

  • This PR adds an extra 16KB to repository

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @panoskatsoulis (Panos) for master.

It involves the following packages:

  • CondTools/L1Trigger (db, l1)
  • L1TriggerConfig/L1TConfigProducers (l1)

@malbouis, @epalencia, @cmsbuild, @rekovic, @ggovi, @tvami, @cecilecaillol, @francescobrivio can you please review it and eventually sign? Thanks.
@Martin-Grunewald, @missirol, @mmusich this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@panoskatsoulis
Copy link
Contributor Author

hold

@panoskatsoulis
Copy link
Contributor Author

Please wait as after enabling these changes the L1TGlobalPrescalesVetosOnlineProd online producer is crashing without obvious reasons. Throwing only this error that contains no much information

----- Begin Fatal Exception 17-Apr-2022 18:27:06 CEST-----------------------
An exception of category 'MakeDataException' occurred while
   [0] Processing  Event run: 4294967295 lumi: 1 event: 1 stream: 0
   [1] Running path 'p'
   [2] Calling method for module L1CondDBPayloadWriterExt/'L1CondDBPayloadWriterExt'
   [3] Using EventSetup component L1TGlobalPrescalesVetosOnlineProd/'' to make data L1TGlobalPrescalesVetosFract/'' in record L1TGlobalPrescalesVetosFractO2ORcd
Exception Message:
Error while making data "L1TGlobalPrescalesVetosFract" "" in Record L1TGlobalPrescalesVetosFractO2ORcd
----- End Fatal Exception -------------------------------------------------

I dont think tho that it's related to this change as the producer for the L1TriggerKeyExtRcd is running correctly and fetched the right data from the EventSetup.

@mmusich
Copy link
Contributor

mmusich commented Apr 18, 2022

is there any reason to wait for running the integration tests?

@makortel
Copy link
Contributor

I dont think tho that it's related to this change as the producer for the L1TriggerKeyExtRcd is running correctly and fetched the right data from the EventSetup.

After poking around the code I agree that the exception in #37602 (comment) is not related to the changes/problems here. The exception message effectively says that the ESProducer returned a nullptr product. From a quick look on the code, I and @Dr15Jones did not see an obvious way how that could happen (the ESProducer appears to throw an exception in case the product would be nullptr, and the framework should report that exception). Would this problem be easily reproduced offline?

@makortel
Copy link
Contributor

I dont think tho that it's related to this change as the producer for the L1TriggerKeyExtRcd is running correctly and fetched the right data from the EventSetup.

After poking around the code I agree that the exception in #37602 (comment) is not related to the changes/problems here. The exception message effectively says that the ESProducer returned a nullptr product. From a quick look on the code, I and @Dr15Jones did not see an obvious way how that could happen (the ESProducer appears to throw an exception in case the product would be nullptr, and the framework should report that exception). Would this problem be easily reproduced offline?

After poking around the code some more, we found out that when we changed the EventSetup system to do prefetching, in case the exception happens during ESProducer::produce(), we accidentally lost the exact type of the exception, and always report the MakeDataException. The L1-O2O appears to heavily rely on the type of the exception (based on many try {} catch(l1t::DataAlreadyPresentException&) we saw e.g. here

try {
setup.get<Record>().get(handle);
} catch (l1t::DataAlreadyPresentException& ex) {
return std::string();
}

). We will fix the EventSetup system to properly cache the exact exception and rethrow when the data product is requested from the EventSetup or Record.

Even if this problem would not have been there, the O2O would still not work because of the missing consumes calls (into which I'll return in a subsequent message).

@makortel
Copy link
Contributor

From a quick inspection from parts of the code, it looks to us that all of the database queries are based on hardcoded values and information from the configuration. We could not find a place in the limited number of source files we looked where IOVs was necessary to make the database query. If that is true, the relevant writers actually can be known at module construction time. Therefore, the writers can be made in the module (e.g. L1CondDBPayloadWriter) constructor, and passed in the edm::ConsumesCollector so they can call esConsumes() and get the token.

In the short term, it should be possible to make the EDAnalyzer(s) to have a configuration parameter telling which writers are needed. The present machinery could be used to print out a list writers being used now, and copy that in the configuration.

@panoskatsoulis
Copy link
Contributor Author

panoskatsoulis commented Apr 18, 2022

I dont think tho that it's related to this change as the producer for the L1TriggerKeyExtRcd is running correctly and fetched the right data from the EventSetup.

After poking around the code I agree that the exception in #37602 (comment) is not related to the changes/problems here. The exception message effectively says that the ESProducer returned a nullptr product. From a quick look on the code, I and @Dr15Jones did not see an obvious way how that could happen (the ESProducer appears to throw an exception in case the product would be nullptr, and the framework should report that exception). Would this problem be easily reproduced offline?

After poking around the code some more, we found out that when we changed the EventSetup system to do prefetching, in case the exception happens during ESProducer::produce(), we accidentally lost the exact type of the exception, and always report the MakeDataException. The L1-O2O appears to heavily rely on the type of the exception (based on many try {} catch(l1t::DataAlreadyPresentException&) we saw e.g. here

try {
setup.get<Record>().get(handle);
} catch (l1t::DataAlreadyPresentException& ex) {
return std::string();
}

). We will fix the EventSetup system to properly cache the exact exception and rethrow when the data product is requested from the EventSetup or Record.

Even if this problem would not have been there, the O2O would still not work because of the missing consumes calls (into which I'll return in a subsequent message).

thx for this, indeed this is making it worse throughout all the weekend to debug it, as these errors here [1] are very cryptic and without information what happened.

Here I'm continuing the explanation about the calls like these here [2-3]. The one in [2] as explained above succeeds after the changes here while the one in [3] keeps on failing. But why these 2 calls behave differently I think is the key for fixing it.
I see that the calls failing are dependent on extra producers that even if they are initialized with their constructors they, for some reasons in 12_3 wrt 12_2, don't be called to produce the extra ...O2ORcd required from the simple ...Rcd.
So for this reason, I already tried since yesterday to make some changes (locally) to construct the required Writers in the L1CondDBPayloadWriter constructor, but this keeps on failing. Probably, I don't use the edm::ConsumesCollector correctly. I will get back to you on this.

An other possible work around if this fails, I think, is to explicitly run the required "Online" producers from the configuration file, in the same path [4], so we have already in place the required ...O2ORcd when these are needed. As you said all the required records to be fetched are hardcoded already and we know what ESProducers are needed.

[1] https://cms-conddb.cern.ch/cmsDbBrowser/logs/show_O2O_log/Prep/L1TMenu/2022-04-17%2013:20:30.447741
[2]

token = m_writer.writePayload(iSetup, "L1TriggerKeyExtRcd@L1TriggerKeyExt");

[3]
token = m_writer.writePayload(iSetup, it->first);

[4]
process.p = cms.Path(process.L1CondDBPayloadWriterExt)

@makortel
Copy link
Contributor

+core

esConsumes was done in a way that works

@epalencia
Copy link
Contributor

+l1

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

@perrotta
Copy link
Contributor

+1

  • Finally...

@Dr15Jones
Copy link
Contributor

@panoskatsoulis Nicely done!

@davidlange6
Copy link
Contributor

davidlange6 commented Oct 11, 2022 via email

@panoskatsoulis
Copy link
Contributor Author

hi @davidlange6 , if I'm not mistaken the sql file is already copied loccally
from the master ...

## test dir
DATA_DIR=$CMSSW_BASE/$SCRAM_TEST_NAME
mkdir $DATA_DIR
## get sqlite from the write-only CMSSW_SEARCH_PATH
for dir in $(echo $CMSSW_SEARCH_PATH | tr ':' '\n') ; do
[ -e $dir/L1TriggerConfig/L1TConfigProducers/data/l1config.db ] || continue
cp $dir/L1TriggerConfig/L1TConfigProducers/data/l1config.db $DATA_DIR/
break
done
local_db=$DATA_DIR/l1config.db
sqlite3 $local_db -cmd "SELECT * from IOV;" ".q" > $DATA_DIR/iov_before

@davidlange6
Copy link
Contributor

davidlange6 commented Oct 11, 2022 via email

@panoskatsoulis
Copy link
Contributor Author

haha ok, it says here 20mins ago
sry for the spam
Screenshot from 2022-10-11 11-35-58

@smuzaffar
Copy link
Contributor

looks like most of @davidlange6 comments were pending in his mail queue and are only now being delivered :-)

@davidlange6
Copy link
Contributor

davidlange6 commented Oct 11, 2022 via email

@davidlange6
Copy link
Contributor

davidlange6 commented Oct 11, 2022 via email

@panoskatsoulis
Copy link
Contributor Author

Yea, I'm flooding a bunch of old GitHub threads this morning - no idea how..

On Oct 11, 2022, at 11:37 AM, Panos @.***> wrote: haha ok, it says here 20mins ago sry for the spam — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.

I almost got a heart attack and nightmares back 🤣

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.