-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HLT-Validation tests failing in IBs of CMSSW <= 12_1_X
#40013
Comments
A new Issue was created by @missirol Marino Missiroli. @Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign hlt |
New categories assigned: hlt @missirol,@Martin-Grunewald you have been requested to review this Pull request/Issue and eventually sign? Thanks |
@cms-sw/orp-l2 HLT plans to go ahead with the PRs to fix these tests. If you have comments, please let us know. |
Corresponding PRs are listed below.
Script used to make these changes is copied below. #!/bin/bash
replFile(){
sed -i "s|$1|root://eoscms.cern.ch//eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/STORM/$2|g" \
"${CMSSW_BASE}"/src/HLTrigger/Configuration/test/cmsDriver.csh \
"${CMSSW_BASE}"/src/Configuration/HLT/python/addOnTestsHLT.py \
"${CMSSW_BASE}"/src/Utilities/ReleaseScripts/scripts/addOnTests.py
}
sed -i "s|root://eoscms.cern.ch//eos/cms/store/data|/store/data|g" \
"${CMSSW_BASE}"/src/HLTrigger/Configuration/test/cmsDriver.csh \
"${CMSSW_BASE}"/src/Configuration/HLT/python/addOnTestsHLT.py \
"${CMSSW_BASE}"/src/Utilities/ReleaseScripts/scripts/addOnTests.py
sed -i "s|root://eoscms.cern.ch//eos/cms/store/hidata|/store/hidata|g" \
"${CMSSW_BASE}"/src/HLTrigger/Configuration/test/cmsDriver.csh \
"${CMSSW_BASE}"/src/Configuration/HLT/python/addOnTestsHLT.py \
"${CMSSW_BASE}"/src/Utilities/ReleaseScripts/scripts/addOnTests.py
replFile \
/store/data/Run2011B/MinimumBias/RAW/v1/000/178/479/3E364D71-F4F5-E011-ABD2-001D09F29146.root \
RAW/Run2011B_MinimumBias_run177719/065F5CDD-E6EC-E011-ACBF-001D09F26509.root
replFile \
/store/hidata/HIRun2011/HIHighPt/RAW/v1/000/182/838/F20AAF66-F71C-E111-9704-BCAEC532971D.root \
RAW/HIRun2011_HIHighPt_run182838/F20AAF66-F71C-E111-9704-BCAEC532971D.root
replFile \
/store/data/Run2012A/MuEG/RAW/v1/000/191/718/14932935-E289-E111-830C-5404A6388697.root \
RAW/Run2012A_MuEG_run191718/14932935-E289-E111-830C-5404A6388697.root
replFile \
/store/data/Run2015D/MuonEG/RAW/v1/000/256/677/00000/80950A90-745D-E511-92FD-02163E011C5D.root \
RAW/Run2015D_MuonEG_run256677/80950A90-745D-E511-92FD-02163E011C5D.root
replFile \
/store/hidata/HIRun2015/HIHardProbes/RAW-RECO/HighPtJet-PromptReco-v1/000/263/689/00000/1802CD9A-DDB8-E511-9CF9-02163E0138CA.root \
RAW/HIRun2015_HIHardProbes_run263718/08057733-02A5-E511-9C7D-02163E014606.root
replFile \
/store/data/Run2016B/JetHT/RAW/v1/000/272/762/00000/C666CDE2-E013-E611-B15A-02163E011DBE.root \
RAW/Run2016B_JetHT_run272762/C666CDE2-E013-E611-B15A-02163E011DBE.root
replFile \
/store/data/Run2017A/HLTPhysics4/RAW/v1/000/295/606/00000/36DE5E0A-3645-E711-8FA1-02163E01A43B.root \
RAW/Run2017A_HLTPhysics4_run295606/36DE5E0A-3645-E711-8FA1-02163E01A43B.root
replFile \
/store/data/Run2018D/EphemeralHLTPhysics1/RAW/v1/000/323/775/00000/2E066536-5CF2-B340-A73B-209640F29FF6.root \
RAW/Run2018D_EphemeralHLTPhysics1_run323775/2E066536-5CF2-B340-A73B-209640F29FF6.root
replFile \
/store/data/Run2018D/HIMinimumBias0/RAW/v1/000/325/112/00000/660F62BB-9932-D645-A4A4-0BBBDA3963E8.root \
RAW/Run2018D_HIMinimumBias0_run325112/660F62BB-9932-D645-A4A4-0BBBDA3963E8.root
replFile \
/store/hidata/HIRun2018A/HIHardProbes/RAW/v1/000/326/479/00000/853DBE29-53BA-9A44-9FDD-58E4E9064EB1.root \
RAW/HIRun2018A_HIHardProbes_run326479/0E2CC5D5-9D87-7348-9219-B00CD718C847.root |
@missirol if the intent is just to fix the IB tests, and there will be no need to transform them into new releases, I think the new PRs can be all merged, taking also into account that they were also all tested succesfully. |
Yes, this is the case. I think those PRs do the job, but I would like to implement whatever is the better solution according to Core-Sw. Let's see if we can clarify that (details below).
@cms-sw/core-l2, please share your preference(s). |
Well, recall that this problem appeared because the statement 'RAW data is never deleted', holding true for many years, was changed/invalidated. Thus, I fear a repeat as alluded to in point 1) above will appear again, so I rather prefer the files are all in an TSG eos are (they are for the MC tests already anyway). Overall these are a few files, so will not create a disk space problem, and no complications in making sure to be able to access the bot cache (from outside and/or running tests locally). |
The IB tests keep failing, so it would be good to converge. As I wrote in #40020 (comment), I think the better solution is to use the bot cache, like the vast majority of CMSSW wfs do (that mechanism is meant precisely to prevent issues like this one). @Martin-Grunewald , if you are convinced the current approach is better, please sign the PRs (and ask Core-Sw to do the same). Otherwise, I will update them to use the bot cache (this 2nd option will not require signatures by Core-Sw). PS.
These MC files are also in the bot cache now, after #40020 (comment). |
So in that case of bot usage, what is the procedure if we change any of the input files? We first need to copy them to the bot cache? Who has permissions? (Then we could copy them to TSG eos directly). |
I think it would have to be a file that is accessible via
There is no denying that we introduce a dependence on the cache, and the latter is not under our control (on the other hand, if the cache fails, most CMSSW tests will). What is not-so-nice about the current approach is that if someone moves the files in the TSG-EOS area (e.g. renaming folders), this will break the IB tests (which is far from obvious.. and it is likely to happen one day, because nothing prevents it other than some of us knowing). As long as the cache works, the advantage is that we just write the LFN of the files, and there is no other maintenance (or need to guard the EOS-TSG area). But to be clear, I don't have a strong opinion on these two solutions; I was waiting for a guideline from Core-Sw (#40020 (comment)), but maybe they don't have a strong opinion either. Tagging @smuzaffar , in case I'm missing something. |
First I would suggest to start using LFN ( i.e.
yes looks complicated but works fine for many years for relvals/addon and unit tests. Unused files are automatically deleted after 6 months. I also have no strong opinion , if you wantto make use of bot cache then I can help but if you want to have full control over the files then feel free to maintain your own eos area. |
OK, so it looks like this works fine. (The only caveat left is this 6-mo deletion once unused: this would bite us in case a discontinued CMSSW release series gets resurrected after say 8 months or so because some wants to add a PR, but some of the corresponding |
Okay, I will double-check things and update the PRs to use the cache in the next couple of days (if there are no issues). @smuzaffar , regarding #40020 (comment), could you please teach the bot to parse the HLT-Validation logs to look for files to cache ? |
Thanks for the update, @traylenator . For reference, Steve also explained that a user can disable mkdir -p ~/.hepix
touch ~/.hepix/off
and that "csh source is loaded with every shell vs only on login for bash". |
I updated the PRs so that the HLT-Val tests can access EDM files from the cms-bot cache. There is a separate issue (not discussed so far) for the CMSSW releases where the HLT-Val tests run in SLC6. There, the tests will continue to fail (what continues to fail is the "hlt-integration-tests" part of these tests). I reproduced the problem locally, and it looks like an incompatibility with the external files (e.g. JAVA Fwiw, #40004 has removed queries to ConfDB in IB tests, so this type of problem should never happen moving forward in recent releases. [1]
|
HLT-integration tests cannot run with SLC6 architectures, due to an incompatibility with the latest .jar files of ConfDB-v2. For further details, see cms-sw#40013 (comment)
HLT-integration tests cannot run with SLC6 architectures, due to an incompatibility with the latest .jar files of ConfDB-v2. For further details, see cms-sw#40013 (comment)
HLT-integration tests cannot run with SLC6 architectures, due to an incompatibility with the latest .jar files of ConfDB-v2. For further details, see cms-sw#40013 (comment)
HLT-integration tests cannot run with SLC6 architectures, due to an incompatibility with the latest .jar files of ConfDB-v2. For further details, see cms-sw#40013 (comment)
We agreed to switch off the HLT-integration tests for IBs using SLC6, to avoid false positives due to the issue described in #40013 (comment). The PRs to 8_0_X, 9_4_X, 10_2_X and 10_3_X have been updated accordingly (the update of the 5_3_X PR is not necessary, as those tests do not run in that earlier cycle). I think all the PRs are now ready to go. |
+hlt With the expected exception of Thanks @smuzaffar for helping us make use of the ibeos cache. |
This issue is fully signed and ready to be closed. |
please close |
Hello @smuzaffar @missirol Need to reopen this I find in several tests that the jobs are failing to finding the input file, while other jobs succeed (130, 126, 125 were tested), running the tests in my own developer areas: Succeeding tests use: while failing tests use: IOW, the path prefix part is different, the failing tests does not seem to use the bot cache. (/store/user/cmsbuild part in the path) Where is the /store/user/cmsbuild part in the filepath inserted? |
1 similar comment
Hello @smuzaffar @missirol Need to reopen this I find in several tests that the jobs are failing to finding the input file, while other jobs succeed (130, 126, 125 were tested), running the tests in my own developer areas: Succeeding tests use: while failing tests use: IOW, the path prefix part is different, the failing tests does not seem to use the bot cache. (/store/user/cmsbuild part in the path) Where is the /store/user/cmsbuild part in the filepath inserted? |
Need to check the tcsh problem mentioned above... |
The actual file in IB EOS cache is [a]
|
Ah, the typo was a cut and paste error. |
Re #40013 (comment), I'm trying to figure out how we can protect the HLT tests from the '6-months removal' rule of the cache.
I'm not convinced by the 1st option. The 2nd point might suggest a solution to the 6-months problem.
The disadvantage of this option is that the DAS name of the files is not obvious (but it's still possible to figure it out), there is still some weak dependence on the TSG area (although that's intended, as backup option), and there may be some wasted space in the cache for a while because of a few identical files with different names. Thoughts? |
Martin noticed that the following file became "unused" in the cache: Based on #40013 (comment) , we will try to add a simple unit test to ensure that the few files we need are kept in the cache. In the meantime, could you please make this file available again in the cache (removing "unused") ? |
Hello @missirol, I just did it. Could you please confirm you can access it now? |
Thanks, Andrea. I see it.
Is it possible to figure out why it became 'unused' today? (I'm trying to understand better how the caching works, to avoid issues like this in the future.) |
(Continuing to write here for completeness, even though the issue is closed.) #40365 adds a unit test to keep the files cached, along the lines of what was described in item-1 of #40013 (comment). I'm not convinced this is the better long-term solution, but it should at least avoid issues like #40013 (comment). If others have any suggestions, I'd be happy to hear them. |
The HLT-Validation tests are failing in IBs of
CMSSW_12_1_X
and lower release cycles, e.g.https://cmssdt.cern.ch/SDT/html/cmssdt-ib/#/ib/CMSSW_12_1_X
The main [*] reason of these failures is access to EDM files used by these tests (these files were removed from the CERN Tier-2, some days or weeks ago) [**].
TSG has copies of these files in its EOS area, so the fix is simple: we can replace the path to these files.
Questions to @cms-sw/orp-l2.
Do we want to fix this? Are PRs still accepted in release cycles as low as
5_3_X
?If yes, should we also make PRs to release cycles that do not appear on the IB dashboard (e.g.
12_2_X
)?[*] In
10_6_X
(and likely some other cycle), these tests will continue to fail (even after fixing the file-access issue) due to other issues.[**] It might be useful to know from experts if there are ways to cache these files similarly to what is done for the EDM files used in RelVal wfs and
addOnTests
.The text was updated successfully, but these errors were encountered: