Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

T0 JobAccountant crashes due to multiple jobs generating same file #10870

Closed
germanfgv opened this issue Oct 13, 2021 · 29 comments
Closed

T0 JobAccountant crashes due to multiple jobs generating same file #10870

germanfgv opened this issue Oct 13, 2021 · 29 comments

Comments

@germanfgv
Copy link
Contributor

Impact of the bug
T0Agent

Describe the bug
Two or more jobs create output files with the same name. JobAccountant tries to add them to the DB and fails due to a ORA-00001: unique constraint (WMBS_FILDETAILS_UNIQUE) violated

We have seen the issue affecting Express jobs, but it may be affecting Repack and PromptReco too.

How to reproduce it
Deploy a Tier0 replay with a significant amount of jobs. As soon as the firsts batches of Express jobs finish, the JobAccountant will crash.

Expected behavior
Each job should create files with unique names

Additional context and error message
Here you can find the full error message of the component
ComponentLog.txt

Full JobAccountant logs can be found here:
/afs/cern.ch/user/c/cmst0/public/JobAccountant/JobAccountantLogs

Logs of a set of jobs generating files with the same name:
/afs/cern.ch/user/c/cmst0/public/JobAccountant/jobs

@amaltaro
Copy link
Contributor

Possibly related to (if not the same root cause): #9633

@germanfgv
Copy link
Contributor Author

germanfgv commented Oct 15, 2021

@amaltaro for what I understand, in #9633 duplicate files are created with a default name and that causes the JobAccountant to crash. In our case, however, that doesn't seems to be happening. For example, for jobs in /afs/cern.ch/user/c/cmst0/public/JobAccountant/jobs/g1/, the duplicate file is:

/store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/DQMIO/Express-v2110131547/000/343/082/00001/86A2833A-2C31-11EC-AF9E-D0C08E80BEEF.root

Could this still have the same root cause?

@todor-ivanov
Copy link
Contributor

Hi @german,

It could still be related to #9633 , even though the duplicate filename is not due to some defaults as in the issue above. I hope we can get back to it during this week. Do you still have some logs in a machine which is experiencing it? I am not sure how easily reproducible the error is.

@germanfgv
Copy link
Contributor Author

It is fairly easy to reproduce. It appears in every replay we run. Just tell me what you need I'm sure I can provide it for you.

@amaltaro
Copy link
Contributor

In order to fix it, I think we need to start making the uuid a function of:

  • job id/number
  • clock/timestamp
  • hostname

It should be possible to adapt it with uuid5 algorithm. More info at: https://docs.python.org/3/library/uuid.html

@amaltaro
Copy link
Contributor

@khurtado this one is tightly coupled to #9011 , and I think that the fix you will propose will actually close this and #9011 issues. I assigned this one to you and moved to Work in progress as well.

@khurtado
Copy link
Contributor

Thanks Alan!

@khurtado
Copy link
Contributor

Here are my findings:

At runtime, we make a PSet tweak to change the name of the output files based on the output modules:

        lfn = "%s/%s/%s.root" % (lfnBase, lfnGroup(job), modName)
        result.addParameter("process.%s.logicalFileName" % modName, lfn)

if lfnBase != None:
lfn = "%s/%s/%s.root" % (lfnBase, lfnGroup(job), modName)
result.addParameter("process.%s.logicalFileName" % modName, lfn)

So, the output files have a pattern like:

/store/unmerged/HG2202_Val/RelValProdMinBias/GEN-SIM/HG2202_Val_OLD_Alanv4-v22/00000/RAWSIMoutput.root

After cmsRun is done, we change the lfn of the file, and with that the filename at stageout. So the file then looks like:

/store/unmerged/HG2202_Val/RelValProdMinBias/GEN-SIM/HG2202_Val_OLD_Alanv4-v22/00000/2AE85F14-94A1-EC11-BBF5-FA163EC7AA59.root

But here is the thing, we don't set thee uuid for the filename. We basically grab the GUID from the generated Framework XML job report here:

filelfn = '%s.root' %(str(guid))
setattr(fileReport, 'lfn', os.path.join(dirname, filelfn))

The guid generated by FWCore:
https://github.com/cms-sw/cmssw/blob/master/FWCore/Utilities/src/Guid.cc#L18-L28

And we can't just change the file lfn to use our own uuid for the filename, since we also enforce and hceck the guid in the filename using this utility:

https://github.com/cms-sw/cmssw-wm-tools/blob/master/bin/cmssw_enforce_guid_in_filename.py

So it seems the best solution here, is reporting the issue to cmssw and have it fixed on that end.

@khurtado
Copy link
Contributor

khurtado commented Mar 14, 2022

@germanfgv Do you happen to have or know anyone with direct access to the T2_CH_CERN worker nodes?
Specifically the nodes from here:

[khurtado@lxplus708 /afs/cern.ch/user/c/cmst0/public/JobAccountant/jobs]$ find . -name "condor*out" -exec grep -H Hostname {} \; | sort
./g1/job_1038/condor.41731.37.out:Hostname:   b7g18p9798.cern.ch
./g1/job_1048/condor.41731.47.out:Hostname:   b7g18p9798.cern.ch
./g2/job_98/condor.41726.97.out:Hostname:   b7g18p7310.cern.ch
./g2/job_99/condor.41726.98.out:Hostname:   b7g18p7310.cern.ch
./g3/job_1137/condor.41731.136.out:Hostname:   b7g17p4406.cern.ch
./g3/job_1138/condor.41731.137.out:Hostname:   b7g17p4406.cern.ch
./g4/job_1169/condor.41731.168.out:Hostname:   b7g18p3673.cern.ch
./g4/job_1170/condor.41731.169.out:Hostname:   b7g18p3673.cern.ch
./g5/job_527/condor.41729.133.out:Hostname:   b7g10p4995.cern.ch
./g5/job_530/condor.41729.136.out:Hostname:   b7g10p4995.cern.ch
./g5/job_531/condor.41729.137.out:Hostname:   b7g10p4995.cern.ch
./g6/job_4623/condor.41780.47.out:Hostname:   b7g17p1733.cern.ch
./g6/job_4635/condor.41780.59.out:Hostname:   b7g17p1733.cern.ch

CSSW is asking to check the availability of /dev/urandom and contents of /proc/sys/kernel/random/entropy_avail, which I still don't know if they change inside containers though, but it should be trivial to check for this after invoking singularity.

@khurtado
Copy link
Contributor

@germanfgv From your logfiles on:

/afs/cern.ch/user/c/cmst0/public/JobAccountant/jobs

Would you be able to get a pair of job logfiles with outputs other than DQMIO?

For example, your g3 or g4 directories do have RAW, ALCARECO in the list of duplicated files in the json, but only the DQMIO job logfiles are present.

The cmssw folks are requesting this on: cms-sw/cmssw#37240

To summarize: It looks like the DQMIO issue is understood, but they need more information for the ALCARECO, RAW, etc.

@germanfgv
Copy link
Contributor Author

germanfgv commented Mar 15, 2022

@khurtado please check the logs that you can find here:
/afs/cern.ch/user/c/cmst0/public/JobAccountant/jobs/g4/job_1106

I'll check if we have more logs. If not, we can try and generate more examples.

@khurtado
Copy link
Contributor

khurtado commented Mar 15, 2022

Hi @germanfgv. So, for g4, I can see the duplicated DQMIO file from job 1169 and 1170, but not 1106:

[khurtado@lxplus755 /afs/cern.ch/user/c/cmst0/public/JobAccountant/jobs/g4]$ find . -name "wmagentJob.log" -exec grep -H "LFN: \/store" {} \; | grep DQMIO
./job_1169/wmagentJob.log:LFN: /store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/DQMIO/Express-v2110131547/000/343/082/00001/CFACDE90-2C31-11EC-A88A-C5C08E80BEEF.root
./job_1170/wmagentJob.log:LFN: /store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/DQMIO/Express-v2110131547/000/343/082/00001/CFACDE90-2C31-11EC-A88A-C5C08E80BEEF.root
./job_1106/wmagentJob.log:LFN: /store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/DQMIO/Express-v2110131547/000/343/082/00001/A1E93C9C-2C31-11EC-ABF5-9B8A8E80BEEF.root

I couldn't find duplicated names for ALCARECO and RAW though, am I perhaps looking in the wrong way?

# ALCARECO
[khurtado@lxplus755 /afs/cern.ch/user/c/cmst0/public/JobAccountant/jobs/g4]$ find . -name "wmagentJob.log" -exec grep -H "LFN: \/store" {} \; | grep ALCARECO
./job_1169/wmagentJob.log:LFN: /store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/ALCARECO/Express-v2110131547/000/343/082/00001/98e014a9-8224-49b9-b5f9-5a77fca89a16.root
./job_1170/wmagentJob.log:LFN: /store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/ALCARECO/Express-v2110131547/000/343/082/00001/b0e5793e-1d8e-4a04-a5c4-f68ea475692e.root
./job_1106/wmagentJob.log:LFN: /store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/ALCARECO/Express-v2110131547/000/343/082/00001/814f7a01-2f71-4f9c-9190-a4dd256e123e.root

# RAW
[khurtado@lxplus755 /afs/cern.ch/user/c/cmst0/public/JobAccountant/jobs/g4]$ find . -name "wmagentJob.log" -exec grep -H "LFN: \/store" {} \; | grep RAW
./job_1169/wmagentJob.log:LFN: /store/unmerged/data/Tier0_REPLAY_2021/TestEnablesEcalHcal/RAW/Express-v2110131547/000/343/082/00001/40ec5397-771e-4478-91d0-45e7c63aec5d.root
./job_1170/wmagentJob.log:LFN: /store/unmerged/data/Tier0_REPLAY_2021/TestEnablesEcalHcal/RAW/Express-v2110131547/000/343/082/00001/b22d070b-8f19-4811-a9a3-81d202072653.root
./job_1106/wmagentJob.log:LFN: /store/unmerged/data/Tier0_REPLAY_2021/TestEnablesEcalHcal/RAW/Express-v2110131547/000/343/082/00001/8fcf123e-d1f5-4c75-959b-9d123bea738a.root
[khurtado@lxplus755 /afs/cern.ch/user/c/cmst0/public/JobAccountant/jobs/g4]$

@germanfgv
Copy link
Contributor Author

According to file g4/dupPickles.json, line 11157, these are the dub LFNs:

    "dup_lfns": [
      "/store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/ALCARECO/Express-v2110131547/000/343/082/00001/814f7a01-2f71-4f9c-9190-a4dd256e123e.root",
      "/store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/ALCARECO/Express-v2110131547/000/343/082/00001/814f7a01-2f71-4f9c-9190-a4dd256e123e.root",
      "/store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/DQMIO/Express-v2110131547/000/343/082/00001/A1E93C9C-2C31-11EC-ABF5-9B8A8E80BEEF.root",
      "/store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/DQMIO/Express-v2110131547/000/343/082/00001/A1E93C9C-2C31-11EC-ABF5-9B8A8E80BEEF.root",
      "/store/unmerged/data/Tier0_REPLAY_2021/TestEnablesEcalHcal/RAW/Express-v2110131547/000/343/082/00001/8fcf123e-d1f5-4c75-959b-9d123bea738a.root",
      "/store/unmerged/data/Tier0_REPLAY_2021/TestEnablesEcalHcal/RAW/Express-v2110131547/000/343/082/00001/8fcf123e-d1f5-4c75-959b-9d123bea738a.root",
      "/store/unmerged/data/logs/prod/2021/10/13/Express_Run343082_StreamCalibration_Tier0_REPLAY_2021_v2110131547_211013_1547/Express/0001/0/Express-c2912657-08db-4d3d-9f8a-a8c949da8c68-0-logArchive.tar
.gz",
      "/store/unmerged/data/logs/prod/2021/10/13/Express_Run343082_StreamCalibration_Tier0_REPLAY_2021_v2110131547_211013_1547/Express/0001/0/Express-c2912657-08db-4d3d-9f8a-a8c949da8c68-0-logArchive.tar
.gz"
    ]

Back in the day, I also couldn't find them directly, that's why I just copied the logs for the DQM jobs. Maybe @amaltaro can clarify to us what this "dup_lfns" list means exactly.

@khurtado
Copy link
Contributor

khurtado commented Mar 15, 2022

@germanfgv That's a good point. Looking at all the guids in the json for g4, I could only find duplicated guids for 2 DQMIO files.

[khurtado@lxplus755 /afs/cern.ch/user/c/cmst0/public/JobAccountant/jobs/g4]$ grep guid dupPickles.json | awk '{print $2}' | sort | uniq -c 2>&1| grep -v " 1"
      2 "CFACDE90-2C31-11EC-A88A-C5C08E80BEEF",
      2 "D372A94C-2C31-11EC-AA52-D0C08E80BEEF",
[khurtado@lxplus755 /afs/cern.ch/user/c/cmst0/public/JobAccountant/jobs/g4]$ grep CFACDE90-2C31-11EC-A88A-C5C08E80BEEF dupPickles.json | grep PFN
              "OutputPFN": "root://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/DQMIO/Express-v2110131547/000/343/082/00001/CFACDE90-2C31-11EC-A88A-C5C08E80BEEF.root?eos.app=cmst0",
              "OutputPFN": "root://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/DQMIO/Express-v2110131547/000/343/082/00001/CFACDE90-2C31-11EC-A88A-C5C08E80BEEF.root?eos.app=cmst0",
[khurtado@lxplus755 /afs/cern.ch/user/c/cmst0/public/JobAccountant/jobs/g4]$ grep D372A94C-2C31-11EC-AA52-D0C08E80BEEF dupPickles.json | grep PFN
              "OutputPFN": "root://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/DQMIO/Express-v2110131547/000/343/082/00001/D372A94C-2C31-11EC-AA52-D0C08E80BEEF.root?eos.app=cmst0",
              "OutputPFN": "root://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/DQMIO/Express-v2110131547/000/343/082/00001/D372A94C-2C31-11EC-AA52-D0C08E80BEEF.root?eos.app=cmst0",
[khurtado@lxplus755 /afs/cern.ch/user/c/cmst0/public/JobAccountant/jobs/g4]$

Maybe what happens is whenever there is a SQL WMBS error, dupLFNs lists all the LFNs listed in the job parameters?
E.g.: SQL error below, so listing all LFNs from parameters
@amaltaro?

[SQL: INSERT INTO wmbs_file_details (id, lfn, filesize, events,
                                            first_event, merged)
             VALUES (wmbs_file_details_SEQ.nextval, :lfn, :filesize, :events,
                     :first_event, :merged)]
[parameters: [{'lfn': '/store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/ALCARECO/Express-v2110131547/000/343/082/00000/92b873a5-603b-446c-b50e-4aebb8441650.root', 'filesize': 1436612, 'events': 2236, 'first_event': 0, 'merged': 0}, {'lfn': '/store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/DQMIO/Express-v2110131547/000/343/082/00000/D960EA12-2C2C-11EC-B296-53878E80BEEF.root', 'filesize': 135147, 'events': 0, 'first_event': 0, 'merged': 0}, {'lfn': '/store/unmerged/data/Tier0_REPLAY_2021/TestEnablesEcalHcal/RAW/Express-v2110131547/000/343/082/00000/0059b267-4a6a-4ce3-9313-1a008b695744.root', 'filesize': 382715936, 'events': 2236, 'first_event': 0, 'merged': 0}, {'lfn': '/store/unmerged/data/logs/prod/2021/10/13/Express_Run343082_StreamCalibration_Tier0_REPLAY_2021_v2110131547_211013_1547/Express/0000/0/Express-fa0ff51e-0a73-4568-85e9-01ca1ccb896c-0-logArchive.tar.gz', 'filesize': 0, 'events': 0, 'first_event': 0, 'merged': 0}, {'lfn': '/store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/ALCARECO/Express-v2110131547/000/343/082/00000/be5d2ccc-6f8e-476d-921a-4aa925743197.root', 'filesize': 1479816, 'events': 2306, 'first_event': 0, 'merged': 0}, {'lfn': '/store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/DQMIO/Express-v2110131547/000/343/082/00000/E4637F92-2C2C-11EC-A75F-040011ACBEEF.root', 'filesize': 135151, 'events': 0, 'first_event': 0, 'merged': 0}, {'lfn': '/store/unmerged/data/Tier0_REPLAY_2021/TestEnablesEcalHcal/RAW/Express-v2110131547/000/343/082/00000/c4b38c53-0fc0-440d-a387-abba8c6d1017.root', 'filesize': 394714912, 'events': 2306, 'first_event': 0, 'merged': 0}, {'lfn': '/store/unmerged/data/logs/prod/2021/10/13/Express_Run343082_StreamCalibration_Tier0_REPLAY_2021_v2110131547_211013_1547/Express/0000/0/Express-fcb3597a-8a81-4f02-85e9-ffba73866d56-0-logArchive.tar.gz', 'filesize': 0, 'events': 0, 'first_event': 0, 'merged': 0}  ... displaying 10 of 64 total bound parameter sets ...  {'lfn': '/store/unmerged/data/Tier0_REPLAY_2021/TestEnablesEcalHcal/RAW/Express-v2110131547/000/343/082/00000/44820b14-9c79-405e-9188-47ba12f864c5.root', 'filesize': 395093657, 'events': 2308, 'first_event': 0, 'merged': 0}, {'lfn': '/store/unmerged/data/logs/prod/2021/10/13/Express_Run343082_StreamCalibration_Tier0_REPLAY_2021_v2110131547_211013_1547/Express/0000/0/Express-4b84d92e-b659-4339-a913-09fdc47dc356-0-logArchive.tar.gz', 'filesize': 0, 'events': 0, 'first_event': 0, 'merged': 0}]]

By the way, if that is the case, it would be good news, as they already know how to fix DQMIO, which uses an old uuid algorithm, but were surprised the others were also wrong, as they use a new algorithm.

@amaltaro
Copy link
Contributor

If I remember correctly - and the docstring is correct - this script is loading all the X pickle reports in the tail of the component log, listing the output files on those report files and comparing them against the files known to WMBS tables. At some point I think I also added a check for multiple job reports - from memory - with the same output LFN.

Just in case, here is the source code: https://github.com/amaltaro/ProductionTools/blob/master/removeDupJobAccountant.py

If you spot any mistake, I would be glad to follow up and get it fixed :-D

@khurtado
Copy link
Contributor

khurtado commented Mar 16, 2022

@amaltaro
Here:
https://github.com/amaltaro/ProductionTools/blob/master/removeDupJobAccountant.py#L48-L49

I think it's assuming logFiles will throw unique pkl paths. However, if I look into a tier0 ComponentLog example, I do see some pkls being shown more than once. E.g.:

[khurtado@vocms047 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobAccountant]$ grep '/data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_10/Report.0.pkl'  ComponentLog
2022-03-12 03:15:56,422:139997933635328:INFO:AccountantWorker:Handling /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_10/Report.0.pkl
2022-03-12 05:08:43,681:140397941163776:INFO:AccountantWorker:Handling /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_10/Report.0.pkl

Or:

[khurtado@vocms047 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobAccountant]$ tail -n 500000  ComponentLog | grep 'install\/tier0\/JobCreator\/JobCache' | awk '{print  $3}' |  sort | uniq  -c | grep -v '1 '
      2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_10/Report.0.pkl
      2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_11/Report.0.pkl
      2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_12/Report.0.pkl
      2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_13/Report.0.pkl
      2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_15/Report.0.pkl
      2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_19/Report.0.pkl
      2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_20/Report.0.pkl
      2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_22/Report.0.pkl
      2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_23/Report.0.pkl
      2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_24/Report.0.pkl
      2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_25/Report.0.pkl
      2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_27/Report.0.pkl
      2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_28/Report.0.pkl
      2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_29/Report.0.pkl
      2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_30/Report.0.pkl
      2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_31/Report.0.pkl
      2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_33/Report.0.pkl
      2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_34/Report.0.pkl
      2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_372/Report.0.pkl
      2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_373/Report.0.pkl
      2 /data/tier0/docker/container1/srv/wmagent/current/install/tier0/JobCreator/JobCache/Express_Run346512_StreamCalibration_Tier0_REPLAY_2022_ID220312030637_v213_220312_0307/Express/JobCollection_3_0/job_375/Report.0.pkl

So I think that will make lfn2PklDict to sometimes have 1 lfn with more than 1 pkl path that is in reality the same pkl path. And that is why in such cases, all output files from the plk path are shown (ALCARECO, DQMIO, RAW, Log-tarball)

@amaltaro
Copy link
Contributor

Thanks for this investigation, Kenyi.

I think you are right and I have just pushed in a commit to fix this issue: amaltaro/ProductionTools@af153ac

@germanfgv next time you need to run it, please make sure to fetch the latest master/head version.

@khurtado
Copy link
Contributor

khurtado commented Mar 16, 2022

Thank you @amaltaro !
@germanfgv @jhonatanamado
Could you guys please submit another REPLAY to reproduce this error and check with the new change in the removeDupJobAccountant.py that Alan just pushed on github?

This is to confirm whether we are only seeing issues with DQMIO or more than that. If only DQMIO are seen, this should be an easy fix for cmssw, as they just new to make DQMIO point to the new guid algorithm that other modules are already using.

@khurtado
Copy link
Contributor

@germanfgv @jhonatanamado Just wondering if you got the chance to try another replay. Let me know if you need any additional info on this matter.

@jhonatanamado
Copy link
Contributor

Hello @khurtado . Im deploying a new replay and will give you the new results asap.

@jhonatanamado
Copy link
Contributor

Hello @khurtado ,
Kenyi you will find two logs here /afs/cern.ch/user/j/jamadova/public/WMCore/JobAccountant
The componentlog and the log of the removeDupJobAccountant.py with the changes proposed by Alan.
Let me know if you need more info.

@khurtado
Copy link
Contributor

khurtado commented Mar 24, 2022

Hi @jhonatanamado
Thanks! So if I understand correctly

Found 406 unique pickle files to parse with a total of 319 output files and 1 duplicated files to process among them.
Duplicate files are:
['/store/unmerged/data/Tier0_REPLAY_2022/StreamCalibration/DQMIO/Express-v5/000/345/755/00000/2AA061C6-AB32-11EC-ADE7-B9C08E80BEEF.root']
See dupPickles.json for further details ...
Can we automatically delete those pickle files? Y/N
Y
Deleting /data/tier0/srv/wmagent/3.0.3/install/tier0/JobCreator/JobCache/Express_Run345755_StreamCalibration_Tier0_REPLAY_2022_ID220324053350_v5_220324_0535/Express/JobCollection_1_0/job_906/Report.0.pkl ...
  Done!

Now loading all LFNs from wmbs_file_details ...
Retrieved 60594 lfns from wmbs_file_details

Only 1 DQMIO was found, right?
@amaltaro Do you think we need more tests?
EDIT: For the recored, after talking to Alan, we are considering DQMIO the only issue now. If we spot issues with other modules in the future, we can create another issue with cmssw

@jhonatanamado
Copy link
Contributor

jhonatanamado commented Mar 24, 2022

Hi @khurtado , Yes JobAccountant starts with this issue with that file, I only deployed the replay and let hits only this first exception. The replay could find more duplicates files as we are used to see it. I only posted the first exception due that we are running a cronjob for all the machines (including Production Agent) restarting this component every certain time. Do you want a full replay and check which other files are affected after the deletion of the duplicate file takes place and restarting the component?

@khurtado
Copy link
Contributor

khurtado commented Mar 25, 2022

@jhonatanamado I have already asked cmssw to fix the DQMIO issue. As things are now and with the current tests, it seems that is the only output module problem, so let's wait for that to be fixed and if you spot more duplicated LFNs from other modules in the future, let us know.

@germanfgv
Copy link
Contributor Author

@khurtado So far I have not been unable to find examples of duplicate files other than DQMIO. Not in replays not in production. I have seen records after around 20 JobAccountant duplicate file errors, and all of them were DQMIO files.

@drkovalskyi
Copy link

Hi guys, we need to fix it asap. It affects Tier0 operations and detector commissioning. @khurtado, could you please point me to an issue that can be tracked with the CMSSW release managers, where you requested the problem to be fixed? If it was a private communication, who you contacted and what is the expectation for the it will be fixed?

@khurtado
Copy link
Contributor

@drkovalskyi Yes, here it s: cms-sw/cmssw#37240

@drkovalskyi
Copy link

Thanks Kenyi.

@khurtado
Copy link
Contributor

khurtado commented Apr 4, 2022

@germanfgv @drkovalskyi : Was this fixed with cms-sw/cmssw#37240 ? Can this issue be closed or is there anything needed from WMCore?

EDIT: It was reported during the WMChat meeting that no new occurrences have been seen since the fix, so closing this ticket.

@khurtado khurtado closed this as completed Apr 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants