-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
T0 JobAccountant crashes due to multiple jobs generating same file #10870
Comments
Possibly related to (if not the same root cause): #9633 |
@amaltaro for what I understand, in #9633 duplicate files are created with a default name and that causes the JobAccountant to crash. In our case, however, that doesn't seems to be happening. For example, for jobs in
Could this still have the same root cause? |
Hi @german, It could still be related to #9633 , even though the duplicate filename is not due to some defaults as in the issue above. I hope we can get back to it during this week. Do you still have some logs in a machine which is experiencing it? I am not sure how easily reproducible the error is. |
It is fairly easy to reproduce. It appears in every replay we run. Just tell me what you need I'm sure I can provide it for you. |
In order to fix it, I think we need to start making the uuid a function of:
It should be possible to adapt it with |
Thanks Alan! |
Here are my findings: At runtime, we make a PSet tweak to change the name of the output files based on the output modules:
WMCore/src/python/PSetTweaks/WMTweak.py Lines 522 to 524 in baf7ae5
So, the output files have a pattern like:
After cmsRun is done, we change the lfn of the file, and with that the filename at stageout. So the file then looks like:
But here is the thing, we don't set thee WMCore/src/python/WMCore/FwkJobReport/FileInfo.py Lines 110 to 111 in 6cab2cb
The guid generated by FWCore: And we can't just change the file https://github.com/cms-sw/cmssw-wm-tools/blob/master/bin/cmssw_enforce_guid_in_filename.py So it seems the best solution here, is reporting the issue to cmssw and have it fixed on that end. |
@germanfgv Do you happen to have or know anyone with direct access to the
CSSW is asking to check the availability of |
@germanfgv From your logfiles on:
Would you be able to get a pair of job logfiles with outputs other than DQMIO? For example, your The cmssw folks are requesting this on: cms-sw/cmssw#37240 To summarize: It looks like the DQMIO issue is understood, but they need more information for the ALCARECO, RAW, etc. |
@khurtado please check the logs that you can find here: I'll check if we have more logs. If not, we can try and generate more examples. |
Hi @germanfgv. So, for
I couldn't find duplicated names for ALCARECO and RAW though, am I perhaps looking in the wrong way?
|
According to file
Back in the day, I also couldn't find them directly, that's why I just copied the logs for the DQM jobs. Maybe @amaltaro can clarify to us what this "dup_lfns" list means exactly. |
@germanfgv That's a good point. Looking at all the guids in the json for
Maybe what happens is whenever there is a SQL WMBS error, dupLFNs lists all the LFNs listed in the job parameters?
By the way, if that is the case, it would be good news, as they already know how to fix DQMIO, which uses an old uuid algorithm, but were surprised the others were also wrong, as they use a new algorithm. |
If I remember correctly - and the docstring is correct - this script is loading all the X pickle reports in the tail of the component log, listing the output files on those report files and comparing them against the files known to WMBS tables. At some point I think I also added a check for multiple job reports - from memory - with the same output LFN. Just in case, here is the source code: https://github.com/amaltaro/ProductionTools/blob/master/removeDupJobAccountant.py If you spot any mistake, I would be glad to follow up and get it fixed :-D |
@amaltaro I think it's assuming
Or:
So I think that will make |
Thanks for this investigation, Kenyi. I think you are right and I have just pushed in a commit to fix this issue: amaltaro/ProductionTools@af153ac @germanfgv next time you need to run it, please make sure to fetch the latest master/head version. |
Thank you @amaltaro ! This is to confirm whether we are only seeing issues with DQMIO or more than that. If only DQMIO are seen, this should be an easy fix for cmssw, as they just new to make DQMIO point to the new guid algorithm that other modules are already using. |
@germanfgv @jhonatanamado Just wondering if you got the chance to try another replay. Let me know if you need any additional info on this matter. |
Hello @khurtado . Im deploying a new replay and will give you the new results asap. |
Hello @khurtado , |
Hi @jhonatanamado
Only 1 DQMIO was found, right? |
Hi @khurtado , Yes JobAccountant starts with this issue with that file, I only deployed the replay and let hits only this first exception. The replay could find more duplicates files as we are used to see it. I only posted the first exception due that we are running a cronjob for all the machines (including Production Agent) restarting this component every certain time. Do you want a full replay and check which other files are affected after the deletion of the duplicate file takes place and restarting the component? |
@jhonatanamado I have already asked cmssw to fix the DQMIO issue. As things are now and with the current tests, it seems that is the only output module problem, so let's wait for that to be fixed and if you spot more duplicated LFNs from other modules in the future, let us know. |
@khurtado So far I have not been unable to find examples of duplicate files other than DQMIO. Not in replays not in production. I have seen records after around 20 JobAccountant duplicate file errors, and all of them were DQMIO files. |
Hi guys, we need to fix it asap. It affects Tier0 operations and detector commissioning. @khurtado, could you please point me to an issue that can be tracked with the CMSSW release managers, where you requested the problem to be fixed? If it was a private communication, who you contacted and what is the expectation for the it will be fixed? |
@drkovalskyi Yes, here it s: cms-sw/cmssw#37240 |
Thanks Kenyi. |
@germanfgv @drkovalskyi : Was this fixed with cms-sw/cmssw#37240 ? Can this issue be closed or is there anything needed from WMCore? EDIT: It was reported during the WMChat meeting that no new occurrences have been seen since the fix, so closing this ticket. |
Impact of the bug
T0Agent
Describe the bug
Two or more jobs create output files with the same name. JobAccountant tries to add them to the DB and fails due to a
ORA-00001: unique constraint (WMBS_FILDETAILS_UNIQUE) violated
We have seen the issue affecting Express jobs, but it may be affecting Repack and PromptReco too.
How to reproduce it
Deploy a Tier0 replay with a significant amount of jobs. As soon as the firsts batches of Express jobs finish, the JobAccountant will crash.
Expected behavior
Each job should create files with unique names
Additional context and error message
Here you can find the full error message of the component
ComponentLog.txt
Full JobAccountant logs can be found here:
/afs/cern.ch/user/c/cmst0/public/JobAccountant/JobAccountantLogs
Logs of a set of jobs generating files with the same name:
/afs/cern.ch/user/c/cmst0/public/JobAccountant/jobs
The text was updated successfully, but these errors were encountered: