-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ROOT files with duplicated GUIDs observed on production T0 replay workflows #37240
Comments
A new Issue was created by @khurtado Kenyi Hurtado. @Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign core |
New categories assigned: core @Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks |
@khurtado What CMSSW version was used in these tests? Was there only one GUID collision in the replay or many? Do you have any information about the hosts that produced the files whose GUIDs collided? |
The process-level GUID is created here cmssw/FWCore/Utilities/src/processGUID.cc Lines 4 to 7 in 62d26a2
Notably this calls the edm::Guid() constructor with usetime == false leading tocmssw/FWCore/Utilities/src/Guid.cc Lines 21 to 27 in 62d26a2
branch being taken. |
@makortel You can find logs of different set of jobs generating files with the same guid here: Logs seem to have 2 replays, one with many incidents, involving 2 files/jobs but one of them (in
The condor logfiles have information about the hosts runninig these jobs. The collisions involve different condor jobs, but runnnig in the same host for each collision incident:
And they involve the same or almost the same runtime date:
|
Thanks @khurtado. Can you tell if the affected hosts have |
I unfortunately don't have direct ssh access to these machines. I can probably submit some condor jobs matching to these hosts, but I don't know how long they would take to match and run. I will check if there is anybody with direct access to them for a quick check or submit those jobs otherwise. Do you know if |
Am I reading the logs correctly, that the duplicates are all in the DQMIO output? DQMIO does not use the code Matti pointed to, it calls cmssw/DQMServices/FwkIO/plugins/DQMRootOutputModule.cc Lines 327 to 338 in ce4f5b2
That UUID comes from So, if it is just the DQMIO getting duplicates, that's not unexpected and the issue is with the DQM use of the ROOT UUID. |
On a closer look my reference to cmssw/IOPool/Output/src/RootOutputFile.cc Line 199 in ce4f5b2
cmssw/IOPool/Output/src/RootOutputFile.cc Lines 230 to 239 in ce4f5b2
cmssw/FWCore/Utilities/src/GlobalIdentifier.cc Lines 5 to 8 in ce4f5b2
I see I changed the DQMIO output behavior from Then I see #28622 removed the whole call to So on one hand, we could easily change the |
@dan131riley We do see more occurrences in DQMIO, but If you look at the json in the
(The logArchive.tar.gz are using our own Question from someone not knowledgeable on the topic: If we have this known problem with DQMIO, why can't this output type use the new UUID generation algorithm too? |
(FYI @cms-sw/dqm-l2 since we mentioned DQM modules) |
It seems weird that the dups are either just DQMIO, or all of ALCARECO, DQMIO, RAW, and the logArchive. The actual logs only cover the DQMIO case--can we get some logs for jobs where the whole set is duplicated? I think we understand the DQMIO-only case, but the other case is still a mystery to me.
It got overlooked--until I started investigating this issue, I hadn't realized |
@makortel @dan131riley Sorry for the delay. We found a bug in the script that finds these duplicated LFNs. After fixing it, it seems the only problem is DQMIO. |
Hi everyone. This issue affects the Tier0 operations and the detector commissioning. Is there any chance to speed it up? |
@cms-sw/dqm-l2 @makortel Any follow up on this? How about the fix proposed by Matti #37240 (comment) ? |
@dan131riley is looking into this |
We will need to switch back to the |
Thanks Dan. |
If this can be available by tomorrow, as you write, we can wait for building CMSSW_12_2_2 till then (otherwise we would be even ready to build it even now) |
Andrea, this problem is what caused Express failure last weekend. Our operators noticed it by chance (we don't normally have coverage during weekends). The risk is that it may happen again and we can be blind for a few days. If you have no strong pressure to release 12_2_2 now, let's wait for the fix. Otherwise we will need a patch release right after that. |
That is exactly what I meant: if the PR is ready by tomorrow(ish), we can wait and build 12_2_2 with it. |
#37405 has the PR for master (comments and criticisms are welcome). Backports will likely wait until tomorrow morning (US EDT). |
Hi @perrotta @dan131riley, thank you for you prompt actions. I see that this has been backported into |
@germanfgv yes, this will be included in CMSSW_12_2_2_patch1. We are just trying to include another PR: #37417 (to be backported) Hopefully 12_2_2_patch1 can be made in the weekend. |
The issue is that we need the patch release before the weekend. Without this fix we risk Tier0 going down during the weekend. Is there any chance to get it today? |
I think it would be better to proceed without #37417 (even if it would lead to two separate patch releases). |
CMSSW_12_2_3 is going to be built: #37433 |
@perrotta the release is built! how long would it be until it is updated in |
ohh.. I see the process is still going on for |
At 3pm we said that the release will take "a few hours" and that it would have been hopefully ready before evening at Fermilab: we are still in track, I guess... |
Yea, sorry I missed the fact that it was full build at 3pm - seems to rely on someone working through the fnal evening to make this converge (at best). Given that no one spoke for this extra development its unfortunate that we make someone work Friday evening just for it.
… On Apr 1, 2022, at 6:37 PM, Andrea Perrotta ***@***.***> wrote:
@perrotta the release is built! how long would it be until it is updated in /cvmfs?
At 3pm we said that the release will take "a few hours" and that it would have been hopefully ready before evening at Fermilab: we are still in track, I guess...
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.
|
Probably a good reason to recover the ability to start branches in GitHub to facilitate patches in cases where the tip of the release branch has moved forward in some way. Useful for data taking.
… On Apr 1, 2022, at 6:42 PM, David Lange ***@***.***> wrote:
Yea, sorry I missed the fact that it was full build at 3pm - seems to rely on someone working through the fnal evening to make this converge (at best). Given that no one spoke for this extra development its unfortunate that we make someone work Friday evening just for it.
> On Apr 1, 2022, at 6:37 PM, Andrea Perrotta ***@***.***> wrote:
>
>
> @perrotta the release is built! how long would it be until it is updated in /cvmfs?
>
> At 3pm we said that the release will take "a few hours" and that it would have been hopefully ready before evening at Fermilab: we are still in track, I guess...
>
> —
> Reply to this email directly, view it on GitHub, or unsubscribe.
> You are receiving this because you are subscribed to this thread.
>
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.
|
@davidlange6 CMSSW_12_2_2 full build took less than six hous two days ago. That's why I asked today at the joint meeting by when it was needed, And the answer was "before this evening at Fermilab". If I am not wrong now it is almost noon at Fermilab, and I think we are perfectly in schedule for what was asked for. |
Release CMSSW_12_2_3 is ready, see https://github.com/cms-sw/cmssw/releases/tag/CMSSW_12_2_3 |
Thanks @perrotta. It is not in |
Ping…can this issue be resolved? |
@germanfgv @drkovalskyi Could you confirm that the issues with duplicated GUIDs are gone? (I'd assume so from the silence, but would like to check explicitly) Thanks! |
Hello @makortel , From Tier0 we can confirm that the issues with duplicated GUIDs are gone since CMSSW_12_2_3. Thanks. |
+1 Thanks @jhonatanamado |
This issue is fully signed and ready to be closed. |
This is related to a WMCore issue:
dmwm/WMCore#10870
Bug description
When deploying T0 replays with a significant amount of jobs, one of the WMCore components fail complaining about duplicated LFNs. Our LFN patterns look like this:
where:
2AE85F14-94A1-EC11-BBF5-FA163EC7AA59
is the GUID extracted from the ROOT file through the framework XML job report.So we are basically observing 2 different jobs generating files with the same GUID.
We get the GUID from the framework XML job report here:
And since the GUID from the FW report seems to be generated here:
https://github.com/cms-sw/cmssw/blob/master/FWCore/Utilities/src/Guid.cc#L18-L28
I'm reporting the issue here.
How to reproduce
Deploy a Tier0 replay with a significant amount of jobs. I think @germanfgv can help with this if needed. At least one incident has been reported per week lately this year.
The text was updated successfully, but these errors were encountered: