Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROOT files with duplicated GUIDs observed on production T0 replay workflows #37240

Closed
khurtado opened this issue Mar 14, 2022 · 41 comments
Closed

Comments

@khurtado
Copy link

khurtado commented Mar 14, 2022

This is related to a WMCore issue:

dmwm/WMCore#10870

Bug description
When deploying T0 replays with a significant amount of jobs, one of the WMCore components fail complaining about duplicated LFNs. Our LFN patterns look like this:

/store/unmerged/HG2202_Val/RelValProdMinBias/GEN-SIM/HG2202_Val_OLD_Alanv4-v22/00000/2AE85F14-94A1-EC11-BBF5-FA163EC7AA59.root

where: 2AE85F14-94A1-EC11-BBF5-FA163EC7AA59 is the GUID extracted from the ROOT file through the framework XML job report.

So we are basically observing 2 different jobs generating files with the same GUID.
We get the GUID from the framework XML job report here:

And since the GUID from the FW report seems to be generated here:
https://github.com/cms-sw/cmssw/blob/master/FWCore/Utilities/src/Guid.cc#L18-L28

I'm reporting the issue here.

How to reproduce
Deploy a Tier0 replay with a significant amount of jobs. I think @germanfgv can help with this if needed. At least one incident has been reported per week lately this year.

@cmsbuild
Copy link
Contributor

A new Issue was created by @khurtado Kenyi Hurtado.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

assign core

@cmsbuild
Copy link
Contributor

New categories assigned: core

@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@khurtado khurtado changed the title ROOT files with duplicated GUIDs observed on priductioin T0 replay workflows ROOT files with duplicated GUIDs observed on production T0 replay workflows Mar 14, 2022
@makortel
Copy link
Contributor

@khurtado What CMSSW version was used in these tests? Was there only one GUID collision in the replay or many? Do you have any information about the hosts that produced the files whose GUIDs collided?

@makortel
Copy link
Contributor

The process-level GUID is created here

Guid const& processGUID() {
static Guid const guid;
return guid;
}

Notably this calls the edm::Guid() constructor with usetime == false leading to
} else {
// uuid_generate() defaults to uuid_generate_random() if /dev/urandom
// is available; if /dev/urandom is not available, then it is better
// to let uuid_generate() choose the best fallback rather than forcing
// use of an inferior source of randomness
::uuid_generate(data_);
}

branch being taken.

@khurtado
Copy link
Author

khurtado commented Mar 14, 2022

@makortel
They were using:
CMSSW_12_0_2_patch1 with scram arch: slc7_amd64_gcc900

You can find logs of different set of jobs generating files with the same guid here:
/afs/cern.ch/user/c/cmst0/public/JobAccountant/jobs

Logs seem to have 2 replays, one with many incidents, involving 2 files/jobs but one of them (in g5 directory) involves 3 files/jobs:

[khurtado@lxplus708 /afs/cern.ch/user/c/cmst0/public/JobAccountant/jobs]$ find . -name "dupPickles.json" -exec head -n 2 {} \;
{
  "/data/tier0/srv/wmagent/3.0.1/install/tier0/JobCreator/JobCache/Express_Run343082_StreamCalibration_Tier0_REPLAY_2021_v2110131547_211013_1547/Express/JobCollection_1_1/job_1058/Report.0.pkl": {
{
  "/data/tier0/srv/wmagent/3.0.1/install/tier0/JobCreator/JobCache/Express_Run343082_StreamCalibration_Tier0_REPLAY_2021_v2110131547_211013_1547/Express/JobCollection_1_1/job_1064/Report.0.pkl": {
{
  "/data/tier0/srv/wmagent/3.0.1/install/tier0/JobCreator/JobCache/Express_Run343082_StreamCalibration_Tier0_REPLAY_2021_v2110131547_211013_1547/Express/JobCollection_1_1/job_1038/Report.0.pkl": {
{
  "/data/tier0/srv/wmagent/3.0.1/install/tier0/JobCreator/JobCache/Express_Run343082_StreamCalibration_Tier0_REPLAY_2021_v2110131547_211013_1547/Express/JobCollection_1_0/job_527/Report.0.pkl": {
{
  "/data/tier0/srv/wmagent/3.0.1/install/tier0/JobCreator/JobCache/Express_Run344063_StreamCalibration_Tier0_REPLAY_2021_v2110131547_211013_1548/Express/JobCollection_5_1/job_4623/Report.0.pkl": {

The condor logfiles have information about the hosts runninig these jobs. The collisions involve different condor jobs, but runnnig in the same host for each collision incident:

[khurtado@lxplus708 /afs/cern.ch/user/c/cmst0/public/JobAccountant/jobs]$ find . -name "condor*out" -exec grep -H Hostname {} \; | sort
./g1/job_1038/condor.41731.37.out:Hostname:   b7g18p9798.cern.ch
./g1/job_1048/condor.41731.47.out:Hostname:   b7g18p9798.cern.ch
./g2/job_98/condor.41726.97.out:Hostname:   b7g18p7310.cern.ch
./g2/job_99/condor.41726.98.out:Hostname:   b7g18p7310.cern.ch
./g3/job_1137/condor.41731.136.out:Hostname:   b7g17p4406.cern.ch
./g3/job_1138/condor.41731.137.out:Hostname:   b7g17p4406.cern.ch
./g4/job_1169/condor.41731.168.out:Hostname:   b7g18p3673.cern.ch
./g4/job_1170/condor.41731.169.out:Hostname:   b7g18p3673.cern.ch
./g5/job_527/condor.41729.133.out:Hostname:   b7g10p4995.cern.ch
./g5/job_530/condor.41729.136.out:Hostname:   b7g10p4995.cern.ch
./g5/job_531/condor.41729.137.out:Hostname:   b7g10p4995.cern.ch
./g6/job_4623/condor.41780.47.out:Hostname:   b7g17p1733.cern.ch
./g6/job_4635/condor.41780.59.out:Hostname:   b7g17p1733.cern.ch

And they involve the same or almost the same runtime date:

[khurtado@lxplus708 /afs/cern.ch/user/c/cmst0/public/JobAccountant/jobs]$ find . -name "condor*out" -exec grep -H "Local time" {} \; | sort
./g1/job_1038/condor.41731.37.out:Local time: Wed Oct 13 16:25:29 CEST 2021
./g1/job_1048/condor.41731.47.out:Local time: Wed Oct 13 16:25:29 CEST 2021
./g2/job_98/condor.41726.97.out:Local time: Wed Oct 13 15:58:51 CEST 2021
./g2/job_99/condor.41726.98.out:Local time: Wed Oct 13 15:58:53 CEST 2021
./g3/job_1137/condor.41731.136.out:Local time: Wed Oct 13 16:26:53 CEST 2021
./g3/job_1138/condor.41731.137.out:Local time: Wed Oct 13 16:26:53 CEST 2021
./g4/job_1169/condor.41731.168.out:Local time: Wed Oct 13 16:27:32 CEST 2021
./g4/job_1170/condor.41731.169.out:Local time: Wed Oct 13 16:27:31 CEST 2021
./g5/job_527/condor.41729.133.out:Local time: Wed Oct 13 16:18:54 CEST 2021
./g5/job_530/condor.41729.136.out:Local time: Wed Oct 13 16:18:54 CEST 2021
./g5/job_531/condor.41729.137.out:Local time: Wed Oct 13 16:18:54 CEST 2021
./g6/job_4623/condor.41780.47.out:Local time: Wed Oct 13 17:34:21 CEST 2021
./g6/job_4635/condor.41780.59.out:Local time: Wed Oct 13 17:34:21 CEST 2021

@makortel
Copy link
Contributor

Thanks @khurtado. Can you tell if the affected hosts have /dev/urandom? Or check contents of /proc/sys/kernel/random/entropy_avail on them?

@khurtado
Copy link
Author

khurtado commented Mar 14, 2022

I unfortunately don't have direct ssh access to these machines. I can probably submit some condor jobs matching to these hosts, but I don't know how long they would take to match and run. I will check if there is anybody with direct access to them for a quick check or submit those jobs otherwise.

Do you know if /dev/urandom and /proc/sys/kernel/random/entropy_avail can vary depending on whether they run inside a singularity container or not by the way? We run these jobs inside containers in production.

@dan131riley
Copy link

Am I reading the logs correctly, that the duplicates are all in the DQMIO output? DQMIO does not use the code Matti pointed to, it calls TFile::GetUUID() instead:

void DQMRootOutputModule::openFile(edm::FileBlock const&) {
//NOTE: I need to also set the I/O performance settings
m_file = std::make_unique<TFile>(m_fileName.c_str(),
"RECREATE",
"1" //This is the file format version number
);
edm::Service<edm::JobReport> jr;
cms::Digest branchHash;
std::string guid{m_file->GetUUID().AsString()};
std::transform(guid.begin(), guid.end(), guid.begin(), (int (*)(int))std::toupper);

That UUID comes from TUUID(), see comments here: https://github.com/root-project/root/blob/bd7c94375627ae10a401543ac84f296f5386c874/core/base/src/TUUID.cxx#L24-L30 TUUID() only implements the older UUID generation algorithm that uses the system time and ethernet address. We actually had to switch away from that algorithm to the one that uses /dev/urandom because we were getting duplicates in docker containers, as apparently containers don't guarantee clock monotonicity across different containers on the same system.

So, if it is just the DQMIO getting duplicates, that's not unexpected and the issue is with the DQM use of the ROOT UUID.

@makortel
Copy link
Contributor

DQMIO does not use the code Matti pointed to,

On a closer look my reference to edm::processGUID() was incorrect anyway, RootOutputFile doesn't use it either but calls edm::createGlobalIdentifier()

fid_ = FileID(createGlobalIdentifier());

Service<JobReport> reportSvc;
reportToken_ = reportSvc->outputFileOpened(file_,
logicalFile_, // PFN and LFN
om_->catalog(), // catalog
moduleName, // module class name
om_->moduleLabel(), // module label
fid_.fid(), // file id (guid)
std::string(), // data type (not yet known, so string is empty).
md5alg.digest().toString(), // branch hash
branchNames); // branch names being written

edm::createGlobalIdentifier() itself is just

std::string createGlobalIdentifier(bool binary) {
Guid guid;
return binary ? guid.toBinary() : guid.toString();
}

I see I changed the DQMIO output behavior from edm::createGlobalIdentifier() to TFile::GetUUID() in #28551 to fix inconsistency between DQMIO input and output that was reported in #28534.

Then I see #28622 removed the whole call to edm::JobReport::inputFileOpened() from DQMRootSource (?!) that used the TFile::GetUUID().

So on one hand, we could easily change the DQMRootOutputModule to use edm::createGlobalIdentifier() without worrying the consistency raised in #28534, but on the other hand, should DQMRootSource report inputFileOpened()?

@khurtado
Copy link
Author

khurtado commented Mar 14, 2022

@dan131riley We do see more occurrences in DQMIO, but If you look at the json in the g3 or or g4 directories from the T0 folks, you will see there are other datatiers with the problem. E.g.:

  "dup_lfns": [
      "/store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/ALCARECO/Express-v2110131547/000/343/082/00001/a48e105e-4352-4d0e-a084-cb077219d1dd.root",
      "/store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/ALCARECO/Express-v2110131547/000/343/082/00001/a48e105e-4352-4d0e-a084-cb077219d1dd.root",
      "/store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/DQMIO/Express-v2110131547/000/343/082/00001/891C47F4-2C31-11EC-BA14-95C08E80BEEF.root",
      "/store/unmerged/data/Tier0_REPLAY_2021/StreamCalibration/DQMIO/Express-v2110131547/000/343/082/00001/891C47F4-2C31-11EC-BA14-95C08E80BEEF.root",
      "/store/unmerged/data/Tier0_REPLAY_2021/TestEnablesEcalHcal/RAW/Express-v2110131547/000/343/082/00001/523ec1e8-286f-47b2-b0a0-5534dbd7c492.root",
      "/store/unmerged/data/Tier0_REPLAY_2021/TestEnablesEcalHcal/RAW/Express-v2110131547/000/343/082/00001/523ec1e8-286f-47b2-b0a0-5534dbd7c492.root",
      "/store/unmerged/data/logs/prod/2021/10/13/Express_Run343082_StreamCalibration_Tier0_REPLAY_2021_v2110131547_211013_1547/Express/0001/0/Express-ac5ffd57-dc01-493f-ba17-951d97d7a523-0-logArchive.tar.gz",
      "/store/unmerged/data/logs/prod/2021/10/13/Express_Run343082_StreamCalibration_Tier0_REPLAY_2021_v2110131547_211013_1547/Express/0001/0/Express-ac5ffd57-dc01-493f-ba17-951d97d7a523-0-logArchive.tar.gz"
    ]
  },

(The logArchive.tar.gz are using our own uuid mechanism from python, so we need to fix that independently of the ROOT files).

Question from someone not knowledgeable on the topic: If we have this known problem with DQMIO, why can't this output type use the new UUID generation algorithm too?

@makortel
Copy link
Contributor

(FYI @cms-sw/dqm-l2 since we mentioned DQM modules)

@dan131riley
Copy link

@dan131riley We do see more occurrences in DQMIO, but If you look at the json in the g3 or or g4 directories from the T0 folks, you will see there are other datatiers with the problem.

It seems weird that the dups are either just DQMIO, or all of ALCARECO, DQMIO, RAW, and the logArchive. The actual logs only cover the DQMIO case--can we get some logs for jobs where the whole set is duplicated? I think we understand the DQMIO-only case, but the other case is still a mystery to me.

Question from someone not knowledgeable on the topic: If we have this known problem with DQMIO, why can't this output type use the new UUID generation algorithm too?

It got overlooked--until I started investigating this issue, I hadn't realized TFile::TUUID() was using the old algorithm.

@khurtado
Copy link
Author

khurtado commented Mar 24, 2022

@makortel @dan131riley Sorry for the delay. We found a bug in the script that finds these duplicated LFNs. After fixing it, it seems the only problem is DQMIO.

@drkovalskyi
Copy link
Contributor

Hi everyone. This issue affects the Tier0 operations and the detector commissioning. Is there any chance to speed it up?

@qliphy
Copy link
Contributor

qliphy commented Mar 29, 2022

@cms-sw/dqm-l2 @makortel Any follow up on this? How about the fix proposed by Matti #37240 (comment) ?

@makortel
Copy link
Contributor

@dan131riley is looking into this

@dan131riley
Copy link

We will need to switch back to the createGlobalIdentifier(), I'm still looking into whether we can restore the DQMIO files to the input files FJR, I should have a PR tomorrow at the latest.

@drkovalskyi
Copy link
Contributor

Thanks Dan.

@perrotta
Copy link
Contributor

We will need to switch back to the createGlobalIdentifier(), I'm still looking into whether we can restore the DQMIO files to the input files FJR, I should have a PR tomorrow at the latest.

If this can be available by tomorrow, as you write, we can wait for building CMSSW_12_2_2 till then (otherwise we would be even ready to build it even now)

@drkovalskyi
Copy link
Contributor

Andrea, this problem is what caused Express failure last weekend. Our operators noticed it by chance (we don't normally have coverage during weekends). The risk is that it may happen again and we can be blind for a few days. If you have no strong pressure to release 12_2_2 now, let's wait for the fix. Otherwise we will need a patch release right after that.

@perrotta
Copy link
Contributor

Andrea, this problem is what caused Express failure last weekend. Our operators noticed it by chance (we don't normally have coverage during weekends). The risk is that it may happen again and we can be blind for a few days. If you have no strong pressure to release 12_2_2 now, let's wait for the fix. Otherwise we will need a patch release right after that.

That is exactly what I meant: if the PR is ready by tomorrow(ish), we can wait and build 12_2_2 with it.
Otherwise we build 12_2_2 now, since there are other requests for it, and another patch later on when the fix is available (not the preferred option)

@dan131riley
Copy link

#37405 has the PR for master (comments and criticisms are welcome). Backports will likely wait until tomorrow morning (US EDT).

@germanfgv
Copy link
Contributor

Hi @perrotta @dan131riley, thank you for you prompt actions. I see that this has been backported into CMSSW_12_2_X. I would like to know, when will the patch release be available? I'm guessing it will be CMSSW_12_2_2_patch1.

@qliphy
Copy link
Contributor

qliphy commented Apr 1, 2022

Hi @perrotta @dan131riley, thank you for you prompt actions. I see that this has been backported into CMSSW_12_2_X. I would like to know, when will the patch release be available? I'm guessing it will be CMSSW_12_2_2_patch1.

@germanfgv yes, this will be included in CMSSW_12_2_2_patch1. We are just trying to include another PR: #37417 (to be backported) Hopefully 12_2_2_patch1 can be made in the weekend.

@drkovalskyi
Copy link
Contributor

The issue is that we need the patch release before the weekend. Without this fix we risk Tier0 going down during the weekend. Is there any chance to get it today?

@makortel
Copy link
Contributor

makortel commented Apr 1, 2022

I think it would be better to proceed without #37417 (even if it would lead to two separate patch releases).

@perrotta
Copy link
Contributor

perrotta commented Apr 1, 2022

CMSSW_12_2_3 is going to be built: #37433

@germanfgv
Copy link
Contributor

@perrotta the release is built! how long would it be until it is updated in /cvmfs?

@germanfgv
Copy link
Contributor

ohh.. I see the process is still going on for slc7_amd64_gcc900. Nvm!

@perrotta
Copy link
Contributor

perrotta commented Apr 1, 2022

@perrotta the release is built! how long would it be until it is updated in /cvmfs?

At 3pm we said that the release will take "a few hours" and that it would have been hopefully ready before evening at Fermilab: we are still in track, I guess...

@davidlange6
Copy link
Contributor

davidlange6 commented Apr 1, 2022 via email

@davidlange6
Copy link
Contributor

davidlange6 commented Apr 1, 2022 via email

@perrotta
Copy link
Contributor

perrotta commented Apr 1, 2022

@davidlange6 CMSSW_12_2_2 full build took less than six hous two days ago. That's why I asked today at the joint meeting by when it was needed, And the answer was "before this evening at Fermilab". If I am not wrong now it is almost noon at Fermilab, and I think we are perfectly in schedule for what was asked for.

@perrotta
Copy link
Contributor

perrotta commented Apr 1, 2022

Release CMSSW_12_2_3 is ready, see https://github.com/cms-sw/cmssw/releases/tag/CMSSW_12_2_3

@germanfgv
Copy link
Contributor

Thanks @perrotta. It is not in cvmfs yet, but we will try it as soon as it gets there.

@dan131riley
Copy link

Ping…can this issue be resolved?

@makortel
Copy link
Contributor

@germanfgv @drkovalskyi Could you confirm that the issues with duplicated GUIDs are gone? (I'd assume so from the silence, but would like to check explicitly) Thanks!

@jhonatanamado
Copy link

Hello @makortel , From Tier0 we can confirm that the issues with duplicated GUIDs are gone since CMSSW_12_2_3. Thanks.

@makortel
Copy link
Contributor

+1

Thanks @jhonatanamado

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants