Agents continuously failing to insert blocks into DBS #11965

amaltaro · 2024-04-11T14:11:07Z

Impact of the bug
WMAgent

Describe the bug
There seems to be an unusual number of blocks that are continuously failing to be inserted into DBS Server, with a variety of errors, as can be seen in [1] and [2].

For [1], that/those blocks actually belong to a worfklow that went all the way to completed in the system and then got rejected, as can be seen from this ReqMgr2 API.

For [2], that block belongs to a workflow that is currently in running-closed status. Block failing injection for about 10h.

This is based on vocms0255, I haven't yet checked the other agents.

How to reproduce it
Not sure

Expected behavior
For the rejected workflow (or aborted), we should make DBS3Upload aware that output data is no longer relevant and skip their injection into DBS Server. This might require persisting information in the DBSBuffer tables (like marking the block and relevant files as injected), otherwise the same blocks will come up every time we run a cycle of the DBS3Upload component.

For the malformed SQL statement (note a typo mailformed(!)), we probably need to correlate this error with further information from DBS Server. Is it the same error as we have with concurrent HTTP requests? Or what is actually wrong with this. Maybe @todor-ivanov can shed some light on this. Expected behavior of this fix is to be determined.

Additional context and error message
[1]

2024-04-11 15:32:06,562:140685583296256:ERROR:DBSUploadPoller:Error trying to process block /TKCosmics_38T/Run3Winter24Reco-TkAlCosmics0T-AlcaRecoTkAlCosmics0T_cosmics_133X_mcRun3_2024cosmics_realistic_deco_v1-v5/ALCARECO#a5225151-fe56-45b1-b4dc-244b4644c02d through DBS. Details: DBSError code: 0, message: , reason:

[2]

2024-04-11 14:09:09,438:140685583296256:ERROR:DBSUploadPoller:Error trying to process block /SingleNeutrino_E-10-gun/Run3Winter24Reco-133X_mcRun3_2024_realistic_v10-v2/GEN-SIM-RECO#995c334f-6648-4c55-98a
1-44afbed8a57f through DBS. Details: DBSError code: 131, message: 5d0aae4c60a9089bfd22c0602c1bcecffd88106ed1a4578923297eda9e7da9d2 unable to find dataset_id for /SingleNeutrino_E-10-gun/Run3Winter24Digi-
133X_mcRun3_2024_realistic_v10-v2/GEN-SIM-RAW, error DBSError Code:103 Description:DBS DB query error, e.g. mailformed SQL statement Function:dbs.GetID Message: Error: sql: no rows in result set, reason:
 DBSError Code:103 Description:DBS DB query error, e.g. mailformed SQL statement Function:dbs.GetID Message: Error: sql: no rows in result set

The text was updated successfully, but these errors were encountered:

amaltaro · 2024-09-16T15:25:34Z

@todor-ivanov as discussed in the meeting today - and right now with Andrea as well - let us put this back to ToDo and come back to this beginning of October (2 weeks more should not hurt us here).

LinaresToine · 2024-09-26T19:59:23Z

Following discussion in mattermost wm-ops thread with @amaltaro.

Related to failure in inserting data to DBS, the current T0 production agent is struggling with inserting files into the blocks. I see the following error message in the DBS3Upload component log

Error trying to process block /AlCaP0/Run2024H-v1/RAW#983e96f3-5dca-4919-a3b4-fa291f145fb7 through DBS. Details: DBSError code: 110, message: d93d36f53eaf3097db5c9f50851359041c418a18727e6f363e6c18c37d3f25bb una
ble to insert files, error DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.insertFilesViaChunks Message: Error: concurrency error, reason: DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.insertFilesViaChunk
s Message: Error: concurrency error

This is present for the following blocks:

/AlCaP0/Run2024H-v1/RAW#3bbaf481-068c-4fda-8656-663fa9a987a4
/Muon0/Run2024H-v1/RAW#7369ccdf-3d3a-4d32-bad9-b04b02f279d4
/AlCaP0/Run2024H-v1/RAW#92fab5d9-9a27-4a7c-a57e-4b2691c654cd
/AlCaP0/Run2024H-v1/RAW#983e96f3-5dca-4919-a3b4-fa291f145fb7

vkuznet · 2024-09-26T20:16:03Z

I suggest that you review #11106 which describes the actual issue with concurrent data insertion. In short, to make it work we must have all pieces (like dataset configuration, etc.) in place to make concurrent injection. To solve this problem someone must inject first one block with all necessary information, and then can safely use concurrent pattern to inject other blocks.

amaltaro · 2024-09-26T21:04:02Z

@vkuznet thank you for jumping into this discussion.

I had a feeling that there was another obscure problem with DBS Server, and reviewing the ticket you pointed to (11106) - and according to your sentence above - I understand that, provided that we have at least 1 block injected into DBS for a given dataset, the "concurrency error" should no longer happen, given that all the foundation information is already in the database. Correct?

I picked one of the blocks provided by Antonio and queried DBS Server for its blocks:
https://cmsweb.cern.ch/dbs/prod/global/DBSReader/blocks?dataset=/AlCaP0/Run2024H-v1/RAW

as you can see, this dataset already has a bunch of blocks in the database. So, how come we are having a "concurrency error" here?

vkuznet · 2024-09-26T21:38:45Z

If you'll inspect the code [1], in order to insert DBS block concurrently we need to have in place:

dataset configuration
primary dataset info
processing era
acquisition era
data tier
physics group
dataset access type
processed dataset

So, if all of these information is present and it is consistent across all blocks in DBS then answer is yes the concurrency error (based on database content) should not arise. In other words DBS server first acquire or insert this info into DBS tables and if two or more HTTP calls arrives at the same time it can cause database error which lead to concurrency error form DBS server. Is it the case of the discussed blocks I don't know. But it is possible to not have all the information present in DB across all blocks if any of the above have differ among them.

You may look at example of bulkblocks JSON [2] to see actually how this information is structured and provided to DBS. In particular, the information in dataset_conf_list and file_conf_list is used to look-up aforementioned info, along with primds, processing_era, etc. So, if you inject multiple JSON they need to have identical info for those attributes, otherwise you may potentially get into racing conditions described in #11106

[1] https://github.com/dmwm/dbs2go/blob/master/dbs/bulkblocks2.go#L478
[2] https://github.com/dmwm/dbs2go/blob/master/test/bulkblocks.json

amaltaro · 2024-09-26T22:18:54Z

Valentin, unless there is a bug in the (T0)WMAgent, all the blocks for the same dataset should carry exactly the same metadata. That means, same acquisition era, primary dataset, etc etc etc.

Having said that, if a block exists in DBS Server, we can conclude that all of its metadata is already available as well. IF that metadata is already available and we are trying to inject more blocks for the same dataset, hence the same meta-data, there should be NO concurrency error.

Based on your explanation and on the data shared by Antonio, I fail to see how we would hit a "concurrency error". That means there is more to what we have discussed/understood so far; or the error message is misleading...

In any case, I would suggest to have @todor-ivanov following this up next week, comparing things with the DBS Server logs and against the source code.

vkuznet · 2024-09-27T12:10:51Z

I further looked into the dbs code and I think I identified the issue. According to the dbs code

the concurrency error is defined here https://github.com/dmwm/dbs2go/blob/master/dbs/errors.go#L20
it is used in here https://github.com/dmwm/dbs2go/blob/master/dbs/bulkblocks2.go#L1002
- more precisely the code fails in insertFilesChunk function, see https://github.com/dmwm/dbs2go/blob/master/dbs/bulkblocks2.go#L1018
- and if you look into body of this function there are couple of places where error can occur, one of them is checking file type attribute "file_type" which should be provided by input, see https://github.com/dmwm/dbs2go/blob/master/dbs/bulkblocks2.go#L1018

Then, I looked at one of the dbs logs and found

[2024-09-24 00:45:17.228980109 +0000 UTC m=+2471302.202098481] fail to insert files chunks, trec &{IsFileValid:1 DatasetID:15071289 BlockID:37951592 CreationDate:1727138717 CreateBy:/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmst0/CN=658085/CN=Robot: CMS Tier0 FilesMap:{mu:{state:0 sema:0} read:{_:[] _:{} v:<nil>} dirty:map[] misses:0} NErrors:2}

So, indeed input file record DOES NOT contain required file type attribute, see File structure over here https://github.com/dmwm/dbs2go/blob/master/dbs/bulkblocks.go#L65. The "file_type" must be present in provided JSON, otherwise it will be assigned to default value 0 which is what file injection tries to get from database and it should be non-zero value.

To summarize, I suggest to check JSON records T0 provides and ensure it provides "file_type" along other file attributes (all of them are defiend in this struct: https://github.com/dmwm/dbs2go/blob/master/dbs/bulkblocks.go#L65). Without it DBS code correctly fails, but probably it would be useful to adjust error message to properly report the error.

vkuznet · 2024-09-27T12:12:42Z

For the record, here is how DBS error look in a log:

[2024-09-24 00:45:17.228980109 +0000 UTC m=+2471302.202098481] fail to insert files chunks, trec &{IsFileValid:1 DatasetID:15071289 BlockID:37951592 CreationDate:1727138717 CreateBy:/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmst0/CN=658085/CN=Robot: CMS Tier0 FilesMap:{mu:{state:0 sema:0} read:{_:[] _:{} v:<nil>} dirty:map[] misses:0} NErrors:2}
[2024-09-24 00:45:17.229561212 +0000 UTC m=+2471302.202679583] 5ecdc2bdcd03492fd64efc269de332cdcf1c8a53c3e3cc07168b0c741f0270ba unable to insert files, error DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.insertFilesViaChunks Message: Error: concurrency error
[2024-09-24 00:45:17.232415539 +0000 UTC m=+2471302.205533911] DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.InsertBulkBlocksConcurrently Message:5ecdc2bdcd03492fd64efc269de332cdcf1c8a53c3e3cc07168b0c741f0270ba unable to insert files, error DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.insertFilesViaChunks Message: Error: concurrency error Error: nested DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.insertFilesViaChunks Message: Error: concurrency error Stacktrace:
goroutine 300475111 [running]:
github.com/dmwm/dbs2go/dbs.Error({0xb054e0?, 0xc0009f2410?}, 0x6e, {0xc0004f60f0, 0xe6}, {0xa3b23e, 0x2b})
        /go/src/github.com/vkuznet/dbs2go/dbs/errors.go:185 +0x99
github.com/dmwm/dbs2go/dbs.(*API).InsertBulkBlocksConcurrently(0xc00036c000)
        /go/src/github.com/vkuznet/dbs2go/dbs/bulkblocks2.go:743 +0x2546
github.com/dmwm/dbs2go/web.DBSPostHandler({0xb08290, 0xc000012cd8}, 0xc000616700, {0xa1d753, 0xa})
        /go/src/github.com/vkuznet/dbs2go/web/handlers.go:544 +0x1374
github.com/dmwm/dbs2go/web.BulkBlocksHandler({0xb08290?, 0xc000012cd8?}, 0xc000a9f460?)
        /go/src/github.com/vkuznet/dbs2go/web/handlers.go:960 +0x3b
net/http.HandlerFunc.ServeHTTP(0xc00055f1a0?, {0xb08290?, 0xc000012cd8?}, 0x95d5a0?)
        /usr/local/go/src/net/http/server.go:2136 +0x29
github.com/dmwm/dbs2go/web.limitMiddleware.func1({0xb08290?, 0xc000012cd8?}, 0xc00055f1a0?)
        /go/src/github.com/vkuznet/dbs2go/web/middlewares.go:110 +0x32
net/http.HandlerFunc.ServeHTTP(0x7f8c001964c0?, {0xb08290?, 0xc000012cd8?}, 0xc0003

So, you have all pointers to look which lines of code fails by inspecting its stack, and that exactly what I did.

amaltaro · 2024-09-27T19:43:04Z

As far as I can tell, it should always be set like:

      "file_type": "EDM",

@LinaresToine can you please change the component configuration and provide one of the block names that is failing to be inserted, in the following line:

config.DBS3Upload.dumpBlockJsonFor = ""

then restart DBS3Upload and you should soon get a JSON dump of the content that the component is POSTing to the DBS Server. Output file should be under the component directory (e.g. install/DBS3Upload/).

LinaresToine · 2024-09-27T20:04:55Z

Ok, I changed the config as suggested. Waiting on the loadFiles method to complete the cycle. Ill follow up

LinaresToine · 2024-09-27T21:07:32Z

I have placed the output json file in /eos/home-c/cmst0/public/dbsError/dbsuploader_block.json.

Another error is showing up in the DBS3Upload component for all 4 pending blocks:

Hit a general exception while inserting block /AlCaP0/Run2024H-v1/RAW#92fab5d9-9a27-4a7c-a57e-4b2691c654cd. Error: (52, 'Empty reply from server')
Traceback (most recent call last):
  File "/data/tier0/WMAgent.venv3/lib64/python3.9/site-packages/WMComponent/DBS3Buffer/DBSUploadPoller.py", line 94, in uploadWorker
    dbsApi.insertBulkBlock(blockDump=block)
  File "/data/tier0/WMAgent.venv3/lib64/python3.9/site-packages/dbs/apis/dbsClient.py", line 647, in insertBulkBlock
    result =  self.__callServer("bulkblocks", data=blockDump, callmethod='POST' )
  File "/data/tier0/WMAgent.venv3/lib64/python3.9/site-packages/dbs/apis/dbsClient.py", line 474, in __callServer
    self.http_response = method_func(self.url, method, params, data, request_headers)
  File "/data/tier0/WMAgent.venv3/lib64/python3.9/site-packages/RestClient/RestApi.py", line 42, in post
    return http_request(self._curl)
  File "/data/tier0/WMAgent.venv3/lib64/python3.9/site-packages/RestClient/RequestHandling/HTTPRequest.py", line 56, in __call__
    curl_object.perform()
pycurl.error: (52, 'Empty reply from server')

germanfgv · 2024-10-31T13:26:59Z

An update from T0:
Here is a JSON dump for a succesfully uploaded T0 DBS block:

/eos/home-c/cmst0/public/dbsError/dbsuploader_successful_block.json

Now we have a total of 276 blocks that we are unable to upload. We the same error message for all of them. A list of these blocks can be found here:

/eos/home-c/cmst0/public/dbsError/failingBlocks.txt

Because of these, we have 121384 files in T0 that we have been unable to register in DBS. @todor-ivanov is trying to find a way for us to upload this information.

todor-ivanov · 2024-11-04T11:44:20Z

Here is the follow up on what is the status of those blocks according to DBS. I had to create a script to go and query directly the DBS database lfn by lfn for all those blocks and here is the accumulated result:
blockDBSRecords.json: /eos/home-c/cmst0/public/dbsError/blockDBSRecords.json

So from what I can see from those results we can identify at least 3 different use cases:

blocks attempted to be uploaded to DBS twice - resulting in oracle 'duplicate error'
blocks processed twice resulting in block mismatch between what the agent knows about the block and what has been already uploaded to DBS for those blocks - in few of those cases the data uploaded to DBS is not complete i.e. some of the files are missing even from the previously uploaded block.
completely missing records - probably due to block fields misconfiguration

I am going to filter out those for which we know are there. On top of that I consider checking their Rucio status as well.
FYI @germanfgv

p.s. Here: DBSBlocksCheck.py is the script I used for accumulating those results

p.s. Here: And here: blockDBSRecords.json is an updated version of the DBS records with updated Rucio information per block as well

todor-ivanov · 2024-11-04T15:07:05Z

And continuing to reduce the results to something more readable here [1] is the final list of the block and file status at DBS for all of them.

As one can see:

Almost all of those blocks are properly present at DBS - so for those I assume that the Agent did not properly handled the initial return code by DBS and it simply continues to retry.
Only 4 of them (probably the 4 originally reported) - are falling under one of the cathegories:
- Block mismatch:
  - /AlCaP0/Run2024H-v1/RAW#3bbaf481-068c-4fda-8656-663fa9a987a4
  - /AlCaP0/Run2024H-v1/RAW#92fab5d9-9a27-4a7c-a57e-4b2691c654cd
    meaning, all files from this block are recorded as part of a different block (could be due to an attempt to reprocess the same block twice.
- Block mismatch and partially recorded files:
  - /Muon0/Run2024H-v1/RAW#7369ccdf-3d3a-4d32-bad9-b04b02f279d4
  - /AlCaP0/Run2024H-v1/RAW#983e96f3-5dca-4919-a3b4-fa291f145fb7
    meaning, not only that the files already uploaded to DBS belong to a different block, but those which were already puloded were not the whole block

FYI: @germanfgv @LinaresToine

[1]

	blockName: /AlCaP0/Run2024H-v1/RAW#3bbaf481-068c-4fda-8656-663fa9a987a4: 
		blockDBSStatus: ['MISSING']
		filesDBSStatus: ['BLOCKMISMATCH']
	blockName: /Muon0/Run2024H-v1/RAW#7369ccdf-3d3a-4d32-bad9-b04b02f279d4: 
		blockDBSStatus: ['MISSING']
		filesDBSStatus: ['MISSING', 'BLOCKMISMATCH']
	blockName: /AlCaP0/Run2024H-v1/RAW#92fab5d9-9a27-4a7c-a57e-4b2691c654cd: 
		blockDBSStatus: ['MISSING']
		filesDBSStatus: ['BLOCKMISMATCH']
	blockName: /AlCaP0/Run2024H-v1/RAW#983e96f3-5dca-4919-a3b4-fa291f145fb7: 
		blockDBSStatus: ['MISSING']
		filesDBSStatus: ['MISSING', 'BLOCKMISMATCH']
	blockName: /Commissioning/Run2024I-v1/RAW#4113bc20-c92b-43a3-a767-06bccfe4af56: OK
	blockName: /ParkingSingleMuon6/Run2024I-v1/RAW#75ce0f9d-2153-4520-b79f-bb0df5f19227: OK
	blockName: /ZeroBias/Run2024I-v1/RAW#50e30425-0a03-46c5-8da0-26719c266dbc: OK
	blockName: /ParkingSingleMuon1/Run2024I-v1/RAW#fae36d0d-cc36-4c5e-bfde-009aa38f9b7c: OK
	blockName: /ParkingSingleMuon2/Run2024I-v1/RAW#3c125d76-99c4-4f65-8141-0ae9abcd0e1a: OK
	blockName: /ParkingSingleMuon7/Run2024I-v1/RAW#0a25242b-c6fb-4eca-8382-826f4e878021: OK
	blockName: /ParkingSingleMuon5/Run2024I-v1/RAW#b01693d9-d3d9-4ab2-8760-0b019652f89e: OK
	blockName: /ParkingSingleMuon8/Run2024I-v1/RAW#a57df37e-4d9d-465e-93b2-54d40f892429: OK
	blockName: /ParkingSingleMuon10/Run2024I-v1/RAW#627f4dd1-4c31-4f4b-bb85-48d19996ba4f: OK
	blockName: /ParkingSingleMuon9/Run2024I-v1/RAW#1b4bc705-f6de-4f55-9c8b-7cd490457341: OK
	blockName: /Tau/Run2024I-v1/RAW#2da0c3fb-4b3c-47a2-ac49-99cca62c226d: OK
	blockName: /BTagMu/Run2024I-v1/RAW#280bd893-a382-496c-8e6f-366309493acc: OK
	blockName: /ParkingDoubleMuonLowMass1/Run2024I-v1/RAW#b4492da6-fa05-4484-a034-aa2a87354735: OK
	blockName: /AlCaLowPtJet/Run2024I-v1/RAW#0002be22-10d9-4200-9845-bf112ec9291a: OK
	blockName: /Muon1/Run2024I-v1/RAW#499938c6-8357-4095-99db-91c90e600f0e: OK
	blockName: /ParkingDoubleMuonLowMass4/Run2024I-PromptReco-v1/AOD#09d3e003-1ca7-459f-8089-1f1d95f5ba20: OK
	blockName: /ParkingDoubleMuonLowMass1/Run2024I-PromptReco-v1/DQMIO#777a6f7a-058d-46a4-bfb9-d905b141fbd2: OK
	blockName: /ParkingDoubleMuonLowMass0/Run2024I-PromptReco-v1/AOD#6fbca7c8-560f-4805-af21-55d424e9877a: OK
	blockName: /Muon0/Run2024I-MuAlCalIsolatedMu-PromptReco-v1/ALCARECO#06571035-b278-4998-a9d3-2b523bb4fd0e: OK
	blockName: /Muon1/Run2024I-HcalCalHO-PromptReco-v1/ALCARECO#65cfc43b-aca0-4600-8bb3-db4261856f3b: OK
	blockName: /Tau/Run2024I-LogError-PromptReco-v1/RAW-RECO#645686dd-1135-4866-9e53-6438aa17600d: OK
	blockName: /Muon1/Run2024I-PromptReco-v1/NANOAOD#d26b31f5-0371-4fe2-9420-e87a87925fdd: OK
	blockName: /ParkingSingleMuon8/Run2024I-PromptReco-v1/MINIAOD#86c82707-6992-4126-b86f-182c5f5aa7fc: OK
	blockName: /Tau/Run2024I-PromptReco-v1/AOD#6627bc34-c746-49a9-ab02-550710731e1b: OK
	blockName: /Muon0/Run2024I-PromptReco-v1/MINIAOD#ebab670f-0588-422e-8206-c406f948bb06: OK
	blockName: /Muon0/Run2024I-PromptReco-v1/DQMIO#35e3beac-bd5d-4ba4-82c0-a5372e89e5a6: OK
	blockName: /ParkingDoubleMuonLowMass0/Run2024I-PromptReco-v1/DQMIO#2e38060a-bf22-4135-897d-f7d93684dede: OK
	blockName: /DisplacedJet/Run2024I-EXOLLPJetHCAL-PromptReco-v1/AOD#ab2bd1a4-6f42-4d9a-853f-1a7a5aa5f2f4: OK
	blockName: /ParkingDoubleMuonLowMass7/Run2024I-PromptReco-v1/MINIAOD#0821e60b-12b6-4993-8a67-0952379c34bb: OK
	blockName: /Muon1/Run2024I-EXOCSCCluster-PromptReco-v1/USER#5197c2ba-2f13-48e9-bddd-9c1fd071cd33: OK
	blockName: /ParkingVBF6/Run2024I-PromptReco-v1/NANOAOD#a36dad4c-dac8-4a43-bde9-5ac38a0f8b7d: OK
	blockName: /EphemeralZeroBias1/Run2024I-PromptReco-v1/MINIAOD#0c7b63de-5a48-4d98-8e9b-9c52c714703a: OK
	blockName: /JetMET1/Run2024I-PromptReco-v1/DQMIO#6d3cc1c5-48c8-4f9e-9f8b-1cb5b481d550: OK
	blockName: /Tau/Run2024I-PromptReco-v1/NANOAOD#473eddda-612e-4306-91bf-9dfcb3b5d108: OK
	blockName: /ParkingVBF6/Run2024I-PromptReco-v1/MINIAOD#38f7630c-e015-4a97-a331-e33b0cfa3604: OK
	blockName: /ParkingSingleMuon6/Run2024I-PromptReco-v1/AOD#bd0eeb38-b1d0-4157-ade6-fb3b65f57995: OK
	blockName: /ParkingVBF1/Run2024I-PromptReco-v1/AOD#1298b211-43f2-49c6-8788-2bde6e2a9e62: OK
	blockName: /ParkingVBF0/Run2024I-PromptReco-v1/MINIAOD#63e59425-3ff1-406c-ae18-48bf9f239354: OK
	blockName: /ParkingSingleMuon8/Run2024I-PromptReco-v1/AOD#85d58a9d-29b0-4f98-99bc-9c201ed2c6a2: OK
	blockName: /Tau/Run2024I-EXODisappTrk-PromptReco-v1/USER#1a96c708-64a9-4f62-819f-a19633154b16: OK
	blockName: /SpecialZeroBias5/Run2024I-PromptReco-v1/AOD#59e655a4-2897-47d0-ba11-287332c4e6b5: OK
	blockName: /ParkingVBF1/Run2024I-PromptReco-v1/NANOAOD#aa47f11c-7080-492b-ab91-ad19e6299fff: OK
	blockName: /ParkingVBF3/Run2024I-PromptReco-v1/NANOAOD#4330d839-9985-4b36-9d5e-b5aa5c19175f: OK
	blockName: /ParkingHH/Run2024I-PromptReco-v1/MINIAOD#e8d162fa-f391-439c-a7f1-8a8d39dda120: OK
	blockName: /Tau/Run2024I-PromptReco-v1/DQMIO#1a0ac20a-1d60-4d89-8133-e8559f1e4c13: OK
	blockName: /ParkingSingleMuon0/Run2024I-PromptReco-v1/MINIAOD#d8995d51-e005-4757-8439-850c005cbd57: OK
	blockName: /ParkingVBF5/Run2024I-PromptReco-v1/MINIAOD#5bff218a-4895-4f13-8148-c9e0bcf820b7: OK
	blockName: /Muon0/Run2024I-PromptReco-v1/AOD#306f5950-5eec-43d3-96f2-8dfbe22d322c: OK
	blockName: /EGamma0/Run2024I-PromptReco-v1/AOD#e9814b10-2545-4a83-8a3d-2501f5679ecd: OK
	blockName: /JetMET1/Run2024I-PromptReco-v1/NANOAOD#039f5f67-3f70-4797-9b2a-c6d698e52efd: OK
	blockName: /ParkingVBF1/Run2024I-PromptReco-v1/DQMIO#b1f45558-5e9a-493c-afed-7e133bb4a7e7: OK
	blockName: /DisplacedJet/Run2024I-PromptReco-v1/AOD#0edffee0-8286-4658-943c-8efc45f23ea4: OK
	blockName: /Muon1/Run2024I-PromptReco-v1/DQMIO#edae199f-fb22-48d3-9a8a-cdb15703bcbe: OK
	blockName: /DisplacedJet/Run2024I-EXODelayedJet-PromptReco-v1/AOD#b46c0dcb-c26a-47c2-a4b0-fcac9b9d63be: OK
	blockName: /Muon0/Run2024I-HcalCalIterativePhiSym-PromptReco-v1/ALCARECO#6b39e513-27ac-4e54-ad1e-a343b9d064fc: OK
	blockName: /JetMET1/Run2024I-PromptReco-v1/AOD#fa6562fe-0d1d-4d06-9bf0-a135edbcf172: OK
	blockName: /ParkingVBF0/Run2024I-PromptReco-v1/DQMIO#24686495-80bc-44de-a3b2-f39cfa971760: OK
	blockName: /NoBPTX/Run2024I-PromptReco-v1/AOD#99fd2849-5b43-4927-9596-6e6a33683d9c: OK
	blockName: /HLTPhysics/Run2024I-LogError-PromptReco-v1/RAW-RECO#8abbaf67-41c7-452f-816a-f978dd14cc1b: OK
	blockName: /EGamma0/Run2024I-LogError-PromptReco-v1/RAW-RECO#ab07e175-a29f-4203-a3f3-dceb2938ae33: OK
	blockName: /JetMET1/Run2024I-EXODisappTrk-PromptReco-v1/USER#98870a35-c0d8-4ece-9731-1ac081143000: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#3b08c77d-8e97-4aca-be54-f95b7ab76465: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#2b3eefa5-923b-4b42-9c5c-cf162453d59b: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#1d77a290-571c-442d-be95-531e4168e94d: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#23d1c315-25d0-47a8-813e-caa7a5f2a0f1: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#80570ad8-4b6a-4e00-bcb5-63ff743504d5: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#5abdb7b2-4a4a-4a10-a4b3-9cfae99bdf83: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#adc72e9a-7410-493a-a327-1611b18a4106: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#e0c75028-64eb-480f-abb6-910505a92973: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#c1bc74d9-2c3b-415d-928f-7ec8395868ad: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#f1e8e065-2a65-4bd7-9279-d88c423c0ea0: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#8147c17b-4550-4a14-9747-ca696aa03408: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#4645a8b5-b1ef-4008-b1f5-dff3fadb1855: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#a8e53c28-c9ad-40fa-88bf-b7c2f3e61a64: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#19e36aa2-2ec8-4974-891f-112279ec9393: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#ba34c69d-db70-455e-81cc-13a161727e80: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#b66d6bae-2647-44b9-8bad-320da54d0a29: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#33b16287-bc8d-421b-bd38-2059ad19dd87: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#a1f20e02-b988-4e3a-bd6f-68610bde0b97: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#d3c7711f-b7ba-4e4d-9db8-999cd6383551: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#a77f0bd9-bd98-418e-b39c-9bf859203fad: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#aff24402-3ead-4ab6-9d71-02b13721b7cf: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#aea1ab62-ffa2-447f-bede-dbd01a05708a: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#ebb35256-bdc6-40a7-ae6c-9de27a2094bf: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#36740b3b-31a6-4be6-a4e4-f76f5e1200ab: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#a95882ee-05b6-482a-bbf8-6f7ff8ab4354: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#0358032c-2997-440c-a658-461e011e87a0: OK
	blockName: /Cosmics/Run2024I-MuAlGlobalCosmics-PromptReco-v1/ALCARECO#f7f08dfb-6c23-441c-9137-09abad0a7d39: OK
	blockName: /ParkingDoubleMuonLowMass2/Run2024I-PromptReco-v1/DQMIO#e9d71494-b460-4680-a3b9-7a1c62fc4d01: OK
	blockName: /ParkingHH/Run2024I-PromptReco-v1/AOD#ba026985-4cfd-4a06-ba1b-bfb5af6cbb64: OK
	blockName: /MinimumBias/Run2024I-PromptReco-v1/NANOAOD#dfbf01f5-c43c-463f-b562-aee5a91da41e: OK
	blockName: /ParkingSingleMuon2/Run2024I-PromptReco-v1/AOD#f2995e54-af6b-456c-8b40-abb844b299a2: OK
	blockName: /EGamma0/Run2024I-HcalCalIterativePhiSym-PromptReco-v1/ALCARECO#0ad19bb8-d4fa-4565-9019-91ef6e7207ac: OK
	blockName: /EGamma1/Run2024I-EXODisappTrk-PromptReco-v1/USER#cf89035f-8d3d-4a95-9dd4-ae73b92cb865: OK
	blockName: /MinimumBias/Run2024I-SiStripCalMinBias-PromptReco-v1/ALCARECO#d4fdf46b-db01-4df9-9eac-e23672e14f84: OK
	blockName: /Muon0/Run2024I-EXODisappTrk-PromptReco-v1/USER#ae7335f9-c7ae-42bd-8304-0805347446dd: OK
	blockName: /ParkingDoubleMuonLowMass7/Run2024I-PromptReco-v1/NANOAOD#0f9d8bfa-a7f7-44f9-8bec-e509f8334490: OK
	blockName: /ParkingDoubleMuonLowMass7/Run2024I-PromptReco-v1/DQMIO#a004890b-d6cf-4ff0-b715-f1ba374e3d97: OK
	blockName: /ParkingSingleMuon4/Run2024I-PromptReco-v1/AOD#64b94383-cbaf-48a5-b194-3d15baa01adc: OK
	blockName: /Tau/Run2024I-PromptReco-v1/MINIAOD#23bc1b6d-3f75-4007-8618-52755f3fb1f3: OK
	blockName: /ParkingDoubleMuonLowMass3/Run2024I-PromptReco-v1/NANOAOD#3935aabb-be36-4e9d-a49f-afc523994fd5: OK
	blockName: /MinimumBias/Run2024I-SiStripCalZeroBias-PromptReco-v1/ALCARECO#d621c61d-52ea-4d11-b092-4068bfd61ddf: OK
	blockName: /EphemeralZeroBias7/Run2024I-PromptReco-v1/MINIAOD#f770ca99-fca5-4054-a377-9365c016069b: OK
	blockName: /SpecialZeroBias1/Run2024I-PromptReco-v1/MINIAOD#b8cc0194-13dd-4091-96fa-28f5be5c2134: OK
	blockName: /ParkingDoubleMuonLowMass4/Run2024I-TkAlJpsiMuMu-PromptReco-v1/ALCARECO#950e3d2c-5375-47c5-8f79-03df611b9422: OK
	blockName: /Commissioning/Run2024I-SiStripCalMinBias-PromptReco-v1/ALCARECO#0a2b5651-fe2e-436e-af4d-488cb00acf68: OK
	blockName: /ScoutingPFMonitor/Run2024I-PromptReco-v1/NANOAOD#c2e3fd58-1bf5-4769-9230-6c6ac11bf75f: OK
	blockName: /SpecialZeroBias5/Run2024I-SiStripCalMinBias-PromptReco-v1/ALCARECO#3ecd813c-ad88-4ced-acba-d78a9ebc9963: OK
	blockName: /SpecialZeroBias5/Run2024I-SiStripCalZeroBias-PromptReco-v1/ALCARECO#31c2c844-9601-4881-ba21-e999b89d7900: OK
	blockName: /JetMET1/Run2024I-HcalCalIsoTrkProducerFilter-PromptReco-v1/ALCARECO#0f1c47f1-3755-4a00-8d36-2cd47e42605c: OK
	blockName: /ParkingHH/Run2024I-PromptReco-v1/DQMIO#b3983892-1ab2-4d9b-a5da-c7e238846e1f: OK
	blockName: /SpecialZeroBias5/Run2024I-LogErrorMonitor-PromptReco-v1/USER#6ea22759-354e-49ba-a72b-2d29034979e2: OK
	blockName: /ParkingVBF6/Run2024I-PromptReco-v1/DQMIO#332ab512-bdef-44e0-a091-9615bdd417c6: OK
	blockName: /Muon0/Run2024I-TkAlZMuMu-PromptReco-v1/ALCARECO#6d4fa60b-d7c3-4280-ab08-c48d7cbf258d: OK
	blockName: /EphemeralZeroBias3/Run2024I-PromptReco-v1/NANOAOD#fde6778b-8361-427a-876f-e16b2e65978f: OK
	blockName: /TestEnablesEcalHcal/Run2024I-Express-v1/RAW#6a31d991-d964-4fca-9113-59d1c40d5759: OK
	blockName: /StreamExpressCosmics/Run2024I-SiPixelCalZeroBias-Express-v1/ALCARECO#13f46438-f23d-4121-85fb-896d224db127: OK
	blockName: /StreamExpressCosmics/Run2024I-SiStripCalCosmics-Express-v1/ALCARECO#9d90aa84-6f90-4ee3-8e33-a2718e9e59b2: OK
	blockName: /StreamALCAPPSExpress/Run2024I-PromptCalibProdPPSAlignment-Express-v1/ALCAPROMPT#939fa7d4-6ffa-4788-9b12-83436c0413b5: OK
	blockName: /StreamExpress/Run2024I-TkAlZMuMu-Express-v1/ALCARECO#6097a868-c5bd-43ae-a804-1e881fcf5bc4: OK
	blockName: /StreamExpress/Run2024I-SiPixelCalSingleMuonTight-Express-v1/ALCARECO#1b495798-f27a-49e4-8a21-87d8d2a236f6: OK
	blockName: /StreamExpress/Run2024I-PromptCalibProdSiPixelAliHGComb-Express-v1/ALCAPROMPT#aafb7697-c790-4d5e-9dac-4cf5fae1c4ce: OK
	blockName: /ParkingSingleMuon11/Run2024I-PromptReco-v1/AOD#f35b9721-1e35-434b-9e06-28b6c88f64fe: OK
	blockName: /ParkingDoubleMuonLowMass6/Run2024I-PromptReco-v1/AOD#90186a44-9236-4665-b48d-a8dd37ef0ff1: OK
	blockName: /Cosmics/Run2024I-CosmicTP-PromptReco-v1/RAW-RECO#7c8c249d-9515-4d08-8217-418a269b1a2e: OK
	blockName: /ParkingSingleMuon1/Run2024I-PromptReco-v1/MINIAOD#51342b65-75e2-42eb-b868-4df8ee7809b8: OK
	blockName: /ParkingSingleMuon4/Run2024I-PromptReco-v1/AOD#e8fd5b7b-24c4-49a1-8c26-7d7c2d37661a: OK
	blockName: /ParkingVBF5/Run2024I-PromptReco-v1/AOD#db278bd7-459c-4427-9c2f-b214640caaeb: OK
	blockName: /SpecialZeroBias1/Run2024I-PromptReco-v1/AOD#786cf920-b104-4e8f-bebb-30a30d090357: OK
	blockName: /EGamma0/Run2024I-EcalUncalWElectron-PromptReco-v1/ALCARECO#fd409f01-fba4-4d87-a1c5-04ccde5ee8ad: OK
	blockName: /Muon0/Run2024I-SiPixelCalSingleMuonLoose-PromptReco-v1/ALCARECO#5b6152f5-616f-4f55-ab65-eb8b1de0798b: OK
	blockName: /EGamma0/Run2024I-EGMJME-PromptReco-v1/RAW-RECO#55a6487b-d4f9-4d01-82c7-0f2ee35872d1: OK
	blockName: /EGamma1/Run2024I-HcalCalIterativePhiSym-PromptReco-v1/ALCARECO#67ac37a8-4d01-4a8b-a30f-4a714cfb2a0a: OK
	blockName: /HLTPhysics/Run2024I-LogErrorMonitor-PromptReco-v1/USER#1b6693e3-d43b-47e9-ae23-fd327e5af74e: OK
	blockName: /MuonShower/Run2024I-EXOCSCCluster-PromptReco-v1/USER#6d2d1137-e1d8-4ae0-bd81-065f8a050490: OK
	blockName: /ParkingVBF1/Run2024I-PromptReco-v1/MINIAOD#12935b72-988e-44ca-9c7c-9b9d8063d8b3: OK
	blockName: /Cosmics/Run2024I-LogError-PromptReco-v1/RAW-RECO#9b7ece4a-0f62-4b45-942a-e8366b905412: OK
	blockName: /EGamma1/Run2024I-WElectron-PromptReco-v1/USER#e8b3efe3-339d-4b5c-82df-bc78defb09ea: OK
	blockName: /NoBPTX/Run2024I-TkAlCosmicsInCollisions-PromptReco-v1/ALCARECO#c64e5941-7da4-47a8-ab32-284a5e059dca: OK
	blockName: /Commissioning/Run2024I-LogError-PromptReco-v1/RAW-RECO#af0433ba-b843-4b44-bfa3-70b5d7475863: OK
	blockName: /MinimumBias/Run2024I-PromptReco-v1/AOD#e45c0cff-39a4-440f-819a-2e522045618b: OK
	blockName: /ParkingSingleMuon7/Run2024I-PromptReco-v1/NANOAOD#9162a0ab-23e1-4fd1-870e-f55da0661a44: OK
	blockName: /Muon1/Run2024I-TkAlZMuMu-PromptReco-v1/ALCARECO#582d0da0-d896-46f3-b4de-e7001a1b4dea: OK
	blockName: /EphemeralZeroBias3/Run2024I-PromptReco-v1/MINIAOD#aeaed114-ebc8-4ef4-a625-5af2c3559120: OK
	blockName: /Commissioning/Run2024I-EcalActivity-PromptReco-v1/RAW-RECO#e65b7bb5-0732-432a-a181-4d9153161caa: OK
	blockName: /ParkingVBF3/Run2024I-PromptReco-v1/DQMIO#fecdfbd5-6a50-4100-92f7-90b05c452570: OK
	blockName: /ParkingVBF2/Run2024I-PromptReco-v1/DQMIO#43430e62-1bd6-4161-aa8d-3cf9b3da3e55: OK
	blockName: /ParkingDoubleMuonLowMass4/Run2024I-PromptReco-v1/NANOAOD#ceeadfe9-03f7-471f-93fb-a7d4ae3ab806: OK
	blockName: /EGamma1/Run2024I-LogErrorMonitor-PromptReco-v1/USER#2bd7cb69-577f-47a0-adaa-115b9df2e1b2: OK
	blockName: /Commissioning/Run2024I-PromptReco-v1/NANOAOD#2936bb45-005c-4f28-9ee0-c24b8e8b647e: OK
	blockName: /StreamExpress/Run2024I-TkAlMinBias-Express-v1/ALCARECO#c07f8345-698b-4a8d-87ac-39ef617874e0: OK
	blockName: /StreamCalibration/Run2024I-EcalTestPulsesRaw-Express-v1/ALCARECO#b05bf432-e2d5-4e77-8bc3-d51be7fadb5c: OK
	blockName: /StreamExpress/Run2024I-PromptCalibProdSiStripGains-Express-v1/ALCAPROMPT#82ba3ad1-7f7a-4a6b-ba44-e00632e53de5: OK
	blockName: /ExpressPhysics/Run2024I-Express-v1/FEVT#12d498f7-0eb0-4c0b-a124-f971dfab8ec8: OK
	blockName: /ParkingSingleMuon3/Run2024I-PromptReco-v1/AOD#82f71a1c-0f5c-4a24-88e9-24d02619104d: OK
	blockName: /Muon0/Run2024I-ZMu-PromptReco-v1/RAW-RECO#a0c2d025-fe28-40fc-b695-64e3465954d8: OK
	blockName: /ParkingSingleMuon1/Run2024I-PromptReco-v1/AOD#2b73f5f0-a63f-402e-94bc-6231bc31d130: OK
	blockName: /ParkingSingleMuon1/Run2024I-PromptReco-v1/AOD#45da3ad2-609d-4ed7-a77c-4a212635ea47: OK
	blockName: /EGamma0/Run2024I-PromptReco-v1/DQMIO#5ba6d8d0-5af2-4e46-9028-f8a83a38bd22: OK
	blockName: /ParkingDoubleMuonLowMass2/Run2024I-PromptReco-v1/NANOAOD#179594fb-d82a-4194-b8c7-ef8636ccade9: OK
	blockName: /ZeroBias/Run2024I-HcalCalIsolatedBunchSelector-PromptReco-v1/ALCARECO#623d20fe-86d2-4650-9c32-76fa0c792d6b: OK
	blockName: /HcalNZS/Run2024I-LogError-PromptReco-v1/RAW-RECO#ae1000be-4a3e-4b4d-b9dc-362c2028b5a9: OK
	blockName: /EGamma0/Run2024I-EcalESAlign-PromptReco-v1/ALCARECO#982d6c67-32ba-4a10-9239-4f460bf1c002: OK
	blockName: /ScoutingPFMonitor/Run2024I-PromptReco-v1/MINIAOD#37708d28-236f-4554-91d3-e3617b2c2a22: OK
	blockName: /ParkingSingleMuon2/Run2024I-PromptReco-v1/MINIAOD#ee62eeb6-9a38-4a12-94b4-6e2e03c5bf6c: OK
	blockName: /MuonShower/Run2024I-PromptReco-v1/MINIAOD#ab4cb22d-8acc-4661-b955-84786cd695db: OK
	blockName: /ParkingSingleMuon0/Run2024I-PromptReco-v1/NANOAOD#b2b7410c-cae4-48d2-aa08-1a4473ca9fcb: OK
	blockName: /EGamma1/Run2024I-EcalESAlign-PromptReco-v1/ALCARECO#7fb030a6-ce2b-40ba-9cfa-633e914c6ed4: OK
	blockName: /ZeroBias/Run2024I-PromptReco-v1/DQMIO#59bf6b32-52ee-4b87-a5f9-0eaf70f4eb00: OK
	blockName: /ParkingVBF4/Run2024I-PromptReco-v1/NANOAOD#023f6a88-3b2b-4826-ab9b-76a146d5c6a0: OK
	blockName: /SpecialZeroBias1/Run2024I-LogError-PromptReco-v1/RAW-RECO#9d2492d8-6426-4ece-bdf3-2c91113f4286: OK
	blockName: /MinimumBias/Run2024I-PromptReco-v1/MINIAOD#9029265d-b4db-458c-afc1-405d680f07da: OK
	blockName: /SpecialZeroBias0/Run2024I-PromptReco-v1/AOD#9c8e602e-c2ba-49c4-9b96-4b2ff143d3a4: OK
	blockName: /ParkingDoubleMuonLowMass1/Run2024I-TkAlUpsilonMuMu-PromptReco-v1/ALCARECO#6cdb8ade-6f9b-4b24-ab89-027326e095be: OK
	blockName: /SpecialZeroBias5/Run2024I-LogError-PromptReco-v1/RAW-RECO#a1de2483-ce3e-425d-85d6-5fea33a6ea67: OK
	blockName: /StreamExpressCosmics/Run2024I-Express-v1/DQMIO#699d704b-19aa-480c-8a8d-0fbebb0b0cd9: OK
	blockName: /StreamExpressCosmics/Run2024I-PromptCalibProdSiStripLA-Express-v1/ALCAPROMPT#6420dcde-2fa4-45bd-9c32-d10682923317: OK
	blockName: /StreamExpressCosmics/Run2024I-PromptCalibProdSiStrip-Express-v1/ALCAPROMPT#2e69c51a-1dca-4088-8aac-3743f810ce56: OK
	blockName: /StreamALCAPPSExpress/Run2024I-PPSCalMaxTracks-Express-v1/ALCARECO#652af1be-cb9a-4a1c-bbf4-c88c1686d301: OK
	blockName: /StreamExpress/Run2024I-PromptCalibProd-Express-v1/ALCAPROMPT#0e377dc2-065e-4e15-8566-bd910470baad: OK
	blockName: /StreamExpress/Run2024I-SiPixelCalSingleMuon-Express-v1/ALCARECO#3a744512-b3a2-447d-85f5-d0ec6254af59: OK
	blockName: /ZeroBias/Run2024I-SiStripCalMinBias-PromptReco-v1/ALCARECO#35f2cee8-79c7-4cab-9d86-dc50c003d893: OK
	blockName: /ZeroBias/Run2024I-SiStripCalMinBias-PromptReco-v1/ALCARECO#56647210-b149-4fdc-800f-a3e9523b3ea3: OK
	blockName: /HcalNZS/Run2024I-PromptReco-v1/DQMIO#045c4ec2-d9a4-44ce-9fd5-717415778bf5: OK
	blockName: /SpecialZeroBias5/Run2024I-PromptReco-v1/NANOAOD#d533eae7-c987-4e63-9e0e-0edb3e0bd246: OK
	blockName: /ParkingSingleMuon0/Run2024I-PromptReco-v1/AOD#a154328f-5edf-46bb-a395-d3d45a8b1ca6: OK
	blockName: /EGamma1/Run2024I-ZElectron-PromptReco-v1/RAW-RECO#5f92c14b-c6c0-4b8e-8913-71a11b54f598: OK
	blockName: /EGamma1/Run2024I-EXOMONOPOLE-PromptReco-v1/USER#6dcd1342-723d-4844-87b9-8248bf5db833: OK
	blockName: /StreamExpress/Run2024I-PromptCalibProdSiPixel-Express-v1/ALCAPROMPT#c9543d48-4c8f-4936-bd43-3a63dd1c174f: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#ef15476a-bbad-4545-9f91-7e77de5d6034: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#0758fd50-931f-488d-8a1e-663e1c6174e4: OK
	blockName: /SpecialZeroBias1/Run2024I-PromptReco-v1/DQMIO#1596222f-a243-4dc3-b9fe-a9ec03ad9adb: OK
	blockName: /SpecialZeroBias2/Run2024I-SiStripCalZeroBias-PromptReco-v1/ALCARECO#a8b97986-e1e1-4d7a-9b0f-4ae31589c714: OK
	blockName: /EGamma1/Run2024I-PromptReco-v1/MINIAOD#ee18201c-58ce-42a7-a60b-d1364ca32653: OK
	blockName: /ParkingVBF5/Run2024I-PromptReco-v1/NANOAOD#36beaf6c-97b1-475b-aba4-e9e7e8e694c3: OK
	blockName: /StreamExpressCosmics/Run2024I-PromptCalibProdSiPixelLAMCS-Express-v1/ALCAPROMPT#99955f16-752e-4dd0-ad52-d6eaf8d0f509: OK
	blockName: /Muon1/Run2024I-SiPixelCalSingleMuonLoose-PromptReco-v1/ALCARECO#14d95293-a39a-4d2a-a1f1-956affdde47c: OK
	blockName: /ParkingDoubleMuonLowMass0/Run2024I-PromptReco-v1/MINIAOD#4223ff94-a955-424e-b245-13c337e5b17d: OK
	blockName: /Muon1/Run2024I-EXODisappMuon-PromptReco-v1/USER#55154291-7763-457d-a791-96c65ef849ea: OK
	blockName: /Muon1/Run2024I-MUOJME-PromptReco-v1/RAW-RECO#3b55bfa9-ba65-4454-bc98-e60db1916b28: OK
	blockName: /ParkingSingleMuon9/Run2024I-PromptReco-v1/NANOAOD#439d8642-eccc-4e50-85ec-ce0a084b7fee: OK
	blockName: /Muon1/Run2024I-MuAlCalIsolatedMu-PromptReco-v1/ALCARECO#b9479dcc-06fb-4dce-8b11-d3cb0e957900: OK
	blockName: /Muon1/Run2024I-TkAlMuonIsolated-PromptReco-v1/ALCARECO#56f3de3c-f985-466c-b405-36fb5ff57720: OK
	blockName: /ParkingSingleMuon3/Run2024I-PromptReco-v1/AOD#04da00bb-4cec-4b87-a1d9-0747a3d4f02e: OK
	blockName: /Muon1/Run2024I-EXODisappTrk-PromptReco-v1/USER#f3e359cc-cb16-4020-8e84-4f083d1e9441: OK
	blockName: /JetMET0/Run2024I-JetHTJetPlusHOFilter-PromptReco-v1/RAW-RECO#13eb60ae-a6e7-4a47-b856-0f4f4e6e00d8: OK
	blockName: /Muon0/Run2024I-MUOJME-PromptReco-v1/RAW-RECO#a7967f2a-1035-4aea-afa0-26b9898938ce: OK
	blockName: /Muon0/Run2024I-ZMu-PromptReco-v1/RAW-RECO#966a0503-91c2-4e38-a002-bfac712cb168: OK
	blockName: /ZeroBias/Run2024I-SiStripCalMinBias-PromptReco-v1/ALCARECO#685c6155-a526-4509-bce9-812330416777: OK
	blockName: /ParkingDoubleMuonLowMass1/Run2024I-PromptReco-v1/MINIAOD#3b1b2f8c-a7da-4e53-902f-b4fa265774d2: OK
	blockName: /Muon1/Run2024I-PromptReco-v1/AOD#4f5fab34-198f-4eb0-81a4-09fc0c1a501c: OK
	blockName: /ParkingDoubleMuonLowMass5/Run2024I-PromptReco-v1/DQMIO#0fdb3015-38e1-47ef-bb71-237e7fbb1f08: OK
	blockName: /ParkingDoubleMuonLowMass6/Run2024I-PromptReco-v1/MINIAOD#ab102ad0-6081-42b6-9afe-9df7746af1d9: OK
	blockName: /Muon1/Run2024I-PromptReco-v1/MINIAOD#4adf338f-e98a-4e15-8e7f-a075cafbf918: OK
	blockName: /Muon1/Run2024I-HcalCalIterativePhiSym-PromptReco-v1/ALCARECO#a794ba12-a0d8-4c78-9d07-9b977908cb1f: OK
	blockName: /ParkingSingleMuon8/Run2024I-PromptReco-v1/NANOAOD#f594d1ee-148d-4b50-870b-a2f66c51efec: OK
	blockName: /ParkingSingleMuon9/Run2024I-PromptReco-v1/AOD#c60db50a-6e2e-4054-82d0-5d2f2622dd85: OK
	blockName: /ParkingSingleMuon11/Run2024I-PromptReco-v1/MINIAOD#9fab21d5-31e2-426a-aa0d-5ae91119d468: OK
	blockName: /ParkingSingleMuon3/Run2024I-PromptReco-v1/MINIAOD#ca5f6304-70f8-4504-b885-75bb293adf69: OK
	blockName: /ParkingSingleMuon3/Run2024I-PromptReco-v1/NANOAOD#7e049c44-3e1b-4e9d-92aa-8aa5a70db114: OK
	blockName: /Muon1/Run2024I-LogError-PromptReco-v1/RAW-RECO#65903046-1775-44ea-94bc-dfba5746ed0e: OK
	blockName: /Muon1/Run2024I-ZMu-PromptReco-v1/RAW-RECO#1fb3d4ad-ca60-4e64-a077-1437accb0f57: OK
	blockName: /ParkingSingleMuon11/Run2024I-PromptReco-v1/NANOAOD#b83a6f99-b768-4588-8577-18436f17bd0c: OK
	blockName: /ParkingDoubleMuonLowMass1/Run2024I-PromptReco-v1/AOD#41bdafe6-cdc1-40e1-9edc-85ef3d18e0ae: OK
	blockName: /ParkingSingleMuon4/Run2024I-PromptReco-v1/MINIAOD#b8ca91f4-608a-4612-8eeb-df6183a7b99f: OK
	blockName: /ParkingSingleMuon6/Run2024I-PromptReco-v1/MINIAOD#9b909863-233e-480c-9e3f-3850c0b67167: OK
	blockName: /ParkingDoubleMuonLowMass6/Run2024I-PromptReco-v1/AOD#b9849323-cc1a-445a-817f-c4e3212f74a2: OK
	blockName: /ParkingDoubleMuonLowMass7/Run2024I-PromptReco-v1/AOD#c0047424-6a0a-41c6-94ac-28aa41129d71: OK
	blockName: /ParkingVBF6/Run2024I-PromptReco-v1/AOD#d37b0146-4466-4ba8-a1ba-85c1ef4a902e: OK
	blockName: /Muon1/Run2024I-LogErrorMonitor-PromptReco-v1/USER#63a0eea8-35a0-4b70-adb5-39a6141c7bde: OK
	blockName: /ParkingVBF2/Run2024I-PromptReco-v1/NANOAOD#85077ea4-7bbd-45f7-a2df-2608128abe31: OK
	blockName: /ParkingVBF0/Run2024I-PromptReco-v1/AOD#8738e434-abdc-4d03-a60d-358ab2412188: OK
	blockName: /JetMET1/Run2024I-EXOSoftDisplacedVertices-PromptReco-v1/AOD#c2dd6617-dd8e-4bd8-9bd7-cdd81a858c36: OK
	blockName: /ParkingSingleMuon9/Run2024I-PromptReco-v1/MINIAOD#78af0c2f-e670-46e5-832a-d3828066fca7: OK
	blockName: /ParkingSingleMuon7/Run2024I-PromptReco-v1/AOD#f81d407a-8165-41c9-8a11-a5182d63d273: OK
	blockName: /Muon0/Run2024I-LogErrorMonitor-PromptReco-v1/USER#d7c8e5cc-0e23-424f-b000-55aa41070d63: OK
	blockName: /EphemeralZeroBias0/Run2024I-PromptReco-v1/MINIAOD#ad55216f-8224-404d-b8e9-daba73c85bb4: OK
	blockName: /ParkingDoubleMuonLowMass6/Run2024I-PromptReco-v1/NANOAOD#f590565c-1256-4f74-8e00-db331266d599: OK
	blockName: /Muon0/Run2024I-PromptReco-v1/NANOAOD#2ab3aa3d-a4ca-4773-8a58-85813617ea33: OK
	blockName: /Muon1/Run2024I-HcalCalHBHEMuonProducerFilter-PromptReco-v1/ALCARECO#455e3ec5-62d7-4f68-abbe-a34111a87076: OK
	blockName: /JetMET1/Run2024I-LogError-PromptReco-v1/RAW-RECO#1a77079a-29d9-4920-834c-eb523aeea080: OK
	blockName: /Muon0/Run2024I-EXODisappMuon-PromptReco-v1/USER#474dc48d-caf5-44d3-9a80-dcc2d5eec561: OK
	blockName: /JetMET1/Run2024I-JetHTJetPlusHOFilter-PromptReco-v1/RAW-RECO#f965d641-1a87-46b3-87a8-9d2a566fc604: OK
	blockName: /ParkingSingleMuon6/Run2024I-PromptReco-v1/NANOAOD#33455ccc-7ce1-4fb5-aa69-2048f8362f27: OK
	blockName: /ParkingSingleMuon10/Run2024I-PromptReco-v1/NANOAOD#fe088890-8bc3-416c-b263-408703b5efa4: OK
	blockName: /ParkingSingleMuon10/Run2024I-PromptReco-v1/AOD#df90c976-3437-46ee-af3e-fbc2221e48f4: OK
	blockName: /ParkingDoubleMuonLowMass6/Run2024I-PromptReco-v1/DQMIO#fe3a6889-bdb5-4959-a164-2db208d3e69b: OK
	blockName: /ParkingSingleMuon10/Run2024I-PromptReco-v1/MINIAOD#71008299-7fb4-4245-90d2-e980b9e195a1: OK
	blockName: /ParkingVBF2/Run2024I-PromptReco-v1/MINIAOD#7e7cb65b-c8db-4e36-b9a1-3475a031dedd: OK
	blockName: /AlCaP0/Run2024I-v1/RAW#0d3bd409-bcd8-4c59-bcd8-d7ecc3a1222b: OK
	blockName: /ParkingVBF3/Run2024I-PromptReco-v1/MINIAOD#2a6f15da-bd5c-4520-b70b-ab50ff65e04e: OK
	blockName: /ParkingSingleMuon4/Run2024I-PromptReco-v1/NANOAOD#f7f3e422-3bd6-4410-a955-9ed339b7219f: OK
	blockName: /ParkingVBF2/Run2024I-PromptReco-v1/AOD#ebaafffd-4020-46cf-9137-37c2832d3eac: OK
	blockName: /JetMET1/Run2024I-EXOMONOPOLE-PromptReco-v1/USER#77eabaa1-ead3-4534-a972-e788c3e7f050: OK
	blockName: /ParkingDoubleMuonLowMass5/Run2024I-PromptReco-v1/NANOAOD#8e5664a8-e3d1-4f92-b27e-3707ca3dc4df: OK
	blockName: /Muon0/Run2024I-EXOCSCCluster-PromptReco-v1/USER#700a9b24-dbc4-4bf9-8787-01da5ef26a06: OK
	blockName: /Muon0/Run2024I-LogError-PromptReco-v1/RAW-RECO#b5af0d19-9710-4757-af04-ba6f63ab4070: OK
	blockName: /ScoutingPFRun3/Run2024I-v1/HLTSCOUT#da913315-bb66-42ed-8c46-4f7f3714ef0c: OK
	blockName: /ParkingDoubleMuonLowMass5/Run2024I-PromptReco-v1/AOD#3d339afb-6be9-440d-802e-3572ab355d56: OK
	blockName: /ParkingDoubleMuonLowMass1/Run2024I-PromptReco-v1/NANOAOD#1c60e700-caf6-469a-9bb7-40122038ed33: OK
	blockName: /JetMET1/Run2024I-PromptReco-v1/MINIAOD#af8b556a-6dfa-43eb-890a-7f0cdea01f87: OK
	blockName: /JetMET1/Run2024I-EXOHighMET-PromptReco-v1/RAW-RECO#f09b2a46-61f1-4ae0-ac58-17d8c7db4fe3: OK
	blockName: /ParkingVBF3/Run2024I-PromptReco-v1/AOD#8a47209b-3d7a-4c13-a3d7-8a34397a9f94: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#2d01f47e-5f60-4115-b99a-3ccd5f843a71: OK
	blockName: /EphemeralZeroBias5/Run2024I-PromptReco-v1/MINIAOD#65830620-4305-4f9d-b084-555c17dc5610: OK
	blockName: /ParkingHH/Run2024I-PromptReco-v1/AOD#47f79776-ae5b-41ac-9965-d4981b9790d6: OK
	blockName: /ZeroBias/Run2024I-PromptReco-v1/AOD#d46e1e69-d0e3-4456-9bf1-060eb3731aec: OK
	blockName: /ParkingVBF0/Run2024I-PromptReco-v1/NANOAOD#40dc6115-2b40-43be-9e4e-cdd658e68dc7: OK
	blockName: /ParkingSingleMuon11/Run2024I-PromptReco-v1/AOD#0410c654-f369-40f4-858b-af27bbe4d94d: OK
	blockName: /JetMET0/Run2024I-PromptReco-v1/AOD#dc00eaf3-8f2c-4465-ba70-78335e4cb245: OK
	blockName: /ParkingDoubleMuonLowMass0/Run2024I-PromptReco-v1/NANOAOD#2c9e2ac2-b2aa-4868-9782-556e6193cbbb: OK
	blockName: /ParkingDoubleMuonLowMass5/Run2024I-PromptReco-v1/MINIAOD#3957f41c-be2a-4f09-b04c-6995d7c23eee: OK

todor-ivanov · 2024-11-04T15:45:08Z

@germanfgv @LinaresToine Could we check what is special for those 4 blocks reported as experiencing BLOCKMISMATCH records at DBS in my previous comment: #11965 (comment) . I am interested to find out at least:

Were those the original 4 blocks reported initially by T0 team?
Were those 4 blocks handled by a separate agent?

amaltaro · 2024-11-04T16:33:02Z

@todor-ivanov as we discussed in the WMCore meeting, DBS3Upload should have a mechanism to identify blocks that have already been injected into DBS Server, but failed to acknowledge the operation for some reason.

If the component tries to inject a block already in the server, it is supposed to return exit code 128, marking the block as check here:
https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/DBS3Buffer/DBSUploadPoller.py#L104

which will trigger the execution of this block of code:
https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/DBS3Buffer/DBSUploadPoller.py#L848

in the next cycle of the component.

I don't think anything changed on the DBS Server codebase lately, so I expect this feature to be still functional. But you might want to revise the error message/code that we are getting for the problematic blocks.

amaltaro · 2024-11-04T21:46:40Z

@todor-ivanov we are having similar problems with one agent that is ready to be shutdown (after draining), but it still has one block that it fails to inject into DBS Server.

Could you please look into submit12 and try to understand what the problem is with:

2024-11-04 21:38:50,968:140011552839424:INFO:DBSUploadPoller:About to call insert block for: /XToYYprimeTo4Q_MX-2000_MY-30_MYprime-600_narrow_TuneCP5_13TeV-madgraph-pythia8/RunIISummer20UL16NanoAODv9-106X_mcRun2_asymptotic_v17-v2/NANOAODSIM#fb282932-f32b-48de-82e5-a56cceb34cad
2024-11-04 21:38:51,654:140011552839424:ERROR:DBSUploadPoller:Error trying to process block /XToYYprimeTo4Q_MX-2000_MY-30_MYprime-600_narrow_TuneCP5_13TeV-madgraph-pythia8/RunIISummer20UL16NanoAODv9-106X_mcRun2_asymptotic_v17-v2/NANOAODSIM#fb282932-f32b-48de-82e5-a56cceb34cad through DBS. Details: DBSError code: 131, message: fb20d909d3a86926e3d8d0498c1ebfc3f4ad617c6b5e5dcaeecde3662af8797b unable to find dataset_id for /XToYYprimeTo4Q_MX-2000_MY-30_MYprime-600_narrow_TuneCP5_13TeV-madgraph-pythia8/RunIISummer20UL16MiniAODv2-106X_mcRun2_asymptotic_v17-v2/MINIAODSIM, error DBSError Code:103 Description:DBS DB query error, e.g. mailformed SQL statement Function:dbs.GetID Message: Error: sql: no rows in result set, reason: DBSError Code:103 Description:DBS DB query error, e.g. mailformed SQL statement Function:dbs.GetID Message: Error: sql: no rows in result set

It seems to be failing injection since Oct 15th.

germanfgv · 2024-11-05T02:09:24Z

Almost all of those blocks are properly present at DBS - so for those I assume that the Agent did not properly handled the initial return code by DBS and it simply continues to retry.

Thanks @todor-ivanov! tested this using my own script (DBSBlockCheck.py) and got the same result. All but 4 blocks are already available in DBS.

All the blocks listed in /eos/home-c/cmst0/public/dbsError/failingBlocks.txt belong to the same agent, including the 4 problematic ones.

I can check @amaltaro's idea tomorrow.

todor-ivanov · 2024-11-05T10:49:58Z

In reality the error code the agent unwraps from the HTTP header for some reason is 52 instead of 128 see [1]. So thie mechanism mentioned here: #11965 (comment) will never trigger.

[1]

2024-11-05 10:07:49,084:139753276044864:ERROR:DBSUploadPoller:Hit a general exception while inserting block /Tau/Run2024I-PromptReco-v1/DQMIO#1a0ac20a-1d60-4d89-8133-e8559f1e4c13. Error: (52, 'Empty reply from server')
Traceback (most recent call last):
  File "/data/tier0/WMAgent.venv3/lib64/python3.9/site-packages/WMComponent/DBS3Buffer/DBSUploadPoller.py", line 94, in uploadWorker
    dbsApi.insertBulkBlock(blockDump=block)
  File "/data/tier0/WMAgent.venv3/lib64/python3.9/site-packages/dbs/apis/dbsClient.py", line 647, in insertBulkBlock
    result =  self.__callServer("bulkblocks", data=blockDump, callmethod='POST' )
  File "/data/tier0/WMAgent.venv3/lib64/python3.9/site-packages/dbs/apis/dbsClient.py", line 474, in __callServer
    self.http_response = method_func(self.url, method, params, data, request_headers)
  File "/data/tier0/WMAgent.venv3/lib64/python3.9/site-packages/RestClient/RestApi.py", line 42, in post
    return http_request(self._curl)
  File "/data/tier0/WMAgent.venv3/lib64/python3.9/site-packages/RestClient/RequestHandling/HTTPRequest.py", line 56, in __call__
    curl_object.perform()
pycurl.error: (52, 'Empty reply from server')

todor-ivanov · 2024-11-05T10:59:40Z

Actually it never tries to read the HTTP header and to actually resolve the true DBS error, which is supposed to be done through the dbsError class here:

WMCore/src/python/WMComponent/DBS3Buffer/DBSUploadPoller.py

Line 102 in 76fd3a9

srvCode = dbsError.getServerCode()

And the reason why it happens like that, is obviously, because the error returned by the pycurl client is not of type HTTPError. So this whole piece of code there is never tried:

WMCore/src/python/WMComponent/DBS3Buffer/DBSUploadPoller.py

Lines 96 to 115 in 76fd3a9

    
           except HTTPError as ex: 
        
               # DBS Go server errors are defined here: 
        
               # https://github.com/dmwm/dbs2go/blob/master/dbs/errors.go 
        
               dbsError = DBSError(ex.body) 
        
               reason = dbsError.getReason() 
        
               message = dbsError.getMessage() 
        
               srvCode = dbsError.getServerCode() 
        
               msg = f'DBSError code: {srvCode}, message: {message}, reason: {reason}' 
        
               if srvCode == 128: 
        
                   # block already exist 
        
                   logging.warning("Block %s already exists. Marking it as uploaded.", name) 
        
                   results.put({'name': name, 'success': "check"}) 
        
               elif srvCode in [132, 133, 134, 135, 136, 137, 138, 139, 140]: 
        
                   # racing conditions 
        
                   logging.warning("Hit a transient data race condition injecting block %s, %s", name, msg) 
        
                   results.put({'name': name, 'success': "error", 'error': msg}) 
        
               else: 
        
                   msg = f"Error trying to process block {name} through DBS. Details: {msg}" 
        
                   logging.error(msg) 
        
                   results.put({'name': name, 'success': "error", 'error': msg})

But instead the exception is handled as a general exception and this one is taking the control:

WMCore/src/python/WMComponent/DBS3Buffer/DBSUploadPoller.py

Lines 116 to 119 in 76fd3a9

    
           except Exception as ex: 
        
               msg = f"Hit a general exception while inserting block {name}. Error: {str(ex)}" 
        
               logging.exception(msg) 
        
               results.put({'name': name, 'success': "error", 'error': msg})

It has to have something to do with this line from the traceback:

  File "/data/tier0/WMAgent.venv3/lib64/python3.9/site-packages/RestClient/RequestHandling/HTTPRequest.py", line 56, in __call__
    curl_object.perform()

todor-ivanov · 2024-11-05T12:07:31Z

And just to add to the observation: The 4 blocks which I mentioned are experiencing BLOCKMISMATCH at DBS, behave differently. They fail with a proper DBS exception [1]. All 4 of them. And it is indeed the concuerrency error - DBSError Code:110. So for them the actual HTTP Header is indeed parsed and the true DBS Error encoded into it is received, so the exception is handled according to whatever logic is meant to be implemented by:

WMCore/src/python/WMComponent/DBS3Buffer/DBSUploadPoller.py

Lines 96 to 115 in 76fd3a9

    
           except HTTPError as ex: 
        
               # DBS Go server errors are defined here: 
        
               # https://github.com/dmwm/dbs2go/blob/master/dbs/errors.go 
        
               dbsError = DBSError(ex.body) 
        
               reason = dbsError.getReason() 
        
               message = dbsError.getMessage() 
        
               srvCode = dbsError.getServerCode() 
        
               msg = f'DBSError code: {srvCode}, message: {message}, reason: {reason}' 
        
               if srvCode == 128: 
        
                   # block already exist 
        
                   logging.warning("Block %s already exists. Marking it as uploaded.", name) 
        
                   results.put({'name': name, 'success': "check"}) 
        
               elif srvCode in [132, 133, 134, 135, 136, 137, 138, 139, 140]: 
        
                   # racing conditions 
        
                   logging.warning("Hit a transient data race condition injecting block %s, %s", name, msg) 
        
                   results.put({'name': name, 'success': "error", 'error': msg}) 
        
               else: 
        
                   msg = f"Error trying to process block {name} through DBS. Details: {msg}" 
        
                   logging.error(msg) 
        
                   results.put({'name': name, 'success': "error", 'error': msg})

But as we can see DBS ErrorCode: 110 is not handled at this logic. So I suspect the conversation on how to proceed about these cases needs to continue once we understand what exactly has happened with those 4 blocks at the first place. It doesn't seem that a proper agreement has been achieved on the actions required on both - the client and the server side for situations like that.

[1]

2024-09-20 08:43:10,755:139632874354240:ERROR:DBSUploadPoller:Error trying to process block /AlCaP0/Run2024H-v1/RAW#92fab5d9-9a27-4a7c-a57e-4b2691c654cd through DBS. Details: DBSError code: 110, message: 5ecdc2bdcd03492fd64efc269de332cdc
f1c8a53c3e3cc07168b0c741f0270ba unable to insert files, error DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.insertFilesViaChunks Message: Error: concurrency error, reason: DBSError Code:110 Description:
DBS DB insert record error Function:dbs.bulkblocks.insertFilesViaChunks Message: Error: concurrency error
2024-09-20 08:43:10,756:139632874354240:INFO:DBSUploadPoller:About to call insert block for: /AlCaP0/Run2024H-v1/RAW#983e96f3-5dca-4919-a3b4-fa291f145fb7
2024-09-20 08:43:10,757:139632874354240:INFO:DBSUploadPoller:Queueing block for insertion: /L1ScoutingSelection/Run2024H-v1/L1SCOUT#694f9058-382e-47d9-89cd-646541261cd7
2024-09-20 08:43:10,760:139632874354240:ERROR:DBSUploadPoller:Error trying to process block /AlCaP0/Run2024H-v1/RAW#3bbaf481-068c-4fda-8656-663fa9a987a4 through DBS. Details: DBSError code: 110, message: 997071d9311e283887ce5e57b0b180046
7986e1c57f620aff5a39d98b881fb6c unable to insert files, error DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.insertFilesViaChunks Message: Error: concurrency error, reason: DBSError Code:110 Description:
DBS DB insert record error Function:dbs.bulkblocks.insertFilesViaChunks Message: Error: concurrency error
...

2024-09-20 08:43:10,799:139632874354240:ERROR:DBSUploadPoller:Error trying to process block /AlCaP0/Run2024H-v1/RAW#983e96f3-5dca-4919-a3b4-fa291f145fb7 through DBS. Details: DBSError code: 110, message: d93d36f53eaf3097db5c9f50851359041c418a18727e6f363e6c18c37d3f25bb unable to insert files, error DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.insertFilesViaChunks Message: Error: concurrency error, reason: DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.insertFilesViaChunks Message: Error: concurrency error

...
2024-09-20 08:43:11,854:139632874354240:ERROR:DBSUploadPoller:Error trying to process block /Muon0/Run2024H-v1/RAW#7369ccdf-3d3a-4d32-bad9-b04b02f279d4 through DBS. Details: DBSError code: 110, message: e38e86de6869760af39faf5da584eceee0b0b9d1de48e57276593df8dd4c720e unable to insert files, error DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.insertFilesViaChunks Message: Error: concurrency error, reason: DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.insertFilesViaChunks Message: Error: concurrency error

germanfgv · 2024-11-05T13:08:23Z

@todor-ivanov actually, at some point, the 4 blocks started failing with the pycurl.error: (52, 'Empty reply from server'), before any other block had failed.

This is the last appearance of DBSError 110:

2024-09-27 14:31:30,711:140408533284416:ERROR:DBSUploadPoller:Error trying to process block /AlCaP0/Run2024H-v1/RAW#92fab5d9-9a27-4a7c-a57e-4b2691c654cd through DBS. Details: DBSError code: 110, message: ec6dab1b1b8d8ba3bf018be816846d73e007b5049b93
947a1e7472786c73ece6 unable to insert files, error DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.insertFilesViaChunks Message: Error: concurrency error, reason: DBSError Code:110 Description:DBS DB insert record e
rror Function:dbs.bulkblocks.insertFilesViaChunks Message: Error: concurrency error

This is the first appearance of pyCurl error 52.

2024-09-27 15:03:00,192:140408533284416:ERROR:DBSUploadPoller:Hit a general exception while inserting block /AlCaP0/Run2024H-v1/RAW#92fab5d9-9a27-4a7c-a57e-4b2691c654cd. Error: (52, 'Empty reply from server')

This might be a clue on what caused the other 272 blocks to fail. It seems something change in the DBS server at some point between 2024-09-27 14:31 and 2024-09-27 15:03. After that moment, the client is unable to parse the server's error codes. This exactly coincides with the deployment of the APS-based CMSWEB cluster

@amaltaro @todor-ivanov @vkuznet

germanfgv · 2024-11-05T13:16:18Z

Here you have the DBS3Upload ComponentLog, in case you want to check these dates:

/eos/user/c/cmst0/public/dbsError/ComponentLog

vkuznet · 2024-11-05T13:32:24Z

@germanfgv I would like to mention that according to k8s production dbs cluster we run DBS pods for 209 days. Therefore, nothing has changed on DBS side, and neither I aware of any development, commits/PRs. The concurrency error may seems misleading since it printed out with concurrency call to file injection. But the file injection fails due to missing aux meta-data in JSON payload. Please see these DBS code:

ConcurrencyErr occur here
it happens because insertFilesChunk function returns error
and if you inspect insertFilesChunk code base you'll see that error occurs only in three occations:
1. failure to look up FILE_DATA_TYPES id
2. wrong is_file_valid value
3. oracle insert error

I reported MANY times that most likely issue is with missing file data type in JSON payload, and I strongly suggest to start with your JSON payload and see if it is there. In particular, the files section of payload should contain file_type, see example here.

If JSON payload is correct in terms of ALL required aux meta-data, I suggest that you move down the list and check validity of the file(s) and finally look-up for ORACLE insert error.

germanfgv · 2024-11-05T14:02:20Z

@vkuznet We have 2 separate problems here:

4 blocks showing concurrency errors.
272 that are already properly uploaded to DBS, but the agent is unable to parse the response from the server.

I bring up the APS upgrade in reference with the issues parsing the response from the server, not as an explanation for the concurrency issues. After 2024-09-27 15:03, the DBS client is unable to distinguish DBSError 128: Block already exists, from DBSError 110: Concurrency error (Or any HTTP other error). They all show up as a pyCurl error 52. As the timing coincides with the deployment of the APS server, it seems to me very likely the issue is related to that upgrade, specially since, as you mentioned, there have not been any other changes in the code.

I would like to switch this agent temporarily to the cmsweb-prod.cern.ch version of DBSWriter, simply to check if the 272 blocks without concurrency issues can move along. This will not create a bit pressure over the server, as this agent is no longer producing new data, and simply needs to upload those 272 blocks. @vkuznet do you have anything against that plan?

Regarding the JSON payload, the dumps we obtained from the agent show "file_type": "EDM", as expected. This is why we've moved to check the validity of the files and Todor already found issues there. There are indeed files appearing in more than one block. Fixing this will be more complicated and we still need to understand what faulty agent logic caused it.

vkuznet · 2024-11-05T14:15:54Z

My suggestions would be the following:

you need to understand source of 52 error, it can be many things, including timeout.
to do that, you should stop using DBS3Upload code as it hides many things and prevent from debugging the issue
instead, you should use plain curl with your payload, see examples in dbs2go documention. In particular, you need last example with bulkblocks api and I suggest to use gziped payload.
- using curl call bypass pycurl, but still use libcurl libratry, which gives you better idea about the underlying error (i.e. avoid Python wrappers)
- you can take one payload at a time and make your injection, moreover you can insert into this curl call your custom User-agent header which will allow you to trace this request both in APS, and DBS logs, e.g. curl -H "User-Agent: my-failed-dataset-block#123" ....
- if curl is successful you may claim timeout as source of 52 error, otherwise
- you'll get clear trace in logs to see your request and debug it further.

vkuznet · 2024-11-05T14:23:18Z

@germanfgv , and regarding switching to cmsweb-prod, if you interested to understand the error I suggest to use manual curl approach as I described before. And, afterwards you may switch to cmsweb-prod to see if you'll be able to inject them using Apache FE.

LinaresToine · 2024-11-06T02:37:25Z

About the 4 original blocks with the concurrency errors, specifically the AlCaP0 blocks:

I see all files in DBS, but they are distributed among two "impostor" blocks.

1. `/AlCaP0/Run2024H-v1/RAW#b51293e3-1563-47a3-a88f-1eb33790c417`
2. `/AlCaP0/Run2024H-v1/RAW#0392f25d-8397-40b3-8f6f-46266d92583b`

I call them impostor because both have blocks that belong to other blocks and number 2 is not even in Rucio and all his files belong to another block according to the database. Here is a summary of all 5 blocks; the 2 impostors and the 3 originals:

/AlCaP0/Run2024H-v1/RAW#b51293e3-1563-47a3-a88f-1eb33790c417 (impostor 1)

RucioInjector added 4 files
RucioInjector deleted the rule
DBSBUFFER_FILE table has those 4 files belonging to the given blockname
DAS shows 8 additional files
- 5 belong to /AlCaP0/Run2024H-v1/RAW#3bbaf481-068c-4fda-8656-663fa9a987a4
- 3 belong to /AlCaP0/Run2024H-v1/RAW#92fab5d9-9a27-4a7c-a57e-4b2691c654cd

/AlCaP0/Run2024H-v1/RAW#0392f25d-8397-40b3-8f6f-46266d92583b (impostor 2)

Never made it to Rucio
DBSBUFFER_FILE Table shows no files
DAS shows 12 files
- 3 belong to /AlCaP0/Run2024H-v1/RAW#983e96f3-5dca-4919-a3b4-fa291f145fb7
- 9 belong to /AlCaP0/Run2024H-v1/RAW#92fab5d9-9a27-4a7c-a57e-4b2691c654cd

/AlCaP0/Run2024H-v1/RAW#3bbaf481-068c-4fda-8656-663fa9a987a4 (original)

RucioInjector added 5 files
DBSBUFFER_FILE has 5 files
DAS shoes no data
The 5 files are in DAS as part of /AlCaP0/Run2024H-v1/RAW#b51293e3-1563-47a3-a88f-1eb33790c417

/AlCaP0/Run2024H-v1/RAW#92fab5d9-9a27-4a7c-a57e-4b2691c654cd (original)

RucioInjector added 12 files
DBSBUFFER_FILE has 12 files
DAS shows no data with that blockname
DAS shows 9 files as part of /AlCaP0/Run2024H-v1/RAW#0392f25d-8397-40b3-8f6f-46266d92583b
DAS shows 3 files as part of /AlCaP0/Run2024H-v1/RAW#b51293e3-1563-47a3-a88f-1eb33790c417

/AlCaP0/Run2024H-v1/RAW#983e96f3-5dca-4919-a3b4-fa291f145fb7 (original)

RucioInjector added 12 files
DBSBUFFER_FILE has 12 files
DAS shows no data with that blockname
DAS shows 3 files as part of /AlCaP0/Run2024H-v1/RAW#0392f25d-8397-40b3-8f6f-46266d92583b

todor-ivanov · 2024-11-06T16:58:28Z

About:

to do that, you should stop using DBS3Upload code as it hides many things and prevent from debugging the issue

Just to put @vkuznet's words in perspective:

I tried to completely simulate the whole agent environment in preprod connected to DBS integration, falsely assuming everything should go smoothly and upon initial successful upload of the block I'll be able to reproduce the duplication error on a second attempt. But:

First, the initial upload failed for reasons which will be explained on the next line
Second the error returned by DBS was completely ignored. Here is one proper pdb session within the whole agent env. [1], where we can very well see that DBSApi is not able to complete it's execution.... the error originates from the dbsclient and it throws the error from this very line:

https://github.com/dmwm/DBSClient/blob/1e6acbd55c55497cf747a2a0cf4539936138a04a/src/python/dbs/apis/dbsClient.py#L647:

    def insertBulkBlock(self, blockDump):
   ...
        result =  self.__callServer("bulkblocks", data=blockDump, callmethod='POST' )

Which actually contains the true DBS Error in the header and one can spot the error message in the printout:

RestClient.ErrorHandling.RestClientExceptions.HTTPError: HTTP Error 400: DBSError Code:101 Description:DBS DB error Function:dbs.bulkblocks.InsertBulkBlocksConcurrently Message:unable to find parent lfn /store/data/Run2024I/ParkingSingleMuon4/RAW/v1/000/386/640/00000/7c1b6c7b-a0bf-4f19-8dfa-1722884306c5.root \
      Error: nested DBSError Code:103 Description:DBS DB query error, e.g. mailformed SQL statement Function:dbs.GetID Message: Error: sql: no rows in result set

-- So those are two nested DBS errors:

DBSError Code: 101 - The error from the wrapper API InsertBulkBlocksConcurrently, reflecting that the call to the database actually failed.
DBSError Code: 103 - And the bottom error giving the true reason, why the call to the database failed - and which in this case is because the Parentage file of this lfn: /store/data/Run2024I/ParkingSingleMuon4/RAW/v1/000/386/640/00000/7c1b6c7b-a0bf-4f19-8dfa-1722884306c5.root is indeed missing at this instance of DBS (which is completely expected), and the sql query actually returned an empty result. And all this is properly raised by the dbsclient. What happens at the WMAgents DBSApi though is quite undesired. The error code is silently dropped and transformed only to the upper level HTTP 400 error. And the so carried actual error inside the header is simply ignored by this line:

WMCore/src/python/WMComponent/DBS3Buffer/DBSUploadPoller.py

Line 108 in 76fd3a9

elif srvCode in [132, 133, 134, 135, 136, 137, 138, 139, 140]:

And there is a plethora of DBS server errors we do not handle: https://github.com/dmwm/dbs2go/blob/8effd5a6bcb1c5b169348e3ac886891ad3aa1a2a/dbs/errors.go#L37-L81 : [2]

FYI: @germanfgv @LinaresToine

[1]

5643 │> /data/WMAgent.venv3/lib64/python3.9/site-packages/dbs/apis/dbsClient.py(477)__callServer()                                                                                                                                                                                                                           
5644 │-> self.__parseForException(http_error)                                         |                                                                                                                                                                                                                                      
5645 │(Pdb)                                                                           |                                                                                                                                                                                                                                      
5646 │DBS Server error: [{'error': {'reason': 'DBSError Code:103 Description:DBS DB query error, e.g. mailformed SQL statement Function:dbs.GetID Message: Error: sql: no rows in result set', 'message': 'unable to find parent lfn /store/data/Run2024I/ParkingSingleMuon4/RAW/v1/000/386/640/00000/7c1b6c7b-a0bf-4f19-8df\
      a-1722884306c5.root', 'function': 'dbs.bulkblocks.InsertBulkBlocksConcurrently', 'code': 101, 'stacktrace': '\ngoroutine 7968287 [running]:\ngithub.com/dmwm/dbs2go/dbs.Error({0xb2ca20?, 0xc0006842d0?}, 0x65, {0xc000702000, 0x84}, {0xa5c044, 0x2b})\n\t/go/src/github.com/vkuznet/dbs2go/dbs/errors.go:185 +0x99\n\
      github.com/dmwm/dbs2go/dbs.(*API).InsertBulkBlocksConcurrently(0xc000236070)\n\t/go/src/github.com/vkuznet/dbs2go/dbs/bulkblocks2.go:508 +0x605\ngithub.com/dmwm/dbs2go/web.DBSPostHandler({0xb2f790, 0xc000aa01e0}, 0xc000686c60, {0xa3e07d, 0xa})\n\t/go/src/github.com/vkuznet/dbs2go/web/handlers.go:562 +0x109e\n\
      github.com/dmwm/dbs2go/web.BulkBlocksHandler({0xb2f790?, 0xc000aa01e0?}, 0xc000033f60?)\n\t/go/src/github.com/vkuznet/dbs2go/web/handlers.go:978 +0x3b\nnet/http.HandlerFunc.ServeHTTP(0x0?, {0xb2f790?, 0xc000aa01e0?}, 0x11?)\n\t/usr/local/go/src/net/http/server.go:2171 +0x29\ngithub.com/dmwm/dbs2go/web.limitMi\
      ddleware.func1({0xb2f790?, 0xc000aa01e0?}, 0xc0006c6650?)\n\t/go/src/github.com/vkuznet/dbs2go/web/middlewares.go:110 +0x32\nnet/http.HandlerFunc.ServeHTTP(0xc0003c0f30?, {0xb2f790?, 0xc000aa01e0?}, 0xc0003af450?)\n\t/usr/loca'}, 'http': {'method': 'POST', 'code': 400, 'timestamp': '2024-11-06 16:16:23.350982\
      889 +0000 UTC m=+5760929.544914892', 'path': '/dbs/int/global/DBSWriter/bulkblocks', 'user_agent': 'DBSClient/Unknown/', 'x_forwarded_host': 'cmsweb-testbed.cern.ch', 'x_forwarded_for': '188.184.96.94:20438, 188.184.96.94', 'remote_addr': '10.100.148.128:41393'}, 'exception': 400, 'type': 'HTTPError', 'messag\
      e': 'DBSError Code:101 Description:DBS DB error Function:dbs.bulkblocks.InsertBulkBlocksConcurrently Message:unable to find parent lfn /store/data/Run2024I/ParkingSingleMuon4/RAW/v1/000/386/640/00000/7c1b6c7b-a0bf-4f19-8dfa-1722884306c5.root Error: nested DBSError Code:103 Description:DBS DB query error, e.g.\
       mailformed SQL statement Function:dbs.GetID Message: Error: sql: no rows in result set'}]                                                                                                                                                                                                                             
5647 │RestClient.ErrorHandling.RestClientExceptions.HTTPError: HTTP Error 400: DBSError Code:101 Description:DBS DB error Function:dbs.bulkblocks.InsertBulkBlocksConcurrently Message:unable to find parent lfn /store/data/Run2024I/ParkingSingleMuon4/RAW/v1/000/386/640/00000/7c1b6c7b-a0bf-4f19-8dfa-1722884306c5.root \
      Error: nested DBSError Code:103 Description:DBS DB query error, e.g. mailformed SQL statement Function:dbs.GetID Message: Error: sql: no rows in result set                                                                                                                                                            
5648 │> /data/WMAgent.venv3/lib64/python3.9/site-packages/dbs/apis/dbsClient.py(477)__callServer()                                                                                                                                                                                                                           
5649 │-> self.__parseForException(http_error)                                         |                                                                                                                                                                                                                                      
5650 │(Pdb)                                                                           |                                                                                                                                                                                                                                      
5651 │> /data/WMAgent.venv3/lib64/python3.9/site-packages/dbs/apis/dbsClient.py(486)__callServer()                                                                                                                                                                                                                           
5652 │-> self.__parseForException(data)                                               |                                                                                                                                                                                                                                      
5653 │(Pdb)                                                                           |                                                                                                                                                                                                                                      
5654 │--Return--                                                                      |                                                                                                                                                                                                                                      
5655 │> /data/WMAgent.venv3/lib64/python3.9/site-packages/dbs/apis/dbsClient.py(486)__callServer()->None                                                                                                                                                                                                                     
5656 │-> self.__parseForException(data)                                               |                                                                                                                                                                                                                                      
5657 │(Pdb)                                                                           |                                                                                                                                                                                                                                      
5658 │RestClient.ErrorHandling.RestClientExceptions.HTTPError: HTTP Error 400: DBSError Code:101 Description:DBS DB error Function:dbs.bulkblocks.InsertBulkBlocksConcurrently Message:unable to find parent lfn /store/data/Run2024I/ParkingSingleMuon4/RAW/v1/000/386/640/00000/7c1b6c7b-a0bf-4f19-8dfa-1722884306c5.root \
      Error: nested DBSError Code:103 Description:DBS DB query error, e.g. mailformed SQL statement Function:dbs.GetID Message: Error: sql: no rows in result set                                                                                                                                                            
5659 │> /data/WMAgent.venv3/lib64/python3.9/site-packages/dbs/apis/dbsClient.py(647)insertBulkBlock()                                                                                                                                                                                                                        
5660 │-> result =  self.__callServer("bulkblocks", data=blockDump, callmethod='POST' )|                                                                                                                                                                                                                                      
5661 │(Pdb) p result                                                                  |                                                                                                                                                                                                                                      
5662 │*** NameError: name 'result' is not defined                                     |                                                                                                                                                                                                                                      
5663 │(Pdb) n                                                                         |                                                                                                                                                                                                                                      
5664 │--Return--                                                                      |                                                                                                                                                                                                                                      
5665 │> /data/WMAgent.venv3/lib64/python3.9/site-packages/dbs/apis/dbsClient.py(647)insertBulkBlock()->None                                                                                                                                                                                                                  
5666 │-> result =  self.__callServer("bulkblocks", data=blockDump, callmethod='POST' )|                                                                                                                                                                                                                                      
5667 │(Pdb) p result                                                                  |                                                                                                                                                                                                                                      
5668 │*** NameError: name 'result' is not defined                                     |                                                                                                                                                                                                                                      
5669 │(Pdb) n                                                                         |                                                                                                                                                                                                                                      
5670 │RestClient.ErrorHandling.RestClientExceptions.HTTPError: HTTP Error 400: DBSError Code:101 Description:DBS DB error Function:dbs.bulkblocks.InsertBulkBlocksConcurrently Message:unable to find parent lfn /store/data/Run2024I/ParkingSingleMuon4/RAW/v1/000/386/640/00000/7c1b6c7b-a0bf-4f19-8dfa-1722884306c5.root \
      Error: nested DBSError Code:103 Description:DBS DB query error, e.g. mailformed SQL statement Function:dbs.GetID Message: Error: sql: no rows in result set                                                                                                                                                            
5671 │> /data/WMAgent.venv3/srv/WMCore/src/python/WMComponent/DBS3Buffer/DBSUploadPoller.py(94)uploadWorker()                                                                                                                                                                                                                
5672 │-> dbsApi.insertBulkBlock(blockDump=block)                                      |                                                                                                                                                                                                                                      
5673 │(Pdb)                                                                           |                                                                                                                                                                                                                                      
5674 │> /data/WMAgent.venv3/srv/WMCore/src/python/WMComponent/DBS3Buffer/DBSUploadPoller.py(96)uploadWorker()                                                                                                                                                                                                                
5675 │-> except HTTPError as ex:                                                      |                                                                                                                                                                                                                                      
5676 │(Pdb)

[2]

// DBS Error codes provides static representation of DBS errors, they cover 1xx range
const (
	GenericErrorCode               = iota + 100 // generic DBS error
	DatabaseErrorCode                           // 101 database error
	TransactionErrorCode                        // 102 transaction error
	QueryErrorCode                              // 103 query error
	RowsScanErrorCode                           // 104 row scan error
	SessionErrorCode                            // 105 db session error
	CommitErrorCode                             // 106 db commit error
	ParseErrorCode                              // 107 parser error
	LoadErrorCode                               // 108 loading error, e.g. load template
	GetIDErrorCode                              // 109 get id db error
	InsertErrorCode                             // 110 db insert error
	UpdateErrorCode                             // 111 update error
	LastInsertErrorCode                         // 112 db last insert error
	ValidateErrorCode                           // 113 validation error
	PatternErrorCode                            // 114 pattern error
	DecodeErrorCode                             // 115 decode error
	EncodeErrorCode                             // 116 encode error
	ContentTypeErrorCode                        // 117 content type error
	ParametersErrorCode                         // 118 parameters error
	NotImplementedApiCode                       // 119 not implemented API error
	ReaderErrorCode                             // 120 io reader error
	WriterErrorCode                             // 121 io writer error
	UnmarshalErrorCode                          // 122 json unmarshal error
	MarshalErrorCode                            // 123 marshal error
	HttpRequestErrorCode                        // 124 HTTP request error
	MigrationErrorCode                          // 125 Migration error
	RemoveErrorCode                             // 126 remove error
	InvalidRequestErrorCode                     // 127 invalid request error
	BlockAlreadyExists                          // 128 block xxx already exists in DBS
	FileDataTypesDoesNotExist                   // 129 FileDataTypes does not exist in DBS
	FileParentDoesNotExist                      // 130 FileParent does not exist in DBS
	DatasetParentDoesNotExist                   // 131 DatasetParent does not exist in DBS
	ProcessedDatasetDoesNotExist                // 132 ProcessedDataset does not exist in DBS
	PrimaryDatasetTypeDoesNotExist              // 133 PrimaryDatasetType does not exist in DBS
	PrimaryDatasetDoesNotExist                  // 134 PrimaryDataset does not exist in DBS
	ProcessingEraDoesNotExist                   // 135 ProcessingEra does not exist in DBS
	AcquisitionEraDoesNotExist                  // 136 AcquisitionEra does not exist in DBS
	DataTierDoesNotExist                        // 137 DataTier does not exist in DBS
	PhysicsGroupDoesNotExist                    // 138 PhysicsGroup does not exist in DBS
	DatasetAccessTypeDoesNotExist               // 139 DatasetAccessType does not exist in DBS
	DatasetDoesNotExist                         // 140 Dataset does not exist in DBS
	LastAvailableErrorCode                      // last available DBS error code
)

germanfgv · 2024-11-06T20:40:48Z

I changed the DBSWriter instance that the component is accessing from cmsweb.cern.ch to cmsweb-prod.cern.ch. As expected, we no longer get the pyCurl error 52 message. The 272 blocks that are already correct in the database were processed without issues, and this is allowing the agent to continue creating and uploading blocks.

Now we are left with the original 4 problematic blocks.

todor-ivanov · 2024-11-08T16:52:54Z

Here to summarize the status and our findings about this issue from the work with T0 Team for the whole last week

The problem is 3 fold:

The agent looses the HTTP header, containing the actual DBSError code, when we switch the frontend to APS.
Upon a conversation with @vkuznet, we might have a direction. There obviously is a slight difference between how the connection is handled with Apache and APS. Things might boil down to the keepAlive && keepAliveTimeout flags.
We are not distinguishing between all possible situations that could have led to a specific error. We treat only one separately - which is DBS ErrorCode 128. And on top of that we do not even handle/recognize all the possible errors that DBS Server is returning to us.

The above two are concerning mostly the huge pile of blocks which we were accumulating and not recognizing that their records were already in DBS, such that the agent should stop retrying. Once we switched back to the APache frontend all those proceeded, and the sequential steps for the other workflows depending on the data also started.

The third aspect is more subtle, though. We had four blocks which were:
- Having all their files already in Rucio
- Having an overlap in DBS with another block. Meaning some of their files were wrongly uploaded to DBS as part of some completely different block (as extra files to it), which should never happen!!!! This was on the other side rightfully blocking those records to proceed from the DBS server side by breaking one UNIQUE table constraint at the file level, so the original block was held back at the agent. The reasons for that are still unknown. One possible place to look is for a concurrency issue on how we feed the 4 different input queues of DBSUploadPoller. The json dump of the original block though is absolutely correct. The problem is that the json for the originally uploaded block with the extra files is well gone and we cannot dump it to see what was actually uploaded.

As a strategy we decided to split the problem in 5 steps: 2 OPS and 3 DEV

OPS1: Switching back the agent to Apache front end and getting rid of the big backlog - DONE (reported by @germanfgv in the previous comment)
OPS2: Deleting manually the files which are overlapping between the blocks, such that we can release the last 4 blocks as well - This one is tricky because it does not finish with just deleting the files from a single table. I need to manually go and find all the relations between the following DBS database tables and clean them all from any record related to those files:
- TABLE: FILES
- TABLE: FILE_PARENTS
- TABLE: FILE_LUMIS
- TABLE: ASSOCIATED_FILES

(the later never imagined even exists)

DEV1: Start properly handling a bigger subset of errors at the Agent returned by the DBS Server
DEV2: Debug and fix the bug which caused the blocks overlap only in DBS
DEV3: Debug and find why are we loosing the HTTP header when we switch to APS frontend

so:

OPS1: is done.
OPS2: I am currently fighting with it, polishing all the queries and cleanup procedures - but I'd rather not execute anything on Friday 5 o'clock. I plan to proceed on Monday.
DEV1: We need a new issue to be created and worked on
DEV2: T0 Team continues to search through the logs and repeats the analysis of the data we did for the Muon0 block: /Muon0/Run2024H-v1/RAW#7369ccdf-3d3a-4d32-bad9-b04b02f279d4 with the other three locked blocks. look a good summary done by @LinaresToine here: Agents continuously failing to insert blocks into DBS #11965 (comment)
DEV3: This is somehow tricky - in the debugging session connected through cmswe-testbed.cern.ch (which is currently) an APS frontend, I can see that the dbsclient (which is a dependency for WMCore), does see the HTTP Header, and the DBS errors are well recognizable in the object. See my comment: Agents continuously failing to insert blocks into DBS #11965 (comment)

amaltaro · 2024-11-08T17:11:42Z

Thank you for summarizing everything that has been going on in here.

For the OPS2 issue above, I find deleting entries from the DBS Server database extremely dangerous. Even though it might require extra work, it would be much safer to actually recreate the lumis (or block) that is failing to get inserted into DBS. Did you and the T0 discuss this possibility? @germanfgv

About the DEV1, unless I am missing some context, I do not think we should replicate every single status code from the DBS Server to the client side. IMO, the client should only deal with the status code that it can actually do something different. If there is no different execution flow, then reporting the error from the server is what we can do (which is already done in the generic exception AFAICT).

germanfgv · 2024-11-08T18:56:45Z

@amaltaro we no longer have streamer files for these run/lumis, it's not possible to recreate these blocks.

We could consider making the changes in Rucio, but it would require to remove files from one block and add it to the other. Also, it would require to do the same in the agent's DBSBUFFER database.

amaltaro · 2024-11-08T19:46:31Z

Given the criticality and amount of information in DBS, it would be the last system that I would delete things manually.
For dbsbuffer, do I understand it right that we would only need to mark this block and its files as uploaded to DBS?
For Rucio, what would have to be done? Remove files/replicas from a DATASET? Would it need creation of a new DATASET + files/replicas?

germanfgv · 2024-11-12T10:58:29Z

In Rucio, we would need to remove 4 files from one block and add them to another. In the agent's database, we would need to change the block of the 4 problematic files and mark them as InDBS. I'm not sure how Rucio would reack to this, but I think it will be ok, as all files belong to the same container

amaltaro · 2024-11-19T16:21:03Z

After some discussions during the Tier0 meeting, I decided to have a quick look at the logs to see if we can have a better understanding of this issue.

I don't see some information in this thread, so let me write my observations here:

before the problematic blocks have been created in DBS3Upload, the component had a few oracle issues like:

Exception Class: DBSUploadException
Message: Unhandled exception while loading uploadable files for DatasetPath.
(cx_Oracle.DatabaseError) ORA-25401: can not continue fetches

after these oracle issues, I noticed many files being reported as duplicated in the logs:

2024-09-19 17:23:01,916:139632874354240:INFO:DBSUploadPoller:Executing loadFiles method...
2024-09-19 17:23:11,876:139632874354240:ERROR:DBSBufferBlock:Duplicate file inserted into DBSBufferBlock: 1077894
Ignoring this file for now!

based on Antonio's feedback above, the "impostor block" had the following timeline in the component:

### impostor block 1
2024-09-19 17:51:38,694:139632874354240:INFO:DBSUploadPoller:Queueing block for insertion: /AlCaP0/Run2024H-v1/RAW#b51293e3-1563-47a3-a88f-1eb33790c417
2024-09-19 17:52:47,723:139632874354240:INFO:DBSUploadPoller:About to call insert block for: /AlCaP0/Run2024H-v1/RAW#b51293e3-1563-47a3-a88f-1eb33790c417

while the original block had this timeline (and kept failing since then)

2024-09-19 17:51:38,698:139632874354240:INFO:DBSUploadPoller:Queueing block for insertion: /AlCaP0/Run2024H-v1/RAW#3bbaf481-068c-4fda-8656-663fa9a987a4
2024-09-19 17:52:49,777:139632874354240:ERROR:DBSUploadPoller:Error trying to process block /AlCaP0/Run2024H-v1/RAW#3bbaf481-068c-4fda-8656-663fa9a987a4 through DBS. Details: DBSError code: 110, message: 997071d9311e283887ce5e57b0b1800467986e1c57f620aff5a39d98b881fb6c unable to insert files, error DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.insertFilesViaChunks Message: Error: concurrency error, reason: DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.insertFilesViaChunks Message: Error: concurrency error

looking into RucioInjector, these 2 blocks above had the following timeline:

### impostor block 1
2024-09-19 15:30:19,570:139632735942208:INFO:RucioInjectorPoller:Block /AlCaP0/Run2024H-v1/RAW#b51293e3-1563-47a3-a88f-1eb33790c417 inserted into Rucio
2024-09-19 15:30:29,385:139632735942208:INFO:RucioInjectorPoller:Successfully inserted 4 files on block /AlCaP0/Run2024H-v1/RAW#b51293e3-1563-47a3-a88f-1eb33790c417
2024-09-19 17:57:20,982:139632735942208:INFO:RucioInjectorPoller:Closing block: /AlCaP0/Run2024H-v1/RAW#b51293e3-1563-47a3-a88f-1eb33790c417
### original block 1
2024-09-19 17:56:12,439:139632735942208:INFO:RucioInjectorPoller:Block /AlCaP0/Run2024H-v1/RAW#3bbaf481-068c-4fda-8656-663fa9a987a4 inserted into Rucio
2024-09-19 17:56:41,372:139632735942208:INFO:RucioInjectorPoller:Successfully inserted 5 files on block /AlCaP0/Run2024H-v1/RAW#3bbaf481-068c-4fda-8656-663fa9a987a4

Having said that, I have the following questions/comments:

it looks like we have not closed the original blocks in Rucio. AFAIK it is not a big deal and it has no impact in anything else. It is, nonetheless, different than any other block created by WMAgent.
is it possible that the list of files returned from dbsbuffer was not unique? File id is supposed to be unique (and sequential, AFAICT). How about lfns, do we have the same lfn under different file ids? Otherwise, how would we iterate through the same fileid twice?

Without investigating the codebase too much, it is possible that those duplicate file ids ("Ignoring this file for now!") actually triggered the misbehavior of the component. This duplicate file id is identified here:
https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/DBS3Buffer/DBSBufferBlock.py#L105
and one of the places it is used (there is another in the same module) is in this block:
https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/DBS3Buffer/DBSUploadPoller.py#L487

vkuznet · 2024-11-19T18:04:27Z

@amaltaro , few observations:

From ChatGPT: The Oracle error ORA-25401: can not continue fetches usually occurs in the context of Oracle's Transparent Application Failover (TAF) when a failover attempt disrupts an ongoing SQL FETCH operation. This error indicates that the fetch operation cannot proceed because the session was lost and re-established during the failover.
from DBSUploadPoller.py it uses thread object which invokes begin/rollback

Is it possible that thread was killed because ORACLE timed out? Or, if connection was lost to ORACLE and error was thrown. How DBSUploadPoller.py guarantees that transactions will be rolled back if thread is killed? From what I read in a code nothing is protected for such use-case and transaction will not be rolled back if thread is killed. It may explain the weird behavior.

In other words, because of the polling cycle, if thread is killed for whatever reason there is no guarantee that transaction can be rolled back in Python. But polling cycle will start poller again and it may execute the same injection of objects into database which may not be protected (if there is no UNIQUE constrain on a injected object), and it may explain the observed behavior.

amaltaro · 2025-01-21T14:59:34Z

A new development issue has been created with #12229, which will make WMAgent more resilient and error messages more friendly.

Given that all operational issues have now been resolved - despite not being able to understand the T0 issues, even after tons of debugging by Todor and German - I think we can close this out.

Todor, Andrea, others, please reopen it if anything is still pending from this debugging/operations. Thanks!

amaltaro added BUG WMAgent DBS labels Apr 11, 2024

amaltaro added this to WMCore quarterly developments Apr 11, 2024

amaltaro moved this to Todo in WMCore quarterly developments Apr 11, 2024

todor-ivanov self-assigned this Apr 11, 2024

todor-ivanov moved this from Todo to In Progress in WMCore quarterly developments Apr 11, 2024

amaltaro moved this from In Progress to ToDo in WMCore quarterly developments Sep 16, 2024

todor-ivanov moved this from ToDo to In Progress in WMCore quarterly developments Nov 20, 2024

todor-ivanov mentioned this issue Jan 15, 2025

WMAgent: Properly parse and handle DBS client errors #12229

Open

amaltaro closed this as completed Jan 21, 2025

github-project-automation bot moved this from In Progress to Done in WMCore quarterly developments Jan 21, 2025

Agents continuously failing to insert blocks into DBS #11965

Agents continuously failing to insert blocks into DBS #11965

Comments

amaltaro commented Apr 11, 2024

amaltaro commented Sep 16, 2024

LinaresToine commented Sep 26, 2024

vkuznet commented Sep 26, 2024

amaltaro commented Sep 26, 2024

vkuznet commented Sep 26, 2024

amaltaro commented Sep 26, 2024

vkuznet commented Sep 27, 2024

vkuznet commented Sep 27, 2024

amaltaro commented Sep 27, 2024

LinaresToine commented Sep 27, 2024

LinaresToine commented Sep 27, 2024 • edited Loading

germanfgv commented Oct 31, 2024

todor-ivanov commented Nov 4, 2024 • edited Loading

todor-ivanov commented Nov 4, 2024 • edited Loading

todor-ivanov commented Nov 4, 2024

amaltaro commented Nov 4, 2024

amaltaro commented Nov 4, 2024

germanfgv commented Nov 5, 2024

todor-ivanov commented Nov 5, 2024 • edited Loading

todor-ivanov commented Nov 5, 2024 • edited Loading

todor-ivanov commented Nov 5, 2024 • edited Loading

germanfgv commented Nov 5, 2024

germanfgv commented Nov 5, 2024

vkuznet commented Nov 5, 2024

germanfgv commented Nov 5, 2024

vkuznet commented Nov 5, 2024

vkuznet commented Nov 5, 2024

LinaresToine commented Nov 6, 2024 • edited Loading

todor-ivanov commented Nov 6, 2024 • edited Loading

germanfgv commented Nov 6, 2024

todor-ivanov commented Nov 8, 2024 • edited Loading

amaltaro commented Nov 8, 2024

germanfgv commented Nov 8, 2024

amaltaro commented Nov 8, 2024

germanfgv commented Nov 12, 2024

amaltaro commented Nov 19, 2024

vkuznet commented Nov 19, 2024 • edited Loading

amaltaro commented Jan 21, 2025

LinaresToine commented Sep 27, 2024 •

edited

Loading

todor-ivanov commented Nov 4, 2024 •

edited

Loading

todor-ivanov commented Nov 4, 2024 •

edited

Loading

todor-ivanov commented Nov 5, 2024 •

edited

Loading

todor-ivanov commented Nov 5, 2024 •

edited

Loading

todor-ivanov commented Nov 5, 2024 •

edited

Loading

LinaresToine commented Nov 6, 2024 •

edited

Loading

todor-ivanov commented Nov 6, 2024 •

edited

Loading

todor-ivanov commented Nov 8, 2024 •

edited

Loading

vkuznet commented Nov 19, 2024 •

edited

Loading