Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agents continuously failing to insert blocks into DBS #11965

Closed
amaltaro opened this issue Apr 11, 2024 · 38 comments
Closed

Agents continuously failing to insert blocks into DBS #11965

amaltaro opened this issue Apr 11, 2024 · 38 comments
Assignees

Comments

@amaltaro
Copy link
Contributor

Impact of the bug
WMAgent

Describe the bug
There seems to be an unusual number of blocks that are continuously failing to be inserted into DBS Server, with a variety of errors, as can be seen in [1] and [2].

For [1], that/those blocks actually belong to a worfklow that went all the way to completed in the system and then got rejected, as can be seen from this ReqMgr2 API.

For [2], that block belongs to a workflow that is currently in running-closed status. Block failing injection for about 10h.

This is based on vocms0255, I haven't yet checked the other agents.

How to reproduce it
Not sure

Expected behavior
For the rejected workflow (or aborted), we should make DBS3Upload aware that output data is no longer relevant and skip their injection into DBS Server. This might require persisting information in the DBSBuffer tables (like marking the block and relevant files as injected), otherwise the same blocks will come up every time we run a cycle of the DBS3Upload component.

For the malformed SQL statement (note a typo mailformed(!)), we probably need to correlate this error with further information from DBS Server. Is it the same error as we have with concurrent HTTP requests? Or what is actually wrong with this. Maybe @todor-ivanov can shed some light on this. Expected behavior of this fix is to be determined.

Additional context and error message
[1]

2024-04-11 15:32:06,562:140685583296256:ERROR:DBSUploadPoller:Error trying to process block /TKCosmics_38T/Run3Winter24Reco-TkAlCosmics0T-AlcaRecoTkAlCosmics0T_cosmics_133X_mcRun3_2024cosmics_realistic_deco_v1-v5/ALCARECO#a5225151-fe56-45b1-b4dc-244b4644c02d through DBS. Details: DBSError code: 0, message: , reason: 

[2]

2024-04-11 14:09:09,438:140685583296256:ERROR:DBSUploadPoller:Error trying to process block /SingleNeutrino_E-10-gun/Run3Winter24Reco-133X_mcRun3_2024_realistic_v10-v2/GEN-SIM-RECO#995c334f-6648-4c55-98a
1-44afbed8a57f through DBS. Details: DBSError code: 131, message: 5d0aae4c60a9089bfd22c0602c1bcecffd88106ed1a4578923297eda9e7da9d2 unable to find dataset_id for /SingleNeutrino_E-10-gun/Run3Winter24Digi-
133X_mcRun3_2024_realistic_v10-v2/GEN-SIM-RAW, error DBSError Code:103 Description:DBS DB query error, e.g. mailformed SQL statement Function:dbs.GetID Message: Error: sql: no rows in result set, reason:
 DBSError Code:103 Description:DBS DB query error, e.g. mailformed SQL statement Function:dbs.GetID Message: Error: sql: no rows in result set
@amaltaro
Copy link
Contributor Author

@todor-ivanov as discussed in the meeting today - and right now with Andrea as well - let us put this back to ToDo and come back to this beginning of October (2 weeks more should not hurt us here).

@LinaresToine
Copy link

Following discussion in mattermost wm-ops thread with @amaltaro.

Related to failure in inserting data to DBS, the current T0 production agent is struggling with inserting files into the blocks. I see the following error message in the DBS3Upload component log

Error trying to process block /AlCaP0/Run2024H-v1/RAW#983e96f3-5dca-4919-a3b4-fa291f145fb7 through DBS. Details: DBSError code: 110, message: d93d36f53eaf3097db5c9f50851359041c418a18727e6f363e6c18c37d3f25bb una
ble to insert files, error DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.insertFilesViaChunks Message: Error: concurrency error, reason: DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.insertFilesViaChunk
s Message: Error: concurrency error

This is present for the following blocks:

  • /AlCaP0/Run2024H-v1/RAW#3bbaf481-068c-4fda-8656-663fa9a987a4
  • /Muon0/Run2024H-v1/RAW#7369ccdf-3d3a-4d32-bad9-b04b02f279d4
  • /AlCaP0/Run2024H-v1/RAW#92fab5d9-9a27-4a7c-a57e-4b2691c654cd
  • /AlCaP0/Run2024H-v1/RAW#983e96f3-5dca-4919-a3b4-fa291f145fb7

@vkuznet
Copy link
Contributor

vkuznet commented Sep 26, 2024

I suggest that you review #11106 which describes the actual issue with concurrent data insertion. In short, to make it work we must have all pieces (like dataset configuration, etc.) in place to make concurrent injection. To solve this problem someone must inject first one block with all necessary information, and then can safely use concurrent pattern to inject other blocks.

@amaltaro
Copy link
Contributor Author

@vkuznet thank you for jumping into this discussion.

I had a feeling that there was another obscure problem with DBS Server, and reviewing the ticket you pointed to (11106) - and according to your sentence above - I understand that, provided that we have at least 1 block injected into DBS for a given dataset, the "concurrency error" should no longer happen, given that all the foundation information is already in the database. Correct?

I picked one of the blocks provided by Antonio and queried DBS Server for its blocks:
https://cmsweb.cern.ch/dbs/prod/global/DBSReader/blocks?dataset=/AlCaP0/Run2024H-v1/RAW

as you can see, this dataset already has a bunch of blocks in the database. So, how come we are having a "concurrency error" here?

@vkuznet
Copy link
Contributor

vkuznet commented Sep 26, 2024

If you'll inspect the code [1], in order to insert DBS block concurrently we need to have in place:

  • dataset configuration
  • primary dataset info
  • processing era
  • acquisition era
  • data tier
  • physics group
  • dataset access type
  • processed dataset

So, if all of these information is present and it is consistent across all blocks in DBS then answer is yes the concurrency error (based on database content) should not arise. In other words DBS server first acquire or insert this info into DBS tables and if two or more HTTP calls arrives at the same time it can cause database error which lead to concurrency error form DBS server. Is it the case of the discussed blocks I don't know. But it is possible to not have all the information present in DB across all blocks if any of the above have differ among them.

You may look at example of bulkblocks JSON [2] to see actually how this information is structured and provided to DBS. In particular, the information in dataset_conf_list and file_conf_list is used to look-up aforementioned info, along with primds, processing_era, etc. So, if you inject multiple JSON they need to have identical info for those attributes, otherwise you may potentially get into racing conditions described in #11106

[1] https://github.com/dmwm/dbs2go/blob/master/dbs/bulkblocks2.go#L478
[2] https://github.com/dmwm/dbs2go/blob/master/test/bulkblocks.json

@amaltaro
Copy link
Contributor Author

Valentin, unless there is a bug in the (T0)WMAgent, all the blocks for the same dataset should carry exactly the same metadata. That means, same acquisition era, primary dataset, etc etc etc.

Having said that, if a block exists in DBS Server, we can conclude that all of its metadata is already available as well. IF that metadata is already available and we are trying to inject more blocks for the same dataset, hence the same meta-data, there should be NO concurrency error.

Based on your explanation and on the data shared by Antonio, I fail to see how we would hit a "concurrency error". That means there is more to what we have discussed/understood so far; or the error message is misleading...

In any case, I would suggest to have @todor-ivanov following this up next week, comparing things with the DBS Server logs and against the source code.

@vkuznet
Copy link
Contributor

vkuznet commented Sep 27, 2024

I further looked into the dbs code and I think I identified the issue. According to the dbs code

Then, I looked at one of the dbs logs and found

[2024-09-24 00:45:17.228980109 +0000 UTC m=+2471302.202098481] fail to insert files chunks, trec &{IsFileValid:1 DatasetID:15071289 BlockID:37951592 CreationDate:1727138717 CreateBy:/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmst0/CN=658085/CN=Robot: CMS Tier0 FilesMap:{mu:{state:0 sema:0} read:{_:[] _:{} v:<nil>} dirty:map[] misses:0} NErrors:2}

So, indeed input file record DOES NOT contain required file type attribute, see File structure over here https://github.com/dmwm/dbs2go/blob/master/dbs/bulkblocks.go#L65. The "file_type" must be present in provided JSON, otherwise it will be assigned to default value 0 which is what file injection tries to get from database and it should be non-zero value.

To summarize, I suggest to check JSON records T0 provides and ensure it provides "file_type" along other file attributes (all of them are defiend in this struct: https://github.com/dmwm/dbs2go/blob/master/dbs/bulkblocks.go#L65). Without it DBS code correctly fails, but probably it would be useful to adjust error message to properly report the error.

@vkuznet
Copy link
Contributor

vkuznet commented Sep 27, 2024

For the record, here is how DBS error look in a log:

[2024-09-24 00:45:17.228980109 +0000 UTC m=+2471302.202098481] fail to insert files chunks, trec &{IsFileValid:1 DatasetID:15071289 BlockID:37951592 CreationDate:1727138717 CreateBy:/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmst0/CN=658085/CN=Robot: CMS Tier0 FilesMap:{mu:{state:0 sema:0} read:{_:[] _:{} v:<nil>} dirty:map[] misses:0} NErrors:2}
[2024-09-24 00:45:17.229561212 +0000 UTC m=+2471302.202679583] 5ecdc2bdcd03492fd64efc269de332cdcf1c8a53c3e3cc07168b0c741f0270ba unable to insert files, error DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.insertFilesViaChunks Message: Error: concurrency error
[2024-09-24 00:45:17.232415539 +0000 UTC m=+2471302.205533911] DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.InsertBulkBlocksConcurrently Message:5ecdc2bdcd03492fd64efc269de332cdcf1c8a53c3e3cc07168b0c741f0270ba unable to insert files, error DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.insertFilesViaChunks Message: Error: concurrency error Error: nested DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.insertFilesViaChunks Message: Error: concurrency error Stacktrace:
goroutine 300475111 [running]:
github.com/dmwm/dbs2go/dbs.Error({0xb054e0?, 0xc0009f2410?}, 0x6e, {0xc0004f60f0, 0xe6}, {0xa3b23e, 0x2b})
        /go/src/github.com/vkuznet/dbs2go/dbs/errors.go:185 +0x99
github.com/dmwm/dbs2go/dbs.(*API).InsertBulkBlocksConcurrently(0xc00036c000)
        /go/src/github.com/vkuznet/dbs2go/dbs/bulkblocks2.go:743 +0x2546
github.com/dmwm/dbs2go/web.DBSPostHandler({0xb08290, 0xc000012cd8}, 0xc000616700, {0xa1d753, 0xa})
        /go/src/github.com/vkuznet/dbs2go/web/handlers.go:544 +0x1374
github.com/dmwm/dbs2go/web.BulkBlocksHandler({0xb08290?, 0xc000012cd8?}, 0xc000a9f460?)
        /go/src/github.com/vkuznet/dbs2go/web/handlers.go:960 +0x3b
net/http.HandlerFunc.ServeHTTP(0xc00055f1a0?, {0xb08290?, 0xc000012cd8?}, 0x95d5a0?)
        /usr/local/go/src/net/http/server.go:2136 +0x29
github.com/dmwm/dbs2go/web.limitMiddleware.func1({0xb08290?, 0xc000012cd8?}, 0xc00055f1a0?)
        /go/src/github.com/vkuznet/dbs2go/web/middlewares.go:110 +0x32
net/http.HandlerFunc.ServeHTTP(0x7f8c001964c0?, {0xb08290?, 0xc000012cd8?}, 0xc0003

So, you have all pointers to look which lines of code fails by inspecting its stack, and that exactly what I did.

@amaltaro
Copy link
Contributor Author

As far as I can tell, it should always be set like:

      "file_type": "EDM",

@LinaresToine can you please change the component configuration and provide one of the block names that is failing to be inserted, in the following line:

config.DBS3Upload.dumpBlockJsonFor = ""

then restart DBS3Upload and you should soon get a JSON dump of the content that the component is POSTing to the DBS Server. Output file should be under the component directory (e.g. install/DBS3Upload/).

@LinaresToine
Copy link

Ok, I changed the config as suggested. Waiting on the loadFiles method to complete the cycle. Ill follow up

@LinaresToine
Copy link

LinaresToine commented Sep 27, 2024

I have placed the output json file in /eos/home-c/cmst0/public/dbsError/dbsuploader_block.json.

Another error is showing up in the DBS3Upload component for all 4 pending blocks:

Hit a general exception while inserting block /AlCaP0/Run2024H-v1/RAW#92fab5d9-9a27-4a7c-a57e-4b2691c654cd. Error: (52, 'Empty reply from server')
Traceback (most recent call last):
  File "/data/tier0/WMAgent.venv3/lib64/python3.9/site-packages/WMComponent/DBS3Buffer/DBSUploadPoller.py", line 94, in uploadWorker
    dbsApi.insertBulkBlock(blockDump=block)
  File "/data/tier0/WMAgent.venv3/lib64/python3.9/site-packages/dbs/apis/dbsClient.py", line 647, in insertBulkBlock
    result =  self.__callServer("bulkblocks", data=blockDump, callmethod='POST' )
  File "/data/tier0/WMAgent.venv3/lib64/python3.9/site-packages/dbs/apis/dbsClient.py", line 474, in __callServer
    self.http_response = method_func(self.url, method, params, data, request_headers)
  File "/data/tier0/WMAgent.venv3/lib64/python3.9/site-packages/RestClient/RestApi.py", line 42, in post
    return http_request(self._curl)
  File "/data/tier0/WMAgent.venv3/lib64/python3.9/site-packages/RestClient/RequestHandling/HTTPRequest.py", line 56, in __call__
    curl_object.perform()
pycurl.error: (52, 'Empty reply from server')

@germanfgv
Copy link
Contributor

An update from T0:
Here is a JSON dump for a succesfully uploaded T0 DBS block:

/eos/home-c/cmst0/public/dbsError/dbsuploader_successful_block.json

Now we have a total of 276 blocks that we are unable to upload. We the same error message for all of them. A list of these blocks can be found here:

/eos/home-c/cmst0/public/dbsError/failingBlocks.txt

Because of these, we have 121384 files in T0 that we have been unable to register in DBS. @todor-ivanov is trying to find a way for us to upload this information.

@todor-ivanov
Copy link
Contributor

todor-ivanov commented Nov 4, 2024

Here is the follow up on what is the status of those blocks according to DBS. I had to create a script to go and query directly the DBS database lfn by lfn for all those blocks and here is the accumulated result:
blockDBSRecords.json: /eos/home-c/cmst0/public/dbsError/blockDBSRecords.json

So from what I can see from those results we can identify at least 3 different use cases:

  • blocks attempted to be uploaded to DBS twice - resulting in oracle 'duplicate error'
  • blocks processed twice resulting in block mismatch between what the agent knows about the block and what has been already uploaded to DBS for those blocks - in few of those cases the data uploaded to DBS is not complete i.e. some of the files are missing even from the previously uploaded block.
  • completely missing records - probably due to block fields misconfiguration

I am going to filter out those for which we know are there. On top of that I consider checking their Rucio status as well.
FYI @germanfgv

p.s. Here: DBSBlocksCheck.py is the script I used for accumulating those results

p.s. Here: And here: blockDBSRecords.json is an updated version of the DBS records with updated Rucio information per block as well

@todor-ivanov
Copy link
Contributor

todor-ivanov commented Nov 4, 2024

And continuing to reduce the results to something more readable here [1] is the final list of the block and file status at DBS for all of them.

As one can see:

  • Almost all of those blocks are properly present at DBS - so for those I assume that the Agent did not properly handled the initial return code by DBS and it simply continues to retry.
  • Only 4 of them (probably the 4 originally reported) - are falling under one of the cathegories:
    • Block mismatch:
      • /AlCaP0/Run2024H-v1/RAW#3bbaf481-068c-4fda-8656-663fa9a987a4
      • /AlCaP0/Run2024H-v1/RAW#92fab5d9-9a27-4a7c-a57e-4b2691c654cd
        meaning, all files from this block are recorded as part of a different block (could be due to an attempt to reprocess the same block twice.
    • Block mismatch and partially recorded files:
      • /Muon0/Run2024H-v1/RAW#7369ccdf-3d3a-4d32-bad9-b04b02f279d4
      • /AlCaP0/Run2024H-v1/RAW#983e96f3-5dca-4919-a3b4-fa291f145fb7
        meaning, not only that the files already uploaded to DBS belong to a different block, but those which were already puloded were not the whole block

FYI: @germanfgv @LinaresToine

[1]

	blockName: /AlCaP0/Run2024H-v1/RAW#3bbaf481-068c-4fda-8656-663fa9a987a4: 
		blockDBSStatus: ['MISSING']
		filesDBSStatus: ['BLOCKMISMATCH']
	blockName: /Muon0/Run2024H-v1/RAW#7369ccdf-3d3a-4d32-bad9-b04b02f279d4: 
		blockDBSStatus: ['MISSING']
		filesDBSStatus: ['MISSING', 'BLOCKMISMATCH']
	blockName: /AlCaP0/Run2024H-v1/RAW#92fab5d9-9a27-4a7c-a57e-4b2691c654cd: 
		blockDBSStatus: ['MISSING']
		filesDBSStatus: ['BLOCKMISMATCH']
	blockName: /AlCaP0/Run2024H-v1/RAW#983e96f3-5dca-4919-a3b4-fa291f145fb7: 
		blockDBSStatus: ['MISSING']
		filesDBSStatus: ['MISSING', 'BLOCKMISMATCH']
	blockName: /Commissioning/Run2024I-v1/RAW#4113bc20-c92b-43a3-a767-06bccfe4af56: OK
	blockName: /ParkingSingleMuon6/Run2024I-v1/RAW#75ce0f9d-2153-4520-b79f-bb0df5f19227: OK
	blockName: /ZeroBias/Run2024I-v1/RAW#50e30425-0a03-46c5-8da0-26719c266dbc: OK
	blockName: /ParkingSingleMuon1/Run2024I-v1/RAW#fae36d0d-cc36-4c5e-bfde-009aa38f9b7c: OK
	blockName: /ParkingSingleMuon2/Run2024I-v1/RAW#3c125d76-99c4-4f65-8141-0ae9abcd0e1a: OK
	blockName: /ParkingSingleMuon7/Run2024I-v1/RAW#0a25242b-c6fb-4eca-8382-826f4e878021: OK
	blockName: /ParkingSingleMuon5/Run2024I-v1/RAW#b01693d9-d3d9-4ab2-8760-0b019652f89e: OK
	blockName: /ParkingSingleMuon8/Run2024I-v1/RAW#a57df37e-4d9d-465e-93b2-54d40f892429: OK
	blockName: /ParkingSingleMuon10/Run2024I-v1/RAW#627f4dd1-4c31-4f4b-bb85-48d19996ba4f: OK
	blockName: /ParkingSingleMuon9/Run2024I-v1/RAW#1b4bc705-f6de-4f55-9c8b-7cd490457341: OK
	blockName: /Tau/Run2024I-v1/RAW#2da0c3fb-4b3c-47a2-ac49-99cca62c226d: OK
	blockName: /BTagMu/Run2024I-v1/RAW#280bd893-a382-496c-8e6f-366309493acc: OK
	blockName: /ParkingDoubleMuonLowMass1/Run2024I-v1/RAW#b4492da6-fa05-4484-a034-aa2a87354735: OK
	blockName: /AlCaLowPtJet/Run2024I-v1/RAW#0002be22-10d9-4200-9845-bf112ec9291a: OK
	blockName: /Muon1/Run2024I-v1/RAW#499938c6-8357-4095-99db-91c90e600f0e: OK
	blockName: /ParkingDoubleMuonLowMass4/Run2024I-PromptReco-v1/AOD#09d3e003-1ca7-459f-8089-1f1d95f5ba20: OK
	blockName: /ParkingDoubleMuonLowMass1/Run2024I-PromptReco-v1/DQMIO#777a6f7a-058d-46a4-bfb9-d905b141fbd2: OK
	blockName: /ParkingDoubleMuonLowMass0/Run2024I-PromptReco-v1/AOD#6fbca7c8-560f-4805-af21-55d424e9877a: OK
	blockName: /Muon0/Run2024I-MuAlCalIsolatedMu-PromptReco-v1/ALCARECO#06571035-b278-4998-a9d3-2b523bb4fd0e: OK
	blockName: /Muon1/Run2024I-HcalCalHO-PromptReco-v1/ALCARECO#65cfc43b-aca0-4600-8bb3-db4261856f3b: OK
	blockName: /Tau/Run2024I-LogError-PromptReco-v1/RAW-RECO#645686dd-1135-4866-9e53-6438aa17600d: OK
	blockName: /Muon1/Run2024I-PromptReco-v1/NANOAOD#d26b31f5-0371-4fe2-9420-e87a87925fdd: OK
	blockName: /ParkingSingleMuon8/Run2024I-PromptReco-v1/MINIAOD#86c82707-6992-4126-b86f-182c5f5aa7fc: OK
	blockName: /Tau/Run2024I-PromptReco-v1/AOD#6627bc34-c746-49a9-ab02-550710731e1b: OK
	blockName: /Muon0/Run2024I-PromptReco-v1/MINIAOD#ebab670f-0588-422e-8206-c406f948bb06: OK
	blockName: /Muon0/Run2024I-PromptReco-v1/DQMIO#35e3beac-bd5d-4ba4-82c0-a5372e89e5a6: OK
	blockName: /ParkingDoubleMuonLowMass0/Run2024I-PromptReco-v1/DQMIO#2e38060a-bf22-4135-897d-f7d93684dede: OK
	blockName: /DisplacedJet/Run2024I-EXOLLPJetHCAL-PromptReco-v1/AOD#ab2bd1a4-6f42-4d9a-853f-1a7a5aa5f2f4: OK
	blockName: /ParkingDoubleMuonLowMass7/Run2024I-PromptReco-v1/MINIAOD#0821e60b-12b6-4993-8a67-0952379c34bb: OK
	blockName: /Muon1/Run2024I-EXOCSCCluster-PromptReco-v1/USER#5197c2ba-2f13-48e9-bddd-9c1fd071cd33: OK
	blockName: /ParkingVBF6/Run2024I-PromptReco-v1/NANOAOD#a36dad4c-dac8-4a43-bde9-5ac38a0f8b7d: OK
	blockName: /EphemeralZeroBias1/Run2024I-PromptReco-v1/MINIAOD#0c7b63de-5a48-4d98-8e9b-9c52c714703a: OK
	blockName: /JetMET1/Run2024I-PromptReco-v1/DQMIO#6d3cc1c5-48c8-4f9e-9f8b-1cb5b481d550: OK
	blockName: /Tau/Run2024I-PromptReco-v1/NANOAOD#473eddda-612e-4306-91bf-9dfcb3b5d108: OK
	blockName: /ParkingVBF6/Run2024I-PromptReco-v1/MINIAOD#38f7630c-e015-4a97-a331-e33b0cfa3604: OK
	blockName: /ParkingSingleMuon6/Run2024I-PromptReco-v1/AOD#bd0eeb38-b1d0-4157-ade6-fb3b65f57995: OK
	blockName: /ParkingVBF1/Run2024I-PromptReco-v1/AOD#1298b211-43f2-49c6-8788-2bde6e2a9e62: OK
	blockName: /ParkingVBF0/Run2024I-PromptReco-v1/MINIAOD#63e59425-3ff1-406c-ae18-48bf9f239354: OK
	blockName: /ParkingSingleMuon8/Run2024I-PromptReco-v1/AOD#85d58a9d-29b0-4f98-99bc-9c201ed2c6a2: OK
	blockName: /Tau/Run2024I-EXODisappTrk-PromptReco-v1/USER#1a96c708-64a9-4f62-819f-a19633154b16: OK
	blockName: /SpecialZeroBias5/Run2024I-PromptReco-v1/AOD#59e655a4-2897-47d0-ba11-287332c4e6b5: OK
	blockName: /ParkingVBF1/Run2024I-PromptReco-v1/NANOAOD#aa47f11c-7080-492b-ab91-ad19e6299fff: OK
	blockName: /ParkingVBF3/Run2024I-PromptReco-v1/NANOAOD#4330d839-9985-4b36-9d5e-b5aa5c19175f: OK
	blockName: /ParkingHH/Run2024I-PromptReco-v1/MINIAOD#e8d162fa-f391-439c-a7f1-8a8d39dda120: OK
	blockName: /Tau/Run2024I-PromptReco-v1/DQMIO#1a0ac20a-1d60-4d89-8133-e8559f1e4c13: OK
	blockName: /ParkingSingleMuon0/Run2024I-PromptReco-v1/MINIAOD#d8995d51-e005-4757-8439-850c005cbd57: OK
	blockName: /ParkingVBF5/Run2024I-PromptReco-v1/MINIAOD#5bff218a-4895-4f13-8148-c9e0bcf820b7: OK
	blockName: /Muon0/Run2024I-PromptReco-v1/AOD#306f5950-5eec-43d3-96f2-8dfbe22d322c: OK
	blockName: /EGamma0/Run2024I-PromptReco-v1/AOD#e9814b10-2545-4a83-8a3d-2501f5679ecd: OK
	blockName: /JetMET1/Run2024I-PromptReco-v1/NANOAOD#039f5f67-3f70-4797-9b2a-c6d698e52efd: OK
	blockName: /ParkingVBF1/Run2024I-PromptReco-v1/DQMIO#b1f45558-5e9a-493c-afed-7e133bb4a7e7: OK
	blockName: /DisplacedJet/Run2024I-PromptReco-v1/AOD#0edffee0-8286-4658-943c-8efc45f23ea4: OK
	blockName: /Muon1/Run2024I-PromptReco-v1/DQMIO#edae199f-fb22-48d3-9a8a-cdb15703bcbe: OK
	blockName: /DisplacedJet/Run2024I-EXODelayedJet-PromptReco-v1/AOD#b46c0dcb-c26a-47c2-a4b0-fcac9b9d63be: OK
	blockName: /Muon0/Run2024I-HcalCalIterativePhiSym-PromptReco-v1/ALCARECO#6b39e513-27ac-4e54-ad1e-a343b9d064fc: OK
	blockName: /JetMET1/Run2024I-PromptReco-v1/AOD#fa6562fe-0d1d-4d06-9bf0-a135edbcf172: OK
	blockName: /ParkingVBF0/Run2024I-PromptReco-v1/DQMIO#24686495-80bc-44de-a3b2-f39cfa971760: OK
	blockName: /NoBPTX/Run2024I-PromptReco-v1/AOD#99fd2849-5b43-4927-9596-6e6a33683d9c: OK
	blockName: /HLTPhysics/Run2024I-LogError-PromptReco-v1/RAW-RECO#8abbaf67-41c7-452f-816a-f978dd14cc1b: OK
	blockName: /EGamma0/Run2024I-LogError-PromptReco-v1/RAW-RECO#ab07e175-a29f-4203-a3f3-dceb2938ae33: OK
	blockName: /JetMET1/Run2024I-EXODisappTrk-PromptReco-v1/USER#98870a35-c0d8-4ece-9731-1ac081143000: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#3b08c77d-8e97-4aca-be54-f95b7ab76465: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#2b3eefa5-923b-4b42-9c5c-cf162453d59b: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#1d77a290-571c-442d-be95-531e4168e94d: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#23d1c315-25d0-47a8-813e-caa7a5f2a0f1: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#80570ad8-4b6a-4e00-bcb5-63ff743504d5: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#5abdb7b2-4a4a-4a10-a4b3-9cfae99bdf83: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#adc72e9a-7410-493a-a327-1611b18a4106: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#e0c75028-64eb-480f-abb6-910505a92973: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#c1bc74d9-2c3b-415d-928f-7ec8395868ad: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#f1e8e065-2a65-4bd7-9279-d88c423c0ea0: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#8147c17b-4550-4a14-9747-ca696aa03408: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#4645a8b5-b1ef-4008-b1f5-dff3fadb1855: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#a8e53c28-c9ad-40fa-88bf-b7c2f3e61a64: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#19e36aa2-2ec8-4974-891f-112279ec9393: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#ba34c69d-db70-455e-81cc-13a161727e80: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#b66d6bae-2647-44b9-8bad-320da54d0a29: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#33b16287-bc8d-421b-bd38-2059ad19dd87: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#a1f20e02-b988-4e3a-bd6f-68610bde0b97: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#d3c7711f-b7ba-4e4d-9db8-999cd6383551: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#a77f0bd9-bd98-418e-b39c-9bf859203fad: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#aff24402-3ead-4ab6-9d71-02b13721b7cf: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#aea1ab62-ffa2-447f-bede-dbd01a05708a: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#ebb35256-bdc6-40a7-ae6c-9de27a2094bf: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#36740b3b-31a6-4be6-a4e4-f76f5e1200ab: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#a95882ee-05b6-482a-bbf8-6f7ff8ab4354: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#0358032c-2997-440c-a658-461e011e87a0: OK
	blockName: /Cosmics/Run2024I-MuAlGlobalCosmics-PromptReco-v1/ALCARECO#f7f08dfb-6c23-441c-9137-09abad0a7d39: OK
	blockName: /ParkingDoubleMuonLowMass2/Run2024I-PromptReco-v1/DQMIO#e9d71494-b460-4680-a3b9-7a1c62fc4d01: OK
	blockName: /ParkingHH/Run2024I-PromptReco-v1/AOD#ba026985-4cfd-4a06-ba1b-bfb5af6cbb64: OK
	blockName: /MinimumBias/Run2024I-PromptReco-v1/NANOAOD#dfbf01f5-c43c-463f-b562-aee5a91da41e: OK
	blockName: /ParkingSingleMuon2/Run2024I-PromptReco-v1/AOD#f2995e54-af6b-456c-8b40-abb844b299a2: OK
	blockName: /EGamma0/Run2024I-HcalCalIterativePhiSym-PromptReco-v1/ALCARECO#0ad19bb8-d4fa-4565-9019-91ef6e7207ac: OK
	blockName: /EGamma1/Run2024I-EXODisappTrk-PromptReco-v1/USER#cf89035f-8d3d-4a95-9dd4-ae73b92cb865: OK
	blockName: /MinimumBias/Run2024I-SiStripCalMinBias-PromptReco-v1/ALCARECO#d4fdf46b-db01-4df9-9eac-e23672e14f84: OK
	blockName: /Muon0/Run2024I-EXODisappTrk-PromptReco-v1/USER#ae7335f9-c7ae-42bd-8304-0805347446dd: OK
	blockName: /ParkingDoubleMuonLowMass7/Run2024I-PromptReco-v1/NANOAOD#0f9d8bfa-a7f7-44f9-8bec-e509f8334490: OK
	blockName: /ParkingDoubleMuonLowMass7/Run2024I-PromptReco-v1/DQMIO#a004890b-d6cf-4ff0-b715-f1ba374e3d97: OK
	blockName: /ParkingSingleMuon4/Run2024I-PromptReco-v1/AOD#64b94383-cbaf-48a5-b194-3d15baa01adc: OK
	blockName: /Tau/Run2024I-PromptReco-v1/MINIAOD#23bc1b6d-3f75-4007-8618-52755f3fb1f3: OK
	blockName: /ParkingDoubleMuonLowMass3/Run2024I-PromptReco-v1/NANOAOD#3935aabb-be36-4e9d-a49f-afc523994fd5: OK
	blockName: /MinimumBias/Run2024I-SiStripCalZeroBias-PromptReco-v1/ALCARECO#d621c61d-52ea-4d11-b092-4068bfd61ddf: OK
	blockName: /EphemeralZeroBias7/Run2024I-PromptReco-v1/MINIAOD#f770ca99-fca5-4054-a377-9365c016069b: OK
	blockName: /SpecialZeroBias1/Run2024I-PromptReco-v1/MINIAOD#b8cc0194-13dd-4091-96fa-28f5be5c2134: OK
	blockName: /ParkingDoubleMuonLowMass4/Run2024I-TkAlJpsiMuMu-PromptReco-v1/ALCARECO#950e3d2c-5375-47c5-8f79-03df611b9422: OK
	blockName: /Commissioning/Run2024I-SiStripCalMinBias-PromptReco-v1/ALCARECO#0a2b5651-fe2e-436e-af4d-488cb00acf68: OK
	blockName: /ScoutingPFMonitor/Run2024I-PromptReco-v1/NANOAOD#c2e3fd58-1bf5-4769-9230-6c6ac11bf75f: OK
	blockName: /SpecialZeroBias5/Run2024I-SiStripCalMinBias-PromptReco-v1/ALCARECO#3ecd813c-ad88-4ced-acba-d78a9ebc9963: OK
	blockName: /SpecialZeroBias5/Run2024I-SiStripCalZeroBias-PromptReco-v1/ALCARECO#31c2c844-9601-4881-ba21-e999b89d7900: OK
	blockName: /JetMET1/Run2024I-HcalCalIsoTrkProducerFilter-PromptReco-v1/ALCARECO#0f1c47f1-3755-4a00-8d36-2cd47e42605c: OK
	blockName: /ParkingHH/Run2024I-PromptReco-v1/DQMIO#b3983892-1ab2-4d9b-a5da-c7e238846e1f: OK
	blockName: /SpecialZeroBias5/Run2024I-LogErrorMonitor-PromptReco-v1/USER#6ea22759-354e-49ba-a72b-2d29034979e2: OK
	blockName: /ParkingVBF6/Run2024I-PromptReco-v1/DQMIO#332ab512-bdef-44e0-a091-9615bdd417c6: OK
	blockName: /Muon0/Run2024I-TkAlZMuMu-PromptReco-v1/ALCARECO#6d4fa60b-d7c3-4280-ab08-c48d7cbf258d: OK
	blockName: /EphemeralZeroBias3/Run2024I-PromptReco-v1/NANOAOD#fde6778b-8361-427a-876f-e16b2e65978f: OK
	blockName: /TestEnablesEcalHcal/Run2024I-Express-v1/RAW#6a31d991-d964-4fca-9113-59d1c40d5759: OK
	blockName: /StreamExpressCosmics/Run2024I-SiPixelCalZeroBias-Express-v1/ALCARECO#13f46438-f23d-4121-85fb-896d224db127: OK
	blockName: /StreamExpressCosmics/Run2024I-SiStripCalCosmics-Express-v1/ALCARECO#9d90aa84-6f90-4ee3-8e33-a2718e9e59b2: OK
	blockName: /StreamALCAPPSExpress/Run2024I-PromptCalibProdPPSAlignment-Express-v1/ALCAPROMPT#939fa7d4-6ffa-4788-9b12-83436c0413b5: OK
	blockName: /StreamExpress/Run2024I-TkAlZMuMu-Express-v1/ALCARECO#6097a868-c5bd-43ae-a804-1e881fcf5bc4: OK
	blockName: /StreamExpress/Run2024I-SiPixelCalSingleMuonTight-Express-v1/ALCARECO#1b495798-f27a-49e4-8a21-87d8d2a236f6: OK
	blockName: /StreamExpress/Run2024I-PromptCalibProdSiPixelAliHGComb-Express-v1/ALCAPROMPT#aafb7697-c790-4d5e-9dac-4cf5fae1c4ce: OK
	blockName: /ParkingSingleMuon11/Run2024I-PromptReco-v1/AOD#f35b9721-1e35-434b-9e06-28b6c88f64fe: OK
	blockName: /ParkingDoubleMuonLowMass6/Run2024I-PromptReco-v1/AOD#90186a44-9236-4665-b48d-a8dd37ef0ff1: OK
	blockName: /Cosmics/Run2024I-CosmicTP-PromptReco-v1/RAW-RECO#7c8c249d-9515-4d08-8217-418a269b1a2e: OK
	blockName: /ParkingSingleMuon1/Run2024I-PromptReco-v1/MINIAOD#51342b65-75e2-42eb-b868-4df8ee7809b8: OK
	blockName: /ParkingSingleMuon4/Run2024I-PromptReco-v1/AOD#e8fd5b7b-24c4-49a1-8c26-7d7c2d37661a: OK
	blockName: /ParkingVBF5/Run2024I-PromptReco-v1/AOD#db278bd7-459c-4427-9c2f-b214640caaeb: OK
	blockName: /SpecialZeroBias1/Run2024I-PromptReco-v1/AOD#786cf920-b104-4e8f-bebb-30a30d090357: OK
	blockName: /EGamma0/Run2024I-EcalUncalWElectron-PromptReco-v1/ALCARECO#fd409f01-fba4-4d87-a1c5-04ccde5ee8ad: OK
	blockName: /Muon0/Run2024I-SiPixelCalSingleMuonLoose-PromptReco-v1/ALCARECO#5b6152f5-616f-4f55-ab65-eb8b1de0798b: OK
	blockName: /EGamma0/Run2024I-EGMJME-PromptReco-v1/RAW-RECO#55a6487b-d4f9-4d01-82c7-0f2ee35872d1: OK
	blockName: /EGamma1/Run2024I-HcalCalIterativePhiSym-PromptReco-v1/ALCARECO#67ac37a8-4d01-4a8b-a30f-4a714cfb2a0a: OK
	blockName: /HLTPhysics/Run2024I-LogErrorMonitor-PromptReco-v1/USER#1b6693e3-d43b-47e9-ae23-fd327e5af74e: OK
	blockName: /MuonShower/Run2024I-EXOCSCCluster-PromptReco-v1/USER#6d2d1137-e1d8-4ae0-bd81-065f8a050490: OK
	blockName: /ParkingVBF1/Run2024I-PromptReco-v1/MINIAOD#12935b72-988e-44ca-9c7c-9b9d8063d8b3: OK
	blockName: /Cosmics/Run2024I-LogError-PromptReco-v1/RAW-RECO#9b7ece4a-0f62-4b45-942a-e8366b905412: OK
	blockName: /EGamma1/Run2024I-WElectron-PromptReco-v1/USER#e8b3efe3-339d-4b5c-82df-bc78defb09ea: OK
	blockName: /NoBPTX/Run2024I-TkAlCosmicsInCollisions-PromptReco-v1/ALCARECO#c64e5941-7da4-47a8-ab32-284a5e059dca: OK
	blockName: /Commissioning/Run2024I-LogError-PromptReco-v1/RAW-RECO#af0433ba-b843-4b44-bfa3-70b5d7475863: OK
	blockName: /MinimumBias/Run2024I-PromptReco-v1/AOD#e45c0cff-39a4-440f-819a-2e522045618b: OK
	blockName: /ParkingSingleMuon7/Run2024I-PromptReco-v1/NANOAOD#9162a0ab-23e1-4fd1-870e-f55da0661a44: OK
	blockName: /Muon1/Run2024I-TkAlZMuMu-PromptReco-v1/ALCARECO#582d0da0-d896-46f3-b4de-e7001a1b4dea: OK
	blockName: /EphemeralZeroBias3/Run2024I-PromptReco-v1/MINIAOD#aeaed114-ebc8-4ef4-a625-5af2c3559120: OK
	blockName: /Commissioning/Run2024I-EcalActivity-PromptReco-v1/RAW-RECO#e65b7bb5-0732-432a-a181-4d9153161caa: OK
	blockName: /ParkingVBF3/Run2024I-PromptReco-v1/DQMIO#fecdfbd5-6a50-4100-92f7-90b05c452570: OK
	blockName: /ParkingVBF2/Run2024I-PromptReco-v1/DQMIO#43430e62-1bd6-4161-aa8d-3cf9b3da3e55: OK
	blockName: /ParkingDoubleMuonLowMass4/Run2024I-PromptReco-v1/NANOAOD#ceeadfe9-03f7-471f-93fb-a7d4ae3ab806: OK
	blockName: /EGamma1/Run2024I-LogErrorMonitor-PromptReco-v1/USER#2bd7cb69-577f-47a0-adaa-115b9df2e1b2: OK
	blockName: /Commissioning/Run2024I-PromptReco-v1/NANOAOD#2936bb45-005c-4f28-9ee0-c24b8e8b647e: OK
	blockName: /StreamExpress/Run2024I-TkAlMinBias-Express-v1/ALCARECO#c07f8345-698b-4a8d-87ac-39ef617874e0: OK
	blockName: /StreamCalibration/Run2024I-EcalTestPulsesRaw-Express-v1/ALCARECO#b05bf432-e2d5-4e77-8bc3-d51be7fadb5c: OK
	blockName: /StreamExpress/Run2024I-PromptCalibProdSiStripGains-Express-v1/ALCAPROMPT#82ba3ad1-7f7a-4a6b-ba44-e00632e53de5: OK
	blockName: /ExpressPhysics/Run2024I-Express-v1/FEVT#12d498f7-0eb0-4c0b-a124-f971dfab8ec8: OK
	blockName: /ParkingSingleMuon3/Run2024I-PromptReco-v1/AOD#82f71a1c-0f5c-4a24-88e9-24d02619104d: OK
	blockName: /Muon0/Run2024I-ZMu-PromptReco-v1/RAW-RECO#a0c2d025-fe28-40fc-b695-64e3465954d8: OK
	blockName: /ParkingSingleMuon1/Run2024I-PromptReco-v1/AOD#2b73f5f0-a63f-402e-94bc-6231bc31d130: OK
	blockName: /ParkingSingleMuon1/Run2024I-PromptReco-v1/AOD#45da3ad2-609d-4ed7-a77c-4a212635ea47: OK
	blockName: /EGamma0/Run2024I-PromptReco-v1/DQMIO#5ba6d8d0-5af2-4e46-9028-f8a83a38bd22: OK
	blockName: /ParkingDoubleMuonLowMass2/Run2024I-PromptReco-v1/NANOAOD#179594fb-d82a-4194-b8c7-ef8636ccade9: OK
	blockName: /ZeroBias/Run2024I-HcalCalIsolatedBunchSelector-PromptReco-v1/ALCARECO#623d20fe-86d2-4650-9c32-76fa0c792d6b: OK
	blockName: /HcalNZS/Run2024I-LogError-PromptReco-v1/RAW-RECO#ae1000be-4a3e-4b4d-b9dc-362c2028b5a9: OK
	blockName: /EGamma0/Run2024I-EcalESAlign-PromptReco-v1/ALCARECO#982d6c67-32ba-4a10-9239-4f460bf1c002: OK
	blockName: /ScoutingPFMonitor/Run2024I-PromptReco-v1/MINIAOD#37708d28-236f-4554-91d3-e3617b2c2a22: OK
	blockName: /ParkingSingleMuon2/Run2024I-PromptReco-v1/MINIAOD#ee62eeb6-9a38-4a12-94b4-6e2e03c5bf6c: OK
	blockName: /MuonShower/Run2024I-PromptReco-v1/MINIAOD#ab4cb22d-8acc-4661-b955-84786cd695db: OK
	blockName: /ParkingSingleMuon0/Run2024I-PromptReco-v1/NANOAOD#b2b7410c-cae4-48d2-aa08-1a4473ca9fcb: OK
	blockName: /EGamma1/Run2024I-EcalESAlign-PromptReco-v1/ALCARECO#7fb030a6-ce2b-40ba-9cfa-633e914c6ed4: OK
	blockName: /ZeroBias/Run2024I-PromptReco-v1/DQMIO#59bf6b32-52ee-4b87-a5f9-0eaf70f4eb00: OK
	blockName: /ParkingVBF4/Run2024I-PromptReco-v1/NANOAOD#023f6a88-3b2b-4826-ab9b-76a146d5c6a0: OK
	blockName: /SpecialZeroBias1/Run2024I-LogError-PromptReco-v1/RAW-RECO#9d2492d8-6426-4ece-bdf3-2c91113f4286: OK
	blockName: /MinimumBias/Run2024I-PromptReco-v1/MINIAOD#9029265d-b4db-458c-afc1-405d680f07da: OK
	blockName: /SpecialZeroBias0/Run2024I-PromptReco-v1/AOD#9c8e602e-c2ba-49c4-9b96-4b2ff143d3a4: OK
	blockName: /ParkingDoubleMuonLowMass1/Run2024I-TkAlUpsilonMuMu-PromptReco-v1/ALCARECO#6cdb8ade-6f9b-4b24-ab89-027326e095be: OK
	blockName: /SpecialZeroBias5/Run2024I-LogError-PromptReco-v1/RAW-RECO#a1de2483-ce3e-425d-85d6-5fea33a6ea67: OK
	blockName: /StreamExpressCosmics/Run2024I-Express-v1/DQMIO#699d704b-19aa-480c-8a8d-0fbebb0b0cd9: OK
	blockName: /StreamExpressCosmics/Run2024I-PromptCalibProdSiStripLA-Express-v1/ALCAPROMPT#6420dcde-2fa4-45bd-9c32-d10682923317: OK
	blockName: /StreamExpressCosmics/Run2024I-PromptCalibProdSiStrip-Express-v1/ALCAPROMPT#2e69c51a-1dca-4088-8aac-3743f810ce56: OK
	blockName: /StreamALCAPPSExpress/Run2024I-PPSCalMaxTracks-Express-v1/ALCARECO#652af1be-cb9a-4a1c-bbf4-c88c1686d301: OK
	blockName: /StreamExpress/Run2024I-PromptCalibProd-Express-v1/ALCAPROMPT#0e377dc2-065e-4e15-8566-bd910470baad: OK
	blockName: /StreamExpress/Run2024I-SiPixelCalSingleMuon-Express-v1/ALCARECO#3a744512-b3a2-447d-85f5-d0ec6254af59: OK
	blockName: /ZeroBias/Run2024I-SiStripCalMinBias-PromptReco-v1/ALCARECO#35f2cee8-79c7-4cab-9d86-dc50c003d893: OK
	blockName: /ZeroBias/Run2024I-SiStripCalMinBias-PromptReco-v1/ALCARECO#56647210-b149-4fdc-800f-a3e9523b3ea3: OK
	blockName: /HcalNZS/Run2024I-PromptReco-v1/DQMIO#045c4ec2-d9a4-44ce-9fd5-717415778bf5: OK
	blockName: /SpecialZeroBias5/Run2024I-PromptReco-v1/NANOAOD#d533eae7-c987-4e63-9e0e-0edb3e0bd246: OK
	blockName: /ParkingSingleMuon0/Run2024I-PromptReco-v1/AOD#a154328f-5edf-46bb-a395-d3d45a8b1ca6: OK
	blockName: /EGamma1/Run2024I-ZElectron-PromptReco-v1/RAW-RECO#5f92c14b-c6c0-4b8e-8913-71a11b54f598: OK
	blockName: /EGamma1/Run2024I-EXOMONOPOLE-PromptReco-v1/USER#6dcd1342-723d-4844-87b9-8248bf5db833: OK
	blockName: /StreamExpress/Run2024I-PromptCalibProdSiPixel-Express-v1/ALCAPROMPT#c9543d48-4c8f-4936-bd43-3a63dd1c174f: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#ef15476a-bbad-4545-9f91-7e77de5d6034: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#0758fd50-931f-488d-8a1e-663e1c6174e4: OK
	blockName: /SpecialZeroBias1/Run2024I-PromptReco-v1/DQMIO#1596222f-a243-4dc3-b9fe-a9ec03ad9adb: OK
	blockName: /SpecialZeroBias2/Run2024I-SiStripCalZeroBias-PromptReco-v1/ALCARECO#a8b97986-e1e1-4d7a-9b0f-4ae31589c714: OK
	blockName: /EGamma1/Run2024I-PromptReco-v1/MINIAOD#ee18201c-58ce-42a7-a60b-d1364ca32653: OK
	blockName: /ParkingVBF5/Run2024I-PromptReco-v1/NANOAOD#36beaf6c-97b1-475b-aba4-e9e7e8e694c3: OK
	blockName: /StreamExpressCosmics/Run2024I-PromptCalibProdSiPixelLAMCS-Express-v1/ALCAPROMPT#99955f16-752e-4dd0-ad52-d6eaf8d0f509: OK
	blockName: /Muon1/Run2024I-SiPixelCalSingleMuonLoose-PromptReco-v1/ALCARECO#14d95293-a39a-4d2a-a1f1-956affdde47c: OK
	blockName: /ParkingDoubleMuonLowMass0/Run2024I-PromptReco-v1/MINIAOD#4223ff94-a955-424e-b245-13c337e5b17d: OK
	blockName: /Muon1/Run2024I-EXODisappMuon-PromptReco-v1/USER#55154291-7763-457d-a791-96c65ef849ea: OK
	blockName: /Muon1/Run2024I-MUOJME-PromptReco-v1/RAW-RECO#3b55bfa9-ba65-4454-bc98-e60db1916b28: OK
	blockName: /ParkingSingleMuon9/Run2024I-PromptReco-v1/NANOAOD#439d8642-eccc-4e50-85ec-ce0a084b7fee: OK
	blockName: /Muon1/Run2024I-MuAlCalIsolatedMu-PromptReco-v1/ALCARECO#b9479dcc-06fb-4dce-8b11-d3cb0e957900: OK
	blockName: /Muon1/Run2024I-TkAlMuonIsolated-PromptReco-v1/ALCARECO#56f3de3c-f985-466c-b405-36fb5ff57720: OK
	blockName: /ParkingSingleMuon3/Run2024I-PromptReco-v1/AOD#04da00bb-4cec-4b87-a1d9-0747a3d4f02e: OK
	blockName: /Muon1/Run2024I-EXODisappTrk-PromptReco-v1/USER#f3e359cc-cb16-4020-8e84-4f083d1e9441: OK
	blockName: /JetMET0/Run2024I-JetHTJetPlusHOFilter-PromptReco-v1/RAW-RECO#13eb60ae-a6e7-4a47-b856-0f4f4e6e00d8: OK
	blockName: /Muon0/Run2024I-MUOJME-PromptReco-v1/RAW-RECO#a7967f2a-1035-4aea-afa0-26b9898938ce: OK
	blockName: /Muon0/Run2024I-ZMu-PromptReco-v1/RAW-RECO#966a0503-91c2-4e38-a002-bfac712cb168: OK
	blockName: /ZeroBias/Run2024I-SiStripCalMinBias-PromptReco-v1/ALCARECO#685c6155-a526-4509-bce9-812330416777: OK
	blockName: /ParkingDoubleMuonLowMass1/Run2024I-PromptReco-v1/MINIAOD#3b1b2f8c-a7da-4e53-902f-b4fa265774d2: OK
	blockName: /Muon1/Run2024I-PromptReco-v1/AOD#4f5fab34-198f-4eb0-81a4-09fc0c1a501c: OK
	blockName: /ParkingDoubleMuonLowMass5/Run2024I-PromptReco-v1/DQMIO#0fdb3015-38e1-47ef-bb71-237e7fbb1f08: OK
	blockName: /ParkingDoubleMuonLowMass6/Run2024I-PromptReco-v1/MINIAOD#ab102ad0-6081-42b6-9afe-9df7746af1d9: OK
	blockName: /Muon1/Run2024I-PromptReco-v1/MINIAOD#4adf338f-e98a-4e15-8e7f-a075cafbf918: OK
	blockName: /Muon1/Run2024I-HcalCalIterativePhiSym-PromptReco-v1/ALCARECO#a794ba12-a0d8-4c78-9d07-9b977908cb1f: OK
	blockName: /ParkingSingleMuon8/Run2024I-PromptReco-v1/NANOAOD#f594d1ee-148d-4b50-870b-a2f66c51efec: OK
	blockName: /ParkingSingleMuon9/Run2024I-PromptReco-v1/AOD#c60db50a-6e2e-4054-82d0-5d2f2622dd85: OK
	blockName: /ParkingSingleMuon11/Run2024I-PromptReco-v1/MINIAOD#9fab21d5-31e2-426a-aa0d-5ae91119d468: OK
	blockName: /ParkingSingleMuon3/Run2024I-PromptReco-v1/MINIAOD#ca5f6304-70f8-4504-b885-75bb293adf69: OK
	blockName: /ParkingSingleMuon3/Run2024I-PromptReco-v1/NANOAOD#7e049c44-3e1b-4e9d-92aa-8aa5a70db114: OK
	blockName: /Muon1/Run2024I-LogError-PromptReco-v1/RAW-RECO#65903046-1775-44ea-94bc-dfba5746ed0e: OK
	blockName: /Muon1/Run2024I-ZMu-PromptReco-v1/RAW-RECO#1fb3d4ad-ca60-4e64-a077-1437accb0f57: OK
	blockName: /ParkingSingleMuon11/Run2024I-PromptReco-v1/NANOAOD#b83a6f99-b768-4588-8577-18436f17bd0c: OK
	blockName: /ParkingDoubleMuonLowMass1/Run2024I-PromptReco-v1/AOD#41bdafe6-cdc1-40e1-9edc-85ef3d18e0ae: OK
	blockName: /ParkingSingleMuon4/Run2024I-PromptReco-v1/MINIAOD#b8ca91f4-608a-4612-8eeb-df6183a7b99f: OK
	blockName: /ParkingSingleMuon6/Run2024I-PromptReco-v1/MINIAOD#9b909863-233e-480c-9e3f-3850c0b67167: OK
	blockName: /ParkingDoubleMuonLowMass6/Run2024I-PromptReco-v1/AOD#b9849323-cc1a-445a-817f-c4e3212f74a2: OK
	blockName: /ParkingDoubleMuonLowMass7/Run2024I-PromptReco-v1/AOD#c0047424-6a0a-41c6-94ac-28aa41129d71: OK
	blockName: /ParkingVBF6/Run2024I-PromptReco-v1/AOD#d37b0146-4466-4ba8-a1ba-85c1ef4a902e: OK
	blockName: /Muon1/Run2024I-LogErrorMonitor-PromptReco-v1/USER#63a0eea8-35a0-4b70-adb5-39a6141c7bde: OK
	blockName: /ParkingVBF2/Run2024I-PromptReco-v1/NANOAOD#85077ea4-7bbd-45f7-a2df-2608128abe31: OK
	blockName: /ParkingVBF0/Run2024I-PromptReco-v1/AOD#8738e434-abdc-4d03-a60d-358ab2412188: OK
	blockName: /JetMET1/Run2024I-EXOSoftDisplacedVertices-PromptReco-v1/AOD#c2dd6617-dd8e-4bd8-9bd7-cdd81a858c36: OK
	blockName: /ParkingSingleMuon9/Run2024I-PromptReco-v1/MINIAOD#78af0c2f-e670-46e5-832a-d3828066fca7: OK
	blockName: /ParkingSingleMuon7/Run2024I-PromptReco-v1/AOD#f81d407a-8165-41c9-8a11-a5182d63d273: OK
	blockName: /Muon0/Run2024I-LogErrorMonitor-PromptReco-v1/USER#d7c8e5cc-0e23-424f-b000-55aa41070d63: OK
	blockName: /EphemeralZeroBias0/Run2024I-PromptReco-v1/MINIAOD#ad55216f-8224-404d-b8e9-daba73c85bb4: OK
	blockName: /ParkingDoubleMuonLowMass6/Run2024I-PromptReco-v1/NANOAOD#f590565c-1256-4f74-8e00-db331266d599: OK
	blockName: /Muon0/Run2024I-PromptReco-v1/NANOAOD#2ab3aa3d-a4ca-4773-8a58-85813617ea33: OK
	blockName: /Muon1/Run2024I-HcalCalHBHEMuonProducerFilter-PromptReco-v1/ALCARECO#455e3ec5-62d7-4f68-abbe-a34111a87076: OK
	blockName: /JetMET1/Run2024I-LogError-PromptReco-v1/RAW-RECO#1a77079a-29d9-4920-834c-eb523aeea080: OK
	blockName: /Muon0/Run2024I-EXODisappMuon-PromptReco-v1/USER#474dc48d-caf5-44d3-9a80-dcc2d5eec561: OK
	blockName: /JetMET1/Run2024I-JetHTJetPlusHOFilter-PromptReco-v1/RAW-RECO#f965d641-1a87-46b3-87a8-9d2a566fc604: OK
	blockName: /ParkingSingleMuon6/Run2024I-PromptReco-v1/NANOAOD#33455ccc-7ce1-4fb5-aa69-2048f8362f27: OK
	blockName: /ParkingSingleMuon10/Run2024I-PromptReco-v1/NANOAOD#fe088890-8bc3-416c-b263-408703b5efa4: OK
	blockName: /ParkingSingleMuon10/Run2024I-PromptReco-v1/AOD#df90c976-3437-46ee-af3e-fbc2221e48f4: OK
	blockName: /ParkingDoubleMuonLowMass6/Run2024I-PromptReco-v1/DQMIO#fe3a6889-bdb5-4959-a164-2db208d3e69b: OK
	blockName: /ParkingSingleMuon10/Run2024I-PromptReco-v1/MINIAOD#71008299-7fb4-4245-90d2-e980b9e195a1: OK
	blockName: /ParkingVBF2/Run2024I-PromptReco-v1/MINIAOD#7e7cb65b-c8db-4e36-b9a1-3475a031dedd: OK
	blockName: /AlCaP0/Run2024I-v1/RAW#0d3bd409-bcd8-4c59-bcd8-d7ecc3a1222b: OK
	blockName: /ParkingVBF3/Run2024I-PromptReco-v1/MINIAOD#2a6f15da-bd5c-4520-b70b-ab50ff65e04e: OK
	blockName: /ParkingSingleMuon4/Run2024I-PromptReco-v1/NANOAOD#f7f3e422-3bd6-4410-a955-9ed339b7219f: OK
	blockName: /ParkingVBF2/Run2024I-PromptReco-v1/AOD#ebaafffd-4020-46cf-9137-37c2832d3eac: OK
	blockName: /JetMET1/Run2024I-EXOMONOPOLE-PromptReco-v1/USER#77eabaa1-ead3-4534-a972-e788c3e7f050: OK
	blockName: /ParkingDoubleMuonLowMass5/Run2024I-PromptReco-v1/NANOAOD#8e5664a8-e3d1-4f92-b27e-3707ca3dc4df: OK
	blockName: /Muon0/Run2024I-EXOCSCCluster-PromptReco-v1/USER#700a9b24-dbc4-4bf9-8787-01da5ef26a06: OK
	blockName: /Muon0/Run2024I-LogError-PromptReco-v1/RAW-RECO#b5af0d19-9710-4757-af04-ba6f63ab4070: OK
	blockName: /ScoutingPFRun3/Run2024I-v1/HLTSCOUT#da913315-bb66-42ed-8c46-4f7f3714ef0c: OK
	blockName: /ParkingDoubleMuonLowMass5/Run2024I-PromptReco-v1/AOD#3d339afb-6be9-440d-802e-3572ab355d56: OK
	blockName: /ParkingDoubleMuonLowMass1/Run2024I-PromptReco-v1/NANOAOD#1c60e700-caf6-469a-9bb7-40122038ed33: OK
	blockName: /JetMET1/Run2024I-PromptReco-v1/MINIAOD#af8b556a-6dfa-43eb-890a-7f0cdea01f87: OK
	blockName: /JetMET1/Run2024I-EXOHighMET-PromptReco-v1/RAW-RECO#f09b2a46-61f1-4ae0-ac58-17d8c7db4fe3: OK
	blockName: /ParkingVBF3/Run2024I-PromptReco-v1/AOD#8a47209b-3d7a-4c13-a3d7-8a34397a9f94: OK
	blockName: /ScoutingPFRun3/Run2024I-PromptReco-v1/NANOAOD#2d01f47e-5f60-4115-b99a-3ccd5f843a71: OK
	blockName: /EphemeralZeroBias5/Run2024I-PromptReco-v1/MINIAOD#65830620-4305-4f9d-b084-555c17dc5610: OK
	blockName: /ParkingHH/Run2024I-PromptReco-v1/AOD#47f79776-ae5b-41ac-9965-d4981b9790d6: OK
	blockName: /ZeroBias/Run2024I-PromptReco-v1/AOD#d46e1e69-d0e3-4456-9bf1-060eb3731aec: OK
	blockName: /ParkingVBF0/Run2024I-PromptReco-v1/NANOAOD#40dc6115-2b40-43be-9e4e-cdd658e68dc7: OK
	blockName: /ParkingSingleMuon11/Run2024I-PromptReco-v1/AOD#0410c654-f369-40f4-858b-af27bbe4d94d: OK
	blockName: /JetMET0/Run2024I-PromptReco-v1/AOD#dc00eaf3-8f2c-4465-ba70-78335e4cb245: OK
	blockName: /ParkingDoubleMuonLowMass0/Run2024I-PromptReco-v1/NANOAOD#2c9e2ac2-b2aa-4868-9782-556e6193cbbb: OK
	blockName: /ParkingDoubleMuonLowMass5/Run2024I-PromptReco-v1/MINIAOD#3957f41c-be2a-4f09-b04c-6995d7c23eee: OK

@todor-ivanov
Copy link
Contributor

@germanfgv @LinaresToine Could we check what is special for those 4 blocks reported as experiencing BLOCKMISMATCH records at DBS in my previous comment: #11965 (comment) . I am interested to find out at least:

  • Were those the original 4 blocks reported initially by T0 team?
  • Were those 4 blocks handled by a separate agent?

@amaltaro
Copy link
Contributor Author

amaltaro commented Nov 4, 2024

@todor-ivanov as we discussed in the WMCore meeting, DBS3Upload should have a mechanism to identify blocks that have already been injected into DBS Server, but failed to acknowledge the operation for some reason.

If the component tries to inject a block already in the server, it is supposed to return exit code 128, marking the block as check here:
https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/DBS3Buffer/DBSUploadPoller.py#L104

which will trigger the execution of this block of code:
https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/DBS3Buffer/DBSUploadPoller.py#L848

in the next cycle of the component.

I don't think anything changed on the DBS Server codebase lately, so I expect this feature to be still functional. But you might want to revise the error message/code that we are getting for the problematic blocks.

@amaltaro
Copy link
Contributor Author

amaltaro commented Nov 4, 2024

@todor-ivanov we are having similar problems with one agent that is ready to be shutdown (after draining), but it still has one block that it fails to inject into DBS Server.

Could you please look into submit12 and try to understand what the problem is with:

2024-11-04 21:38:50,968:140011552839424:INFO:DBSUploadPoller:About to call insert block for: /XToYYprimeTo4Q_MX-2000_MY-30_MYprime-600_narrow_TuneCP5_13TeV-madgraph-pythia8/RunIISummer20UL16NanoAODv9-106X_mcRun2_asymptotic_v17-v2/NANOAODSIM#fb282932-f32b-48de-82e5-a56cceb34cad
2024-11-04 21:38:51,654:140011552839424:ERROR:DBSUploadPoller:Error trying to process block /XToYYprimeTo4Q_MX-2000_MY-30_MYprime-600_narrow_TuneCP5_13TeV-madgraph-pythia8/RunIISummer20UL16NanoAODv9-106X_mcRun2_asymptotic_v17-v2/NANOAODSIM#fb282932-f32b-48de-82e5-a56cceb34cad through DBS. Details: DBSError code: 131, message: fb20d909d3a86926e3d8d0498c1ebfc3f4ad617c6b5e5dcaeecde3662af8797b unable to find dataset_id for /XToYYprimeTo4Q_MX-2000_MY-30_MYprime-600_narrow_TuneCP5_13TeV-madgraph-pythia8/RunIISummer20UL16MiniAODv2-106X_mcRun2_asymptotic_v17-v2/MINIAODSIM, error DBSError Code:103 Description:DBS DB query error, e.g. mailformed SQL statement Function:dbs.GetID Message: Error: sql: no rows in result set, reason: DBSError Code:103 Description:DBS DB query error, e.g. mailformed SQL statement Function:dbs.GetID Message: Error: sql: no rows in result set

It seems to be failing injection since Oct 15th.

@germanfgv
Copy link
Contributor

  • Almost all of those blocks are properly present at DBS - so for those I assume that the Agent did not properly handled the initial return code by DBS and it simply continues to retry.

Thanks @todor-ivanov! tested this using my own script (DBSBlockCheck.py) and got the same result. All but 4 blocks are already available in DBS.

All the blocks listed in /eos/home-c/cmst0/public/dbsError/failingBlocks.txt belong to the same agent, including the 4 problematic ones.

I can check @amaltaro's idea tomorrow.

@todor-ivanov
Copy link
Contributor

todor-ivanov commented Nov 5, 2024

In reality the error code the agent unwraps from the HTTP header for some reason is 52 instead of 128 see [1]. So thie mechanism mentioned here: #11965 (comment) will never trigger.

[1]

2024-11-05 10:07:49,084:139753276044864:ERROR:DBSUploadPoller:Hit a general exception while inserting block /Tau/Run2024I-PromptReco-v1/DQMIO#1a0ac20a-1d60-4d89-8133-e8559f1e4c13. Error: (52, 'Empty reply from server')
Traceback (most recent call last):
  File "/data/tier0/WMAgent.venv3/lib64/python3.9/site-packages/WMComponent/DBS3Buffer/DBSUploadPoller.py", line 94, in uploadWorker
    dbsApi.insertBulkBlock(blockDump=block)
  File "/data/tier0/WMAgent.venv3/lib64/python3.9/site-packages/dbs/apis/dbsClient.py", line 647, in insertBulkBlock
    result =  self.__callServer("bulkblocks", data=blockDump, callmethod='POST' )
  File "/data/tier0/WMAgent.venv3/lib64/python3.9/site-packages/dbs/apis/dbsClient.py", line 474, in __callServer
    self.http_response = method_func(self.url, method, params, data, request_headers)
  File "/data/tier0/WMAgent.venv3/lib64/python3.9/site-packages/RestClient/RestApi.py", line 42, in post
    return http_request(self._curl)
  File "/data/tier0/WMAgent.venv3/lib64/python3.9/site-packages/RestClient/RequestHandling/HTTPRequest.py", line 56, in __call__
    curl_object.perform()
pycurl.error: (52, 'Empty reply from server')

@todor-ivanov
Copy link
Contributor

todor-ivanov commented Nov 5, 2024

Actually it never tries to read the HTTP header and to actually resolve the true DBS error, which is supposed to be done through the dbsError class here:

srvCode = dbsError.getServerCode()

And the reason why it happens like that, is obviously, because the error returned by the pycurl client is not of type HTTPError. So this whole piece of code there is never tried:

except HTTPError as ex:
# DBS Go server errors are defined here:
# https://github.com/dmwm/dbs2go/blob/master/dbs/errors.go
dbsError = DBSError(ex.body)
reason = dbsError.getReason()
message = dbsError.getMessage()
srvCode = dbsError.getServerCode()
msg = f'DBSError code: {srvCode}, message: {message}, reason: {reason}'
if srvCode == 128:
# block already exist
logging.warning("Block %s already exists. Marking it as uploaded.", name)
results.put({'name': name, 'success': "check"})
elif srvCode in [132, 133, 134, 135, 136, 137, 138, 139, 140]:
# racing conditions
logging.warning("Hit a transient data race condition injecting block %s, %s", name, msg)
results.put({'name': name, 'success': "error", 'error': msg})
else:
msg = f"Error trying to process block {name} through DBS. Details: {msg}"
logging.error(msg)
results.put({'name': name, 'success': "error", 'error': msg})

But instead the exception is handled as a general exception and this one is taking the control:

except Exception as ex:
msg = f"Hit a general exception while inserting block {name}. Error: {str(ex)}"
logging.exception(msg)
results.put({'name': name, 'success': "error", 'error': msg})

It has to have something to do with this line from the traceback:

  File "/data/tier0/WMAgent.venv3/lib64/python3.9/site-packages/RestClient/RequestHandling/HTTPRequest.py", line 56, in __call__
    curl_object.perform()

@todor-ivanov
Copy link
Contributor

todor-ivanov commented Nov 5, 2024

And just to add to the observation: The 4 blocks which I mentioned are experiencing BLOCKMISMATCH at DBS, behave differently. They fail with a proper DBS exception [1]. All 4 of them. And it is indeed the concuerrency error - DBSError Code:110. So for them the actual HTTP Header is indeed parsed and the true DBS Error encoded into it is received, so the exception is handled according to whatever logic is meant to be implemented by:

except HTTPError as ex:
# DBS Go server errors are defined here:
# https://github.com/dmwm/dbs2go/blob/master/dbs/errors.go
dbsError = DBSError(ex.body)
reason = dbsError.getReason()
message = dbsError.getMessage()
srvCode = dbsError.getServerCode()
msg = f'DBSError code: {srvCode}, message: {message}, reason: {reason}'
if srvCode == 128:
# block already exist
logging.warning("Block %s already exists. Marking it as uploaded.", name)
results.put({'name': name, 'success': "check"})
elif srvCode in [132, 133, 134, 135, 136, 137, 138, 139, 140]:
# racing conditions
logging.warning("Hit a transient data race condition injecting block %s, %s", name, msg)
results.put({'name': name, 'success': "error", 'error': msg})
else:
msg = f"Error trying to process block {name} through DBS. Details: {msg}"
logging.error(msg)
results.put({'name': name, 'success': "error", 'error': msg})

But as we can see DBS ErrorCode: 110 is not handled at this logic. So I suspect the conversation on how to proceed about these cases needs to continue once we understand what exactly has happened with those 4 blocks at the first place. It doesn't seem that a proper agreement has been achieved on the actions required on both - the client and the server side for situations like that.

[1]

2024-09-20 08:43:10,755:139632874354240:ERROR:DBSUploadPoller:Error trying to process block /AlCaP0/Run2024H-v1/RAW#92fab5d9-9a27-4a7c-a57e-4b2691c654cd through DBS. Details: DBSError code: 110, message: 5ecdc2bdcd03492fd64efc269de332cdc
f1c8a53c3e3cc07168b0c741f0270ba unable to insert files, error DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.insertFilesViaChunks Message: Error: concurrency error, reason: DBSError Code:110 Description:
DBS DB insert record error Function:dbs.bulkblocks.insertFilesViaChunks Message: Error: concurrency error
2024-09-20 08:43:10,756:139632874354240:INFO:DBSUploadPoller:About to call insert block for: /AlCaP0/Run2024H-v1/RAW#983e96f3-5dca-4919-a3b4-fa291f145fb7
2024-09-20 08:43:10,757:139632874354240:INFO:DBSUploadPoller:Queueing block for insertion: /L1ScoutingSelection/Run2024H-v1/L1SCOUT#694f9058-382e-47d9-89cd-646541261cd7
2024-09-20 08:43:10,760:139632874354240:ERROR:DBSUploadPoller:Error trying to process block /AlCaP0/Run2024H-v1/RAW#3bbaf481-068c-4fda-8656-663fa9a987a4 through DBS. Details: DBSError code: 110, message: 997071d9311e283887ce5e57b0b180046
7986e1c57f620aff5a39d98b881fb6c unable to insert files, error DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.insertFilesViaChunks Message: Error: concurrency error, reason: DBSError Code:110 Description:
DBS DB insert record error Function:dbs.bulkblocks.insertFilesViaChunks Message: Error: concurrency error
...

2024-09-20 08:43:10,799:139632874354240:ERROR:DBSUploadPoller:Error trying to process block /AlCaP0/Run2024H-v1/RAW#983e96f3-5dca-4919-a3b4-fa291f145fb7 through DBS. Details: DBSError code: 110, message: d93d36f53eaf3097db5c9f50851359041c418a18727e6f363e6c18c37d3f25bb unable to insert files, error DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.insertFilesViaChunks Message: Error: concurrency error, reason: DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.insertFilesViaChunks Message: Error: concurrency error

...
2024-09-20 08:43:11,854:139632874354240:ERROR:DBSUploadPoller:Error trying to process block /Muon0/Run2024H-v1/RAW#7369ccdf-3d3a-4d32-bad9-b04b02f279d4 through DBS. Details: DBSError code: 110, message: e38e86de6869760af39faf5da584eceee0b0b9d1de48e57276593df8dd4c720e unable to insert files, error DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.insertFilesViaChunks Message: Error: concurrency error, reason: DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.insertFilesViaChunks Message: Error: concurrency error

@germanfgv
Copy link
Contributor

@todor-ivanov actually, at some point, the 4 blocks started failing with the pycurl.error: (52, 'Empty reply from server'), before any other block had failed.

This is the last appearance of DBSError 110:

2024-09-27 14:31:30,711:140408533284416:ERROR:DBSUploadPoller:Error trying to process block /AlCaP0/Run2024H-v1/RAW#92fab5d9-9a27-4a7c-a57e-4b2691c654cd through DBS. Details: DBSError code: 110, message: ec6dab1b1b8d8ba3bf018be816846d73e007b5049b93
947a1e7472786c73ece6 unable to insert files, error DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.insertFilesViaChunks Message: Error: concurrency error, reason: DBSError Code:110 Description:DBS DB insert record e
rror Function:dbs.bulkblocks.insertFilesViaChunks Message: Error: concurrency error

This is the first appearance of pyCurl error 52.

2024-09-27 15:03:00,192:140408533284416:ERROR:DBSUploadPoller:Hit a general exception while inserting block /AlCaP0/Run2024H-v1/RAW#92fab5d9-9a27-4a7c-a57e-4b2691c654cd. Error: (52, 'Empty reply from server')

This might be a clue on what caused the other 272 blocks to fail. It seems something change in the DBS server at some point between 2024-09-27 14:31 and 2024-09-27 15:03. After that moment, the client is unable to parse the server's error codes. This exactly coincides with the deployment of the APS-based CMSWEB cluster

@amaltaro @todor-ivanov @vkuznet

@germanfgv
Copy link
Contributor

Here you have the DBS3Upload ComponentLog, in case you want to check these dates:

/eos/user/c/cmst0/public/dbsError/ComponentLog

@vkuznet
Copy link
Contributor

vkuznet commented Nov 5, 2024

@germanfgv I would like to mention that according to k8s production dbs cluster we run DBS pods for 209 days. Therefore, nothing has changed on DBS side, and neither I aware of any development, commits/PRs. The concurrency error may seems misleading since it printed out with concurrency call to file injection. But the file injection fails due to missing aux meta-data in JSON payload. Please see these DBS code:

  • ConcurrencyErr occur here
  • it happens because insertFilesChunk function returns error
  • and if you inspect insertFilesChunk code base you'll see that error occurs only in three occations:
    1. failure to look up FILE_DATA_TYPES id
    2. wrong is_file_valid value
    3. oracle insert error

I reported MANY times that most likely issue is with missing file data type in JSON payload, and I strongly suggest to start with your JSON payload and see if it is there. In particular, the files section of payload should contain file_type, see example here.

If JSON payload is correct in terms of ALL required aux meta-data, I suggest that you move down the list and check validity of the file(s) and finally look-up for ORACLE insert error.

@germanfgv
Copy link
Contributor

@vkuznet We have 2 separate problems here:

  1. 4 blocks showing concurrency errors.
  2. 272 that are already properly uploaded to DBS, but the agent is unable to parse the response from the server.

I bring up the APS upgrade in reference with the issues parsing the response from the server, not as an explanation for the concurrency issues. After 2024-09-27 15:03, the DBS client is unable to distinguish DBSError 128: Block already exists, from DBSError 110: Concurrency error (Or any HTTP other error). They all show up as a pyCurl error 52. As the timing coincides with the deployment of the APS server, it seems to me very likely the issue is related to that upgrade, specially since, as you mentioned, there have not been any other changes in the code.

I would like to switch this agent temporarily to the cmsweb-prod.cern.ch version of DBSWriter, simply to check if the 272 blocks without concurrency issues can move along. This will not create a bit pressure over the server, as this agent is no longer producing new data, and simply needs to upload those 272 blocks. @vkuznet do you have anything against that plan?

Regarding the JSON payload, the dumps we obtained from the agent show "file_type": "EDM", as expected. This is why we've moved to check the validity of the files and Todor already found issues there. There are indeed files appearing in more than one block. Fixing this will be more complicated and we still need to understand what faulty agent logic caused it.

@vkuznet
Copy link
Contributor

vkuznet commented Nov 5, 2024

My suggestions would be the following:

  • you need to understand source of 52 error, it can be many things, including timeout.
  • to do that, you should stop using DBS3Upload code as it hides many things and prevent from debugging the issue
  • instead, you should use plain curl with your payload, see examples in dbs2go documention. In particular, you need last example with bulkblocks api and I suggest to use gziped payload.
    • using curl call bypass pycurl, but still use libcurl libratry, which gives you better idea about the underlying error (i.e. avoid Python wrappers)
    • you can take one payload at a time and make your injection, moreover you can insert into this curl call your custom User-agent header which will allow you to trace this request both in APS, and DBS logs, e.g. curl -H "User-Agent: my-failed-dataset-block#123" ....
    • if curl is successful you may claim timeout as source of 52 error, otherwise
    • you'll get clear trace in logs to see your request and debug it further.

@vkuznet
Copy link
Contributor

vkuznet commented Nov 5, 2024

@germanfgv , and regarding switching to cmsweb-prod, if you interested to understand the error I suggest to use manual curl approach as I described before. And, afterwards you may switch to cmsweb-prod to see if you'll be able to inject them using Apache FE.

@LinaresToine
Copy link

LinaresToine commented Nov 6, 2024

About the 4 original blocks with the concurrency errors, specifically the AlCaP0 blocks:

I see all files in DBS, but they are distributed among two "impostor" blocks.

1. `/AlCaP0/Run2024H-v1/RAW#b51293e3-1563-47a3-a88f-1eb33790c417`
2. `/AlCaP0/Run2024H-v1/RAW#0392f25d-8397-40b3-8f6f-46266d92583b`

I call them impostor because both have blocks that belong to other blocks and number 2 is not even in Rucio and all his files belong to another block according to the database. Here is a summary of all 5 blocks; the 2 impostors and the 3 originals:

/AlCaP0/Run2024H-v1/RAW#b51293e3-1563-47a3-a88f-1eb33790c417 (impostor 1)

  • RucioInjector added 4 files
  • RucioInjector deleted the rule
  • DBSBUFFER_FILE table has those 4 files belonging to the given blockname
  • DAS shows 8 additional files
    • 5 belong to /AlCaP0/Run2024H-v1/RAW#3bbaf481-068c-4fda-8656-663fa9a987a4
    • 3 belong to /AlCaP0/Run2024H-v1/RAW#92fab5d9-9a27-4a7c-a57e-4b2691c654cd

/AlCaP0/Run2024H-v1/RAW#0392f25d-8397-40b3-8f6f-46266d92583b (impostor 2)

  • Never made it to Rucio
  • DBSBUFFER_FILE Table shows no files
  • DAS shows 12 files
    • 3 belong to /AlCaP0/Run2024H-v1/RAW#983e96f3-5dca-4919-a3b4-fa291f145fb7
    • 9 belong to /AlCaP0/Run2024H-v1/RAW#92fab5d9-9a27-4a7c-a57e-4b2691c654cd

/AlCaP0/Run2024H-v1/RAW#3bbaf481-068c-4fda-8656-663fa9a987a4 (original)

  • RucioInjector added 5 files
  • DBSBUFFER_FILE has 5 files
  • DAS shoes no data
  • The 5 files are in DAS as part of /AlCaP0/Run2024H-v1/RAW#b51293e3-1563-47a3-a88f-1eb33790c417

/AlCaP0/Run2024H-v1/RAW#92fab5d9-9a27-4a7c-a57e-4b2691c654cd (original)

  • RucioInjector added 12 files
  • DBSBUFFER_FILE has 12 files
  • DAS shows no data with that blockname
  • DAS shows 9 files as part of /AlCaP0/Run2024H-v1/RAW#0392f25d-8397-40b3-8f6f-46266d92583b
  • DAS shows 3 files as part of /AlCaP0/Run2024H-v1/RAW#b51293e3-1563-47a3-a88f-1eb33790c417

/AlCaP0/Run2024H-v1/RAW#983e96f3-5dca-4919-a3b4-fa291f145fb7 (original)

  • RucioInjector added 12 files
  • DBSBUFFER_FILE has 12 files
  • DAS shows no data with that blockname
  • DAS shows 3 files as part of /AlCaP0/Run2024H-v1/RAW#0392f25d-8397-40b3-8f6f-46266d92583b

@todor-ivanov
Copy link
Contributor

todor-ivanov commented Nov 6, 2024

About:

to do that, you should stop using DBS3Upload code as it hides many things and prevent from debugging the issue

Just to put @vkuznet's words in perspective:

I tried to completely simulate the whole agent environment in preprod connected to DBS integration, falsely assuming everything should go smoothly and upon initial successful upload of the block I'll be able to reproduce the duplication error on a second attempt. But:

  • First, the initial upload failed for reasons which will be explained on the next line
  • Second the error returned by DBS was completely ignored. Here is one proper pdb session within the whole agent env. [1], where we can very well see that DBSApi is not able to complete it's execution.... the error originates from the dbsclient and it throws the error from this very line:

https://github.com/dmwm/DBSClient/blob/1e6acbd55c55497cf747a2a0cf4539936138a04a/src/python/dbs/apis/dbsClient.py#L647:

    def insertBulkBlock(self, blockDump):
   ...
        result =  self.__callServer("bulkblocks", data=blockDump, callmethod='POST' )

Which actually contains the true DBS Error in the header and one can spot the error message in the printout:

RestClient.ErrorHandling.RestClientExceptions.HTTPError: HTTP Error 400: DBSError Code:101 Description:DBS DB error Function:dbs.bulkblocks.InsertBulkBlocksConcurrently Message:unable to find parent lfn /store/data/Run2024I/ParkingSingleMuon4/RAW/v1/000/386/640/00000/7c1b6c7b-a0bf-4f19-8dfa-1722884306c5.root \
      Error: nested DBSError Code:103 Description:DBS DB query error, e.g. mailformed SQL statement Function:dbs.GetID Message: Error: sql: no rows in result set                                                                                     

-- So those are two nested DBS errors:

  • DBSError Code: 101 - The error from the wrapper API InsertBulkBlocksConcurrently, reflecting that the call to the database actually failed.
  • DBSError Code: 103 - And the bottom error giving the true reason, why the call to the database failed - and which in this case is because the Parentage file of this lfn: /store/data/Run2024I/ParkingSingleMuon4/RAW/v1/000/386/640/00000/7c1b6c7b-a0bf-4f19-8dfa-1722884306c5.root is indeed missing at this instance of DBS (which is completely expected), and the sql query actually returned an empty result. And all this is properly raised by the dbsclient. What happens at the WMAgents DBSApi though is quite undesired. The error code is silently dropped and transformed only to the upper level HTTP 400 error. And the so carried actual error inside the header is simply ignored by this line:

elif srvCode in [132, 133, 134, 135, 136, 137, 138, 139, 140]:

And there is a plethora of DBS server errors we do not handle: https://github.com/dmwm/dbs2go/blob/8effd5a6bcb1c5b169348e3ac886891ad3aa1a2a/dbs/errors.go#L37-L81 : [2]

FYI: @germanfgv @LinaresToine

[1]

5643 │> /data/WMAgent.venv3/lib64/python3.9/site-packages/dbs/apis/dbsClient.py(477)__callServer()                                                                                                                                                                                                                           
5644 │-> self.__parseForException(http_error)                                         |                                                                                                                                                                                                                                      
5645 │(Pdb)                                                                           |                                                                                                                                                                                                                                      
5646 │DBS Server error: [{'error': {'reason': 'DBSError Code:103 Description:DBS DB query error, e.g. mailformed SQL statement Function:dbs.GetID Message: Error: sql: no rows in result set', 'message': 'unable to find parent lfn /store/data/Run2024I/ParkingSingleMuon4/RAW/v1/000/386/640/00000/7c1b6c7b-a0bf-4f19-8df\
      a-1722884306c5.root', 'function': 'dbs.bulkblocks.InsertBulkBlocksConcurrently', 'code': 101, 'stacktrace': '\ngoroutine 7968287 [running]:\ngithub.com/dmwm/dbs2go/dbs.Error({0xb2ca20?, 0xc0006842d0?}, 0x65, {0xc000702000, 0x84}, {0xa5c044, 0x2b})\n\t/go/src/github.com/vkuznet/dbs2go/dbs/errors.go:185 +0x99\n\
      github.com/dmwm/dbs2go/dbs.(*API).InsertBulkBlocksConcurrently(0xc000236070)\n\t/go/src/github.com/vkuznet/dbs2go/dbs/bulkblocks2.go:508 +0x605\ngithub.com/dmwm/dbs2go/web.DBSPostHandler({0xb2f790, 0xc000aa01e0}, 0xc000686c60, {0xa3e07d, 0xa})\n\t/go/src/github.com/vkuznet/dbs2go/web/handlers.go:562 +0x109e\n\
      github.com/dmwm/dbs2go/web.BulkBlocksHandler({0xb2f790?, 0xc000aa01e0?}, 0xc000033f60?)\n\t/go/src/github.com/vkuznet/dbs2go/web/handlers.go:978 +0x3b\nnet/http.HandlerFunc.ServeHTTP(0x0?, {0xb2f790?, 0xc000aa01e0?}, 0x11?)\n\t/usr/local/go/src/net/http/server.go:2171 +0x29\ngithub.com/dmwm/dbs2go/web.limitMi\
      ddleware.func1({0xb2f790?, 0xc000aa01e0?}, 0xc0006c6650?)\n\t/go/src/github.com/vkuznet/dbs2go/web/middlewares.go:110 +0x32\nnet/http.HandlerFunc.ServeHTTP(0xc0003c0f30?, {0xb2f790?, 0xc000aa01e0?}, 0xc0003af450?)\n\t/usr/loca'}, 'http': {'method': 'POST', 'code': 400, 'timestamp': '2024-11-06 16:16:23.350982\
      889 +0000 UTC m=+5760929.544914892', 'path': '/dbs/int/global/DBSWriter/bulkblocks', 'user_agent': 'DBSClient/Unknown/', 'x_forwarded_host': 'cmsweb-testbed.cern.ch', 'x_forwarded_for': '188.184.96.94:20438, 188.184.96.94', 'remote_addr': '10.100.148.128:41393'}, 'exception': 400, 'type': 'HTTPError', 'messag\
      e': 'DBSError Code:101 Description:DBS DB error Function:dbs.bulkblocks.InsertBulkBlocksConcurrently Message:unable to find parent lfn /store/data/Run2024I/ParkingSingleMuon4/RAW/v1/000/386/640/00000/7c1b6c7b-a0bf-4f19-8dfa-1722884306c5.root Error: nested DBSError Code:103 Description:DBS DB query error, e.g.\
       mailformed SQL statement Function:dbs.GetID Message: Error: sql: no rows in result set'}]                                                                                                                                                                                                                             
5647 │RestClient.ErrorHandling.RestClientExceptions.HTTPError: HTTP Error 400: DBSError Code:101 Description:DBS DB error Function:dbs.bulkblocks.InsertBulkBlocksConcurrently Message:unable to find parent lfn /store/data/Run2024I/ParkingSingleMuon4/RAW/v1/000/386/640/00000/7c1b6c7b-a0bf-4f19-8dfa-1722884306c5.root \
      Error: nested DBSError Code:103 Description:DBS DB query error, e.g. mailformed SQL statement Function:dbs.GetID Message: Error: sql: no rows in result set                                                                                                                                                            
5648 │> /data/WMAgent.venv3/lib64/python3.9/site-packages/dbs/apis/dbsClient.py(477)__callServer()                                                                                                                                                                                                                           
5649 │-> self.__parseForException(http_error)                                         |                                                                                                                                                                                                                                      
5650 │(Pdb)                                                                           |                                                                                                                                                                                                                                      
5651 │> /data/WMAgent.venv3/lib64/python3.9/site-packages/dbs/apis/dbsClient.py(486)__callServer()                                                                                                                                                                                                                           
5652 │-> self.__parseForException(data)                                               |                                                                                                                                                                                                                                      
5653 │(Pdb)                                                                           |                                                                                                                                                                                                                                      
5654 │--Return--                                                                      |                                                                                                                                                                                                                                      
5655 │> /data/WMAgent.venv3/lib64/python3.9/site-packages/dbs/apis/dbsClient.py(486)__callServer()->None                                                                                                                                                                                                                     
5656 │-> self.__parseForException(data)                                               |                                                                                                                                                                                                                                      
5657 │(Pdb)                                                                           |                                                                                                                                                                                                                                      
5658 │RestClient.ErrorHandling.RestClientExceptions.HTTPError: HTTP Error 400: DBSError Code:101 Description:DBS DB error Function:dbs.bulkblocks.InsertBulkBlocksConcurrently Message:unable to find parent lfn /store/data/Run2024I/ParkingSingleMuon4/RAW/v1/000/386/640/00000/7c1b6c7b-a0bf-4f19-8dfa-1722884306c5.root \
      Error: nested DBSError Code:103 Description:DBS DB query error, e.g. mailformed SQL statement Function:dbs.GetID Message: Error: sql: no rows in result set                                                                                                                                                            
5659 │> /data/WMAgent.venv3/lib64/python3.9/site-packages/dbs/apis/dbsClient.py(647)insertBulkBlock()                                                                                                                                                                                                                        
5660 │-> result =  self.__callServer("bulkblocks", data=blockDump, callmethod='POST' )|                                                                                                                                                                                                                                      
5661 │(Pdb) p result                                                                  |                                                                                                                                                                                                                                      
5662 │*** NameError: name 'result' is not defined                                     |                                                                                                                                                                                                                                      
5663 │(Pdb) n                                                                         |                                                                                                                                                                                                                                      
5664 │--Return--                                                                      |                                                                                                                                                                                                                                      
5665 │> /data/WMAgent.venv3/lib64/python3.9/site-packages/dbs/apis/dbsClient.py(647)insertBulkBlock()->None                                                                                                                                                                                                                  
5666 │-> result =  self.__callServer("bulkblocks", data=blockDump, callmethod='POST' )|                                                                                                                                                                                                                                      
5667 │(Pdb) p result                                                                  |                                                                                                                                                                                                                                      
5668 │*** NameError: name 'result' is not defined                                     |                                                                                                                                                                                                                                      
5669 │(Pdb) n                                                                         |                                                                                                                                                                                                                                      
5670 │RestClient.ErrorHandling.RestClientExceptions.HTTPError: HTTP Error 400: DBSError Code:101 Description:DBS DB error Function:dbs.bulkblocks.InsertBulkBlocksConcurrently Message:unable to find parent lfn /store/data/Run2024I/ParkingSingleMuon4/RAW/v1/000/386/640/00000/7c1b6c7b-a0bf-4f19-8dfa-1722884306c5.root \
      Error: nested DBSError Code:103 Description:DBS DB query error, e.g. mailformed SQL statement Function:dbs.GetID Message: Error: sql: no rows in result set                                                                                                                                                            
5671 │> /data/WMAgent.venv3/srv/WMCore/src/python/WMComponent/DBS3Buffer/DBSUploadPoller.py(94)uploadWorker()                                                                                                                                                                                                                
5672 │-> dbsApi.insertBulkBlock(blockDump=block)                                      |                                                                                                                                                                                                                                      
5673 │(Pdb)                                                                           |                                                                                                                                                                                                                                      
5674 │> /data/WMAgent.venv3/srv/WMCore/src/python/WMComponent/DBS3Buffer/DBSUploadPoller.py(96)uploadWorker()                                                                                                                                                                                                                
5675 │-> except HTTPError as ex:                                                      |                                                                                                                                                                                                                                      
5676 │(Pdb)                                                                                                                                                                                           

[2]

// DBS Error codes provides static representation of DBS errors, they cover 1xx range
const (
	GenericErrorCode               = iota + 100 // generic DBS error
	DatabaseErrorCode                           // 101 database error
	TransactionErrorCode                        // 102 transaction error
	QueryErrorCode                              // 103 query error
	RowsScanErrorCode                           // 104 row scan error
	SessionErrorCode                            // 105 db session error
	CommitErrorCode                             // 106 db commit error
	ParseErrorCode                              // 107 parser error
	LoadErrorCode                               // 108 loading error, e.g. load template
	GetIDErrorCode                              // 109 get id db error
	InsertErrorCode                             // 110 db insert error
	UpdateErrorCode                             // 111 update error
	LastInsertErrorCode                         // 112 db last insert error
	ValidateErrorCode                           // 113 validation error
	PatternErrorCode                            // 114 pattern error
	DecodeErrorCode                             // 115 decode error
	EncodeErrorCode                             // 116 encode error
	ContentTypeErrorCode                        // 117 content type error
	ParametersErrorCode                         // 118 parameters error
	NotImplementedApiCode                       // 119 not implemented API error
	ReaderErrorCode                             // 120 io reader error
	WriterErrorCode                             // 121 io writer error
	UnmarshalErrorCode                          // 122 json unmarshal error
	MarshalErrorCode                            // 123 marshal error
	HttpRequestErrorCode                        // 124 HTTP request error
	MigrationErrorCode                          // 125 Migration error
	RemoveErrorCode                             // 126 remove error
	InvalidRequestErrorCode                     // 127 invalid request error
	BlockAlreadyExists                          // 128 block xxx already exists in DBS
	FileDataTypesDoesNotExist                   // 129 FileDataTypes does not exist in DBS
	FileParentDoesNotExist                      // 130 FileParent does not exist in DBS
	DatasetParentDoesNotExist                   // 131 DatasetParent does not exist in DBS
	ProcessedDatasetDoesNotExist                // 132 ProcessedDataset does not exist in DBS
	PrimaryDatasetTypeDoesNotExist              // 133 PrimaryDatasetType does not exist in DBS
	PrimaryDatasetDoesNotExist                  // 134 PrimaryDataset does not exist in DBS
	ProcessingEraDoesNotExist                   // 135 ProcessingEra does not exist in DBS
	AcquisitionEraDoesNotExist                  // 136 AcquisitionEra does not exist in DBS
	DataTierDoesNotExist                        // 137 DataTier does not exist in DBS
	PhysicsGroupDoesNotExist                    // 138 PhysicsGroup does not exist in DBS
	DatasetAccessTypeDoesNotExist               // 139 DatasetAccessType does not exist in DBS
	DatasetDoesNotExist                         // 140 Dataset does not exist in DBS
	LastAvailableErrorCode                      // last available DBS error code
)

@germanfgv
Copy link
Contributor

I changed the DBSWriter instance that the component is accessing from cmsweb.cern.ch to cmsweb-prod.cern.ch. As expected, we no longer get the pyCurl error 52 message. The 272 blocks that are already correct in the database were processed without issues, and this is allowing the agent to continue creating and uploading blocks.

Now we are left with the original 4 problematic blocks.

@todor-ivanov
Copy link
Contributor

todor-ivanov commented Nov 8, 2024

Here to summarize the status and our findings about this issue from the work with T0 Team for the whole last week

The problem is 3 fold:

  • The agent looses the HTTP header, containing the actual DBSError code, when we switch the frontend to APS.
    Upon a conversation with @vkuznet, we might have a direction. There obviously is a slight difference between how the connection is handled with Apache and APS. Things might boil down to the keepAlive && keepAliveTimeout flags.

  • We are not distinguishing between all possible situations that could have led to a specific error. We treat only one separately - which is DBS ErrorCode 128. And on top of that we do not even handle/recognize all the possible errors that DBS Server is returning to us.

The above two are concerning mostly the huge pile of blocks which we were accumulating and not recognizing that their records were already in DBS, such that the agent should stop retrying. Once we switched back to the APache frontend all those proceeded, and the sequential steps for the other workflows depending on the data also started.

  • The third aspect is more subtle, though. We had four blocks which were:
    • Having all their files already in Rucio
    • Having an overlap in DBS with another block. Meaning some of their files were wrongly uploaded to DBS as part of some completely different block (as extra files to it), which should never happen!!!! This was on the other side rightfully blocking those records to proceed from the DBS server side by breaking one UNIQUE table constraint at the file level, so the original block was held back at the agent. The reasons for that are still unknown. One possible place to look is for a concurrency issue on how we feed the 4 different input queues of DBSUploadPoller. The json dump of the original block though is absolutely correct. The problem is that the json for the originally uploaded block with the extra files is well gone and we cannot dump it to see what was actually uploaded.

As a strategy we decided to split the problem in 5 steps: 2 OPS and 3 DEV

  • OPS1: Switching back the agent to Apache front end and getting rid of the big backlog - DONE (reported by @germanfgv in the previous comment)
  • OPS2: Deleting manually the files which are overlapping between the blocks, such that we can release the last 4 blocks as well - This one is tricky because it does not finish with just deleting the files from a single table. I need to manually go and find all the relations between the following DBS database tables and clean them all from any record related to those files:
    • TABLE: FILES
    • TABLE: FILE_PARENTS
    • TABLE: FILE_LUMIS
    • TABLE: ASSOCIATED_FILES

(the later never imagined even exists)

  • DEV1: Start properly handling a bigger subset of errors at the Agent returned by the DBS Server
  • DEV2: Debug and fix the bug which caused the blocks overlap only in DBS
  • DEV3: Debug and find why are we loosing the HTTP header when we switch to APS frontend

so:

  • OPS1: is done.
  • OPS2: I am currently fighting with it, polishing all the queries and cleanup procedures - but I'd rather not execute anything on Friday 5 o'clock. I plan to proceed on Monday.
  • DEV1: We need a new issue to be created and worked on
  • DEV2: T0 Team continues to search through the logs and repeats the analysis of the data we did for the Muon0 block: /Muon0/Run2024H-v1/RAW#7369ccdf-3d3a-4d32-bad9-b04b02f279d4 with the other three locked blocks. look a good summary done by @LinaresToine here: Agents continuously failing to insert blocks into DBS #11965 (comment)
  • DEV3: This is somehow tricky - in the debugging session connected through cmswe-testbed.cern.ch (which is currently) an APS frontend, I can see that the dbsclient (which is a dependency for WMCore), does see the HTTP Header, and the DBS errors are well recognizable in the object. See my comment: Agents continuously failing to insert blocks into DBS #11965 (comment)

@amaltaro
Copy link
Contributor Author

amaltaro commented Nov 8, 2024

Thank you for summarizing everything that has been going on in here.

For the OPS2 issue above, I find deleting entries from the DBS Server database extremely dangerous. Even though it might require extra work, it would be much safer to actually recreate the lumis (or block) that is failing to get inserted into DBS. Did you and the T0 discuss this possibility? @germanfgv

About the DEV1, unless I am missing some context, I do not think we should replicate every single status code from the DBS Server to the client side. IMO, the client should only deal with the status code that it can actually do something different. If there is no different execution flow, then reporting the error from the server is what we can do (which is already done in the generic exception AFAICT).

@germanfgv
Copy link
Contributor

@amaltaro we no longer have streamer files for these run/lumis, it's not possible to recreate these blocks.

We could consider making the changes in Rucio, but it would require to remove files from one block and add it to the other. Also, it would require to do the same in the agent's DBSBUFFER database.

@amaltaro
Copy link
Contributor Author

amaltaro commented Nov 8, 2024

Given the criticality and amount of information in DBS, it would be the last system that I would delete things manually.
For dbsbuffer, do I understand it right that we would only need to mark this block and its files as uploaded to DBS?
For Rucio, what would have to be done? Remove files/replicas from a DATASET? Would it need creation of a new DATASET + files/replicas?

@germanfgv
Copy link
Contributor

In Rucio, we would need to remove 4 files from one block and add them to another. In the agent's database, we would need to change the block of the 4 problematic files and mark them as InDBS. I'm not sure how Rucio would reack to this, but I think it will be ok, as all files belong to the same container

@amaltaro
Copy link
Contributor Author

After some discussions during the Tier0 meeting, I decided to have a quick look at the logs to see if we can have a better understanding of this issue.

I don't see some information in this thread, so let me write my observations here:

  1. before the problematic blocks have been created in DBS3Upload, the component had a few oracle issues like:
Exception Class: DBSUploadException
Message: Unhandled exception while loading uploadable files for DatasetPath.
(cx_Oracle.DatabaseError) ORA-25401: can not continue fetches
  1. after these oracle issues, I noticed many files being reported as duplicated in the logs:
2024-09-19 17:23:01,916:139632874354240:INFO:DBSUploadPoller:Executing loadFiles method...
2024-09-19 17:23:11,876:139632874354240:ERROR:DBSBufferBlock:Duplicate file inserted into DBSBufferBlock: 1077894
Ignoring this file for now!
  1. based on Antonio's feedback above, the "impostor block" had the following timeline in the component:
### impostor block 1
2024-09-19 17:51:38,694:139632874354240:INFO:DBSUploadPoller:Queueing block for insertion: /AlCaP0/Run2024H-v1/RAW#b51293e3-1563-47a3-a88f-1eb33790c417
2024-09-19 17:52:47,723:139632874354240:INFO:DBSUploadPoller:About to call insert block for: /AlCaP0/Run2024H-v1/RAW#b51293e3-1563-47a3-a88f-1eb33790c417
  1. while the original block had this timeline (and kept failing since then)
2024-09-19 17:51:38,698:139632874354240:INFO:DBSUploadPoller:Queueing block for insertion: /AlCaP0/Run2024H-v1/RAW#3bbaf481-068c-4fda-8656-663fa9a987a4
2024-09-19 17:52:49,777:139632874354240:ERROR:DBSUploadPoller:Error trying to process block /AlCaP0/Run2024H-v1/RAW#3bbaf481-068c-4fda-8656-663fa9a987a4 through DBS. Details: DBSError code: 110, message: 997071d9311e283887ce5e57b0b1800467986e1c57f620aff5a39d98b881fb6c unable to insert files, error DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.insertFilesViaChunks Message: Error: concurrency error, reason: DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.insertFilesViaChunks Message: Error: concurrency error
  1. looking into RucioInjector, these 2 blocks above had the following timeline:
### impostor block 1
2024-09-19 15:30:19,570:139632735942208:INFO:RucioInjectorPoller:Block /AlCaP0/Run2024H-v1/RAW#b51293e3-1563-47a3-a88f-1eb33790c417 inserted into Rucio
2024-09-19 15:30:29,385:139632735942208:INFO:RucioInjectorPoller:Successfully inserted 4 files on block /AlCaP0/Run2024H-v1/RAW#b51293e3-1563-47a3-a88f-1eb33790c417
2024-09-19 17:57:20,982:139632735942208:INFO:RucioInjectorPoller:Closing block: /AlCaP0/Run2024H-v1/RAW#b51293e3-1563-47a3-a88f-1eb33790c417
### original block 1
2024-09-19 17:56:12,439:139632735942208:INFO:RucioInjectorPoller:Block /AlCaP0/Run2024H-v1/RAW#3bbaf481-068c-4fda-8656-663fa9a987a4 inserted into Rucio
2024-09-19 17:56:41,372:139632735942208:INFO:RucioInjectorPoller:Successfully inserted 5 files on block /AlCaP0/Run2024H-v1/RAW#3bbaf481-068c-4fda-8656-663fa9a987a4

Having said that, I have the following questions/comments:

  1. it looks like we have not closed the original blocks in Rucio. AFAIK it is not a big deal and it has no impact in anything else. It is, nonetheless, different than any other block created by WMAgent.
  2. is it possible that the list of files returned from dbsbuffer was not unique? File id is supposed to be unique (and sequential, AFAICT). How about lfns, do we have the same lfn under different file ids? Otherwise, how would we iterate through the same fileid twice?

Without investigating the codebase too much, it is possible that those duplicate file ids ("Ignoring this file for now!") actually triggered the misbehavior of the component. This duplicate file id is identified here:
https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/DBS3Buffer/DBSBufferBlock.py#L105
and one of the places it is used (there is another in the same module) is in this block:
https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/DBS3Buffer/DBSUploadPoller.py#L487

@vkuznet
Copy link
Contributor

vkuznet commented Nov 19, 2024

@amaltaro , few observations:

  • From ChatGPT: The Oracle error ORA-25401: can not continue fetches usually occurs in the context of Oracle's Transparent Application Failover (TAF) when a failover attempt disrupts an ongoing SQL FETCH operation. This error indicates that the fetch operation cannot proceed because the session was lost and re-established during the failover.
  • from DBSUploadPoller.py it uses thread object which invokes begin/rollback

Is it possible that thread was killed because ORACLE timed out? Or, if connection was lost to ORACLE and error was thrown. How DBSUploadPoller.py guarantees that transactions will be rolled back if thread is killed? From what I read in a code nothing is protected for such use-case and transaction will not be rolled back if thread is killed. It may explain the weird behavior.

In other words, because of the polling cycle, if thread is killed for whatever reason there is no guarantee that transaction can be rolled back in Python. But polling cycle will start poller again and it may execute the same injection of objects into database which may not be protected (if there is no UNIQUE constrain on a injected object), and it may explain the observed behavior.

@amaltaro
Copy link
Contributor Author

A new development issue has been created with #12229, which will make WMAgent more resilient and error messages more friendly.

Given that all operational issues have now been resolved - despite not being able to understand the T0 issues, even after tons of debugging by Todor and German - I think we can close this out.

Todor, Andrea, others, please reopen it if anything is still pending from this debugging/operations. Thanks!

@github-project-automation github-project-automation bot moved this from In Progress to Done in WMCore quarterly developments Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

5 participants