Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blocks failing to be inserted with Error: concurrency error #98

Closed
amaltaro opened this issue Aug 3, 2023 · 7 comments
Closed

Blocks failing to be inserted with Error: concurrency error #98

amaltaro opened this issue Aug 3, 2023 · 7 comments

Comments

@amaltaro
Copy link

amaltaro commented Aug 3, 2023

Hello,
I found 2 central production blocks that are failing to be inserted into DBS since June, they are:

 /DYJetsToMuMu_M-50_TuneCP5_13p6TeV_madgraphMLM-pythia8/Run3Winter23Reco-TRKRealistic_AlcaRecoRealisticTRK_REAL_126X_mcRun3_2022_realistic_postEE_v2-v2/GEN-SIM-RECO#c44686c3-86fa-4fc3-9902-ae328914b317
/DYJetsToMuMu_M-50_TuneCP5_13p6TeV_madgraphMLM-pythia8/Run3Winter23Digi-TRKRealistic_REAL_126X_mcRun3_2022_realistic_postEE_v2-v2/GEN-SIM-RAW#e70265df-efb6-42ca-8858-1dcf669e68a2

and they belong to the following workflow: cmsunified_task_TRK-Run3Winter23wmLHEGS-00005__v1_T_230615_091700_9237

I see each of them has the maximum number of files defined in WMAgent, 500 files, and the error message we get back in the client is:

2023-08-03 12:35:58,964:140192869803840:INFO:DBSUploadPoller:About to call insert block for: /DYJetsToMuMu_M-50_TuneCP5_13p6TeV_madgraphMLM-pythia8/Run3Winter23Digi-TRKRealistic_REAL_126X_mcRun3_2022_realistic_postEE_v2-v2/GEN-SIM-RAW#e70265df-efb6-42ca-8858-1dcf669e68a2
2023-08-03 12:35:59,690:140192869803840:ERROR:DBSUploadPoller:Error trying to process block /DYJetsToMuMu_M-50_TuneCP5_13p6TeV_madgraphMLM-pythia8/Run3Winter23Reco-TRKRealistic_AlcaRecoRealisticTRK_REAL_126X_mcRun3_2022_realistic_postEE_v2-v2/GEN-SIM-RECO#c44686c3-86fa-4fc3-9902-ae328914b317 through DBS. Details: DBSError code: 110, message: 4cdac4592325e3196087c4671726e76b3a35851ab1fa81ed3b4121ef2323d989 unable to insert files, error DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.insertFilesViaChunks Message: Error: concurrency error, reason: DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.insertFilesViaChunks Message: Error: concurrency error

these blocks are inserted with the following dbsApi call:

            dbsApi.insertBulkBlock(blockDump=block)

as can be seen in this source code: https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/DBS3Buffer/DBSUploadPoller.py

I am also going to attach both block dumps (what WMAgent is posting to DBS Server) such that it makes debugging easier. Please find them here: https://amaltaro.web.cern.ch/amaltaro/forWMCore/dbs2go_98/

@amaltaro
Copy link
Author

@d-ylee Hi Dennis, this is still a problem for central production and we have to keep a service running only because these 2 blocks are not going through. Would you be able to check the server logs and help me understanding why these blocks fail to get injected?

@d-ylee
Copy link
Contributor

d-ylee commented Nov 6, 2023

@amaltaro Are these from global prod?

@amaltaro
Copy link
Author

amaltaro commented Nov 6, 2023

@d-ylee yes, against cmsweb-prod cluster.

@d-ylee
Copy link
Contributor

d-ylee commented Nov 7, 2023

@amaltaro I checked the logs around that time. It shows that the concurrency error was happening in insertFilesViaChunks:

return Error(ConcurrencyErr, InsertErrorCode, "", "dbs.bulkblocks.insertFilesViaChunks")

For /DYJetsToMuMu_M-50_TuneCP5_13p6TeV_madgraphMLM-pythia8/Run3Winter23Reco-TRKRealistic_AlcaRecoRealisticTRK_REAL_126X_mcRun3_2022_realistic_postEE_v2-v2/GEN-SIM-RECO#c44686c3-86fa-4fc3-9902-ae328914b317, I see that the NErrors in insertBulkBlocksConcurrently is 13.

For /DYJetsToMuMu_M-50_TuneCP5_13p6TeV_madgraphMLM-pythia8/Run3Winter23Digi-TRKRealistic_REAL_126X_mcRun3_2022_realistic_postEE_v2-v2/GEN-SIM-RAW#e70265df-efb6-42ca-8858-1dcf669e68a2, the NErrors is 7.

trec.NErrors is only incremented in the insertFilesChunk function:

func insertFilesChunk(

Within the function, there are three locations where trec.NError is incremented.

  1. dbs2go/dbs/bulkblocks2.go

    Lines 1028 to 1034 in 28e02bd

    fileTypeID, err := GetID(tx, "FILE_DATA_TYPES", "file_type_id", "file_type", rrr.FileType)
    if err != nil {
    if utils.VERBOSE > 1 {
    log.Println("### trec unable to find file_type_id for", rrr.FileType, "lfn", lfn, "error", err)
    }
    trec.NErrors += 1
    return
  2. dbs2go/dbs/bulkblocks2.go

    Lines 1050 to 1055 in 28e02bd

    if trec.IsFileValid == 0 {
    if rrr.IsFileValid != 0 && rrr.IsFileValid != 1 {
    log.Println("### wrong is_file_valid value for", rrr.FileType, "lfn", lfn, "error", err)
    trec.NErrors += 1
    return
    }
  3. dbs2go/dbs/bulkblocks2.go

    Lines 1078 to 1085 in 28e02bd

    err = r.Insert(tx)
    if err != nil {
    if utils.VERBOSE > 1 {
    log.Printf("### trec unable to insert File record for lfn %s, error %v", lfn, err)
    }
    trec.NErrors += 1
    return
    }

For 2, I do not see ### wrong is _file_valid... in the logs, so I don't think it is at 2.

It is hard to tell if it is 1 or 3 because in production, the configuration sets verbosity to 0. In this case, I do not see the messages printed corresponding to 1 or 3.

For the 3rd location, it could also be occurring in the Files.Insert function:

func (r *Files) Insert(tx *sql.Tx) error {

This could be happening when the sql statement is being executed:

dbs2go/dbs/files.go

Lines 292 to 314 in 28e02bd

_, err = tx.Exec(
stm,
r.FILE_ID,
r.LOGICAL_FILE_NAME,
r.IS_FILE_VALID,
r.DATASET_ID,
r.BLOCK_ID,
r.FILE_TYPE_ID,
r.CHECK_SUM,
r.FILE_SIZE,
r.EVENT_COUNT,
r.ADLER32,
r.MD5,
r.AUTO_CROSS_SECTION,
r.CREATION_DATE,
r.CREATE_BY,
r.LAST_MODIFICATION_DATE,
r.LAST_MODIFIED_BY)
if err != nil {
if utils.VERBOSE > 0 {
log.Println("unable to insert files, error", err)
}
return Error(err, InsertErrorCode, "", "dbs.files.Insert")

Beyond this, it is hard to see why there was this error while looking at the production logs. @amaltaro Do you know the 13 and 7 files that could potentially have this issue?

Side note: I think we might need to have some of the messages be taken out of verbosity statements, especially for the File.Insert function. I cannot tell exactly which files failed to be inserted, causing the concurrency error.

@vkuznet
Copy link
Contributor

vkuznet commented Nov 7, 2023

From my previous experience with debugging these kind of errors they come from inconsistent JSON configuration where for file/block injection there should be proper configuration in place and in concurrent environment it is not there. I reported already that bulkblock DBS API is not concurrently safe by design, i.e. it has racing condition, see my report in this ticket: dmwm/WMCore#11106 Without proper addressing DBS data injection procedure I still think we may experience this kind of errors from time to time. To resolve them, someone should inject FIRST into DBS all configuration for dataset/block/files and then start injecting blocks/files.

@jenimal
Copy link

jenimal commented Jan 17, 2024

This workflow is in rejected archived don't worry blow it away

@amaltaro
Copy link
Author

@d-ylee Dennis, I am going to shutdown the agent that tries to inject those 2 blocks into DBS, as the workflow has been rejected and the output data is no longer needed. Feel free to close this issue out at your convenience - and thank you for helping out with this debugging!

@d-ylee d-ylee closed this as completed Jan 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants