-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Agents continuously failing to insert blocks into DBS #11965
Comments
@todor-ivanov as discussed in the meeting today - and right now with Andrea as well - let us put this back to ToDo and come back to this beginning of October (2 weeks more should not hurt us here). |
Following discussion in mattermost wm-ops thread with @amaltaro. Related to failure in inserting data to DBS, the current T0 production agent is struggling with inserting files into the blocks. I see the following error message in the DBS3Upload component log
This is present for the following blocks:
|
I suggest that you review #11106 which describes the actual issue with concurrent data insertion. In short, to make it work we must have all pieces (like dataset configuration, etc.) in place to make concurrent injection. To solve this problem someone must inject first one block with all necessary information, and then can safely use concurrent pattern to inject other blocks. |
@vkuznet thank you for jumping into this discussion. I had a feeling that there was another obscure problem with DBS Server, and reviewing the ticket you pointed to (11106) - and according to your sentence above - I understand that, provided that we have at least 1 block injected into DBS for a given dataset, the "concurrency error" should no longer happen, given that all the foundation information is already in the database. Correct? I picked one of the blocks provided by Antonio and queried DBS Server for its blocks: as you can see, this dataset already has a bunch of blocks in the database. So, how come we are having a "concurrency error" here? |
If you'll inspect the code [1], in order to insert DBS block concurrently we need to have in place:
So, if all of these information is present and it is consistent across all blocks in DBS then answer is yes the concurrency error (based on database content) should not arise. In other words DBS server first acquire or insert this info into DBS tables and if two or more HTTP calls arrives at the same time it can cause database error which lead to concurrency error form DBS server. Is it the case of the discussed blocks I don't know. But it is possible to not have all the information present in DB across all blocks if any of the above have differ among them. You may look at example of bulkblocks JSON [2] to see actually how this information is structured and provided to DBS. In particular, the information in [1] https://github.com/dmwm/dbs2go/blob/master/dbs/bulkblocks2.go#L478 |
Valentin, unless there is a bug in the (T0)WMAgent, all the blocks for the same dataset should carry exactly the same metadata. That means, same acquisition era, primary dataset, etc etc etc. Having said that, if a block exists in DBS Server, we can conclude that all of its metadata is already available as well. IF that metadata is already available and we are trying to inject more blocks for the same dataset, hence the same meta-data, there should be NO concurrency error. Based on your explanation and on the data shared by Antonio, I fail to see how we would hit a "concurrency error". That means there is more to what we have discussed/understood so far; or the error message is misleading... In any case, I would suggest to have @todor-ivanov following this up next week, comparing things with the DBS Server logs and against the source code. |
I further looked into the dbs code and I think I identified the issue. According to the dbs code
Then, I looked at one of the dbs logs and found
So, indeed input file record DOES NOT contain required file type attribute, see To summarize, I suggest to check JSON records T0 provides and ensure it provides |
For the record, here is how DBS error look in a log:
So, you have all pointers to look which lines of code fails by inspecting its stack, and that exactly what I did. |
As far as I can tell, it should always be set like:
@LinaresToine can you please change the component configuration and provide one of the block names that is failing to be inserted, in the following line:
then restart |
Ok, I changed the config as suggested. Waiting on the loadFiles method to complete the cycle. Ill follow up |
I have placed the output json file in Another error is showing up in the DBS3Upload component for all 4 pending blocks:
|
An update from T0:
Now we have a total of
Because of these, we have |
Here is the follow up on what is the status of those blocks according to DBS. I had to create a script to go and query directly the DBS database lfn by lfn for all those blocks and here is the accumulated result: So from what I can see from those results we can identify at least 3 different use cases:
I am going to filter out those for which we know are there. On top of that I consider checking their Rucio status as well. p.s. Here: DBSBlocksCheck.py is the script I used for accumulating those results p.s. Here: And here: blockDBSRecords.json is an updated version of the DBS records with updated Rucio information per block as well |
And continuing to reduce the results to something more readable here [1] is the final list of the block and file status at DBS for all of them. As one can see:
FYI: @germanfgv @LinaresToine [1]
|
@germanfgv @LinaresToine Could we check what is special for those 4 blocks reported as experiencing
|
@todor-ivanov as we discussed in the WMCore meeting, DBS3Upload should have a mechanism to identify blocks that have already been injected into DBS Server, but failed to acknowledge the operation for some reason. If the component tries to inject a block already in the server, it is supposed to return exit code 128, marking the block as which will trigger the execution of this block of code: in the next cycle of the component. I don't think anything changed on the DBS Server codebase lately, so I expect this feature to be still functional. But you might want to revise the error message/code that we are getting for the problematic blocks. |
@todor-ivanov we are having similar problems with one agent that is ready to be shutdown (after draining), but it still has one block that it fails to inject into DBS Server. Could you please look into submit12 and try to understand what the problem is with:
It seems to be failing injection since Oct 15th. |
Thanks @todor-ivanov! tested this using my own script (DBSBlockCheck.py) and got the same result. All but 4 blocks are already available in DBS. All the blocks listed in I can check @amaltaro's idea tomorrow. |
In reality the error code the agent unwraps from the HTTP header for some reason is [1]
|
Actually it never tries to read the HTTP header and to actually resolve the true DBS error, which is supposed to be done through the
And the reason why it happens like that, is obviously, because the error returned by the WMCore/src/python/WMComponent/DBS3Buffer/DBSUploadPoller.py Lines 96 to 115 in 76fd3a9
But instead the exception is handled as a general exception and this one is taking the control: WMCore/src/python/WMComponent/DBS3Buffer/DBSUploadPoller.py Lines 116 to 119 in 76fd3a9
It has to have something to do with this line from the traceback:
|
And just to add to the observation: The 4 blocks which I mentioned are experiencing WMCore/src/python/WMComponent/DBS3Buffer/DBSUploadPoller.py Lines 96 to 115 in 76fd3a9
But as we can see [1]
|
@todor-ivanov actually, at some point, the 4 blocks started failing with the This is the last appearance of DBSError
This is the first appearance of pyCurl error
This might be a clue on what caused the other 272 blocks to fail. It seems something change in the DBS server at some point between |
Here you have the
|
@germanfgv I would like to mention that according to k8s production dbs cluster we run DBS pods for 209 days. Therefore, nothing has changed on DBS side, and neither I aware of any development, commits/PRs. The concurrency error may seems misleading since it printed out with concurrency call to file injection. But the file injection fails due to missing aux meta-data in JSON payload. Please see these DBS code:
I reported MANY times that most likely issue is with missing file data type in JSON payload, and I strongly suggest to start with your JSON payload and see if it is there. In particular, the files section of payload should contain If JSON payload is correct in terms of ALL required aux meta-data, I suggest that you move down the list and check validity of the file(s) and finally look-up for ORACLE insert error. |
@vkuznet We have 2 separate problems here:
I bring up the APS upgrade in reference with the issues parsing the response from the server, not as an explanation for the concurrency issues. After I would like to switch this agent temporarily to the Regarding the JSON payload, the dumps we obtained from the agent show |
My suggestions would be the following:
|
@germanfgv , and regarding switching to |
About the 4 original blocks with the concurrency errors, specifically the I see all files in DBS, but they are distributed among two "impostor" blocks.
I call them impostor because both have blocks that belong to other blocks and number 2 is not even in Rucio and all his files belong to another block according to the database. Here is a summary of all 5 blocks; the 2 impostors and the 3 originals:
|
About:
Just to put @vkuznet's words in perspective: I tried to completely simulate the whole agent environment in
Which actually contains the true DBS Error in the header and one can spot the error message in the printout:
-- So those are two nested DBS errors:
And there is a plethora of DBS server errors we do not handle: https://github.com/dmwm/dbs2go/blob/8effd5a6bcb1c5b169348e3ac886891ad3aa1a2a/dbs/errors.go#L37-L81 : [2] FYI: @germanfgv @LinaresToine [1]
[2]
|
I changed the DBSWriter instance that the component is accessing from Now we are left with the original 4 problematic blocks. |
Here to summarize the status and our findings about this issue from the work with T0 Team for the whole last week The problem is 3 fold:
The above two are concerning mostly the huge pile of blocks which we were accumulating and not recognizing that their records were already in DBS, such that the agent should stop retrying. Once we switched back to the APache frontend all those proceeded, and the sequential steps for the other workflows depending on the data also started.
As a strategy we decided to split the problem in 5 steps: 2 OPS and 3 DEV
(the later never imagined even exists)
so:
|
Thank you for summarizing everything that has been going on in here. For the OPS2 issue above, I find deleting entries from the DBS Server database extremely dangerous. Even though it might require extra work, it would be much safer to actually recreate the lumis (or block) that is failing to get inserted into DBS. Did you and the T0 discuss this possibility? @germanfgv About the DEV1, unless I am missing some context, I do not think we should replicate every single status code from the DBS Server to the client side. IMO, the client should only deal with the status code that it can actually do something different. If there is no different execution flow, then reporting the error from the server is what we can do (which is already done in the generic exception AFAICT). |
@amaltaro we no longer have streamer files for these run/lumis, it's not possible to recreate these blocks. We could consider making the changes in Rucio, but it would require to remove files from one block and add it to the other. Also, it would require to do the same in the agent's |
Given the criticality and amount of information in DBS, it would be the last system that I would delete things manually. |
In Rucio, we would need to remove 4 files from one block and add them to another. In the agent's database, we would need to change the block of the 4 problematic files and mark them as |
After some discussions during the Tier0 meeting, I decided to have a quick look at the logs to see if we can have a better understanding of this issue. I don't see some information in this thread, so let me write my observations here:
Having said that, I have the following questions/comments:
Without investigating the codebase too much, it is possible that those duplicate file ids ("Ignoring this file for now!") actually triggered the misbehavior of the component. This duplicate file id is identified here: |
@amaltaro , few observations:
Is it possible that thread was killed because ORACLE timed out? Or, if connection was lost to ORACLE and error was thrown. How In other words, because of the polling cycle, if thread is killed for whatever reason there is no guarantee that transaction can be rolled back in Python. But polling cycle will start poller again and it may execute the same injection of objects into database which may not be protected (if there is no UNIQUE constrain on a injected object), and it may explain the observed behavior. |
A new development issue has been created with #12229, which will make WMAgent more resilient and error messages more friendly. Given that all operational issues have now been resolved - despite not being able to understand the T0 issues, even after tons of debugging by Todor and German - I think we can close this out. Todor, Andrea, others, please reopen it if anything is still pending from this debugging/operations. Thanks! |
Impact of the bug
WMAgent
Describe the bug
There seems to be an unusual number of blocks that are continuously failing to be inserted into DBS Server, with a variety of errors, as can be seen in [1] and [2].
For [1], that/those blocks actually belong to a worfklow that went all the way to
completed
in the system and then gotrejected
, as can be seen from this ReqMgr2 API.For [2], that block belongs to a workflow that is currently in
running-closed
status. Block failing injection for about 10h.This is based on vocms0255, I haven't yet checked the other agents.
How to reproduce it
Not sure
Expected behavior
For the rejected workflow (or aborted), we should make DBS3Upload aware that output data is no longer relevant and skip their injection into DBS Server. This might require persisting information in the DBSBuffer tables (like marking the block and relevant files as injected), otherwise the same blocks will come up every time we run a cycle of the DBS3Upload component.
For the
malformed SQL statement
(note a typo mailformed(!)), we probably need to correlate this error with further information from DBS Server. Is it the same error as we have with concurrent HTTP requests? Or what is actually wrong with this. Maybe @todor-ivanov can shed some light on this. Expected behavior of this fix is to be determined.Additional context and error message
[1]
[2]
The text was updated successfully, but these errors were encountered: