-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Successful (no job failure) workflows w/ missing output on DBS #11358
Comments
@vkuznet Valentin, can you please investigate this issue? A short summary of steps to debug it is:
I can help you once you get to this, but this is a brief set of instructions to get started with this and it might be enough to find the problem. |
I can start looking into this issue next week as I'm busy today and early next week with doctor appointments. Meanwhile, it would be useful if @amaltaro will provide which exact URL calls DBS3Upload do and does it log these calls to DBS. From what I read I see two possibilities here:
The problem will be much easier to debug if for given workflow we'll identify the URL call DBS3upload makes to DBS and see its parameters, i.e. if we can know file/block which is supplied via HTTP call we can easily check if it is an issue with DBS or not. |
@vkuznet DBS3Upload uses only 1 API, which is you can also see what is logged around this code and check the logs of the relevant agent (once you figure which agent it is, please see instructions above). As you know, the block information can be extremely large, so we definitely do not log and/or dump it anywhere. There is a way to dump it though, but I'd suggest to take the first debugging steps without the whole information that is passed to the server (other than the block name). Given that you decided to be more verbose and explicit on the exception hit in the server, it might be that an analysis of the error message is enough to figure what the problem is. |
From submit6 agent ComponentLog:
What it actually means is that provided JSON has insufficient information to insert a specific info. The In order to answer what is missing we need an actual JSON of a specific failed block. My question is how to get this JSON? The recent error I see from ComponentLog is dated by
but I do not see block with ID 397d4e07-e918-4c6d-ab5b-b074f1b83ee2 anywhere in dbs2go-global-w logs on that date. SO, there are several questions here:
|
From the DBS logs (from cmsweb-prod of today), I see a few datasets which failed to be injected via bulkblocks API and all of them points to the same issue in a code:
The codeline which fails to insert them is here: https://github.com/dmwm/dbs2go/blob/master/dbs/bulkblocks2.go#L917 and it tells me that the supplied JSON points contains the parent which is not found in DBS database, please see entire block logic: https://github.com/dmwm/dbs2go/blob/master/dbs/bulkblocks2.go#L911-L926 Therefore, if it is the same issue I bet that whoever who construct the JSON input did not check that parents of these datasets/blocks are present in DBS. Again, without actual JSON it is hard to pin-point further which parents are not present. But I think we have a clear clue of what is going on with injections. I still feel weak and can't spend more time on this, but I hope @todor-ivanov can dig more deeply into this issue. For the records DBS logs can be found on vocms0750 and today's logs are
|
@vkuznet Valentin, DBS3Upload component provides a way to dump the block information in a file (json), but that required further work to become useful. to change how we decide to dump the block data for a given block name, such that now we can dump information for a single block and inspect what exactly gets posted to DBSServer. I have already patched submit6 with that patch, changed the component configuration with the block name you mentioned above (pasted here as well):
and restarted DBS3Upload. Once you have the data you need, please revert that config back to an empty string and restart the component. |
Alan, thanks for providing this but so far I do not see anything on submit6 agent. If I read you changes correctly and reading config, I see that the data should go to |
Yes, that agent seems to have an unusually high number of open blocks (or not, vocms0253 has 15919 blocks!!!):
that's why it's taking a while to associate files to their respective blocks. The first cycle is usually longer than the subsequent ones. You are correct, that's the correct place where you should see the json file. |
@vkuznet I just figured why the json dump never showed up for the block above, and the reason is that that block went through the system long ago (it had only a transient failure), sorry about that. It looks like we have another block that has been failing over the last few cycles of DBS3Upload. So I updated the component config to get a dump for:
|
@amaltaro , I got the JSON file for
I just checked DBS global database directly using and it has no such datasets registered, see
So, the problem, at least with this JSON and this block, is lack of parent dataset in DBS. Since I do not know how this JSON is constructed I do not have any further comments, but I suggest to figure it out and see why we create a JSON (block) with parent info which is not present in DBS DB. |
Thank you for pin-pointing this issue, Valentin. Based on this information, I investigated it a little further and it looks like we lost the ability of spotting such problems in DBS3Upload with the migration of DBS to the Golang implementation, as expected in these 2 lines of code: Now on why and how it happened, we can look this up in WMStats by:
as such, StepChain workflows produce the unmerged files at one point and the merge is triggered by dataset + output module + datatier, asynchronously (actually they depend on many other constraints like size, events, etc). So it's indeed possible to have a dataset/block to be injected into DBS server without having its parent data. I don't know how to properly resolve it, given that an ACDC workflow can be executed to recover the MINIAOD data, thus eventually succeeding the insertion of this NANOAOD block into DBS server.
|
@amaltaro , this seems to be identical issue as I reported over here: #11106 In other words, the order to inject data into DBS matters since dataset/blocks/files info depends on other information such as configs, parentage, etc. And, resolution requires careful inspection of data injection procedure in WMCore. Said that, indeed, these lines https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/DBS3Buffer/DBSUploadPoller.py#L102-L103 rely on DBS python exception logic rather checking actual DBS server error. Therefore, our effort of properly handling DBSErrors #10962, #11173 also matter here. What currently can be done is the following:
I think the most reliable solution should be adding into DBSUploadPoller a code which will query DBS for dataset parents. The reason is simple, it is the most explicit action. The code should not rely on matching specific exception string from DBS server as server implementation can change and exception message too. Instead, the poller should explicitly check if data is allowed to be inserted. For that it should check the following:
|
We can implement it in a "permission vs forgiveness" model, up to now we have adopted the forgiveness one (deal with exceptions). The reason for this is:
We could try to look into this, but this becomes a reasonably more expensive development. |
@amaltaro , everything really depends on implementation. If code will use sequential mode, i.e. make DBS calls for configs, dataset, blocks, etc. in a sequence, of course it will be slow. But if code will implement making this calls in parallel then at most you will add 1-2 sec to the over time, not more. The checks which I suggested to do before data inject are very simple DBS queries since the query contains specific value, e.g. |
What I meant with more expensive development was that we go away from a simple block dump provided to the server towards much more control and checks on the client side:
If we take this path, then I think the best would be to make this stateful (dependent on the local database information). |
if you serious about eliminating data race conditions then it is the only way, i.e. client should check everything before injection. The implementation details can vary though. If you need cache you can use it but it implies that you need to maintain it, and use resources for it. Making HTTP calls is cheap in my view but of course it has its own drawback (everything depends on data volume, rate, etc.). Retry or not again is up to a client. The bottom line client should perform all checks before trying to inject the new data. If this is implemented then we will eliminate data racing condition issue which is present right now and issue with failures due to not existing data in DB (this one). The only failure may happen in this scenario are due to database and/or DBS server issue, but in this case we'll have strong separation of responsibilities between the client (which performed all necessary checks) and server which is provided with fully valid input and should insert the data. |
Yes, given that we found this to actually be a problem with the workflow itself, in the sense that some merge jobs failed and it caused one of the parent datasets not to be created (now pending ACDC creation and execution), I would suggest to close this one out (once we hear back from Hasan). At some point, that/those blocks should succeed once the ACDC goes through and (parent) data gets properly produced. On what concerns the long term and major refactoring of DBS3Upload, I would suggest to update the GH issue #11106 with some of the comments made in this issue (my preference at this very moment is to include a database schema change) and plan it for the future quarters. |
fine with me. |
Hi all, apologies for late reply. We check the following endpoint [1] which I think should include the recovery info that ACDC will use. But it's empty for these workflows. That's why we didn't ACDC them |
@haozturk I think you looked into the wrong workflow. A few replies above, I investigated the following workflow where we can actually find ACDC entries for the merge task (among others). |
@amaltaro himm, that's not one of the workflows for which we opened this issue. It's not a "Successful (no job failure) workflow w/ missing output on DBS". If you check the workflows that I provided in the issue body, they don't have ACDC entries. Regarding that workflow in particular, it has a filemismatch problem where an output file [1] is missing on DBS while it's on Rucio. We don't create ACDCs before fixing the mismatch. Do you think these issues are related? I haven't read the whole thread. I might be missing some context. [1] /store/mc/RunIISummer20UL16NanoAODv9/DYJetsToLL_M-2000to3000_TuneCP5_13TeV-amcatnloFXFX-pythia8/NANOAODSIM/106X_mcRun2_asymptotic_v17-v2/60000/E9D7E442-299E-FC4B-BA82-7D87E3A82B7D.root |
@amaltaro , you put too much faith in my abilities. At least I need to know where GWQ log is, is it on production server (cmsweb.cern.ch)? Is it available on vocms0750? Does it called workqueu*.log? Then, it would be more useful to know which WMCore generates specific log entries, how nlumis are calculated which which log entries are generated. I'm asking because I assume that GWQ is on cmsweb, and its log on vocms0750, and its log is called workqueue*.log, if so than tehre is nothing in there:
return nothing. At least I need to know more information about GWQ and which patterns to look at and at which log(s). |
I made an effort to look-up given workflow in workflow log. Here I made several assumptions which may or may not be true:
There is no any other information about given workflow and I have no other clue where to look it up. @amaltaro , please provide further instructions how to deal with it. Better, as I requested, it would be nice to look at specific codebase which does the initial estimate of nlumis numbers and see the logic as well as if it logs anything into the log. |
Valentin, see further comments and instructions below:
yes, logs are available in
this is the way to move forward, checking when the statuses transition happened. Given that this one is about global workqueue, we don't really care about the whole lifetime of the workflow, but solely the transition from Expanding on the content of that reqmgrInteractionTask log file (mentioned above):
Last line of this log represents the moment the workflow gets closed for further input data, so no more blocks/stats can be added to it. In these logs, we can see that indeed In terms of source code, this code is quite complex, but it's performed by a Global Workqueue cherrypy app that starts up from this module: Maybe the next step we can do is, to clone this workflow into one of our dev setup and see if global workqueue would again find the 179 lumis. |
@amaltaro, I looked up the code and can reproduce 179 nlumis number. The code
it has 4 blocks:
and
If you sum-up num_limis across all blocks you'll get 179 :)
While, the
So, the issue here is that
The difference is additional join for DATASET_ACCESS_TYPES table in dataset query. I need time to investigate further DBS queries as I inherited them from Python based code. I'll report later with more results on how DBS queries differ for data and blocks, but obviously it is the root of the problem reported here. |
Updating the comment above that tagged the "wrong" Alan (sorry about that!) |
@vkuznet without looking carefully into this. My hypothesis is that filesummaries is actually returning UNIQUE tuples of run/lumi:
which means that there are files in different blocks (and maybe even in the same block) that have exactly the same run/lumi tuple. Hence DBS returns a smaller amount of lumis when queried by dataset. Given that this dataset is pretty small, I would suggest to retrieve all the run/lumis for all files in this dataset, order them and look how many duplicates we have (if any). |
yes, this is the case different blocks have same run/lumi tuples. And, sorting them and taking unique set gives me 100 unit lumis. So the mystery is solved. Said that, the remedy in WMCore should be the following:
For example:
Now, repeat this for all blocks, and extract run/lumis pairs. Then make a set of this list and take its size. Here is a simple python code which does exactly that and it returns 100 as expected:
|
Thank you for this investigation, Valentin. Could you please check if the output dataset: also has those 179 lumis? If a merge job has multiple files with the same run/lumi, then the output would carry the unique information. But if the unmerged run/lumi is scattered in different merge jobs, then I think it's possible that the output dataset would have duplicate run/lumis. In that case, using the filesummaries for input discovery isn't wrong. |
In this issue #11403 (comment) I provided two examples of python functions, |
@amaltaro , regarding NANOAODSIM we also have discrepancy. The nlumis for dataset is 100, the dataset has two blocks, and their sum of nlumis is 32+99=131. But using my code from #11403 (comment) both functions How would you like to move forward with this? My suggestion to add both |
@amaltaro , is there anything left for this issue? My understanding that we fully debugged the issue and now understand its cause. We provided tools (either wmcore python script or wflow-dbs service) to data-ops, and I wonder if we need to keep open this issue. If so, it would be nice to list actions items required to move forward with this issue. Thanks. |
Yes, I think we can declare this issue as resolved. In the future, we still have to think in a more sustainable way to find how many total and how many unique lumis were expected to be processed (which might be different between the beginning and end of the workflow lifetime as well). @haozturk please reopen it in case there is anything else missing. |
Qier was asking about the following workflow: which keeps coming back in operations as "noRecoveryDoc", thus without any ACDC documents to be recovered. I decided to run it over the service that Valentin deployed in cmsweb-testbed and here is the output:
from the report above, it looks like the input data contains thousands of duplicate lumis (based on Actually, having a second look at these input metrics:
it looks like the input dataset only has 1 valid file(!). I did a spot check and I think this is actually wrong, given that all 6 files in this block are actually valid: @vkuznet could you please review how you count those (output stats seem to be miscounted as well)? |
Alan, yes there was an error (mis-match valid vs invalid in DBS API query). Now server is fixed and reports:
|
Do you know what's the problem about this workflow? The num_invalid_files is 0, so it doesn't look like the invalidate issue. |
@z4027163 issue with that workflow is not related to invalid files, but to amount of unique (or duplicate) run/lumis in the input, see:
from the report above. Does it answer the remaining question that was reported at the CompOps meeting? |
@vkuznet Can you give more details of what's the meaning of "unique_file_lumis"? I am a bit supprised that the output has a higher value than the input dataset. |
@z4027163 , it sets over here: https://github.com/vkuznet/wflow-dbs/blob/main/dbs.go#L37 and calculated in this function https://github.com/vkuznet/wflow-dbs/blob/main/dbs.go#L194 But in plain English it is number of lumis returned by filelumis DBS API for a given block name (I resolved a dataset into block names, then query filelumis for every block and calculate unique number of run-lumi pairs). |
@vkuznet according to the code, is it correct to say that |
@amaltaro , the
|
Impact of the bug
Workflow announcement
Describe the bug
We see lots of successful workflows w/ missing output on DBS. I'm not sure whether this is due to delay or failure in DBS injection or a problem in job failure accounting.
How to reproduce it
Here are some affected workflows:
If you need a complete list, please let me know. I can provide it.
Expected behavior
If there is no job failure, we expect to see 100% output on DBS.
Additional context and error message
None
The text was updated successfully, but these errors were encountered: